bip-0173: test vectors, HRP, and casing requirements updates #587

mruddy commented at 1:20 AM on September 20, 2017: contributor

a12uel5l: vector added to demonstrate that it and its uppercase version, A12UEL5L are both valid and treated as equal. segwit_addr.bech32_encode('a', [])
10a06t8: vector added to demonstrate that a combined empty HRP and empty data encode to a valid Bech32 string. segwit_addr.bech32_encode('', [])
1qzzfhee: vector added to demonstrate that an empty HRP encodes to a valid Bech32 string. segwit_addr.bech32_encode('', [0])
pzry9x0s0muk: vector moved to a new section as it is misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.

>>> segwit_addr.bech32_create_checksum('', [1,2,3,4,5,6])
[15, 16, 15, 27, 28, 22]
>>> segwit_addr.bech32_verify_checksum('', [1,2,3,4,5,6]+[15, 16, 15, 27, 28, 22])
True

1pzry9x0s0muk: vector removed as incorrect because "Empty HRP" is Bech32 valid -- HRP validity is application specific. segwit_addr.bech32_encode('', [1,2,3,4,5,6])
A1g7sgd8 and A1G7SGD8: vectors added to verify that implementations adhere to BIP requirement: "The lowercase form is used when determining a character's value for checksum purposes.". Note:

>>> segwit_addr.bech32_encode('a', [])
'a12uel5l'
>>> segwit_addr.bech32_encode('A', [])
'A1g7sgd8'  # BUG: violates "The lowercase form is used when determining a character's value for checksum purposes."

10a06t8: test vector added to demonstrate that while a valid Bech32 string, it is segwit application specific invalid.

luke-jr commented at 3:27 AM on September 20, 2017: member

@sipa @gmaxwell

luke-jr added the label Proposed BIP modification on Sep 20, 2017

mruddy renamed this:
~~bip-0173 test vector updates, hrp lowercase~~
bip-0173 test vector updates
on Sep 21, 2017

mruddy renamed this:
~~bip-0173 test vector updates~~
bip-0173 test vector updates and casing requirements
on Sep 22, 2017

mruddy commented at 2:46 PM on September 22, 2017: contributor

In addition to the test vector updates, I made some updates, in a second commit, regarding casing requirements. Here's my rationale for these additional changes:

Encoding: HRP should always be lowercased. This is because:

From the existing BIP: "The lowercase form is used when determining a character's value for checksum purposes."
Encoders should always output a non-mixed case Bech32 string. Lowercase is canonical due the previous item, and the fact that the encoding charset is all lowercase. Otherwise, the HRP would need to be scanned to make sure it is consistent case and then upper/lower case according to that finding.

For presentation, or QR code use, a simple uppercasing can be run externally on the encoding result. This is consistent with the "A12UEL5L" test vector currently in the BIP, e.g.:

>>> segwit_addr.bech32_encode('a', []).upper()
'A12UEL5L'

Decoding: Because of the BIP requirement that, "The lowercase form is used when determining a character's value for checksum purposes.":

Decoding treats uppercase and lowercase inputs as equal. The hrp still needs to be lowercased before hrp expansion during checksum calculation.
The transcription problems associated with mixed-case are avoided without restriction on acceptance of mixed-case input for decoding.
Not accepting mixed-case input for decoding is seemingly punitive.

Tangentially related: https://github.com/sipa/bech32/issues/25

in bip-0173.mediawiki:190 in 52e6c12ab4 outdated

 190 | -the use of
 191 | +Encoders MUST always output an all lowercase Bech32 string.
 192 | +If an uppercase version of the encoding result is desired, (e.g.- for presentation purposes, or QR code use),
 193 | +then an uppercasing procedure can be performed external to the encoding process.
 194 | +
 195 | +Decoders MUST accept uppercase, lowercase, and mixed-case strings.

sipa commented at 7:07 PM on September 22, 2017:

Decoders must not accept mixed-case strings?

mruddy commented at 10:06 PM on September 22, 2017:

Not a typo. I think decoders should (and thus "must") accept mixed-case. Because of the existing requirement to use the lowercase form to determine a character's value for checksum purposes, there is no difference between upper and lower case characters in any position of the bech32 string. That applies to the HRP because it is also used to compute the checksum. So, on a micro level, I don't see a reason why a decoder should fail the input. On a macro level, I can see where you might be trying to discourage mixed case. But, the other changes I made also push in that direction (in fact fixing this: segwit_addr.bech32_encode('A', [])).

in bip-0173.mediawiki:289 in 52e6c12ab4 outdated

 286 |  * <tt>de1lg7wt</tt> + 0xFF: Invalid character in checksum
 287 | +* <tt>A1g7sgd8</tt>: The lowercase form is used when determining a character's value for checksum purposes.
 288 | +* <tt>A1G7SGD8</tt>: The lowercase form is used when determining a character's value for checksum purposes.
 289 | +
 290 | +The following are invalid Bech32 strings (with reason for invalidity):
 291 | +* <tt>pzry9x0s0muk</tt>: No separator character

sipa commented at 7:09 PM on September 22, 2017:

I think this section can be merged with the 'invalid Bech32 checksum' section.

mruddy commented at 10:14 PM on September 22, 2017:

That's how it was to start. I moved it to a new section because I thought it was misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.

sipa commented at 10:17 PM on September 22, 2017:

Right, but that's to for several of the existing examples in that section (like HRP character out of range, or length exceeded.

I think they can be combined into a single section with examples of invalid Bech32 strings.

sipa commented at 10:12 PM on September 22, 2017: member

Rejecting mixed case is very intentional. Doing so massively reduces the chance of accepting arbitrary text as valid Bech32.

mruddy commented at 10:18 PM on September 22, 2017: contributor

Rejecting mixed case is very intentional. Doing so massively reduces the chance of accepting arbitrary text as valid Bech32.

I don't see it that way. The input is always being canonicalized to lower anyways. Am I misunderstanding you somehow? I was working on a java reference impl encoder/decoder and it works. I'll put it up shortly.

sipa commented at 10:22 PM on September 22, 2017: member

@mruddy Just because it's canonicalized to lowercase for the purpose of checksum computation doesn't mean that it's valid. There is are extra requirements for validity besides checksum correctness (including maximum length, range of hrp characters, and no mixed case).

sipa commented at 10:23 PM on September 22, 2017: member

There is a PR for a Java implementation already: https://github.com/sipa/bech32/pull/19

I welcome review!

mruddy commented at 10:23 PM on September 22, 2017: contributor

Here's some java code. Maybe this will clarify:

https://github.com/mruddy/bech32/blob/master/src/org/bitcoin/bips/bip173/bech32/Bech32.java#L110

https://github.com/mruddy/bech32/blob/master/src/org/bitcoin/bips/bip173/bech32/Bech32.java#L177

mruddy commented at 10:24 PM on September 22, 2017: contributor

Yes, I saw that PR and thought that I could do better. This is my attempt https://github.com/mruddy/bech32/ :)

mruddy commented at 10:29 PM on September 22, 2017: contributor

Just because it's canonicalized to lowercase for the purpose of checksum computation doesn't mean that it's valid. There is are extra requirements for validity besides checksum correctness (including maximum length, range of hrp characters, and no mixed case).

Agreed, but the case of a-Z and A-Z is interchangeable due to the lowercase canonicalization. That's what I mean by mixed-case.

sipa commented at 10:30 PM on September 22, 2017: member

I still don't understand why you want to add "Decoder must accept mixed case". I absolutely think it should not.

mruddy commented at 10:31 PM on September 22, 2017: contributor

Because from a user standpoint, if they happen to transcribe and get the case wrong somehow, it should still be accepted since the result will be the same.

sipa commented at 10:33 PM on September 22, 2017: member

If they transcribe it wrong, the software should complain and they should go back and fix it.

Trying to correct it for them is a footgun. If they can accidentally misread and transcribe as the wrong case, they could equally have transcribed as just the wrong character - and you wouldn't want to correct that for them either (even though error correction algorithms exist).

Effectively, the consistent case is part of the consistency check of the address, together with the checksum.

mruddy commented at 10:40 PM on September 22, 2017: contributor

Well, "wrong" was not a great word to use. I should have said "different" because that's all it is -- not wrong, because they are equal. It's less code to be more user friendly and get the same result.

Maybe I won't be able to persuade you on this point. But, how about the rest of the changes? Once we have general agreement, I'll update accordingly.

sipa commented at 10:44 PM on September 22, 2017: member

Big concept ACK on the rest of the changes. I've had other reports that the language was unclear wrt the behaviour of the encoder and uppercase input, and wanted to improve it. This seems like a good solution.

I do want to verify the extra test cases, though.

mruddy commented at 10:49 PM on September 22, 2017: contributor

Cool, thanks.

Also, I was thinking of accepting mixed case not so much as trying to correct users as much as it was unifying code paths.

BTW, the test cases should be testable with the examples in my first comment. I also have those test cases in my java.

mruddy commented at 11:17 PM on September 22, 2017: contributor

FYI, I just added a 3rd commit to capture what we discussed. I'll squash it all when we're all set. I left it separate right now so you can easily see what I incrementally changed.

in bip-0173.mediawiki:271 in 242698e5a6 outdated

 266 | @@ -264,6 +267,9 @@ P2PKH addresses can be used.
 267 |  
 268 |  The following strings have a valid Bech32 checksum.
 269 |  * <tt>A12UEL5L</tt>
 270 | +* <tt>a12uel5l</tt>
 271 | +* <tt>10a06t8</tt>

sipa commented at 6:44 PM on September 23, 2017:

There is a requirement that the HRP is non-empty, so while you're technically right that this has a valid checksum, it's not testable... any correct decoder should fail with that as input.

So I think it's maybe better to rename the section to "The following strings are valid Bech32" (and the next section to "The following string are not valid Bech32 (with reason for invalidity)", and have these two example with empty HRP in that section (with a note that technically their checksum is valid).

mruddy commented at 10:35 PM on September 23, 2017:

I read the BIP contrary to that. My interpretation was that no such requirement that the HRP is non-empty is in the BIP. The BIP even says that the HRP's validity is application specific (which builds on top of generic Bech32). That's why this works:

>>> segwit_addr.bech32_create_checksum('', [])
[15, 29, 15, 26, 11, 7]
>>> segwit_addr.bech32_verify_checksum('', [] + [15, 29, 15, 26, 11, 7])
True

It seems odd that we'd encode something that can't be decoded. I think the reference decoder is implementing a phantom requirement right now. If we add a requirement that HRP be non-empty, I think that's OK, just other things need to be updated also. Am I making sense?

sipa commented at 11:00 PM on September 23, 2017:

You're right - it seems the non-empty HRP isn't mentioned in the BIP anywhere, except in the test vectors.

I prefer modifying it so that it does require that, and is uniformly implemented in decoders and encoders.

mruddy commented at 11:05 PM on September 23, 2017:

ok, i'll update to that effect.

btw, the non-checksum data portion can be empty right now. do you want to keep that as it is now? i think so, just checking.

sipa commented at 11:08 PM on September 23, 2017:

Yes, sounds good. Thanks!

mruddy renamed this:
~~bip-0173 test vector updates and casing requirements~~
bip-0173: test vectors, HRP, and casing requirements updates
on Sep 24, 2017

mruddy commented at 12:17 AM on September 24, 2017: contributor

I've updated and squashed everything back into one commit. Besides this BIP update, the reference code will need some updates.

Just wanted to point out that a couple of the things to fix, 1) mixed case encoding output and 2) using the upper case form of the HRP for checksum creation, are identified by this test:

>>> segwit_addr.bech32_encode('a', [])
'a12uel5l'
>>> segwit_addr.bech32_encode('A', [])
'A1g7sgd8'

bip-0173: test vectors; HRP and casing requirements 96d39e8025

mruddy commented at 3:31 AM on September 24, 2017: contributor

Added some info about conversion to US-ASCII and unmappable characters.

sipa commented at 3:45 AM on September 24, 2017: member

ACK 96d39e80255548c5d95f01f577e59bb90f9d88a9, I've verified all the test vectors.

Thank you!

mruddy commented at 12:08 PM on September 24, 2017: contributor

@sipa thanks for reviewing! One more thing. I think we should definitely use UTF-8 instead of US-ASCII.

I've added a commit and will squash it upon approval.

Doing this will avoid the unmappable character issue and make using this easier for many people using higher level languages that use unicode strings. UTF-8 is compatible with ASCII because the mapping for ASCII uncompatible code points goes to multibyte, with each byte being out of ASCII range (which we'll already reject). This is easy to see in the table at the top of the description at https://en.wikipedia.org/wiki/UTF-8.

>>> '\U00000080'.encode('utf-8')
b'\xc2\x80'
>>> '\U00000800'.encode('utf-8')
b'\xe0\xa0\x80'
>>> '\U00010000'.encode('utf-8')
b'\xf0\x90\x80\x80'

sipa commented at 5:17 PM on September 24, 2017: member

@mruddy Yes, that's why it's restricted to the range of printable characters in the lower 7 bits, to avoid issues with character sets.

You also can't say "UTF-8 code points" (as UTF-8 is a way of encoding code points as bytes). You could say Unicode code points (or Unicode characters), though.

I think it's unnecessary though. The text does not say "The HRP must consist of Unicode characters, converted to US-ASCII" - that would be prone to having character conversion issues. It says "must consist of ASCII characters". If you're using it with other characters, regardless of how you encoding them, it's already used incorrectly.

mruddy commented at 5:43 PM on September 24, 2017: contributor

You also can't say "UTF-8 code points" (as UTF-8 is a way of encoding code points as bytes). You could say Unicode code points (or Unicode characters), though.

Agree, good catch.

The reason I bring this up is because of languages like Java (but not just Java) where the String type is UTF-16. So, that's what a user of the library is going to have and then it's going to be sent in and the library is going to have to encode to what Bech32 needs. For example: https://github.com/mruddy/bech32/blob/master/src/com/github/mruddy/bip173/bech32/Bech32.java#L150-L152 If Bech32 needs UTF-8, then there is no mapping issue. If Bech32 needs US-ASCII, then there can be problems introduced or exceptions generated. Otherwise the library user needs to know to convert to UTF-8 and then pass in a byte array, instead of the native String type, or do a code point check before sending the String in. Since we have the 7-bit restriction, this is safe. I think it makes the BIP more developer friendly.

sipa commented at 5:45 PM on September 24, 2017: member

The API should just take in a String, and then check whether all characters are in the ASCII range. UTF-8 shouldn't be involved at any point.

mruddy commented at 6:25 PM on September 24, 2017: contributor

Oh, I suppose :)

I dropped the UTF-8 commit and it's back to the one that you ACK'd, 96d39e80255548c5d95f01f577e59bb90f9d88a9.

Thanks!

luke-jr merged this on Sep 24, 2017

luke-jr closed this on Sep 24, 2017

mruddy deleted the branch on Sep 24, 2017