unicode normalization in BIP-39 #1429

pull grondilu wants to merge 1 commits into bitcoin:master from grondilu:master changing 4 files +3392 −3392

grondilu commented at 11:10 am on March 4, 2023: none
Four languages have word lists encoded in non-normalized unicode. See canon and compatibility equivalence.

This makes display and comparison with input fail on some terminals (admittedly I can only report on alacritty, though).

I suggest replacing all these files with normalized versions.

I obtained these normalized versions with bash/raku (raku always outputs normalized unicode) :
0bip-0039 $ for f in *.txt; do raku -e "print q{$f}.IO.slurp" > /tmp/$f; mv {/tmp/,}$f;
unicode normalization 49f30865b6
apoelstra commented at 2:05 pm on March 4, 2023: contributor

Are the non-normalized versions fed into the hash function that derives the BIP32 seed? If so, changing the encoding will cause seeds to change and people to be unable to recover their coins.

This isn’t concern trolling – I really think this might be true, but I don’t know.
roconnor-blockstream commented at 3:01 pm on March 4, 2023: contributor

I don’t know what I’m talking about, but…

From https://docs.raku.org/language/unicode it suggests that raku is outputting NFC data, but BIP-39 specifically calls for NFKD data. Thus I’m guessing the words lists are written in NFKD format, as one would expect. (Edit: as @apoelstra notes above, this isn’t just a matter of display, rather the exact byte sequence is fed into SHA-512, and any change in the byte sequences of these words will change the resulting master seed and all the associated public and private keys, destroying any existing wallet.)

How wallets display these words, is probably out of scope. I think that normalizing them for display could perhaps be reasonable, but TBH, it sounds more like alacritty is perhaps “broken” for not treating combining characters correctly.

BIP-39 is listed as “Unanimously Discourage for implementation” for this and many other reasons. We really need to follow that advice and move onto alternatives such as SLIP-39 or Codex32.

P.S. @luke-jr it would be helpful to get a BIP number assigned to Codex32 so that we can start directing people to BIP-39 alternatives.
grondilu commented at 4:04 pm on March 4, 2023: none

Are the non-normalized versions fed into the hash function that derives the BIP32 seed? If so, changing the encoding will cause seeds to change and people to be unable to recover their coins.

I think you are right. A different encoding will generate a different seed. So I suppose it is a bad idea to change this.

My bad.
grondilu commented at 4:10 pm on March 4, 2023: none

From https://docs.raku.org/language/unicode it suggests that raku is outputting NFC data, but BIP-39 specifically calls for NFKD data.

It appears you are correct. This PR is unfounded. Sorry for the inconvenience.
grondilu closed this on Mar 4, 2023
roconnor-blockstream commented at 4:49 pm on March 4, 2023: contributor

No worries.

Contributors
grondilu apoelstra roconnor-blockstream