Fix two errors in the BIP 39 French wordlist #622

nym-zone commented at 5:45 AM on January 1, 2018: contributor

The BIP 39 French wordlist contains two significant technical errors:

Byte Order Marker (BOM) U+FEFF at the beginning of the first line, preceding the word “abaisser”.
No newline '\n' char terminating the last line, after “zoologie”.

The former may cause user loss of funds. An implementation which generates a mnemonic phrase and also turns it into a BIP 39 seed value may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the word “abaisser” to the user. Of course, it cannot be expected that the user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet. In the face of a buggy wordlist, whitespace handling and normalization cannot be absolutely relied on to remove a notoriously mischievous character. Those who provide technical support may be well advised to ask French users with unrestorable wallets, “Did your mnemonic phrase contain the word ‘abaisser’?”

The latter broke the shell script I use to massage wordlists into C sources when building easyseed.

I know of only one commonplace platform where software regularly prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line terminator on the last line even when asked to create a Unix ('\n') text file. It is RECOMMENDED that new wordlists be examined for correctness using standard shell tools on a sane platform.

Fix two errors in the BIP 39 French wordlist

The BIP 39 wordlist contained two significant technical errors:

 - Byte Order Marker (BOM) U+FEFF at the beginning of the first line,
   preceding the word "abaisser".

 - No newline '\n' char terminating the last line, after "zoologie".

The former may cause user loss of funds.  An implementation which
generates a mnemonic phrase and also turns it into a BIP 39 seed value
may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the
word "abaisser" to the user.  Of course, it cannot be expected that the
user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet.
In the face of a buggy wordlist, whitespace handling and normalization
cannot be absolutely relied on to remove a notoriously mischievous
character.  Those who provide technical support may be well advised to
ask French users with unrestorable wallets, "Did your mnemonic phrase
contain the word 'abaisser'?"

The latter broke the shell script I use to massage wordlists into C
sources when building https://github.com/nym-zone/easyseed .

I know of only one commonplace platform where software regularly
prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line
terminator on the last line even when asked to create a Unix ('\n') text
file.  It is RECOMMENDED that new wordlists be examined for correctness
using standard shell tools on a sane platform.

50c4f1255e

dabura667 commented at 6:30 AM on January 1, 2018: none

ACK 50c4f1255eed6b1c08ca9c20b4d8f380879ed2f5

I checked all other current wordlists and they all:

Did not have the BOM
Did have a trailing LF line break

I pulled @nym-zone 's branch locally and inspected the file using a hex editor to double check the new french.txt is fixed.

As for @nym-zone's concerns:

The only wallet I know of that shows users French phrases is Copay, and they don't use the files as-is from the BIP. https://github.com/bitpay/bitcore-mnemonic/blob/master/lib/words/french.js

If there is a wallet that uses the exact file from the BIP that shows french users phrases in French. That warning applies... but I don't think any wallet does so.

nym-zone commented at 7:29 AM on January 1, 2018: contributor

Thanks, @dabura667. I should have clarified, I discovered this when writing a mnemonic phrase generator tool which embeds all eight wordlists currently in the BIP repository. I reported French, because that’s what was broken. Insofar as I could when dealing with multiple languages I do not know, I exercised reasonable care to assure the integrity of all the wordlists.

(Then I wrote an automated battery of runtime tests against the compile-time SHA-256 hashes of the files, in case somebody out there may have buggy tools which mangle UTF-8 when building...)

I just hexdumped Copay’s french.js to double-check; and its string representation is fine as for the ironically self-abasing “abaisser”.

nym-zone commented at 9:11 PM on January 5, 2018: contributor

Checking additional implementations in the wild, I have not (yet?) found any which carry the spurious U+FEFF. But it is not only a matter of Copay.

From a popular implementation, widely used because it runs in a web browser: https://github.com/iancoleman/bip39/commit/3a8dbe99b4be2084794d1191a06eadc38db0917b, checking wordlist_french.js:

00000030  20 3a 20 57 4f 52 44 4c  49 53 54 53 3b 0a 57 4f  | : WORDLISTS;.WO|
00000040  52 44 4c 49 53 54 53 5b  22 66 72 65 6e 63 68 22  |RDLISTS["french"|
00000050  5d 20 3d 20 5b 0a 22 61  62 61 69 73 73 65 72 22  |] = [."abaisser"|

From https://github.com/NovaCrypto/BIP39/pull/4/commits/5ecf5687564e2c0af0bae133e1fe66095e80ad98 (see also), checking French.java:

00000520  20 6e 65 77 20 53 74 72  69 6e 67 5b 5d 7b 0a 20  | new String[]{. |
00000530  20 20 20 20 20 20 20 20  20 20 20 22 61 62 61 69  |           "abai|
00000540  73 73 65 72 22 2c 0a 20  20 20 20 20 20 20 20 20  |sser",.         |

I will keep my eye out for other French-supporting BIP 39 implementations.

nym-zone referenced this in commit 3a8dbe99b4 on Jan 5, 2018

nym-zone referenced this in commit 234c66cd5d on Jan 7, 2018

nym-zone referenced this in commit ba25dfac56 on Jan 7, 2018

nym-zone commented at 5:13 AM on January 9, 2018: contributor

ping @Kirvx @NicolasDorier @vosine @luke-jr

Bugfix on #152, fbe7196.

This is a simple technical fix which changes no functionality in correct implementations, but can help prevent implementation errors.

Kirvx commented at 7:46 AM on January 9, 2018: contributor

ACK. Thanks @nym-zone and @dabura667. @voisine Are breadwallet french users affected?

nym-zone referenced this in commit fd14d42752 on Jan 9, 2018

nym-zone commented at 8:25 AM on January 9, 2018: contributor

Thanks, @Kirvx.

I didn’t mean to imply anything about Breadwallet; I was simply trying to get the attention of persons with significant involvement in #152 and/or maintenance of BIP 39 wordlists and/or BIP maintenance. But, good idea! Now, I checked the only Breadwallet French file I could find on a brief search, BreadWallet/fr.lproj/BIP39Words.plist from breadwallet/breadwallet-legacy@fd14d42; and it is fine:

$ hd -s 173 -n 16 BIP39Words.plist
000000ad  3c 73 74 72 69 6e 67 3e  61 62 61 69 73 73 65 72  |<string>abaisser|
000000bd

(That file is delimited by XML, not newlines; so as long as “zoologie” is there, that’s fine also.)

By the way, thank you for your work creating this list. I strongly urge that BIP 39 should have broad language support; and French is an important language. I can see in #152 how much work was put in to make and refine the list. Too bad, it seems abaisser wanted to abase itself.

Kirvx commented at 12:17 PM on January 9, 2018: contributor

😃

Thank you for checking breadwallet @nym-zone 👍

luke-jr commented at 6:05 AM on January 10, 2018: member

@slush0 @prusnak @voisine @ebfull

luke-jr added the label Proposed BIP modification on Jan 10, 2018

NicolasDorier commented at 2:16 AM on January 11, 2018: contributor

would be nice to add a test vector. NBitcoin also has french hard coded... I could not find a proper tool on windows (sublime, and vscode don't work for some reason) showing me hidden characters though.

evoskuil commented at 2:32 AM on January 11, 2018: contributor

Try notepad++

nym-zone commented at 4:23 AM on January 11, 2018: contributor

@NicolasDorier:

would be nice to add a test vector.

From the nym-zone/easyseed@c7d698a set of a dozen languages’ test vectors, I give you French test vectors in a convenient JSON format. Please run them with your implementation.

As stated in the preceding commit log from two days ago (nym-zone/easyseed@5f35cd0, q.v.), these vectors are specifically designed to flunk implementations which do not perform proper Unicode NFKD normalization—even with words not containing diacritics (including the English wordlist, too).

(I may rename or modify things; but the versioned link will obviously remain stable.)

nym-zone commented at 5:05 AM on January 11, 2018: contributor

Pertinent to the below, but also as a general point: U+FEFF is not removed or otherwise affected by Unicode NFKD normalization. To shortcut discussion of Unicode character properties, here is an empirical object demonstration with an ICU utility:

$ echo -n $'a\ufeffb' | uconv -x '::nfkd;' | hd
00000000  61 ef bb bf 62                                    |a...b|
00000005

Of course, due to its double-personality as “ZERO WIDTH NO-BREAK SPACE” (ZWNBSP), it’s also zero-width—and therefore invisible on any display which supports Unicode:

$ echo $'a\ufeffb'
ab

(Check, there is an extra character in between there.)

Thus if it gets fed to PBKDF2 for BIP 39 seed generation, and the user is shown the corresponding mnemonic, then the user will write down a phrase which will not restore the wallet. That’s why I panicked when I saw this:

$ head -c16 french.txt | hd
00000000  ef bb bf 61 62 61 69 73  73 65 72 0a 61 62 61 6e  |...abaisser.aban|
00000010

@NicolasDorier, IWordlistSource.cs from MetacoSA/NBitcoin@45a0ad9 does not contain this bug:

$ hd -s 80771 -n 19 IWordlistSource.cs
00013b83  22 66 72 65 6e 63 68 22  2c 20 22 61 62 61 69 73  |"french", "abais|
00013b93  73 65 72                                          |ser|
00013b96

If you used a Windows text editor to somehow copypaste that, I would not expect it to have picked up the U+FEFF. AFAIK, the Windows tools which (unnecessarily) produce a Byte Order Marker for octet-stream UTF-8 tend to not consider it part of the text. But then, I also would not rely on this.

My greater concern is automatically processed text. I know for a fact that the U+FEFF can be slurped up from french.txt, because that’s what happened to me. In my Makefile, I use Unix shell tools to turn the wordlist .txt files into C const char * arrays. The resulting French C array did contain the U+FEFF before I fixed it.

Had I foisted that on users, somebody would have lost funds sooner or later.

(Apologies for the double post. My bad.)

nym-zone referenced this in commit 45a0ad9e32 on Jan 11, 2018

NicolasDorier commented at 1:46 PM on January 13, 2018: contributor

@nym-zone I dodged a bullet.

In .NET, the default encoding is UTF8 BOM: when reading a file it checks BOM, and if present, remove it from the string.

When writing to the file, the default UTF8 BOM bytes is added.

However, UTF8Encoding.GetBytes() does not emit it, even if the UTF8Encoding objects has BOM setting set, for that a call to UTF8Encoding.GetPreamble() is needed.

TL;DR: My implementation is conform... by chance. (Tested against your vector) Hopefully the test vectors of BIP39 (which I added since beginning) should have caught the bug though.

NicolasDorier referenced this in commit d91fa4ad47 on Jan 13, 2018

dabura667 commented at 3:32 PM on January 13, 2018: none

@NicolasDorier the test vector generator included with Trezor-mnemonic library has all 0x00 as a vector generator for edge cases which results in all first words except the last.

cornfeedhobo commented at 10:38 PM on July 4, 2018: none

Can we get a resolution here? My understanding of these wordlist definitions negates the need for the BOM, and it is not recommended for use.

NicolasDorier commented at 11:53 AM on July 5, 2018: contributor

why this has not been merged?

voisine commented at 5:29 PM on July 5, 2018: contributor

ACK

cornfeedhobo commented at 12:24 AM on August 1, 2018: none

@Kirvx Could we get your ACK on this?

Kirvx commented at 4:59 AM on August 1, 2018: contributor

ACK

cornfeedhobo commented at 10:17 PM on August 1, 2018: none

@luke-jr Anything else needed to get this merged?

luke-jr merged this on Aug 9, 2018

luke-jr closed this on Aug 9, 2018