Fix two errors in the BIP 39 French wordlist #622

pull nym-zone wants to merge 1 commits into bitcoin:master from nym-zone:fix39french changing 1 files +2 −2
  1. nym-zone commented at 5:45 AM on January 1, 2018: contributor

    The BIP 39 French wordlist contains two significant technical errors:

    • Byte Order Marker (BOM) U+FEFF at the beginning of the first line, preceding the word “abaisser”.

    • No newline '\n' char terminating the last line, after “zoologie”.

    The former may cause user loss of funds. An implementation which generates a mnemonic phrase and also turns it into a BIP 39 seed value may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the word “abaisser” to the user. Of course, it cannot be expected that the user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet. In the face of a buggy wordlist, whitespace handling and normalization cannot be absolutely relied on to remove a notoriously mischievous character. Those who provide technical support may be well advised to ask French users with unrestorable wallets, “Did your mnemonic phrase contain the word ‘abaisser’?”

    The latter broke the shell script I use to massage wordlists into C sources when building easyseed.

    I know of only one commonplace platform where software regularly prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line terminator on the last line even when asked to create a Unix ('\n') text file. It is RECOMMENDED that new wordlists be examined for correctness using standard shell tools on a sane platform.

  2. Fix two errors in the BIP 39 French wordlist
    The BIP 39 wordlist contained two significant technical errors:
    
     - Byte Order Marker (BOM) U+FEFF at the beginning of the first line,
       preceding the word "abaisser".
    
     - No newline '\n' char terminating the last line, after "zoologie".
    
    The former may cause user loss of funds.  An implementation which
    generates a mnemonic phrase and also turns it into a BIP 39 seed value
    may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the
    word "abaisser" to the user.  Of course, it cannot be expected that the
    user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet.
    In the face of a buggy wordlist, whitespace handling and normalization
    cannot be absolutely relied on to remove a notoriously mischievous
    character.  Those who provide technical support may be well advised to
    ask French users with unrestorable wallets, "Did your mnemonic phrase
    contain the word 'abaisser'?"
    
    The latter broke the shell script I use to massage wordlists into C
    sources when building https://github.com/nym-zone/easyseed .
    
    I know of only one commonplace platform where software regularly
    prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line
    terminator on the last line even when asked to create a Unix ('\n') text
    file.  It is RECOMMENDED that new wordlists be examined for correctness
    using standard shell tools on a sane platform.
    50c4f1255e
  3. dabura667 commented at 6:30 AM on January 1, 2018: none

    ACK 50c4f1255eed6b1c08ca9c20b4d8f380879ed2f5

    I checked all other current wordlists and they all:

    1. Did not have the BOM
    2. Did have a trailing LF line break

    I pulled @nym-zone 's branch locally and inspected the file using a hex editor to double check the new french.txt is fixed.

    As for @nym-zone's concerns:

    The only wallet I know of that shows users French phrases is Copay, and they don't use the files as-is from the BIP. https://github.com/bitpay/bitcore-mnemonic/blob/master/lib/words/french.js

    If there is a wallet that uses the exact file from the BIP that shows french users phrases in French. That warning applies... but I don't think any wallet does so.

  4. nym-zone commented at 7:29 AM on January 1, 2018: contributor

    Thanks, @dabura667. I should have clarified, I discovered this when writing a mnemonic phrase generator tool which embeds all eight wordlists currently in the BIP repository. I reported French, because that’s what was broken. Insofar as I could when dealing with multiple languages I do not know, I exercised reasonable care to assure the integrity of all the wordlists.

    (Then I wrote an automated battery of runtime tests against the compile-time SHA-256 hashes of the files, in case somebody out there may have buggy tools which mangle UTF-8 when building...)

    I just hexdumped Copay’s french.js to double-check; and its string representation is fine as for the ironically self-abasing “abaisser”.

  5. nym-zone cross-referenced this on Jan 5, 2018 from issue French word list and test vectors by westonal
  6. nym-zone commented at 9:11 PM on January 5, 2018: contributor

    Checking additional implementations in the wild, I have not (yet?) found any which carry the spurious U+FEFF. But it is not only a matter of Copay.

    From a popular implementation, widely used because it runs in a web browser: https://github.com/iancoleman/bip39/commit/3a8dbe99b4be2084794d1191a06eadc38db0917b, checking wordlist_french.js:

    00000030  20 3a 20 57 4f 52 44 4c  49 53 54 53 3b 0a 57 4f  | : WORDLISTS;.WO|
    00000040  52 44 4c 49 53 54 53 5b  22 66 72 65 6e 63 68 22  |RDLISTS["french"|
    00000050  5d 20 3d 20 5b 0a 22 61  62 61 69 73 73 65 72 22  |] = [."abaisser"|
    

    From https://github.com/NovaCrypto/BIP39/pull/4/commits/5ecf5687564e2c0af0bae133e1fe66095e80ad98 (see also), checking French.java:

    00000520  20 6e 65 77 20 53 74 72  69 6e 67 5b 5d 7b 0a 20  | new String[]{. |
    00000530  20 20 20 20 20 20 20 20  20 20 20 22 61 62 61 69  |           "abai|
    00000540  73 73 65 72 22 2c 0a 20  20 20 20 20 20 20 20 20  |sser",.         |
    

    I will keep my eye out for other French-supporting BIP 39 implementations.

  7. nym-zone referenced this in commit 3a8dbe99b4 on Jan 5, 2018
  8. nym-zone referenced this in commit 234c66cd5d on Jan 7, 2018
  9. nym-zone referenced this in commit ba25dfac56 on Jan 7, 2018
  10. nym-zone cross-referenced this on Jan 8, 2018 from issue BIP39: Adds Russian word list by farazdagi
  11. nym-zone cross-referenced this on Jan 8, 2018 from issue Czech wordlist for BIP0039 by zizelevak
  12. nym-zone commented at 5:13 AM on January 9, 2018: contributor

    ping @Kirvx @NicolasDorier @vosine @luke-jr

    Bugfix on #152, fbe7196.

    This is a simple technical fix which changes no functionality in correct implementations, but can help prevent implementation errors.

  13. Kirvx commented at 7:46 AM on January 9, 2018: contributor

    ACK. Thanks @nym-zone and @dabura667. @voisine Are breadwallet french users affected?

  14. nym-zone referenced this in commit fd14d42752 on Jan 9, 2018
  15. nym-zone commented at 8:25 AM on January 9, 2018: contributor

    Thanks, @Kirvx.

    I didn’t mean to imply anything about Breadwallet; I was simply trying to get the attention of persons with significant involvement in #152 and/or maintenance of BIP 39 wordlists and/or BIP maintenance. But, good idea! Now, I checked the only Breadwallet French file I could find on a brief search, BreadWallet/fr.lproj/BIP39Words.plist from breadwallet/breadwallet-legacy@fd14d42; and it is fine:

    $ hd -s 173 -n 16 BIP39Words.plist
    000000ad  3c 73 74 72 69 6e 67 3e  61 62 61 69 73 73 65 72  |<string>abaisser|
    000000bd
    

    (That file is delimited by XML, not newlines; so as long as “zoologie” is there, that’s fine also.)

    By the way, thank you for your work creating this list. I strongly urge that BIP 39 should have broad language support; and French is an important language. I can see in #152 how much work was put in to make and refine the list. Too bad, it seems abaisser wanted to abase itself.

  16. Kirvx commented at 12:17 PM on January 9, 2018: contributor

    😃

    Thank you for checking breadwallet @nym-zone 👍

  17. luke-jr commented at 6:05 AM on January 10, 2018: member
  18. luke-jr added the label Proposed BIP modification on Jan 10, 2018
  19. NicolasDorier commented at 2:16 AM on January 11, 2018: contributor

    would be nice to add a test vector. NBitcoin also has french hard coded... I could not find a proper tool on windows (sublime, and vscode don't work for some reason) showing me hidden characters though.

  20. evoskuil commented at 2:32 AM on January 11, 2018: contributor

    Try notepad++

  21. nym-zone commented at 4:23 AM on January 11, 2018: contributor

    @NicolasDorier:

    would be nice to add a test vector.

    From the nym-zone/easyseed@c7d698a set of a dozen languages’ test vectors, I give you French test vectors in a convenient JSON format. Please run them with your implementation.

    As stated in the preceding commit log from two days ago (nym-zone/easyseed@5f35cd0, q.v.), these vectors are specifically designed to flunk implementations which do not perform proper Unicode NFKD normalization—even with words not containing diacritics (including the English wordlist, too).

    (I may rename or modify things; but the versioned link will obviously remain stable.)

  22. nym-zone commented at 5:05 AM on January 11, 2018: contributor

    Pertinent to the below, but also as a general point: U+FEFF is not removed or otherwise affected by Unicode NFKD normalization. To shortcut discussion of Unicode character properties, here is an empirical object demonstration with an ICU utility:

    $ echo -n $'a\ufeffb' | uconv -x '::nfkd;' | hd
    00000000  61 ef bb bf 62                                    |a...b|
    00000005
    

    Of course, due to its double-personality as “ZERO WIDTH NO-BREAK SPACE” (ZWNBSP), it’s also zero-width—and therefore invisible on any display which supports Unicode:

    $ echo $'a\ufeffb'
    ab
    

    (Check, there is an extra character in between there.)

    Thus if it gets fed to PBKDF2 for BIP 39 seed generation, and the user is shown the corresponding mnemonic, then the user will write down a phrase which will not restore the wallet. That’s why I panicked when I saw this:

    $ head -c16 french.txt | hd
    00000000  ef bb bf 61 62 61 69 73  73 65 72 0a 61 62 61 6e  |...abaisser.aban|
    00000010
    

    @NicolasDorier, IWordlistSource.cs from MetacoSA/NBitcoin@45a0ad9 does not contain this bug:

    $ hd -s 80771 -n 19 IWordlistSource.cs
    00013b83  22 66 72 65 6e 63 68 22  2c 20 22 61 62 61 69 73  |"french", "abais|
    00013b93  73 65 72                                          |ser|
    00013b96
    

    If you used a Windows text editor to somehow copypaste that, I would not expect it to have picked up the U+FEFF. AFAIK, the Windows tools which (unnecessarily) produce a Byte Order Marker for octet-stream UTF-8 tend to not consider it part of the text. But then, I also would not rely on this.

    My greater concern is automatically processed text. I know for a fact that the U+FEFF can be slurped up from french.txt, because that’s what happened to me. In my Makefile, I use Unix shell tools to turn the wordlist .txt files into C const char * arrays. The resulting French C array did contain the U+FEFF before I fixed it.

    Had I foisted that on users, somebody would have lost funds sooner or later.

    (Apologies for the double post. My bad.)

  23. nym-zone referenced this in commit 45a0ad9e32 on Jan 11, 2018
  24. NicolasDorier commented at 1:46 PM on January 13, 2018: contributor

    @nym-zone I dodged a bullet.

    In .NET, the default encoding is UTF8 BOM: when reading a file it checks BOM, and if present, remove it from the string.

    When writing to the file, the default UTF8 BOM bytes is added.

    However, UTF8Encoding.GetBytes() does not emit it, even if the UTF8Encoding objects has BOM setting set, for that a call to UTF8Encoding.GetPreamble() is needed.

    TL;DR: My implementation is conform... by chance. (Tested against your vector) Hopefully the test vectors of BIP39 (which I added since beginning) should have caught the bug though.

  25. NicolasDorier referenced this in commit d91fa4ad47 on Jan 13, 2018
  26. dabura667 commented at 3:32 PM on January 13, 2018: none

    @NicolasDorier the test vector generator included with Trezor-mnemonic library has all 0x00 as a vector generator for edge cases which results in all first words except the last.

  27. iancoleman cross-referenced this on Jul 4, 2018 from issue Remove erroneous prefix bits from BIP-0039 french wordlist by cornfeedhobo
  28. cornfeedhobo commented at 10:38 PM on July 4, 2018: none

    Can we get a resolution here? My understanding of these wordlist definitions negates the need for the BOM, and it is not recommended for use.

  29. NicolasDorier commented at 11:53 AM on July 5, 2018: contributor

    why this has not been merged?

  30. voisine commented at 5:29 PM on July 5, 2018: contributor

    ACK

  31. cornfeedhobo commented at 12:24 AM on August 1, 2018: none

    @Kirvx Could we get your ACK on this?

  32. Kirvx commented at 4:59 AM on August 1, 2018: contributor

    ACK

  33. cornfeedhobo commented at 10:17 PM on August 1, 2018: none

    @luke-jr Anything else needed to get this merged?

  34. luke-jr merged this on Aug 9, 2018
  35. luke-jr closed this on Aug 9, 2018


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bips. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-14 11:10 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me