BIP0039 Added Japanese wordlist #92

bip39JP commented at 4:24 PM on August 14, 2014: contributor

I have been testing with some colleagues to create a wordlist for BIP0039 for the Japanese language.

We have decided on the current word list for the following reasons:

Japanese borrowed Chinese symbol character set (Kanji) are filled with homonyms that are often only discernible via the Kanji used, we thought that would lower memorability (as the words are random, and context is not possible). This is why we chose to use only the hiragana phonetic character set.
If the user inputs any character outside of the hiragana character set (such as katakana character set) the developer may automatically replace with the hiragana to check validity of the words.
Hiragana is easier to write down by hand than Kanji and Katakana, and will be easy for all ages (both young and old)

We are anxious for any comments or criticisms that you may have.

BIP0039 Added Japanese wordlist 8990249442

erikahend commented at 4:58 PM on August 14, 2014: none

Native Japanese speaker here (working with MultiBit HD developers).

Agreed
Disagree. It should be exactly one or the other because it is very unlikely to accidentally type in a Katakana/Kanji that looks similar to Hiragana, so they should be treated as separate letters. There are many Kanji with the same phonic expression but have completely different meaning from one another, so it's adding unnecessary complexity
Agreed :-)

gary-rowe commented at 4:58 PM on August 14, 2014: none

MultiBit HD developer here. Just corroborating @erikahend is working with us.

bip39JP commented at 5:05 PM on August 14, 2014: contributor

@erikahend Thanks for the reply.

I was considering the implications of someone hitting spacebar accidentally and being so concentrated that they don't notice it changed to katakana, etc.

Also, I wonder, do most wallets validate the seed before running it through the HMAC?

I guess just throwing an error when validating (as no words contain katakan or kanji) would be sufficient.

Thanks for the comment!

bip39JP commented at 5:07 PM on August 14, 2014: contributor

http://bip39jp.github.io/

Please generate a few and give them a look. Perhaps you will run into something or realize something not noticeable from just looking at the list.

Please note that word wrap is not working on this site (I am not good at html/css, so I couldn't fix it for the life of me)

voisine commented at 5:22 PM on August 14, 2014: contributor

@bip39JP the validation for a 128bit seed only involves a 4bit checksum, so that means there's a 1 in 16 chance that any error will still generate a valid checksum. It's better than nothing, but not something to put too much confidence in.

bip39JP commented at 12:16 PM on August 15, 2014: contributor

@voisine Very true.

However, sometimes I imagine the scenario where a user inputs seed, sees no bitcoins (because we hashed their typo with their seed) in it and runs to either A. complain about wallet xyz online or B. post an issue on Github etc, when it was user error.

I would think that it is good practice to tell the user "Don't freak out, you mistyped something." right after they put it incorrectly. So even if there is a 1/16 chance of generating a valid checksum with a wrong seed. I would say that 90% of the reason why I would like to validate is to check that all words are in the wordlist (aka typos are not present) and checking the checksum is trivial, so I don't see why not to check it.

I think the checksum will help in Japanese, as people trying to remember seeds might mix up word order more often that, say, English.

Does anyone have any comment on the separating by UTF-8 ideographic space and not ascii space? This element is the only one that I would think some hidden technical aspect would point away from doing such a thing... I placed that in the commit because in trials, almost every single person who entered their seed used ideographic spaces.(it's the default when in hiragana mode for typing) so I figured 1. generate the seed with it, and 2. just-in-case, run the input seed (on restoration) through a replace(' ', '　') just in case.

Any comments on this decision?

luke-jr commented at 12:18 PM on August 15, 2014: member

Should we really be revising the BIP content for every word list? Maybe there should be a separate index of word lists (bip-0039-wordlists.md?)

voisine commented at 4:30 PM on August 15, 2014: contributor

I think BIPs are a little more fluid than, say, RFCs, maybe we could start versioning them? Though in this case a separately maintained word list index also works.

luke-jr commented at 4:35 PM on August 15, 2014: member

Once it leaves Draft stage, BIPs should be about as fixed as RFCs. This one is still Draft, but I can see it being desirable to move it beyond Draft before every language has a word list...

Moved wordlists to separate file. 2248c1dc74

bip39JP commented at 4:57 PM on August 15, 2014: contributor

I have moved to a separate file.

I agree with @luke-jr and I think we should keep it separate so that the wordlist document can be more flexible.

However, I wonder if each wordlist should be given a status? (like "draft" or "finished")

I do not plan to change the list, but the rules surrounding a language may require special considerations such as Japanese. These special considerations may only be realized after actual use.

So maybe a Draft > Accepted status for Special Exceptions... and / or the wordlists themselves? is in order.

voisine commented at 6:32 PM on August 15, 2014: contributor

with regard to the ideographic vs ascii space, something I discussed with stick and slush but hadn't gotten around to submitting yet, was that each word list should specify what the word separator should be when deriving the seed. I had an idea for a word list composed of three letter pronounceable syllables rather than words, and it would be nice to have every other syllable separated by a dash with no other separators like: "machec-binnev-dordeb-sogduc-dosmul-sarrum"

much shorter and probably about the same difficulty for remembering.

bip39JP commented at 6:54 PM on August 16, 2014: contributor

+1 @voisine I agree.

I fear for my English writing skills to edit the BIP in any significant way, so maybe you could make a separate PR? or give me something I could copy/paste in?

If something such as that were added, maybe the wordlists.md file will require a table for the wordlists including info of the separator.

voisine commented at 8:08 PM on August 16, 2014: contributor

Slight problem, BIP39 requires that phrases are NFKD unicode normalized prior to deriving a seed, and NFKD converts ideograph spaces to ASCII spaces. So the space issue is already handled by the spec, but maybe it would be good to just point it out as a note for the JP word list.

Fixed wordlist links to account for new document 1901f2c807

Clarify the normalization will fix mixed space use af05299220

bip39JP commented at 5:20 PM on August 25, 2014: contributor

Anyone else with comments or opinions on the way I separated wordlist management into a separate file?

I understand this is the first wordlist since the original English one... so there might be details to iron out for multi-wordlist things.

Please let me know if you have any concerns.

I look forward to any comments or criticism you may have. Thank you.

voisine commented at 7:36 PM on August 25, 2014: contributor

I can't vouch for the wordlist itself, but the rest looks good to me.

non-japanese speaker ack.

Added Japanese test vectors 0d0520e312

formatting 213b67e8f8

ecdsa commented at 12:51 PM on September 11, 2014: none

I know it's probably too late for changing bip39, but I think whitespaces should be stripped off the mnemonic string before it is hashed. This would make the seed tolerant to japanese/chinese users forgetting to type the whitespaces.

janmoller commented at 2:07 PM on September 11, 2014: none

It is very much too late for the english word list. However, since no other word lists have officially been added I don't think it is a disaster to have other whitespace characters (blank) for other languages. On the other hand, it seems to be normal to offer autocompletion for words in the UI, both Trezor and Mycelium does this, and the spaces are added automatically. So maybe spacing this is a non-issue?

voisine commented at 2:31 PM on September 11, 2014: contributor

Yes, my suggestion is that word spacing should be defined for each word list.

ecdsa commented at 2:34 PM on September 11, 2014: none

I am not saying that there should be no spaces in the phrases shown in the UI. I think it is better to remove them before hashing, because it might confuse chinese and japanese users. (we also remove accents in Electrum, so I decided to also remove whitespaces, after including a wordlist for japanese)

Y75QMO commented at 2:48 PM on September 11, 2014: contributor

Removing spaces can be dangerous.

It should be impossible that diferent phrases become the same seed. However, for example

"act orange lend" and "actor angel end" BOTH become:

"actorangelend"

luke-jr commented at 2:51 PM on September 11, 2014: member

It probably makes more sense to remove spaces for some languages than others. I'm not fluent in Japanese, but IIRC spaces are usually omitted in the course of language.

ecdsa commented at 2:54 PM on September 11, 2014: none

Not really dangerous. You'd lose a bit of entropy, that's all. Since the wordlists that are used usually exclude words that are prefix of another word in the list, you will need at least 3 words to create such a situation.

Y75QMO commented at 3:03 PM on September 11, 2014: contributor

You don't need 3 words. Conflict with 2 words exist: "legal one" "leg alone" BOTH become "legalone"

There are many combinations that become the same: "leg end" would conflict with "legend".

ecdsa commented at 3:05 PM on September 11, 2014: none

oh right. I thought that words that are prefixes of other words were avoided in that list

bip39JP commented at 3:09 PM on September 11, 2014: contributor

I agree with omitting spaces:

To give context, currently Thomas has implemented our wordlist in Electrum, but it does not wordwrap. This is because traditionally, Japanese does not require spaces in their sentence.

However, our wordlist uses the phoenetic alphabet of Hiragana to create the words, and these words are only Nouns, verbs, and adjectives.

When Japanese make sentences, they use things called "particles" to denote the pieces of a sentence, and this is what helps make things readable without spaces.
When Japanese write they use Kanji, Hiragana, and Katakana, three character sets to help differentiate amongst homonyms and to help in understanding context.
With Hiragana, we lose 2, with no particles we lose 1... so a problem arises if we include spaces.

This problem is with wordwrap.

If wordwrap (which wraps after every character) comes in the middle of one of our "words" like "chousen" (romanized) and split it into "chou" and "sen"... given no context, the user could think of those as two separate words and write them with spaces between them. This would give an incorrect seed, even after NFKD.

So rather than worry about how the seed shows up (PyQt offered no way to remedy this situation) Thomas proposed to just remove all spaces.

reference: Look at this screen shot of Pokemon in Japanese. http://blog-imgs-19-origin.fc2.com/k/e/b/kebihabi/20080917124209.jpg

Another solution I proposed was stretching the screen to a size that could fit the largest possible seed. (though it may look ugly) for Japanese. or Thomas said just splitting with line break to show the user their seed.

Wordwrapping may prove to be an implementation issue that may pose more problem to non-Japanese developers. I personally have never had to worry about hiragana only phrases in my career, so Japanese language and wordwrap has never crossed my mind at the same time.

I will have to decide how to set this, and add into the special considerations list.

I am still open to suggestions on how to standardize.

Note: the problem doesn't really go away by using Kanji or katakana, or particles either. Japanese is very modular in their language, so splitting large words up makes smaller words and homonyms are extremely common.

ecdsa commented at 3:12 PM on September 11, 2014: none

it would be nice to know how often this occurs with the english list. if this significantly impacts entropy then we might want to reintroduce spaces

christophebiocca commented at 3:21 PM on September 11, 2014: none

It isn't just entropy that's the issue, it's the fact that an implementation that strips spaces is not compatible with one that follows the spec. A user moving the seed from one implementation to another (which is the entire point of standardizing in the first place) will run into problems. The english list needs spaces when hashing, or it's no different from replacing the wordlist wholesale.

janmoller commented at 3:25 PM on September 11, 2014: none

The more I think about it the more this seems like a non-issue. The user should very carefully write down the words one after the other in the right order. If you display all the words at the same time there will be mistakes. "Write word, click, write word, click..." mitigates the risk of writing them in the wrong order and writing duplicate words. This makes the word-separator an internal detail that the user should not care about.

ecdsa commented at 3:30 PM on September 11, 2014: none

@christophebiocca : sorry to be offtopic; Electrum is not following BIP39. @janmoller : entering words one by one makes sense on a smartphone, but on a desktop, you want to be able to paste it

janmoller commented at 3:42 PM on September 11, 2014: none

IMO offering copy-pasting of whole word lists is really bad as it encourages users to put them into text files, email, or whatnot. Word lists are designed to be put offline in an easy, low-tech way ... pen and paper. Showing them one by one nodges the user to do it the right way IMO.

Y75QMO commented at 3:50 PM on September 11, 2014: contributor

Here is a list of two words conflicts: http://pastebin.com/gwxTVGSt Only 166 conflicts from more than 4 million combinations. Entropy is reduced by about 0.004% I think electrum is safe.

Maybe it makes sense to remove spaces in Japanese, but I would not remove them from languages like English or Spanish.

ecdsa commented at 4:19 PM on September 11, 2014: none

Thanks for counting the conflicts. Actually, for 12 words it would be more like 0.04%
p = 1-pow(1-166/(2048*2048.),12-1)

Y75QMO commented at 6:07 AM on September 12, 2014: contributor

You are right. But I don't understand why 12-1.

2 words = 1-pow(1-166/(2048_2048.), 1) 4 words = 1-pow(1-166/(2048_2048.), 2) ... 12 words = 1-pow(1-166/(2048*2048.), 6)

Anyway, given the small impact, exact calculation is not important. And the more the words, the more the entropy.

In terms of "bits of entropy" 2 words without confilcts would be 22 bits of entropy. 2 words with conflicts would be log2(2048^2-166) = 21.9999429... bits of entropy. (99.99974..%). 12 words without conflicts would be 132 bits of entropy. 12 words with conflicts would be 131.999657... bits of entropy. Same proportion (99.99974...%)

Sorry, for the offtopic.

ecdsa commented at 8:59 AM on September 12, 2014: none

If you have N words, there are N-1 pairs where a conflict can occur (assuming there is only one conflict): (1<->2) or (2<->3) or (3<->4) etc..

I am still undecided about keeping whitespaces in languages lke English and Spanish. It would certainly be better to keep whitespaces for those languages, but that means that the language has to be detected, which is something I want to avoid. I believe the seed derivation function should be agnostic about languages and the wordlists being used.

Y75QMO commented at 10:08 AM on September 12, 2014: contributor

I think there are at least 3 different parts:

Mnemonic generation.
Ask Mnemonic AND password from user.
Seed derivation from mnemonic and password.

Mnemonic Generation: It needs to know in what language you want the mnemonic and the lists of words for that language.
Ask mnemonic AND password from user: It needs to know (or detect) in what language the mnemonic is. It has to verify the checksum of the mnemonic and inform the user about misspellings or words out of the valid list for that language. Once the language is known, It can even help the user typing the mnemonic (for example, in English and Spanish there is no need to type more than 4 caracters of each word, since there are not two words that start with the same 4 chars).

The seed is derived from the mnemonic AND a password. The password has to be asked to the user. Should be clear for each language if spaces should be removed from the password as well. I guess in Japanese might make sense to remove spaces from the password, but that is not the case in English or Spanish.

How is it possible to verify the checksum of the mnemonic if you don't ask the user to separate the words? You need to know the exact words to recover the original bits of entropy (ENT) and the bits of the checksum (CS).

| ENT | CS | ENT+CS | MS | +-------+----+--------+------+ | 128 | 4 | 132 | 12 | | 160 | 5 | 165 | 15 | | 192 | 6 | 198 | 18 | | 224 | 7 | 231 | 21 | | 256 | 8 | 264 | 24 |

The final output should be some bytes after processing and encoding the mnemonic and some bytes after processing and encoding the password.

Part Three: 3. Seed derivation from mnemonic and password: This part just receives whatever bytes produced by part 1 or part 2 and derives the seed from those bytes (the bytes from the mnemonic and the bytes from the password). It does not need any word lists or to verify any checksums.

schildbach commented at 10:10 AM on September 12, 2014: contributor

Just a quick question: For those languages we keep whitespaces, will they be normalized somehow? E.g. one or more consecutive Unicode whitespace characters replaced by a single regular space, and any whitespace at start or end or the phrase stripped?

Y75QMO commented at 10:19 AM on September 12, 2014: contributor

Yes, detecting the language, detecting words out of the list, detecting invalid checksums, removing extra spaces between words (or not allowing to type spaces at all, since words can be detected after the first few keystrokes) and verifying the checksum of the mnemonic should be made in "part 2" of the process ("Ask Mnemonic AND password from user").

ecdsa commented at 10:20 AM on September 12, 2014: none

@Y75QMO : The checksum of BIP39 depends on the wordlist, which the reason why I will not follow this BIP. Electrum will use a hash of the seedphrase as checksum. I think the checksum should be agnostic about the wordlist, just like the seed derivation.

Y75QMO commented at 10:59 AM on September 12, 2014: contributor

For example, an user has this mnemonic/seedphrase: "kitchen struggle cook area age"

Now, he has lost his computer. He wants to recover his wallet and types in a new computer:

"kitchen struggle cool area age".

BIP39 will warn the user that he made a mistake typing.

How can electrum detect that there is a mistake in the seedphrase?

ecdsa commented at 11:06 AM on September 12, 2014: none

Detecting words and assisting the user is still possible, assuming electrum has the same wordlist that was used to generate the seedprase. However, I think this should only be a GUI improvement. The assumption that we have the same wordlist should not be used for anything critical.

ecdsa commented at 6:03 PM on September 13, 2014: none

update: we will remove whitespaces only between CJK characters (chinese-japanese_korean)

Wordlist update.

Brushed up wordlist:
1. First 3 characters are unique for every word.
2. No more words less than 3 characters.

9fda3dbf20

bip39JP force-pushed on Sep 28, 2014

laanwj commented at 8:31 AM on October 15, 2014: member

Is this ready for merging?

bip39JP commented at 11:28 AM on October 15, 2014: contributor

Yes, this is ready for merging as far as I'm concerned. @voisine already gave an ack.

The details talking of whitespaces is more of an OT overarching conversation of BIP39 wordlists (especially CJK) in general, so I will not include it in the special considerations with my PR.

laanwj referenced this in commit 0557a3eb54 on Oct 15, 2014

laanwj merged this on Oct 15, 2014

laanwj closed this on Oct 15, 2014

luke-jr referenced this in commit 860990fa0a on Jun 6, 2017

ajtowns referenced this in commit 0c7bbf83c6 on Oct 18, 2019