BIP39 French Wordlist - My proposal #152

pull Kirvx wants to merge 1 commits into bitcoin:master from Kirvx:master changing 2 files +2072 −0
  1. Kirvx commented at 9:04 PM on May 8, 2015: contributor

    @voisine

    Here are my restrictions:

    1. High priority on simple and common french words.
    2. Only words with 5-8 letters.
    3. A word is fully recognizable by typing the first 4 letters (special french characters "é-è" are considered equal to "e", for exemple "museau" and "musée" can not be together).
    4. Only infinitive verbs, adjectives and nouns.
    5. No pronouns, no adverbs, no prepositions, no conjunctions, no interjections (unless a noun/adjective is also popular than its interjection like "mince;chouette").
    6. No numeral adjectives.
    7. No words in the plural (except invariable words like "univers", or same spelling than singular like "heureux").
    8. No female adjectives (except words with same spelling for male and female adjectives like "magique").
    9. No words with several senses AND different spelling in speaking like "verre-vert", unless a word has a meaning much more popular than another like "perle" and "pairle".
    10. No very similar words with 1 letter of difference.
    11. No essentially reflexive verbs (unless a verb is also a noun like "souvenir").
    12. No words with "ô;â;ç;ê;œ;æ;î;ï;û;ù;à;ë;ÿ".
    13. No words ending by "é;ée;è;et;ai;ait".
    14. No demonyms.
    15. No words in conflict with the spelling corrections of 1990 (http://goo.gl/Y8DU4z).
    16. No embarrassing words (in a very, very large scope) or belonging to a particular religion.
    17. No identical words with the Spanish wordlist (as Y75QMO wants).

    4 wordlists used:

    Spelling verified with Hunspell French Dictionnary (1990 and Classique) in Notepad++, and meaning verified with https://fr.wiktionary.org and http://www.larousse.fr/ for hundreds words.

    Guys can review: @ecdsa @NicolasDorier @EricLarch @nicolasbigot @pollastri-pierre

    Thanks to Thomas Voegtlin for his wordlist!

    Please wait before merging.

    --- The following message is partially outdated because of the evolution of the wordlist. ---

    J'ai défini un maximum de restrictions "raisonnables" pour qu'un individu puisse deviner le plus facilement possible un de ses mots en cas d'oubli (ou s'en souvenir facilement).

    Pour les mots "embarrassants", il s'agit de mots qui peuvent être assimilés à une vilaine insulte, de certains mots relatifs à une maladie grave, à la mort, à la pauvreté, au crime, à la violence, au domaine médical, à des attitudes et bien d'autres.

    J'ai fait de mon mieux pour supprimer les mots qui présentaient une ressemblance avec un autre mot, à l'oral comme à l'écrit. Plusieurs centaines de mots qui avaient une différence de 1 lettre (ou 1 lettre différente) avec un autre mot ont été supprimés. Je considère que le résultat est plutôt satisfaisant, loin d'être parfait, mais tout à fait correct. Aussi, les restrictions n°6 et 10 sont complémentaires à ce problème.

    J'estime qu'il y a 1% de mots potentiellement inconnus du public (comme "quantum"), et 5% de mots avec des sens qui sont potentiellement incertains par le public (comme "fongible"). Je considère ces marges comme convenables.

    Notez que certains éléments chimiques du tableau périodique sont présents, les plus populaires.

    Pour une vérification plaisante, voici la version imprimable (5 pages PDF A4): https://www.dropbox.com/sh/xlq3x2anb706uw1/AADUYAqcBvkvUPdhwC2uLWmEa?dl=0

    Si vous voulez vérifier en 1 lecture, focalisez-vous sur les restrictions n°2,3,5,8 et 11. Étant donné l'homogénéité de la liste (et le bon sens qu'elle doit avoir), les mots contraires aux restrictions n°1,4 et 13 devront vous sauter aux yeux. Comptez 15 minutes de lecture par page. Je recommande quand même une deuxième lecture.

    J'espère que vous apprécierez cette wordlist, c'est un travail de plus de 70 heures que je n'envisageais pas de faire au début, étant donné l'ampleur et la responsabilité de la tâche.

    Si un mot vous semble inapproprié, ou si vous avez des remarques à faire par rapport aux restrictions, vous pouvez m'en faire part.

    Sachez aussi que si elle vous convient, elle sera intégrée dans une des prochaines versions de breadwallet avec les autres wordlists étrangères.

  2. EricLarch commented at 9:34 PM on May 8, 2015: none

    Are there restrictions on the fact that each word should be separated by more than one letter? For instance we had the case of someone writing down "fog" instead of "frog" (or the otherway around I'm not sure), and by chance both were valid words in the dictionnary. Since since this a new dictionary, I think we have the opportunity to maybe check this rule with an algorithm just to reduce the possibilities of very costly mistakes.

  3. Kirvx commented at 9:42 PM on May 8, 2015: contributor

    I did not used a script for that, I eliminated the most glaring similarities. And I do not have the skills to do that :/

  4. NicolasDorier commented at 8:20 PM on May 9, 2015: contributor

    Good idea, I will also review it. For the similar words, I will check the combinaison of all levenstein distance, will be quick. Also, I think you word list is not in KD normalization (not a big deal, I'll fix that)

  5. NicolasDorier commented at 8:24 PM on May 9, 2015: contributor

    Ah one more question. I'm not sure about some words which are either very unknown, or often misspelled. (like zircon and wapiti, which is the only I have seen after quick scan, and maybe the only one)

    Do you think we should change such words ?

  6. Kirvx commented at 8:41 PM on May 9, 2015: contributor

    @NicolasDorier Thanks for the help :)

    I used

    perl nfkd.pl < wordlist.txt > nfkdworldlist.txt
    

    and nfkd.pl is

    #!/usr/bin/perl
    
    use Unicode::Normalize;
    use strict;
    use warnings;
    use open qw(:std :utf8);
    
    while (<>) {
        print NFKD("$_");
    }
    

    Thanks to Aaron Voisine for this! Is that ok?

    Of course we can change this kind of words if we find a word that is compatible with the restrictions.

  7. NicolasDorier commented at 1:59 PM on May 10, 2015: contributor

    I reviewed the first 1024, here my difficulties :

    acerbe I don't understand meaning out of context, more less spelling it agrafer agrapher ? aggrapher ? easy mistake azur azure with a 'e' ? easy mispell bénigne Difficult to pronounce of the phone + rare word bielle never heard this word biopsie never heard biotype never heard bluffer easy mispell "bleuffer" brome never heard bruine never heard buccal easy mispell (bucal) cadastre never heard caduc easy mispell "caduque" calepin never heard caneton easy mispell "canneton" césium never heard (almost) cloporte never heard cobalt never heard coccyx never heard cosy easy misspell "cosie" dactylo did not know it existed, thought it was abbreviation of "dactylographie" embryon easy misspell "embrillon" ethnie "éthnie" ? fakir easy misspell (faquir) fenouil easy misspell "fenouille" filetage never heard final can confound with "finale" gallium never heard gecko never heard grivois never heard hydromel never heard idylle never heard iguane difficult to spell right

    All of that is surely subjective. We don't have to replace if you think I am the only one having those difficulties. I'll review the next 1024 later. Let me know what you think about these words.

  8. Kirvx commented at 4:41 PM on May 10, 2015: contributor

    Thanks for the review :) Have you googled these 17 unknown words? I think after a search most of the people will say "Ah yes I know this word". What do you think @EricLarch ? Anyway, it represents 1.6% of these 1024 words, that's pretty correct. I agree for all of the rest. "ethnie" is correct :) "dactylo" is the job http://www.larousse.fr/dictionnaires/francais/dactylo/21484, but there is also "dactylographe" which is the the old form according to larousse, so maybe we should delete this word. "acerbe" http://www.larousse.fr/dictionnaires/francais/acerbe/604 But before trying to change all these words (if we can, finding extra words is complicated, I will try each day), it seems more logical to me to work on the 1 letter difference first, no? I'm very curious about how many words are in conflict with another :)

  9. EricLarch commented at 4:50 PM on May 10, 2015: none

    I agree the bénigne and bluffer can be difficult (I have seen poker players write "bleuffer"...). For the others I would think that any French native speaker must know them, and I don't think that anyone litterate would write "embrillon" or "faquir" ever. I understand some people can have troubles with spelling, but then no words would be safe.

  10. NicolasDorier commented at 7:12 PM on May 10, 2015: contributor

    it seems more logical to me to work on the 1 letter difference first, no?

    Don't, I can do that automatically. I will do it once we agreed on the words. (I'll also code something up to verify you respected your restrictions)

    For the others I would think that any French native speaker must know them

    I am native speaker, but I admit I am not very good. ;) If all of you think that the problem is between my screen and my chair, then I have no problem into believing it. ;) I'll review the next one tomorrow.

  11. Kirvx commented at 8:04 PM on May 10, 2015: contributor

    Don't, I can do that automatically. I will do it once we agreed on the words. (I'll also code something up to verify you respected your restrictions)

    Ok, thanks for your time to code :)

    I am native speaker, but I admit I am not very good. ;) If all of you think that the problem is between my screen and my chair, then I have no problem into believing it. ;) I'll review the next one tomorrow.

    It's also cool to learn words ^^

  12. NicolasDorier commented at 8:56 PM on May 10, 2015: contributor

    yeah it is cool, but I'd just hope people will not have to spell words on phone, which will happen for unknown words. But I'm fine with it if you think I am one of the only who do not know them.

    I expect most service provider using BIP39 will auto correct words for the user. (I will surely include that in nbitcoin... even if only for me ;D)

  13. Kirvx commented at 1:23 PM on May 13, 2015: contributor

    @NicolasDorier Have you had the time to review the second part? :)

  14. NicolasDorier commented at 6:27 PM on May 13, 2015: contributor

    shit I forgot, working on that sorry

  15. NicolasDorier commented at 6:43 PM on May 13, 2015: contributor

    Here it is :

    iridium Never heard jacinthe jacynthe ? jacynte ? jacinte ? jaloux jalou ? joyau joyaux ? lasso lasseau ? momifier mommifier ? obturer Never heard oxyde oxide ? perdrix perdrie ? phoque foque ? pylône ô ??? rhodium never heard sextuor never heard, I understood "sexe tueur" :s suricate never heard thorax torax ? ubuesque never heard vanadium never heard wapiti never heard zircon never heard

    My remark are typical spelling mistake that can be done. Once you agreed on the words to change let me know, I'll then run some word analysis on the list. (dictionnary check / that your rules are satisfied / that 2 words are not too similar)

  16. Kirvx commented at 7:20 PM on May 13, 2015: contributor

    Thanks :) Never heard suricate? https://i.imgur.com/TE6PlMx.jpg They have a reserved place in the wordlist ^^ I will try to find many words in the next 2 days.

  17. NicolasDorier commented at 8:10 PM on May 13, 2015: contributor

    Well, I heard about suricate, as far as I was concerned, it was a french comedian group on youtube. :p

  18. Kirvx commented at 5:47 PM on May 14, 2015: contributor

    Ok, I propose to change these words:

    bénigne bluffer cosy dactylo césium gecko gallium grivois sextuor
    cadastre rhodium vanadium iridium jacinthe jaloux brome azur
    agrafer caduc zircon lasso momifier fenouil bruine bielle final bise
    bibelot fakir
    

    EDIT: "fakir" too by adding

    biberon banlieue financer éthanol prélude taureau slogan punaise sternum sottise burin
    tétine filière esquiver binaire festival pyjama opaque pharaon piéton pizza boycott
    phobie fémur féodal fissure rituel rallye
    

    And wombat (https://i.imgur.com/scN9gIU.jpg) :bear:

    What do you think?

  19. NicolasDorier commented at 7:57 PM on May 14, 2015: contributor

    pyjama. pijama ? (I would have bet it was spelled like that) rallye rallie ? (comme rallier)

    Except those I'm good. I like wombat, but I doubt lots people know. What do you think ? (once again, if you think it is fine, I'm fine with it too, I just hope people do not stress too much when they don't manage to spell right 25 words)

    Tell me when you update the list that I run some code on it.

  20. Kirvx commented at 8:24 PM on May 14, 2015: contributor

    Thanks :) I can change "rallye" by "rallonge", "pyjama" by "pyrolyse" ,"poreux" or "pixel" you decide, or another word. Maybe add "yacht" (the boat) instead of wombat, because we don't have words starting by "y".

  21. NicolasDorier commented at 8:38 PM on May 14, 2015: contributor

    rallonge et pixel, ok pour yatch.

  22. Kirvx commented at 9:08 PM on May 14, 2015: contributor

    @NicolasDorier Updated I also deleted "wapiti" and add "linéaire". I change the encoding, but apparently github doesn't updated the whole file. So here is the original https://www.dropbox.com/s/chaxgqotio59rf4/french.txt?dl=0 Note that you are a "collaborator" on Kirvx/bips, so feel free to correct what you want if you can (i don"t know what a collaborator can do).

  23. NicolasDorier commented at 11:41 PM on May 14, 2015: contributor

    thanks, I'll run some word analysis to check everything is fine. (Hopefully before sunday)

  24. Kirvx commented at 4:35 PM on May 15, 2015: contributor

    poncer -> ponctuel tréfonds -> trèfle pâturage (which is more used in the plural) -> gerbille

    Same dropbox link https://www.dropbox.com/s/chaxgqotio59rf4/french.txt?dl=0

  25. NicolasDorier commented at 4:56 PM on May 15, 2015: contributor

    did you update on github ? I prefer using the github version for my tests, so I'm sure there is no mistake in the modifications. (don't worry about encoding, I'll fix it)

    Ps : gerbille => never heard :D

  26. Kirvx commented at 5:48 PM on May 15, 2015: contributor

    Yes update on github. Ok i will try to change gerbille

  27. Kirvx commented at 10:52 PM on May 15, 2015: contributor

    "gerboise", "graffiti", "glycémie" or another ? :)

  28. NicolasDorier commented at 12:24 AM on May 16, 2015: contributor

    ok let's take "graffiti"

  29. Kirvx commented at 11:14 AM on May 16, 2015: contributor

    Updated.

  30. NicolasDorier commented at 12:34 PM on May 18, 2015: contributor

    Here similar words (separated by 1 letter, accent removed)

    amener,mener //Similar
    argent,urgent
    banque,barque
    baraque,barque
    baron,bâton
    bolide,solide
    bonifier,tonifier
    bonus,tonus
    céder,coder
    choquer,croquer
    crayon,rayon
    créer,crier
    curieux,furieux
    défaire,refaire
    doyen,moyen
    entier,envier
    épreuve,preuve
    éprouver,prouver //Similar
    établir,rétablir //Similar
    fermer,ferrer
    fièvre,lièvre
    figer,fixer
    flaque,plaque
    génie,genre
    herbe,herse
    humeur,humour
    hyène,hymne
    infecter,injecter
    léger,loger
    local,loyal
    loger,louer
    loger,lover
    louer,lover //Lover is unknown + very similar to other words (4 occurences)
    maison,saison
    malade,salade
    ministre,sinistre
    notaire,notoire
    piéton,piston
    podium,sodium
    préparer,réparer //Similar
    proie,prose
    rare,rire
    redire,réduire
    refermer,réformer
    rejeter,répéter
    réparer,séparer
    soupirer,soutirer
    toiture,voiture
    

    I noted potential problems. What do you think ?

    Checking other stuff...

  31. NicolasDorier commented at 1:09 PM on May 18, 2015: contributor

    I also noted the following collision with Spanish. (btw, the Spanish list is not normalized on github)

    ceder,céder
    enorme,énorme
    gemir,gémir
    ideal,idéal
    serie,série
    
  32. Kirvx commented at 1:53 PM on May 18, 2015: contributor

    Génial merci :) Je m'attendais au double pour les mots différents d'une lettre :) Le plugin de comparaison de Notepad m'a apparemment trahi pour les différences avec la wordlist espagnole ^^ Serait-ce compliqué de faire une comparaison entre notre wordlist et celle des 300 000 mots, pour nous proposer dans un autre fichier tous les mots qui peuvent être ajoutés à notre wordlist en prenant en compte les restrictions ? Cela permettrait de remplacer facilement les mots différents de 1 lettre, mais aussi d'établir d'autres restrictions comme supprimer les mots de 4 lettres (42 mots), les mots contenants les lettres "ô,â,ç,ê" (28 mots) et bien plus pour peu d'avoir pas mal de mots en réserve. Ça permettrait aussi de ne pas oublier des mots simples, pour remplacer ceux qui restent compliqués. Je propose cela parce que j'ai mis 3 heures il y a deux semaines pour remplacer "manuellement" plus d'une trentaine de mots, donc si on peut facilement filtrer le truc pour avoir une wordlist "badass" :/ Merci encore :)

  33. NicolasDorier commented at 2:13 PM on May 18, 2015: contributor

    Ca demanderait trop de temps, de codage, de CPU / memoire pour avoir un process automatique qui vaille le coup pour quelque chose que l'on fait qu'une seule fois. (surtout que la liste deja suffisemment badass, a part les quelques mots qui se ressemblent)

    Je pense qu'on est relativement tranquil, à part les 2 ou 3 mots qui se ressemblent un peu trop. Les collisions avec l'espagnol ont des accents différents, donc je ne pense pas que ça soit un problème.

  34. Kirvx commented at 2:21 PM on May 18, 2015: contributor

    Ok :) Je vais quand même essayer de remplacer tous les mots conflictuels + les mots espagnoles, et supprimer "lover". Ça va me prendre quelques jours.

  35. dabura667 commented at 2:25 PM on May 18, 2015: none

    #147 Spanish normalization is pending merge.

    BIP39 creator made the PR... don't know what's taking so long...

  36. Kirvx commented at 3:11 PM on May 18, 2015: contributor

    #147 Spanish normalization is pending merge. BIP39 creator made the PR... don't know what's taking so long...

    Maybe ask @laanwj to merge #147 ?

  37. Kirvx commented at 3:23 PM on May 20, 2015: contributor

    @NicolasDorier Petite question: est-ce que votre programme prend en compte la restriction n°2 ?

  38. Kirvx commented at 1:00 PM on May 21, 2015: contributor

    Bip update in 1-2 days.

  39. NicolasDorier commented at 1:29 PM on May 21, 2015: contributor

    Oui je vérifie la restriction 2

  40. Kirvx commented at 8:42 PM on May 21, 2015: contributor

    Cool :)

    Importantes modifications (voir le commit pour les nouveaux candidats). J'ai pu réduire la wordlist de 330k mots à 25k mots avec mes piètres connaissances en regex. J'ai donc lu tous ces 25k mots, et j'ai pu en tirer un peu moins de 150 mots. En conséquence:

    • J'ai remplacé/modifié tous les mots conflictuels par d'autres.
    • J'ai supprimé les mots espagnoles.
    • J'ai supprimé tous les mots de 4 lettres pour ne laisser que les mots de 5-8 lettres.
    • J'ai supprimé tous les mots contenants "ô,â,ç,ê".
    • J'ai supprimé "lover" et une dizaine d'autres mots pour certaines raisons (anglicisme, mots compliqués).
    • J'ai supprimé tous les adjectifs féminins (nouvelle restriction).

    Il en résulte un surplus de 29 mots. Y a-t-il des objections sur certains mots ? :) Je vais faire une relecture pour trouver les derniers adjectifs féminins (il est très probable qu'il en reste), ainsi que certains autres mots compliqués. @NicolasDorier Pouvez-vous effectuer une nouvelle analyse par votre programme ? ^^

    S'il reste assez de mots, je pense supprimer les adjectifs numéraux, je verrais.

  41. Kirvx force-pushed on May 21, 2015
  42. Kirvx commented at 9:30 PM on May 21, 2015: contributor

    Squashed 4 hotfixes

  43. Kirvx force-pushed on May 21, 2015
  44. Kirvx force-pushed on May 21, 2015
  45. NicolasDorier commented at 11:18 PM on May 21, 2015: contributor

    cool, je m'occupe de ça demain :)

  46. Kirvx force-pushed on May 22, 2015
  47. NicolasDorier commented at 1:03 PM on May 22, 2015: contributor

    With spanish : elixir, élixir

    Similar words :

    e´clipse,ellipse e´mission,mission e´voquer,re´voquer //Should change that munition,punition notice,novice papillon,pavillon re´duire,se´duire septuple,sextuple //Should change subvenir,survenir

    Unknown word :

    Diapason

    I think that easy to mispell is not a good argument against a word after all, since most BIP39 programs will catch that and tell the user "programe" does not exist do you mean "programme" ? 4 letters rule is respected.

    PS: I think you should not change the unknown word if I am the only one not knowing it, or we will never end this :D

  48. Kirvx commented at 3:25 PM on May 22, 2015: contributor

    Updated :D J'ai supprimé/modifié tous les mots cités et certains autres mots suite à ma relecture en cours (40%).

    I think that easy to mispell is not a good argument against a word after all, since most BIP39 programs will catch that and tell the user "programe" does not exist do you mean "programme" ? 4 letters rule is respected.

    Oui c'est vrai que le clavier numérique peut aider ^^

    PS: I think you should not change the unknown word if I am the only one not knowing it, or we will never end this :D

    Nan mais c'est clair que diapason est beaucoup moins populaire que d'autres mots, et comme on a des mots en stock, ça coûte pas grand chose de le supprimer, comme d'autres.

    Pourriez-vous refaire une analyse quand vous aurez le temps ? :)

  49. Kirvx commented at 10:53 PM on May 22, 2015: contributor

    Et voici la mise à jour tant attendue après ma relecture complète. Elle supprime donc les adjectifs numéraux (nouvelle restriction). Elle ajoute de nouveaux mots (académie acajou adéquat adhésif agrume amertume amovible amphibie apéritif apologie). Elle modifie accord -> accolade ; secourir -> secouer ; soulager -> soulever. Et elle supprime certains mots (voir le commit).

    Je considérerais la wordlist comme quasiment terminée si elle vous convient, et si l'analyse de @NicolasDorier ne révèle plus de conflits.

  50. Kirvx commented at 11:09 PM on May 22, 2015: contributor

    Update de la version pdf pour une relecture sur papier : https://www.dropbox.com/sh/xlq3x2anb706uw1/AADUYAqcBvkvUPdhwC2uLWmEa?dl=0

  51. NicolasDorier commented at 11:37 PM on May 22, 2015: contributor

    I reviewed, cool list,

    • No spanish collision
    • No words starts with the same 4 letters
    • No word with levenshtein distance of 1 (permutation/addition/removal)

    There is some words I find rather rare and unknow, but I don't think it is essential to change them. If it is good for you @Kirvx we can rebase then squash everything so it is ready for merging.

    hirsute
    nigaud
    opossum
    ballast
    bistouri
    
  52. Kirvx commented at 12:10 AM on May 23, 2015: contributor

    Thank you very much :) ballast and nigaud deleted. What about the UTF-8 NFKD encoding? Is that ok? Can you rebase and squash ? I think I will make une connerie if I do it ^^ bip-0039-wordlists.md looks good for you ? @EricLarch Is that ok for you? @ecdsa Can you take a look?

    I will read again the worlist this week end, and ask the merge if it's ok for @EricLarch and @ecdsa (he does not respond to my emails to review it, maybe with an issue to Electrum :D).

  53. NicolasDorier force-pushed on May 23, 2015
  54. NicolasDorier commented at 12:53 AM on May 23, 2015: contributor

    I just rebased and squashed

  55. realindiahotel commented at 6:19 AM on May 23, 2015: none

    Hi guys, I don't think Spanish collision would matter too much anyway because the two chinese wordlists have words in common anyway so it would just be similar to that?


    From: Nicolas Doriermailto:notifications@github.com Sent: ‎23/‎05/‎2015 9:37 AM To: bitcoin/bipsmailto:bips@noreply.github.com Subject: Re: [bips] BIP39 French Wordlist - My proposal (#152)

    I reviewed,

    • No spanish collision
    • No words starts with the same 4 letters
    • No word with levenshtein distance of 1 (permutation/addition/removal)

    There is some words I find rather rare and unknow, but I'm don't think it is essential to change them. If it is good for you @Kirvx we can rebase then squash everything so it is ready for merging.

    hirsute
    nigaud
    opossum
    ballast
    bistouri
    

    Reply to this email directly or view it on GitHub: #152 (comment)

  56. Kirvx commented at 9:34 AM on May 23, 2015: contributor

    @Thashiznets Hi :) The french wordlist has ≈100 identical words with the english wordlist (https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin). I only respected the spanish wordlist because the creator wanted a fully recognizable wordlist between the others (and because there was less than 20 spanish words in the original french wordlist).

    There are no words in common between the Spanish wordlist and any other language wordlist, therefore it is possible to detect the language with just one word.

    https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md

    I don't think it's a problem to have identical words with others wordlist if the programm that generates wordlists for users clearly specifies the language of the wordlist (@voisine :bowtie:)

  57. NicolasDorier commented at 1:18 PM on May 23, 2015: contributor

    @Thashiznets , yes, I was only checking the restriction that Kirvx wanted to use rather than imposing them. Let's wait some days to see if there is any feedback on this list then ping laanwj to merge that.

  58. voisine commented at 6:13 PM on May 23, 2015: contributor

    I personally think that using common words familiar to as many speakers as possible is a lot more important than being able to detect the language. Since phrases are normalized and hashed to generate the master key, we don't need to even know the language for it to work.

  59. Kirvx commented at 7:43 PM on May 23, 2015: contributor

    @voisine You're right :) @NicolasDorier Je pense supprimer "alpin", qui est peut être assimilé à un gentilé (restriction n°14), pour le remplacer par "agrafer", et remplacer "minium" par "minimal" pour être cohérent avec "maximal". J'ai encore 800 mots à revoir, je vais m'en faire 200-400 ce soir, puis le reste sera ok demain en fin d'après midi.

  60. NicolasDorier commented at 7:49 PM on May 23, 2015: contributor

    Si tu veux en remplacer, tant que tu y es, je pense qu'il faudrait remplacer "hirsute, opossum, bistoury" je ne pense pas qu'ils soient connus.

  61. Kirvx commented at 3:49 PM on May 24, 2015: contributor

    Ok je viens de relire le reste, tout est ok pour moi sauf "mythe" que j'ai changé par "mythique" à cause de la sonorité identique à "mite". J'ai remplacé "hirsute" et "opossum" par "vacarme" et "tibia". Je n'ai pas trouvé de mots supplémentaires pour remplacer "bistouri", mais "hirsute" et "opossum" me semblent bien plus méconnus que lui, donc bon si ça te convient quand même @NicolasDorier ... Une dernière analyse par ton programme et éventuellement un rebase & squash seraient appréciables si t'es ok ^^ C'est ok pour l'encodage ?

  62. NicolasDorier commented at 10:44 AM on May 25, 2015: contributor

    ça marche pour moi, je fais ça cette aprem

  63. French Wordlist 69841de362
  64. NicolasDorier force-pushed on May 25, 2015
  65. NicolasDorier commented at 12:33 PM on May 25, 2015: contributor

    @Kirvx, I runned my program, all seems good, I stashed commits. I also checked that words were correctly spelled. It is ready to merge.

  66. Kirvx commented at 12:44 PM on May 25, 2015: contributor

    @NicolasDorier Nice :)

    Ok I propose to wait until Wednesday before ping laanwj for merging if he agrees. @EricLarch The clock is ticking :)

  67. EricLarch commented at 1:01 PM on May 25, 2015: none

    All good for me. Nice work!

    On Mon, May 25, 2015 at 2:44 PM, Kirvx notifications@github.com wrote:

    @NicolasDorier https://github.com/NicolasDorier Nice :)

    Ok I propose to wait until Wednesday before ping laanwj for merging if he agrees.

    @EricLarch https://github.com/EricLarch The clock is ticking :)

    — Reply to this email directly or view it on GitHub #152 (comment).

  68. laanwj referenced this in commit fbe7196ee6 on May 26, 2015
  69. laanwj merged this on May 26, 2015
  70. laanwj closed this on May 26, 2015

  71. NicolasDorier commented at 12:11 PM on May 26, 2015: contributor

    laanwj pretty excited today, he closed at least 7 pending merge I participated from 3 months in one day. Well, I'll add the list to NBitcoin. :)

  72. Kirvx commented at 12:14 PM on May 26, 2015: contributor

    Awesome :) Thanks to everyone :)

  73. voisine commented at 4:34 PM on May 26, 2015: contributor

    @Kirvx @NicolasDorier @EricLarch Thanks guys, this will be going into the next breadwallet update. Vive la France !

  74. realindiahotel commented at 10:48 PM on May 26, 2015: none

    Suppose I best also add French list to BIP39.NET

    -----Original Message----- From: "Kirvx" notifications@github.com Sent: ‎26/‎05/‎2015 10:14 PM To: "bitcoin/bips" bips@noreply.github.com Cc: "Thå Shïz" thashiznets@yahoo.com.au Subject: Re: [bips] BIP39 French Wordlist - My proposal (#152)

    Awesome :) Thanks to everyone :) — Reply to this email directly or view it on GitHub.

  75. realindiahotel commented at 5:50 AM on June 8, 2015: none

    Is now added in BIP39.NET

  76. Kirvx commented at 9:25 AM on June 8, 2015: contributor

    Yeah thanks :)

    Le lun. 8 juin 2015 07:51, Thå Shïz notifications@github.com a écrit :

    Is now added in BIP39.NET

    — Reply to this email directly or view it on GitHub #152 (comment).

  77. NicolasDorier commented at 10:20 AM on June 8, 2015: contributor

    same in NBitcoin (in master branch, will be out for the next release)

  78. wizardofozzie commented at 3:15 AM on June 11, 2015: none

    I had a big problem trying to detect the bip39 language as French shares ~5% of its words with English.

    With the test vector (entropy 000000000000000000000000000000000000; english mnemonic = "abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon about") it is incorrectly detected as French. I've changed my code to check all of the following (see below), however I'd implore the list to be made completely different to English (or at the very least, don't make the first word the same)

    FRENCH_BIP39_CLASHES = [(1, u'abandon'), (88, u'amateur'), (107, u'angle'), (110, u'animal'), (148, u'aspect'), (190, u'badge'), (230, u'bicycle'), (262, u'bonus'), (277, u'brave'), (323, u'canal'), (328, u'capable'), (347, u'caution'), (403, u'civil'), (409, u'client'), (436, u'concert'), (451, u'correct'), (461, u'coyote'), (478, u'crucial'), (479, u'cruel'), (493, u'cycle'), (498, u'danger'), (562, u'digital'), (573, u'distance'), (594, u'double'), (598, u'dragon'), (631, u'effort'), (725, u'essence'), (757, u'exact'), (763, u'excuse'), (795, u'fatal'), (796, u'fatigue'), (812, u'festival'), (820, u'figure'), (854, u'fortune'), (861, u'fragile'), (880, u'fruit'), (919, u'globe'), (953, u'guide'), (998, u'humble'), (1011, u'image'), (1014, u'immense'), (1017, u'impact'), (1043, u'innocent'), (1053, u'intact'), (1070, u'jaguar'), (1093, u'junior'), (1102, u'label'), (1123, u'lecture'), (1165, u'loyal'), (1178, u'machine'), (1248, u'million'), (1254, u'minute'), (1255, u'miracle'), (1259, u'mobile'), (1286, u'muscle'), (1301, u'nation'), (1302, u'nature'), (1322, u'noble'), (1331, u'notable'), (1381, u'opinion'), (1387, u'orange'), (1409, u'ozone'), (1411, u'palace'), (1416, u'panda'), (1476, u'phrase'), (1478, u'piano'), (1492, u'pizza'), (1524, u'position'), (1548, u'prison'), (1567, u'public'), (1576, u'puzzle'), (1580, u'question'), (1626, u'relief'), (1671, u'rival'), (1674, u'romance'), (1707, u'salon'), (1727, u'science'), (1748, u'sentence'), (1756, u'service'), (1769, u'simple'), (1777, u'social'), (1801, u'source'), (1805, u'spatial'), (1809, u'stable'), (1830, u'surface'), (1833, u'surprise'), (1836, u'suspect'), (1847, u'talent'), (1911, u'train'), (1933, u'tunnel'), (1948, u'unique'), (1954, u'usage'), (1963, u'vague'), (1970, u'valve'), (2008, u'village'), (2014, u'virus'), (2020, u'vital'), (2034, u'volume'), (2039, u'voyage'), (2041, u'wagon')]

  79. NicolasDorier commented at 10:11 AM on June 11, 2015: contributor

    I'm not convinced in that. The Auto Language detect feature is by itself dangerous. (Chinese Tradition and Modern) Also finding 2048 well understood words that we don't share in english nor any other language is an impossible task.

  80. realindiahotel commented at 11:07 AM on June 11, 2015: none

    Hi @simcity4242 thanks for bringing this up, it is interesting, I don't really think there is any great reason for us to do Auto detect of the mnemonic language, do you have a specific use case in mind? I'm actually thinking of removing this functionality from BIP39.NET because at the end of the day we don't really need to know the language of the mnemonic on input. Unless of course you have a specific task in mind, it may be wise to just avoid auto language detect altogether. I did it before the french list, and while Nicholas is right in that if it's only ~5% then chances are you will have majority french only every time so shouldn't be an issue, but you will need to account for the edge cases I guess.

  81. schildbach commented at 11:15 AM on June 11, 2015: contributor

    If auto-detection is not possible, you'd need to add to the 12 words the information what wordlist is used. So effectively it would be a 13th word.

  82. realindiahotel commented at 11:18 AM on June 11, 2015: none

    Why do you need to know what language is used tho?

  83. schildbach commented at 11:20 AM on June 11, 2015: contributor
    • For offering auto-completion of words.
    • For using the right type of space.
  84. realindiahotel commented at 11:23 AM on June 11, 2015: none

    Surely you would detect localization off the system for auto detect just as any other app/program does now? Correct spaces are whatever the user puts in, ideographic to normal happens during Normalization anyway so it doesn't matter what spaces are put in.

  85. realindiahotel commented at 11:24 AM on June 11, 2015: none

    Also if you are inputting the words you can't auto detect language as you type the words in!

  86. schildbach commented at 11:30 AM on June 11, 2015: contributor

    On mobile devices, you generally don't type spaces. Everything is auto-completed. This is especially true if there are well defined dictionaries. You can't use the system locale reliably, as phrases should be exchangable between devices.

  87. schildbach commented at 11:33 AM on June 11, 2015: contributor

    If all encodings use the same type of space then we're good. But I heard that's not the case?

  88. realindiahotel commented at 11:50 AM on June 11, 2015: none

    On mobile devices the OS handles the auto-complete based on a localized dictionary in most cases. Yes the space us different for JP however the Normalization process turns tge ideographic space into ASCII space regardless of what is input so it doesn't matter what space is auto added.

  89. dabura667 commented at 11:56 AM on June 11, 2015: none

    Japanese phones don't auto-insert spaces at all, in fact.

  90. schildbach commented at 11:58 AM on June 11, 2015: contributor

    Well, I will use a customized auto-complete. Otherwise it will insert words not contained in the word lists, or maybe it's even missing words from the lists. I assume I will be able to append the space myself.

  91. dabura667 commented at 12:00 PM on June 11, 2015: none

    That is probably best.

    I like Mycelium's setup.

    Japanese list is unique with the first 3 characters so it should be easy to auto-complete

  92. schildbach commented at 12:02 PM on June 11, 2015: contributor

    FWIW, for the first word I plan to auto-complete to all the supported word lists at the same time, so essentially the dictionary is a union of the wordlists. For all subsequent words, I exclude the word lists that can't match anymore. If after the 12th word there still would be multiple word lists matching, I maybe ask the user for what list to use (if that's needed, I'm not sure).

  93. gurnec commented at 4:01 PM on June 11, 2015: none

    FWIW, I use auto-detection in seedrecover. It's just a UI nicety.

    The french word list isn't really that much of a problem; the likelihood of an entire (random) 12-word mnemonic being ambiguous between English and French is less than 1 in 5 × 10<sup>15</sup>.

    As NicolasDorier already pointed out, It's the Chinese Simplified and Traditional wordlists which are problematic if you want to do auto-detection, they share 62% of their words. That's a 1 in 295 likelihood of ambiguity for a 12-word mnemonic, 1 in 4720 if you also require the checksum be valid.

    This problem (if it even is one) could have been solved by requiring that for each new word list, if it shares a word with an existing word list, that word must be placed in the same position as it is in the existing word list (or just use Electrum 2.x's method).

  94. wizardofozzie commented at 4:24 AM on June 17, 2015: none

    @Thashiznets I initially flagged French because the first test vector contains "abandon" (11/12 words) and my code was just checking the first word (like Electrum), so the English test vectors were returning "French" as language; I've used a workaround

    Basically, I've been trying to differentiate mnemonic phrases without needing to know if it's BIP39, or Electrum 2.x (or Electrum 1.x, which is much harder). I just think it's prudent to have certainty in knowing what type of mnemonic it is by the words alone.

    My reasoning also extends to this (which @gurnec answered).

  95. Kirvx commented at 12:01 PM on June 17, 2015: contributor

    Sorry to not answer to this problem, I'm not a tech guy :/ Anyway, if it's still a problem, and since the wordlist has been merged, I think it's incorrect to change it for a non critical issue (because of the problem to have 2 versions of a wordlist).

  96. realindiahotel commented at 12:35 PM on June 17, 2015: none

    Agreed, leave as is, I think trying to guess the spec used i.e. BIP39, Electrum etc could end in tears.

  97. TheBlueMatt referenced this in commit ba5da44239 on May 5, 2016
  98. nym-zone cross-referenced this on Jan 9, 2018 from issue Fix two errors in the BIP 39 French wordlist by nym-zone
  99. brenorb commented at 6:31 PM on August 13, 2018: contributor

    Hey, I was taking a look on BIP0039 to add Portuguese and then I saw the French wordlist has a lot of words matching the English list. I know it is not on the proposed rules, but I believe it is important to not have words already used in other language mnemonic sets.

    These are the ones identical to the English list: 'french.txt': ['abandon', 'amateur', 'angle', 'animal', 'aspect', 'badge', 'bicycle', 'bonus', 'brave', 'canal', 'capable', 'caution', 'civil', 'client', 'concert', 'correct', 'coyote', 'crucial', 'cruel', 'cycle', 'danger', 'digital', 'distance', 'double', 'dragon', 'effort', 'essence', 'exact', 'excuse', 'fatal', 'fatigue', 'festival', 'figure', 'fortune', 'fragile', 'fruit', 'globe', 'guide', 'humble', 'image', 'immense', 'impact', 'innocent', 'intact', 'jaguar', 'junior', 'label', 'lecture', 'loyal', 'machine', 'million', 'minute', 'miracle', 'mobile', 'muscle', 'nation', 'nature', 'noble', 'notable', 'opinion', 'orange', 'ozone', 'palace', 'panda', 'phrase', 'piano', 'pizza', 'position', 'prison', 'public', 'puzzle', 'question', 'relief', 'rival', 'romance', 'salon', 'science', 'sentence', 'service', 'simple', 'social', 'source', 'spatial', 'stable', 'surface', 'surprise', 'suspect', 'talent', 'train', 'tunnel', 'unique', 'usage', 'vague', 'valve', 'village', 'virus', 'vital', 'volume', 'voyage', 'wagon'],

  100. Kirvx commented at 8:03 PM on August 13, 2018: contributor

    You're right that was a concern during the creation of the wordlist https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin but it wasn't a priority for me, and I think it wasn't easily possible to apply this additional restriction with the other rules.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bips. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-14 15:10 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me