Add bip39 Indonesian wordlist #621

pull perlancar wants to merge 2 commits into bitcoin:master from perlancar:master changing 3 files +2059 −0
  1. perlancar commented at 3:26 am on January 1, 2018: none

    How the wordlist is produced:

    1. Download and uncompress https://dumps.wikimedia.org/idwiki/latest/idwiki-latest-pages-articles.xml.bz2 (the version used when producing the wordlist is 2017-121-30).
    2. Count the words in all the articles inside articles.xml using this script https://github.com/perlancar/perl-WordLists-ID-Common/blob/master/devscripts/count-words-in-mediawiki-articles . The result is https://raw.githubusercontent.com/perlancar/perl-WordLists-ID-Common/master/devdata/words.txt .
    3. Curate the words manually (mostly removing non-Indonesian words). The result is https://raw.githubusercontent.com/perlancar/perl-WordLists-ID-Common/master/devdata/words-curated.txt . You can diff this two wordlist text file to see the difference.
    4. Generate the BIP39 Indonesian wordlist using this script https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devscripts/gen-wordlist . This script basically picks the most frequent words in words-curated.txt that are not already in the English, Spanish, French, and Italian BIP39 wordlist.
  2. Add bip39 Indonesian wordlist 2b35f485a4
  3. dabura667 commented at 6:38 am on January 1, 2018: none

    Technical Checklist ACK 2b35f485a449619060af724ea88a6ef0e1f44e43

    Checked:

    1. Is NFKD normalized list
    2. No BOM
    3. Has single trailing LF line break and is separated by LF line breaks
  4. perlancar commented at 7:13 am on January 1, 2018: none
    @dabura667 Like English, Indonesian only uses 26 Latin letters and encodable in ASCII so no need to put in BOM and asciibetical sorting suffices. Please advise about trailing LF and LF line breaks, because that’s how most of the other wordlist files use too.
  5. dabura667 commented at 8:07 am on January 1, 2018: none
    ACK means “ok to merge” and “Technical Checklist” means I checked for formatting errors and everything is OK.
  6. perlancar commented at 8:13 am on January 1, 2018: none
    Ah ok, thanks.
  7. Add Perl bip39 implementation: Bitcoin::BIP39 4fcfed70d2
  8. nym-zone referenced this in commit b2f66ba1a4 on Jan 5, 2018
  9. nym-zone commented at 3:18 pm on January 5, 2018: contributor

    Has this any independent review from other native speakers or experts in the language? The earlier BIP wordlist addition pull requests witnessed much lively discussion and examination. One was superseded by a new proposal after significant problems were found.

    Since this is said to be ASCII, it is easy to check some basic characteristics in addition to those checked by @dabura667:

    0$ grep '^[^a-z]' indonesian.txt
    1$ grep -Eo '^[a-z]{0,3}$' indonesian.txt
    2$ grep -Eo '^[a-z]{4}' indonesian.txt | sort -s | uniq | wc -l
    32048
    

    Sample 24-word mnemonic (outside of code-tagging to permit line breaks):

    $ easyseed -b 256 -l id keuskupan utuh kegunaan serta pesisir mungkin reguler cermin langsung enam parkir lari gaib bensin babak dinilai meluncurkan mandiri bijaksana keamanan domestik bercerita prefektur legislatif @perlancar, a question from an implementer: Are these strings for identifying this wordlist sensible and appropriate for Indonesian-speaking users?

    0+	LANG(indonesian,	u8"Bahasa Indonesia",	"id",	ascii_space ),
    

    I have proposed on bitcoin-dev that native language strings and short ASCII codes should be standardized: https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2018-January/015498.html

  10. nym-zone commented at 3:37 pm on January 5, 2018: contributor

    @perlancar, I recommend that you split 4fcfed70d2761c10c2cd228db8c9310af15f64e6, “Add Perl bip39 implementation: Bitcoin::BIP39”, into a separate pull request. That should be reviewed separately from the wordlist; and it’s not even mentioned in the title or description of this pull request #621. People who glance through the list of pull requests will not even realize that this exists.

    It is good that you did atomic commits, so that 2b35f485a449619060af724ea88a6ef0e1f44e43 (the proposed Indonesian wordlist) and 4fcfed70d2761c10c2cd228db8c9310af15f64e6 (listing of the Perl module in bip-0039.mediawiki itself) can be handled separately.

    I am not a maintainer here; therefore, I can only make a “recommendation”.

  11. perlancar commented at 11:15 pm on January 5, 2018: none
    @nym-zone I did separate into two pull requests, but admittedly I didn’t create a branch for the first PR. And then when I created the second PR, Github merged the two.
  12. perlancar commented at 11:23 pm on January 5, 2018: none
    @nym-zone Thanks for the input. I did publish the proposed Indonesian wordlist as a Perl module (mentioned above) in hope of gathering some input, but of course the number of Indonesian Perl users is negligible. I will try to solicit input from more communities. For your information, I am a native Indonesian speaker. The wordlist were selected from most common Indonesian words which I have curated manually so I would say they are sensible. But that is my opinion.
  13. nym-zone referenced this in commit d03ddae008 on Jan 6, 2018
  14. nym-zone referenced this in commit 8aaa6f37e8 on Jan 7, 2018
  15. nym-zone referenced this in commit c7d698a35f on Jan 11, 2018
  16. ubunteroz commented at 3:59 pm on May 27, 2018: none

    Hello, native Indonesian speaker here. Thanks @perlancar for your effort! Quick review from your wordlist:

    1. It’s better to remove conjuctions (such as atau, tetapi, yaitu, and yakni) to avoid confusion.
    2. Also, non-root words can cause another confusion (peperangan -> perang, pedesaan -> desa, menyebabkan -> sebab).
    3. Limiting wordlist to max. 8/9-letter words can help us to remember easily.
  17. perlancar commented at 9:42 pm on June 2, 2018: none
    Hi Surya (@ubunteroz), thanks for the comments. All are good. I will manually construct an edited new wordlist when I have some free time, but PR’s or patches are welcome :)
  18. luke-jr commented at 5:19 am on July 5, 2018: member

    @dabura667 Only authors are supposed to ACK changes to BIPs.

    In this case, these people: @slush0 @prusnak @voisine @ebfull

  19. luke-jr added the label Proposed BIP modification on Jul 5, 2018
  20. dlwhitehurst cross-referenced this on Jul 11, 2018 from issue Add Indonesian Word List once Accepted (bip39) by dlwhitehurst
  21. DonaldTsang cross-referenced this on Dec 24, 2018 from issue Binary Lists by DonaldTsang
  22. in bip-0039/indonesian.txt:1989 in 4fcfed70d2
    1984+unsur
    1985+untuk
    1986+upacara
    1987+upah
    1988+upaya
    1989+upeti
    


    heri16 commented at 3:55 am on July 3, 2019:
    This word is unfamiliar even to native speakers.

    perlancar commented at 8:27 am on July 3, 2019:
    Sorry, by “this word” which one are you referring to? Because GitHub Web UI is showing 4 words: upacara, upah, upaya, upeti. I would argue that they are familiar to native speakers. What is your basis to say that they are not? Let’s take “upeti” for example, this word is found 792 times in the Wikipedia article (see PR description which I have just updated to describe the process of producing the wordlist).

    perlancar commented at 8:32 am on July 3, 2019:

    Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the “Sari Kata Bahasa Indonesia” book?

    This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist.

    This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies.

    This is a good idea, but what about license? And is there an online version for it?

  23. in bip-0039/indonesian.txt:1954 in 4fcfed70d2
    1949+tumbuh
    1950+tunai
    1951+tunduk
    1952+tunggal
    1953+tuntutan
    1954+turbin
    


    heri16 commented at 3:56 am on July 3, 2019:
    This word is unfamiliar even to native speakers.

    perlancar commented at 8:28 am on July 3, 2019:
    Which word are you referring to? tunduk, tunggal, tuntutan, turbin? All of them? What is the basis? (See my comment on the upeti word above).
  24. in bip-0039/indonesian.txt:1935 in 4fcfed70d2
    1930+topik
    1931+tradisional
    1932+tragedi
    1933+trailer
    1934+transportasi
    1935+trek
    


    heri16 commented at 3:57 am on July 3, 2019:
    This word is unfamiliar even to native speakers.

    perlancar commented at 8:29 am on July 3, 2019:
    Which word are you referring to? trek? I tend to agree.
  25. in bip-0039/indonesian.txt:1915 in 4fcfed70d2
    1910+tiga
    1911+tiket
    1912+tikus
    1913+timah
    1914+timbul
    1915+timnya
    


    heri16 commented at 3:58 am on July 3, 2019:
    This word is a conjugation.
  26. in bip-0039/indonesian.txt:1901 in 4fcfed70d2
    1896+terowongan
    1897+terpilih
    1898+tersebut
    1899+tertentu
    1900+terutama
    1901+terwujud
    


    heri16 commented at 3:58 am on July 3, 2019:
    This word is a conjugation.

    perlancar commented at 8:29 am on July 3, 2019:
    I agree that conjugated words should be avoided.
  27. in bip-0039/indonesian.txt:1900 in 4fcfed70d2
    1895+ternyata
    1896+terowongan
    1897+terpilih
    1898+tersebut
    1899+tertentu
    1900+terutama
    


    heri16 commented at 3:58 am on July 3, 2019:
    This word is a conjugation.
  28. in bip-0039/indonesian.txt:1854 in 4fcfed70d2
    1849+tantangan
    1850+tanya
    1851+tapi
    1852+tari
    1853+tata
    1854+tatkala
    


    heri16 commented at 3:59 am on July 3, 2019:
    This word might be unfamiliar to some native speakers.
  29. in bip-0039/indonesian.txt:1834 in 4fcfed70d2
    1829+tadi
    1830+tahap
    1831+tahta
    1832+tahun
    1833+tajam
    1834+takhta
    


    heri16 commented at 4:00 am on July 3, 2019:
    This word might be unfamiliar to some native speakers.
  30. in bip-0039/indonesian.txt:1821 in 4fcfed70d2
    1816+surya
    1817+susah
    1818+susu
    1819+sutradara
    1820+swasta
    1821+syair
    


    heri16 commented at 4:00 am on July 3, 2019:
    This word might be unfamiliar to some native speakers.
  31. in bip-0039/indonesian.txt:1812 in 4fcfed70d2
    1807+sungai
    1808+suntingan
    1809+supaya
    1810+support
    1811+surat
    1812+surel
    


    heri16 commented at 4:01 am on July 3, 2019:
    This word might be unfamiliar to some native speakers.
  32. in bip-0039/indonesian.txt:1791 in 4fcfed70d2
    1786+suaminya
    1787+suara
    1788+suasana
    1789+suatu
    1790+subjek
    1791+subspesies
    


    heri16 commented at 4:01 am on July 3, 2019:
    This word might be unfamiliar to some native speakers.
  33. in bip-0039/indonesian.txt:1783 in 4fcfed70d2
    1778+stasiun
    1779+status
    1780+stop
    1781+strategis
    1782+string
    1783+stroke
    


    heri16 commented at 4:01 am on July 3, 2019:
    This word might be unfamiliar to some native speakers.
  34. in bip-0039/indonesian.txt:1782 in 4fcfed70d2
    1777+star
    1778+stasiun
    1779+status
    1780+stop
    1781+strategis
    1782+string
    


    heri16 commented at 4:02 am on July 3, 2019:
    This imported word is unfamiliar to almost all native speakers.
  35. in bip-0039/indonesian.txt:1777 in 4fcfed70d2
    1772+spons
    1773+stabil
    1774+stadion
    1775+staf
    1776+standar
    1777+star
    


    heri16 commented at 4:02 am on July 3, 2019:
    This imported word is unfamiliar to almost all native speakers.
  36. in bip-0039/indonesian.txt:1745 in 4fcfed70d2
    1740+silsilah
    1741+siluman
    1742+simbol
    1743+sinar
    1744+sinetron
    1745+singel
    


    heri16 commented at 4:03 am on July 3, 2019:
    This imported word is has multiple spellings debated by native speakers.

    perlancar commented at 8:30 am on July 3, 2019:
    Agreed.
  37. in bip-0039/indonesian.txt:1739 in 4fcfed70d2
    1734+signifikan
    1735+sihir
    1736+sikap
    1737+siklus
    1738+silat
    1739+silinder
    


    heri16 commented at 4:05 am on July 3, 2019:
    Technical term for most native speakers.

    perlancar commented at 8:30 am on July 3, 2019:
    I can agree to this.
  38. in bip-0039/indonesian.txt:1705 in 4fcfed70d2
    1700+sengaja
    1701+seni
    1702+senjata
    1703+sensus
    1704+sentral
    1705+senyawa
    


    heri16 commented at 4:07 am on July 3, 2019:
    Is a Conjucation or Technical term for most native speakers.

    perlancar commented at 8:30 am on July 3, 2019:
    Agreed.
  39. in bip-0039/indonesian.txt:1694 in 4fcfed70d2
    1689+selisih
    1690+seluruh
    1691+semakin
    1692+sembilan
    1693+sementara
    1694+seminggu
    


    heri16 commented at 4:09 am on July 3, 2019:
    Conjugation

    perlancar commented at 8:30 am on July 3, 2019:
    Agreed.
  40. in bip-0039/indonesian.txt:1637 in 4fcfed70d2
    1632+salju
    1633+saluran
    1634+sama
    1635+sambil
    1636+sampai
    1637+samurai
    


    heri16 commented at 4:11 am on July 3, 2019:
    Imported word not familiar to most native speakers.

    perlancar commented at 8:31 am on July 3, 2019:
    Agreed on imported words, though I don’t share your opinion of “samurai” being not familiar with most native speakers.
  41. heri16 changes_requested
  42. heri16 commented at 4:47 am on July 3, 2019: none

    Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the “Sari Kata Bahasa Indonesia” book?

    This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist.

    This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies.

  43. perlancar commented at 3:33 pm on July 3, 2019: none

    I have created another wordlist. The source is still from Wikipedia Indonesia, but this time I manually curate the BIP wordlist using this criteria:

    1. Words from 4 to 10 letters long.
    2. Avoid conjugated words, choose only root words.
    3. Avoid prepositions, conjunctions, pronouns (e.g. dan, atau, jika, kamu, saya, …).
    4. Avoid technical words when possible.
    5. Avoid loan words when possible.
    6. Avoid words that have multiple competing spellings, if possible.
    7. The above criteria is balanced/countered by this: I want to make the BIP wordlist unique by its 4-letters. So even if user does not type or mistype some letters at the end of words, they can still be corrected.

    The resulting BIP wordlist is put here: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words-bip.txt. I haven’t made a change to this PR. Inputs/comments welcome.

    As you might see, some of the words are indeed not very popular to native speakers, because I also want to satisfy criteria number 7.

    Also useful is the larger wordlist from which I curated this BIP wordlist: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words.txt . I indented the words which I want to use in the BIP wordlist. You can suggest corrections or changes by submitting a PR which changes this file.

  44. luke-jr closed this on Jul 2, 2021


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bips. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-24 05:10 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me