Create korean.txt #544

pull juwhankim wants to merge 1 commits into bitcoin:master from juwhankim:patch-1 changing 1 files +2048 −0
  1. juwhankim commented at 9:42 pm on May 31, 2017: none

    Wordlist for Korean language.

    Generated by the following process.

    1. Original copy of “Most frequently used Korean words for education purposes”, published by national language institute(2003), http://www.korean.go.kr/front/etcData/etcDataView.do?mn_id=46&etc_seq=71
    2. Removed homonyms.
    3. Removed words with less than 2 characters(Most Korean words are 2~5 characters).
    4. Removed word pairs(or groups) that share first 2 characters, .i.e., selected words with unique leading 2 characters.
    5. Randomly(in a uniformly distributed way) picked 2048 samples.
  2. Create korean.txt
    Wordlist for Korean language
    d0802c2c85
  3. dabura667 commented at 11:54 pm on May 31, 2017: none

    Hooray! I was trying to get my Korean friends to do this for a while.

    Is each word unique to the first two symbols?

    Most languages are unique to the first x symbols so that it is easier to create predictive software keyboards.

    Also, is the wordlist normalized with NFKD unicode normalization?

    I will check once I can get to a computer.

  4. dabura667 commented at 11:57 pm on May 31, 2017: none

    Also, are these all nouns? Verbs? A mixture?

    For any part that has conjugation (verbs I understand can be ~da, ~yo, ~bnida depending on politeness) are they all the same conjugation? (I saw kayo, but maybe that’s a noun? Because I also see a lot of ~da so I am guessing those are verbs)

  5. dabura667 commented at 0:00 am on June 1, 2017: none

    Also, there’s a lot of ~hada words, which I understand is similar to ~suru in Japanese. These words tend to be of a noun + suru = “do the noun” type verb, so maybe removing the hada and just using the noun might be good.

    What are your thoughts?

  6. juwhankim commented at 0:52 am on June 1, 2017: none

    Hello,

    Wow a lot of questions.

    Let me try to go over some justifications.

    First thing to note about korean language is that it is an agglutinative one. Which means that ~~hada is, even though there’s an repetitive postfix, a complete lexem. The configuration is usually 2~3 symbols + a postfix to define its part of speech. That’s for the case of verbs, adverbs, and adjectives mostly.

    On the other hand, nouns tend to be shorter because they do not contain trailing affixes. They are short usually in 2~3 symbols range.

    If one confines the selection of words to nouns and word roots, variety of word length becomes rather limited. However, there are also cases where ~~hada and ~~ alone are both legitimate words.

    Hence I removed all the cases in the frequently used dictionary that are in multiple forms. For example if there are zzz+hada and zzz, I removed both of them to avoid confusion.

    When summarized:

    1. To strengthen variety of word length, I opted for mixing verbs, nouns, adverbs, and adjectives.
    2. After all the filtering I ended up with, out of ~5000 words, ~3000 words. Which I believe is a proper set to randomly pick 2048 words from.
    3. Repetition of hada hani and etc is intrinsic to Korean language. Every verb in its basic form ends ~~hada basically. For your pun and information, the ~~hada becomes ~~haetda for past tense, ~~haeseo for because of doing ~~, et cetra and et cetra.
    4. For typical length of words in korean language I decided to make to first 2 symbols distinct. Considering that the characters are already composition of multiple syllables. For example, 닭 which means chicken is a composition of ㄷ(d) ㅏ(a) ㄹ(r) ㄱ(g), and it reads as darg. Hence, entropy wise one composed character in korean language is roughly equivalent to 2~4 symbols in roman characters. Converting back, by taking 2 unique characters in korean language, I am taking on the average 4~6 unique symbols in roman alphabets. That should agree well with the standard published in bip 39.

    However, I concur with your concern that structured repetition of hada might have undesirable impact on randomizing the seed. From what I know little, it should not affect the quality of the final seed because the mnemonic goes through many rounds of hashing, and hashing is all about distributing bits here and there. But well, I might as well be very wrong at the same time.

    나의 iPhone에서 보냄

        1. 오전 9:00 Dabura667 notifications@github.com 작성:

    Also, there’s a lot of ~hada words, which I understand is similar to ~suru in Japanese. These words tend to be of a noun + suru = “do the noun” type verb, so maybe removing the hada and just using the noun might be good.

    What are your thoughts?

    — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

  7. luke-jr commented at 10:14 pm on June 7, 2017: member
  8. luke-jr added the label Proposed BIP modification on Jun 7, 2017
  9. junderw commented at 6:42 am on August 4, 2017: contributor

    I was talking to @Annyonghaseyo at the Tokyo Bitcoin Meetup last night, and she showed me several words which she claims are misspelled and flat out incorrect.

    She also echoed the concern that ~hada words should be avoided, as there are a lot of words that stand on their own without hada and having to remember whether or not the word was being used as a verb or noun is not user friendly.

    She is currently translating wallet apps so that her parents can use Bitcoin in Korea.

    I asked if she could help look over the list, or recommend a new one, and she said she would look into it.

  10. gnujoow commented at 9:31 am on August 4, 2017: none

    I reviewed the list briefly and i found all words are acceptable as Korean. but it still has homonyms, synonyms, informal, pairs(or groups) that share first 2 character words.

    I would do this if it needs to fix

  11. junderw commented at 1:55 am on August 5, 2017: contributor

    @gnujoow Not being Korean nor having a deep understanding of Korean language, I feel underqualified to make comments, but I have been living in Japan for a majority of my life and have a native understanding of the language.

    Under that context, when discussing with @Annyonghaseyo about the word list, I was concerned with all the -hada words.

    In Japanese, 99% of the time when we use the equivalent which is -suru as a verb, the word before -suru is actual a noun (The 1% are a gerund nouns which make no sense alone without -suru after it)

    Seeing the large number of -hada words, and speaking with my korean friend @Annyonghaseyo about it, she looked at a few and said “oh yeah, this word is used without -hada, so is this one, and this one too”, etc. etc

    Some people may choose to memorize the words, and if they are remembering tons of words and some of them use hada and some don’t it will be difficult… so as a rule, when me and a few others decided the Japanese wordlist one rule we came up with was “No -suru verbs”

    I would appreciate more insight into how -hada works, and whether all these words actually require -hada or not, and if not I would suggest not using them.

  12. Annyonghaseyo commented at 5:50 pm on August 5, 2017: none

    @ problem is not whether or not the words are actually in the Korean dictionary, but it is more about sense. The words in this list seem to lack common sense in the way they were chosen and words are conjugated in weird ways, etc.

    Example:

    1.검은색/검정색 These two words both mean “black” 2.값싸다(low price) is just a combination of price + cheap ‘값+싸다’ so this should be replaced with a word that is not a combination of two words but rather it’s own word. 3.근데 is a colloquial shortening of ‘그런데’ and should not be used. 4.Adding -hada to adjectives and nouns to turn them into verbs would be confusing to remember, as you would have to remember not only the word but whether the word list contained the verb-ified -hada word or the original adjective or noun. This should be decided to do all as one or the other, and not mix. Since many words can be used with -hada, I suggest making the rule to NOT use -hada verbs.

    Using words that have meaning in Korean is not enough for this wordlist, the word selection should have some common sense applied for the use case. This list currently needs a lot of work. A lot of words are not appropriate for the recovery phrase. I think the Japanese word list uses common sense rules that could be used in a similar way for Korean.

  13. junderw cross-referenced this on Aug 10, 2017 from issue Add Korean Wordlist by junderw
  14. junderw commented at 2:27 am on August 10, 2017: contributor
    Added alternative list @ #570
  15. luke-jr commented at 1:25 pm on August 15, 2017: member
    As the alternative was ACK’d, this seems no longer applicable? Ping me if I need to reopen.
  16. luke-jr closed this on Aug 15, 2017


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bips. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-24 12:10 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me