
From: Ryo IGARASHI <rigarash@gmail.com> On Mon, Apr 25, 2011 at 11:06 PM, Gevorg Voskanyan <v_gevorg@yahoo.com> wrote:
OK, that tells case conversions and normalization don't apply to Japanese. But what about collation? Isn't there any "dictionary order" defined for Japanese words? Just curious.
"Dictionary order" depends on what kind of information in the dictionary. For example, we use complex sorting algorithm for 'Kanji' letter dictionary.
However, for language dictionary (Japanese-Japanese dictionary), we use pronunciation order. But this is impossible to decide by program since each 'Kanji' letter have usually 3-4 (sometimes more) completely different pronunciation only to be decided by the context in principle.
Just FYI.
These are sizes of collation rules for different languages in ICU 4.4 by size (top 5): 630641 2010-04-28 18:28 zh.txt 439431 2010-04-28 18:28 ko.txt 438456 2010-04-28 18:28 ja.txt 23851 2010-04-28 18:28 kn.txt 23594 2010-04-28 18:28 bn.txt I've looked into ja.txt file and it includes a huge dictionary of Kanji letters sorted by their order. I can't check it by my own but I assume that the collation rules for Japanese are not that simple. Also there are customization parameters for collation in locale names like ja_JP.UTF-8@collation=unihan These are keywords take from: http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers "big5han" Pinyin ordering for Latin, big5 charset ordering for CJK characters. (used in Chinese) "dict" (dictionary) For a dictionary-style ordering (such as in Sinhala) "direct" Hindi variant "gb2312" (gb2312han) Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. (used in Chinese) "phonebk" (phonebook) For a phonebook-style ordering (such as in German) "phonetic" Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use. "pinyin" Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese) "reformed" Reformed collation (such as in Swedish) "search" A special collation type dedicated for string search. "stroke" Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese) "trad" (traditional) For a traditional-style ordering (such as in Spanish) "unihan" Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters. (used in Chinese) So I can't check but I can assume it does something right... Artyom