
On 24/04/2011 22:01, Ryou Ezoe wrote:
Collation and Conversions: Japanese doesn't have concepts of case and accent. Since we don't have these concepts, we never need it.
I believe all CJK characters can be decomposed to radicals, which are equivalent, so you could want to do normalization. Also, converting between halfwidth and fullwidth katakana could have some uses.
Boundary analysis: What is the definition of boundary and how does it analyse? It sounds too smart for such a small things it actually does.
It uses the boundary analysis algorithms defined by the Unicode standard, which doesn't use heuristics or anything like that. Remember Boost.Locale is just a wrapper of ICU, which is the real smart library.
I'd rather call it strtok with hard-coded delimiters. Japanese doesn't separate each words by space. So unless we perform really complicated natural language processing(which is impossible to be perfect since we never have complete Japanese dictionary), we can't split Japanese text by words. Also, Japanese doesn't have a concept of word wrap. So "find appropriate places for line breaks" is unnecessary. Actually, there are some rules for line break in Japanese.
You can still break at punctuation marks, and there are places where you should definitely not break. Thai, Lao, Chinese and Japanese do require the use of dictionaries or heuristics to correctly distinguish words. However, the default algorithm provided by Unicode still provides a best effort implementation without those things.