
2009/6/20 Artyom <artyomtnk@yahoo.com>
UTF-16 ... This is the recommended encoding for dealing with Unicode internally for general purposes
To be honest, it is most error prone encoding to work with Unicode:
Amen. Really, I don't see why people don't just use UTF-8 all over the place. Even UTF-32 isn't as convenient as most would like, since you still have combining code points and other similar complications. As a programmer what I really care about is usually some nebulous concept of "characters", and one character can easily be 3 codepoints or 1/3 of a codepoint. It feels like the only way to get Unicode string handling right (at the application level, not library or render levels) is to deal entirely in strings and regexes. Suppose I have "difficult" with the "ffi" ligature codepoint, and I do a perl-style split on /i/. I should probably be getting "d", the "ff" ligature codepoint, and "cult". I know if I tried to code that by hand in every application I'd miss all kinds of evil corner cases like that.