
On 01/19/2011 06:58 AM, Edward Diener wrote:
... elision by patrick...
I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ).
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits. Nobody will care any longer that this fixed length set of bits "wastes space", as so many people today hysterically are fixated on. Whether or not UTF-32 can do this now or not I do not know but this world where a character in some language on earth is represented by some arcane multi-byte encoding will end. If UTF-32 can not do it then UTF-nn inevitably will. UTF-32 is the only UCS fixed width encoding.
UTF-16 can encode most the basic multilingual plane in fixed width. That's most the characters in the world. If you know your problem domain, and know that you are in the first code plane then you can use UTF-16 as a fixed width encoding. If you know that you have to be able to handle any UCS character, then you can't. Currently 107,296 of the characters in UCS are defined out of a total code space of 1,114,112, (0 to 10FFFF16).
I do not think that shoving UTF-8 down everybody's throats is the best solution even now, I think a good set of classes to convert between encoding standards is much better.
I agree with you. Nobody should shove any one solution down anyone's throat. Instead, I wish that more people would understand the trade-offs of different encodings and when each might be more desirable instead of saying, "Oh, we can never do that." or "Oh, we must always do that." The best thing is to understand your problem domain, and what the implications of that domain are in each of the possible encodings. The truth is that the web and xml apps all use Unicode, as do more and more applications. Nobody considers doing new international applications with anything other than Unicode. That means that you need to know about the three encodings, UTF-8 UTF-16 and UTF-32, and their trade-offs. If you're on a fast lightly loaded machine with lots of memory, there could be real advantages to UTF-32. If you're running on a hand-held device with limited memory, UTF-8 could be a real winner. That's a simplistic view of a complex decision, but if you're doing the design for something you should educate yourself and make the complex decision with fore thought. You can get your own copy of the Unicode 5.2 standard as a zipped pdf file at http://www.unicode.org/versions/Unicode5.2.0/UnicodeStandard-5.2.zip The 6.0 standard is being worked on as we speak. Patrick