Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

20 Jan 2011

      On 01/19/2011 06:58 AM, Edward Diener wrote:
...
... elision by patrick...
I do not believe that UTF-8 is the way to go. In fact I know it is 
not, except perhaps for the very near future for some programmers ( 
Linux advocates ).
Inevitably a Unicode standard will be adapted where every character of 
every language will be represented by a single fixed length number of 
bits. Nobody will care any longer that this fixed length set of bits 
"wastes space", as so many people today hysterically are fixated on. 
Whether or not UTF-32 can do this now or not I do not know but this 
world where a character in some language on earth is represented by 
some arcane multi-byte encoding will end. If UTF-32 can not do it then 
UTF-nn inevitably will.
UTF-32 is the only UCS fixed width encoding.
UTF-16 can encode most the basic multilingual plane in fixed width.  
That's most the characters in the world.  If you know your problem 
domain, and know that you are in the first code plane then you can use 
UTF-16 as a fixed width encoding.  If you know that you have to be able 
to handle any UCS character, then you can't.  Currently 107,296 of the 
characters in UCS are defined out of a total code space of 1,114,112, (0 
to 10FFFF16).
...
I do not think that shoving UTF-8 down everybody's throats is the best 
solution even now, I think a good set of classes to convert between 
encoding standards is much better.
I agree with you.  Nobody should shove any one solution down anyone's 
throat.  Instead, I wish that more people would understand the 
trade-offs of different encodings and when each might be more desirable 
instead of saying, "Oh, we can never do that."  or "Oh, we must always 
do that."  The best thing is to understand your problem domain, and what 
the implications of that domain are in each of the possible encodings.

The truth is that the web and xml apps all use Unicode, as do more and 
more applications.  Nobody considers doing new international 
applications with anything other than Unicode.  That means that you need 
to know about the three encodings, UTF-8 UTF-16 and UTF-32, and their 
trade-offs.  If you're on a fast lightly loaded machine with lots of 
memory, there could be real advantages to UTF-32.  If you're running on 
a hand-held device with limited memory, UTF-8 could be a real winner.  
That's a simplistic view of a complex decision, but if you're doing the 
design for something you should educate yourself and make the complex 
decision with fore thought.

You can get your own copy of the Unicode 5.2 standard as a zipped pdf 
file at http://www.unicode.org/versions/Unicode5.2.0/UnicodeStandard-5.2.zip

The 6.0 standard is being worked on as we speak.

Patrick

Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

Patrick Horgan