Re: [boost] [General] Always treat std::strings as UTF-8

14 Jan 2011

      John B. Turpish wrote:
...
On Fri, Jan 14, 2011 at 4:42 AM, Matus Chochlik <chochlik@gmail.com> 
wrote:
...
b) UTF-32 is basically a waste of memory for most localizations.
I'm not an expert, so take this with a grain of salt. But couldn't it
just as easily be said that UTF-8 is a waste of CPU? There are a
number of operations that are constant time if you can assume a fixed
size for a character that I would think would have to be linear for
UTF-8, for example accessing the Nth character.
Yes, in principle, but:

- you rarely, if ever, need to access the Nth character;
- waste of space is also a waste of CPU due to more cache misses;
- UTF-8 has the nice property that you can do things with a string without 
even decoding the characters; for example, you can sort UTF-8 strings as-is, 
or split them on a specific (7 bit) character, such as '.' or '/'.

Typically, UTF/UCS-32 is only needed as an intermediate representation in 
very few places, the rest of the strings can happily stay UTF-8.

Re: [boost] [General] Always treat std::strings as UTF-8

Peter Dimov