Re: [boost] UTF-8 conversion etc.

24 Feb 2008

      ...
I don't have a lot of experience using non-ascii strings in my
internal code,
aside from occasional forays into utf-8 for special characters, but
wouldn't
using ucs-4 for the "core" encoding be the sane thing to do?  With a
ucs-4
encoding, you could use a
basic_string<wchar_t>
and continue using the familiar api without worrying about the
complications
and confusion caused by variable length encodings.
The sane thing, perhaps. But take a look at Mozilla, for example, who're
dealing with character data a lot. Currently they're evaluating the
memory and speed effects of switching from UTF-16 to UTF-8 for
everything. The reasoning is that even on web pages that consist mostly
of exotic characters, there's still a lot of ASCII around (not counting
tag names): URIs, IDs, classes, names, etc. Thus, the space savings
could be considerable. (Current benchmarks record an average of a few
Frank Mori Hess wrote:
percent on an unfortunately not representative set of pages, if I
remember correctly.)

Can you imagine what these developers would think of switching to
UTF-32, where 11 bits are guaranteed to be wasted simply because all
Unicode5 planes can be represented with 21 bits?

Sebastian

Re: [boost] UTF-8 conversion etc.

Sebastian Redl