
I don't have a lot of experience using non-ascii strings in my internal code, aside from occasional forays into utf-8 for special characters, but wouldn't using ucs-4 for the "core" encoding be the sane thing to do? With a ucs-4 encoding, you could use a
basic_string<wchar_t>
and continue using the familiar api without worrying about the complications and confusion caused by variable length encodings. The sane thing, perhaps. But take a look at Mozilla, for example, who're dealing with character data a lot. Currently they're evaluating the memory and speed effects of switching from UTF-16 to UTF-8 for everything. The reasoning is that even on web pages that consist mostly of exotic characters, there's still a lot of ASCII around (not counting tag names): URIs, IDs, classes, names, etc. Thus, the space savings could be considerable. (Current benchmarks record an average of a few
Frank Mori Hess wrote: percent on an unfortunately not representative set of pages, if I remember correctly.) Can you imagine what these developers would think of switching to UTF-32, where 11 bits are guaranteed to be wasted simply because all Unicode5 planes can be represented with 21 bits? Sebastian