Re: [boost] Strings tagged with their character set

27 Sep 2007

      ...
For certain special purposes (like the one above), a variable-width 
string class would be useful, but I think we should focus on storing 
strings in fixed-width encodings and then converting them appropriately 
during I/O.
Actually, I disagree with this. The only general-purpose fixed-width
encoding available is UTF-32, and hardly anyone actually uses it. For
good reason: for English text, it wastes 75% of the used space. In
general, it wastes about 10 bits (30%) in everything, because Unicode
only has about, what, 2^21 code points? Under no circumstances can a
UTF-8 string be larger, nor a UTF_16 string. You may say that linear
James Porter wrote:
traversal is faster, because you don't have to inspect the bytes to find
out where the next one is. I'm tempted to disagree there. Random access
may be faster (a lot faster), but you rarely need this. Mostly, you want
to access data linearly anyway. And because UTF-8 can squeeze up to 4
characters in the space where UTF-32 puts one, cache coherency is much
better.
Then there's the "practical use" issue. The Linux kernel uses UTF-8
internally (when compiled appropriately). The Windows kernel uses
UTF-16. No kernel I know of uses UTF-32. This means that for every
system call, UTF-32 strings have to be converted. Another performance
hit there. (Not to mention complexity hit, for calling APIs that return
text.) Hmm ... I think Qt and wxWidgets (with appropriate configuration)
use UTF-32 on Linux.

I think the problem of UTF-8 and UTF-16 strings is important and must be
addressed.

Sebastian Redl