
For certain special purposes (like the one above), a variable-width string class would be useful, but I think we should focus on storing strings in fixed-width encodings and then converting them appropriately during I/O. Actually, I disagree with this. The only general-purpose fixed-width encoding available is UTF-32, and hardly anyone actually uses it. For good reason: for English text, it wastes 75% of the used space. In general, it wastes about 10 bits (30%) in everything, because Unicode only has about, what, 2^21 code points? Under no circumstances can a UTF-8 string be larger, nor a UTF_16 string. You may say that linear
James Porter wrote: traversal is faster, because you don't have to inspect the bytes to find out where the next one is. I'm tempted to disagree there. Random access may be faster (a lot faster), but you rarely need this. Mostly, you want to access data linearly anyway. And because UTF-8 can squeeze up to 4 characters in the space where UTF-32 puts one, cache coherency is much better. Then there's the "practical use" issue. The Linux kernel uses UTF-8 internally (when compiled appropriately). The Windows kernel uses UTF-16. No kernel I know of uses UTF-32. This means that for every system call, UTF-32 strings have to be converted. Another performance hit there. (Not to mention complexity hit, for calling APIs that return text.) Hmm ... I think Qt and wxWidgets (with appropriate configuration) use UTF-32 on Linux. I think the problem of UTF-8 and UTF-16 strings is important and must be addressed. Sebastian Redl