
Anthony Williams <anthony_w.geo@yahoo.com> writes:
Miro Jurisic <macdev@meeroh.org> writes:
[snip]
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.
Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT rather than a typedef, so that one may specialize std::char_traits. Of course, if this gets standardized, then it can be a built-in, since the standard can specialize its own templates.
Performance guarantees aside, standardizing around UTF-32 is not, IMO, practical.
[snip]
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.
Yes. basic_string<CharType> relies on each CharType being a valid entity in its own right --- for Unicode this means it must be a single Unicode code point, so using basic_string for UTF-8 is out.
basic_string can still be used as a low-level storage facility, although it was certainly not designed to be used as such, and in treating it as such, many of the ``compatibility'' advantages are lost anyway. If you are advocating internal representation in UTF-32, however, I would say that performance measurements generally show that UTF-16 is significantly faster for processing, such that the advantage of being able to nicely fit it into the existing interfaces is not justified.
You are right that Unicode does not play fair with most standard locale facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity (which could be seen as many-many), locale specifics).
Collation is one area where the standard library facilities should be OK, since the standard library collation support deals with whole strings. When you install the collation facet in your locale, you choose the Unicode collation options that are relevant to you.
Perhaps, except for the other issues which I have described in other messages. -- Jeremy Maitin-Shepard