
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.
Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.
Basic_string is a sequence of code points, no more no less, all performance guarentees for basic_string can be met as such.
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding.
Well it is the same first step that ICU takes: there is also a proposal before the C language committee to introduce such data types (they're called char16_t and char32_t), C++ is likely to follow suite (see http://std.dkuug.dk/jtc1/sc22/wg14/www/docs/n1040.pdf). I'm talking about code-points (and sequences thereof), not characters or glyphs which as you say consist of multiple code points. I would handle "characters" and "glyphs" as iterator adapters sitting on top of sequences of code points. For code points, basic_string is as good a container as any (as are vector and deque and anything else you care to define).
2) define iterator adapters to convert a sequence of one Unicode character type to another.
This is also not as easy as you seem to believe that it is, because even within one encoding many strings can have multiple representations.
I'm not talking about normalision / composition here: just conversion between encodings, ICU does this already, as do many other libraries. Iterator adapters for normalisation / composition / compression would also be useful additions. Likewise adapters for iterating "characters" and "glyphs".
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.
Working on sequences of code points always requires care: clearly one could erase a low surrogate and leave a high surrogate "orphanned" behind for example. One would need to make it clear in the documention that potential problems like this can occur.
7) Anything I've forgotten :-)
I think you have forgotten to read and understand the complexity of Unicode (or any of the books that discuss the spec less tersely, such as Unicode Demystified), because I think that some of the suggestions you made here are incompatible with how Unicode actually works. Please correct me if I am wrong -- I would love to be wrong :-)
Well sometimes I'm wrong, and sometimes I'm right ;-) Unicode is such a large and complex issue, that it's actually pretty hard to keep even a small fraction of the issues in ones mind at a time, hence my suggestion to split the issue up into a series of steps.
The main goal would be to define a good clean interface, the implementation could be:
We can't define a good clean interface until we understand the problems.
Obviously. John.