
In article <00c101c42167$8d0a7f40$1b440352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote:
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.
Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding.
2) define iterator adapters to convert a sequence of one Unicode character type to another.
This is also not as easy as you seem to believe that it is, because even within one encoding many strings can have multiple representations.
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.
7) Anything I've forgotten :-)
I think you have forgotten to read and understand the complexity of Unicode (or any of the books that discuss the spec less tersely, such as Unicode Demystified), because I think that some of the suggestions you made here are incompatible with how Unicode actually works. Please correct me if I am wrong -- I would love to be wrong :-)
The main goal would be to define a good clean interface, the implementation could be:
We can't define a good clean interface until we understand the problems. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>