
Miro Jurisic <macdev@meeroh.org> writes:
In article <00c101c42167$8d0a7f40$1b440352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote:
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.
Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.
Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT rather than a typedef, so that one may specialize std::char_traits. Of course, if this gets standardized, then it can be a built-in, since the standard can specialize its own templates.
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding.
Yes, but a codepoint is 32-bits, and codepoints can come in any sequence. A given sequence of codepoints may or may not have a valid semantic meaning as a "character", but that is like debating whether or not "fjkp" is a valid word --- beyond the scope of basic string handling facilities.
2) define iterator adapters to convert a sequence of one Unicode character type to another.
This is also not as easy as you seem to believe that it is, because even within one encoding many strings can have multiple representations.
That is why there are various canonical forms defined. We should provide a means of converting to the canonical forms. However, this is independent of Unicode encoding --- the same sequence of code points can be represented in each Unicode encoding in precisely one way.
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.
Yes. basic_string<CharType> relies on each CharType being a valid entity in its own right --- for Unicode this means it must be a single Unicode code point, so using basic_string for UTF-8 is out. You are right that Unicode does not play fair with most standard locale facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity (which could be seen as many-many), locale specifics). Collation is one area where the standard library facilities should be OK, since the standard library collation support deals with whole strings. When you install the collation facet in your locale, you choose the Unicode collation options that are relevant to you. Anthony -- Anthony Williams Senior Software Engineer, Beran Instruments Ltd.