
I've recently started on the first draft of a Unicode library. An assumption I think is wrong is that wchar_t would be suitable for Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on Microsoft compilers, for example. The utf8_codecvt_facet implementation will on these compilers cut off any codepoints over 0xFFFF. (U+1D12C will come out as U+D12C.) I think a definition of unicode::code as uint32_t would be much better. Problem is, codecvt is only implemented for wchar_t and char, so it's not possible to make a Unicode codecvt without manually adding (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to the std namespace. I guess this is the reason that Ron Garcia just used wchar_t. About Unicode strings: I suggest having a codepoint_string, with the string of code units as a template parameter. Its interface should work with 21 (32) bits values, while internally these are converted to UTF-8, UTF-16, or remain UTF-32. template <class CodeUnitString> class codepoint_string { CodeUnitString code_units; // ... }; The real unicode::string would be the character string, which uses a base character with its combining marks for its interface. template <class CodePointString> class string { CodePointString codepoints; // ... }; So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters. unicode::string should take care of correctly searching for a character string, rather than a codepoint string. operator< has never done "the right thing" anyway: it does not make a difference between uppercase and lowercase, for example. Probably, locales should be used for collation. The Unicode collation algorithm is pretty well specified. Hope all this is clear... Regards, Rogier