
The central difference between ansi strings and a utf8 strings is, that character access by index is simple for ansi strings but difficult for utf8 encoded strings. std::basic_string can handle utf8, utf16 and utf32 encoded strings, but there is no access to the decoded string with access to the unicode values of the characters.
However it isn't basic_string and it means it is isolated from the rest of standard library. In perfect world I would expect to read/write utf_strings from std::streams in the same way it is provided for std::string i.e. all the operations like operator>>, getline and so on should be usable on utf_strings. It is always possible to access the basic_string<> data by calling raw()! The standard requires character access, which can't be implemented efficiently for utf8 and utf16 encoded strings.
So in this area I basicaly identify with Matt Austern's proposal for the C++0x ( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2035.pdf ). I see my approach as an addition to Matt Austern's proposal. While Matt is handling encoded strings, my approach deals with decoded strings. The encoded string types are std::string, std::ustring, and std::u32string. These strings allow access to the raw values of the encoded words. My wrapper allow access to the strings at an symbolic level. It allows conversion between the different encodings and also to the unicode values of the characters as char32_t values.
8bit word array -> std::string 16bit word array -> std::ustring 32bit word array -> std::u32string utf8 encoded strings -> utf8_string (based on std::string) utf16 encoded strings -> utf16_string (based on std::ustring) utf32 encoded strings -> utf32_string (based on std::u32string) but the approach also allows: latin-1 encoded strings -> latin1_string (based on std::string) windows-1252 encoded strings -> windows_1252_string (based on std::string) Regards, Nils