
On 01/19/2011 12:56 PM, Robert Ramey wrote:
... elision by patrick ... std::string - a sequence of bytes utf8_string - a sequence of "code points" implemented in terms of std::string. With the ability to specify a conversion facet to convert from your local encoding to utf-8. The string would still validate the utf-8 received from the conversion facet.
What do you do about things that can validly be represented by one character, or by a basic character with one or more combining characters. For example Ü can be represented by U+00DC, a capital U with diaeresis or by the two combining characters U+0055 U+0308, a U and a combining diaeresis. Ü<=- That one is done with two combining characters and the previous one is just one character. The spec says that these must be considered absolutely equivalent. Will our utf8_string class always choose one representation over another? Certainly to make choices like this you'd need the characterization database from Unicode. So, if you're iterating the utf8_string with an iterator iter, what type does *iter return? It could _consume_ a lot of bytes. Is it a char32_t with the character in it, is it another utf8-string with only one character in it? I'd say char32_t because that can hold anything in ucs. So then what about *iter=thechar. What type or types can thechar be? char32_t char16_t, wchar_t, char, unsigned char, int, int32_t, a utf8_string with only one "character" to be copied in, a utf8_string and we'll just take the first char? I'd probably use char32_t in both those cases. Food for thought. I agree I'd like to see it be derived from std::string so you can pass it to things that expect a std::string and don't care so much about encoding. Patrick