
Erik Wien wrote:
Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.)
It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics. My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case. The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms. In other words, I believe that string::operator== should always perform the per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() ) that is specified in the Container requirements table. If I want to test whether two sequences of char16_t's, interpreted as UTF16 Unicode strings, would represent the same string in a printed form, I should be given a dedicated function that does just that - or an equivalent. Similarly, if I want to normalize a sequence of chars that are actually UTF8, I'd call the appropriate 'normalize' function/algorithm. But I may be wrong. :-)