
Erik Wien wrote:
Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.)
It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics.
My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.
The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.
In other words, I believe that string::operator== should always perform the per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() ) that is specified in the Container requirements table.
If I want to test whether two sequences of char16_t's, interpreted as UTF16 Unicode strings, would represent the same string in a printed form, I should be given a dedicated function that does just that - or an equivalent. Similarly, if I want to normalize a sequence of chars that are actually UTF8, I'd call the appropriate 'normalize' function/algorithm.
Right, and there are several different Normalised forms so we have to be able to choose the algorithm that does the right thing for what we want here. Can I make one other plea here: *please* lets not get too stuck on string class representations; we can have iterator sequences as well (these may well be part of a string, or they may be part of a memory mapped file, or some other smart iterator - like the Unicode encoding transformation iterators I've just been writing), and operations / algorithms on iterators are more important too me than YASC (Yet Another String Class) :-) John.