
Vladimir Prus wrote:
Second question is if operator==, operator< or 'find' should operate on vector<char_XX> or on abstract characters, using Unicode rules, or there should be two versions. I don't really understand why 'unicode-unaware' semantic is ever needed, so we should have only 'unicode-aware' one.
Look at 21.3/2: "The class template basic_string conforms to the requirements of a Sequence, as specified in (23.1.1). Additionally, because the iterators supported by basic_string are random access iterators (24.1.5), basic_string conforms to the the requirements of a Reversible Container, as specified in (23.1)." Now look at Table 65, Container requirements, operator==: "== is an equivalence relation. a.size()==b.size() && equal(a.begin(), a.end(), b.begin())" The question is now, what do begin(), end() and size() return for our hypothetical string16? I maintain that the library design is much cleaner if begin(), end() and size() are random access iterators over the underlying _storage_, not over the codepoint representation or abstract character representation. Codepoint iterators and abstract character iterators would still be provided, but they would be constant bidirectional with char32_t as the value_type. Codepoint and abstract character operations would be provided by algorithms, taking an iterator range. The user should remember and honor the encoding (UTF-16, UCS-2, other) of a particular container of char16_t, not the container itself. This is straightforward STL-style container-iterator-algorithm orthogonalization.