
Peter Dimov wrote:
It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics.
My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.
The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.
That is kinda what my current implementation does, but the container is not directly accessible by the user. (Nor do I think it should be) Instead I wrap the vector of code points in a class and provide different types of iterators to iterate though the vector at different "character levels", instead of external algorithms. You can therefore access the string on a code unit level, but the casual user would not neccesarily know (or care) about that. Instead he would use the "string as a value" approach, using strings to represent a sentance, word, or some other language construct. When most people think of a string, they think of text, and not the underlying binary representation, and therefore that is, in my opinion, the notion a library should be designed around.
In other words, I believe that string::operator== should always perform the per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() ) that is specified in the Container requirements table.
If I want to test whether two sequences of char16_t's, interpreted as UTF16 Unicode strings, would represent the same string in a printed form, I should be given a dedicated function that does just that - or an equivalent. Similarly, if I want to normalize a sequence of chars that are actually UTF8, I'd call the appropriate 'normalize' function/algorithm.
Though I see where you are coming from, I don't agree with you on that. In my opinion a good unicode library should hide as much as possible of the complexity of the actual character representation from the user. If we were to require the user to know that a direct binary comparison of strings is not the same as a actual textual comparison, we loose some of the simplicity of the library. Most users that use such a library would not know that the character ö can be represented as both 'oš' and 'ö', and that as a consequence of that, calling == on to strings could result in the behaviour "ö" != "ö". By removing the need for such knowledge by the user, we reduce the learning curve considerably, which is one of the main reasons for abstracting this functionality anyway.