[boost] Re: Re: Any interest in adding unicode support to boost?

20 Oct 2004


      Peter Dimov wrote:
...
It appears that there are two schools of thought when it comes to string 
design. One approach treats a string purely as a sequential container of 
values. The other tries to represent "string values" as a coherent whole. 
It doesn't help that in the simple case where the value_type is char the 
two approaches result in mostly identical semantics.
My opinion is that the std::char_traits<> experiment failed and 
conclusively demonstrated that the "string as a value" approach is a dead 
end, and that practical string libraries must treat a string as a 
sequential container, vector<char>, vector<char16_t> and vector<char32_t> 
in our case.
The interpretation of that sequence of integers as a concrete string value 
representation needs to be done by algorithms.
That is kinda what my current implementation does, but the container is not 
directly accessible by the user. (Nor do I think it should be) Instead I 
wrap the vector of code points in a class and provide different types of 
iterators to iterate though the vector at different "character levels", 
instead of external algorithms. You can therefore access the string on a 
code unit level, but the casual user would not neccesarily know (or care) 
about that. Instead he would use the "string as a value" approach, using 
strings to represent a sentance, word, or some other language construct. 
When most people think of a string, they think of text, and not the 
underlying binary representation, and therefore that is, in my opinion, the 
notion a library should be designed around.
...
In other words, I believe that string::operator== should always perform 
the per-element comparison std::equal( lhs.begin(), lhs.end(), 
rhs.begin() ) that is specified in the Container requirements table.
If I want to test whether two sequences of char16_t's, interpreted as 
UTF16 Unicode strings, would represent the same string in a printed form, 
I should be given a dedicated function that does just that - or an 
equivalent. Similarly, if I want to normalize a sequence of chars that are 
actually UTF8, I'd call the appropriate 'normalize' function/algorithm.
Though I see where you are coming from, I don't agree with you on that. In 
my opinion a good unicode library should hide as much as possible of the 
complexity of the actual character representation from the user. If we were 
to require the user to know that a direct binary comparison of strings is 
not the same as a actual textual comparison, we loose some of the simplicity 
of the library. Most users that use such a library would not know that the 
character ö can be represented as both 'oš' and 'ö', and that as a 
consequence of that, calling == on to strings could result in the behaviour 
"ö" != "ö". By removing the need for such knowledge by the user, we reduce 
the learning curve considerably, which is one of the main reasons for 
abstracting this functionality anyway.

[boost] Re: Re: Any interest in adding unicode support to boost?

Erik Wien