Re: [boost] Re: Re: Any interest in adding unicode support to boost?

20 Oct 2004


      Erik Wien wrote:
...
Peter Dimov wrote:
...
It appears that there are two schools of thought when it comes to
string design. One approach treats a string purely as a sequential
container of values. The other tries to represent "string values" as
a coherent whole. It doesn't help that in the simple case where the
value_type is char the two approaches result in mostly identical
semantics. My opinion is that the std::char_traits<> experiment failed 
and
conclusively demonstrated that the "string as a value" approach is a
dead end, and that practical string libraries must treat a string as
a sequential container, vector<char>, vector<char16_t> and
vector<char32_t> in our case.
The interpretation of that sequence of integers as a concrete string
value representation needs to be done by algorithms.
That is kinda what my current implementation does, but the container
is not directly accessible by the user. (Nor do I think it should be)
Instead I wrap the vector of code points in a class and provide different 
types
of iterators to iterate though the vector at different "character
levels", instead of external algorithms.
That's what external algorithms take, iterators. I don't understand what you 
mean by that.
...
You can therefore access the string on a code unit level, but the casual 
user would not neccesarily know (or
care) about that. Instead he would use the "string as a value"
approach, using strings to represent a sentance, word, or some other
language construct. When most people think of a string, they think of 
text, and not the
underlying binary representation, and therefore that is, in my
opinion, the notion a library should be designed around.
That may be so. But I don't see how the user can be isolated from the binary 
representation if he needs to pick one of utf8_string, utf16_string, 
ucs2_string, ucs4_string to store his strings. Perhaps I misunderstand your 
idea. Can you post a sketch of your spec? How many string classes do you 
have? What encoding do they use? What do begin(), end(), size() return? Are 
the iterators random access? Bidirectional? Constant? How can the user 
obtain the underlying element sequence to persist it somewhere or to pass it 
to an external library?
...
In my opinion a good unicode library should hide as much as possible of
the complexity of the actual character representation from the user.
Hiding intrinsic complexity isn't necessarily a good idea. Sometimes users 
need to accomplish a specific task and the abstraction layer, in its 
attempts to "hide the complexity", just gets in the way. This should never 
happen.