
Erik Wien wrote:
Peter Dimov wrote:
It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics. My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.
The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.
That is kinda what my current implementation does, but the container is not directly accessible by the user. (Nor do I think it should be) Instead I wrap the vector of code points in a class and provide different types of iterators to iterate though the vector at different "character levels", instead of external algorithms.
That's what external algorithms take, iterators. I don't understand what you mean by that.
You can therefore access the string on a code unit level, but the casual user would not neccesarily know (or care) about that. Instead he would use the "string as a value" approach, using strings to represent a sentance, word, or some other language construct. When most people think of a string, they think of text, and not the underlying binary representation, and therefore that is, in my opinion, the notion a library should be designed around.
That may be so. But I don't see how the user can be isolated from the binary representation if he needs to pick one of utf8_string, utf16_string, ucs2_string, ucs4_string to store his strings. Perhaps I misunderstand your idea. Can you post a sketch of your spec? How many string classes do you have? What encoding do they use? What do begin(), end(), size() return? Are the iterators random access? Bidirectional? Constant? How can the user obtain the underlying element sequence to persist it somewhere or to pass it to an external library?
In my opinion a good unicode library should hide as much as possible of the complexity of the actual character representation from the user.
Hiding intrinsic complexity isn't necessarily a good idea. Sometimes users need to accomplish a specific task and the abstraction layer, in its attempts to "hide the complexity", just gets in the way. This should never happen.