
[Rearranging paragraphs from your post] On Mon, 21 Mar 2005 01:50:04 +0100, Erik Wien <wien@start.no> wrote:
One solution could be to make code points the "base level" of abstraction, and used normalization policies (like you outlined) for functions where normalization form actually matters (find etc.), we could still get most of the functionality a grapheme_cluster_string would provide, but without the extra types.
I'm not too sure how you envision using normalisation policies for functions. However, the problem I see with it is that normalisation form is not a property of a function. A normalisation form is a property of a string. I think it should be an invariant of that string. Imagine a std::map<> where you use a Unicode string as a key; you want equivalent strings to map to the same object. operator< for two strings with the same normalisation form and the same encoding is trivial (and as fast as std::basic_string::operator< for UTF-8 or UTF-32). On two strings with unknown normalisation forms, it will be dreadfully much slower because you'll need to look things up in the Unicode database all the time.
What I really don't like about this solution, is that we would end up with a myriad of different types that all are "unicode strings", but at different levels. I can easily imagine mayhem erupting when everyone get their favorite unicode abstraction and use that one exclusively in their APIs. Passing strings around would be a complete nightmare.
I'm just afraid that if we have a code_point_string in all encodings, plus the dynamic one, in addition to the same number of strings at the grapheme cluster level, there would simply be too many of them, and it would confuse the users more that it would help them.
As long as there is one boost::unicode_string, I speculate this shouldn't be much of a problem. Developers wanting to make another choice than you have made I think will fall into either of two categories: - Those who know about Unicode and are not easily confused by encodings and normalisation forms; - and those who worry about performance. With a good rationale (based on measured performance in a number of test cases), you should be able to pick one that's good enough in most situations, I think. (Looking at the ICU website, I'd say this would involve UTF-16, but let's see what you come up with.) Regards, Rogier