Re: [boost] Re: [Unicode strings] We're off

21 Mar 2005

      [Rearranging paragraphs from your post]

On Mon, 21 Mar 2005 01:50:04 +0100, Erik Wien <wien@start.no> wrote:
...
One solution could be to make code points the "base level" of
abstraction, and used normalization policies (like you outlined) for
functions where normalization form actually matters (find etc.), we
could still get most of the functionality a grapheme_cluster_string
would provide, but without the extra types.
I'm not too sure how you envision using normalisation policies for functions.
However, the problem I see with it is that normalisation form is not a
property of a function. A normalisation form is a property of a
string. I think it should be an invariant of that string.

Imagine a std::map<> where you use a Unicode string as a key; you want
equivalent strings to map to the same object. operator< for two
strings with the same normalisation form and the same encoding is
trivial (and as fast as std::basic_string::operator< for UTF-8 or
UTF-32). On two strings with unknown normalisation forms, it will be
dreadfully much slower because you'll need to look things up in the
Unicode database all the time.
...
What I really don't like about this solution, is that we would end up
with a myriad of different types that all are "unicode strings", but at
different levels. I can easily imagine mayhem erupting when everyone get
their favorite unicode abstraction and use that one exclusively in their
APIs. Passing strings around would be a complete nightmare.
...
I'm just afraid that if we have a code_point_string in all encodings,
plus the dynamic one, in addition to the same number of strings at the
grapheme cluster level, there would simply be too many of them, and it
would confuse the users more that it would help them.
As long as there is one boost::unicode_string, I speculate this
shouldn't be much of a problem. Developers wanting to make another
choice than you have made I think will fall into either of two
categories:
- Those who know about Unicode and are not easily confused by
encodings and normalisation forms;
- and those who worry about performance. With a good rationale (based
on measured performance in a number of test cases), you should be able
to pick one that's good enough in most situations, I think. (Looking
at the ICU website, I'd say this would involve UTF-16, but let's see
what you come up with.)

Regards,
Rogier

Re: [boost] Re: [Unicode strings] We're off

Rogier van Dalen