[boost] Re: [Unicode strings] We're off

4 Apr 2005

      Sorry about the late reply. I have been away for easter, and to top it 
all off, been sick a while. Anyway, I'm back...
...
I'm not too sure how you envision using normalisation policies for functions.
However, the problem I see with it is that normalisation form is not a
property of a function. A normalisation form is a property of a
string. I think it should be an invariant of that string.
...
Imagine a std::map<> where you use a Unicode string as a key; you want
equivalent strings to map to the same object. operator< for two
strings with the same normalisation form and the same encoding is
trivial (and as fast as std::basic_string::operator< for UTF-8 or
UTF-32). On two strings with unknown normalisation forms, it will be
dreadfully much slower because you'll need to look things up in the
Unicode database all the time.
Yep.. You are of course right. I should start thinking before I talk. :) 
Having strings locked to a normalization form, would be the most logical 
way to go. What I don't really see though, is why you would have to have 
a separate class (different from the code point string class that is) 
for this functionality. If we made the code point string classes (both 
the static and dynamic ones) have a normalization policy and provide a 
policy that doesn't actually do anything, in addition to ones that 
normalize to each of the normalization forms, everyone could have their 
way. If you don't care about normalization, use the do_nothing one. If 
you do care (or simply have no clue what normalization is - most users), 
use NFD or NFC or something.
...
...
What I really don't like about this solution, is that we would end up
with a myriad of different types that all are "unicode strings", but at
different levels. I can easily imagine mayhem erupting when everyone get
their favorite unicode abstraction and use that one exclusively in their
APIs. Passing strings around would be a complete nightmare.
...
I'm just afraid that if we have a code_point_string in all encodings,
plus the dynamic one, in addition to the same number of strings at the
grapheme cluster level, there would simply be too many of them, and it
would confuse the users more that it would help them.
As long as there is one boost::unicode_string, I speculate this
shouldn't be much of a problem.
I hope you are right, because if it turns out to be a problem, it will 
be a major one! What do the rest of you think? Would a large number of 
different classes lead to confusion, or would a unicode_string typedef 
hide this complexity?

  Developers wanting to make another
...
choice than you have made I think will fall into either of two
categories:
- Those who know about Unicode and are not easily confused by
encodings and normalisation forms;
- and those who worry about performance.
Yep, that sounds about right. Most users should not really care what 
kind of encoding and normalization form is used. They want to work with 
the string, not fiddle with it's internal representation.

  With a good rationale (based
...
on measured performance in a number of test cases), you should be able
to pick one that's good enough in most situations, I think. (Looking
at the ICU website, I'd say this would involve UTF-16, but let's see
what you come up with.)
I would be surprised if any other encoding than UTF-16 would end up as 
the most efficient one. UTF-8 suffers from the big variation in code 
unit count for any given code point and UTF-32 is just a waste of space 
for little performance for most users. You never know though.

[boost] Re: [Unicode strings] We're off

Erik Wien