
Sorry about the late reply. I have been away for easter, and to top it all off, been sick a while. Anyway, I'm back...
I'm not too sure how you envision using normalisation policies for functions. However, the problem I see with it is that normalisation form is not a property of a function. A normalisation form is a property of a string. I think it should be an invariant of that string.
Imagine a std::map<> where you use a Unicode string as a key; you want equivalent strings to map to the same object. operator< for two strings with the same normalisation form and the same encoding is trivial (and as fast as std::basic_string::operator< for UTF-8 or UTF-32). On two strings with unknown normalisation forms, it will be dreadfully much slower because you'll need to look things up in the Unicode database all the time.
Yep.. You are of course right. I should start thinking before I talk. :) Having strings locked to a normalization form, would be the most logical way to go. What I don't really see though, is why you would have to have a separate class (different from the code point string class that is) for this functionality. If we made the code point string classes (both the static and dynamic ones) have a normalization policy and provide a policy that doesn't actually do anything, in addition to ones that normalize to each of the normalization forms, everyone could have their way. If you don't care about normalization, use the do_nothing one. If you do care (or simply have no clue what normalization is - most users), use NFD or NFC or something.
What I really don't like about this solution, is that we would end up with a myriad of different types that all are "unicode strings", but at different levels. I can easily imagine mayhem erupting when everyone get their favorite unicode abstraction and use that one exclusively in their APIs. Passing strings around would be a complete nightmare.
I'm just afraid that if we have a code_point_string in all encodings, plus the dynamic one, in addition to the same number of strings at the grapheme cluster level, there would simply be too many of them, and it would confuse the users more that it would help them.
As long as there is one boost::unicode_string, I speculate this shouldn't be much of a problem.
I hope you are right, because if it turns out to be a problem, it will be a major one! What do the rest of you think? Would a large number of different classes lead to confusion, or would a unicode_string typedef hide this complexity? Developers wanting to make another
choice than you have made I think will fall into either of two categories: - Those who know about Unicode and are not easily confused by encodings and normalisation forms; - and those who worry about performance.
Yep, that sounds about right. Most users should not really care what kind of encoding and normalization form is used. They want to work with the string, not fiddle with it's internal representation. With a good rationale (based
on measured performance in a number of test cases), you should be able to pick one that's good enough in most situations, I think. (Looking at the ICU website, I'd say this would involve UTF-16, but let's see what you come up with.)
I would be surprised if any other encoding than UTF-16 would end up as the most efficient one. UTF-8 suffers from the big variation in code unit count for any given code point and UTF-32 is just a waste of space for little performance for most users. You never know though.