[boost] Re: [Unicode strings] We're off

21 Mar 2005

      Rogier van Dalen wrote:
...
<snip>
I believe we are talking about different kinds of users. Let's get
this clear: I was assuming that the Unicode library will be aimed at
programmers doing everyday programming jobs whose programs will have
to deal with non-English characters (because they're bound to be
localised, or because non-English names will be inserted in a
database, or whatever), i.e. people who have no idea about how Unicode
works and don't want to, as long as it does work.
That was my initial thought. This Unicode library should in my opinion 
make handling Unicode strings correctly as easy as it is to handle ASCII 
strings today. But that does not mean we will have to put mittens on 
everyone else to keep them away from the lower details. If you need to 
manipulate code points, I think you should be allowed to. Code units on 
the other hand, I'm a little more wary about, since users easily could 
screw things up on that level. (Make a sequence ill-formed.) Furthermore 
I don't really see why anyone would need to muck about with code units.
...
What I think would be a good interface:
// A string of code points, encoded UTF-16 (or templated).
class code_point_string {
public:
    //...
    const std::basic_string<char16_t> code_units();
};
// A string of "grapheme clusters", with a code_point_string underlying.
// The string is always in a normalisation form.
template <class NormalisationPolicy = NormalisationFormC>
   class unicode_string
{
public:
   //...
   const code_point_string & code_points() const;
};
Those who need to process code points can happily use
code_point_string; others can use unicode_string.
This is starting to look more and more like the way to go in my opinion. 
By layering interfaces with an increasing level of abstraction (from 
code points and up), we could more or less keep everyone happy.

What I really don't like about this solution, is that we would end up 
with a myriad of different types that all are "unicode strings", but at 
different levels. I can easily imagine mayhem erupting when everyone get 
their favorite unicode abstraction and use that one exclusively in their 
APIs. Passing strings around would be a complete nightmare.

One solution could be to make code points the "base level" of 
abstraction, and used normalization policies (like you outlined) for 
functions where normalization form actually matters (find etc.), we 
could still get most of the functionality a grapheme_cluster_string 
would provide, but without the extra types.

I'm just afraid that if we have a code_point_string in all encodings, 
plus the dynamic one, in addition to the same number of strings at the 
grapheme cluster level, there would simply be too many of them, and it 
would confuse the users more that it would help them.

Feel free to convince me otherwise though.

[boost] Re: [Unicode strings] We're off

Erik Wien