[boost] Re: [Unicode strings] We're off

19 Mar 2005

      Rogier van Dalen wrote:
...
...
Be careful with making a global assertion.  Different users of a Unicode
library will need to access the data at different levels.  Some will
need the raw encoding bytes or words, some will need code points, and
some will need 'grapheme clusters'.
The library should support working at the level that each particular
user needs, and different parts of an application or library may need to
work at multiple levels.
A decision must be made. Certainly you should have access to code
points; and you should be able to work at multiple levels. However,
one level has to be the default level. Most programmers should be able
to get what they want by using boost::unicode_string (or whatever it's
going to be called). We need to make a "global assertion" that's
correct 99% of the time.
I don't see why there has to be a "default" inteface at all.  There 
should just be multiple interfaces, one for each level that a programmer 
may have need to work at.
...
I think we need an interface that will work for programmers that have
no idea what the difference between a code point or a grapheme cluster
is, and don't want to be bothered by the difference between
U+0135 LATIN SMALL LETTER J WITH CIRCUMFLEX
and
    U+006A LATIN SMALL LETTER J
    U+0302 COMBINING CIRCUMFLEX ACCENT
That's fine for *certain* uses.  Other programs may have a need to 
distinguish between the two, and need the ability to convert a Unicode 
string from the form where all combining characters are combined and the 
form where they are all separate explicit codepoints.  A way of telling 
the library that you don't care about the difference is to ensure that 
every string you use is canonicalized into the form that makes your job 
easier.

Alternatively, the interface could provide the ability to set state bits 
in the string that indicate whether you want to see the differences or not.
...
String handling includes searching, comparing, for which the above
should be equivalent. As a programmer, I don't want to be bothered
with different sequences that are canonically equivalent. I want it to
just work. The library should handle the cases I didn't think about.
That's fine *when* you are working at that high a level of abstraction.
...
Input and output has to deal with code points, obviously, but I think
going from code points to what users think of as "characters" and vice
versa for I/O should be done by the library. By default.
I have not been able to find another use case for accessing code
points directly. I'm ready to be convinced I'm wrong. However, we'll
have to make a choice.
Another use case would be writing codeset conversion functions.

-- 
Jonathan Biggar
jon@levanta.com