Re: [boost] Re: [Unicode strings] We're off

19 Mar 2005

      On Fri, 18 Mar 2005 14:56:53 -0800, Jonathan Biggar <jon@levanta.com> wrote:
...
Rogier van Dalen wrote:
...
I believe this is what a Unicode library should use as its basic unit.
Be careful with making a global assertion.  Different users of a Unicode
library will need to access the data at different levels.  Some will
need the raw encoding bytes or words, some will need code points, and
some will need 'grapheme clusters'.
The library should support working at the level that each particular
user needs, and different parts of an application or library may need to
work at multiple levels.
A decision must be made. Certainly you should have access to code
points; and you should be able to work at multiple levels. However,
one level has to be the default level. Most programmers should be able
to get what they want by using boost::unicode_string (or whatever it's
going to be called). We need to make a "global assertion" that's
correct 99% of the time.

I think we need an interface that will work for programmers that have
no idea what the difference between a code point or a grapheme cluster
is, and don't want to be bothered by the difference between

    U+0135 LATIN SMALL LETTER J WITH CIRCUMFLEX
and
    U+006A LATIN SMALL LETTER J
    U+0302 COMBINING CIRCUMFLEX ACCENT

String handling includes searching, comparing, for which the above
should be equivalent. As a programmer, I don't want to be bothered
with different sequences that are canonically equivalent. I want it to
just work. The library should handle the cases I didn't think about.

Input and output has to deal with code points, obviously, but I think
going from code points to what users think of as "characters" and vice
versa for I/O should be done by the library. By default.
I have not been able to find another use case for accessing code
points directly. I'm ready to be convinced I'm wrong. However, we'll
have to make a choice.

Regards,
Rogier