
On Fri, 18 Mar 2005 14:56:53 -0800, Jonathan Biggar <jon@levanta.com> wrote:
Rogier van Dalen wrote:
I believe this is what a Unicode library should use as its basic unit.
Be careful with making a global assertion. Different users of a Unicode library will need to access the data at different levels. Some will need the raw encoding bytes or words, some will need code points, and some will need 'grapheme clusters'.
The library should support working at the level that each particular user needs, and different parts of an application or library may need to work at multiple levels.
A decision must be made. Certainly you should have access to code points; and you should be able to work at multiple levels. However, one level has to be the default level. Most programmers should be able to get what they want by using boost::unicode_string (or whatever it's going to be called). We need to make a "global assertion" that's correct 99% of the time. I think we need an interface that will work for programmers that have no idea what the difference between a code point or a grapheme cluster is, and don't want to be bothered by the difference between U+0135 LATIN SMALL LETTER J WITH CIRCUMFLEX and U+006A LATIN SMALL LETTER J U+0302 COMBINING CIRCUMFLEX ACCENT String handling includes searching, comparing, for which the above should be equivalent. As a programmer, I don't want to be bothered with different sequences that are canonically equivalent. I want it to just work. The library should handle the cases I didn't think about. Input and output has to deal with code points, obviously, but I think going from code points to what users think of as "characters" and vice versa for I/O should be done by the library. By default. I have not been able to find another use case for accessing code points directly. I'm ready to be convinced I'm wrong. However, we'll have to make a choice. Regards, Rogier