
Rogier van Dalen wrote:
Be careful with making a global assertion. Different users of a Unicode library will need to access the data at different levels. Some will need the raw encoding bytes or words, some will need code points, and some will need 'grapheme clusters'.
The library should support working at the level that each particular user needs, and different parts of an application or library may need to work at multiple levels.
A decision must be made. Certainly you should have access to code points; and you should be able to work at multiple levels. However, one level has to be the default level. Most programmers should be able to get what they want by using boost::unicode_string (or whatever it's going to be called). We need to make a "global assertion" that's correct 99% of the time.
I don't see why there has to be a "default" inteface at all. There should just be multiple interfaces, one for each level that a programmer may have need to work at.
I think we need an interface that will work for programmers that have no idea what the difference between a code point or a grapheme cluster is, and don't want to be bothered by the difference between
U+0135 LATIN SMALL LETTER J WITH CIRCUMFLEX and U+006A LATIN SMALL LETTER J U+0302 COMBINING CIRCUMFLEX ACCENT
That's fine for *certain* uses. Other programs may have a need to distinguish between the two, and need the ability to convert a Unicode string from the form where all combining characters are combined and the form where they are all separate explicit codepoints. A way of telling the library that you don't care about the difference is to ensure that every string you use is canonicalized into the form that makes your job easier. Alternatively, the interface could provide the ability to set state bits in the string that indicate whether you want to see the differences or not.
String handling includes searching, comparing, for which the above should be equivalent. As a programmer, I don't want to be bothered with different sequences that are canonically equivalent. I want it to just work. The library should handle the cases I didn't think about.
That's fine *when* you are working at that high a level of abstraction.
Input and output has to deal with code points, obviously, but I think going from code points to what users think of as "characters" and vice versa for I/O should be done by the library. By default. I have not been able to find another use case for accessing code points directly. I'm ready to be convinced I'm wrong. However, we'll have to make a choice.
Another use case would be writing codeset conversion functions. -- Jonathan Biggar jon@levanta.com