
Dear Graham, You seem to have a lot of valuable experience with Unicode. I feel that the interface is the more urgent matter for now. The implementation of the tables can always be changed as long as the interface remains stable. Let me rephrase that: I think the issues of programmer's interface and data tables are orthogonal, and programmers will care most about the interface, and then about performance. This is not to say you don't give a whole lot of important and difficult issues that must be solved at some point. As far as I can see, C++ provides the machinery to abstract away from code points, which gives an opportunity for hiding more of the complexity of Unicode than other programming languages. I would love to see a library which hides the issue of different normalisation forms from the programmer. As for the scope of the library:
How have you hooked in dictionary word break support for languages like Thai IMO that would be beyond the scope of a general Unicode library.
How far have you gone? Do you have support for going from logical to display on combined ltor and rtol ? Customised glyph conversion for Indic Urdu? Correct me if I'm wrong, but I think these issues become important only when rendering Unicode strings. Aren't they thus better handled by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost Unicode library should focus on processing symbolic Unicode strings and keep away from what happens when they are displayed, just like std::basic_string does.
How can we ensure that other boost projects understand the implication of Unicode support and the subtle changes required, e.g. hooks to allow for canonical decomposition on string data portions of regular expressions in the regexpr project?
As long as the Unicode string abstracts away from normalisation forms, level 2 Unicode support for regular expressions should basically come for free, I believe. In general, the Unicode library should incorporate as much Unicode-specific machinery as possible, leaving as little difficulty for other library authors as possible. Docs are important here. Why would you want to do canonical decomposition explicitly in a regular expression?
We will need to create a utility to take the 'raw'/ published unicode data files along with user defined private characters to make these tables which would then be used by the set of functions that we will agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.
I find the idea of users embedding private character properties _within_ the standard Unicode tables, and building their own slightly different version of the Unicode library, scary. Why is this needed? Regards, Rogier