
In article <022101c4220f$a0c363f0$a8500352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote:
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.
Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.
Basic_string is a sequence of code points, no more no less, all performance guarentees for basic_string can be met as such.
If all you want basic_string for is a sequence of code points, you should use a vector<codePointT> instead, as vector does not provide additional methods that would be at best deceptive and at worst dangerous when applied to Unicode strings.
Iterator adapters for normalisation / composition / compression would also be useful additions.
Likewise adapters for iterating "characters" and "glyphs".
Leaving compression out, as I don't see what it has to do with Unicode strings per se, I don't think they would be useful additions, I think they would be required in order a boost Unicode library to meet my expectations.
Working on sequences of code points always requires care: clearly one could erase a low surrogate and leave a high surrogate "orphanned" behind for example. One would need to make it clear in the documention that potential problems like this can occur.
It is precisely because this interface is dangerous that I believe that it should not be the default interface to a Unicode string. It is rarely useful and often harmful. It does not make it easy to do things right.
Unicode is such a large and complex issue, that it's actually pretty hard to keep even a small fraction of the issues in ones mind at a time, hence my suggestion to split the issue up into a series of steps.
The problem is that I think that some of the steps you propose do not take us in the direction of a useful Unicode string abstraction in boost, but merely provide convenient wrappers for the simple problems without tackling the complicated problems. I don't have a problem with solving simple problems first, but I would like to have a reason to believe that solving those simple problems gets us closer to solving the hard problems at a later time; I am not convinced the approach you proposal fits that bill. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>