
Hi Jeremy,
The IBM International Components for Unicode (ICU) library (http://oss.software.ibm.com/icu/) is an existing C++ library .... and although there are some C++-specific facilities, most of the C++ API is the same as the C API, thus resulting in a less-than-optimal C++ interface.
True. In particular it looks like they use interators which have 'next' method. Hmm... let me guess why -- IIRC it was Java library initially and was then ported to C++.
Nonetheless, I think Boostifying the ICU library would be quite feasible, ...
As Miro said, there are alternatives how Boost.Unicode might be related to ICU, though using code from there is desirable.
The representation of locales does present an issue that needs to be considered. The existing C++ standard locale facets are not very suitable for a variety of reasons:
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.
We can just forget about locale::operator() ;-) But there are other issues. For example, 'toupper' takes a charT and returns charT. The Unicode standard (in 5.18) gives an example of a character which becomes two characters when uppercased. Also, it might be necessary to look at the following code point to find if it's composing character. Other facets, say 'num_put', maybe don't need changes. If it generates data in UCS-2, that's fine.
- The interface of std::collate<Ch> is not at all suitable for providing all of the functionality desired for Unicode string collation. A suitable Unicode collation facility should at least allow for user-selection of the strength level used (refer to http://www.unicode.org/unicode/reports/tr10/),
Can't you 'imbue' a new facet whenever you need to change something? It's needed, though, to decide what to use for 'charT' and what encoding to use. If ICU can compare UTF-16 encoded strings, then it's possible to pass those strings to 'compare'. I'm don't understand what's 'transform', though.
It would still be possible to use the standard locale object as a container of an entirely new set of facets, which could be loaded from the data sources based on the name of the locale, and ``injected'' into an existing locale object, by calling some function. It is not clear, however, what advantage this would serve over simply using a thin-wrapper over a locale name to represent a ``locale,'' as is done in the ICU library.
First, using std::locale would be more familiar. Second, std::locale allows to use different facets, and that's a good thing in general. E.g. I have all POSIX locale categories set to "C" except for LC_CTYPE. It would be inconvenient to have only one locale setting for everything. - Volodya