[boost] Re: Boost Unicode support ideas

13 Apr 2004


      Hi Jeremy,
...
The IBM International Components for Unicode (ICU) library
(http://oss.software.ibm.com/icu/) is an existing C++ library ....
and although
there are some C++-specific facilities, most of the C++ API is the same
as the C API, thus resulting in a less-than-optimal C++ interface.
True. In particular it looks like they use interators which have 'next'
method. Hmm... let me guess why -- IIRC it was Java library initially and
was then ported to C++.
...
Nonetheless, I think Boostifying the ICU library would be quite
feasible, ...
As Miro said, there are alternatives how Boost.Unicode might be related to
ICU, though using code from there is desirable.
...
The representation of locales does present an issue that needs to be
considered.  The existing C++ standard locale facets are not very
suitable for a variety of reasons:
- The standard facets (and the locale class itself, in that it is a
   functor for comparing basic_strings) are tied to facilities such as
   std::basic_string and std::ios_base which are not suitable for
   Unicode support.
We can just forget about locale::operator() ;-) But there are other issues.
For example, 'toupper' takes a charT and returns charT. The Unicode
standard (in 5.18) gives an example of a character which becomes two
characters when uppercased. Also, it might be necessary to look at the
following code point to find if it's composing character. 

Other facets, say 'num_put', maybe don't need changes. If it generates data
in UCS-2, that's fine.
...
- The interface of std::collate<Ch> is not at all suitable for
   providing all of the functionality desired for Unicode string
   collation.  A suitable Unicode collation facility should at least
   allow for user-selection of the strength level used (refer to
   http://www.unicode.org/unicode/reports/tr10/),
Can't you 'imbue' a new facet whenever you need to change something? 
It's needed, though, to decide what to use for 'charT' and what encoding to
use. If ICU can compare UTF-16 encoded strings, then it's possible to pass
those strings to 'compare'. I'm don't understand what's 'transform',
though.
...
It would still be possible to use the standard locale object as a
container of an entirely new set of facets, which could be loaded from
the data sources based on the name of the locale, and ``injected'' into
an existing locale object, by calling some function.  It is not clear,
however, what advantage this would serve over simply using a
thin-wrapper over a locale name to represent a ``locale,'' as is done in
the ICU library.
First, using std::locale would be more familiar. Second, std::locale allows
to use different facets, and that's a good thing in general. E.g. I have
all POSIX locale categories set to "C" except for LC_CTYPE. It would be
inconvenient to have only one locale setting for everything.

- Volodya

[boost] Re: Boost Unicode support ideas

Vladimir Prus