[boost] Boost Unicode support ideas

12 Apr 2004

      It seems that Unicode support in Boost (which could lead to Unicode
support in the C++ language and standard library) would be quite
desirable.

The IBM International Components for Unicode (ICU) library
(http://oss.software.ibm.com/icu/) is an existing C++ library with what
appears to be a Boost-compatible license, which provides all or most of
the Unicode support that would be desired in Boost or the C++ standard
library, in addition to Unicode-equivalents of libraries already either
in the standard library or in Boost, including number/currency
formatting, date formatting, message formatting, and a regular
expression library.  Unfortunately, it does not use C++ exceptions to
signal exceptional conditions (but rather it uses an error code return
mechanism), it does not follow Boost naming conventions, and although
there are some C++-specific facilities, most of the C++ API is the same
as the C API, thus resulting in a less-than-optimal C++ interface.

Nonetheless, I think Boostifying the ICU library would be quite
feasible, whereas attempting to reimplement all of the desired
functionality that the ICU library provides would be extremely
time consuming, since the collating and other services in the ICU
library already support a large number of locales, and the
character-set conversion facilities support a large number of character
sets.

The representation of locales does present an issue that needs to be
considered.  The existing C++ standard locale facets are not very
suitable for a variety of reasons:

 - The standard facets (and the locale class itself, in that it is a
   functor for comparing basic_strings) are tied to facilities such as
   std::basic_string and std::ios_base which are not suitable for
   Unicode support.

 - The interface of std::collate<Ch> is not at all suitable for
   providing all of the functionality desired for Unicode string
   collation.  A suitable Unicode collation facility should at least
   allow for user-selection of the strength level used (refer to
   http://www.unicode.org/unicode/reports/tr10/), and would ideally
   also support customizations as extensive as the ICU library does
   (refer to
   http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html
   and
   http://oss.software.ibm.com/icu/userguide/Collate_Customization.html).

 - Facilities such as Unicode string collation are heavily data-driven,
   and it would be inefficient to load the data for facilities that are
   not used.  This could be avoided by using some sort of lazy loading
   mechanism.

It would still be possible to use the standard locale object as a
container of an entirely new set of facets, which could be loaded from
the data sources based on the name of the locale, and ``injected'' into
an existing locale object, by calling some function.  It is not clear,
however, what advantage this would serve over simply using a
thin-wrapper over a locale name to represent a ``locale,'' as is done in
the ICU library.

-- 
Jeremy Maitin-Shepard