Re: [boost] Boost Unicode support ideas

13 Apr 2004

      "John Maddock" <john@johnmaddock.co.uk> writes:
...
...
[snip]
...
...
The representation of locales does present an issue that needs to be
considered.  The existing C++ standard locale facets are not very
suitable for a variety of reasons:
- The standard facets (and the locale class itself, in that it is a
functor for comparing basic_strings) are tied to facilities such as
std::basic_string and std::ios_base which are not suitable for
Unicode support.
...
Why not?  Once the locale facets are provided, the std iostreams will "just
work", that was the whole point of templating them in the first place.
I'm not sure they will ``just work,'' although I will admit I haven't
looked into the issues relating to Unicode support in iostreams
thoroughly yet.
...
[snip]
...
However I think we're getting ahead of ourselves here:  I think a Unicode
library should be handled in stages:
...
1) define the data types for 8/16/32 bit Unicode characters.
unsigned char for UTF-8 code units, boost::uint16_t for UTF-16 code
units, and boost::int32_t for UTF-32 code units and to represent
Unicode code points
...
2) define iterator adapters to convert a sequence of one Unicode character
type to another.
This is easy enough, unicode.org provides optimized C code for this
purpose, which could easily be changed slightly for the use of iterator
adapters.  Alternatively, ICU probably less directly provides this.
...
3) define char_traits specialisations (as necessary) in order to get
basic_string working with Unicode character sequences, typedef the
appropriate string types:
...
typedef basic_string<utf8_t> utf8_string; // etc
As far as the use of UTF-8 as the internal encoding, I personally would
suggest that UTF-16 be used instead, because UTF-8 is rather inefficient
to work with.  Although I am not overly attached to UTF-16, I do think
it is important to standardize on a single internal representation,
because for practical reasons, it is useful to be able to have
non-templated APIs for purposes such as collating.

The other issues I see with using basic_string include that many of its
methods would not be suitable for use with a Unicode string, and it
does not have something like an operator += which would allow appending
of a single Unicode code point (represented as a 32-bit integer).

What it comes down to is that basic_string is designed with fixed-width
character representations in mind.

I would be more in favor of creating a separate type to represent
Unicode strings.
...
4) define low level access to the core Unicode data properties (in
unidata.txt).
Reuse of the ICU library would probably be very helpful in this.
...
5) Begin to add locale support - a big job, probably a few facets at a
time.
The issue is that, despite what you say, most or all of the standard
library facets are not suitable for use with Unicode strings.  For
instance, the character classification and toupper-like operations need
not be tied to a locale.  Furthermore, many of the operations such as
toupper on a single character are not well defined, and rather must be
defined as a string to string mapping.  Finally, the single-character
type must be a 32-bit integer, while the code unit type will probably
not be (since UTF-32 as the internal representation would be
inefficient).

Specific cases include collate<Ch>, which lacks an interface for
configuring collation, such as which strength level to use, whether
uppercase or lowercase letters should sort first, whether in French
locales accents should be sorted right to left, and other such features.
It is true that an additional, more powerful interface could be
provided, but this would add complexity.  Additionally, it depends on
basic_string<Ch> (note lack of char_traits specification), which is used
as the return type of transform, when something representing a byte
array might be more suitable.

Additionally, num_put, moneypunct and money_put all would allow only a
single code unit in a number of cases, when a string of multiple code
points would be suitable.  In addition, those facets also depend on
basic_string<Ch>.
...
6) define iterator adapters for various Unicode algorithms
(composition/decomposition/compression etc).
7) Anything I've forgotten :-)
A facility for Unicode substring matching, which would use the
collation facilities, would be useful.  This could be based on the ICU
implementation.

Additionally, a date formatting facility for Unicode would be useful.
...
[snip]
-- 
Jeremy Maitin-Shepard