Re: [boost] Boost Unicode support ideas

14 Apr 2004

      ...
...
1) define the data types for 8/16/32 bit Unicode characters.
unsigned char for UTF-8 code units, boost::uint16_t for UTF-16 code
units, and boost::int32_t for UTF-32 code units and to represent
Unicode code points
Almost, ICU uses wchar_t for UTF-16 on Win32 (just to complicate things).
...
...
2) define iterator adapters to convert a sequence of one Unicode
character
type to another.
This is easy enough, unicode.org provides optimized C code for this
purpose, which could easily be changed slightly for the use of iterator
adapters.  Alternatively, ICU probably less directly provides this.
Yep.
...
...
3) define char_traits specialisations (as necessary) in order to get
basic_string working with Unicode character sequences, typedef the
appropriate string types:
...
typedef basic_string<utf8_t> utf8_string; // etc
As far as the use of UTF-8 as the internal encoding, I personally would
suggest that UTF-16 be used instead, because UTF-8 is rather inefficient
to work with.  Although I am not overly attached to UTF-16, I do think
it is important to standardize on a single internal representation,
because for practical reasons, it is useful to be able to have
non-templated APIs for purposes such as collating.
You can use whatever you want - I don't think users should be constrained to
a specific internal encoding.  Personally I don't like UTF8 either, but I
know some people do...
...
The other issues I see with using basic_string include that many of its
methods would not be suitable for use with a Unicode string, and it
does not have something like an operator += which would allow appending
of a single Unicode code point (represented as a 32-bit integer).
What it comes down to is that basic_string is designed with fixed-width
character representations in mind.
I would be more in favor of creating a separate type to represent
Unicode strings.
Personally I think we have too many string types around already.  While I
understand you're concerns about basic_string, as a container of code-points
it's just fine IMO.  We can always add non-member functions for more
advanced manipulation.
...
...
4) define low level access to the core Unicode data properties (in
unidata.txt).
Reuse of the ICU library would probably be very helpful in this.
...
5) Begin to add locale support - a big job, probably a few facets at a
time.
The issue is that, despite what you say, most or all of the standard
library facets are not suitable for use with Unicode strings.  For
instance, the character classification and toupper-like operations need
not be tied to a locale.
Accepted ctype operations are largely (though not completely) independed of
the locale, that just makes the ctype specialisations easier IMO.
...
Furthermore, many of the operations such as
toupper on a single character are not well defined, and rather must be
defined as a string to string mapping.
I know, however 1 to 1 approximations are available (those in Unidata.txt).
I'm not saying that the std locale facets should be the only interface, or
even the primary one, but providing it does get a lot of other stuff
working.
...
Finally, the single-character
type must be a 32-bit integer, while the code unit type will probably
not be (since UTF-32 as the internal representation would be
inefficient).
True, for UTF-16 only the core Unicode subset would be supported by
std::locale (ie no surrogates): this is the same as the situation in Java
and JavaScript.
...
Specific cases include collate<Ch>, which lacks an interface for
configuring collation, such as which strength level to use, whether
uppercase or lowercase letters should sort first, whether in French
locales accents should be sorted right to left, and other such features.
It is true that an additional, more powerful interface could be
provided, but this would add complexity.
You can provide any constructor interface to the collate facet that you
want, for example to support a locale and a strenth level one might use:

template <class charT>
class unicode_collate : public std::collate<charT>
{
public:
unicode_collate(const char* name, int level = INT_MAX);
/* details */
};

I'm assuming that we have a non-member function to create a locale object
that contains a set of Unicode facets:

std::locale create_unicode_locale(const char* name);

Usage to create a locale object with primary level collation would then be:

std::locale l(create_unicode_locale("en_GB"), new unicode_collate("en_GB",
1));
mystream.imbue(l);
mystream << something;
// etc.
...
Additionally, it depends on
basic_string<Ch> (note lack of char_traits specification), which is used
as the return type of transform, when something representing a byte
array might be more suitable.
You might have me on that one :-)
...
Additionally, num_put, moneypunct and money_put all would allow only a
single code unit in a number of cases, when a string of multiple code
points would be suitable.  In addition, those facets also depend on
basic_string<Ch>.
I don't understand what the problem is there, please explain.
...
...
6) define iterator adapters for various Unicode algorithms
(composition/decomposition/compression etc).
7) Anything I've forgotten :-)
A facility for Unicode substring matching, which would use the
collation facilities, would be useful.  This could be based on the ICU
implementation.
Additionally, a date formatting facility for Unicode would be useful.
std::time_get / std::time_put ? :-)

John.

Re: [boost] Boost Unicode support ideas

John Maddock