
"John Maddock" <john@johnmaddock.co.uk> writes:
[snip]
The representation of locales does present an issue that needs to be considered. The existing C++ standard locale facets are not very suitable for a variety of reasons:
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.
Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.
I'm not sure they will ``just work,'' although I will admit I haven't looked into the issues relating to Unicode support in iostreams thoroughly yet.
[snip]
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:
1) define the data types for 8/16/32 bit Unicode characters.
unsigned char for UTF-8 code units, boost::uint16_t for UTF-16 code units, and boost::int32_t for UTF-32 code units and to represent Unicode code points
2) define iterator adapters to convert a sequence of one Unicode character type to another.
This is easy enough, unicode.org provides optimized C code for this purpose, which could easily be changed slightly for the use of iterator adapters. Alternatively, ICU probably less directly provides this.
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
As far as the use of UTF-8 as the internal encoding, I personally would suggest that UTF-16 be used instead, because UTF-8 is rather inefficient to work with. Although I am not overly attached to UTF-16, I do think it is important to standardize on a single internal representation, because for practical reasons, it is useful to be able to have non-templated APIs for purposes such as collating. The other issues I see with using basic_string include that many of its methods would not be suitable for use with a Unicode string, and it does not have something like an operator += which would allow appending of a single Unicode code point (represented as a 32-bit integer). What it comes down to is that basic_string is designed with fixed-width character representations in mind. I would be more in favor of creating a separate type to represent Unicode strings.
4) define low level access to the core Unicode data properties (in unidata.txt).
Reuse of the ICU library would probably be very helpful in this.
5) Begin to add locale support - a big job, probably a few facets at a time.
The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale. Furthermore, many of the operations such as toupper on a single character are not well defined, and rather must be defined as a string to string mapping. Finally, the single-character type must be a 32-bit integer, while the code unit type will probably not be (since UTF-32 as the internal representation would be inefficient). Specific cases include collate<Ch>, which lacks an interface for configuring collation, such as which strength level to use, whether uppercase or lowercase letters should sort first, whether in French locales accents should be sorted right to left, and other such features. It is true that an additional, more powerful interface could be provided, but this would add complexity. Additionally, it depends on basic_string<Ch> (note lack of char_traits specification), which is used as the return type of transform, when something representing a byte array might be more suitable. Additionally, num_put, moneypunct and money_put all would allow only a single code unit in a number of cases, when a string of multiple code points would be suitable. In addition, those facets also depend on basic_string<Ch>.
6) define iterator adapters for various Unicode algorithms (composition/decomposition/compression etc). 7) Anything I've forgotten :-)
A facility for Unicode substring matching, which would use the collation facilities, would be useful. This could be based on the ICU implementation. Additionally, a date formatting facility for Unicode would be useful.
[snip]
-- Jeremy Maitin-Shepard