
1) define the data types for 8/16/32 bit Unicode characters.
unsigned char for UTF-8 code units, boost::uint16_t for UTF-16 code units, and boost::int32_t for UTF-32 code units and to represent Unicode code points
Almost, ICU uses wchar_t for UTF-16 on Win32 (just to complicate things).
2) define iterator adapters to convert a sequence of one Unicode character type to another.
This is easy enough, unicode.org provides optimized C code for this purpose, which could easily be changed slightly for the use of iterator adapters. Alternatively, ICU probably less directly provides this.
Yep.
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:
typedef basic_string<utf8_t> utf8_string; // etc
As far as the use of UTF-8 as the internal encoding, I personally would suggest that UTF-16 be used instead, because UTF-8 is rather inefficient to work with. Although I am not overly attached to UTF-16, I do think it is important to standardize on a single internal representation, because for practical reasons, it is useful to be able to have non-templated APIs for purposes such as collating.
You can use whatever you want - I don't think users should be constrained to a specific internal encoding. Personally I don't like UTF8 either, but I know some people do...
The other issues I see with using basic_string include that many of its methods would not be suitable for use with a Unicode string, and it does not have something like an operator += which would allow appending of a single Unicode code point (represented as a 32-bit integer).
What it comes down to is that basic_string is designed with fixed-width character representations in mind.
I would be more in favor of creating a separate type to represent Unicode strings.
Personally I think we have too many string types around already. While I understand you're concerns about basic_string, as a container of code-points it's just fine IMO. We can always add non-member functions for more advanced manipulation.
4) define low level access to the core Unicode data properties (in unidata.txt).
Reuse of the ICU library would probably be very helpful in this.
5) Begin to add locale support - a big job, probably a few facets at a time.
The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale.
Accepted ctype operations are largely (though not completely) independed of the locale, that just makes the ctype specialisations easier IMO.
Furthermore, many of the operations such as toupper on a single character are not well defined, and rather must be defined as a string to string mapping.
I know, however 1 to 1 approximations are available (those in Unidata.txt). I'm not saying that the std locale facets should be the only interface, or even the primary one, but providing it does get a lot of other stuff working.
Finally, the single-character type must be a 32-bit integer, while the code unit type will probably not be (since UTF-32 as the internal representation would be inefficient).
True, for UTF-16 only the core Unicode subset would be supported by std::locale (ie no surrogates): this is the same as the situation in Java and JavaScript.
Specific cases include collate<Ch>, which lacks an interface for configuring collation, such as which strength level to use, whether uppercase or lowercase letters should sort first, whether in French locales accents should be sorted right to left, and other such features. It is true that an additional, more powerful interface could be provided, but this would add complexity.
You can provide any constructor interface to the collate facet that you want, for example to support a locale and a strenth level one might use: template <class charT> class unicode_collate : public std::collate<charT> { public: unicode_collate(const char* name, int level = INT_MAX); /* details */ }; I'm assuming that we have a non-member function to create a locale object that contains a set of Unicode facets: std::locale create_unicode_locale(const char* name); Usage to create a locale object with primary level collation would then be: std::locale l(create_unicode_locale("en_GB"), new unicode_collate("en_GB", 1)); mystream.imbue(l); mystream << something; // etc.
Additionally, it depends on basic_string<Ch> (note lack of char_traits specification), which is used as the return type of transform, when something representing a byte array might be more suitable.
You might have me on that one :-)
Additionally, num_put, moneypunct and money_put all would allow only a single code unit in a number of cases, when a string of multiple code points would be suitable. In addition, those facets also depend on basic_string<Ch>.
I don't understand what the problem is there, please explain.
6) define iterator adapters for various Unicode algorithms (composition/decomposition/compression etc). 7) Anything I've forgotten :-)
A facility for Unicode substring matching, which would use the collation facilities, would be useful. This could be based on the ICU implementation.
Additionally, a date formatting facility for Unicode would be useful.
std::time_get / std::time_put ? :-) John.