
----- Original Message ----- From: "Rogier van Dalen" <rogiervd@gmail.com>
I've recently started on the first draft of a Unicode library.
Interesting. Is there a discussion going about this library that I have missed, or haven't you posted anything about it yet? I'd hate to start something like this, if there is already being made an effort on the subject.
An assumption I think is wrong is that wchar_t would be suitable for Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on Microsoft compilers, for example. The utf8_codecvt_facet implementation will on these compilers cut off any codepoints over 0xFFFF. (U+1D12C will come out as U+D12C.)
I agree. The "unicode is wide strings" assumption is wrong in my opinion, and I would stribe to provide a correct implementation based on the Unicode standard if I were to go ahead with this.
I think a definition of unicode::code as uint32_t would be much better. Problem is, codecvt is only implemented for wchar_t and char, so it's not possible to make a Unicode codecvt without manually adding (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to the std namespace. I guess this is the reason that Ron Garcia just used wchar_t.
I don't really feel locking the code unit size to 32bits is a good solution either as strings would then become unneccesarily large. In a test implementation I have recently made, I templated the entire encoding scheme (using an encoding_traits class) and made a common interface for strings that lets you iterate over the code points it controls, no matter what the underlying encoding is. (I will post another message with more details of this library.) This does of course make for problems with other parts of the standard, but solutions to these problems is what I want my thesis to be all about.
About Unicode strings: I suggest having a codepoint_string, with the string of code units as a template parameter. Its interface should work with 21 (32) bits values, while internally these are converted to UTF-8, UTF-16, or remain UTF-32. template <class CodeUnitString> class codepoint_string { CodeUnitString code_units; // ... };
The real unicode::string would be the character string, which uses a base character with its combining marks for its interface. template <class CodePointString> class string { CodePointString codepoints; // ... };
So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters.
unicode::string should take care of correctly searching for a character string, rather than a codepoint string.
Thanks. I will take that into consideration. I'm glad to hear any design/implementation ideas since I want this library to be useable for the largest amount of people possible.
operator< has never done "the right thing" anyway: it does not make a difference between uppercase and lowercase, for example. Probably, locales should be used for collation. The Unicode collation algorithm is pretty well specified.
Yes. I hope to be able to add support for the collation algorithm to enable proper, locale specific collation.