[boost] Re: Any interest in adding unicode support to boost?

19 Oct 2004


      ----- Original Message ----- 
From: "Rogier van Dalen" <rogiervd@gmail.com>
...
I've recently started on the first draft of a Unicode library.
Interesting. Is there a discussion going about this library that I have
missed, or haven't you posted anything about it yet?  I'd hate to start
something like this, if there is already being made an effort on the
subject.
...
An assumption I think is wrong is that wchar_t would be suitable for
Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on
Microsoft compilers, for example. The utf8_codecvt_facet
implementation will on these compilers cut off any codepoints over
0xFFFF. (U+1D12C will come out as U+D12C.)
I agree. The "unicode is wide strings" assumption is wrong in my opinion,
and I would stribe to provide a correct implementation based on the Unicode
standard if I were to go ahead with this.
...
I think a definition of unicode::code as uint32_t would be much
better. Problem is, codecvt is only implemented for wchar_t and char,
so it's not possible to make a Unicode codecvt without manually adding
(dummy) implementations of codecvt<unicode::code,char,mbstate_t> to
the std namespace. I guess this is the reason that Ron Garcia just
used wchar_t.
I don't really feel locking the code unit size to 32bits is a good solution 
either as strings would then become unneccesarily large. In a test 
implementation I have recently made, I templated the entire encoding scheme 
(using an encoding_traits class) and made a common interface for strings 
that lets you iterate over the code points it controls, no matter what the 
underlying encoding is. (I will post another message with more details of 
this library.) This does of course make for problems with other parts of the 
standard, but solutions to these problems is what I want my thesis to be all 
about.
...
About Unicode strings:
I suggest having a codepoint_string, with the string of code units as
a template parameter. Its interface should work with 21 (32) bits
values, while internally these are converted to UTF-8, UTF-16, or
remain UTF-32.
template <class CodeUnitString> class codepoint_string {
   CodeUnitString code_units;
   // ...
};
The real unicode::string would be the character string, which uses a
base character with its combining marks for its interface.
template <class CodePointString> class string {
   CodePointString codepoints;
   // ...
};
So unicode::string<unicode::codepoint_string<std::string> > would be a
UTF8-encoded string that is manipulated using its characters.
unicode::string should take care of correctly searching for a
character string, rather than a codepoint string.
Thanks. I will take that into consideration. I'm glad to hear any
design/implementation ideas since I want this library to be useable for the
largest amount of people possible.
...
operator< has never done "the right thing" anyway: it does not make a
difference between uppercase and lowercase, for example. Probably,
locales should be used for collation. The Unicode collation algorithm
is pretty well specified.
Yes. I hope to be able to add support for the collation algorithm to enable
proper, locale specific collation.

[boost] Re: Any interest in adding unicode support to boost?

Erik Wien