Re: [boost] [rfc] Unicode GSoC project

14 May 2009

      Mathias Gaunard wrote:
...
Hi everyone. I'm in charge of the Unicode Google Summer of Code project.
I have been working on range adaptors to iterate over code points in an 
UTF-x string as well as converting back those code points to UTF-y for 
the past week and
That's good, these are needed. Also needed are tables that store the 
various character properties, and (hopefully) some parsers that build 
the tables directly from the Unicode character database so we can easily 
rev it whenever the database changes.
...
I stopped working on these for a bit to put together some short 
documentation (which is my first quickbook document, so it may not be 
very pretty).
This is not a documentation of the final work, but rather that of what 
I'm working on at the moment.
I would like to know everyone's opinion of the concepts I am defining, 
which assume the range that is being worked on is indeed a valid unicode 
range in a particular encoding, as well as the system used to enforce 
those concepts.
Also, I put the normalization form C as part of the invariant
The invariant of what? The internal data over which the iterators 
traverse? Which iterators? All of them? Are you really talking about an 
invariant (something that is true of the data both before an after each 
operation completes), or of pre- or post-conditions?

, but maybe
...
that should be something orthogonal. I personally don't think it's 
really useful for general-purpose text though.
I should hope there is a way to operate on valid Unicode ranges that 
happen not to be in normalization form C.
...
While the system doesn't provide conversion from other character sets, 
this can easily be added by using assume_utf32. For example, using an 
ISO-8859-1 string as input to assume_utf32 just works, since ISO-8859-1 
is included verbatim into Unicode.
I personally haven't taken the time to learn how ICU handles Unicode 
input and character set conversions. It might be illustrative to see how 
an established and respected Unicode library handles issues like this.
...
The documentation contains as well some introductory Unicode material.
You can find the documentation online here:
http://mathias.gaunard.emi.u-bordeaux1.fr/unicode/doc/html/
Thanks for posting this. Some comments.

<<Core Types>>

The library provides the following core types in the boost namespace:

uchar8_t
uchar16_t
uchar32_t

In C++0x, these are called char, char16_t and char32_t. I think uchar8_t 
is unnecessary, and for a Boost Unicode library, boost::char16 and 
boost::char32 would work just fine. On a C++0x compiler, they should be 
typedefs for char16_t and char32_t.

<<Concepts>>

I strongly disagree with requiring normalization form C for the concept 
UnicodeRange. There are many more valid Unicode sequences.

And UnicodeGrapheme concept doesn't make sense to me. You say, "A model 
of UnicodeGrapheme is a range of Unicode code points that is a single 
grapheme cluster in Normalized Form C." A grapheme cluster != Unicode 
code point. It may be many code points representing a base character an 
many zero-width combining characters. So what exactly is being traversed 
by a UnicodeGrapheme range?

The concepts are of critical importance, and these don't seem right to 
me. My C++0x concept-foo is weak, and I'd like to involve many more 
people in this discussion.

The purpose of the concepts are to allow algorithms to be implemented 
generically in terms of the operations provided by the concepts. So, 
what algorithms do we need, and how can we express them generically in 
terms of concepts? Without that most critical step, we'll get the 
concepts all wrong.

I imagine we'll want algorithms for converting from one encoding to 
another, or from one normalization form (or, more likely, from no 
normalization form) to another, so we'll need to constrain the 
algorithms to specific encodings and/or normalization forms. We'll also 
need a concept that represents Unicode input that hasn't yet been 
normalized (perhaps in each of the encodings?). Point is, the concrete 
algorithms must come first. We may end up back with a single perfectly 
general UnicodeRange that all algorithms can be implemented in terms of. 
  That'd be nice, but I bet we end up with refinements for the different 
encodings/normalized forms that make it possible to implement some 
algorithms much more efficiently.

(I stopped reading the docs at this point.)

-- 
Eric Niebler
BoostPro Computing
http://www.boostpro.com