
Mathias Gaunard wrote:
Hi everyone. I'm in charge of the Unicode Google Summer of Code project.
I have been working on range adaptors to iterate over code points in an UTF-x string as well as converting back those code points to UTF-y for the past week and
That's good, these are needed. Also needed are tables that store the various character properties, and (hopefully) some parsers that build the tables directly from the Unicode character database so we can easily rev it whenever the database changes.
I stopped working on these for a bit to put together some short documentation (which is my first quickbook document, so it may not be very pretty). This is not a documentation of the final work, but rather that of what I'm working on at the moment.
I would like to know everyone's opinion of the concepts I am defining, which assume the range that is being worked on is indeed a valid unicode range in a particular encoding, as well as the system used to enforce those concepts.
Also, I put the normalization form C as part of the invariant
The invariant of what? The internal data over which the iterators traverse? Which iterators? All of them? Are you really talking about an invariant (something that is true of the data both before an after each operation completes), or of pre- or post-conditions? , but maybe
that should be something orthogonal. I personally don't think it's really useful for general-purpose text though.
I should hope there is a way to operate on valid Unicode ranges that happen not to be in normalization form C.
While the system doesn't provide conversion from other character sets, this can easily be added by using assume_utf32. For example, using an ISO-8859-1 string as input to assume_utf32 just works, since ISO-8859-1 is included verbatim into Unicode.
I personally haven't taken the time to learn how ICU handles Unicode input and character set conversions. It might be illustrative to see how an established and respected Unicode library handles issues like this.
The documentation contains as well some introductory Unicode material.
You can find the documentation online here: http://mathias.gaunard.emi.u-bordeaux1.fr/unicode/doc/html/
Thanks for posting this. Some comments. <<Core Types>> The library provides the following core types in the boost namespace: uchar8_t uchar16_t uchar32_t In C++0x, these are called char, char16_t and char32_t. I think uchar8_t is unnecessary, and for a Boost Unicode library, boost::char16 and boost::char32 would work just fine. On a C++0x compiler, they should be typedefs for char16_t and char32_t. <<Concepts>> I strongly disagree with requiring normalization form C for the concept UnicodeRange. There are many more valid Unicode sequences. And UnicodeGrapheme concept doesn't make sense to me. You say, "A model of UnicodeGrapheme is a range of Unicode code points that is a single grapheme cluster in Normalized Form C." A grapheme cluster != Unicode code point. It may be many code points representing a base character an many zero-width combining characters. So what exactly is being traversed by a UnicodeGrapheme range? The concepts are of critical importance, and these don't seem right to me. My C++0x concept-foo is weak, and I'd like to involve many more people in this discussion. The purpose of the concepts are to allow algorithms to be implemented generically in terms of the operations provided by the concepts. So, what algorithms do we need, and how can we express them generically in terms of concepts? Without that most critical step, we'll get the concepts all wrong. I imagine we'll want algorithms for converting from one encoding to another, or from one normalization form (or, more likely, from no normalization form) to another, so we'll need to constrain the algorithms to specific encodings and/or normalization forms. We'll also need a concept that represents Unicode input that hasn't yet been normalized (perhaps in each of the encodings?). Point is, the concrete algorithms must come first. We may end up back with a single perfectly general UnicodeRange that all algorithms can be implemented in terms of. That'd be nice, but I bet we end up with refinements for the different encodings/normalized forms that make it possible to implement some algorithms much more efficiently. (I stopped reading the docs at this point.) -- Eric Niebler BoostPro Computing http://www.boostpro.com