[boost] [RFC] Unicode and Converters/Segmenters

1 Aug 2010

      It's been a while (about a year actually) since I had some feedback 
about my Unicode library, so here I am requesting for comments.

The Unicode library provides facilities to convert between UTF and 
locale encodings in a way as nice and generic to use as possible, as 
well as a few Unicode character properties that can be used for 
normalization or segmentation into graphemes.

The largest part of the library is actually a fairly intricate generic 
Converter and Segmenter system, that, among others, allows to define, in 
an easy and stateless way, a variable-width N to M conversion step.

The conversion can then be applied normally on the input, or step by 
step by an iterator or range adaptor, essentially performing a lazy 
conversion.
Converters can be combined, and can be used to make codecvt facets, 
which allows them to be transparently applied by standard file streams. 
Converters can also be built from codecvt facets, which is how the 
Unicode library provides conversion between locale encodings.

I think the whole system really deserves to be a library of itself and 
not just part of Unicode, but I'm unsure of how to deal with this in Boost.
I think it's quite cool, but I haven't really seen much interest into 
it. I may write a short tutorial of how to write base64 codecs with it 
and how to use that with iostreams just to show it off a bit more 
outside of a Unicode context.

Anyway, the docs are here:
<http://mathias.gaunard.com/unicode/doc/html/>

And the code is on the sandbox:
<https://svn.boost.org/svn/boost/sandbox/SOC/2009/unicode/>

As I have said before, I will be submitting the full thing for formal 
review *soon*, i.e mid-september.

The changes that will go in are mostly performance-related: I'm 
experiencing with things right now and doing benchmarks, considering 
unsafe codecs and SIMD ones (SIMD is not just an implementation detail, 
due to the step-by-step evaluation; using SIMD means having a much 
larger step -- and of course, it cannot be safe).
I also need to tackle the issue of compile-time, which is quite large: I 
need better header separation.

I also need to find a better solution, from a binary point of view, to 
expose composition from the shared library, as the current one doesn't 
give much flexibility in implementation.

[boost] [RFC] Unicode and Converters/Segmenters

Mathias Gaunard