Re: [boost] RFC: interest in Unicode codecs?

17 Jul 2009

      On Thu, Feb 12, 2009 at 6:32 PM, Phil
Endecott<spam_from_boost_dev@chezphil.org> wrote:
...
Cory Nelson wrote:
...
Is there interest in having a Unicode codec library submitted to Boost?
Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008):
http://svn.int64.org/viewvc/int64/snips/unicode.hpp
Right now it is pretty simple to use:
transcode<utf8, utf16le>(forwarditerator, forwarditeratorend,
outputiterator, traits [, maximum]);
transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [,
maximum]);
There is also a codecvt facet which supports any-to-any.
Supports UTF-8, UTF-16, and UTF-32, in little or big endian.  Has a
special wchar_encoding that maps to UTF-16 or UTF-32 depending on your
platform.  A traits class controls error handling.
Hi Cory,
Yes, Boost definitely ought to have Unicode conversion.  Yours is not the
first proposal, and there is actually already one implementation hidden
inside another Boost library.  I wrote some UTF conversion code a while ago
and was trying to build a more comprehensive character set library around it
but realised after a while that that was too mammoth a job.  So I think that
getting _just_ the UTF conversion into Boost, ASAP, is the right thing to
do.
I've had a look at your code.  I like that you have implemented what I
called an "error policy".  It is wasteful to continuously check the validity
of input that comes from trusted code, but important to check it when it
comes from an untrusted source.  However I'm not sure that your actual
conversion is as efficient as it could be.  I spent quite a while profiling
my UTF8 conversion and came up with this:
   http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
I think that you could largely copy&paste bits of that into the right places
in your algorithm and get a significant speedup.
I finally found some time to do some optimizations of my own and have
had some good progress using a small lookup table, a switch, and
slightly deducing branches.  See line 318:

http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup

Despite these efforts, Windows 7 still decodes UTF-8 three times
faster (~750MiB/s vs ~240MiB/s on my Core 2.  I assume they are either
using some gigantic look up tables or SSE.

-- 
Cory Nelson
http://int64.org