
On Thu, Feb 12, 2009 at 6:32 PM, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
Cory Nelson wrote:
Is there interest in having a Unicode codec library submitted to Boost?
Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008): http://svn.int64.org/viewvc/int64/snips/unicode.hpp
Right now it is pretty simple to use:
transcode<utf8, utf16le>(forwarditerator, forwarditeratorend, outputiterator, traits [, maximum]); transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [, maximum]);
There is also a codecvt facet which supports any-to-any.
Supports UTF-8, UTF-16, and UTF-32, in little or big endian. Has a special wchar_encoding that maps to UTF-16 or UTF-32 depending on your platform. A traits class controls error handling.
Hi Cory,
Yes, Boost definitely ought to have Unicode conversion. Yours is not the first proposal, and there is actually already one implementation hidden inside another Boost library. I wrote some UTF conversion code a while ago and was trying to build a more comprehensive character set library around it but realised after a while that that was too mammoth a job. So I think that getting _just_ the UTF conversion into Boost, ASAP, is the right thing to do.
I've had a look at your code. I like that you have implemented what I called an "error policy". It is wasteful to continuously check the validity of input that comes from trusted code, but important to check it when it comes from an untrusted source. However I'm not sure that your actual conversion is as efficient as it could be. I spent quite a while profiling my UTF8 conversion and came up with this:
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
I think that you could largely copy&paste bits of that into the right places in your algorithm and get a significant speedup.
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318: http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE. -- Cory Nelson http://int64.org