Re: [boost] RFC: interest in Unicode codecs?

12 Feb 2009

      Cory Nelson wrote:
...
Is there interest in having a Unicode codec library submitted to Boost?
Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008):
http://svn.int64.org/viewvc/int64/snips/unicode.hpp
Right now it is pretty simple to use:
transcode<utf8, utf16le>(forwarditerator, forwarditeratorend,
outputiterator, traits [, maximum]);
transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [, maximum]);
There is also a codecvt facet which supports any-to-any.
Supports UTF-8, UTF-16, and UTF-32, in little or big endian.  Has a
special wchar_encoding that maps to UTF-16 or UTF-32 depending on your
platform.  A traits class controls error handling.
Hi Cory,

Yes, Boost definitely ought to have Unicode conversion.  Yours is not 
the first proposal, and there is actually already one implementation 
hidden inside another Boost library.  I wrote some UTF conversion code 
a while ago and was trying to build a more comprehensive character set 
library around it but realised after a while that that was too mammoth 
a job.  So I think that getting _just_ the UTF conversion into Boost, 
ASAP, is the right thing to do.

I've had a look at your code.  I like that you have implemented what I 
called an "error policy".  It is wasteful to continuously check the 
validity of input that comes from trusted code, but important to check 
it when it comes from an untrusted source.  However I'm not sure that 
your actual conversion is as efficient as it could be.  I spent quite a 
while profiling my UTF8 conversion and came up with this:

     http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh

I think that you could largely copy&paste bits of that into the right 
places in your algorithm and get a significant speedup.

Having said all that, I must say that I actually use the code that I 
wrote quite rarely.  I now tend to use UTF8 everywhere and treat it as 
a sequence of bytes.  Because of the properties of UTF8 I find it's 
rare to need to identify individual code points.  For example, if I'm 
scanning for a matching " or ) I can just look for the next matching 
byte, without worrying about where the character boundaries are.  If I 
were to use a special UTF8-decoding iterator to do that scan I would 
waste a lot of time do unnecessary conversions.  I'm not sure what 
conclusion to draw from that: perhaps just that any "UTF8 string", or 
whatever, should come with a health warning that users should first 
learn how UTF8 works and review whether or not they actually need it.

Cheers,  Phil.

Re: [boost] RFC: interest in Unicode codecs?

Phil Endecott