
Cory Nelson wrote:
Is there interest in having a Unicode codec library submitted to Boost?
Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008): http://svn.int64.org/viewvc/int64/snips/unicode.hpp
Right now it is pretty simple to use:
transcode<utf8, utf16le>(forwarditerator, forwarditeratorend, outputiterator, traits [, maximum]); transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [, maximum]);
There is also a codecvt facet which supports any-to-any.
Supports UTF-8, UTF-16, and UTF-32, in little or big endian. Has a special wchar_encoding that maps to UTF-16 or UTF-32 depending on your platform. A traits class controls error handling.
Hi Cory, Yes, Boost definitely ought to have Unicode conversion. Yours is not the first proposal, and there is actually already one implementation hidden inside another Boost library. I wrote some UTF conversion code a while ago and was trying to build a more comprehensive character set library around it but realised after a while that that was too mammoth a job. So I think that getting _just_ the UTF conversion into Boost, ASAP, is the right thing to do. I've had a look at your code. I like that you have implemented what I called an "error policy". It is wasteful to continuously check the validity of input that comes from trusted code, but important to check it when it comes from an untrusted source. However I'm not sure that your actual conversion is as efficient as it could be. I spent quite a while profiling my UTF8 conversion and came up with this: http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh I think that you could largely copy&paste bits of that into the right places in your algorithm and get a significant speedup. Having said all that, I must say that I actually use the code that I wrote quite rarely. I now tend to use UTF8 everywhere and treat it as a sequence of bytes. Because of the properties of UTF8 I find it's rare to need to identify individual code points. For example, if I'm scanning for a matching " or ) I can just look for the next matching byte, without worrying about where the character boundaries are. If I were to use a special UTF8-decoding iterator to do that scan I would waste a lot of time do unnecessary conversions. I'm not sure what conclusion to draw from that: perhaps just that any "UTF8 string", or whatever, should come with a health warning that users should first learn how UTF8 works and review whether or not they actually need it. Cheers, Phil.