Re: [boost] RFC: interest in Unicode codecs?

18 Jul 2009


      On Fri, Jul 17, 2009 at 4:29 PM, Rogier van Dalen<rogiervd@gmail.com> wrote:
...
On Fri, Jul 17, 2009 at 20:02, Cory Nelson <phrosty@gmail.com> wrote:
...
On Thu, Feb 12, 2009 at 6:32 PM, Phil
Endecott<spam_from_boost_dev@chezphil.org> wrote:
...
Cory Nelson wrote:
...
Is there interest in having a Unicode codec library submitted to Boost?
...
...
I've had a look at your code.  I like that you have implemented what I
called an "error policy".  It is wasteful to continuously check the validity
of input that comes from trusted code, but important to check it when it
comes from an untrusted source.  However I'm not sure that your actual
conversion is as efficient as it could be.  I spent quite a while profiling
my UTF8 conversion and came up with this:
   http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
I think that you could largely copy&paste bits of that into the right places
in your algorithm and get a significant speedup.
I finally found some time to do some optimizations of my own and have
had some good progress using a small lookup table, a switch, and
slightly deducing branches.  See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times
faster (~750MiB/s vs ~240MiB/s on my Core 2.  I assume they are either
using some gigantic look up tables or SSE.
Dear Cory,
Though I'm not sure decoding this much UTF8-encoded data is often a
bottleneck in practice, let me add my thoughts. I wrote a
utf8_codecvt::do_in that uses lookup tables for a Unicode library that
Graham and I started long ago (in the vault). I haven't actually
compared performance of my implementation against yours, so mine may
well be slower. However, the technique may be of interest.
UTF-8 is the primary bottleneck in XML decoding.  That's been my
motivation thus far.
...
From the first UTF-8 byte, the values and the valid ranges of the 0 to
3 following bytes is known (Table 3-7 in the Unicode standard version
5.0). Therefore, you can put these in lookup tables (7 tables with 256
entries each, with a 32-bit code point value). A master table contains
256 entries mapping the first byte to 0 to 3 following tables. You
find the code point values for consecutive bytes in the appropriate
tables and OR them. The trick that I suspect gave me speed-up compared
to the function the Unicode Consortium publishes, is that code point
values for invalid bytes are 0xffffffff. After OR-ing all values, you
need only one check to see whether there was an error. This reduces
branches.
Sounds interesting, I will try this out and see what happens.
...
As I said, I'm not sure it's actually faster than your code, but it
may be an interesting technique. The debug compilation of
utf8_data.cpp, which contains the tables and nothing else, is 30kB on
Linux gcc; hardly "gigantic". However, with a threefold speed increase
it seems likely that Windows uses bigger tables or something smarter
than I came up with.
To answer your original question, "is there any interest in Unicode
codecs?", yes.
It now seems to me that a full Unicode library would be hard to get
accepted into Boost; it would be more feasible to get a UTF library
submitted, which is more along the lines of your library. (A Unicode
library could later be based on the same principles.)
Freestanding transcoding functions and codecvt facets are not the only
thing I believe a UTF library would need, though. I'd add to the list:
- iterator adaptors (input and output);
- range adaptors;
- a code point string;
- compile-time encoding (meta-programming);
- documentation.
I agree, mostly.  I'm not sure if a special string (as opposed to
basic_string<utf16_t>) would be worthwhile -- what would you do with
it that didn't require a full Unicode library supporting it?
...
(The compile-time encoding thing may sound esoteric. I believe would
be useful for fast parsers. It's not that hard to do at any rate.) I
suspect the string is a pain to design to please everyone. Iterator
adaptors, I found, are a pain to attach error policies to and write
them correctly. For example, with a policy equivalent to your
"ReplaceCheckFailures", you need to produce the same code point
sequence whether you traverse an invalid encoded string forward or
backward. I've got code for UTF-8 that passes my unit tests, but the
error checking and the one-by-one decoding makes it much harder to
optimise.
I believe that Mathias Gaunard is working on a library at
<http://blogloufoque.free.fr/unicode/doc/html/>. I don't know how
complete it is, but from the documentation it looks well thought-out
so far. I'm looking forward to seeing where that's going!
Cheers,
Rogier
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- 
Cory Nelson
http://int64.org