
On Fri, Jul 17, 2009 at 4:29 PM, Rogier van Dalen<rogiervd@gmail.com> wrote:
On Fri, Jul 17, 2009 at 20:02, Cory Nelson <phrosty@gmail.com> wrote:
On Thu, Feb 12, 2009 at 6:32 PM, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
Cory Nelson wrote:
Is there interest in having a Unicode codec library submitted to Boost?
I've had a look at your code. I like that you have implemented what I called an "error policy". It is wasteful to continuously check the validity of input that comes from trusted code, but important to check it when it comes from an untrusted source. However I'm not sure that your actual conversion is as efficient as it could be. I spent quite a while profiling my UTF8 conversion and came up with this:
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
I think that you could largely copy&paste bits of that into the right places in your algorithm and get a significant speedup.
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE.
Dear Cory,
Though I'm not sure decoding this much UTF8-encoded data is often a bottleneck in practice, let me add my thoughts. I wrote a utf8_codecvt::do_in that uses lookup tables for a Unicode library that Graham and I started long ago (in the vault). I haven't actually compared performance of my implementation against yours, so mine may well be slower. However, the technique may be of interest.
UTF-8 is the primary bottleneck in XML decoding. That's been my motivation thus far.
From the first UTF-8 byte, the values and the valid ranges of the 0 to 3 following bytes is known (Table 3-7 in the Unicode standard version 5.0). Therefore, you can put these in lookup tables (7 tables with 256 entries each, with a 32-bit code point value). A master table contains 256 entries mapping the first byte to 0 to 3 following tables. You find the code point values for consecutive bytes in the appropriate tables and OR them. The trick that I suspect gave me speed-up compared to the function the Unicode Consortium publishes, is that code point values for invalid bytes are 0xffffffff. After OR-ing all values, you need only one check to see whether there was an error. This reduces branches.
Sounds interesting, I will try this out and see what happens.
As I said, I'm not sure it's actually faster than your code, but it may be an interesting technique. The debug compilation of utf8_data.cpp, which contains the tables and nothing else, is 30kB on Linux gcc; hardly "gigantic". However, with a threefold speed increase it seems likely that Windows uses bigger tables or something smarter than I came up with.
To answer your original question, "is there any interest in Unicode codecs?", yes.
It now seems to me that a full Unicode library would be hard to get accepted into Boost; it would be more feasible to get a UTF library submitted, which is more along the lines of your library. (A Unicode library could later be based on the same principles.)
Freestanding transcoding functions and codecvt facets are not the only thing I believe a UTF library would need, though. I'd add to the list: - iterator adaptors (input and output); - range adaptors; - a code point string; - compile-time encoding (meta-programming); - documentation.
I agree, mostly. I'm not sure if a special string (as opposed to basic_string<utf16_t>) would be worthwhile -- what would you do with it that didn't require a full Unicode library supporting it?
(The compile-time encoding thing may sound esoteric. I believe would be useful for fast parsers. It's not that hard to do at any rate.) I suspect the string is a pain to design to please everyone. Iterator adaptors, I found, are a pain to attach error policies to and write them correctly. For example, with a policy equivalent to your "ReplaceCheckFailures", you need to produce the same code point sequence whether you traverse an invalid encoded string forward or backward. I've got code for UTF-8 that passes my unit tests, but the error checking and the one-by-one decoding makes it much harder to optimise.
I believe that Mathias Gaunard is working on a library at <http://blogloufoque.free.fr/unicode/doc/html/>. I don't know how complete it is, but from the documentation it looks well thought-out so far. I'm looking forward to seeing where that's going!
Cheers, Rogier _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Cory Nelson http://int64.org