Re: [boost] RFC: interest in Unicode codecs?

18 Jul 2009

      Cory Nelson wrote:
...
I finally found some time to do some optimizations of my own and have
had some good progress using a small lookup table, a switch, and
slightly deducing branches.  See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times
faster (~750MiB/s vs ~240MiB/s on my Core 2.  I assume they are either
using some gigantic look up tables or SSE.
Hi Cory,

What is your test input?

When the input is largely ASCII, a worthwhile optimisation is to cast 
groups of 4 (or 8) characters to ints and & with 0x80808080; if the 
answer is zero, no further conversion is needed.

In general I'm unsure of the performance issues of lookup tables 
compared to explicit bit-manipulation.  Cache effects may be 
significant, and a benchmark will tend to warm up the cache better than 
a real application might.

I can't see how SSE could be applied to this problem, but it's not 
something I know much about.

I don't have much time to work on this right now, but if the algorithm 
plus test harness and test data were bundled up into something that I 
can just "make", I will try to compare it with my version.

Regards,  Phil.

Re: [boost] RFC: interest in Unicode codecs?

Phil Endecott