Re: [boost] GSoC Unicode library: second preview

21 Jun 2009

      Mathias Gaunard wrote:
...
Here is the documentation of the current state of the Unicode library 
that I am doing as a google summer of code project:
http://blogloufoque.free.fr/unicode/doc/html/
Hi Mathias,

I have looked quickly at your UTF8 code at 
https://svn.boost.org/trac/boost/browser/sandbox/SOC/2009/unicode/boost/unic... 
in comparison with mine at 
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh .  The 
encoding is similar, though I have avoided some code duplication (which 
is probably worthwhile in an inline function) and used an IF_LIKELY 
macro to enable gcc's branch hinting.

My decoding implementation is rather different than yours, though.  You 
explicitly determine the length of the code first and then loop , while 
I do this:

   static char32_t decode(const_char8_ptr_t& p) {
     char8_t b0 = *(p++);
     IF_LIKELY((b0&0x80)==0) {
       return b0;
     }
     char8_t b1 = *(p++);
     check((b1&0xc0)==0x80);
     IF_LIKELY((b0&0xe0)==0xc0) {
       char32_t r = (b1&0x3f) | ((b0&0x1f)<<6);
       check(r>=0x80);
       return r;
     }
     char8_t b2 = *(p++);
     check((b2&0xc0)==0x80);
     IF_LIKELY((b0&0xf0)==0xe0) {
       char32_t r = (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
       check(r>=0x800);
       return r;
     }
     char8_t b3 = *(p++);
     check((b3&0xc0)==0x80);
     IF_LIKELY((b0&0xf8)==0xf0) {
       char32_t r = (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);
       check(r>=0x10000);
       return r;
     }
   }

You may find that that is faster.

Regarding the character database, the size is an issue.  Can unwanted 
parts be omitted?  For example, I would guess that the character names 
are not often used except for debugging messages and they are probably 
a large part of it.

Regards,  Phil.

Re: [boost] GSoC Unicode library: second preview

Phil Endecott