
Mathias Gaunard wrote:
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/
Hi Mathias, I have looked quickly at your UTF8 code at https://svn.boost.org/trac/boost/browser/sandbox/SOC/2009/unicode/boost/unic... in comparison with mine at http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh . The encoding is similar, though I have avoided some code duplication (which is probably worthwhile in an inline function) and used an IF_LIKELY macro to enable gcc's branch hinting. My decoding implementation is rather different than yours, though. You explicitly determine the length of the code first and then loop , while I do this: static char32_t decode(const_char8_ptr_t& p) { char8_t b0 = *(p++); IF_LIKELY((b0&0x80)==0) { return b0; } char8_t b1 = *(p++); check((b1&0xc0)==0x80); IF_LIKELY((b0&0xe0)==0xc0) { char32_t r = (b1&0x3f) | ((b0&0x1f)<<6); check(r>=0x80); return r; } char8_t b2 = *(p++); check((b2&0xc0)==0x80); IF_LIKELY((b0&0xf0)==0xe0) { char32_t r = (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12); check(r>=0x800); return r; } char8_t b3 = *(p++); check((b3&0xc0)==0x80); IF_LIKELY((b0&0xf8)==0xf0) { char32_t r = (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18); check(r>=0x10000); return r; } } You may find that that is faster. Regarding the character database, the size is an issue. Can unwanted parts be omitted? For example, I would guess that the character names are not often used except for debugging messages and they are probably a large part of it. Regards, Phil.