
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/ This preview comes a bit later than I planned, but after struggling to keep the documentation and the code in sync I decided to move to automatic documentation generation with doxygen, which I had trouble setting up due to my experience. The reference still lacks a lot of information, however. The library only features UTF support and the Unicode Character Database (not fully updated to latest Unicode version) at the moment, but grapheme clusters and normalization support will come very soon. I would like to get feedback on the UTF codecs, the various concepts (Pipe, Consumer, BoundaryChecker) and the whole approach of lazy ranges. Grapheme clusters (and other text boundaries facilities) support will also be provided in terms of the Consumer and BoundaryChecker concepts. The library features lazy ranges similar to that of Boost.RangeEx, and I used one of the naming conventions that was proposed during review: u8_encode is the eager algorithm, u8_encoded is the lazy one. Since it wasn't really agreed which naming to use for RangeEx, I would like this to be discussed as well.

Mathias Gaunard wrote:
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/
I forgot to mention the code is available on the Boost Sandbox SVN, under SOC/2009/unicode.

Mathias Gaunard wrote:
Mathias Gaunard wrote:
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/
I forgot to mention the code is available on the Boost Sandbox SVN, under SOC/2009/unicode.
Looking good. I wonder if you've talked to Haoyu Bai about the gsoc py3k project, where IIUC string-unicode conversion is a central issue. It'd be a shame if the boost's approach to unicode became fragmented at the moment it started (i.e. if boost.python and and your library don't do things in a coherent way), and you may both benefit. Just a thought. You might bring this up on the python c++-sig list where more people have been thinking about boost.python + py3k + unicode. -t

2009/6/20 troy d. straszheim <troy@resophonic.com>
Mathias Gaunard wrote:
Mathias Gaunard wrote:
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/
I forgot to mention the code is available on the Boost Sandbox SVN, under SOC/2009/unicode.
Looking good. I wonder if you've talked to Haoyu Bai about the gsoc py3k project, where IIUC string-unicode conversion is a central issue.
I've been watching the progress of this too, as the GSoC CGI library is in a similar position and really needs unicode support. Looks good so far, keep us posted. Cheers, Darren

troy d. straszheim wrote:
Looking good. I wonder if you've talked to Haoyu Bai about the gsoc py3k project, where IIUC string-unicode conversion is a central issue. It'd be a shame if the boost's approach to unicode became fragmented at the moment it started (i.e. if boost.python and and your library don't do things in a coherent way), and you may both benefit.
I suggest what boost.python really needs is some kind of "unicode string type" which it can translate the python unicode string type to. However, the library is at the moment nothing more than a set of algorithms and tools operating on ranges of raw data. Those can later be composed to create the unicode string type.

Mathias Gaunard wrote:
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/
Hi Mathias, I have looked quickly at your UTF8 code at https://svn.boost.org/trac/boost/browser/sandbox/SOC/2009/unicode/boost/unic... in comparison with mine at http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh . The encoding is similar, though I have avoided some code duplication (which is probably worthwhile in an inline function) and used an IF_LIKELY macro to enable gcc's branch hinting. My decoding implementation is rather different than yours, though. You explicitly determine the length of the code first and then loop , while I do this: static char32_t decode(const_char8_ptr_t& p) { char8_t b0 = *(p++); IF_LIKELY((b0&0x80)==0) { return b0; } char8_t b1 = *(p++); check((b1&0xc0)==0x80); IF_LIKELY((b0&0xe0)==0xc0) { char32_t r = (b1&0x3f) | ((b0&0x1f)<<6); check(r>=0x80); return r; } char8_t b2 = *(p++); check((b2&0xc0)==0x80); IF_LIKELY((b0&0xf0)==0xe0) { char32_t r = (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12); check(r>=0x800); return r; } char8_t b3 = *(p++); check((b3&0xc0)==0x80); IF_LIKELY((b0&0xf8)==0xf0) { char32_t r = (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18); check(r>=0x10000); return r; } } You may find that that is faster. Regarding the character database, the size is an issue. Can unwanted parts be omitted? For example, I would guess that the character names are not often used except for debugging messages and they are probably a large part of it. Regards, Phil.

Phil Endecott wrote:
I have looked quickly at your UTF8 code at https://svn.boost.org/trac/boost/browser/sandbox/SOC/2009/unicode/boost/unic... in comparison with mine at http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh . The encoding is similar, though I have avoided some code duplication (which is probably worthwhile in an inline function) and used an IF_LIKELY macro to enable gcc's branch hinting.
My decoding implementation is rather different than yours, though. You explicitly determine the length of the code first and then loop , while I do this: <code snip />
You may find that that is faster.
My code wasn't fine-tuned for performance at all, I'm still trying to make things work first ;). I'll surely consider your technique when I finally measure. On another note, while I do think IF_LIKELY for UTF-16 is a good idea, doesn't that heavily penalize certain scripts, such as asian ones, in the case of UTF-8?
Regarding the character database, the size is an issue. Can unwanted parts be omitted? For example, I would guess that the character names are not often used except for debugging messages and they are probably a large part of it.
The current design doesn't allow it to be shrunk any more than this unfortunately. I'm not too sure of how to enhance it to allow parts to be removed either.

Mathias Gaunard wrote:
On another note, while I do think IF_LIKELY for UTF-16 is a good idea, doesn't that heavily penalize certain scripts, such as asian ones, in the case of UTF-8?
Not really: - In many cases, documents that use a exotic script actually contain large numbers of ASCII characters; consider an HTML page, for example, which will be full of HTML punctuation and tags. (I believe that I became aware of this after reading something written by a Mozilla person who had been investigating Unicode issues.) - The penalty of a wrong branch hint is not "heavy". We probably have lots of places in our code where the compiler heuristic is wrong, but we don't notice until we study it very carefully (as I did with this UTF8 code). This is why processors still need to implement dynamic branch prediction. My normal policy for using compiler branch hints like IF_LIKELY is to compile once with profile-driven optimisation, and then to find the places where it made a significant difference and add branch hints. I then get close to the profile-driven-optimised performance without needing to actually re-do the profiling. Regards, Phil.
participants (4)
-
Darren Garvey
-
Mathias Gaunard
-
Phil Endecott
-
troy d. straszheim