Re: [boost] UTF-8 conversion etc. (Sebastian Redl)

10 Mar 2008


      On Mon, Mar 10, 2008 at 1:10 PM, Graham <Graham@system-development.co.uk> wrote:
...
Sebastian,
As Unicode characters that are not in page zero can require more than 32
 bits
to encode them [yes really] this means that one 'character' can be very
 long
Unicode defines codepoints from 0 to 10FFFF - this can be encoded with
32 bits in UTF-8 and UTF-16.
...
in UTF-8/16 encoding. It is even worse if you start looking at
 conceptual
characters [graphemes] where you can easily have three characters make
 up a
conceptual character.
Normalization support would be nice, but is a huge task that is out of
scope of the library (imho).  This is where you have to decide if you
want a full blown Unicode library or just a small codec.

-- 
Cory Nelson