Re: [boost] UTF-8 conversion etc. (Cory Nelson)

Sebastian,
As Unicode characters that are not in page zero can require more than 32
bits
to encode them [yes really] this means that one 'character' can be very
long
Unicode defines codepoints from 0 to 10FFFF - this can be encoded with
32 bits in UTF-8 and UTF-16.
Cory, This is true for simple characters, except that current Unicode specs require support for surrogates - which require twice that -and thats even before you start to discuss logical grouping of characters or graphemes which can themselves be two or three characters long. I am glad you recognise that normalisation support is difficult - that's why we the character support library is the hard part to develop. I guess we just ran out of steam after that. Yours, Graham

On Mon, Mar 10, 2008 at 3:22 PM, Graham <Graham@system-development.co.uk> wrote:
Sebastian,
As Unicode characters that are not in page zero can require more than 32
bits
to encode them [yes really] this means that one 'character' can be very
long
Unicode defines codepoints from 0 to 10FFFF - this can be encoded with
32 bits in UTF-8 and UTF-16.
Cory,
This is true for simple characters, except that current Unicode specs require support for surrogates - which require twice that -and thats even before you start to discuss logical grouping of characters or graphemes which can themselves be two or three characters long.
A surrogate pair in UTF-16 takes up two code units for a total of 32 bits. UTF-8 does not have surrogates at all. What are you talking about? -- Cory Nelson
participants (2)
-
Cory Nelson
-
Graham