Re: [boost] UTF-8 conversion etc. (Sebastian Redl)

10 Mar 2008

      ...
As Unicode characters that are not in page zero can require more than 32
bits
to encode them [yes really]
Unless you're talking about grapheme clusters or composite characters 
(are they the same thing?), not in Unicode 5. No Unicode code point 
needs more than one UTF-32 unit, more than two UTF-16 units (a surrogate
Graham wrote:
pair) or more than 4 UTF-8 units (11110www 10xxxxxx 10yyyyyy 10zzzzzz 
for a total of 21 bits).
...
The only way I have found of handling this is to base the string
functions
on a proper Unicode character support library according to the Unicode
spec.
This means that you need character movement support, grapheme support,
and
sorting support.
There are several issues here. One is the ability to store text in some 
encoding, and to convert it to Unicode code points or a different encoding.
The second issue is the ability to process this text. This brings in the 
Unicode algorithms like Collation.
The third issue is the ability to display this text. We're talking BIDI 
support and, if I understand the term correctly, character movement. (Is 
this about moving the caret from grapheme to grapheme, taking into 
account BIDI and ligatures?)

The nice thing is that the dependencies go strictly upwards. Storing 
doesn't depend on processing, and processing doesn't depend on 
displaying. So it's possible to take these one step at a time.
...
As I said to Phil, Rogier and I completed a Unicode character library
for
Release under boost, but never submitted it to Boost as we had intended
to
release it with a string library built on it, and never had time to do
the
second part of the work.
Post it, and we'll do the second part. It's open-source.

Sebastian