
As Unicode characters that are not in page zero can require more than 32 bits
to encode them [yes really] Unless you're talking about grapheme clusters or composite characters (are they the same thing?), not in Unicode 5. No Unicode code point needs more than one UTF-32 unit, more than two UTF-16 units (a surrogate
Graham wrote: pair) or more than 4 UTF-8 units (11110www 10xxxxxx 10yyyyyy 10zzzzzz for a total of 21 bits).
The only way I have found of handling this is to base the string functions
on a proper Unicode character support library according to the Unicode spec.
This means that you need character movement support, grapheme support, and
sorting support.
There are several issues here. One is the ability to store text in some encoding, and to convert it to Unicode code points or a different encoding. The second issue is the ability to process this text. This brings in the Unicode algorithms like Collation. The third issue is the ability to display this text. We're talking BIDI support and, if I understand the term correctly, character movement. (Is this about moving the caret from grapheme to grapheme, taking into account BIDI and ligatures?) The nice thing is that the dependencies go strictly upwards. Storing doesn't depend on processing, and processing doesn't depend on displaying. So it's possible to take these one step at a time.
As I said to Phil, Rogier and I completed a Unicode character library for
Release under boost, but never submitted it to Boost as we had intended to
release it with a string library built on it, and never had time to do the
second part of the work.
Post it, and we'll do the second part. It's open-source. Sebastian