Re: [boost] UTF-8 conversion etc. (Sebastian Redl)

We can implement UTF-8's and UTF-16's skip_forward by looking at the
current byte. But does that work with all encodings? I think it doesn't
work for shift encodings, unless you're willing to come to a stop on a
shift character. I'm not: there's a rule for some shift encodings that
they *must* end in the initial shift state, which means that there's a
good chance that a shift character is the last thing in the string. This
would mean, however, that if you increment an iterator that points to
the last real character, it must scan past the shift character or it
won't compare equal to the end iterator. Unless you're willing to scan
past the shift in the equality test, another thing I wouldn't do.
Seems to me that shift encodings are a lot more pain than they're worth.
I really have to wonder why anyone would ever have come up with them.
Sebastian, As Unicode characters that are not in page zero can require more than 32 bits to encode them [yes really] this means that one 'character' can be very long in UTF-8/16 encoding. It is even worse if you start looking at conceptual characters [graphemes] where you can easily have three characters make up a conceptual character. The only way I have found of handling this is to base the string functions on a proper Unicode character support library according to the Unicode spec. This means that you need character movement support, grapheme support, and sorting support. As I said to Phil, Rogier and I completed a Unicode character library for Release under boost, but never submitted it to Boost as we had intended to release it with a string library built on it, and never had time to do the second part of the work. Yours, Graham

On Mon, Mar 10, 2008 at 1:10 PM, Graham <Graham@system-development.co.uk> wrote:
Sebastian,
As Unicode characters that are not in page zero can require more than 32 bits
to encode them [yes really] this means that one 'character' can be very long
Unicode defines codepoints from 0 to 10FFFF - this can be encoded with 32 bits in UTF-8 and UTF-16.
in UTF-8/16 encoding. It is even worse if you start looking at conceptual
characters [graphemes] where you can easily have three characters make up a
conceptual character.
Normalization support would be nice, but is a huge task that is out of scope of the library (imho). This is where you have to decide if you want a full blown Unicode library or just a small codec. -- Cory Nelson

As Unicode characters that are not in page zero can require more than 32 bits
to encode them [yes really] Unless you're talking about grapheme clusters or composite characters (are they the same thing?), not in Unicode 5. No Unicode code point needs more than one UTF-32 unit, more than two UTF-16 units (a surrogate
Graham wrote: pair) or more than 4 UTF-8 units (11110www 10xxxxxx 10yyyyyy 10zzzzzz for a total of 21 bits).
The only way I have found of handling this is to base the string functions
on a proper Unicode character support library according to the Unicode spec.
This means that you need character movement support, grapheme support, and
sorting support.
There are several issues here. One is the ability to store text in some encoding, and to convert it to Unicode code points or a different encoding. The second issue is the ability to process this text. This brings in the Unicode algorithms like Collation. The third issue is the ability to display this text. We're talking BIDI support and, if I understand the term correctly, character movement. (Is this about moving the caret from grapheme to grapheme, taking into account BIDI and ligatures?) The nice thing is that the dependencies go strictly upwards. Storing doesn't depend on processing, and processing doesn't depend on displaying. So it's possible to take these one step at a time.
As I said to Phil, Rogier and I completed a Unicode character library for
Release under boost, but never submitted it to Boost as we had intended to
release it with a string library built on it, and never had time to do the
second part of the work.
Post it, and we'll do the second part. It's open-source. Sebastian
participants (3)
-
Cory Nelson
-
Graham
-
Sebastian Redl