Re: [boost] UTF-8 conversion etc. (Sebastian Redl)

10 Mar 2008

      ...
We can implement UTF-8's and UTF-16's skip_forward by looking at the
...
current byte. But does that work with all encodings? I think it doesn't
...
work for shift encodings, unless you're willing to come to a stop on a
...
shift character. I'm not: there's a rule for some shift encodings that
...
they *must* end in the initial shift state, which means that there's a
...
good chance that a shift character is the last thing in the string.
This
...
would mean, however, that if you increment an iterator that points to
...
the last real character, it must scan past the shift character or it
...
won't compare equal to the end iterator. Unless you're willing to scan
...
past the shift in the equality test, another thing I wouldn't do.
...

...
Seems to me that shift encodings are a lot more pain than they're
worth.
...
I really have to wonder why anyone would ever have come up with them.
Sebastian,

As Unicode characters that are not in page zero can require more than 32
bits

to encode them [yes really] this means that one 'character' can be very
long 

in UTF-8/16 encoding. It is even worse if you start looking at
conceptual 

characters [graphemes] where you can easily have three characters make
up a 

conceptual character.

The only way I have found of handling this is to base the string
functions

on a proper Unicode character support library according to the Unicode
spec. 

This means that you need character movement support, grapheme support,
and 

sorting support.

As I said to Phil, Rogier and I completed a Unicode character library
for

Release under boost, but never submitted it to Boost as we had intended
to 

release it with a string library built on it, and never had time to do
the 

second part of the work.

Yours,

Graham