
We can implement UTF-8's and UTF-16's skip_forward by looking at the
current byte. But does that work with all encodings? I think it doesn't
work for shift encodings, unless you're willing to come to a stop on a
shift character. I'm not: there's a rule for some shift encodings that
they *must* end in the initial shift state, which means that there's a
good chance that a shift character is the last thing in the string. This
would mean, however, that if you increment an iterator that points to
the last real character, it must scan past the shift character or it
won't compare equal to the end iterator. Unless you're willing to scan
past the shift in the equality test, another thing I wouldn't do.
Seems to me that shift encodings are a lot more pain than they're worth.
I really have to wonder why anyone would ever have come up with them.
Sebastian, As Unicode characters that are not in page zero can require more than 32 bits to encode them [yes really] this means that one 'character' can be very long in UTF-8/16 encoding. It is even worse if you start looking at conceptual characters [graphemes] where you can easily have three characters make up a conceptual character. The only way I have found of handling this is to base the string functions on a proper Unicode character support library according to the Unicode spec. This means that you need character movement support, grapheme support, and sorting support. As I said to Phil, Rogier and I completed a Unicode character library for Release under boost, but never submitted it to Boost as we had intended to release it with a string library built on it, and never had time to do the second part of the work. Yours, Graham