[boost] Unicode string

7 Apr 2004

      Miro Jurisic wrote:
...
On the other hand, in order to manipulate a Unicode string without
violating constraints on well-formedness, you have to consider the string
as a sequence of abstract characters (unless, of course, you constrain
yourself to string transformations which operate on code point sequences
yet guarantee that strings remain well-formed; there are few such
transformations -- concatenation is one of them under certain constraints).
[snip]
...
capital letter C; combining caron; lowercase letter e
it contains two abstract characters, but three UCS4 code points; therefore,
removing the first character from that string means removing the first two
code points of three. Removing just the first code point would leave you
with a combining caron followed by a lowercase letter e, which is not a
well-formed Unicode string.
Hi Miro,

so the point is that when using string-as-code-point-container, even searching 
and removing a character/substring might get invalid string? E.g. even 
looking for string 'foo' you theoretically can find string 'foo' followed by 
composing character, and removing just 'foo' will be invalid?
...
basic_string is not the abstraction you are looking for, but it's also the
only one that is readily available in STL/boost today. It may serve as a
good starting point (questionable, IMNSHO), but it should most definitely
not be treated as the right thing to use for Unicode in the long term.
I wonder what's the right abstraction then? Is it necessary to have a class to 
represent abstract character, with all composing characters?

- Volodya

[boost] Unicode string

Vladimir Prus