
Miro Jurisic wrote:
On the other hand, in order to manipulate a Unicode string without violating constraints on well-formedness, you have to consider the string as a sequence of abstract characters (unless, of course, you constrain yourself to string transformations which operate on code point sequences yet guarantee that strings remain well-formed; there are few such transformations -- concatenation is one of them under certain constraints).
[snip]
capital letter C; combining caron; lowercase letter e
it contains two abstract characters, but three UCS4 code points; therefore, removing the first character from that string means removing the first two code points of three. Removing just the first code point would leave you with a combining caron followed by a lowercase letter e, which is not a well-formed Unicode string.
Hi Miro, so the point is that when using string-as-code-point-container, even searching and removing a character/substring might get invalid string? E.g. even looking for string 'foo' you theoretically can find string 'foo' followed by composing character, and removing just 'foo' will be invalid?
basic_string is not the abstraction you are looking for, but it's also the only one that is readily available in STL/boost today. It may serve as a good starting point (questionable, IMNSHO), but it should most definitely not be treated as the right thing to use for Unicode in the long term.
I wonder what's the right abstraction then? Is it necessary to have a class to represent abstract character, with all composing characters? - Volodya