
In article <077097E85A6BD3119E910800062786A90B3F2D1C@muc-mail5.ixos.de>, Ferdinand Prantl <ferdinand.prantl@ixos.de> wrote:
I am afraid there is no universal solution for all users. The easiest solution is based on the native basic_string<>, which is specialized for char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit) usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would require another basic_string<> specialization.
Assuming that basic_string<> is an appropriate abstraction for Unicode strings is a fallacy; the rest of your post, and some other parts of this thread, seem to make that assumption. The reason that this is false is that basic_string<T> has performance guarantees which make it incompatible with a useful Unicode string abstraction. In order to maintain performance guarantees of basic_string, you have to treat a Unicode string as a sequence of code points, rather than as a sequence of abstract characters. On the other hand, in order to manipulate a Unicode string without violating constraints on well-formedness, you have to consider the string as a sequence of abstract characters (unless, of course, you constrain yourself to string transformations which operate on code point sequences yet guarantee that strings remain well-formed; there are few such transformations -- concatenation is one of them under certain constraints). It should be noted that basic_string<ucs4char_t> is as misguided an idea as basic_string<utf8char_t>, because even in UCS4 an abstract character might consist of more than one code point; for example, if you consider the string capital letter C; combining caron; lowercase letter e it contains two abstract characters, but three UCS4 code points; therefore, removing the first character from that string means removing the first two code points of three. Removing just the first code point would leave you with a combining caron followed by a lowercase letter e, which is not a well-formed Unicode string. (Yes, I know that this particular string could also be written in a canonically precomposed form in which there is indeed one code point per abstract character, but that is not true of all Unicode strings which include combining marks; I am just too lazy to find out exactly which aren't.) To summarize: basic_string<ucs4char_t> solves very few problems compared to basic_string<utf8char_t>. Do not be fooled into thinking that the complexities of Unicode can be swept under the UCS4 rug. basic_string is not the abstraction you are looking for, but it's also the only one that is readily available in STL/boost today. It may serve as a good starting point (questionable, IMNSHO), but it should most definitely not be treated as the right thing to use for Unicode in the long term. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>