[boost] Re: [program_options] Re: Unicode support

6 Apr 2004

      In article <077097E85A6BD3119E910800062786A90B3F2D1C@muc-mail5.ixos.de>,
 Ferdinand Prantl <ferdinand.prantl@ixos.de> wrote:
...
I am afraid there is no universal solution for all users. The easiest
solution is based on the native basic_string<>, which is specialized for
char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit)
usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would
require another basic_string<> specialization.
Assuming that basic_string<> is an appropriate abstraction for Unicode strings 
is a fallacy; the rest of your post, and some other parts of this thread, seem 
to make that assumption.

The reason that this is false is that basic_string<T> has performance guarantees 
which make it incompatible with a useful Unicode string abstraction.

In order to maintain performance guarantees of basic_string, you have to treat a 
Unicode string as a sequence of code points, rather than as a sequence of 
abstract characters. 

On the other hand, in order to manipulate a Unicode string without violating 
constraints on well-formedness, you have to consider the string as a sequence of 
abstract characters (unless, of course, you constrain yourself to string 
transformations which operate on code point sequences yet guarantee that strings 
remain well-formed; there are few such transformations -- concatenation is one 
of them under certain constraints).

It should be noted that basic_string<ucs4char_t> is as misguided an idea as 
basic_string<utf8char_t>, because even in UCS4 an abstract character might 
consist of more than one code point; for example, if you consider the string

capital letter C; combining caron; lowercase letter e

it contains two abstract characters, but three UCS4 code points; therefore, 
removing the first character from that string means removing the first two code 
points of three. Removing just the first code point would leave you with a 
combining caron followed by a lowercase letter e, which is not a well-formed 
Unicode string.

(Yes, I know that this particular string could also be written in a canonically 
precomposed form in which there is indeed one code point per abstract character, 
but that is not true of all Unicode strings which include combining marks; I am 
just too lazy to find out exactly which aren't.)

To summarize:

basic_string<ucs4char_t> solves very few problems compared to 
basic_string<utf8char_t>. Do not be fooled into thinking that the complexities 
of Unicode can be swept under the UCS4 rug.

basic_string is not the abstraction you are looking for, but it's also the only 
one that is readily available in STL/boost today. It may serve as a good 
starting point (questionable, IMNSHO), but it should most definitely not be 
treated as the right thing to use for Unicode in the long term.

meeroh

-- 
If this message helped you, consider buying an item
from my wish list: <http://web.meeroh.org/wishlist>

[boost] Re: [program_options] Re: Unicode support

Miro Jurisic