RE: [boost] Re: [program_options] Re: Unicode support

Hello,
From: Vladimir Prus [mailto:ghost@cs.msu.su]
glib did a very good implementation of UTF-8 handling and Glibmm is a well done C++ wrapper but it lacks the "standardness". Something like boost::ustring COULD bring a widely accepted UTF-8 aware unicode string to C++ programmers. A somewhat relieving thought.
I am not exactly sure if UTF-8 or UCS-4 is better as universal solution, but some solution is surely needed.
I am afraid there is no universal solution for all users. The easiest solution is based on the native basic_string<>, which is specialized for char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit) usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would require another basic_string<> specialization. UCS-2 held all characters in Unicode 1.1, There was a need for more unique numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no 4-byte character specialization for basic_string<> in STL yet. Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not suitable for the fast in-memory usage because some operations, for example at(int) and length() have not O(1) but O(n) to give a result. They (UTF, T=Transformation) are better for storing texts as they save place by using variable number of bytes for a character. However UCS (e.g. UCS-2 or UCS-4) are fast for memory operations because they use fixed character size. That is why I would not like to use basic_string<utf8char> in memory, rather basic_string<ucs4char> instead, but I would not generalize it for all possible applications. You can expect initialization from (const char * argv []) on all platforms or (const wchar_t * argv []) on Windows in UCS-2. With the basic_string<> you already have support for the parameters from the current locale (char) and for parameters in UCS-2. If we take an option to read parameters into basic_string<wchar_t> or basic_string<ucs4char_t>, where the character size or encoding is not the same as the native encoding on the command line, there is an affinity to streams. Some shells allow usage of UTF-8 encoded parameters or, generally, usage of characters out of the current locale. It means, that a program can choose the way, how to encode all characters from Unicode to char. UTF-7/8, etc. I would like to have a solution similar to streams: imbue(). Having this, you could convert internally every argv[x] using imbue(y) applied to a stringstream, where the facet y provides the caller. The target character capabilities could choose the caller by providing a basic_string<> specialization. On the other hand, such a conversion can be performed also by a user. The parameters sent to main() are char* or wchar_t* and thus program_options can give them back just as they are in basic_string<char> and basic_string<wchar_t>. The client can use his facet to imbue a basic_stringstream<> initialized with the parameter from program_options. Or a conversion library could be used to perform the conversion, something like lexical_cast<> does for type converions. It is a matter of convenience only - separate converions library (no support for encoding in program_options) or imbue(), performing the conversion inside the program_options. However, the conversion should not be implemented for program_options only, that is why I suggested an existing interface - facets.
Or did I miss something? Is something like this part of boost already?
Nope :-( Even UTF-8 encoder is not in boost yet.
You can find come converting facets for UTF-8 raedy for imbue() to a stream in the files section on yahoo. Unfortunately not finished or not reviewed... Ferda
- Volodya

Hi Ferdinand,
I am not exactly sure if UTF-8 or UCS-4 is better as universal solution, but some solution is surely needed.
I am afraid there is no universal solution for all users. The easiest solution is based on the native basic_string<>, which is specialized for char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit) usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would require another basic_string<> specialization.
UCS-2 held all characters in Unicode 1.1, There was a need for more unique numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no 4-byte character specialization for basic_string<> in STL yet.
Or, to be exact, there's no agreement if wchar_t should be 32-bit or 16-bit. Linux (or gcc specifically) uses 32 bits, and Windows 16, which means wstring only suitable for UCS-2. Besides, UCS-2 has a mechanism to represent characters outside of 16-bit space with two elements, which, I suspect, won't work if wchar_t is 16 bit.
Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not suitable for the fast in-memory usage because some operations, for example at(int) and length() have not O(1) but O(n) to give a result.
I believe UTF-8 is popular party because such operations are believed to be rare.
They (UTF, T=Transformation) are better for storing texts as they save place by using variable number of bytes for a character. However UCS (e.g. UCS-2 or UCS-4) are fast for memory operations because they use fixed character size.
What about representing values with two 16-bit values (that's what I've mentioned above)? (BTW, for reference to those interested, http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf talks about different encodings).
That is why I would not like to use basic_string<utf8char> in memory, rather basic_string<ucs4char> instead, but I would not generalize it for all possible applications.
In case of program options, I suspect that everything will work for UTF-8 strings. IOW, the library does not care that any given 'char' might be part of some Unicode character. That's why it's attractive for me.
You can expect initialization from (const char * argv []) on all platforms or (const wchar_t * argv []) on Windows in UCS-2. With the basic_string<> you already have support for the parameters from the current locale (char) and for parameters in UCS-2.
What do you mean by 'parameters from the current locale'? I am not sure that ctype::widen is required to care about user-selected character encoding. Not do I think it's requires from default codecvt facet. If my locale on Linux is ru_RU.KOI8-R I don't think standard requires codecvt<wchar_t, char> facet in default instance of 'locale' to do meaningfull conversion from KOI8-R into unicode.
If we take an option to read parameters into basic_string<wchar_t> or basic_string<ucs4char_t>, where the character size or encoding is not the same as the native encoding on the command line, there is an affinity to streams. Some shells allow usage of UTF-8 encoded parameters or, generally, usage of characters out of the current locale. It means, that a program can choose the way, how to encode all characters from Unicode to char. UTF-7/8, etc. I would like to have a solution similar to streams: imbue(). Having this, you could convert internally every argv[x] using imbue(y) applied to a stringstream, where the facet y provides the caller.
That's right. There should be some mechanism to convert from current 8-bit encoding and source code encoding into unicode. At least at linux there's a mechanism to do the first thing, but I'm not aware about Windows.
On the other hand, such a conversion can be performed also by a user. The parameters sent to main() are char* or wchar_t* and thus program_options can give them back just as they are in basic_string<char> and basic_string<wchar_t>. The client can use his facet to imbue a basic_stringstream<> initialized with the parameter from program_options. Or a conversion library could be used to perform the conversion, something like lexical_cast<> does for type converions. It is a matter of convenience only - separate converions library (no support for encoding in program_options) or imbue(), performing the conversion inside the program_options. However, the conversion should not be implemented for program_options only, that is why I suggested an existing interface - facets.
I agree with basing the mechanism on facets. OTOH, it's really can be made orthogonal to program options. So initially program_options can support only ascii (strict 7-bit) and unicode, for which conversion is trivial.
Or did I miss something? Is something like this part of boost already?
Nope :-( Even UTF-8 encoder is not in boost yet.
You can find come converting facets for UTF-8 raedy for imbue() to a stream in the files section on yahoo. Unfortunately not finished or not reviewed...
Yea, I can find some facets, including one written by myself ;-( And yea, unfortunately they are not in Boost yet. - Volodya

Ferdinand Prantl wrote:
Hello,
From: Vladimir Prus [mailto:ghost@cs.msu.su]
glib did a very good implementation of UTF-8 handling and Glibmm is a well done
C++ wrapper
but it lacks the "standardness". Something like
boost::ustring COULD
bring a widely accepted UTF-8 aware unicode string to C++
programmers.
A somewhat relieving thought.
I am not exactly sure if UTF-8 or UCS-4 is better as universal solution, but some solution is surely needed.
I am afraid there is no universal solution for all users. The easiest solution is based on the native basic_string<>, which is specialized for char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit) usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would require another basic_string<> specialization.
UCS-2 held all characters in Unicode 1.1, There was a need for more unique numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no 4-byte character specialization for basic_string<> in STL yet.
Technically, there isn't a 2-byte specialization either; wchar_t might not be 16 bits. Bob

In article <077097E85A6BD3119E910800062786A90B3F2D1C@muc-mail5.ixos.de>, Ferdinand Prantl <ferdinand.prantl@ixos.de> wrote:
I am afraid there is no universal solution for all users. The easiest solution is based on the native basic_string<>, which is specialized for char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit) usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would require another basic_string<> specialization.
Assuming that basic_string<> is an appropriate abstraction for Unicode strings is a fallacy; the rest of your post, and some other parts of this thread, seem to make that assumption. The reason that this is false is that basic_string<T> has performance guarantees which make it incompatible with a useful Unicode string abstraction. In order to maintain performance guarantees of basic_string, you have to treat a Unicode string as a sequence of code points, rather than as a sequence of abstract characters. On the other hand, in order to manipulate a Unicode string without violating constraints on well-formedness, you have to consider the string as a sequence of abstract characters (unless, of course, you constrain yourself to string transformations which operate on code point sequences yet guarantee that strings remain well-formed; there are few such transformations -- concatenation is one of them under certain constraints). It should be noted that basic_string<ucs4char_t> is as misguided an idea as basic_string<utf8char_t>, because even in UCS4 an abstract character might consist of more than one code point; for example, if you consider the string capital letter C; combining caron; lowercase letter e it contains two abstract characters, but three UCS4 code points; therefore, removing the first character from that string means removing the first two code points of three. Removing just the first code point would leave you with a combining caron followed by a lowercase letter e, which is not a well-formed Unicode string. (Yes, I know that this particular string could also be written in a canonically precomposed form in which there is indeed one code point per abstract character, but that is not true of all Unicode strings which include combining marks; I am just too lazy to find out exactly which aren't.) To summarize: basic_string<ucs4char_t> solves very few problems compared to basic_string<utf8char_t>. Do not be fooled into thinking that the complexities of Unicode can be swept under the UCS4 rug. basic_string is not the abstraction you are looking for, but it's also the only one that is readily available in STL/boost today. It may serve as a good starting point (questionable, IMNSHO), but it should most definitely not be treated as the right thing to use for Unicode in the long term. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>
participants (4)
-
Ferdinand Prantl
-
Miro Jurisic
-
Robert Bell
-
Vladimir Prus