[boost] RE: Re: [program_options] Re: Unicode support

6 Apr 2004

      Hi Ferdinand,
...
...
I am not exactly sure if UTF-8 or UCS-4 is better as
universal solution, but some solution is surely needed.
I am afraid there is no universal solution for all users. The easiest
solution is based on the native basic_string<>, which is specialized for
char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit)
usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would
require another basic_string<> specialization.
UCS-2 held all characters in Unicode 1.1, There was a need for more unique
numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no
4-byte character specialization for basic_string<> in STL yet.
Or, to be exact, there's no agreement if wchar_t should be 32-bit or 16-bit.
Linux (or gcc specifically) uses 32 bits, and Windows 16, which means
wstring only suitable for UCS-2. Besides, UCS-2 has a mechanism to
represent characters outside of 16-bit space with two elements, which, I
suspect, won't work if wchar_t is 16 bit.
...
Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not suitable
for the fast in-memory usage because some operations, for example at(int)
and length() have not O(1) but O(n) to give a result.
I believe UTF-8 is popular party because such operations are believed to be
rare.
...
They (UTF,
T=Transformation) are better for storing texts as they save place by using
variable number of bytes for a character. However UCS (e.g. UCS-2 or
UCS-4) are fast for memory operations because they use fixed character
size.
What about representing values with two 16-bit values (that's what I've
mentioned above)?

(BTW, for reference to those interested,   
     http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
talks about different encodings).
...
That is why I would not like to use basic_string<utf8char> in
memory, rather basic_string<ucs4char> instead, but I would not generalize
it for all possible applications.
In case of program options, I suspect that everything will work for UTF-8
strings. IOW, the library does not care that any given 'char' might be part
of some Unicode character. That's why it's attractive for me.
...
You can expect initialization from (const char * argv []) on all platforms
or (const wchar_t * argv []) on Windows in UCS-2. With the basic_string<>
you already have support for the parameters from the current locale (char)
and for parameters in UCS-2.
What do you mean by 'parameters from the current locale'? I am not sure that
ctype::widen is required to care about user-selected character encoding.
Not do I think it's requires from default codecvt facet.

If my locale on Linux is ru_RU.KOI8-R I don't think standard requires
codecvt<wchar_t, char> facet in default instance of 'locale' to do
meaningfull conversion from KOI8-R into unicode.
...
If we take an option to read parameters into basic_string<wchar_t> or
basic_string<ucs4char_t>, where the character size or encoding is not the
same as the native encoding on the command line, there is an affinity to
streams. Some shells allow usage of UTF-8 encoded parameters or,
generally, usage of characters out of the current locale. It means, that a
program can choose the way, how to encode all characters from Unicode to
char. UTF-7/8, etc. I would like to have a solution similar to streams:
imbue(). Having this, you could convert internally every argv[x] using
imbue(y) applied to a stringstream, where the facet y provides the caller.
That's right. There should be some mechanism to convert from 
current 8-bit encoding and source code encoding into unicode. At least at
linux there's a mechanism to do the first thing, but I'm not aware about
Windows.
...
On the other hand, such a conversion can be performed also by a user. The
parameters sent to main() are char* or wchar_t* and thus program_options
can give them back just as they are in basic_string<char> and
basic_string<wchar_t>. The client can use his facet to imbue a
basic_stringstream<> initialized with the parameter from program_options.
Or a conversion library could be used to perform the conversion, something
like lexical_cast<> does for type converions. It is a matter of
convenience only - separate converions library (no support for encoding in
program_options) or imbue(), performing the conversion inside the
program_options. However, the conversion should not be implemented for
program_options only, that is why I suggested an existing interface -
facets.
I agree with basing the mechanism on facets. OTOH, it's really can be made
orthogonal to program options. So initially program_options can support
only ascii (strict 7-bit) and unicode, for which conversion is trivial.
...
...
...
Or did I miss something? Is something like this part of
boost already?
Nope :-( Even UTF-8 encoder is not in boost yet.
You can find come converting facets for UTF-8 raedy for imbue() to a
stream in the files section on yahoo. Unfortunately not finished or not
reviewed...
Yea, I can find some facets, including one written by myself ;-( And yea,
unfortunately they are not in Boost yet.

- Volodya

[boost] RE: Re: [program_options] Re: Unicode support

Vladimir Prus