[boost] RE: RE: Re: [program_options] Re: Unicode support

7 Apr 2004

      Hi Ferda,
...
...
Linux (or gcc specifically) uses 32 bits, and Windows 16,
which means wstring only suitable for UCS-2.
Wow, I did not know that about gcc, what a smart compiler... ;-)
MSVC does it usually because Win32 API id ANSI/UCS-2, so the usage is
straightforward then.
Maybe gcc is just not burdened with compatibility with older APIs ;-)
...
Actually, not UCS-2, but another variable-sized character encoding was
introduced in Unicode 2.0 - UTF-16.
Thanks for correction.
...
...
I believe UTF-8 is popular party because such operations are
believed to be rare.
Really? string.length() ? You must have been joking... :-)
No, I was not. The point is that you rarely need string.length() to return
the number of real characters.

You need it when doing output, but then wstring::length() is not accurate
either, because there are composing characters which don't take space, and
there are characters which takes twice as much space on screen/terminal and
ascii characters.

You probably need it when working with unicode characters. But if you want
to find 7-bit ascii substring in Unicode string, you can just do iterate
over all characters -- it does not matter if you'll check the bytes which
are part of unicode character.

At least, that's what I think is the case ;-) For program_options this is
true.
...
...
What about representing values with two 16-bit values (that's
what I've mentioned above)?
You mean UTF-16. Yes, it is also a possiblity, but having wchar_t not
fixed sized, you would have to provide basic_string<char16_t>...
I see it pretty the same like your UTF-8 option, because both encodings
use variable-sized character, being so incompatible with the shared
implementation of basic_string<> for char and wchar_t.
Moreover, if basic_string<char> from the API would produce UTF-8 string,
that it would be only consistent, if basic_string<wchar_t> produced UTF-16
and not UCS-2. To say nothing of support of 4-byte wchar_t and UTF-32 :-)
In fact, given what Mira said in another message, I'm entirely lost what's
the really right representation for Unicode string. So it's better to focus
on what representation can be used inside program_options.
...
...
In case of program options, I suspect that everything will
work for UTF-8 strings. IOW, the library does not care that
any given 'char' might be part of some Unicode character.
That's why it's attractive for me.
It depends on the way of usage of basic_string<> inside and outside of the
program_options. The implemenation of basic_string<> is done for
fixed-size characters. It does not do searching/iterating/length counting
with regards to the character size, which can vary from 1 to 6 bytes in
UTF-8.
Luckily, it won't expose UTF-8 string, so if all searches use 7-bit string
we're safe.
...
...
...
You can expect initialization from (const char * argv []) on all
platforms or (const wchar_t * argv []) on Windows in UCS-2.
With the
basic_string<> you already have support for the parameters from the
current locale (char) and for parameters in UCS-2.
What do you mean by 'parameters from the current locale'?
If you set LC_CTYPE to something like "cs_CZ.iso8859-2" on UN*Xes or
choose "Czech" in Windows Control Panel, command line shell (well, not
only it) will start to accept and deliver in (char * argv []) also
characters from a local alphabet (here Czech, for example). Such string
you would have convert into UTF-8, if you wanted to have an UTF-8
interface. It means, you would have to do more than
basic_string<char>(argv[x])...
Yes, you right. But I think I'd still have to do conversion. E.g. user
creates an option with a type of wstring but parsers char** argv.
...
...
I
am not sure that ctype::widen is required to care about
user-selected character encoding.
Not do I think it's requires from default codecvt facet.
AFAIK it must care; the method widen() needs locale to provide an
extension from char to a templated char_type.
The standard says 

   The char argument of do_widen is intended to accept value derived from
   character literals...

and encoding of character literals usually does not change when user changes
locale. So implementation might handle only 7-bit values.
...
STL in MSVC uses correctly
mbcstowc() to perform the conversion from the local alphabet to the UCS-2.
I hope that no-one simply casts to wchar_t, being so reliable only for
7-bit characters. Anyway, it is always possible to write an own converting
facet and force its usage for widen by with imbue.
I've just looked at implementation in one of the gcc versions I have and it
just casts to wchar_t.
...
...
That's right. There should be some mechanism to convert from
current 8-bit encoding and source code encoding into unicode.
At least at linux there's a mechanism to do the first thing,
but I'm not aware about Windows.
There are such methods in stdlib.h, which produce a wide character or a
string of wide characters:
  int mbtowc(wchar_t *pwc, const char *s, size_t n);
  size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);
Yes, that's the mechanism I was talking about.
...
These methods work with the locale supported in the operating system and
thus you can always get a wchar_t from a char. And these should be used in
the default facet in STL.
I've tried to find implementation in gcc, but failed :-(
...
...
I agree with basing the mechanism on facets. OTOH, it's
really can be made orthogonal to program options. So
initially program_options can support only ascii (strict
7-bit) and unicode, for which conversion is trivial.
Hmm, if it was orthogonal to program options, like facets are to streams,
then there is no need to declare support of encodings. char * and
basic_string<char> are simply in the local encoding and wchar_t * and
basic_string<wchar_t> are wide according to the understanding of
"wideness" on a platform (Unicode).
It will also support the input, which you can expect for program_options
from main(): char * in the current locale encoding. If someone wants
something else, he can use facet and streams to convert it (which could be
hidden in program_options with imbue, but not necessarily).
I think the point is not to support UTF-8 or UCS-2 but to support
encodings explicitely declared or not. I would not implicitely return
UTF-8 or UTF-16 basic_string<>s, not only because STL handles their chars
implicitely according to the current locale, but also because
basic_string<char>(argv[x]) does not behave this way. Having the encoding
independent on program_options using some external conversions, possibly
imbuable into program_options, make the library thinner and more
concentrated on the problem: parse the command line. Like the streams -
read characters from input.
So, can we agree on this:

1. Sometimes, the program_options library need to make conversion from local
encoding to unicode and from unicode to local encoding.
2. This conversion should be done via codecvt facet
3. It might be possible to allow user to imbue/specify codecvt facet when
using program_options.
...
Those, who say "we need full Unicode, UTF-8 or UCS-4" have the same option
as if working with streams - imbue or convert. Otherwise they have
currebnt locale char from istream or wchar_t from wistream.
For example if one excepts char** argv to be in UTF-8 he just changes locale
accordingly and all std::string values that program_options *returns* will
be in that encoding automatically.
...
...
...
You can find come converting facets for UTF-8 raedy for
imbue() to a
stream in the files section on yahoo. Unfortunately not finished or
not reviewed...
Yea, I can find some facets, including one written by myself
;-( And yea, unfortunately they are not in Boost yet.
Great! You can probly prepare it for review, if your time allows it... :-)
Let's see ;-)

- Volodya

[boost] RE: RE: Re: [program_options] Re: Unicode support

Vladimir Prus