New subject: [program_options] Re: Unicode support

6 Apr 2004

      Hello,
...
From: Vladimir Prus [mailto:ghost@cs.msu.su]
...
glib did a very good
implementation of UTF-8 handling and Glibmm is a well done 
C++ wrapper 
but it lacks the "standardness". Something like 
boost::ustring COULD 
bring a widely accepted UTF-8 aware unicode string to C++ 
programmers. 
A somewhat relieving thought.
I am not exactly sure if UTF-8 or UCS-4 is better as 
universal solution, but some solution is surely needed.
I am afraid there is no universal solution for all users. The easiest
solution is based on the native basic_string<>, which is specialized for
char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit)
usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would
require another basic_string<> specialization.

UCS-2 held all characters in Unicode 1.1, There was a need for more unique
numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no
4-byte character specialization for basic_string<> in STL yet.

Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not suitable for
the fast in-memory usage because some operations, for example at(int) and
length() have not O(1) but O(n) to give a result. They (UTF,
T=Transformation) are better for storing texts as they save place by using
variable number of bytes for a character. However UCS (e.g. UCS-2 or UCS-4)
are fast for memory operations because they use fixed character size. That
is why I would not like to use basic_string<utf8char> in memory, rather
basic_string<ucs4char> instead, but I would not generalize it for all
possible applications.

You can expect initialization from (const char * argv []) on all platforms
or (const wchar_t * argv []) on Windows in UCS-2. With the basic_string<>
you already have support for the parameters from the current locale (char)
and for parameters in UCS-2.

If we take an option to read parameters into basic_string<wchar_t> or
basic_string<ucs4char_t>, where the character size or encoding is not the
same as the native encoding on the command line, there is an affinity to
streams. Some shells allow usage of UTF-8 encoded parameters or, generally,
usage of characters out of the current locale. It means, that a program can
choose the way, how to encode all characters from Unicode to char. UTF-7/8,
etc. I would like to have a solution similar to streams: imbue(). Having
this, you could convert internally every argv[x] using imbue(y) applied to a
stringstream, where the facet y provides the caller. The target character
capabilities could choose the caller by providing a basic_string<>
specialization.

On the other hand, such a conversion can be performed also by a user. The
parameters sent to main() are char* or wchar_t* and thus program_options can
give them back just as they are in basic_string<char> and
basic_string<wchar_t>. The client can use his facet to imbue a
basic_stringstream<> initialized with the parameter from program_options. Or
a conversion library could be used to perform the conversion, something like
lexical_cast<> does for type converions. It is a matter of convenience only
- separate converions library (no support for encoding in program_options)
or imbue(), performing the conversion inside the program_options. However,
the conversion should not be implemented for program_options only, that is
why I suggested an existing interface - facets.
...
...
Or did I miss something? Is something like this part of 
boost already?
Nope :-( Even UTF-8 encoder is not in boost yet.
You can find come converting facets for UTF-8 raedy for imbue() to a stream
in the files section on yahoo. Unfortunately not finished or not reviewed...

Ferda
...
- Volodya

RE: [boost] Re: [program_options] Re: Unicode support

Ferdinand Prantl

Vladimir Prus

Robert Bell

Miro Jurisic

tags

participants (4)