New subject: [program_options] Unicode support

6 Apr 2004

      Hi,
...
From: Vladimir Prus [mailto:ghost@cs.msu.su]
...
I understand, that stl support unicode for unicode is not the best, 
but there are facilities, that can provide required 
functionality if 
properly extended/configured.
Let's break the question in two parts.
1. Should 'unicode support' mean that there are two versions 
of each interface, one using string and the other using 
wstring? I think this kind of Unicode support is not good. It 
means that each library which accepts or returns strings must 
ultimately have double interface and be either entirely in 
headers, or use instantinate two variants in sources -- which 
doubles the size of the library.
Not all libraries must have doubled interface or templated interface for any
basic_string<C, T, A>. Different libraries have different string on their
interface and one must convert among them. Typical is for example ANSI <->
UCS-2 on Windows. It is only convenient to be able to avoid such conversions
when working with standard or widely used libraries, for example
boost_regex, which provide templated interface and one can choose 8-bit or
16-bit character space (in future probly 32-bit with a new basic_string<>
;-). I think program_options, being also such a widely used and quite small
library, shoud be also templated (no offence, meant just the size, not the
importance ;-).
...
2. Should program_options library use UTF-8 or wstring. As 
I've said, neither is clear leader, but UTF-8 seems better.
Here I disagree. Command-line shells work with all characters in the current
locale (the whole 255 characters space of 8 bits is used). You would give
the user a character array in UTF-8 encoding, which is not typical case
today, one processes the parameters by basic_string<char>(argv[x]) in the
current locale.

I think you should simply use basic_string<> as a template and the encoding
let on the caller providing its specialization or perform the conversion
himself. Or support the encoding internally by providing an interface to set
it, not to do it with a fixed encoding support, even if I like UTF-8 because
it suppors full Unicode character range, not like UCS-2.

Thinking more, you can expect rather short strings coming to
program_options, not megabytes of text. For this usage is more suitable to
use fixed-size character encodings because they are faster and easier to
work with, having direct support in basic_string<>.

Does anybody know, if there are plans to add something like
basic_string<ucs2char_t> into the next C++ standard?
...
...
I think, that there is no big reason to try to reinvent a wheel and 
provide all encopasing solution in the library like program_options.
It should be enough if it will be unicode-enabled so it can 
be used in 
the any specific scenario, provided that all necessary 
facilities are 
on place.
It's *far* from all encopassing solution. In fact, the 
changes in program_options will include:
1. Adding ascii -> UTF-8 conversion in parsers 2. Adding 
UTF-8 -> ascii conversion in value parsers 3. Adding unicode 
parsers with UCS-4 -> UTF-8 conversion 4. Adding unicode 
value parsers and UTF8 -> UCS-4 conversion
That's all, and given that there's at least two UTF-8 codecs 
announced on the mailing list, not a lof of work. And this 
will add Unicode support without changing interface a bit.
Yes, you are right; there is not much work to add the conversion code into
the internals of program_options. I also wrote my own UCS-2 <-> UTF-8
encoding routines to use them for basic_string<char> <->
basic_string<wchar_t> conversion. However, I think, that we should reuse as
much as possible and not to rewrite similar code in every library, which
works with strings coming from a real user input.

Your solution to support UTF-8 invisibly changes the interface anyway - not
the text of prototypes directly but the behavior of the interface (encoding
of strings).

Nevertheless, you could support this encoding conversion not only by
providing your own conversion routines, but rather by accepting existgin
facets, which help streams similarly (as I wrote in the former e-mail). Then
one could simply write a conversion facet once and use it for a stream input
and also for a cmdline input, sharing the implementation.

Ferda
...
- Volodya

RE: [boost] Re: [program_options] Unicode support

Ferdinand Prantl

Vladimir Prus

tags

participants (2)