RE: [boost] Re: [program_options] Unicode support

Hi,
From: Vladimir Prus [mailto:ghost@cs.msu.su]
I understand, that stl support unicode for unicode is not the best, but there are facilities, that can provide required functionality if properly extended/configured.
Let's break the question in two parts.
1. Should 'unicode support' mean that there are two versions of each interface, one using string and the other using wstring? I think this kind of Unicode support is not good. It means that each library which accepts or returns strings must ultimately have double interface and be either entirely in headers, or use instantinate two variants in sources -- which doubles the size of the library.
Not all libraries must have doubled interface or templated interface for any basic_string<C, T, A>. Different libraries have different string on their interface and one must convert among them. Typical is for example ANSI <-> UCS-2 on Windows. It is only convenient to be able to avoid such conversions when working with standard or widely used libraries, for example boost_regex, which provide templated interface and one can choose 8-bit or 16-bit character space (in future probly 32-bit with a new basic_string<> ;-). I think program_options, being also such a widely used and quite small library, shoud be also templated (no offence, meant just the size, not the importance ;-).
2. Should program_options library use UTF-8 or wstring. As I've said, neither is clear leader, but UTF-8 seems better.
Here I disagree. Command-line shells work with all characters in the current locale (the whole 255 characters space of 8 bits is used). You would give the user a character array in UTF-8 encoding, which is not typical case today, one processes the parameters by basic_string<char>(argv[x]) in the current locale. I think you should simply use basic_string<> as a template and the encoding let on the caller providing its specialization or perform the conversion himself. Or support the encoding internally by providing an interface to set it, not to do it with a fixed encoding support, even if I like UTF-8 because it suppors full Unicode character range, not like UCS-2. Thinking more, you can expect rather short strings coming to program_options, not megabytes of text. For this usage is more suitable to use fixed-size character encodings because they are faster and easier to work with, having direct support in basic_string<>. Does anybody know, if there are plans to add something like basic_string<ucs2char_t> into the next C++ standard?
I think, that there is no big reason to try to reinvent a wheel and provide all encopasing solution in the library like program_options. It should be enough if it will be unicode-enabled so it can be used in the any specific scenario, provided that all necessary facilities are on place.
It's *far* from all encopassing solution. In fact, the changes in program_options will include:
1. Adding ascii -> UTF-8 conversion in parsers 2. Adding UTF-8 -> ascii conversion in value parsers 3. Adding unicode parsers with UCS-4 -> UTF-8 conversion 4. Adding unicode value parsers and UTF8 -> UCS-4 conversion
That's all, and given that there's at least two UTF-8 codecs announced on the mailing list, not a lof of work. And this will add Unicode support without changing interface a bit.
Yes, you are right; there is not much work to add the conversion code into the internals of program_options. I also wrote my own UCS-2 <-> UTF-8 encoding routines to use them for basic_string<char> <-> basic_string<wchar_t> conversion. However, I think, that we should reuse as much as possible and not to rewrite similar code in every library, which works with strings coming from a real user input. Your solution to support UTF-8 invisibly changes the interface anyway - not the text of prototypes directly but the behavior of the interface (encoding of strings). Nevertheless, you could support this encoding conversion not only by providing your own conversion routines, but rather by accepting existgin facets, which help streams similarly (as I wrote in the former e-mail). Then one could simply write a conversion facet once and use it for a stream input and also for a cmdline input, sharing the implementation. Ferda
- Volodya

Ferdinand Prantl wrote:
1. Should 'unicode support' mean that there are two versions of each interface, one using string and the other using wstring? I think this kind of Unicode support is not good. It means that each library which accepts or returns strings must ultimately have double interface and be either entirely in headers, or use instantinate two variants in sources -- which doubles the size of the library.
Not all libraries must have doubled interface or templated interface for any basic_string<C, T, A>. Different libraries have different string on their interface and one must convert among them. Typical is for example ANSI <-> UCS-2 on Windows. It is only convenient to be able to avoid such conversions when working with standard or widely used libraries, for example boost_regex, which provide templated interface and one can choose 8-bit or 16-bit character space (in future probly 32-bit with a new basic_string<> ;-). I think program_options, being also such a widely used and quite small library, shoud be also templated (no offence, meant just the size, not the importance ;-).
Making it templated would mean that using the library increases code size for the client -- which I really want to avoid. As for convenience -- consider the cases in the design document. One library does not care about Unicode and exposes options_description<char>. The main application uses unicode, so declares options_description<wchar_t>. Now, to add library's option into program options some additional conversion is needed. The only advantage of making the library templated is that you don't need to convert input into internal encoding, so it might be faster. But is it really important for a library which is not going to be performance bottleneck? (E.g. for boost::regex speed is much more important).
2. Should program_options library use UTF-8 or wstring. As I've said, neither is clear leader, but UTF-8 seems better.
Here I disagree. Command-line shells work with all characters in the current locale (the whole 255 characters space of 8 bits is used). You would give the user a character array in UTF-8 encoding, which is not typical case today, one processes the parameters by basic_string<char>(argv[x]) in the current locale.
I'm sorry but I'm lost. What does "you would give the user a character array" mean?
I think you should simply use basic_string<> as a template and the encoding let on the caller providing its specialization or perform the conversion himself. Or support the encoding internally by providing an interface to set it, not to do it with a fixed encoding support, even if I like UTF-8 because it suppors full Unicode character range, not like UCS-2.
Thinking more, you can expect rather short strings coming to program_options, not megabytes of text. For this usage is more suitable to use fixed-size character encodings because they are faster and easier to work with, having direct support in basic_string<>.
Ok, I have to reiterate: the biggest advantage of UTF-8 is that the existing command like parser will just work with UTF-8, so the "easier" point above does not apply, IMO. As for speed: again I think it's of minor importance.
That's all, and given that there's at least two UTF-8 codecs announced on the mailing list, not a lof of work. And this will add Unicode support without changing interface a bit.
Yes, you are right; there is not much work to add the conversion code into the internals of program_options. I also wrote my own UCS-2 <-> UTF-8 encoding routines to use them for basic_string<char> <-> basic_string<wchar_t> conversion. However, I think, that we should reuse as much as possible and not to rewrite similar code in every library, which works with strings coming from a real user input.
Heh, I'm not going to rewrite anything -- I'll use one of the facets that are available.
Your solution to support UTF-8 invisibly changes the interface anyway - not the text of prototypes directly but the behavior of the interface (encoding of strings).
The encoding of user-visible strings is not changed. The only user difference I see is that by default, char* input will be require to be 7-bit. But I think even this is not stricty required.
Nevertheless, you could support this encoding conversion not only by providing your own conversion routines, but rather by accepting existgin facets, which help streams similarly (as I wrote in the former e-mail). Then one could simply write a conversion facet once and use it for a stream input and also for a cmdline input, sharing the implementation.
That's for sure. I plan to use facets as much as possible. - Volodya
participants (2)
-
Ferdinand Prantl
-
Vladimir Prus