[program_options] Unicode support

newer
Re: [boost] Serialization. Pointer...

Vladimir Prus

6 Apr 2004 6 Apr '04

7:27 a.m.

Hello, it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding. The initial description of the approach I plan is at http://zigzag.cs.msu.su:7813/program_options/html/program_options.design.htm... or http://zigzag.cs.msu.su/~ghost/program_options/html/program_options.design.h... So summarize the document, user would be able to pass wchar_t** to the parse_command_line function. If specific option needs unicode, this can be specified like this: ("email", unicode::value<Email>(), "email address") I believe the change will be mostly transparent and won't affect ascii usage. All options are very welcome! TIA, Volodya

Show replies by date

Hagen Moebius

6 Apr 6 Apr

9:32 a.m.

New subject: [program_options] Re: Unicode support

Hello, I just wanted to express my sincere wish/hope that any achievements in this sector will become a common boost good. glib did a very good implementation of UTF-8 handling and Glibmm is a well done C++ wrapper but it lacks the "standardness". Something like boost::ustring COULD bring a widely accepted UTF-8 aware unicode string to C++ programmers. A somewhat relieving thought. Or did I miss something? Is something like this part of boost already? Regards, Hagen Moebius.

Vladimir Prus

10:52 a.m.

New subject: [program_options] Re: Unicode support

Hi Hagen,

...

I just wanted to express my sincere wish/hope that any achievements in this sector will become a common boost good.

That's why I'm asking for comment, rather than silently adding specific implementation. Hopefully we'll agree on something.

...

glib did a very good implementation of UTF-8 handling and Glibmm is a well done C++ wrapper but it lacks the "standardness". Something like boost::ustring COULD bring a widely accepted UTF-8 aware unicode string to C++ programmers. A somewhat relieving thought.

I am not exactly sure if UTF-8 or UCS-4 is better as universal solution, but some solution is surely needed.

...

Or did I miss something? Is something like this part of boost already?

Nope :-( Even UTF-8 encoder is not in boost yet. - Volodya

Pavol Droba

11:02 a.m.

Hi, I have read your proposal. Maybe I'm missing something very serious, but I would prefere to have a similar scheme as used by stl. So that, there will be variants accepting char and wchar_t data types, and all possible unicode problems will be addressed by char_traits and locale. I understand, that stl support unicode for unicode is not the best, but there are facilities, that can provide required functionality if properly extended/configured. I think, that there is no big reason to try to reinvent a wheel and provide all encopasing solution in the library like program_options. It should be enough if it will be unicode-enabled so it can be used in the any specific scenario, provided that all necessary facilities are on place. Regards, Pavol On Tue, Apr 06, 2004 at 11:27:44AM +0400, Vladimir Prus wrote:

...

Hello, it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding.

The initial description of the approach I plan is at

http://zigzag.cs.msu.su:7813/program_options/html/program_options.design.htm... or http://zigzag.cs.msu.su/~ghost/program_options/html/program_options.design.h...

So summarize the document, user would be able to pass wchar_t** to the parse_command_line function. If specific option needs unicode, this can be specified like this:

("email", unicode::value<Email>(), "email address")

I believe the change will be mostly transparent and won't affect ascii usage.

All options are very welcome!

TIA, Volodya

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Vladimir Prus

11:28 a.m.

Hi Pavol,

...

I have read your proposal. Maybe I'm missing something very serious, but I would prefere to have a similar scheme as used by stl.

So that, there will be variants accepting char and wchar_t data types, and all possible unicode problems will be addressed by char_traits and locale.

Variants of what? The command line parser and config file parser will have two variants of the interface. The storage component need not have two variants. What advantage will it give? Finally, and that's the most important point, I believe that options description component need only provide two variants for the 'value' function. As the document say, if you have two variants of the options_description class, than the ascii vs. unicode decision is global for the entire application, which is not so good.

...

I understand, that stl support unicode for unicode is not the best, but there are facilities, that can provide required functionality if properly extended/configured.

Let's break the question in two parts. 1. Should 'unicode support' mean that there are two versions of each interface, one using string and the other using wstring? I think this kind of Unicode support is not good. It means that each library which accepts or returns strings must ultimately have double interface and be either entirely in headers, or use instantinate two variants in sources -- which doubles the size of the library. 2. Should program_options library use UTF-8 or wstring. As I've said, neither is clear leader, but UTF-8 seems better.

...

I think, that there is no big reason to try to reinvent a wheel and provide all encopasing solution in the library like program_options. It should be enough if it will be unicode-enabled so it can be used in the any specific scenario, provided that all necessary facilities are on place.

It's *far* from all encopassing solution. In fact, the changes in program_options will include: 1. Adding ascii -> UTF-8 conversion in parsers 2. Adding UTF-8 -> ascii conversion in value parsers 3. Adding unicode parsers with UCS-4 -> UTF-8 conversion 4. Adding unicode value parsers and UTF8 -> UCS-4 conversion That's all, and given that there's at least two UTF-8 codecs announced on the mailing list, not a lof of work. And this will add Unicode support without changing interface a bit. - Volodya

Pavol Droba

1:56 p.m.

On Tue, Apr 06, 2004 at 03:28:11PM +0400, Vladimir Prus wrote:

...

Hi Pavol,

...
I have read your proposal. Maybe I'm missing something very serious, but I would prefere to have a similar scheme as used by stl.

So that, there will be variants accepting char and wchar_t data types, and all possible unicode problems will be addressed by char_traits and locale.

Variants of what? The command line parser and config file parser will have two variants of the interface.

Not realy a variants. I mean templated and specialed for common char* and wchar_t*.

...

The storage component need not have two variants. What advantage will it give? Finally, and that's the most important point, I believe that options description component need only provide two variants for the 'value' function. As the document say, if you have two variants of the options_description class, than the ascii vs. unicode decision is global for the entire application, which is not so good.

This argument is quite questionable. IMHO either you stick with narrow, or wide characters in whoule application. Otherwise you are forced to make conversions on the border lines. I don't realy see a point in the mixed type approach.

...

...
I understand, that stl support unicode for unicode is not the best, but there are facilities, that can provide required functionality if properly extended/configured.

Let's break the question in two parts.

1. Should 'unicode support' mean that there are two versions of each interface, one using string and the other using wstring? I think this kind of Unicode support is not good. It means that each library which accepts or returns strings must ultimately have double interface and be either entirely in headers, or use instantinate two variants in sources -- which doubles the size of the library.

Actualy, in regards to a general purpose library like this, I don't think that compile time overhead implied templatization of the code is worse then having to do converstion all over the place in runtime. The library should work with basic_string if possible. If my application is unicode, and all input I have is unicode, it is realy annoying to convert everything to and fro when interfacing to library like program_options.

...

2. Should program_options library use UTF-8 or wstring. As I've said, neither is clear leader, but UTF-8 seems better.

Ferda Prant gave quite a good explanation in an other mail about the unicode support in STL. I'm asking only for seamless integration with standard facilities.

...

...
I think, that there is no big reason to try to reinvent a wheel and provide all encopasing solution in the library like program_options. It should be enough if it will be unicode-enabled so it can be used in the any specific scenario, provided that all necessary facilities are on place.

It's *far* from all encopassing solution. In fact, the changes in program_options will include:

1. Adding ascii -> UTF-8 conversion in parsers 2. Adding UTF-8 -> ascii conversion in value parsers 3. Adding unicode parsers with UCS-4 -> UTF-8 conversion 4. Adding unicode value parsers and UTF8 -> UCS-4 conversion

That's all, and given that there's at least two UTF-8 codecs announced on the mailing list, not a lof of work. And this will add Unicode support without changing interface a bit.

Your proposal does not handle the problem. It merely workarounds it. Instead of working with character encodings, it does a conversion all over the place. Regards, Pavol

Vladimir Prus

2:29 p.m.

Pavol Droba wrote:

...

...
Variants of what? The command line parser and config file parser will have two variants of the interface.

Not realy a variants. I mean templated and specialed for common char* and wchar_t*.

Ok.

...

...
The storage component need not have two variants. What advantage will it give? Finally, and that's the most important point, I believe that options description component need only provide two variants for the 'value' function. As the document say, if you have two variants of the options_description class, than the ascii vs. unicode decision is global for the entire application, which is not so good.

This argument is quite questionable. IMHO either you stick with narrow, or wide characters in whoule application. Otherwise you are forced to make conversions on the border lines. I don't realy see a point in the mixed type approach.

Ok, let me rephrase. You're writing boost::http_proxy library and want to make it customizable via program_options. So you need to provide function 'get_options_descriptions'. What will the function return? If there's only one options_descriptions class, there's no question. If there are two versions, then which one do you return? No matter what you decide, the main application might need to do conversions just because it either needs unicode or does not need it. And why an existing operator>> which works for istream only should be fixed to support wistream, if some other option need unicode support?

...

...
...
I understand, that stl support unicode for unicode is not the best, but there are facilities, that can provide required functionality if properly extended/configured.

Let's break the question in two parts.

1. Should 'unicode support' mean that there are two versions of each interface, one using string and the other using wstring? I think this kind of Unicode support is not good. It means that each library which accepts or returns strings must ultimately have double interface and be either entirely in headers, or use instantinate two variants in sources -- which doubles the size of the library.

Actualy, in regards to a general purpose library like this, I don't think that compile time overhead implied templatization of the code is worse then having to do converstion all over the place in runtime. The library should work with basic_string if possible.

I generally tend to ignore speed issues, since with linear time algorithsm and contemporary processors it's not likely to be important. OTOH, code size *is* important. I've just compiled one of the library example, with static linking and full optimization. It takes 152K. Probably, it's partly gcc fault, or maybe it can be reduced but now it's so. Empty program takes several K. Now, if I tell anyone "here's a good library for parsing command line but it will add 152K to the application size", the someone will tell "thanks, I'll parse command line by hand". However, is the library is shared and is available on every Linux installation, then the code size is not issue.

...

If my application is unicode, and all input I have is unicode, it is realy annoying to convert everything to and fro when interfacing to library like program_options.

You don't have to convert anything. Parsers will accept wstring and for values where you need unicode you'll use wstring as well.

...

...
It's *far* from all encopassing solution. In fact, the changes in program_options will include:

1. Adding ascii -> UTF-8 conversion in parsers 2. Adding UTF-8 -> ascii conversion in value parsers 3. Adding unicode parsers with UCS-4 -> UTF-8 conversion 4. Adding unicode value parsers and UTF8 -> UCS-4 conversion

That's all, and given that there's at least two UTF-8 codecs announced on the mailing list, not a lof of work. And this will add Unicode support without changing interface a bit.

Your proposal does not handle the problem. It merely workarounds it. Instead of working with character encodings, it does a conversion all over the place.

Some of the conversions are unavoidable. E.g. if you have unicode-enabled library, you'd still need to accept ascii input (because you can't expect that all input sources are unicode -- main in Linux is never unicode). If you want to support legacy operator>> you'd need conversion to ascii. - Volodya

Pavol Droba

3:54 p.m.

Hi, On Tue, Apr 06, 2004 at 06:29:54PM +0400, Vladimir Prus wrote: [snip]

...

...
This argument is quite questionable. IMHO either you stick with narrow, or wide characters in whoule application. Otherwise you are forced to make conversions on the border lines. I don't realy see a point in the mixed type approach.

Ok, let me rephrase. You're writing boost::http_proxy library and want to make it customizable via program_options. So you need to provide function 'get_options_descriptions'. What will the function return? If there's only one options_descriptions class, there's no question. If there are two versions, then which one do you return? No matter what you decide, the main application might need to do conversions just because it either needs unicode or does not need it.

Well, the http library have two options. Either it can be char_type independent or it would simply accept only char* variants. Given the case of http library, later will be probably the case because it is quite domains specific library. I see that we are generaly arguing, whether program_options library domain is generic enough to support natively char and wchar_t (and be templated) or if it is enough to provide an interface via conversions and support only one encoding internaly. I'm in favor of the first approach. The library works with various sources of informations and its purpose is to restructure the information from these sources into something more usable. I would assume for such a utility, that information passed on input has the same encoding and format as the information on the output. From the nature of the library it seems, that it might be possible to avoid unnecessary conversion into some intermediate encoding. Another association might be a container. The library is a kind container. It parses the input and provides a conainer-like interface for the information stored there. I find it natural, that the container uses the same encoding for its internals as it provides in the external interface.

...

And why an existing operator>> which works for istream only should be fixed to support wistream, if some other option need unicode support?

I don't understand this point. [snip]

...

I generally tend to ignore speed issues, since with linear time algorithsm and contemporary processors it's not likely to be important. OTOH, code size *is* important. I've just compiled one of the library example, with static linking and full optimization. It takes 152K.

Probably, it's partly gcc fault, or maybe it can be reduced but now it's so. Empty program takes several K. Now, if I tell anyone "here's a good library for parsing command line but it will add 152K to the application size", the someone will tell "thanks, I'll parse command line by hand".

However, is the library is shared and is available on every Linux installation, then the code size is not issue.

I don't think, that overhead of 152kb is somehow too big. We living in the world of GBs, few kBs does not realy change much. If an application would use some STL stuff, it won't very small anyway. (probably not the best example, but I have compiled following program with gcc3.3.1 in cygwin with options -03, and stripped of debug info afterwards #include <iostream> using namespace std; int main() { cout << "a test" << endl; return 0; } resulting binary have 200Kb) I would strongly prefer simplier usage of the library to an overhead of 152kBs.

...

...
If my application is unicode, and all input I have is unicode, it is realy annoying to convert everything to and fro when interfacing to library like program_options.

You don't have to convert anything. Parsers will accept wstring and for values where you need unicode you'll use wstring as well.

[snip]

...

Some of the conversions are unavoidable. E.g. if you have unicode-enabled library, you'd still need to accept ascii input (because you can't expect that all input sources are unicode -- main in Linux is never unicode).

If you want to support legacy operator>> you'd need conversion to ascii.

I'm not a linux expert. I'm mainly working on windows. If I decide to use unicode, I have whole api in the unicode without any need for conversions. Actualy in the project I'm working on now, I encountered a need for conversion only once. I'm using date_time library and there was no support for the wide strings at the time. Fortuntely it is fixed now :) Regards, Pavol

Kevin Wheatley

5:47 p.m.

Pavol Droba wrote:

...

resulting binary have 200Kb)

I would strongly prefer simplier usage of the library to an overhead of 152kBs.

Couple of things... for comparison IRIX MIPSPro 7.4.1m CC -O3 -o test.IRIX test.cpp strip test.IRIX Linux gcc 3.2.3 gcc32 -O3 -o test.Linux test.cpp -lstdc++ strip test.Linux ls -l test.* -rw-rw-r-- 1 hxpro hxpro 101 Apr 6 17:37 test.cpp -rwxrwxr-x 1 hxpro hxpro 22308 Apr 6 17:38 test.IRIX -rwxrwxr-x 1 hxpro hxpro 3896 Apr 6 17:39 test.Linux So cygwin support is obviously adding a lot to that compile you did. I also disagree with your size doesn't matter, we generally don't run *users* machines with less than 2GB RAM, and over half a TB disk, but if every feature we added to our code added 200K for something as small as options parsing we'd be avoiding it due to the fact that when you scale that up to hundreds of machines accessing a server that could easily eat a good percentage of the bandwidth even over Gig-E. Kevin -- | Kevin Wheatley | These are the opinions of | | Senior Do-er of Technical Things | nobody and are not shared | | Cinesite (Europe) Ltd | by my employers |

Vladimir Prus

7 Apr 7 Apr

6:17 a.m.

Kevin Wheatley wrote:

...

I also disagree with your size doesn't matter, we generally don't run *users* machines with less than 2GB RAM, and over half a TB disk, but if every feature we added to our code added 200K for something as small as options parsing we'd be avoiding it due to the fact that when you scale that up to hundreds of machines accessing a server that could easily eat a good percentage of the bandwidth even over Gig-E.

I have only too points to add: 1. The "as small as options parsing" phrase above is very much to the point. A user will readily agree with 200K for domain-specific library which might allows to write his application ten time faster and with ten time less bugs. But options parsing is minor functionality, so the requirements are much stricter. 2. Yes, 200K is not much compared to 2GB of RAM, but you also need to download those 200K. Besides, if the library is used by more than one command line tool, you multiply 200K by the number of programs. - Volodya

Vladimir Prus

6:33 a.m.

Pavol Droba wrote:

...

...
Ok, let me rephrase. You're writing boost::http_proxy library and want to make it customizable via program_options. So you need to provide function 'get_options_descriptions'. What will the function return? If there's only one options_descriptions class, there's no question. If there are two versions, then which one do you return? No matter what you decide, the main application might need to do conversions just because it either needs unicode or does not need it.

Well, the http library have two options. Either it can be char_type independent or it would simply accept only char* variants. Given the case of http library, later will be probably the case because it is quite domains specific library.

Who knows? If http library allows to make POST request, then it needs to accept unicode string for the request data.

...

I see that we are generaly arguing, whether program_options library domain is generic enough to support natively char and wchar_t (and be templated) or if it is enough to provide an interface via conversions and support only one encoding internaly.

Actually, the question I'm trying to answer is somewhat different. Using two version will inevitable increase code size for library or client applications. It will also somewhat complicate implementation. Using one version will decrease performance. I believe that the decrease in performance won't be noticed by the users so single version is better, and would like to know if there are issues I've missed.

...

I'm in favor of the first approach.

The library works with various sources of informations and its purpose is to restructure the information from these sources into something more usable. I would assume for such a utility, that information passed on input has the same encoding and format as the information on the output.

Sometimes, information on the output is not string, but just 'int', so it has no encoding. In case where the information on the output is string, I plan that it will have the same encoding as was passed on the input.

...

From the nature of the library it seems, that it might be possible to avoid unnecessary conversion into some intermediate encoding.

Is there anything wrong about conversion, except for speed?

...

Another association might be a container. The library is a kind container. It parses the input and provides a conainer-like interface for the information stored there. I find it natural, that the container uses the same encoding for its internals as it provides in the external interface.

The problem with this analogy is that variables_map is heterogenous container: it stores values of different types. So, if it can store values of both std::string and std::wstring it appears that you need some conversion.

...

...
And why an existing operator>> which works for istream only should be fixed to support wistream, if some other option need unicode support?

I don't understand this point.

You have 'class Font' and operator>> which works for istream only. However you try to declare option of this type in options_description<wchar_t>. This should cause instantination of some code which extracts 'Font' from wistream, and there's no suitable operator>>.

...

...
Some of the conversions are unavoidable. E.g. if you have unicode-enabled library, you'd still need to accept ascii input (because you can't expect that all input sources are unicode -- main in Linux is never unicode).

If you want to support legacy operator>> you'd need conversion to ascii.

I'm not a linux expert. I'm mainly working on windows. If I decide to use unicode, I have whole api in the unicode without any need for conversions.

Actualy in the project I'm working on now, I encountered a need for conversion only once. I'm using date_time library and there was no support for the wide strings at the time. Fortuntely it is fixed now :)

In fact, I would not be surprised if char* functions in windows just convert input into unicode and call wide versions ;-) - Volodya

Pavol Droba

6:29 p.m.

Hi Volodya After reading all the discussion about the complexity of the unicode and considering your arguments, I withdraw the idea of fully templated implementation. However, maybe something in between might be feasible. Actualy your interface is quite close. What I would like to propose, is to have a core part working with an arbitrary encoding (implementation defined) and a set of templated interfaces. So insted of writing unicode::whatever as in your proposal, "whatever" will be thin templated layer, that will convert application specific data to the format understood by the core library and vice versa. This layer should be fully dependant on locales to do the required conversion. These can be suplied by imbue(). This would simplify interface from the user perspective. User will not have the use (IMHO unnatural) unicode prefix, interface will be flexible and open for any reasonable encoding, yet the core library will have all the properties you have declared as imporatant. (compiled separately, working with a single encoding and etc.) Does this seem reasonable to you? Regards, Pavol

Rob Stewart

7:35 p.m.

From: Pavol Droba <droba@topmail.sk>

...

However, maybe something in between might be feasible. Actualy your interface is quite close. What I would like to propose, is to have a core part working with an arbitrary encoding (implementation defined) and a set of templated interfaces.

So insted of writing unicode::whatever as in your proposal, "whatever" will be thin templated layer, that will convert application specific data to the format understood by the core library and vice versa. This layer should be fully dependant on locales to do the required conversion. These can be suplied by imbue().

Wouldn't it be better to use an installable translator (a policy class or a strategy class?) that leaves up to the client the responsibility to perform the required string operations like concatenation (perhaps not needed), searching for a substring, trimming a string, etc. That is, have the program_options library defer all string operations, that require customization to support Unicode strings, to the client. Then, the library can provide the default, std::string implementation, and clients can, as they see fit for their own Unicode types, provide alternatives. Eventually, there may be a few accepted translators that can be added to the library, but for now, program_options won't need to define them, yet it will be ready for the future. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Vladimir Prus

8 Apr 8 Apr

6:06 a.m.

Rob Stewart wrote:

...

...
So insted of writing unicode::whatever as in your proposal, "whatever" will be thin templated layer, that will convert application specific data to the format understood by the core library and vice versa. This layer should be fully dependant on locales to do the required conversion. These can be suplied by imbue().

Wouldn't it be better to use an installable translator (a policy class or a strategy class?) that leaves up to the client the responsibility to perform the required string operations like concatenation (perhaps not needed), searching for a substring, trimming a string, etc. That is, have the program_options library defer all string operations, that require customization to support Unicode strings, to the client.

Then, the library can provide the default, std::string implementation, and clients can, as they see fit for their own Unicode types, provide alternatives. Eventually, there may be a few accepted translators that can be added to the library, but for now, program_options won't need to define them, yet it will be ready for the future.

Hi Rob, I think this solution has a drawback. The approach we were discussing allows the user to obtain wstring from wchar_t command line now, without writing any code at all. What he does with that wstring is up to him but he gets the string. I'm afraid that if user can't get this out-of-box, it's hard to say that 'program_options supports unicode'. - Volodya

Rob Stewart

4:20 p.m.

From: Vladimir Prus <ghost@cs.msu.su>

...

Rob Stewart wrote:

...
Wouldn't it be better to use an installable translator (a policy class or a strategy class?) that leaves up to the client the responsibility to perform the required string operations like concatenation (perhaps not needed), searching for a substring, trimming a string, etc. That is, have the program_options library defer all string operations, that require customization to support Unicode strings, to the client.

Then, the library can provide the default, std::string implementation, and clients can, as they see fit for their own Unicode types, provide alternatives. Eventually, there may be a few accepted translators that can be added to the library, but for now, program_options won't need to define them, yet it will be ready for the future.

I think this solution has a drawback. The approach we were discussing allows the user to obtain wstring from wchar_t command line now, without writing any code at all. What he does with that wstring is up to him but he gets the string.

I'm afraid that if user can't get this out-of-box, it's hard to say that 'program_options supports unicode'.

I didn't think there was an expectation that things would work with Unicode, given the various Unicode formats and markers (or whatever Miro called them). Perhaps you're saying that a strict pass-through, without any conversions to other types, will work. I still don't know how you'll parse command lines with Unicode unless you know how to find each character. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Vladimir Prus

9 Apr 9 Apr

5:51 a.m.

Hi Rob,

...

...
I think this solution has a drawback. The approach we were discussing allows the user to obtain wstring from wchar_t command line now, without writing any code at all. What he does with that wstring is up to him but he gets the string.

I'm afraid that if user can't get this out-of-box, it's hard to say that 'program_options supports unicode'.

I didn't think there was an expectation that things would work with Unicode, given the various Unicode formats and markers (or whatever Miro called them). Perhaps you're saying that a strict pass-through, without any conversions to other types, will work.

Yes, that's what I mean.

...

I still don't know how you'll parse command lines with Unicode unless you know how to find each character.

In UTF-8, I can search for '=', and '-' with a regular string's 'find' method (except for composing characters problem, but that's minor issue). That's why it's attractive. - Volodya

Jeremy Maitin-Shepard

6:36 a.m.

If you want it to work with UTF-8, you should avoid using any non-hexadecimal/octal-specified character or string literals for comparison, since there is no guarantee that the character or string literal, even if it is a wide literal, will be encoded in any particular encoding. (This is one of the annoyances of dealing with Unicode in C++ -- and justifies a language extension which would allow specifying UTF-8, UTF-16, and/or UTF-32-encoded string literals as well as UTF-32 character literals (single code points).) -- Jeremy Maitin-Shepard

Vladimir Prus

8 Apr 8 Apr

6:03 a.m.

Hi Pavol,

...

After reading all the discussion about the complexity of the unicode and considering your arguments, I withdraw the idea of fully templated implementation.

However, maybe something in between might be feasible. Actualy your interface is quite close. What I would like to propose, is to have a core part working with an arbitrary encoding (implementation defined) and a set of templated interfaces.

What I planned was core part (or parsers) working with specific encoding and overloads of interface function for char and wchar_t. In fact, close to what you propose.

...

So insted of writing unicode::whatever as in your proposal, "whatever" will be thin templated layer, that will convert application specific data to the format understood by the core library and vice versa. This layer should be fully dependant on locales to do the required conversion. These can be suplied by imbue().

We now have 'typed_value' class which is responsible for converting string into needed type. It's template already, so another template -- char type to use, will be fine.

...

This would simplify interface from the user perspective. User will not have the use (IMHO unnatural) unicode prefix, interface will be flexible and open for any reasonable encoding, yet the core library will have all the properties you have declared as imporatant. (compiled separately, working with a single encoding and etc.)

I should only comment about 'unicode' prefix. I meant it as an way to easily switch from ascii to unicode. E.g. you have ("foo", value<int>(), "foo") which uses ascii. You add 'using namespace boost::program_options::unicode' above and it uses unicode. But, OTOH, given that I (for now) expect that most options will be parsed using ascii, maybe this convenience optimization is premature. Nothing is wrong with: ("foo", value<int, wchar_t>(), "foo")

...

Does this seem reasonable to you?

Yes, it appears the basic approach is settled. Need to iron out some details and start coding again... Thanks, VOlodya

Dale Peakall

6 Apr 6 Apr

1:32 p.m.

...

I have read your proposal. Maybe I'm missing something very serious, but I would prefer to have a similar scheme as used by stl.

...

So that, there will be variants accepting char and wchar_t data types, and all possible unicode problems will be addressed by char_traits and locale.

I have to agree. Programs should internally work in terms of fixed-width character sets. When string data needs to be imported/ exported locales should be used to perform the transformation. I would make program_options support an imbue() function that allows a locale to be specified (otherwise use the default locale) and template any functions that need to process strings on the character type. This provides much more flexibility than just supporting UTF-8. UTF-8 is a really impractical encoding for almost any locale where the majority of text is not ASCII like and the user may well prefer to encode text is Shift-JIS or other encodings.

...

I understand, that stl support unicode for unicode is not the best, but there are facilities, that can provide required functionality if properly extended/configured.

The support really isn't that bad. Mostly, it's a case of the standard not mandating support for specific features (leaving it as a QOI issue) and programmers not understanding whats required of them in order to make things work. It's a shame that wchar_t is only guaranteed to be 16-bit, but for almost all real-world uses UCS-2 provides the required functionality. Java (a "Unicode compliant") language only supports 16-bit wide characters - really the only difference is that Java doesn't support 8-bit characters and handles all the character transformation in its I/O library without the user having to get involved (most of the time). There is a definite need for a decent UTF-8 code converter. I know there is at least one is the vault. I can't answer for it's quality as I haven't tried to use it. The other need is for a type that is guaranteed to be at least 32-bits and can support UCS-4 for the odd occasions that there is a need to use characters from outside the BMP.

...

I think, that there is no big reason to try to reinvent a wheel and provide all encompassing solution in the library like program_options.

...

It should be enough if it will be unicode-enabled so it can be used in the any specific scenario, provided that all necessary facilities are on place.

Here, here. - Dale.

Vladimir Prus

1:48 p.m.

Dale Peakall wrote:

...

...
So that, there will be variants accepting char and wchar_t data types, and all possible unicode problems will be addressed by char_traits and locale.

I have to agree. Programs should internally work in terms of fixed-width character sets. When string data needs to be imported/ exported locales should be used to perform the transformation.

I would make program_options support an imbue() function that allows a locale to be specified (otherwise use the default locale) and template any functions that need to process strings on the character type.

Why bother? As I've indicated, internal processing will work just fine with UTF-8, and no interface of the library will expose a UTF-8 encoded string. imbue() is a good thing for specifying what encoding char** input has, but it's orthogonal to the rest of the library -- it's used only on interface boundary.

...

This provides much more flexibility than just supporting UTF-8. UTF-8 is a really impractical encoding for almost any locale where the majority of text is not ASCII like and the user may well prefer to encode text is Shift-JIS or other encodings.

Again, in my case the user does not use UTF-8 string, so why would he care how the strings are encoded internally? - Volodya

Miro Jurisic

4:58 p.m.

In article <001b01c41bdb$9882c030$8119fea9@FDI.LOCAL>, "Dale Peakall" <dale@peakall.com> wrote:

...

I have to agree. Programs should internally work in terms of fixed-width character sets. When string data needs to be imported/ exported locales should be used to perform the transformation.

The only fixed-width Unicode character sets are 32 bits wide; UTF-16 is not fixed width, it's only fixed width most of the time. You will have a hard time convincing me (and most other people, as far as I know) that it's a good idea for all of your strings to take 4x as much space as they need. A fixed-width character set is a win mainly when you are doing character-level manipulation, which isn't very common in any properly internationalized app because internationalized apps read most of their strings from a table and use only concatenation (where sprintf/boost::format are considered little more than concatenation) to produce new strings. If all you are doing is concatenation, you don't care about fixed-width characters. That is not to say that fixed-width character sets don't have their uses, but it is not at all as simple as "you should use them all the time". meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Miro Jurisic

7:28 p.m.

In article <200404061127.44141.ghost@cs.msu.su>, Vladimir Prus <ghost@cs.msu.su> wrote:

...

it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding.

Unicode is a non-trivial problem, and I strongly encourage you not to attempt to seriously tackle Unicode in program_options without spending some time thinking about more general issues of Unicode in boost and STL. As I see it, there are only two things that can come out of this: 1. You really sit down and write an appropriate Unicode string abstraction for boost, not tied to program_options If this is the choice you make, then we should be discussing this separately from program_options, and it should be designed separately; when its implementation design begins, program_options can be the first client, of course. 2. You don't really try to solve the whole problem, and you do the minimal amount of work needed for program_options to support Unicode, while ignoring the larger issue of Unicode support in applications In this case, you need to identify the minimal requirements you need to satisfy, and design program_options appropriately. I very strongly discourage you from doing anything in-between, because Unicode becomes rather complex very very quickly when you decide to do something non-trivial, and most likely attempting to do something between 1 and 2 will take you down path before you know it. The complexity of Unicode and internationalization in general cannot be underestimated. That said, remarks on your design: First of all, there is no guarantee that std::wstring is UCS4-encoded, nor even that std::wstring is wide enough to hold a UCS4 code point. Because of the extent to which wchar_t and std::wstring are platform-dependent, I would avoid looking at them at all. (They are so platform-dependent that you can't declare a wide character string literal and be assured that it will work on all reasonable compilers -- because you don't know how wide your characters are.) Given that, I would simply declare that the extent of Unicode support in program_options will be that it supports UTF-8-encoded std::strings, in either canonically decomposed form or canonically precomposed form. If you make those assertions, you can take advantage of Unicode properties in the following two ways: Searching for a substring X of string Y can be done without regard for character boundaries (because Unicode guarantees that characters are encoded to avoid false positives in this scenario). Strict string comparison can be done without regard for character boundaries (because every character has precisely one encoding each canonical form). Basically, those two assumptions allow you to get as close to manipulating strings without considering character boundaries as you can, and IMNSHO that's the best you can do unless you want to design a real Unicode abstraction. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Vladimir Prus

7 Apr 7 Apr

6:52 a.m.

Hi Miro,

...

...
it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding.

Unicode is a non-trivial problem, and I strongly encourage you not to attempt to seriously tackle Unicode in program_options without spending some time thinking about more general issues of Unicode in boost and STL. As I see it, there are only two things that can come out of this:

1. You really sit down and write an appropriate Unicode string abstraction for boost, not tied to program_options

If this is the choice you make, then we should be discussing this separately from program_options, and it should be designed separately; when its implementation design begins, program_options can be the first client, of course.

I'd like to avoid this my all means. I'm not a unicode expert, but what I've learned already sounds complex enough. Besides, competing with existing solutions like ICU (http://oss.software.ibm.com/icu/) is not so good idea.

...

2. You don't really try to solve the whole problem, and you do the minimal amount of work needed for program_options to support Unicode, while ignoring the larger issue of Unicode support in applications

In this case, you need to identify the minimal requirements you need to satisfy, and design program_options appropriately.

Right. Here's are the requirements: 1. When declaring each option, one should be able to specify whether the value should be parsed using unicode, or ascii. If it should be parsed using unicode, all unicode issues (e.g. normalization), are up to the client. 2. Each parser should have ascii and unicode version. How unicode string is obtained is up to the client. 3. The library guarantees that - ascii input is passed to a ascii value without change - unicode input is passed to a unicode value without change - ascii input passed to the unicode value, and - unicode input passed to the ascii value will be converted using codecvt facet (which can be specified by the user) Essentially, the library will allow to pass though both ascii and unicode strings, unmodified.

...

That said, remarks on your design:

First of all, there is no guarantee that std::wstring is UCS4-encoded, nor even that std::wstring is wide enough to hold a UCS4 code point. Because of the extent to which wchar_t and std::wstring are platform-dependent, I would avoid looking at them at all. (They are so platform-dependent that you can't declare a wide character string literal and be assured that it will work on all reasonable compilers -- because you don't know how wide your characters are.)

Oh, I have to agree with this. Even though characters outside BMP are rare, it's better not use wstring.

...

Given that, I would simply declare that the extent of Unicode support in program_options will be that it supports UTF-8-encoded std::strings, in either canonically decomposed form or canonically precomposed form. If you make those assertions, you can take advantage of Unicode properties in the following two ways:

I think there's no need to require that std::string passed to program_options is in UTF-8. It's better to allow user to specify codecvt facet for converting char* into unicode. So, one can use 8-bit encoding specified by locale, or use UTF-8, as he likes. (Besides, codecvt uses wchar_t -- does it mean it can't really be used for unicode too?)

...

Searching for a substring X of string Y can be done without regard for character boundaries (because Unicode guarantees that characters are encoded to avoid false positives in this scenario).

It might be even simpler: since I only look for characters in ascii, so I even don't care about canonical form -- I believe all ascii characters are unambigous (I mean strict 7-bit ascii).

...

Strict string comparison can be done without regard for character boundaries (because every character has precisely one encoding each canonical form).

For now, I don't plan to support Unicode in option names, so string comparison is not yet needed.

...

Basically, those two assumptions allow you to get as close to manipulating strings without considering character boundaries as you can, and IMNSHO that's the best you can do unless you want to design a real Unicode abstraction.

Thanks for your comments! - Volodya

Miro Jurisic

9:24 a.m.

In article <c508gd$o78$1@sea.gmane.org>, Vladimir Prus <ghost@cs.msu.su> wrote:

...

...
1. You really sit down and write an appropriate Unicode string abstraction for boost, not tied to program_options

If this is the choice you make, then we should be discussing this separately from program_options, and it should be designed separately; when its implementation design begins, program_options can be the first client, of course.

I'd like to avoid this my all means. I'm not a unicode expert, but what I've learned already sounds complex enough. Besides, competing with existing solutions like ICU (http://oss.software.ibm.com/icu/) is not so good idea.

I agree that this is not the best choice for you at this time, not because I think boost should avoid competing with the ICU, but because I think that it's beyond the scope of your current work.

...

1. When declaring each option, one should be able to specify whether the value should be parsed using unicode, or ascii. If it should be parsed using unicode, all unicode issues (e.g. normalization), are up to the client.

OK, I have to say at this point that I have not spent much time looking at the PO design itself (nor do I have the time right now; it was only the mention of Unicode that brought me out of lurking) so I may be confused about what's going on here. That said, the way I understand it is that you have some character-based input (argv, config file, environment) that's passed to your library, which you then need to parse for options based on a client-provided specification and set some variables back in the client. The level of Unicode support you want is that you want to accept Unicode characters on input (presumably the config file, as I am not aware of a wchar_t argv variant), and do something sensible (i.e., not mangle the values) from there. I am going to guess that all the characters used as delimiters in the parsing code are ASCII. If that is the case, you could simply continue to treat all strings as containers of code points and you would not run into any problems except when parsing a string that contains a delimiter character followed by a combining mark; for example, foo="bar"<combining mark>baz" would be incorrectly parsed as foo="bar". However, you already have to deal with the case of embedded delimiters, and there is no reason why you can't extend whatever you are doing now to this case; for example, if I would have had to write foo="bar\"baz" if the embedded " didn't have a combining mark following it, then I could just as well be required to use foo="bar\"<combining mark>baz" if the embedded " does have a combining mark following it. So, (and again, this is based on very little information about program_options and mostly on a quick sketch of it that I formed in my head this evening), it seems that as long as you have a mechanism right now to cope with embedding delimiters in program options, you should be able to continue using essentially the same mechanism to cope with Unicode strings. From there, you can decompose your input into keys and values, and now you are left with parsing the values; string values are converted according to the locale (as you said yourself) and numeric values are parsed probably by converting them to ASCII and then using the method you already have.

...

It might be even simpler: since I only look for characters in ascii, so I even don't care about canonical form -- I believe all ascii characters are unambigous (I mean strict 7-bit ascii).

Yes, except in the case where they are followed by a combining mark; see remarks above.

...

For now, I don't plan to support Unicode in option names, so string comparison is not yet needed.

Oh good :-)

...

Thanks for your comments!

You are welcome! I am admittedly somewhat tired right now, I hope I am making myself clear enough. :-) meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Daryle Walker

9 Apr 9 Apr

6:27 a.m.

On 4/6/04 3:27 AM, "Vladimir Prus" <ghost@cs.msu.su> wrote:

...

it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding. [TRUNCATE]

What about: * There's no guarantee that "char" is based on ASCII * There's no guarantee that "wchar_t" is based on Unicode Since other text-related parts of Boost don't really deal with Unicode issues, maybe you should address it after putting it in CVS. Maybe after discussions on how Unicode can fit in Boost-wide. (Other posts in this thread have admitted that the problem is big and difficult. I don't think it's worth delaying the library over. Sometimes, cool-sounding ideas in the abstract turn out to be bad ones in practice.) Even if you do come up with some grand Unicode plan, you would have to make sure your library works with platforms that don't use ASCII/Unicode. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Vladimir Prus

7:54 a.m.

Daryle Walker wrote:

...

On 4/6/04 3:27 AM, "Vladimir Prus" <ghost@cs.msu.su> wrote:

...
it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding.

[TRUNCATE]

What about: * There's no guarantee that "char" is based on ASCII * There's no guarantee that "wchar_t" is based on Unicode

Since other text-related parts of Boost don't really deal with Unicode issues, maybe you should address it after putting it in CVS.

It was specifically requested that some Unicode/wchar_t support be added before putting to CVS.

...

Maybe after discussions on how Unicode can fit in Boost-wide. (Other posts in this thread have admitted that the problem is big and difficult. I don't think it's worth delaying the library over. Sometimes, cool-sounding ideas in the abstract turn out to be bad ones in practice.)

What 'cool-sounding idea' do you mean? What I proposed was that unicode data is just passed though, without modification.

...

Even if you do come up with some grand Unicode plan, you would have to make sure your library works with platforms that don't use ASCII/Unicode.

Do you know specific case there wchar_t does not implicitly means Unicode. - Volodya

Daryle Walker

11 Apr 11 Apr

10:12 p.m.

On 4/9/04 3:54 AM, "Vladimir Prus" <ghost@cs.msu.su> wrote:

...

Daryle Walker wrote:

...
On 4/6/04 3:27 AM, "Vladimir Prus" <ghost@cs.msu.su> wrote:

...
it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding.

[TRUNCATE]

What about: * There's no guarantee that "char" is based on ASCII * There's no guarantee that "wchar_t" is based on Unicode

Since other text-related parts of Boost don't really deal with Unicode issues, maybe you should address it after putting it in CVS.

It was specifically requested that some Unicode/wchar_t support be added before putting to CVS.

That doesn't mean that you _have_ to do it. You can give the person who gave the request a (temporary) rejection notice.

...

...
Maybe after discussions on how Unicode can fit in Boost-wide. (Other posts in this thread have admitted that the problem is big and difficult. I don't think it's worth delaying the library over. Sometimes, cool-sounding ideas in the abstract turn out to be bad ones in practice.)

What 'cool-sounding idea' do you mean? What I proposed was that unicode data is just passed though, without modification.

I read messages in this thread about doing full-blown Unicode handling, and I've read about doing nothing (being as Unicode-ignorant as other text-processing Boost libraries). I wouldn't mind adding "wchar_t" support, without necessarily assuming that it's Unicode. However, the Unicode "problem" is so big that it could take more time and effort than what you have done on program-options so far. _That_ is what I don't want to delay the library for. Also, a solution should be applicable for all of Boost's text libraries, not just this one.

...

...
Even if you do come up with some grand Unicode plan, you would have to make sure your library works with platforms that don't use ASCII/Unicode.

Do you know specific case there wchar_t does not implicitly means Unicode.

Not personally, but that's about as relevant as asking for a platform whose "char" isn't 8 bits. (I've heard platforms like that have existed.) Just because all the common platforms do it a certain way (and/or there's no counter-examples) doesn't mean you can portably assume that the common assumption is all that matters. The identities and code-points of the members of the (narrow and wide) character sets are implementation-defined. The C++ parser allows characters to be named by their ISO-Unicode number, but it's supposed to be mapped to the platform's code-point for that character, not necessarily maintained in Unicode. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Vladimir Prus

12 Apr 12 Apr

6:05 a.m.

Daryle Walker wrote:

...

...
It was specifically requested that some Unicode/wchar_t support be added before putting to CVS.

That doesn't mean that you _have_ to do it. You can give the person who gave the request a (temporary) rejection notice.

Well, that was requested by the review manager (while wearing review manager hat). So, given that I think it's possible to implement with little effort, I'd rather implement this.

...

...
What 'cool-sounding idea' do you mean? What I proposed was that unicode data is just passed though, without modification.

I read messages in this thread about doing full-blown Unicode handling, and I've read about doing nothing (being as Unicode-ignorant as other text-processing Boost libraries). I wouldn't mind adding "wchar_t" support, without necessarily assuming that it's Unicode.

I still plan to assume wchar_t is unicode, until some user comes up with problems in his real case.

...

...
Do you know specific case there wchar_t does not implicitly means Unicode.

Not personally, but that's about as relevant as asking for a platform whose "char" isn't 8 bits. (I've heard platforms like that have existed.)

I know one such platform which exists today, with 32-bit char. However, it (1) still uses ascii (2) you won't run string algrotithms on a DSP anyway Yes, implementation is allowed to use randomly-permuted ascii encoding. The point is that *if* user of such implementation really starts using program_options, it would be possible to accomodate that. - Volodya

Daryle Walker

14 Apr 14 Apr

7:22 a.m.

On 4/12/04 2:05 AM, "Vladimir Prus" <ghost@cs.msu.su> wrote:

...

Daryle Walker wrote: [SNIP]

...
...
Do you know specific case there wchar_t does not implicitly means Unicode.

Not personally, but that's about as relevant as asking for a platform whose "char" isn't 8 bits. (I've heard platforms like that have existed.)

I know one such platform which exists today, with 32-bit char. However, it (1) still uses ascii (2) you won't run string algrotithms on a DSP anyway

I meant something like a 9-bit EBCDIC "char" on an older computer. (I made that up; I don't know if that case has really occurred.)

...

Yes, implementation is allowed to use randomly-permuted ascii encoding. The point is that *if* user of such implementation really starts using program_options, it would be possible to accomodate that.

I'd rather not have an indefinitely-ticking portability time bomb left in the code. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Vladimir Prus

7:42 a.m.

Daryle Walker wrote:

...

...
...
Not personally, but that's about as relevant as asking for a platform whose "char" isn't 8 bits. (I've heard platforms like that have existed.)

I know one such platform which exists today, with 32-bit char. However, it (1) still uses ascii (2) you won't run string algrotithms on a DSP anyway

I meant something like a 9-bit EBCDIC "char" on an older computer. (I made that up; I don't know if that case has really occurred.)

Ok, though EBCDIC is quite ancient anyway.

...

...
Yes, implementation is allowed to use randomly-permuted ascii encoding. The point is that *if* user of such implementation really starts using program_options, it would be possible to accomodate that.

I'd rather not have an indefinitely-ticking portability time bomb left in the code.

In fact, that's not going to be a time-bomb. I expect anyone who's going to use the library to run tests or to look at tests results, and even trivial Unicode test will catch problems on EBCDIC. What I don't want is spend time to accomodate systems: 1. Which might not exist anyone 2. Which might not have any user base 3. Which might not have good enough C++ compiler - Volodya

Miro Jurisic

9 Apr 9 Apr

11:30 p.m.

In article <BC9BBA99.90D2%darylew@hotmail.com>, Daryle Walker <darylew@hotmail.com> wrote:

...

Sometimes, cool-sounding ideas in the abstract turn out to be bad ones in practice.

Unicode is the paragon of that entire class of problems :-) meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

7784

Age (days ago)

7792

Last active (days ago)

List overview

Download

30 comments

9 participants

participants (9)

Dale Peakall
Daryle Walker
Hagen Moebius
Jeremy Maitin-Shepard
Kevin Wheatley
Miro Jurisic
Pavol Droba
Rob Stewart
Vladimir Prus