[boost] Re: [program_options] Unicode support

6 Apr 2004

      In article <200404061127.44141.ghost@cs.msu.su>,
 Vladimir Prus <ghost@cs.msu.su> wrote:
...
it seems that Unicode support is the last issue that should be addressed 
before the library can be added to CVS. Since the issue is somewhat tricky, 
I'd appreciate some comments before I start coding.
Unicode is a non-trivial problem, and I strongly encourage you not to attempt to 
seriously tackle Unicode in program_options without spending some time thinking 
about more general issues of Unicode in boost and STL. As I see it, there are 
only two things that can come out of this:

1. You really sit down and write an appropriate Unicode string abstraction for 
boost, not tied to program_options

If this is the choice you make, then we should be discussing this separately 
from program_options, and it should be designed separately; when its 
implementation design begins, program_options can be the first client, of course.

2. You don't really try to solve the whole problem, and you do the minimal 
amount of work needed for program_options to support Unicode, while ignoring the 
larger issue of Unicode support in applications

In this case, you need to identify the minimal requirements you need to satisfy, 
and design program_options appropriately.

I very strongly discourage you from doing anything in-between, because Unicode 
becomes rather complex very very quickly when you decide to do something 
non-trivial, and most likely attempting to do something between 1 and 2 will 
take you down path before you know it. The complexity of Unicode and 
internationalization in general cannot be underestimated.

That said, remarks on your design:

First of all, there is no guarantee that std::wstring is UCS4-encoded, nor even 
that std::wstring is wide enough to hold a UCS4 code point. Because of the 
extent to which wchar_t and std::wstring are platform-dependent, I would avoid 
looking at them at all. (They are so platform-dependent that you can't declare a 
wide character string literal and be assured that it will work on all reasonable 
compilers -- because you don't know how wide your characters are.)

Given that, I would simply declare that the extent of Unicode support in 
program_options will be that it supports UTF-8-encoded std::strings, in either 
canonically decomposed form or canonically precomposed form. If you make those 
assertions, you can take advantage of Unicode properties in the following two 
ways:

Searching for a substring X of string Y can be done without regard for character 
boundaries (because Unicode guarantees that characters are encoded to avoid 
false positives in this scenario).

Strict string comparison can be done without regard for character boundaries 
(because every character has precisely one encoding each canonical form).

Basically, those two assumptions allow you to get as close to manipulating 
strings without considering character boundaries as you can, and IMNSHO that's 
the best you can do unless you want to design a real Unicode abstraction.

meeroh

-- 
If this message helped you, consider buying an item
from my wish list: <http://web.meeroh.org/wishlist>

[boost] Re: [program_options] Unicode support

Miro Jurisic