
In article <200404061127.44141.ghost@cs.msu.su>, Vladimir Prus <ghost@cs.msu.su> wrote:
it seems that Unicode support is the last issue that should be addressed before the library can be added to CVS. Since the issue is somewhat tricky, I'd appreciate some comments before I start coding.
Unicode is a non-trivial problem, and I strongly encourage you not to attempt to seriously tackle Unicode in program_options without spending some time thinking about more general issues of Unicode in boost and STL. As I see it, there are only two things that can come out of this: 1. You really sit down and write an appropriate Unicode string abstraction for boost, not tied to program_options If this is the choice you make, then we should be discussing this separately from program_options, and it should be designed separately; when its implementation design begins, program_options can be the first client, of course. 2. You don't really try to solve the whole problem, and you do the minimal amount of work needed for program_options to support Unicode, while ignoring the larger issue of Unicode support in applications In this case, you need to identify the minimal requirements you need to satisfy, and design program_options appropriately. I very strongly discourage you from doing anything in-between, because Unicode becomes rather complex very very quickly when you decide to do something non-trivial, and most likely attempting to do something between 1 and 2 will take you down path before you know it. The complexity of Unicode and internationalization in general cannot be underestimated. That said, remarks on your design: First of all, there is no guarantee that std::wstring is UCS4-encoded, nor even that std::wstring is wide enough to hold a UCS4 code point. Because of the extent to which wchar_t and std::wstring are platform-dependent, I would avoid looking at them at all. (They are so platform-dependent that you can't declare a wide character string literal and be assured that it will work on all reasonable compilers -- because you don't know how wide your characters are.) Given that, I would simply declare that the extent of Unicode support in program_options will be that it supports UTF-8-encoded std::strings, in either canonically decomposed form or canonically precomposed form. If you make those assertions, you can take advantage of Unicode properties in the following two ways: Searching for a substring X of string Y can be done without regard for character boundaries (because Unicode guarantees that characters are encoded to avoid false positives in this scenario). Strict string comparison can be done without regard for character boundaries (because every character has precisely one encoding each canonical form). Basically, those two assumptions allow you to get as close to manipulating strings without considering character boundaries as you can, and IMNSHO that's the best you can do unless you want to design a real Unicode abstraction. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>