
Hi Ferdinand,
I am not exactly sure if UTF-8 or UCS-4 is better as universal solution, but some solution is surely needed.
I am afraid there is no universal solution for all users. The easiest solution is based on the native basic_string<>, which is specialized for char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit) usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would require another basic_string<> specialization.
UCS-2 held all characters in Unicode 1.1, There was a need for more unique numbers and UCS-4 was introduced in Unicode 2.0. Unfortunately there is no 4-byte character specialization for basic_string<> in STL yet.
Or, to be exact, there's no agreement if wchar_t should be 32-bit or 16-bit. Linux (or gcc specifically) uses 32 bits, and Windows 16, which means wstring only suitable for UCS-2. Besides, UCS-2 has a mechanism to represent characters outside of 16-bit space with two elements, which, I suspect, won't work if wchar_t is 16 bit.
Generally speaking UTF encodings (e.g. UTF-8 or UTF-16) are not suitable for the fast in-memory usage because some operations, for example at(int) and length() have not O(1) but O(n) to give a result.
I believe UTF-8 is popular party because such operations are believed to be rare.
They (UTF, T=Transformation) are better for storing texts as they save place by using variable number of bytes for a character. However UCS (e.g. UCS-2 or UCS-4) are fast for memory operations because they use fixed character size.
What about representing values with two 16-bit values (that's what I've mentioned above)? (BTW, for reference to those interested, http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf talks about different encodings).
That is why I would not like to use basic_string<utf8char> in memory, rather basic_string<ucs4char> instead, but I would not generalize it for all possible applications.
In case of program options, I suspect that everything will work for UTF-8 strings. IOW, the library does not care that any given 'char' might be part of some Unicode character. That's why it's attractive for me.
You can expect initialization from (const char * argv []) on all platforms or (const wchar_t * argv []) on Windows in UCS-2. With the basic_string<> you already have support for the parameters from the current locale (char) and for parameters in UCS-2.
What do you mean by 'parameters from the current locale'? I am not sure that ctype::widen is required to care about user-selected character encoding. Not do I think it's requires from default codecvt facet. If my locale on Linux is ru_RU.KOI8-R I don't think standard requires codecvt<wchar_t, char> facet in default instance of 'locale' to do meaningfull conversion from KOI8-R into unicode.
If we take an option to read parameters into basic_string<wchar_t> or basic_string<ucs4char_t>, where the character size or encoding is not the same as the native encoding on the command line, there is an affinity to streams. Some shells allow usage of UTF-8 encoded parameters or, generally, usage of characters out of the current locale. It means, that a program can choose the way, how to encode all characters from Unicode to char. UTF-7/8, etc. I would like to have a solution similar to streams: imbue(). Having this, you could convert internally every argv[x] using imbue(y) applied to a stringstream, where the facet y provides the caller.
That's right. There should be some mechanism to convert from current 8-bit encoding and source code encoding into unicode. At least at linux there's a mechanism to do the first thing, but I'm not aware about Windows.
On the other hand, such a conversion can be performed also by a user. The parameters sent to main() are char* or wchar_t* and thus program_options can give them back just as they are in basic_string<char> and basic_string<wchar_t>. The client can use his facet to imbue a basic_stringstream<> initialized with the parameter from program_options. Or a conversion library could be used to perform the conversion, something like lexical_cast<> does for type converions. It is a matter of convenience only - separate converions library (no support for encoding in program_options) or imbue(), performing the conversion inside the program_options. However, the conversion should not be implemented for program_options only, that is why I suggested an existing interface - facets.
I agree with basing the mechanism on facets. OTOH, it's really can be made orthogonal to program options. So initially program_options can support only ascii (strict 7-bit) and unicode, for which conversion is trivial.
Or did I miss something? Is something like this part of boost already?
Nope :-( Even UTF-8 encoder is not in boost yet.
You can find come converting facets for UTF-8 raedy for imbue() to a stream in the files section on yahoo. Unfortunately not finished or not reviewed...
Yea, I can find some facets, including one written by myself ;-( And yea, unfortunately they are not in Boost yet. - Volodya