
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future. *Scenario A:* We will pick a widely-accepted char-based encoding that is able to handle all the writing scripts and alphabets that we can think of, has enough reserved space for future additions or is easily extensible and use that with std::strings which will become the one and only text string 'container' class. All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition). *Scenario B:* We will add yet another string class named utf8_t to the already crowded set named above. Then: library a: will stick to the ANSI encodings with std::strings It has worked in the past it will work in the future, right ? library b[oost]: will use utf8_t instead and provide the (seamles and straightforward) conversions between utf8_t and std::string and std::wstring. Some (many but not all) others will follow library c: will use std::strings with utf-8 ... library [.]n[et]: will use String class ... library q[t]: will use Qstrings .. library w[xWidgets]: will use wxStrings and wxChar* library wi[napi]: will use TCHAR* ... library z: will use const char* in an encoding agnostic way Now an application using libraries [a..z] will become the developers nightmare. What string should he use for the class members, constructor parameters, who to do when the conversions do not work so seamlesly ? Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data. *Scenario C:* This is basically the status quo; a mix of the above. A sad and unsatisfactory state of things. *Consequences of A:* - Interface breaking changes, which will require some fixing in the library client code and some work in the libraries themselves. These should be made as painless as possible with *temporary* utilities or convenience classes that would for example handle the transcoding from utf8 to UCS-2/UTF-16 in WINAPI and be no-ops on most POSIX systems. - Silent introduction of bugs for those who still use std::string for ANSI CP####. This is worse than above and will require some public-relations work on the part of Boost to make it clear that using std::strings with ANSI may be an error since Boost version x.y.z. - We should finally accept the notion that one byte, word, dword != one character and that there are code points and there are characters and both of them can have variable length encoding and devise tool to handle them as such conveniently. - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding. Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified. *Consequences of B:* - No fixing of existing interface which IMO means no or very slow moving on to a single encoding. - Creating another string class, which, let us face it, not everybody will accept even with the Boost influence unless it becomes standard. - We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :) - People will probably start to use other programming languages (although this may by FUD) *Consequences of C:* Here pick all the negatives of the above :) *Note on the encoding to be used* The best candidate for the widely-accepted and extensible encoding vaguely mentioned above is IMO UTF-8. - It has been given a lot of thought - It is an already widely accepted standard - It is char-based so no need to switch to std::basic_string<whatever_char_t> - It is extensible, so once we have done the painful transition we will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte sequences to encode code points, but the scheme is transparently extensible to 1-N bytes (unlike UCS-X and i'm not sure about UTF-16/32). So, [dark-sarcasm] even if we dig out the stargate or join the United Federation of Planets and captain Kirk, every time he returns home, brings a truckload of new writing scripts to support, UTF-8 will be able to handle it. just my 0.02 strips of gold pressed latinum :) [/dark-sarcasm] Best regards, Matus