
On Wed, 19 Jan 2011 11:33:02 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
We will pick a widely-accepted char-based encoding [...] and use that with std::strings which will become the one and only text string 'container' class.
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
Sounds like a little slice of heaven to me. Though you'll still have the pesky problem of having to verify that the UTF-8 code is valid all the time. More on that below.
*Scenario B:*
We will add yet another string class named utf8_t to the already crowded set named above. [...] Now an application using libraries [a..z] will become the developers nightmare. What string should he use for the class members, constructor parameters, who to do when the conversions do not work so seamlesly ?
How is that different from what we've got today, except that the utf*_t classes will make converting to and from different string types, and validating the UTF code, a little easier and more automatic?
Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data.
I think that's a bit of an exaggeration. :-) As more libraries move to the assumption that std::string == UTF-8, the need (and code) for transcoding will silently vanish. Eventually, utf8_t will just be a statement by the programmer that the data contained within is guaranteed to be valid UTF-8, enforced by the class -- something that would require at minimum an extra call if using std::string, one that could be forgotten and open up the program to exploits.
*Scenario C:*
This is basically the status quo; a mix of the above. A sad and unsatisfactory state of things.
Agreed.
*Consequences of A:*
[...] - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding.
It's the getting-there part that I'm concerned about.
Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I don't see that happening. ISPs will switch to IPv6 (because they have to), and make it possible for their customers to stay on IPv4, so their customers *will* stay on IPv4 because it's cheaper. And if they stay with IPv4, there won't be any impetus for consumer electronics companies to make their equipment IPv6-compatible because consumers won't care about it. Without consumer demand, it won't get done for years, maybe a decade or more. That's what I see happening with std::string and UTF-8 as well.
*Consequences of B:*
- No fixing of existing interface which IMO means no or very slow moving on to a single encoding.
Which, as stated above, I believe will happen anyway.
- Creating another string class, which, let us face it, not everybody will accept even with the Boost influence unless it becomes standard.
That's the beauty of it -- not everybody *needs* to accept it. Just the people who write code that isn't encoding-agnostic. Boost.FileSystem might provide a utf16_t overload for Windows, for instance, so that it can automatically convert strings in other UTF types. But I see no reason it would lose the existing interface.
- We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :)
Any technical reason why, other than what you've already written?
- People will probably start to use other programming languages (although this may by FUD)
I hate to point this out, but people are *already* using other programming languages. :-) C++ isn't new or sexy, and has some pain-points (though many of the most egregious ones will be solved with C++0x). Unicode handling is one of them, and in my opinion, the utf*_t types will only ease that.
*Note on the encoding to be used*
The best candidate for the widely-accepted and extensible encoding vaguely mentioned above is IMO UTF-8. [...]
Apparently a growing number of people agree, as do I.
- It is extensible, so once we have done the painful transition we will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte sequences to encode code points, but the scheme is transparently extensible to 1-N bytes (unlike UCS-X and i'm not sure about UTF-16/32). [...]
UTF-16 can't be extended any further than its current definition, not without a major reinterpretation. UTF-32 (and UTF-8) could go up to 0xFFFFFFFF codepoints, but the standards bodies involved have agreed that they'll never be extended past the current UTF-16 limitations. Though of course, that's subject to change if the circumstances change, though nobody foresees such a change right now. -- Chad Nelson Oak Circle Software, Inc. * * *