
On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 19 Jan 2011 11:33:02 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
*Scenario A:*
Sounds like a little slice of heaven to me. Though you'll still have the pesky problem of having to verify that the UTF-8 code is valid all the time. More on that below.
I am a believer ;) and when people realize that UTF-8 is the way to go, the pesky problems will vanish. Believe me today with ANSI Today I have to check/detect the encoding of input files created by users on different windows machines and do the conversions. And checking if data is valid UTF-8 is IMO an easier task. Most people here use windows1252 that is not so different from ASCII so even if something gets garbled it can be rescued. I can't imagine what it is like in countries that have to deal with semitic languages, chinese/japanese/korean ideograms, etc.
*Scenario B:*
How is that different from what we've got today, except that the utf*_t classes will make converting to and from different string types, and validating the UTF code, a little easier and more automatic?
Exactly, and I think that we agree that the current status is far from ideal. The automatic conversions would (probably) be OK but introducing yet another string class is not.
Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data.
I think that's a bit of an exaggeration. :-) As more libraries move to
Yes, sorry I could not resist :)
the assumption that std::string == UTF-8, the need (and code) for transcoding will silently vanish. Eventually, utf8_t will just be a statement by the programmer that the data contained within is guaranteed to be valid UTF-8, enforced by the class -- something that would require at minimum an extra call if using std::string, one that could be forgotten and open up the program to exploits.
Yes but why do not enforce it "organizationally" with the power and influence Boost has. Again I know that it would break a lot of stuff but really are all those people that now use std::string ready to change all their code to use utf8_t instead ? Which will involve more work ? I'm convinced that it will be the latter, but I can be wrong. And many people already *do* use std::string for UTF-8 and are doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing" them because we want them, for the sake of people which still dwell on ANSI, to change their code. IMO we should do the opposite.
[...] - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding.
It's the getting-there part that I'm concerned about.
Me too, but again many other people already pointed out that a large portion of the code is completely encoding agnostic so there would be no impact if we stayed with std::string. There would be, if we add utf8_t.
Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I don't see that happening. ISPs will switch to IPv6 (because they have to), and make it possible for their customers to stay on IPv4, so their customers *will* stay on IPv4 because it's cheaper. And if they stay with IPv4, there won't be any impetus for consumer electronics companies to make their equipment IPv6-compatible because consumers won't care about it. Without consumer demand, it won't get done for years, maybe a decade or more.
That's what I see happening with std::string and UTF-8 as well.
Yes, people (me included) are resistant to big changes event for the better. But I've learned that I should always consider the long-term impact.
- Creating another string class, which, let us face it, not everybody will accept even with the Boost influence unless it becomes standard.
That's the beauty of it -- not everybody *needs* to accept it. Just the people who write code that isn't encoding-agnostic. Boost.FileSystem might provide a utf16_t overload for Windows, for instance, so that it can automatically convert strings in other UTF types. But I see no reason it would lose the existing interface.
So you suggest that for example in the STL there would be (for example) besides the existing fstream and wfstream also a third "ufstream". I think that we actually should be reducing the interface not expanding it (yes I hear it ... "breaking changes!" :)).
- We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :)
Any technical reason why, other than what you've already written?
Besides the ugly name and that is a new class ? No :)
I hate to point this out, but people are *already* using other programming languages. :-) C++ isn't new or sexy, and has some pain-points (though many of the most egregious ones will be solved with C++0x). Unicode handling is one of them, and in my opinion, the utf*_t types will only ease that.
And the solution is long overdue. And creating utf8_t is just putting the problem away, not solving it really.