
On Wed, 19 Jan 2011 15:08:06 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
How is that different from what we've got today, except that the utf*_t classes will make converting to and from different string types, and validating the UTF code, a little easier and more automatic?
Exactly, and I think that we agree that the current status is far from ideal. The automatic conversions would (probably) be OK but introducing yet another string class is not.
Do you see another way to provide those conversions, and automatic verification of proper UTF coding? (Automatic verification is a very good thing, without it someone won't use it or will forget to, and open up their programs to exploitation.)
the assumption that std::string == UTF-8, the need (and code) for transcoding will silently vanish. Eventually, utf8_t will just be a statement by the programmer that the data contained within is guaranteed to be valid UTF-8, enforced by the class -- something that would require at minimum an extra call if using std::string, one that could be forgotten and open up the program to exploits.
Yes but why do not enforce it "organizationally" with the power and influence Boost has. Again I know that it would break a lot of stuff but really are all those people that now use std::string ready to change all their code to use utf8_t instead ? Which will involve more work ? I'm convinced that it will be the latter, but I can be wrong.
If Boost comes out with a version that breaks existing programs, companies just won't upgrade to it. I can keep one of the companies that mine works with upgrading, because the group that I work with is the only one there using C++ and they listen to me, but most companies have a lot more invested in the existing system. Believe me, any breaking changes have to be eased in over many versions -- the "boiling a frog" approach. :-)
And many people already *do* use std::string for UTF-8 and are doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing" them because we want them, for the sake of people which still dwell on ANSI, to change their code. IMO we should do the opposite.
If they're already using UTF-8 strings, then we provide something like BOOST_ALL_STD_STRINGS_ARE_UTF8 that they can define. The utf*_t classes configure themselves to accept std::strings as UTF-8-encoded, and any changes are completely transparent to those people. No punishment involved. For everyone else, we introduce the utf*_t API alongside the std::string one, for those classes and functions that are not encoding-agnostic. The std::string one can be deprecated in future versions if the library author desires. Again, no punishment involved.
[...] - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding.
It's the getting-there part that I'm concerned about.
Me too, but again many other people already pointed out that a large portion of the code is completely encoding agnostic so there would be no impact if we stayed with std::string. There would be, if we add utf8_t.
Those portions of the code that are encoding-agnostic can continue using std::string, and nothing changes. It's only the functions that need to know the encoding that would change, and that change can be gradual.
Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I don't see that happening. [...]
Yes, people (me included) are resistant to big changes event for the better. But I've learned that I should always consider the long-term impact.
As have I. :-) I think the design I'm proposing is low-impact enough that people will adopt it. Slowly, but they will.
That's the beauty of it -- not everybody *needs* to accept it. Just the people who write code that isn't encoding-agnostic. Boost.FileSystem might provide a utf16_t overload for Windows, for instance, so that it can automatically convert strings in other UTF types. But I see no reason it would lose the existing interface.
So you suggest that for example in the STL there would be (for example) besides the existing fstream and wfstream also a third "ufstream". I think that we actually should be reducing the interface not expanding it (yes I hear it ... "breaking changes!" :)).
I don't expect that the utf*_t classes will make it into the standard. They definitely won't make it into the now-misnamed C++0x standard, and it'll likely be another ten years before another one is hashed out -- by then, the UTF-8 conversion should be complete, so there will be no need for it, except possibly to confirm that a string isn't malformed.
- We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :)
Any technical reason why, other than what you've already written?
Besides the ugly name and that is a new class ? No :)
If you can think of a more-acceptable-but-still-descriptive name for it, I'm all ears. :-)
I hate to point this out, but people are *already* using other programming languages. :-) C++ isn't new or sexy, and has some pain-points (though many of the most egregious ones will be solved with C++0x). Unicode handling is one of them, and in my opinion, the utf*_t types will only ease that.
And the solution is long overdue. And creating utf8_t is just putting the problem away, not solving it really.
I see it as merely easing the transition. -- Chad Nelson Oak Circle Software, Inc. * * *