Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

19 Jan 2011

      On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson
<chad.thecomfychair@gmail.com> wrote:
...
On Wed, 19 Jan 2011 11:33:02 +0100
Matus Chochlik <chochlik@gmail.com> wrote:
...
*Scenario A:*
Sounds like a little slice of heaven to me. Though you'll still have
the pesky problem of having to verify that the UTF-8 code is valid all
the time. More on that below.
I am a believer ;) and when people realize that UTF-8 is the way to
go, the pesky problems will vanish. Believe me today with ANSI

Today I have to check/detect the encoding of input files created
by users on different windows machines and do the conversions.
And checking if data is valid UTF-8 is IMO an easier task.

Most people here use windows1252 that is not so different
from ASCII so even if something gets garbled it can be rescued.
I can't imagine what it is like in countries that have to deal with
semitic languages, chinese/japanese/korean ideograms, etc.
...
...
*Scenario B:*
How is that different from what we've got today, except that the utf*_t
classes will make converting to and from different string types, and
validating the UTF code, a little easier and more automatic?
Exactly, and I think that we agree that the current status is far from
ideal. The automatic conversions would (probably) be OK but
introducing yet another string class is not.
...
...
Also half of the cpu time assigned to running that application will
be wasted on useless string transcoding. And half of the memory will
be occupied with useless transcoding-related code and data.
I think that's a bit of an exaggeration. :-) As more libraries move to
Yes, sorry I could not resist :)
...
the assumption that std::string == UTF-8, the need (and code) for
transcoding will silently vanish. Eventually, utf8_t will just be a
statement by the programmer that the data contained within is
guaranteed to be valid UTF-8, enforced by the class -- something that
would require at minimum an extra call if using std::string, one that
could be forgotten and open up the program to exploits.
Yes but why do not enforce it "organizationally" with the power
and influence Boost has. Again I know that it would break a lot
of stuff but really are all those people that now use std::string ready
to change all their code to use utf8_t instead ? Which will involve
more work ? I'm convinced that it will be the latter, but I can be wrong.

And many people already *do* use std::string for UTF-8 and are
doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing"
them because we want them, for the sake of people which still dwell
on ANSI, to change their code. IMO we should do the opposite.
...
...
[...] - Once we overcome the troubled period of transition everything
will be just great. No headaches related to file encoding detection
and transcoding.
It's the getting-there part that I'm concerned about.
Me too, but again many other people already pointed out
that a large portion of the code is completely encoding agnostic
so there would be no impact if we stayed with std::string. There
would be, if we add utf8_t.
...
...
Think about what will happen after we accept IPV6 and drop IPV4. The
process will be painful but after it is done, there will be no more
NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I
don't see that happening. ISPs will switch to IPv6 (because they have
to), and make it possible for their customers to stay on IPv4, so their
customers *will* stay on IPv4 because it's cheaper. And if they stay
with IPv4, there won't be any impetus for consumer electronics
companies to make their equipment IPv6-compatible because consumers
won't care about it. Without consumer demand, it won't get done for
years, maybe a decade or more.
That's what I see happening with std::string and UTF-8 as well.
Yes, people (me included) are resistant to big changes event for the better.
But I've learned that I should always consider the long-term impact.
...
...
- Creating another string class, which, let us face it, not everybody
will accept even with the Boost influence unless it becomes standard.
That's the beauty of it -- not everybody *needs* to accept it. Just the
people who write code that isn't encoding-agnostic. Boost.FileSystem
might provide a utf16_t overload for Windows, for instance, so that it
can automatically convert strings in other UTF types. But I see no
reason it would lose the existing interface.
So you suggest that for example in the STL there would be
(for example) besides the existing fstream and wfstream also
a third "ufstream". I think that we actually should be reducing
the interface not expanding it (yes I hear it ... "breaking changes!" :)).
...
...
- We will abandon std::string and be stuck with utf8_t which I
*personally* already dislike :)
Any technical reason why, other than what you've already written?
Besides the ugly name and that is a new class ? No :)
...
I hate to point this out, but people are *already* using other
programming languages. :-) C++ isn't new or sexy, and has some
pain-points (though many of the most egregious ones will be solved with
C++0x). Unicode handling is one of them, and in my opinion, the utf*_t
types will only ease that.
And the solution is long overdue. And creating utf8_t is just putting
the problem away, not solving it really.