Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

19 Jan 2011


      On Wed, 19 Jan 2011 15:08:06 +0100
Matus Chochlik <chochlik@gmail.com> wrote:
...
...
How is that different from what we've got today, except that the
utf*_t classes will make converting to and from different string
types, and validating the UTF code, a little easier and more
automatic?
Exactly, and I think that we agree that the current status is far from
ideal. The automatic conversions would (probably) be OK but
introducing yet another string class is not.
Do you see another way to provide those conversions, and automatic
verification of proper UTF coding? (Automatic verification is a very
good thing, without it someone won't use it or will forget to, and open
up their programs to exploitation.)
...
...
the assumption that std::string == UTF-8, the need (and code) for
transcoding will silently vanish. Eventually, utf8_t will just be a
statement by the programmer that the data contained within is
guaranteed to be valid UTF-8, enforced by the class -- something that
would require at minimum an extra call if using std::string, one that
could be forgotten and open up the program to exploits.
Yes but why do not enforce it "organizationally" with the power
and influence Boost has. Again I know that it would break a lot
of stuff but really are all those people that now use std::string
ready to change all their code to use utf8_t instead ? Which will
involve more work ? I'm convinced that it will be the latter, but I
can be wrong.
If Boost comes out with a version that breaks existing programs,
companies just won't upgrade to it. I can keep one of the companies
that mine works with upgrading, because the group that I work with is
the only one there using C++ and they listen to me, but most companies
have a lot more invested in the existing system. Believe me, any
breaking changes have to be eased in over many versions -- the "boiling
a frog" approach. :-)
...
And many people already *do* use std::string for UTF-8 and are doing
the "right" (sorry :)) thing, by introducing utf8_t we are
"punishing" them because we want them, for the sake of people which
still dwell on ANSI, to change their code. IMO we should do the
opposite.
If they're already using UTF-8 strings, then we provide something like
BOOST_ALL_STD_STRINGS_ARE_UTF8 that they can define. The utf*_t classes
configure themselves to accept std::strings as UTF-8-encoded, and any
changes are completely transparent to those people. No punishment
involved.

For everyone else, we introduce the utf*_t API alongside the
std::string one, for those classes and functions that are not
encoding-agnostic. The std::string one can be deprecated in future
versions if the library author desires. Again, no punishment involved.
...
...
...
[...] - Once we overcome the troubled period of transition
everything will be just great. No headaches related to file
encoding detection and transcoding.
It's the getting-there part that I'm concerned about.
Me too, but again many other people already pointed out
that a large portion of the code is completely encoding agnostic
so there would be no impact if we stayed with std::string. There
would be, if we add utf8_t.
Those portions of the code that are encoding-agnostic can continue
using std::string, and nothing changes. It's only the functions that
need to know the encoding that would change, and that change can be
gradual.
...
...
...
Think about what will happen after we accept IPV6 and drop IPV4. The
process will be painful but after it is done, there will be no more
NAT, and co. and the whole network infrastructure will be
simplified.
That's a problem I've been watching carefully for many years now,
and I don't see that happening. [...]
Yes, people (me included) are resistant to big changes event for the
better. But I've learned that I should always consider the long-term
impact.
As have I. :-) I think the design I'm proposing is low-impact enough
that people will adopt it. Slowly, but they will.
...
...
That's the beauty of it -- not everybody *needs* to accept it. Just
the people who write code that isn't encoding-agnostic.
Boost.FileSystem might provide a utf16_t overload for Windows, for
instance, so that it can automatically convert strings in other UTF
types. But I see no reason it would lose the existing interface.
So you suggest that for example in the STL there would be (for
example) besides the existing fstream and wfstream also a third
"ufstream". I think that we actually should be reducing the interface
not expanding it (yes I hear it ... "breaking changes!" :)).
I don't expect that the utf*_t classes will make it into the standard.
They definitely won't make it into the now-misnamed C++0x standard, and
it'll likely be another ten years before another one is hashed out --
by then, the UTF-8 conversion should be complete, so there will be no
need for it, except possibly to confirm that a string isn't malformed.
...
...
...
- We will abandon std::string and be stuck with utf8_t which I
*personally* already dislike :)
Any technical reason why, other than what you've already written?
Besides the ugly name and that is a new class ? No :)
If you can think of a more-acceptable-but-still-descriptive name for
it, I'm all ears. :-)
...
...
I hate to point this out, but people are *already* using other
programming languages. :-) C++ isn't new or sexy, and has some
pain-points (though many of the most egregious ones will be solved
with C++0x). Unicode handling is one of them, and in my opinion, the
utf*_t types will only ease that.
And the solution is long overdue. And creating utf8_t is just putting
the problem away, not solving it really.
I see it as merely easing the transition.
-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*