Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

19 Jan 2011


      On Wed, 19 Jan 2011 11:33:02 +0100
Matus Chochlik <chochlik@gmail.com> wrote:
...
The string-encoding-related discussion boils down for me to the
following: What fill the string handling in C++ look like in the
(maybe not immediate) future.
*Scenario A:*
We will pick a widely-accepted char-based encoding [...] and use that
with std::strings which will become the one and only text string
'container' class.
All the wstrings, wxString, Qstrings, utf8strings, etc. will be
abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out
with the help of convenience classes like ansi_str_t and ucs2_t that
will be made obsolete and finally dropped (after the transition).
Sounds like a little slice of heaven to me. Though you'll still have
the pesky problem of having to verify that the UTF-8 code is valid all
the time. More on that below.
...
*Scenario B:*
We will add yet another string class named utf8_t to the already
crowded set named above. [...] Now an application using libraries
[a..z] will become the developers nightmare. What string should he
use for the class members, constructor parameters, who to do when
the conversions do not work so seamlesly ?
How is that different from what we've got today, except that the utf*_t
classes will make converting to and from different string types, and
validating the UTF code, a little easier and more automatic?
...
Also half of the cpu time assigned to running that application will
be wasted on useless string transcoding. And half of the memory will
be occupied with useless transcoding-related code and data.
I think that's a bit of an exaggeration. :-) As more libraries move to
the assumption that std::string == UTF-8, the need (and code) for
transcoding will silently vanish. Eventually, utf8_t will just be a
statement by the programmer that the data contained within is
guaranteed to be valid UTF-8, enforced by the class -- something that
would require at minimum an extra call if using std::string, one that
could be forgotten and open up the program to exploits.
...
*Scenario C:*
This is basically the status quo; a mix of the above. A sad and
unsatisfactory state of things.
Agreed.
...
*Consequences of A:*
[...] - Once we overcome the troubled period of transition everything
will be just great. No headaches related to file encoding detection
and transcoding.
It's the getting-there part that I'm concerned about.
...
Think about what will happen after we accept IPV6 and drop IPV4. The
process will be painful but after it is done, there will be no more
NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I
don't see that happening. ISPs will switch to IPv6 (because they have
to), and make it possible for their customers to stay on IPv4, so their
customers *will* stay on IPv4 because it's cheaper. And if they stay
with IPv4, there won't be any impetus for consumer electronics
companies to make their equipment IPv6-compatible because consumers
won't care about it. Without consumer demand, it won't get done for
years, maybe a decade or more.

That's what I see happening with std::string and UTF-8 as well.
...
*Consequences of B:*
- No fixing of existing interface which IMO means no or very slow
moving on to a single encoding.
Which, as stated above, I believe will happen anyway.
...
- Creating another string class, which, let us face it, not everybody
will accept even with the Boost influence unless it becomes standard.
That's the beauty of it -- not everybody *needs* to accept it. Just the
people who write code that isn't encoding-agnostic. Boost.FileSystem
might provide a utf16_t overload for Windows, for instance, so that it
can automatically convert strings in other UTF types. But I see no
reason it would lose the existing interface.
...
- We will abandon std::string and be stuck with utf8_t which I
*personally* already dislike :)
Any technical reason why, other than what you've already written?
...
- People will probably start to use other programming languages
(although this may by FUD)
I hate to point this out, but people are *already* using other
programming languages. :-) C++ isn't new or sexy, and has some
pain-points (though many of the most egregious ones will be solved with
C++0x). Unicode handling is one of them, and in my opinion, the utf*_t
types will only ease that.
...
*Note on the encoding to be used*
The best candidate for the widely-accepted and extensible encoding
vaguely mentioned above is IMO UTF-8. [...]
Apparently a growing number of people agree, as do I.
...
- It is extensible, so once we have done the painful transition we
will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte
sequences to encode code points, but the scheme is transparently
extensible to 1-N bytes (unlike UCS-X and i'm not sure about
UTF-16/32). [...]
UTF-16 can't be extended any further than its current definition, not
without a major reinterpretation. UTF-32 (and UTF-8) could go up to
0xFFFFFFFF codepoints, but the standards bodies involved have agreed
that they'll never be extended past the current UTF-16 limitations.
Though of course, that's subject to change if the circumstances change,
though nobody foresees such a change right now.
-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*