[boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

19 Jan 2011

      The string-encoding-related discussion boils down
for me to the following: What fill the string handling
in C++ look like in the (maybe not immediate) future.

*Scenario A:*

We will pick a widely-accepted char-based encoding
that is able to handle all the writing scripts and alphabets
that we can think of, has enough reserved space for
future additions or is easily extensible and use that
with std::strings which will become the one and only
text string 'container' class.

All the wstrings, wxString, Qstrings, utf8strings, etc. will
be abandoned. All the APIs using ANSI or UCS-2 will
be slowly phased out with the help of convenience
classes like ansi_str_t and ucs2_t that will be made
obsolete and finally dropped (after the transition).

*Scenario B:*

We will add yet another string class named utf8_t to the
already crowded set named above. Then:

library a: will stick to the ANSI encodings with std::strings
It has worked in the past it will work in the future, right ?

library b[oost]: will use utf8_t instead and provide the (seamles
and straightforward) conversions between utf8_t and std::string
and std::wstring. Some (many but not all) others will follow

library c: will use std::strings with utf-8
...
library [.]n[et]: will use String class
...
library q[t]: will use Qstrings
..
library w[xWidgets]: will use wxStrings and wxChar*
library wi[napi]: will use TCHAR*
...
library z: will use const char* in an encoding agnostic way

Now an application using libraries [a..z] will become
the developers nightmare. What string should he use for
the class members, constructor parameters, who to do
when the conversions do not work so seamlesly ?

Also half of the cpu time assigned to running that
application will be wasted on useless string transcoding.
And half of the memory will be occupied with useless
transcoding-related code and data.

*Scenario C:*

 This is basically the status quo; a mix of the above.
A sad and unsatisfactory state of things.

*Consequences of A:*

- Interface breaking changes, which will require some fixing
in the library client code and some work in the libraries
themselves. These should be made as painless as possible
with *temporary* utilities or convenience classes that would
for example handle the transcoding from utf8 to UCS-2/UTF-16
in WINAPI and be no-ops on most POSIX systems.

- Silent introduction of bugs for those who still use std::string
for ANSI CP####. This is worse than above and will require
some public-relations work on the part of Boost to make it
clear that using std::strings with ANSI may be an error since
Boost version x.y.z.

- We should finally accept the notion that one byte, word,
dword != one character and that there are code points
and there are characters and both of them can have
variable length encoding and devise tool to handle them
as such conveniently.

- Once we overcome the troubled period of transition
everything will be just great. No headaches related to
file encoding detection and transcoding.

Think about what will happen after we accept IPV6
and drop IPV4. The process will be painful but
after it is done, there will be no more NAT, and co.
and the whole network infrastructure will be simplified.

*Consequences of B:*

- No fixing of existing interface which IMO means
no or very slow moving on to a single encoding.

- Creating another string class, which, let us face it,
not everybody will accept even with the Boost influence
unless it becomes standard.

- We will abandon std::string and be stuck with utf8_t
  which I *personally* already dislike :)

- People will probably start to use other programming
languages (although this may by FUD)

*Consequences of C:*

Here pick all the negatives of the above :)

*Note on the encoding to be used*

The best candidate for the widely-accepted and
extensible encoding vaguely mentioned above is IMO
UTF-8.

- It has been given a lot of thought

- It is an already widely accepted standard

- It is char-based so no need to switch
to std::basic_string<whatever_char_t>

- It is extensible, so once we have done the painful
transition we will not have to do it again. Currently
utf-8 uses 1-4 (or 1-6) byte sequences to encode code
points, but the scheme is transparently extensible
to 1-N bytes (unlike UCS-X and i'm not sure about
UTF-16/32).

So,
[dark-sarcasm]
even if we dig out the stargate or join the United
Federation of Planets and captain Kirk, every time
he returns home, brings a truckload of new writing
scripts to support, UTF-8 will be able to handle it.

just my 0.02 strips of gold pressed latinum :)
[/dark-sarcasm]

Best regards,

Matus

[boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

Matus Chochlik