Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

20 Jan 2011

      On Wed, Jan 19, 2011 at 8:28 PM, Dave Abrahams <dave@boostpro.com> wrote:
...
...
OK, I see. But, is there any chance that the standard itself would
be updated so that it first would recommend to use UTF-8 with C++
strings.
Well, never say "never," but... never.  Such recommendations are not
part of the standard's mission.  It doesn't do things like that.
My view of what is the standardizing comitee willing to do may by naive,
but generally I don't see why this could not be done. Other major
languages (Java, C#, etc.) picked a single "standard" encoding and
in those languages you treat text with other encodings as special case.

If C++ recommended the use of UTF-8 this would probably kickstart
the OS and compiler vendors to follow or at least to fix their implementations
of the standard libary and the OS API's to accept UTF-8 by default
(if we agree that this is a good idea).
...
...
After some period of time all other encodings would be deprecated
By whom?
By the same comitee that made the recommendation in the first place.
...
...
I really see all the obstacles that prevent us from just switching
to UTF-8, but adding a new string class will not help for the same
reasons adding wstring did not help.
I don't see the parallel at all.  wstring is just another container of
bytes, for all practical purposes.  It doesn't imply any particular
encoding, and does nothing to segregate the encoded from the raw.
Maybe wstring is not officially UTF-16 or UTF-32 or UCS, but
on most platforms it is at least treated as "the unicode string"
regardless of this being a vague term. What I am afraid of is
that just like the use of wchar_t and wstring spawned the dual
interface used by Winapi and followed by many others (including
myself in the past), introducing a third (semi-)standard string class
will spawn a "ternary" interface (but I may be wrong or mixing the
order of the events mentienoed above).
...
...
As I already said elsewhere I think that this is a problem that has
to be solved "organizationally".
Perhaps.  The type system is one of our organizational tools, and
Boost has an impact insofar as it produces components that people use,
so if we aren't able to produce some flagship library components that
help with the solution, we have little traction.
I believe in strong typing, but .. OK, for the sake of argument, where
do we imagine utf8_t (or whatever its name will be) will be used and
what is out long-term plan for std::string?

If I design a library or an application should I use utf8_t everywhere ?
As the type of the class' member variables, parameters of functions
and constructors or should I stick to std::string (or perhaps wstring)
for maximum compatibility with the rest of the world ?
...
...
...
...
...
*Scenario E:* We add another string class and everyone adopts it
I meant that for example on POSIX OSes the POSIX C-API
did not have to be changed or extended by a new set of functions
doing the same things, but using a new character type, when they
switched from the old encodings to UTF-8.
...and people still have the problem that they lose track of what's
"raw" and what's encoded as utf-8.
Yes, but in the end, they will get used to it. There are many dangerous
things in C++ (like for example dereferencing a nil or dangling pointer,
doing C-pointer arithmetic in the presence of inheritance, etc.) you
should not do and mixing UTF-8 and other encoding would be one
of them. It is a breaking change but it would not be the first one in
C++'s history.
...
...
To compare two strings you still can use stdcmp and not utf8strcmp,
to collate strings you use strcoll and not utf8strcol, etc.
Yeah... but surely POSIX's strcmp only tells you whether the two
strings have the same sequence of code points, not whether they have
the same characters, right?  And if you inadvertently compare a "raw"
string with an equivalent utf-8-encoded string, what happens?
Undefined behavior, your application segfaults, aborts, silently fails...
(what happens if you dereference a dangling pointer ?)

BR,

Matus

Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

Matus Chochlik