
On Wed, Jan 19, 2011 at 8:28 PM, Dave Abrahams <dave@boostpro.com> wrote:
OK, I see. But, is there any chance that the standard itself would be updated so that it first would recommend to use UTF-8 with C++ strings.
Well, never say "never," but... never. Such recommendations are not part of the standard's mission. It doesn't do things like that.
My view of what is the standardizing comitee willing to do may by naive, but generally I don't see why this could not be done. Other major languages (Java, C#, etc.) picked a single "standard" encoding and in those languages you treat text with other encodings as special case. If C++ recommended the use of UTF-8 this would probably kickstart the OS and compiler vendors to follow or at least to fix their implementations of the standard libary and the OS API's to accept UTF-8 by default (if we agree that this is a good idea).
After some period of time all other encodings would be deprecated
By whom?
By the same comitee that made the recommendation in the first place.
I really see all the obstacles that prevent us from just switching to UTF-8, but adding a new string class will not help for the same reasons adding wstring did not help.
I don't see the parallel at all. wstring is just another container of bytes, for all practical purposes. It doesn't imply any particular encoding, and does nothing to segregate the encoded from the raw.
Maybe wstring is not officially UTF-16 or UTF-32 or UCS, but on most platforms it is at least treated as "the unicode string" regardless of this being a vague term. What I am afraid of is that just like the use of wchar_t and wstring spawned the dual interface used by Winapi and followed by many others (including myself in the past), introducing a third (semi-)standard string class will spawn a "ternary" interface (but I may be wrong or mixing the order of the events mentienoed above).
As I already said elsewhere I think that this is a problem that has to be solved "organizationally".
Perhaps. The type system is one of our organizational tools, and Boost has an impact insofar as it produces components that people use, so if we aren't able to produce some flagship library components that help with the solution, we have little traction.
I believe in strong typing, but .. OK, for the sake of argument, where do we imagine utf8_t (or whatever its name will be) will be used and what is out long-term plan for std::string? If I design a library or an application should I use utf8_t everywhere ? As the type of the class' member variables, parameters of functions and constructors or should I stick to std::string (or perhaps wstring) for maximum compatibility with the rest of the world ?
*Scenario E:* We add another string class and everyone adopts it
I meant that for example on POSIX OSes the POSIX C-API did not have to be changed or extended by a new set of functions doing the same things, but using a new character type, when they switched from the old encodings to UTF-8.
...and people still have the problem that they lose track of what's "raw" and what's encoded as utf-8.
Yes, but in the end, they will get used to it. There are many dangerous things in C++ (like for example dereferencing a nil or dangling pointer, doing C-pointer arithmetic in the presence of inheritance, etc.) you should not do and mixing UTF-8 and other encoding would be one of them. It is a breaking change but it would not be the first one in C++'s history.
To compare two strings you still can use stdcmp and not utf8strcmp, to collate strings you use strcoll and not utf8strcol, etc.
Yeah... but surely POSIX's strcmp only tells you whether the two strings have the same sequence of code points, not whether they have the same characters, right? And if you inadvertently compare a "raw" string with an equivalent utf-8-encoded string, what happens?
Undefined behavior, your application segfaults, aborts, silently fails... (what happens if you dereference a dangling pointer ?) BR, Matus