Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

19 Jan 2011


      At Wed, 19 Jan 2011 20:03:59 +0100,
Matus Chochlik wrote:
...
On Wed, Jan 19, 2011 at 7:39 PM, Dave Abrahams <dave@boostpro.com> wrote:
...
Our influence, if we introduce new library components, is very great,
because they're on a de-facto fast track to standardization, and an
improved string library is exactly the sort of thing that would be
adopted upstream.  If we simply agree to a programming convention,
that will have some impact, but much less.
OK, I see. But, is there any chance that the standard itself would
be updated so that it first would recommend to use UTF-8 with C++
strings.
Well, never say "never," but... never.  Such recommendations are not
part of the standard's mission.  It doesn't do things like that.
...
After some period of time all other encodings would be deprecated
By whom?
...
and using them would cause undefined behavior. Could Boost be the
driving force here?
This doesn't seem like a very plausible scenario to me, based on my
experience.  Of course, others may disagree.
...
I really see all the obstacles that prevent us from just switching
to UTF-8, but adding a new string class will not help for the same
reasons adding wstring did not help.
I don't see the parallel at all.  wstring is just another container of
bytes, for all practical purposes.  It doesn't imply any particular
encoding, and does nothing to segregate the encoded from the raw.
...
As I already said elsewhere I think that this is a problem that has
to be solved "organizationally".
Perhaps.  The type system is one of our organizational tools, and
Boost has an impact insofar as it produces components that people use,
so if we aren't able to produce some flagship library components that
help with the solution, we have little traction.
...
...
...
...
*Scenario E:* We add another string class and everyone adopts it
Ok I admit that this is possible. But let me ask: How did the C world
made the transition without abandoning char ?
The transition from what to what?
I meant that for example on POSIX OSes the POSIX C-API
did not have to be changed or extended by a new set of functions
doing the same things, but using a new character type, when they
switched from the old encodings to UTF-8.
...and people still have the problem that they lose track of what's
"raw" and what's encoded as utf-8.
...
To compare two strings you still can use stdcmp and not utf8strcmp,
to collate strings you use strcoll and not utf8strcol, etc.
Yeah... but surely POSIX's strcmp only tells you whether the two
strings have the same sequence of code points, not whether they have
the same characters, right?  And if you inadvertently compare a "raw"
string with an equivalent utf-8-encoded string, what happens?

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com