Re: [boost] [string] proposal

27 Jan 2011

      On 01/27/2011 04:45 AM, Matus Chochlik wrote:
...
... elision by patrick ...
In general? Nothing. I do not have (nor did I have in the past)
anything against a general efficient encoding-agnostic string
if it is called general_string. But std::string IMO is and always
has been primarily about handling text. I certainly do not know
anyone who would store a MPEG inside std::string.
You may think it strange, but there's a lot of code out there that uses 
std::string as a binary buffer.
...
...
You have a sequence of bytes (in a string).
You want to interpret that sequence of bytes in a given encoding (with a view).
Why does the encoding have to apply only to text?
It doesn't, and in your immutable string (or with std::string also) your 
idea of views is a nice one.  It would have different benefits than a 
utf-xx_string with intrinsic encoding.
...
Encoding does not have to apply only to text, but my,
let's call it a vision, is, that the "everyday" handling
of text would use a single encoding. There are people
who have invested a whole lotta of love :) and time into
making it possible and they are generally called
Unicode consortium. C++(1x) already adopts part of
their work via the u"" and U"" literal types, because
it has countless advantages. Why not take a one more
step in that direction and use it for the 'string' type by
default.
That won't happen with std::string though.  It's in the C++ spec as 
behaving a certain way and you won't change that.  You might have a 
chance of getting a utf-8_string in there though.
...
...
...
[snip/]
...
But this already happens, it's called 7-bit clean byte encoding --
barring any endianness issues, just stuff whatever you already have in
a `char const *` into a socket. HTTP, FTP, and even memcached's
protocol work fine without the need to interpret strings other than a
sequence of bytes; my original opposition is having a string that by
default looked at data in it as UTF-8 when really a string would just
be a sequence of bytes not necessarily contiguous.
Again, where you see a string primarily as a class for handling
raw data, that can be interpreted in hundreds of different ways
I see primarily string as a class for encoding human readable text.
And you see it as encoding it in utf-8.  Don't forget that.  It's a very 
specialized use out of the many that std::string supports today.
...
...
So what's the difference between a string for encoding human readable
text and a string that handles raw data?
Usability. It is usually more difficult to use the super-generic everything-
solving things. I again for probably the 10-th time repeat that I'm not against
such string in general but this is not std::string.
And neither would a string that enforced utf-8 encoding be std::string.  
We already have one in the spec, and it's not that.
...
... elision by patrick ...
Unnecessary verbosity.
Do you really want all the people that now do:
struct person
{
     std::string name;
     std::string middle_name;
     std::string family_name;
     // .. etc.
};
to do this ?
struct person
{
     boost::view<some_encoding_tag>  name;
     boost::view<some_encoding_tag>  middle_name;
     boost::view<some_encoding_tag>  family_name;
     // .. etc.
};
If their encoding is not utf-8 compatible it works with std::string, but 
wouldn't work with your utf-8 string.  Your argument is the same as 
applied to your string.
...
... elision by patrick ...
...
Right, what I meant to say is that it hardly has any bearing when
we're talking about engineering solutions. So your circumstances and
mine may very well be different, but that doesn't change that we're
trying to solve the same problem. :)
No.  You're not trying to solve the same problem at all!  (And neither 
of you are trying to deal with std::string.)

You, Dean, are trying to solve an efficiency problem caused by mutable 
strings, and note that an external view can interpret as any encoding 
desired.  You correctly point out that this is more general and 
flexible, that it has a power that can be applied to many things while 
giving you all the efficiency advantages of immutable data types.  
(Although why a general buffer for immutable data would be called string 
which is normally associated with text _is_ a bit confusing.  I suspect 
you've gone down a road you never intended trying to make this point.)

You, Matus, are trying to solve a problem caused by a plethora of 
possible encodings and the extra work that has to be done every time you 
have to deal with them, by specifying that a string will have an 
encoding type associated with it, (and in particular utf-8 as the 
natural default), and that the specialized string itself will enforce 
the encoding as well as provide ways to convert other encodings to it.  
(And I think the natural way to do this is with code conversion 
facets.)  You correctly point out that this specificity allows a power 
in solving this one particular problem that a more general solution 
wouldn't be able to match.  A general string with a view into it would 
allow you to get invalidly encoded data into it (N.B for an immutable 
string _into it_ would have a different meaning) and you would only know 
about this after the fact.

These are both great things.  Kudos to you both.  You're both right.  
You guys keep arguing apples and orangutans and it makes it hard for 
others to talk about either one of your ideas because you're so busy 
going back and forth telling each other that the other doesn't get what 
they're trying to say.

I wish you'd split into threads like [immutable string] and [unicode 
string].

Patrick

Re: [boost] [string] proposal

Patrick Horgan