
On 01/27/2011 04:45 AM, Matus Chochlik wrote:
... elision by patrick ... In general? Nothing. I do not have (nor did I have in the past) anything against a general efficient encoding-agnostic string if it is called general_string. But std::string IMO is and always has been primarily about handling text. I certainly do not know anyone who would store a MPEG inside std::string.
You may think it strange, but there's a lot of code out there that uses std::string as a binary buffer.
You have a sequence of bytes (in a string). You want to interpret that sequence of bytes in a given encoding (with a view).
Why does the encoding have to apply only to text?
It doesn't, and in your immutable string (or with std::string also) your idea of views is a nice one. It would have different benefits than a utf-xx_string with intrinsic encoding.
Encoding does not have to apply only to text, but my, let's call it a vision, is, that the "everyday" handling of text would use a single encoding. There are people who have invested a whole lotta of love :) and time into making it possible and they are generally called Unicode consortium. C++(1x) already adopts part of their work via the u"" and U"" literal types, because it has countless advantages. Why not take a one more step in that direction and use it for the 'string' type by default.
That won't happen with std::string though. It's in the C++ spec as behaving a certain way and you won't change that. You might have a chance of getting a utf-8_string in there though.
[snip/]
But this already happens, it's called 7-bit clean byte encoding -- barring any endianness issues, just stuff whatever you already have in a `char const *` into a socket. HTTP, FTP, and even memcached's protocol work fine without the need to interpret strings other than a sequence of bytes; my original opposition is having a string that by default looked at data in it as UTF-8 when really a string would just be a sequence of bytes not necessarily contiguous. Again, where you see a string primarily as a class for handling raw data, that can be interpreted in hundreds of different ways I see primarily string as a class for encoding human readable text.
And you see it as encoding it in utf-8. Don't forget that. It's a very specialized use out of the many that std::string supports today.
So what's the difference between a string for encoding human readable text and a string that handles raw data? Usability. It is usually more difficult to use the super-generic everything- solving things. I again for probably the 10-th time repeat that I'm not against such string in general but this is not std::string.
And neither would a string that enforced utf-8 encoding be std::string. We already have one in the spec, and it's not that.
... elision by patrick ... Unnecessary verbosity.
Do you really want all the people that now do:
struct person { std::string name; std::string middle_name; std::string family_name; // .. etc. };
to do this ?
struct person { boost::view<some_encoding_tag> name; boost::view<some_encoding_tag> middle_name; boost::view<some_encoding_tag> family_name; // .. etc. };
If their encoding is not utf-8 compatible it works with std::string, but wouldn't work with your utf-8 string. Your argument is the same as applied to your string.
... elision by patrick ...
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :)
No. You're not trying to solve the same problem at all! (And neither of you are trying to deal with std::string.) You, Dean, are trying to solve an efficiency problem caused by mutable strings, and note that an external view can interpret as any encoding desired. You correctly point out that this is more general and flexible, that it has a power that can be applied to many things while giving you all the efficiency advantages of immutable data types. (Although why a general buffer for immutable data would be called string which is normally associated with text _is_ a bit confusing. I suspect you've gone down a road you never intended trying to make this point.) You, Matus, are trying to solve a problem caused by a plethora of possible encodings and the extra work that has to be done every time you have to deal with them, by specifying that a string will have an encoding type associated with it, (and in particular utf-8 as the natural default), and that the specialized string itself will enforce the encoding as well as provide ways to convert other encodings to it. (And I think the natural way to do this is with code conversion facets.) You correctly point out that this specificity allows a power in solving this one particular problem that a more general solution wouldn't be able to match. A general string with a view into it would allow you to get invalidly encoded data into it (N.B for an immutable string _into it_ would have a different meaning) and you would only know about this after the fact. These are both great things. Kudos to you both. You're both right. You guys keep arguing apples and orangutans and it makes it hard for others to talk about either one of your ideas because you're so busy going back and forth telling each other that the other doesn't get what they're trying to say. I wish you'd split into threads like [immutable string] and [unicode string]. Patrick