
On Thu, Jan 27, 2011 at 8:45 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:09 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
But why do you need to separate text encoding from encoding in general? Here's the logic:
In general? Nothing. I do not have (nor did I have in the past) anything against a general efficient encoding-agnostic string if it is called general_string. But std::string IMO is and always has been primarily about handling text. I certainly do not know anyone who would store a MPEG inside std::string.
std::string has not been about handling text -- it's about encapsulating the notion of a sequence of characters with a suitable definition of `character`. You have string algorithms that apply mostly to strings -- pattern matching, slicing/concatenation, character location, tokenization, etc. The notion of "text" is actually a higher concept which imbues a string with things like encoding, language, locality, etc. which all live at a different level. As for people storing <encoded> data inside a string, note that most text-based protocols transfer things now in Base64 or Base32 or some variant of that encoding -- precisely so that they can be dealt with as character sequences. If you were catching an XMPP stream-fed Base64 encoded H.264 video stream why not put it in a string? I wouldn't put it in std::string if I had any *sane* choice because it's just broken IMO but like most people who intend to do things with data in memory gotten from a character stream, you put it in a string.
You have a sequence of bytes (in a string). You want to interpret that sequence of bytes in a given encoding (with a view).
Why does the encoding have to apply only to text?
Encoding does not have to apply only to text, but my, let's call it a vision, is, that the "everyday" handling of text would use a single encoding. There are people who have invested a whole lotta of love :) and time into making it possible and they are generally called Unicode consortium. C++(1x) already adopts part of their work via the u"" and U"" literal types, because it has countless advantages. Why not take a one more step in that direction and use it for the 'string' type by default.
So the literals are already encoded and guess what, they're still a sequence of bytes. The only "sane" way to deal with it is to provide an appropriate *view* of the encoded data in the appropriate level of abstraction. A string I argue is *not* that level of abstraction.
So what's the difference between a string for encoding human readable text and a string that handles raw data?
Usability. It is usually more difficult to use the super-generic everything- solving things. I again for probably the 10-th time repeat that I'm not against such string in general but this is not std::string.
Usability of what, the type? Any type is as usable as any other the way I see it -- they're all just types. So aside from aesthetic/cosmetic differences, what's the point?
So what's wrong with:
view<some_encoding_0> x = get_x(); view<some_encoding_1> y = get_y(); view<some_encoding_3> z = x+y; float w = log(as<acme_float_encoding>(z));
Unnecessary verbosity.
What verbosity? We deal with that through typedefs and descriptive names. Heck C++0x has auto so I don't know what 'verbosity' you're referring to. And if you really wanted to know the encoding of the data from the type, how else would you do it?
Do you really want all the people that now do:
struct person { std::string name; std::string middle_name; std::string family_name; // .. etc. };
to do this ?
struct person { boost::view<some_encoding_tag> name; boost::view<some_encoding_tag> middle_name; boost::view<some_encoding_tag> family_name; // .. etc. };
Well: typedef boost::strings::view<boost::strings::utf8_encoding> utf8_string; struct person { utf8_string name, middle_name, family_name; }; Where's the verbosity in that?
?
See, there's absolutely 0 reason why you *have* to deal with a raw sequence of bytes if what you really want is to deal with a view of these bytes from the outset.
Again I ask, am I missing something here?
Please see the example above.
I did and I saw an even more succinct way of doing it. So again, I don't see what I'm missing here.
[snip/]
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :)
If along solving your problem (all the completely valid points that you had about the performance) we also solve my and other's problem (completely valid points about the encoding) and we think about the acceptability and "adoptability",
I don't know what "acceptability" and "adoptability" mean in this context. Both of these are a matter of taste and not of technical merit.
we provide a backward compatible interface for people who do not have the time to re-implement all their string-related code at once and try really hard to get it into the standard than I do not have a thing against it.
Backward compatibility to a broken implementation hardly seems like a worthy goal. Deprecation is a better route IMHO. Even if it does become std::string, it will be a deprecation of the original definition. Deprecation *is* an option. HTH -- Dean Michael Berris about.me/deanberris