Re: [boost] [string] proposal

27 Jan 2011

      On Thu, Jan 27, 2011 at 8:45 PM, Matus Chochlik <chochlik@gmail.com> wrote:
...
On Thu, Jan 27, 2011 at 12:09 PM, Dean Michael Berris
<mikhailberis@gmail.com> wrote:
...
But why do you need to separate text encoding from encoding in
general? Here's the logic:
In general? Nothing. I do not have (nor did I have in the past)
anything against a general efficient encoding-agnostic string
if it is called general_string. But std::string IMO is and always
has been primarily about handling text. I certainly do not know
anyone who would store a MPEG inside std::string.
std::string has not been about handling text -- it's about
encapsulating the notion of a sequence of characters with a suitable
definition of `character`.

You have string algorithms that apply mostly to strings -- pattern
matching, slicing/concatenation, character location, tokenization,
etc. The notion of "text" is actually a higher concept which imbues a
string with things like encoding, language, locality, etc. which all
live at a different level.

As for people storing <encoded> data inside a string, note that most
text-based protocols transfer things now in Base64 or Base32 or some
variant of that encoding -- precisely so that they can be dealt with
as character sequences. If you were catching an XMPP stream-fed Base64
encoded H.264 video stream why not put it in a string? I wouldn't put
it in std::string if I had any *sane* choice because it's just broken
IMO but like most people who intend to do things with data in memory
gotten from a character stream, you put it in a string.
...
...
You have a sequence of bytes (in a string).
You want to interpret that sequence of bytes in a given encoding (with a view).
Why does the encoding have to apply only to text?
Encoding does not have to apply only to text, but my,
let's call it a vision, is, that the "everyday" handling
of text would use a single encoding. There are people
who have invested a whole lotta of love :) and time into
making it possible and they are generally called
Unicode consortium. C++(1x) already adopts part of
their work via the u"" and U"" literal types, because
it has countless advantages. Why not take a one more
step in that direction and use it for the 'string' type by
default.
So the literals are already encoded and guess what, they're still a
sequence of bytes. The only "sane" way to deal with it is to provide
an appropriate *view* of the encoded data in the appropriate level of
abstraction. A string I argue is *not* that level of abstraction.
...
...
So what's the difference between a string for encoding human readable
text and a string that handles raw data?
Usability. It is usually more difficult to use the super-generic everything-
solving things. I again for probably the 10-th time repeat that I'm not against
such string in general but this is not std::string.
Usability of what, the type? Any type is as usable as any other the
way I see it -- they're all just types. So aside from
aesthetic/cosmetic differences, what's the point?
...
...
So what's wrong with:
view<some_encoding_0> x = get_x();
view<some_encoding_1> y = get_y();
view<some_encoding_3> z = x+y;
float w = log(as<acme_float_encoding>(z));
Unnecessary verbosity.
What verbosity?

We deal with that through typedefs and descriptive names. Heck C++0x
has auto so I don't know what 'verbosity' you're referring to.

And if you really wanted to know the encoding of the data from the
type, how else would you do it?
...
Do you really want all the people that now do:
struct person
{
   std::string name;
   std::string middle_name;
   std::string family_name;
   // .. etc.
};
to do this ?
struct person
{
   boost::view<some_encoding_tag> name;
   boost::view<some_encoding_tag> middle_name;
   boost::view<some_encoding_tag> family_name;
   // .. etc.
};
Well:

typedef boost::strings::view<boost::strings::utf8_encoding> utf8_string;

struct person {
  utf8_string name, middle_name, family_name;
};

Where's the verbosity in that?
...
...
?
See, there's absolutely 0 reason why you *have* to deal with a raw
sequence of bytes if what you really want is to deal with a view of
these bytes from the outset.
Again I ask, am I missing something here?
Please see the example above.
I did and I saw an even more succinct way of doing it. So again, I
don't see what I'm missing here.
...
[snip/]
...
Right, what I meant to say is that it hardly has any bearing when
we're talking about engineering solutions. So your circumstances and
mine may very well be different, but that doesn't change that we're
trying to solve the same problem. :)
If along solving your problem (all the completely valid points
that you had about the performance) we also solve my and
other's problem (completely valid points about the encoding)
and we think about the acceptability and "adoptability",
I don't know what "acceptability" and "adoptability" mean in this context.

Both of these are a matter of taste and not of technical merit.
...
we provide a backward compatible interface for people who
do not have the time to re-implement all their string-related
code at once and try really hard to get it into the standard
than I do not have a thing against it.
Backward compatibility to a broken implementation hardly seems like a
worthy goal. Deprecation is a better route IMHO.

Even if it does become std::string, it will be a deprecation of the
original definition. Deprecation *is* an option.

HTH

-- 
Dean Michael Berris
about.me/deanberris