Re: [boost] [string] proposal

24 Jan 2011

      On Mon, Jan 24, 2011 at 3:04 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
...
On 01/23/2011 06:34 PM, Dean Michael Berris wrote:
...
... elision by patrick ...
I think I disagree with this. A string is by definition a sequence of
something -- a string of integers, a string of events, a string of
characters. Encoding is not an intrinsic property of a string.
I'm with you here, but to be fair to Chad, you could add to that list a
string of utf-8 encoded characters.  If a string contains things with a
particular encoding  there's value in being able to keep track of whether
it's validly encoded.  It may very well be that a std::string is part of
another type, or that there's some encoding wrapper that lets you see it as
utf-8 in the same way an external iterator lets you look at chars.
Sure, however I personally don't see the value of making the encoding
an intrinsic property of a string object. I still personally think
that encoding/decoding are algorithms applied on data, in which case I
would like string to just be the data. I see the encoding being a
different concern from the type of a data structure -- where I see a
string as basically an encapsulation that represents a collection of
"characters" for any suitable definition of "character".
...
...
As for using the existing std::string, I think the problem *is*
std::string and the way it's implemented. In particular I think
allowing for mutation of individual arbitrary elements makes users
that don't need this mutation pay for the cost of having it. Because
of this requirement things like SSO, copy-on-write optimizations(?),
and all the other algorithm baggage that comes with the std::string
implementation makes it really a bad basic string for the language.
So you're saying that there _also_ needs to be an immutable string type that
wouldn't pay this penalty.
Yes -- and I would argue that the string type that is immutable is a
better start for building algorithms around it than those that are
mutable (which std::string is one example). I personally think that
building a string is a different concern from dealing with a string
that is already built -- putting both in the same abstract data type
is a little misguided. That's not the case though for other data types
like trees, vectors, stacks, etc.
...
...
In a world where individual element mutation is a requirement,
std::string may very well be an acceptable implementation. In other
cases where you really don't need to be mutating any character in the
string that's already there, well it's a really bad string
implementation.
So what's wrong with having two different strings?
Nothing -- which is why I think if we were going to create a
boost::string, it should be the string that is immutable, because if
you wanted a mutable string, every other string implementation out
there (including std::string) is already mutable. ;)
...
...
For the purpose of interpreting a string as something else, you don't
need mutation -- and hence you gain a lot by having a string that is
immutable but interpretable in many different ways.
Consider the case where for example I want to interpret the same
string as UTF-8 and then later on as UTF-32.
Are you saying that you try it as utf-8, it doesn't decode and then you try
utf-32 to see if it works?  Cause the same string couldn't be both.   Or are
you saying that the string has some underlying encoding but something lets
it be viewed in other encodings, for example it might actually be EUC, but
external iterators let you view it as utf-8 or utf-16 or utf-32 interpreting
on the fly?
I'm saying the string could contain whatever it contains (which is
largely of little consequence) but that you can give a "view" of the
string as UTF-8 if it's valid UTF-8, or UTF-32 if it's valid UTF-32. I
think that encoding/decoding on the fly would be terribly inefficient,
therefore describing precisely what kind of interpretation you need at
the point of interpretation would be a much more scalable approach.

Consider the following:

  template <class String>
  void needs_utf8(String const & s) {
    view<utf8_encoded> utf8_string(s);
    if (!valid(utf8_string)) throw invalid_string("I need a UTF-8 string.");
  }

  template <class String>
  void needs_utf16(String const & s) {
    view<utf16_encoded> utf16_string(s);
    if (!valid(utf16_string)) throw invalid_string("I need a UTF-16 string.");
  }

I would say you have four choices when implementing `view` and `valid`:

1. view converts, and valid is a no-op.
2. view doesn't convert, and valid does the validation on the underlying string.
3. view converts, and valid does the validation on the underlying string.
4. view doesn't convert, but valid checks the validation on the view.

I'm leaning towards #2.
...
...
In your proposal I would
need to copy the type that has a UTF-8 encoding into another type that
has a UTF-32 encoding. If somehow the copy was trivial and doesn't
need to give any programmer pause to do that, then that would be a
good thing -- which is why an immutable string is something that your
implementation would benefit from in a "plumbing" perspective.
You could imagine:
utf-8_string u8s;
utf-32_string u32s;
// some code that gives a value to u32
u8s = u32s;  // this would use a converting _copy_ constructor
Actually, if you didn't do any "immediate" enforcement of the UTF
invariant on the strings, then the assignment would amount to a
pointer copy.
...
That would be cool.  But what if someone had one of these that represented
an edit buffer and was doing a global search and replace?  I suppose then
the underlying string would not be able to be the immutable one.  Perhaps
the std::string or std::immutable_string would be a template argument to
basic_utf_string<encoding,stringtype>.
I think with an immutable string, you would go about it a different
way -- instead of dealing with the underlying string directly, I would
say that you would represent the edit buffer as a raw buffer of bytes.
...
From the UI perspective (assuming a GUI application) you can do the
rendering based on user preferences, in which case you didn't directly
deal with immutable string objects. That would allow the editing to
happen in a different buffer as compared to having it apply on string
objects (which is a bad way to go about it IMO).
The strings would only come into the picture if you're applying
algorithms on the string data -- and/or viewing the strings in a given
encoding -- for example when saving the file, or loading files from
streams.
...
...
...
...
If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder
then that should be the way to go. However building it into the string
is not something that will scale in case there are other encodings
that would be supported -- think about not just Unicode, but things
like Base64, Zip,<insert encoding here>.
I assume that there is some unique identification for each language and
encoding, or that one could be created. But that's too big a task for
one volunteer developer, so my UTF classes are intended only to handle
the three types that can encode any Unicode code-point.
Sure, but that doesn't mean that you can't design it in a way that
others can extend it appropriately. This was/is the beauty of how the
iterator/range abstraction works out for generic code.
That's a wonderful idea, you could design it to work with statefull
encodings like JIS and EUC and non-statefull encodings like the utf
encodings.
And if you get the efficiency win of the immutable strings along with
it, then that's a double win IMO. :)
...
...
...
...
Ultimately the underlying string should be efficient and could be
operated upon in a predictable manner. It should be lightweight so
that it can be referred to in many different situations and there
should be an infinite number of possibilities for what you can use a
string for.
You've just described std::string. Or alternately, std::vector<char>.
Except these are mutable containers which are exactly what I *don't* want.
But of course as you said before that if you _do_ want mutability then
std::string is acceptable.  It seems that we just need a lighter weight
immutable addition to the fold.
Yup, which is what I think boost::string should be in the first place. ;)

-- 
Dean Michael Berris
about.me/deanberris