Re: [boost] [string] proposal

24 Jan 2011

      On Sat, Jan 22, 2011 at 10:43 AM, Chad Nelson
<chad.thecomfychair@gmail.com> wrote:
...
On Sat, 22 Jan 2011 01:56:36 +0800
Dean Michael Berris <mikhailberis@gmail.com> wrote:
...
...
...
I think strings are different from the encoding they're interpreted
as. Let's fix the problem of a string data structure first then tack
on encoding/decoding as something that depends on the string
abstraction first.
That gets back to the problem that I was originally trying to solve
with the UTF types: that a string needs a way to carry around its
encoding. A UTF-8 type could be built on such a thing very easily.
Hmm... I OTOH don't think the encoding should be part of the string.
The encoding is really external to the string, more like a function
that is applied to the string.
It's a property of the string. It may change, but some encoding (even
if it's just "none") should be associated with a particular string
throughout its existence. Otherwise you might as well use the existing
std::string.
I think I disagree with this. A string is by definition a sequence of
something -- a string of integers, a string of events, a string of
characters. Encoding is not an intrinsic property of a string.

As for using the existing std::string, I think the problem *is*
std::string and the way it's implemented. In particular I think
allowing for mutation of individual arbitrary elements makes users
that don't need this mutation pay for the cost of having it. Because
of this requirement things like SSO, copy-on-write optimizations(?),
and all the other algorithm baggage that comes with the std::string
implementation makes it really a bad basic string for the language.

In a world where individual element mutation is a requirement,
std::string may very well be an acceptable implementation. In other
cases where you really don't need to be mutating any character in the
string that's already there, well it's a really bad string
implementation.

For the purpose of interpreting a string as something else, you don't
need mutation -- and hence you gain a lot by having a string that is
immutable but interpretable in many different ways.

Consider the case where for example I want to interpret the same
string as UTF-8 and then later on as UTF-32. In your proposal I would
need to copy the type that has a UTF-8 encoding into another type that
has a UTF-32 encoding. If somehow the copy was trivial and doesn't
need to give any programmer pause to do that, then that would be a
good thing -- which is why an immutable string is something that your
implementation would benefit from in a "plumbing" perspective.
...
...
If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder
then that should be the way to go. However building it into the string
is not something that will scale in case there are other encodings
that would be supported -- think about not just Unicode, but things
like Base64, Zip, <insert encoding here>.
I assume that there is some unique identification for each language and
encoding, or that one could be created. But that's too big a task for
one volunteer developer, so my UTF classes are intended only to handle
the three types that can encode any Unicode code-point.
Sure, but that doesn't mean that you can't design it in a way that
others can extend it appropriately. This was/is the beauty of how the
iterator/range abstraction works out for generic code.
...
...
Ultimately the underlying string should be efficient and could be
operated upon in a predictable manner. It should be lightweight so
that it can be referred to in many different situations and there
should be an infinite number of possibilities for what you can use a
string for.
You've just described std::string. Or alternately, std::vector<char>.
Except these are mutable containers which are exactly what I *don't* want.

-- 
Dean Michael Berris
about.me/deanberris