Re: [boost] [string] proposal

26 Jan 2011

      On Wed, Jan 26, 2011 at 10:46 PM, Stewart, Robert
<Robert.Stewart@sig.com> wrote:
...
Dean Michael Berris wrote:
...
On Wed, Jan 26, 2011 at 5:10 PM, Matus Chochlik
<chochlik@gmail.com> wrote:
...
On Wed, Jan 26, 2011 at 9:34 AM, Dean Michael Berris
<mikhailberis@gmail.com> wrote:
...
The immutability *does not* have a thing with the problems
in the use-cases described above. Encoding *does*.
Right, so why should the encoding be part of the string then? I say
the encoding should be external to the string (which I've been saying
for the Nth time I think) and just a transformation on an input
string. The transformation doesn't even have to be immediate -- it
could and probably should be lazy. When you have immutable strings the
lazy application of transformations is really a game changer
especially in the way people (at least in C++) are used to when
dealing with strings.
Based upon previous discussion, I think you need to present your case better for immutability.
Right, I think I need to spend a little more time to hash that out.
With something close to 100 messages on the thread already I think
it's about time I did that. Expect a different thread starter then in
the next, oh, few minutes or so.
...
Others consider mutability to be an intrinsic and beneficial characteristic of a string class.  You are proposing to drop that and assume all others implicitly understand why that would be good for all.
I don't think I'm assuming that others will implicitly understand -- I
was more hoping that those reading or at least are interested in the
proposition would likely try to work it out in their heads and maybe
discuss what's unclear to them. ;) That said, it falls on me to be
clearer I agree. :D
...
...
...
...
Sure, but still I don't see why you need to add c_str() to an
immutable string when you're fine to create an std::string
from that immutable string and call c_str() from the std::string
instance instead?
One word. Performance :)
So you're saying c_str() is a performance enhancing feature? Did you
consider that the reason why std::string concatenation is a bad
performance killer is precisely because it has to support c_str()?
That's an interesting viewpoint.  Note, however, that it is extremely common to build a string, piecewise, and then use it as an array of characters with some OS API.  Those APIs won't change, so c_str(), in some form, is definitely needed.  Furthermore, the piecewise assembly seems likely inefficient given and immutable string, especially one from which a contiguous array of characters is needed.  Can you illustrate how that would be done and how it can be as efficient or more so than the status quo?
Right. Here's one attempt at presenting how you would: build an
immutable string, perform (lazy) transformations of that string, and
have a means of linearizing that string into a `char const *` (which
in the end is really what c_str() is):

To build a string, let's borrow from the sstream library in the STL:

  boost::ostringstream ss;
  ss << "This is a string literal"
      << L"this is another literal"
      << instance.some_char_const_ptr()
      << foo_that_returns_an_std_string_perhaps() // can be moved
      << etcetera(); // what it returns can be copied
  boost::string s = ss.str(); // #1

In #1 above, str() returns an immutable string already, which means
that the ostringstream would be building a segmented data structure
that packs chunks together potentially using Boost.Intrusive data
structures -- to make it simple, it would potentially be
statically-sized chunks that lay things out in a memory page's worth
of memory.

So really there's no cost to copy -- well, there will be an increment
on the reference count in the string's metadata block. We can also
implement a function in ss called 'move' which will actually move the
string already built to the holder.

We can implement output iterators that deal with boost::ostringstream
objects to allow existing STL algorithms to stuff data into an
boost::ostringstream using the familiar output iterator type.

Once we have that immutable string, we can start copying it around and
not worry about having to manually synchronize or ensure that the data
is actually copied when dealing with it in different threads. The
reference counting can happen transparently in true RAII fashion which
is a good thing IMO.

Let me try to describe "lazy" transformations then. Let's consider the
case of obtaining a substring of the original string as a lazy
transformation:

  boost::string substring = substr(substr(s, 0, 10), -5, 0);

In a mutable string implementation, substr would *have* to create
another string object in the call to substr(s, 0, 10) then apply the
substr(/* temporary string */, -5, 0) on the temporary, to create yet
another temporary that gets assigned to substring -- just so that you
preserve the invariants on the existing string (s) which could change
at any given point in this nesting of operations. Now consider the
immutable case where you didn't have to make the copies at all, and
encapsulate the bounds in an unspecified type that supports the string
API as well, then later on once the actual assignment is performed you
copy the resulting bounds information and voila you have a substring
from characters (or code points) 5..10 *of the original immutable
string that you can still refer to in the end*. Of course you would
probably want to refer to just the blocks of the original string that
contains these characters, or if the resulting string is short enough
(an optimization point) you can actually break that off as a copy in a
different memory location.

I'm not even going as far as I can go with a DSEL for the string which
would allow you to (with something like Proto) determine at compile
time if it was possible to just reduce the nested substring into a
single application of the substring transformation on the string.

The last case I promised to show was how to linearize an immutable
string into something accessible through a `char const *`.

Now that you can see that a string's innards can be implemented in a
segmented data structure, you can then implement an algorithm external
to the string that deals with traversing these segments to turn out a
potentially interned `char const *` that's unique to the immutable
string. This interned `char const *` can be referred to in the
metadata block of the string and will only ever have to be built once.
The interface of that would look something like:

  template <class String>
  char const * linearize(String s) {
    return interned(s);
  }

You can do all sorts of lock-free implementations (potentially
leveraging TLS where it matters (for some C APIs TLS is actually
preferrable)) on the assignment of the linearized block into the
metadata block of s, or in the worst case you can just have a
different data structure to hold the linearized string. Another
interface that would be thread-friendly would be:

  template <class String>
  char const * linearize(String s, void * buf, size_t buf_len) {
    // linearize to the buffer, then
    return static_cast<char const *>(buf);
  }

Since s will never ever change, this doesn't need to synchronize
access to s in any thread.
...
...
I'd argue that converting an immutable string object (which doesn't
have to be stored as a contiguous chunk of memory unlike how C strings
are handled) to an std::string can be the (amortized constant) cost of
a segmented traversal of the same string.
That's quite interesting, but I'd argue that creating a std::string, which allocates a buffer on the free store to hold a duplicate of the sequence in the immutable string object can be unnecessary overhead should the immutable string already hold a contiguous array of characters.  Thus, the sequence you suggest -- immutable string object to std::string to contiguous array of characters -- may be unnecessarily inefficient.
That really depends on the interface to the linearization function. If
all you needed was to be able to control where you would linearize the
immutable string to, then maybe my example above allows you to put the
string in a *gulp* stack-based char array. It doesn't necessarily have
to be to a std::string if you don't want to put the data there. ;)

HTH

-- 
Dean Michael Berris
about.me/deanberris