
On Wed, Jan 26, 2011 at 10:46 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
On Wed, Jan 26, 2011 at 5:10 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 9:34 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
The immutability *does not* have a thing with the problems in the use-cases described above. Encoding *does*.
Right, so why should the encoding be part of the string then? I say the encoding should be external to the string (which I've been saying for the Nth time I think) and just a transformation on an input string. The transformation doesn't even have to be immediate -- it could and probably should be lazy. When you have immutable strings the lazy application of transformations is really a game changer especially in the way people (at least in C++) are used to when dealing with strings.
Based upon previous discussion, I think you need to present your case better for immutability.
Right, I think I need to spend a little more time to hash that out. With something close to 100 messages on the thread already I think it's about time I did that. Expect a different thread starter then in the next, oh, few minutes or so.
Others consider mutability to be an intrinsic and beneficial characteristic of a string class. You are proposing to drop that and assume all others implicitly understand why that would be good for all.
I don't think I'm assuming that others will implicitly understand -- I was more hoping that those reading or at least are interested in the proposition would likely try to work it out in their heads and maybe discuss what's unclear to them. ;) That said, it falls on me to be clearer I agree. :D
Sure, but still I don't see why you need to add c_str() to an immutable string when you're fine to create an std::string from that immutable string and call c_str() from the std::string instance instead?
One word. Performance :)
So you're saying c_str() is a performance enhancing feature? Did you consider that the reason why std::string concatenation is a bad performance killer is precisely because it has to support c_str()?
That's an interesting viewpoint. Note, however, that it is extremely common to build a string, piecewise, and then use it as an array of characters with some OS API. Those APIs won't change, so c_str(), in some form, is definitely needed. Furthermore, the piecewise assembly seems likely inefficient given and immutable string, especially one from which a contiguous array of characters is needed. Can you illustrate how that would be done and how it can be as efficient or more so than the status quo?
Right. Here's one attempt at presenting how you would: build an immutable string, perform (lazy) transformations of that string, and have a means of linearizing that string into a `char const *` (which in the end is really what c_str() is): To build a string, let's borrow from the sstream library in the STL: boost::ostringstream ss; ss << "This is a string literal" << L"this is another literal" << instance.some_char_const_ptr() << foo_that_returns_an_std_string_perhaps() // can be moved << etcetera(); // what it returns can be copied boost::string s = ss.str(); // #1 In #1 above, str() returns an immutable string already, which means that the ostringstream would be building a segmented data structure that packs chunks together potentially using Boost.Intrusive data structures -- to make it simple, it would potentially be statically-sized chunks that lay things out in a memory page's worth of memory. So really there's no cost to copy -- well, there will be an increment on the reference count in the string's metadata block. We can also implement a function in ss called 'move' which will actually move the string already built to the holder. We can implement output iterators that deal with boost::ostringstream objects to allow existing STL algorithms to stuff data into an boost::ostringstream using the familiar output iterator type. Once we have that immutable string, we can start copying it around and not worry about having to manually synchronize or ensure that the data is actually copied when dealing with it in different threads. The reference counting can happen transparently in true RAII fashion which is a good thing IMO. Let me try to describe "lazy" transformations then. Let's consider the case of obtaining a substring of the original string as a lazy transformation: boost::string substring = substr(substr(s, 0, 10), -5, 0); In a mutable string implementation, substr would *have* to create another string object in the call to substr(s, 0, 10) then apply the substr(/* temporary string */, -5, 0) on the temporary, to create yet another temporary that gets assigned to substring -- just so that you preserve the invariants on the existing string (s) which could change at any given point in this nesting of operations. Now consider the immutable case where you didn't have to make the copies at all, and encapsulate the bounds in an unspecified type that supports the string API as well, then later on once the actual assignment is performed you copy the resulting bounds information and voila you have a substring from characters (or code points) 5..10 *of the original immutable string that you can still refer to in the end*. Of course you would probably want to refer to just the blocks of the original string that contains these characters, or if the resulting string is short enough (an optimization point) you can actually break that off as a copy in a different memory location. I'm not even going as far as I can go with a DSEL for the string which would allow you to (with something like Proto) determine at compile time if it was possible to just reduce the nested substring into a single application of the substring transformation on the string. The last case I promised to show was how to linearize an immutable string into something accessible through a `char const *`. Now that you can see that a string's innards can be implemented in a segmented data structure, you can then implement an algorithm external to the string that deals with traversing these segments to turn out a potentially interned `char const *` that's unique to the immutable string. This interned `char const *` can be referred to in the metadata block of the string and will only ever have to be built once. The interface of that would look something like: template <class String> char const * linearize(String s) { return interned(s); } You can do all sorts of lock-free implementations (potentially leveraging TLS where it matters (for some C APIs TLS is actually preferrable)) on the assignment of the linearized block into the metadata block of s, or in the worst case you can just have a different data structure to hold the linearized string. Another interface that would be thread-friendly would be: template <class String> char const * linearize(String s, void * buf, size_t buf_len) { // linearize to the buffer, then return static_cast<char const *>(buf); } Since s will never ever change, this doesn't need to synchronize access to s in any thread.
I'd argue that converting an immutable string object (which doesn't have to be stored as a contiguous chunk of memory unlike how C strings are handled) to an std::string can be the (amortized constant) cost of a segmented traversal of the same string.
That's quite interesting, but I'd argue that creating a std::string, which allocates a buffer on the free store to hold a duplicate of the sequence in the immutable string object can be unnecessary overhead should the immutable string already hold a contiguous array of characters. Thus, the sequence you suggest -- immutable string object to std::string to contiguous array of characters -- may be unnecessarily inefficient.
That really depends on the interface to the linearization function. If all you needed was to be able to control where you would linearize the immutable string to, then maybe my example above allows you to put the string in a *gulp* stack-based char array. It doesn't necessarily have to be to a std::string if you don't want to put the data there. ;) HTH -- Dean Michael Berris about.me/deanberris