Strings and thread safety (was: [lexical_cast] optimization commited to HEAD)

Hello, It just occurred to me that the thread safety problems with std::string are caused by the fact that the library is trying to decode the user intention from what methods he calls, instead of letting him declare it, and check at compile time that he is complying with the declared intention. I'm thinking to a solution in which the user has the choice of an immutable string, that behaves much like a ref counted "const char *" (or a String in java), and a mutable string as Java's StringBuilder. Something similar has been proposed also during the discussion on SuperString, and now it founds a further motivation for thread safety reasons. For performance reasons (that is involved here, otherwise a basic deep copy implementation would suffice to solve all problems), I think a third usage type can be added: temporary strings, that are used to mimic move semantics, until they are introduced into the language. Now some sketched code, to explain better what I mean (I leaved out a lot of details as templated charT, charTraits and allocator): namespace boost { namespace strings { namespace detail { class repr {}; } // the moveable, temporary string // invariant: rep has only one ref class temp_string { mutable intrusive_ptr<repr> rep; private: // used only by other two classes in release() friend class imm_string: friend class string_builder; temp_string(intrusive_ptr<repr> r):rep(r) {} public: temp_string(const temp_string& t):rep(t.take()) {} // move semantics temp_string(const string_builder&r):rep(r.rep->clone()) {} // always deep copy builders temp_string(const imm_string&r):rep(r.rep->clone()) {} // deep copy here // this is called when the ownership of the rep is going to be taken intrusive_ptr<repr> take() const { intrusive_ptr<repr> r=rep; rep.reset(); return r; } // all mutable operations that need to be chained are defined here temp_string& append(const temp_string&); temp_string& append(const imm_string&); temp_string& append(const string_builder&); }; // all free functions and free operators return temp_string template<typename S1,typename S2> temp_string operator +(const S1& a, const S2& b) { temp_string r(a), r.append(b); return r; } // the immutable string behaves as const char * // the content cannot be changed, but you can assign a new content to it. // release() can be used when we want cast from a string type to an other, // destroying the source (it is a move-semantic cast). // otherwise cast is implicit, but does deep copy class imm_string { friend temp_string; intrusive_ptr<repr> rep; public: imm_string(const temp_string& t):rep(t.take()) {} // move temp->imm // autogenerated copy constructor and operator= do shallow copy char operator[](unsigned i) { return rep->at(i); } temp_string release() { // see comment above the class declaration intrusive_ptr<repr> t=rep; rep.reset(); if(t.refs()==1) return temp_string(t); return temp_string(t->clone()); } }; // the string builder behaves as vector<char> // the content can be changed, but you can assign a new content to it // release() can be used when we want cast from a string type to an other, // destroying the source (it is a move-semantic cast). class string_builder { friend temp_string; intrusive_ptr<repr> rep; public: string_builder(const temp_string& t):rep(t.take()) {} // move temp-> builder string_builder(const string_builder&r):rep(r.rep->clone()) {} // always deep copy builders string_builder& operator=(const temp_string& t) { rep=t.take(); } string_builder& operator=(const string_builder& t) { if(rep->capacity()>t.size()) rep->copy(t.rep); else rep=t.rep->clone(); return *this; } // non chainable mutating operators are defined here char& operator[](unsigned i) { return rep->at(i); } temp_string release() { // see comment above the class declaration intrusive_ptr<repr> t=rep; rep.reset(); // we can assume as invariant that string_builder always owns the rep (refcount==1) return temp_string(t); } // mutating operators and methods are defined in terms of temp_string template<typename S2> string_builder& operator+=(const S2& r) { temp_string t(release()); this->operator=(t+r); } template<typename S2> string_builder&append(const S2& r) { temp_string t(release()); t.append(r); this->operator=(t); } }; }} This scheme addresses the performance problems noted in http://www.sgi.com/tech/stl/string_discussion.html for reference counted strings with unshareable state (like the g++ implementation), because now shareable/unshareable state is assigned by the user at compile time, and can be explicitly changed. In this way the user will know what to expect from the performance point of view, and the semantics will be (to my eyes) more clear. Corrado -- __________________________________________________________________________ dott. Corrado Zoccolo mailto: czoccolo (at) gmail.com PhD - Department of Computer Science - University of Pisa, Italy --------------------------------------------------------------------------

Corrado Zoccolo wrote:
Hello, It just occurred to me that the thread safety problems with std::string are caused by the fact that the library is trying to decode the user intention from what methods he calls, instead of letting him declare it, and check at compile time that he is complying with the declared intention.
This may be the resulting reason, but the primary reason is that that the C++ std lib isn't designed to be threadsafe. Getting back to lexical cast for a moment, as I think I said somewhere in the thread, string is not your only problem std::stringstream is your not thread safe either....so whatever gets done in this thread we won't be making lexical_cast threadsafe until you fix iostreams...
I'm thinking to a solution in which the user has the choice of an immutable string, that behaves much like a ref counted "const char *" (or a String in java), and a mutable string as Java's StringBuilder. Something similar has been proposed also during the discussion on SuperString, and now it founds a further motivation for thread safety reasons.
For performance reasons (that is involved here, otherwise a basic deep copy implementation would suffice to solve all problems), I think a third usage type can be added: temporary strings, that are used to mimic move semantics, until they are introduced into the language.
I guess. I have to say that after doing some pretty heavy string coding in Java I personally don't like having to think about the distinction between the 'builder' and the 'string'. Perhaps I'm just brainwashed to the C++ way, but it's nice to just write std::string and not worry about if the code I write tomorrow will be modifying it. Having to mess with another type is clunky and requires refactoring previous thinking. But, I do understand the reasons and respect that some programmers prefer this approach. So far the performance tests I wrote during the super_string discussion don't give a compelling edge to boost const_string -- it could just be a problem with the implementation, but at the moment it mostly loses to std::string.
Now some sketched code, to explain better what I mean (I leaved out a lot of details as templated charT, charTraits and allocator):
<snip details>
Maybe a usage example of this code would be useful in understanding? I guess the point is that since there is no mutating operator[] on imm_string it means that all query operations are threadsafe?
This scheme addresses the performance problems noted in http://www.sgi.com/tech/stl/string_discussion.html for reference counted strings with unshareable state (like the g++ implementation), because now shareable/unshareable state is assigned by the user at compile time, and can be explicitly changed. In this way the user will know what to expect from the performance point of view, and the semantics will be (to my eyes) more clear.
This seems like a reasonable approach for those that want to go the immutable route....care to develop it into a full blown proposal? Jeff

On 9/3/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote:
Corrado Zoccolo wrote:
Hello, It just occurred to me that the thread safety problems with std::string are caused by the fact that the library is trying to decode the user intention from what methods he calls, instead of letting him declare it, and check at compile time that he is complying with the declared intention.
This may be the resulting reason, but the primary reason is that that the C++ std lib isn't designed to be threadsafe. Getting back to lexical cast for a moment, as I think I said somewhere in the thread, string is not your only problem std::stringstream is your not thread safe either....so whatever gets done in this thread we won't be making lexical_cast threadsafe until you fix iostreams...
I can attest to your statement about std::stringstream, Jeff. We made both a string w/o reference counting and a stringstream that wraps up the string in the same way as std::stringstream does to std::string to get around the thread safety issue we have been discussing here. Perhaps there is more to std::stringstream in this regard, as I remember you talked about some global variables in it, but it appeared the issue stopped at replacing it with the customized stringstream (in addition to replacing std::string). <snip>
Jeff _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Thanks, Greg

Gregory Dai wrote:
On 9/3/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote:
Hello, It just occurred to me that the thread safety problems with std::string are caused by the fact that the library is trying to decode the user intention from what methods he calls, instead of letting him declare it, and check at compile time that he is complying with the declared intention. This may be the resulting reason, but the primary reason is that that the C++ std lib isn't designed to be threadsafe. Getting back to lexical cast for a moment, as I think I said somewhere in the thread, string is not your only
Corrado Zoccolo wrote: problem std::stringstream is your not thread safe either....so whatever gets done in this thread we won't be making lexical_cast threadsafe until you fix iostreams...
I can attest to your statement about std::stringstream, Jeff. We made both a string w/o reference counting and a stringstream that wraps up the string in the same way as std::stringstream does to std::string to get around the thread safety issue we have been discussing here. Perhaps there is more to std::stringstream in this regard, as I remember you talked about some global variables in it, but it appeared the issue stopped at replacing it with the customized stringstream (in addition to replacing std::string).
Well, I suppose you could replace std::stringstream with a custom one which is sure to be threadsafe, but I'd bet all of the performance gain would be gone by the time you were done. Anyway, my whole objection is to the 'claim' that 'thread safety in lexical_cast' had somehow been lost. lexical_cast was NEVER thread safe to start -- not one word in the docs indicates that it is thread-safe, and the implementation clearly isn't because of several things that are used. So clients using lexical_cast from multiple threads should beware... Jeff

lexical_cast was NEVER thread safe to start -- not one word in the docs indicates that it is thread-safe, and the implementation clearly isn't because of several things that are used. So clients using lexical_cast from multiple threads should beware...
There's something I want to clear out, lexical_cast is reentrant and only uses "stacked" variables no ? It looks like the "thread-dangerosity" you're speaking about is like the same we would have by used two different stringstream objects in two different threads... Enlighten me if I missed something. Philippe

Philippe Vaucher wrote:
lexical_cast was NEVER thread safe to start -- not one word in the docs indicates that it is thread-safe, and the implementation clearly isn't because of several things that are used. So clients using lexical_cast from multiple threads should beware...
There's something I want to clear out, lexical_cast is reentrant and only uses "stacked" variables no ?
So what? Those stack types are free to reference global data. And in the case of stringstream, in combination with operators<< and operator>>, they most certainly do access global data in the form of locales and facets.
It looks like the "thread-dangerosity" you're speaking about is like the same we would have by used two different stringstream objects in two different threads...
Yep.
Enlighten me if I missed something.
Well again, please show me in the lexical_cast documentation the line that indicates that it is a 'thread-safe' operation. You won't find it because it isn't. You have the issues of both the internal implementation and the types being converted to/from (eg: string) using global data in their implementation. Since lexical_cast is generic, it has no way of controlling the thread safety of user defined types and their associated operator>> and operator<<. The real problem here, is that lexical_cast *will work* most of the time in an MT environment without difficulty. But that's not good enough to declare it 'thread-safe'...which is the entire assumption in this thread. Actually, as this whole discussion illustrates, it's probably enough to make it 'thread-dangerous' ;-) Jeff

So what? Those stack types are free to reference global data. And in the case of stringstream, in combination with operators<< and operator>>, they most certainly do access global data in the form of locales and facets.
Sure, I wanted to clear that out :) Well again, please show me in the lexical_cast documentation the line that
indicates that it is a 'thread-safe' operation. You won't find it because it isn't. You have the issues of both the internal implementation and the types being converted to/from (eg: string) using global data in their implementation. Since lexical_cast is generic, it has no way of controlling the thread safety of user defined types and their associated operator>> and operator<<.
I never said lexical_cast was thread safe :) I've heard on some compilers you can tell the stdlib to be threadsafe tho. Thank you for your explanations. Philippe

On 9/8/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote:
[snipped]
There's something I want to clear out, lexical_cast is reentrant and only uses "stacked" variables no ?
So what? Those stack types are free to reference global data. And in the case of stringstream, in combination with operators<< and operator>>, they most certainly do access global data in the form of locales and facets.
But I believe almost every STL implementation has thread-safe iostreams, with locked access to global locales and etc. I know that VC 7.1 and STLPort has. Without it, it is almost impossible to write thread correct C++ programs. Or am I missing something?
[snipped]
The real problem here, is that lexical_cast *will work* most of the time in an MT environment without difficulty. But that's not good enough to declare it 'thread-safe'...which is the entire assumption in this thread. Actually, as this whole discussion illustrates, it's probably enough to make it 'thread-dangerous' ;-)
It is only thread-safe if what it uses has the basic thread-safety guarantees. Which, for all I know, it is true in every implementation that cares about writing thread-safe programs.
Jeff
best regards, -- Felipe Magno de Almeida

Felipe Magno de Almeida wrote:
On 9/8/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote:
[snipped]
There's something I want to clear out, lexical_cast is reentrant and only uses "stacked" variables no ? So what? Those stack types are free to reference global data. And in the case of stringstream, in combination with operators<< and operator>>, they most certainly do access global data in the form of locales and facets.
But I believe almost every STL implementation has thread-safe iostreams, with locked access to global locales and etc. I know that VC 7.1 and STLPort has. Without it, it is almost impossible to write thread correct C++ programs. Or am I missing something?
It's not portable because the standard says nothing about threading or thread safety in I/O. And as I recall I've seen at least one implementation that supports compiling out all thread safety, because it is a huge performance hit to make I/O threadsafe. So, again it may mostly work, but it isn't assured by anything in the standard. So in the context of lexical_cast, nothing is assured...
[snipped]
The real problem here, is that lexical_cast *will work* most of the time in an MT environment without difficulty. But that's not good enough to declare it 'thread-safe'...which is the entire assumption in this thread. Actually, as this whole discussion illustrates, it's probably enough to make it 'thread-dangerous' ;-)
It is only thread-safe if what it uses has the basic thread-safety guarantees. Which, for all I know, it is true in every implementation that cares about writing thread-safe programs.
If it were me, I'd guard any use of lexical cast I thought was to be used in an MT context because it's a time-bomb waiting to fail. I'm sorry to keep going on about this, but the whole thread that spawned this one had the embedded presumption that lexical_cast is thread-safe and I wanted to dispel that myth -- it's obviously deeply entrenched. Jeff

Jeff Garland said: (by the date of Fri, 08 Sep 2006 10:24:24 -0700)
If it were me, I'd guard any use of lexical cast I thought was to be used in an MT context because it's a time-bomb waiting to fail. I'm sorry to keep going on about this, but the whole thread that spawned this one had the embedded presumption that lexical_cast is thread-safe and I wanted to dispel that myth -- it's obviously deeply entrenched.
Sorry about that, it's just when I hear "internal buffer" a bell rings in my head and I think "thread safety". It's only because currently I'm learning how to do concurrent programming. -- Janek Kozicki |

On 9/3/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote: > <snip previous discussion> > > This scheme addresses the performance problems noted in > > http://www.sgi.com/tech/stl/string_discussion.html for reference > > counted strings with unshareable state (like the g++ implementation), > > because now shareable/unshareable state is assigned by the user at > > compile time, and can be explicitly changed. In this way the user will > > know what to expect from the performance point of view, and the > > semantics will be (to my eyes) more clear. > > This seems like a reasonable approach for those that want to go the immutable > route....care to develop it into a full blown proposal? I developed my ideas into three concrete classes. I uploaded it to the boost vault as imm_string_and_builder.zip (under Strings - Text Processing). The tiny url to the zip file is http://tinyurl.com/hznt4 . I added a test file to compare the copy/access cost of my string implementations with std::string and const char *. >From my tests, splitting the string abstraction into immutable string and string builder allows a more efficient implementation. My tests results are: 1) compiling without threads configured (BOOST_HAS_THREADS undefined), copying an immutable string has a small overhead over plain const char * 2) compiling with BOOST_USE_ASM_ATOMIC_H, imm_string and string_builder always outperform std::string (I'm using the implementation delivered with g++ . 4.0.1). Overhead is large w.r.t. const char * (due to the high cost of lock operations on Pentium IVs), but passing a const & to imm_string is as fast as passing a const char * (not true for std::string). 3) using move semantics over string builders achieve the same performance of copying immutable strings (see below for the actual numbers) I'm interested in seeing the test results on other platforms/compilers. It is interesting to note that with this split of the abstraction in different classes, the thread safety properties becomes compatible with posix rules (that was my original goal): * accessing an object as read only from multiple threads is safe * even writing to different string builders areas (without changing their length) is MT-safe, as writing to a preallocated char [] I think that, if the upcoming standard will deal with threads, some thread-safe issues in the standard library will need to be addressed. I think that for string, these abstractions will be an useful starting point. Corrado Test results for BOOST_USE_ASM_ATOMIC_H, on a PentiumIV 2.8GHz with HT Baseline (const char *, without atomic count) Took 0.09 s on PKc Baseline (const char *, with atomic count) Took 0.94 s on PKc Pass by value Took 0.96 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 2.45 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE Took 3.48 s on Ss // this is std::string Pass by reference Took 0.09 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 0.08 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE Took 0.13 s on Ss // this is std::string Modified, Pass by value Took 1 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 2.45 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE Took 3.78 s on Ss // this is std::string Modified, Pass by reference Took 0.09 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 0.09 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE Took 0.13 s on Ss // this is std::string Leaked, Pass by value // Leaked state is peculiar of g++ strings implementations Took 1 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 2.45 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE Took 2.56 s on Ss // this is std::string Leaked, Pass by reference // Leaked state is peculiar of g++ strings implementations Took 0.09 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 0.09 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE Took 0.13 s on Ss // this is std::string Temp String, with copy Took 2.49 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 2.49 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE Took 4.39 s on Ss // this is std::string Temp String, Move semantics Took 1.25 s on N5boost7strings10imm_stringIcSt11char_traitsIcEEE Took 1.03 s on N5boost7strings14string_builderIcSt11char_traitsIcEEE > Jeff > _______________________________________________ > Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo (at) gmail.com PhD - Department of Computer Science - University of Pisa, Italy --------------------------------------------------------------------------
participants (6)
-
Corrado Zoccolo
-
Felipe Magno de Almeida
-
Gregory Dai
-
Janek Kozicki
-
Jeff Garland
-
Philippe Vaucher