Input archive loads seem to corrupt shared character std::strings

Folks, We were recently tracking down some weirdness in our application, and think it may interest other users of boost archives. In essence, we are using the std::string library that comes with gcc 4.x, which reference counts the underlying characters. The iarchive loads seem to corrupt these characters if they are shared with other std::string instances. Here is a paraphrased example of what we were doing. // First, we have a string instance static std::string defaultStringValue = "defaultValue"; // Next we make a class that uses this default to initialize a member in its base constructor class A { public: std::string member; A() : member(defaultStringValue); template<class Archive> void serialize(Archive & ar, const unsigned int serializer_version) { ar & member; } } What we find happening is that if you deserialize an instance of A, the defaultStringValue *can* get clobbered (this has a bit to do with the length of the underlying value, as to whether the characters get clobbered or not). The scenario goes something like this: 1. Create an instance of A, this member will get a reference to the underlying "defaultValue" characters. 2. Load the instance of A from an archive that has a different value of member (the same length, or maybe shorter. If you test this, try something like "eefaultValue"). 3. This seems to mess up the value of defaultStringValue, you can print it out and see. Inspection of the code reveals the following is happening (both in 1.34 and HEAD): text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect } Basically, data() is const because of the reference counting. Doing this const_cast and modification causes other strings that share the ref counted characters to be corrupted. We have been applying the following patch to our production version of boost (1.34.0), to work around this issue. Basically, it makes its own temp string and then does the assignment. diff -du -r boost_1_34_0/boost/archive/impl/basic_binary_iprimitive.ipp boost_1_34_0.my/boost/archive/impl/basic_binary_iprimitive.ipp --- boost_1_34_0/boost/archive/impl/basic_binary_iprimitive.ipp 2006-02-09 01:58:07.000000000 -0500 +++ boost_1_34_0.my/boost/archive/impl/basic_binary_iprimitive.ipp 2007-12-03 18:05:43.000000000 -0500 @@ -85,8 +85,9 @@ template<class Archive, class Elem, class Tr> BOOST_ARCHIVE_OR_WARCHIVE_DECL(void) -basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & s) +basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & res_s) { + std::string s; std::size_t l; this->This()->load(l); // borland de-allocator fixup @@ -96,6 +97,9 @@ s.resize(l); // note breaking a rule here - could be a problem on some platform load_binary(const_cast<char *>(s.data()), l); + + // Respect other references to the chars by assigning result string + res_s = s; } #ifndef BOOST_NO_CWCHAR @@ -113,8 +117,9 @@ #ifndef BOOST_NO_STD_WSTRING template<class Archive, class Elem, class Tr> BOOST_ARCHIVE_OR_WARCHIVE_DECL(void) -basic_binary_iprimitive<Archive, Elem, Tr>::load(std::wstring & ws) +basic_binary_iprimitive<Archive, Elem, Tr>::load(std::wstring & res_ws) { + std::wstring ws; std::size_t l; this->This()->load(l); // borland de-allocator fixup @@ -124,6 +129,9 @@ ws.resize(l); // note breaking a rule here - is could be a problem on some platform load_binary(const_cast<wchar_t *>(ws.data()), l * sizeof(wchar_t) / sizeof(char)); + + // Respect other references to the chars by assigning result string + res_ws = ws; } #endif diff -du -r boost_1_34_0/boost/archive/impl/text_iarchive_impl.ipp boost_1_34_0.my/boost/archive/impl/text_iarchive_impl.ipp --- boost_1_34_0/boost/archive/impl/text_iarchive_impl.ipp 2005-12-11 01:12:52.000000000 -0500 +++ boost_1_34_0.my/boost/archive/impl/text_iarchive_impl.ipp 2007-12-03 17:56:04.000000000 -0500 @@ -42,8 +42,9 @@ template<class Archive> BOOST_ARCHIVE_DECL(void) -text_iarchive_impl<Archive>::load(std::string &s) +text_iarchive_impl<Archive>::load(std::string &res_s) { + std::string s; std::size_t size; * this->This() >> size; // skip separating space @@ -54,6 +55,9 @@ #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); + + // Respect other references to the chars by assigning result string + res_s = s; } #ifndef BOOST_NO_CWCHAR @@ -74,8 +78,9 @@ #ifndef BOOST_NO_STD_WSTRING template<class Archive> BOOST_ARCHIVE_DECL(void) -text_iarchive_impl<Archive>::load(std::wstring &ws) +text_iarchive_impl<Archive>::load(std::wstring &res_ws) { + std::wstring ws; std::size_t size; * this->This() >> size; // borland de-allocator fixup @@ -86,6 +91,9 @@ // skip separating space is.get(); is.read((char *)ws.data(), size * sizeof(wchar_t)/sizeof(char)); + + // Respect other references to the chars by assigning result string + res_ws = ws; } #endif // BOOST_NO_STD_WSTRING diff -du -r boost_1_34_0/boost/archive/impl/text_wiarchive_impl.ipp boost_1_34_0.my/boost/archive/impl/text_wiarchive_impl.ipp --- boost_1_34_0/boost/archive/impl/text_wiarchive_impl.ipp 2005-07-02 01:52:14.000000000 -0400 +++ boost_1_34_0.my/boost/archive/impl/text_wiarchive_impl.ipp 2007-12-03 18:05:07.000000000 -0500 @@ -79,8 +79,10 @@ #ifndef BOOST_NO_STD_WSTRING template<class Archive> BOOST_WARCHIVE_DECL(void) -text_wiarchive_impl<Archive>::load(std::wstring &ws) +text_wiarchive_impl<Archive>::load(std::wstring &res_ws) { + std::wstring ws; + std::size_t size; * this->This() >> size; // skip separating space @@ -93,6 +95,9 @@ ws.resize(size); // note breaking a rule here - is this a problem on some platform is.read(const_cast<wchar_t *>(ws.data()), size); + + // Respect other references to the chars by assigning result string + res_ws = ws; } #endif Curious if anyone has thoughts on the matter... Regards, Chuck

Hmmm, this is a very well thought out and clearly explained post. And you know, we've been surprised by some problems with gcc 4.1+ related to export, which depends on serialization of a string which identifies the class. Hmmmmmmmmmmm. And I can appreciate what a huge amount of effort you had to expend to find this. I would like to a) see if someone can confirm this issue and its cause. b) see if this is the best solution. Since this string serialization is a very common operation, the method should be as fast as possible (lol - which seems to be the exact reason we find ourselves with this predicament). Somehow I think just pulling /pushing the characters from/to the input/output directly to to the string will be faster than creating a temporary string. c) after a/b above, it can be decided which version of boost the fix should be rolled into. Robert Ramey Chuck Bear wrote:
Folks,
We were recently tracking down some weirdness in our application, and think it may interest other users of boost archives. In essence, we are using the std::string library that comes with gcc 4.x, which reference counts the underlying characters. The iarchive loads seem to corrupt these characters if they are shared with other std::string instances. Here is a paraphrased example of what we were doing.
// First, we have a string instance static std::string defaultStringValue = "defaultValue";
// Next we make a class that uses this default to initialize a member in its base constructor class A { public: std::string member; A() : member(defaultStringValue); template<class Archive> void serialize(Archive & ar, const unsigned int serializer_version) { ar & member; } }
What we find happening is that if you deserialize an instance of A, the defaultStringValue *can* get clobbered (this has a bit to do with the length of the underlying value, as to whether the characters get clobbered or not). The scenario goes something like this: 1. Create an instance of A, this member will get a reference to the underlying "defaultValue" characters. 2. Load the instance of A from an archive that has a different value of member (the same length, or maybe shorter. If you test this, try something like "eefaultValue"). 3. This seems to mess up the value of defaultStringValue, you can print it out and see.
Inspection of the code reveals the following is happening (both in 1.34 and HEAD): text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect }
Basically, data() is const because of the reference counting. Doing this const_cast and modification causes other strings that share the ref counted characters to be corrupted.
We have been applying the following patch to our production version of boost (1.34.0), to work around this issue. Basically, it makes its own temp string and then does the assignment.
diff -du -r boost_1_34_0/boost/archive/impl/basic_binary_iprimitive.ipp boost_1_34_0.my/boost/archive/impl/basic_binary_iprimitive.ipp --- boost_1_34_0/boost/archive/impl/basic_binary_iprimitive.ipp 2006-02-09 01:58:07.000000000 -0500 +++ boost_1_34_0.my/boost/archive/impl/basic_binary_iprimitive.ipp 2007-12-03 18:05:43.000000000 -0500 @@ -85,8 +85,9 @@
template<class Archive, class Elem, class Tr> BOOST_ARCHIVE_OR_WARCHIVE_DECL(void) -basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & s) +basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & res_s) { + std::string s; std::size_t l; this->This()->load(l); // borland de-allocator fixup @@ -96,6 +97,9 @@ s.resize(l); // note breaking a rule here - could be a problem on some platform load_binary(const_cast<char *>(s.data()), l); + + // Respect other references to the chars by assigning result string + res_s = s; }
#ifndef BOOST_NO_CWCHAR @@ -113,8 +117,9 @@ #ifndef BOOST_NO_STD_WSTRING template<class Archive, class Elem, class Tr> BOOST_ARCHIVE_OR_WARCHIVE_DECL(void) -basic_binary_iprimitive<Archive, Elem, Tr>::load(std::wstring & ws) +basic_binary_iprimitive<Archive, Elem, Tr>::load(std::wstring & res_ws) { + std::wstring ws; std::size_t l; this->This()->load(l); // borland de-allocator fixup @@ -124,6 +129,9 @@ ws.resize(l); // note breaking a rule here - is could be a problem on some platform load_binary(const_cast<wchar_t *>(ws.data()), l * sizeof(wchar_t) / sizeof(char)); + + // Respect other references to the chars by assigning result string + res_ws = ws; } #endif
diff -du -r boost_1_34_0/boost/archive/impl/text_iarchive_impl.ipp boost_1_34_0.my/boost/archive/impl/text_iarchive_impl.ipp --- boost_1_34_0/boost/archive/impl/text_iarchive_impl.ipp 2005-12-11 01:12:52.000000000 -0500 +++ boost_1_34_0.my/boost/archive/impl/text_iarchive_impl.ipp 2007-12-03 17:56:04.000000000 -0500 @@ -42,8 +42,9 @@
template<class Archive> BOOST_ARCHIVE_DECL(void) -text_iarchive_impl<Archive>::load(std::string &s) +text_iarchive_impl<Archive>::load(std::string &res_s) { + std::string s; std::size_t size; * this->This() >> size; // skip separating space @@ -54,6 +55,9 @@ #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); + + // Respect other references to the chars by assigning result string + res_s = s; }
#ifndef BOOST_NO_CWCHAR @@ -74,8 +78,9 @@ #ifndef BOOST_NO_STD_WSTRING template<class Archive> BOOST_ARCHIVE_DECL(void) -text_iarchive_impl<Archive>::load(std::wstring &ws) +text_iarchive_impl<Archive>::load(std::wstring &res_ws) { + std::wstring ws; std::size_t size; * this->This() >> size; // borland de-allocator fixup @@ -86,6 +91,9 @@ // skip separating space is.get(); is.read((char *)ws.data(), size * sizeof(wchar_t)/sizeof(char)); + + // Respect other references to the chars by assigning result string + res_ws = ws; }
#endif // BOOST_NO_STD_WSTRING diff -du -r boost_1_34_0/boost/archive/impl/text_wiarchive_impl.ipp boost_1_34_0.my/boost/archive/impl/text_wiarchive_impl.ipp --- boost_1_34_0/boost/archive/impl/text_wiarchive_impl.ipp 2005-07-02 01:52:14.000000000 -0400 +++ boost_1_34_0.my/boost/archive/impl/text_wiarchive_impl.ipp 2007-12-03 18:05:07.000000000 -0500 @@ -79,8 +79,10 @@ #ifndef BOOST_NO_STD_WSTRING template<class Archive> BOOST_WARCHIVE_DECL(void) -text_wiarchive_impl<Archive>::load(std::wstring &ws) +text_wiarchive_impl<Archive>::load(std::wstring &res_ws) { + std::wstring ws; + std::size_t size; * this->This() >> size; // skip separating space @@ -93,6 +95,9 @@ ws.resize(size); // note breaking a rule here - is this a problem on some platform is.read(const_cast<wchar_t *>(ws.data()), size); + + // Respect other references to the chars by assigning result string + res_ws = ws; } #endif
Curious if anyone has thoughts on the matter...
Regards, Chuck _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert, Thanks for your quick reply. Please allow me to apologize for not providing a full reproducer. (Wanted to get home for dinner.) Here it is (please ignore my general sloppiness with this but I'm still in a bit of a hurry): ------------------------------------------------- #include <string> #include <boost/archive/text_iarchive.hpp> #include <boost/archive/text_oarchive.hpp> #include <boost/serialization/serialization.hpp> #include <boost/serialization/string.hpp> #include <boost/serialization/export.hpp> #include <sstream> #include <stdio.h> std::string defaultStringValue = "defaultValue"; class A { public: friend class boost::serialization::access; std::string member; template<class Archive> void serialize(Archive & ar, const unsigned int serializer_version) { ar & member; } A() : member(defaultStringValue) {} }; int main(int argc, char **argv) { printf("%s\n", defaultStringValue.c_str()); A start; start.member = "eefaultValue"; std::ostringstream oss; { const A &tmp = start; boost::archive::text_oarchive oar(oss); oar << tmp; } std::string archString = oss.str(); A end; std::istringstream iss(archString); { boost::archive::text_iarchive ar(iss); ar >> end; } printf("%s\n", defaultStringValue.c_str()); return 0; } ------------------------------ I have tried this on g++ 4.0.2 (RHEL4 update 3) and g++ 4.1.1. In both cases, it was 64-bit x86. The result is the same: hp09:/home/cbear $ g++ -I/v3p/third-party/install/include -L/v3p/third-party/install/lib -lboost_serialization-gcc-mt-d str.cpp hp09:/home/cbear $ ./a.out defaultValue eefaultValue Anyway, I hope this helps you (or others) with concern (a) below. With regard to (b) below, I think you should proceed empirically. However I can't resist the temptation to speculate about the performance of, at least, the gcc implementation of std::string, in different implementations of a load() method: 1. I can't imagine that the temp std::string implementation would be a disaster, because of the character ref counting. It means you'd create the chars for the temp string, fill them in, and then transfer the reference over to the real string (without further adjustment of the characters). I do think that the reference counts are done with fancy atomic operations, so there is a bit of memory-synchronizing or bus locking going on here (in my patch) that wasn't in the 1.34 implementation. 2. I have to imagine that the typical case is that load() is loading into an empty string. Also I suspect that "is a string empty" can be established very cheaply. So if my suggestion is too slow, you can do what I've outlined only if the string to be loaded was non-empty when passed in. Anyway, that's my $.02. Thanks again for looking into this. Regards, Chuck -----Original Message----- From: boost-bounces@lists.boost.org on behalf of Robert Ramey Sent: Mon 12/3/2007 8:47 PM To: boost@lists.boost.org Subject: Re: [boost] Input archive loads seem to corrupt sharedcharacterstd::strings Hmmm, this is a very well thought out and clearly explained post. And you know, we've been surprised by some problems with gcc 4.1+ related to export, which depends on serialization of a string which identifies the class. Hmmmmmmmmmmm. And I can appreciate what a huge amount of effort you had to expend to find this. I would like to a) see if someone can confirm this issue and its cause. b) see if this is the best solution. Since this string serialization is a very common operation, the method should be as fast as possible (lol - which seems to be the exact reason we find ourselves with this predicament). Somehow I think just pulling /pushing the characters from/to the input/output directly to to the string will be faster than creating a temporary string. c) after a/b above, it can be decided which version of boost the fix should be rolled into. Robert Ramey

On Dec 4, 2007 5:57 AM, Chuck Bear <chuck@vertica.com> wrote:
With regard to (b) below, I think you should proceed empirically. However I can't resist the temptation to speculate about the performance of, at least, the gcc implementation of std::string, in different implementations of a load() method: 1. I can't imagine that the temp std::string implementation would be a disaster, because of the character ref counting. It means you'd create the chars for the temp string, fill them in, and then transfer the reference over to the real string (without further adjustment of the characters). I do think that the reference counts are done with fancy atomic operations, so there is a bit of memory-synchronizing or bus locking going on here (in my patch) that wasn't in the 1.34 implementation. 2. I have to imagine that the typical case is that load() is loading into an empty string. Also I suspect that "is a string empty" can be established very cheaply. So if my suggestion is too slow, you can do what I've outlined only if the string to be loaded was non-empty when passed in.
This atomic operations can be quite heavy on modern hardware. For small strings, it costs as much as an allocation on a P4. You can improve things by swapping the string instead of using assignment. template<class Archive, class Elem, class Tr> BOOST_ARCHIVE_OR_WARCHIVE_DECL(void) -basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & s) +basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & res_s) { + std::string s; std::size_t l; this->This()->load(l); // borland de-allocator fixup @@ -96,6 +97,9 @@ s.resize(l); // note breaking a rule here - could be a problem on some platform load_binary(const_cast<char *>(s.data()), l); + + // Respect other references to the chars by assigning result string *+ res_s.swap(s);* } This will avoid the atomic operations, but will still cost one allocation and one deallocation more than initial solution. Corrado

I'd like to go back to Chuck's original posting:
Inspection of the code reveals the following is happening (both in 1.34 and HEAD): text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect }
Inspection of the g++ string implementation (latest 4.1.x) shows that resize() will *almost* always create a new rep and thus a new data() for the string, if it is shared. The only exception is when the requested size == s.size(). Chuck, was this the bug that hit you? Seems to me, since you are going to rewrite your string anyway, you might as well clear() it first, then resize(size). That'll take care of creating a new representation. The problem with that is that it will write over the string twice, first filling it with 0s then with the read data. But think about it: if the string is shared, you need to construct a new representation with the requested size anyway, and it will have to be filled with 0s. Still, suppose you wanted to avoid going over your string twice, what you'd need is to create a string with an unitialized buffer and start writing into it directly. There's only one way I can think of to achieve that: have an input iterator from your "is" stream, such as std::istream_iterator<char>(is), and use s.assign(is_begin, is_end), assuming you can define and end that will stop reading after size chars (don't even know how to do that with istream_iterator, and there is no assign(int, InputIterator) method in string, sadly). In any case, however, you then lose the benefit of the single is.read() call, because the iterator is going to do ++iter and read character by character. If your stream is buffered, there shouldn't much of an impact, but then you have a double copy from stream buffer to string buffer anyway, which is akin to going over your string twice. In short, I don't think you can avoid the cost of going over your string twice, so I can't see anything better than: text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif + s.clear(); s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect } -- Hervé Brönnimann hervebronnimann@mac.com

On Tue, Dec 04, 2007 at 04:59:17AM -0500, Hervé Brönnimann wrote:
Inspection of the g++ string implementation (latest 4.1.x) shows that resize() will *almost* always create a new rep and thus a new data() for the string, if it is shared. The only exception is when the requested size == s.size(). Chuck, was this the bug that hit you?
To be honest, isn't a specific implementation not completely unimportant? You don't want to rely on implementation details, right? There is only one solution: remove the buggy casting code and replace it with a proper one as suggested. Whether the proper one is slower or not is unimportant!
Seems to me, since you are going to rewrite your string anyway, you might as well clear() it first, then resize(size). That'll take care of creating a new representation. The problem with that is that it will write over the string twice, first filling it with 0s then with the read data. But think about it: if the string is shared, you need to construct a new representation with the requested size anyway, and it will have to be filled with 0s.
No, the problem is that this is implementation specific. Don't rely on such stuff, rely on the C++ standard! Jens

Sorry for the double (triple) posting. My mailer crashed and I started again with the draft, unaware that the post went through. Jens: My post wasn't specific to the shared string implementation, the original posting was. After all, Robert's code was working fine with non-shared implementation.
On Dec 4, 2007, at 5:13 AM, Jens Seidel wrote: To be honest, isn't a specific implementation not completely unimportant? You don't want to rely on implementation details, right?
I don't rely on gcc's string implementation details, but on whether a string implementation is shared or not. The original problem and argument I built upon it is true for any shared implementation. If you think there's no way in C++ to tell whether a string class is shared or not, think about Chuck's posting: bool is_string_class_shared() { const std::string s1('x'); const std::string s2(s1); *(const_cast<char*>s1.data()) = 'y'; return s2[1] == 'y'; } Speaking of the programming with only the C++ standard, it specifically mentions allowing a shared implementation of std::string.
There is only one solution: remove the buggy casting code and replace it with a proper one as suggested. Whether the proper one is slower or not is unimportant!
I strongly disagree. Performance of string serialization is important, as Robert posted in a previous msg. For that reason, the use of const_cast<char*>s.data() seems justified.
Seems to me, since you are going to rewrite your string anyway, you might as well clear() it first, then resize(size). That'll take care of creating a new representation. The problem with that is that it will write over the string twice, first filling it with 0s then with the read data. But think about it: if the string is shared, you need to construct a new representation with the requested size anyway, and it will have to be filled with 0s.
No, the problem is that this is implementation specific. Don't rely on such stuff, rely on the C++ standard!
Again, my argumentation isn't implementation specific, it's true of any shared string class. My patch should work with any standard-compliant string. What I argued was that it's better than the previously proposed fixes for gcc's implementation. What I didn't argue was that it should also be very efficient with non-shared implementation. If I had, I would've noticed that for implementations which release memory upon clearing or assignment, however, it will first free the capacity, then reacquire it, an unnecessary operation. For non-shared implementation, the resize(size) alone was sufficient, so I would simply amend my previous posting by enclosing the "s = std::string ();" within a #if BOOST_SHARED_STD_STRING (suitably defined in boost/config.hpp, and detected per platform using e.g. the code I posted above). -- Hervé Brönnimann hervebronnimann@mac.com
On Tue, Dec 04, 2007 at 04:59:17AM -0500, Hervé Brönnimann wrote:
Inspection of the g++ string implementation (latest 4.1.x) shows that resize() will *almost* always create a new rep and thus a new data() for the string, if it is shared. The only exception is when the requested size == s.size(). Chuck, was this the bug that hit you?
Jens _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/ listinfo.cgi/boost

From: Hervé Brönnimann
On Dec 4, 2007, at 5:13 AM, Jens Seidel wrote: To be honest, isn't a specific implementation not completely unimportant? You don't want to rely on implementation details, right?
I don't rely on gcc's string implementation details, but on whether a string implementation is shared or not.
But whether a string is shared or not is an implementation detail.
The original problem and argument I built upon it is true for any shared implementation ... that you can think of. I bet I can imagine an architecture where the behaviour is different.
If you think there's no way in C++ to tell whether a string class is shared or not, think about Chuck's posting:
bool is_string_class_shared() { const std::string s1('x'); const std::string s2(s1); *(const_cast<char*>s1.data()) = 'y';
At this point you have invoked undefined behaviour, so it is impossible to use the C++ standard alone to reason about the behaviour of the code.
return s2[1] == 'y'; }
Speaking of the programming with only the C++ standard, it specifically mentions allowing a shared implementation of std::string.
Yes, and puts certain restrictions on what a programmer can do so that a shared implementation of std::string can work. (For example, a programmer may not write the code in "is_string_class_shared".)
There is only one solution: remove the buggy casting code and replace it with a proper one as suggested. Whether the proper one is slower or not is unimportant!
I strongly disagree. Performance of string serialization is important, as Robert posted in a previous msg. For that reason, the use of const_cast<char*>s.data() seems justified.
This is the heart of the matter. What is the difference in performance between relying on the const_cast, and the strictly legal version using a vector buffer: text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // Read into a buffer. std::vector<char> v(size); is.read( &v[0], size ); // Return the string. std::swap( s, std::string( v.begin(), v.end() )); } Clearly there are inefficiencies there, but are they bad enough to be worth the const_cast? [snip]
No, the problem is that this is implementation specific. Don't rely on such stuff, rely on the C++ standard!
Again, my argumentation isn't implementation specific, it's true of any shared string class. Which is implementation specific.
My patch should work with any standard-compliant string. I don't think so. (Although it may well work with any /plausible/ implementation of a standard compliant string).
What I argued was that it's better than the previously proposed fixes for gcc's implementation. Ah. Now that's a /very/ different argument :-)
-- Martin Bonner Senior Software Engineer/Team Leader PI SHURLOK LTD Telephone: +44 1223 441434 / 203894 (direct) Fax: +44 1223 203999 Email: martin.bonner@pi-shurlok.com www.pi-shurlok.com disclaimer

On Dec 4, 2007 12:19 PM, Martin Bonner <Martin.Bonner@pi-shurlok.com> wrote:
From: Hervé Brönnimann
On Dec 4, 2007, at 5:13 AM, Jens Seidel wrote: There is only one solution: remove the buggy casting code and replace it with a proper one as suggested. Whether the proper one is slower or not is unimportant!
I strongly disagree. Performance of string serialization is important, as Robert posted in a previous msg. For that reason, the use of const_cast<char*>s.data() seems justified.
This is the heart of the matter. What is the difference in performance between relying on the const_cast, and the strictly legal version using a vector buffer:
text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get();
// Read into a buffer. std::vector<char> v(size); is.read( &v[0], size );
// Return the string. std::swap( s, std::string( v.begin(), v.end() )); }
Clearly there are inefficiencies there, but are they bad enough to be worth the const_cast?
[of course I think you meant std::string(...).swap(s)] What about something like this (warning: pseudo code) text_iarchive_impl<Archive>::load(std::string &s) { static const size_t static_buffer_size = 256; std::size_t size; std::string temp; * this->This() >> size; // skip separating space is.get(); size_t left = size; while(left) { // Read into a buffer. boost::array<char, static_buffer_size> v(size); size_t read_n = std::min(left, v.size()); is.read( &v[0], read_n); left -= read_n; // Return the string. temp.insert(v.begin(), v.begin() + read_n); } temp.swap(s); } This way you do not have to pay for an extra allocation for vector. By choosing static_buffer_size appropriately you will only do one allocation of temp for small strings (which will dominate over the cost of copying the small buffer) and for big strings it is likely that the cost of the read() call will dominate over everything else. HTH, gpd

Jens: I think you just proved once again that there usually are no benefits in going around good ol' standard compliant code. BTW, I looked up the undefined behavior rules in the C++ standard and it is unclear to me that the const_cast (however buggy) is undefined behavior, if you stick by the rules of 7.1.6.1; after all the data() pointer does not point to a const object (it points to the member s._M_data in that implementation, which isn't const since s is passed non-const in that scope). But regardless, the code is buggy with data(). So I tried to follow your advice and eliminate the const_cast. Comes a simple idea: just replace s.data() by &(*s.begin()). No need for clear() or any other operation, it's standard-compliant and well-defined behavior, etc. Minimal change, no swap, no unnecessary allocation (I think). Even strongly exception safe, if string is strongly exception safe, because if resize() or begin() trigger an exception during allocation, s should be unchanged. Do you like that? I do. I"m cc'ing Robert on this one. -- Herve Bronnimann hervebronnimann@mac.com PS: Speaking of C++ standard and strings and exception, I can't find anywhere that specifies that strings are strongly exception safe. Vector guarantees it, and it should be easier for strings than for vector since it can only contain PODs, but I truly cannot see the clause in Chapter 21. Anyone? On Tuesday, December 04, 2007, at 05:12AM, "Jens Seidel" <jensseidel@users.sf.net> wrote:
There is only one solution: remove the buggy casting code and replace it with a proper one as suggested. Whether the proper one is slower or not is unimportant! [snip] Don't rely on such stuff, rely on the C++ standard!
Jens _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

I've been reading the posts on this subject with interest. I don't have strong position myself. I do believe that some extra effort and/or risk is justified to get best performance. I'm leaving for you guys to do battle over the various alternatives. I strongly believe in const correctness and believe that things are "const" for a reason so I'm not enamored with a const_cast of any kind. I would much prefer a "guarenteed correct" solution if its close to best efficiency. Hence I like the current suggestion. But does't it presume that s.begin() points to a real array which would be an implemenation feature?. That is, do I know that a standard conforming string is actually an array of adjacent characters? Robert Ramey Herve Bronnimann wrote:
Jens: I think you just proved once again that there usually are no benefits in going around good ol' standard compliant code. BTW, I looked up the undefined behavior rules in the C++ standard and it is unclear to me that the const_cast (however buggy) is undefined behavior, if you stick by the rules of 7.1.6.1; after all the data() pointer does not point to a const object (it points to the member s._M_data in that implementation, which isn't const since s is passed non-const in that scope). But regardless, the code is buggy with data().
So I tried to follow your advice and eliminate the const_cast. Comes a simple idea: just replace s.data() by &(*s.begin()). No need for clear() or any other operation, it's standard-compliant and well-defined behavior, etc. Minimal change, no swap, no unnecessary allocation (I think). Even strongly exception safe, if string is strongly exception safe, because if resize() or begin() trigger an exception during allocation, s should be unchanged.
Do you like that? I do. I"m cc'ing Robert on this one.
On Tuesday, December 04, 2007, at 05:12AM, "Jens Seidel" <jensseidel@users.sf.net> wrote:
There is only one solution: remove the buggy casting code and replace it with a proper one as suggested. Whether the proper one is slower or not is unimportant! [snip] Don't rely on such stuff, rely on the C++ standard!
Jens _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

.On Dec 5, 2007 2:04 PM, Robert Ramey <ramey@rrsd.com> wrote:
I've been reading the posts on this subject with interest. I don't have strong position myself. I do believe that some extra effort and/or risk is justified to get best performance. I'm leaving for you guys to do battle over the various alternatives. I strongly believe in const correctness and believe that things are "const" for a reason so I'm not enamored with a const_cast of any kind. I would much prefer a "guarenteed correct" solution if its close to best efficiency.
Hence I like the current suggestion. But does't it presume that s.begin() points to a real array which would be an implemenation feature?. That is, do I know that a standard conforming string is actually an array of adjacent characters?
For C++03 it isn't guaranteed, AFAIK. But the new standard will enforce this. Just as std::vector is guaranteed to be contiguous. So I would say it is ok to expect this from an implementation at this point.
Robert Ramey
[snip] -- Felipe Magno de Almeida

Robert Ramey:
I've been reading the posts on this subject with interest. I don't have strong position myself. I do believe that some extra effort and/or risk is justified to get best performance. I'm leaving for you guys to do battle over the various alternatives. I strongly believe in const correctness and believe that things are "const" for a reason so I'm not enamored with a const_cast of any kind. I would much prefer a "guarenteed correct" solution if its close to best efficiency.
Hence I like the current suggestion. But does't it presume that s.begin() points to a real array which would be an implemenation feature?. That is, do I know that a standard conforming string is actually an array of adjacent characters?
The standard doesn't guarantee that a std::string is a contiguous array, but all existing implementations are. You can use &s[0] and take the risk, or you can prefer dispatching on whether s.begin() returns a pointer.

On Wednesday, December 05, 2007, at 12:30PM, "Peter Dimov" <pdimov@pdimov.com> wrote:
The standard doesn't guarantee that a std::string is a contiguous array, but all existing implementations are. You can use &s[0] and take the risk, or you can prefer dispatching on whether s.begin() returns a pointer.
Peter: Section 21.3.1 [string.require] paragraph 3, page 578, of the current working draft: The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size(). Isn't that what Robert is asking for? Is this a new requirement (i.e., not in C++03)? Indeed, I can't see it in C++03. -- Herve Bronnimann hervebronnimann@mac.com

Herve Bronnimann:
On Wednesday, December 05, 2007, at 12:30PM, "Peter Dimov" <pdimov@pdimov.com> wrote:
The standard doesn't guarantee that a std::string is a contiguous array, but all existing implementations are. You can use &s[0] and take the risk, or you can prefer dispatching on whether s.begin() returns a pointer.
Peter: Section 21.3.1 [string.require] paragraph 3, page 578, of the current working draft:
The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
Isn't that what Robert is asking for? Is this a new requirement (i.e., not in C++03)? Indeed, I can't see it in C++03.
I think that it's new for C++0x, but even so, this requirement is a strong indication that every std::string is already contiguous. So using &s[0] or &*s.begin() instead of const_cast'ing data() would definitely be a step forward.

Hervé (and others), Yes, our problem happened when the string was the same size (as it is in the reproducing code I submitted a while ago). (I did not test smaller strings. Nor did I look at the implementation. It seems to me that one could construct reasonable implementations of std::string that worked either way, depending on things like how c_str() is optimized.) In your code below, beware the #ifdef'd 'if' above your change. I'm not sure you want what you'd get if the #ifdef is triggered. Then again, I'm having a hard time understanding the original routine in that regard. It seems very strange to ensure that the string is resized only if the data() isn't NULL. Am I just misreading the code, or does this ensure that the null pointer is chased if the string has NULL data() upon entry to the routine? In general I've heard a lot of suggestions about what would be the fastest on different implementations of std::string. Of course it seems like a good idea to go for the fastest "correct" implementation possible. Does there exist a set of programs that could be run on different platforms (hardware, and libraries) for measuring the relative performance of the proposed solutions? Regards, Chuck -----Original Message----- From: boost-bounces@lists.boost.org on behalf of Hervé Brönnimann Sent: Tue 12/4/2007 4:59 AM To: boost@lists.boost.org Subject: Re: [boost] Input archive loads seem to corruptsharedcharacterstd::strings --snip-- In short, I don't think you can avoid the cost of going over your string twice, so I can't see anything better than: text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif + s.clear(); s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect }

I'd like to go back to Chuck's original posting:
Inspection of the code reveals the following is happening (both in 1.34 and HEAD): text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect }
Inspection of the g++ string implementation (latest 4.1.x) shows that resize() will *almost* always create a new rep and thus a new data() for the string, if it is shared. The only exception is when the requested size == s.size(). Chuck, was this the bug that hit you? Seems to me, since you are going to rewrite your string anyway, you might as well clear() it first, then resize(size). That'll take care of creating a new representation (either clear, or resize if the string was already empty). The problem with that is that it will write over the string twice, first filling it with 0s then with the read data. But think about it: if the string is shared, you need to construct a new representation with the requested size anyway, and it will have to be filled with 0s (or with a copy of what's already in the string if you decide to resize to a different size instead of clearing). Still, suppose you wanted to avoid going over your string twice, what you'd need is to create a string with an unitialized buffer and start writing into it directly. There's only one way I can think of to achieve that: have an input iterator from your "is" stream, such as std::istream_iterator<char>(is), and use s.assign(is_begin, is_end), assuming you can define an end iterator that will stop reading after size chars (don't even know how to do that with istream_iterator, and there is no assign(int, InputIterator) method in string, sadly). In any case, however, you then lose the benefit of the single is.read() call, because the iterator is going to do ++iter and read character by character. If your stream is buffered, there shouldn't much of an impact, but then you have a double copy from stream buffer to string buffer anyway, which is akin to going over your string twice. What would be nice would be to create a new buffer and read directly into it, then transfer ownership to the string... Can't do that with std::string though. In short, I don't think you can avoid the cost of going over your string twice. The problem with clear() in gcc 4.x's implementation is that it creates a new rep with the same capacity as the old string (), at least if I've read the code correctly. With a shared string implementation, it is cheaper to assign the empty string, and that works with any other implementation. In short, I've argued that you can't do much better than: text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif + s = std::string(); s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect } Compared to Corrado's implementation, it's almost the same, but frees up resources earlier. You may actually be able to reuse the same block of memory for the resize() that you just freed. HTH, -- Hervé Brönnimann hervebronnimann@mac.com

On Dec 4, 2007 11:18 AM, Hervé Brönnimann <hervebronnimann@mac.com> wrote:
text_iarchive_impl<Archive>::load(std::string &s) { std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif + s = std::string(); s.resize(size); is.read(const_cast<char *>(s.data()), size); // <== This looks suspect }
Compared to Corrado's implementation, it's almost the same, but frees up resources earlier. You may actually be able to reuse the same block of memory for the resize() that you just freed.
This doesn't enforce strong exception safety. My solution either modified the string with the deserialized value, or it kept the original value. This one may leave the string empty. Corrado
HTH, -- Hervé Brönnimann hervebronnimann@mac.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- __________________________________________________________________________ dott. Corrado Zoccolo mailto:zoccolo@di.unipi.it PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda

On Tuesday 04 December 2007 09:29:45 Corrado Zoccolo wrote:
You can improve things by swapping the string instead of using assignment.
template<class Archive, class Elem, class Tr> BOOST_ARCHIVE_OR_WARCHIVE_DECL(void) -basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & s) +basic_binary_iprimitive<Archive, Elem, Tr>::load(std::string & res_s) { + std::string s; std::size_t l; this->This()->load(l); // borland de-allocator fixup @@ -96,6 +97,9 @@ s.resize(l); // note breaking a rule here - could be a problem on some platform load_binary(const_cast<char *>(s.data()), l); + + // Respect other references to the chars by assigning result string *+ res_s.swap(s);* }
How about this: std::size_t l; this->This()->load(l); std::string tmp(l, '\0'); /* We write directly into the string's buffer using a const_cast Note: this relies on the implementation to provide an own buffer to every std::string, which we try to assure by using a temporary here. We are _not_ using resize() on the existing string because if the size remains the same this is a no-op and does not cause a buffer to be allocated on e.g. GCC4's libstdc++. */ load_binary(const_cast<char*>(tmp.data()), tmp.size()); res.swap(tmp); This might even make the Borland workaround unnecessary. Note also that I used tmp.data() and tmp.size() in the load_binary call just for symmetry. Uli
participants (11)
-
Chuck Bear
-
Corrado Zoccolo
-
Felipe Magno de Almeida
-
Giovanni Piero Deretta
-
Herve Bronnimann
-
Hervé Brönnimann
-
Jens Seidel
-
Martin Bonner
-
Peter Dimov
-
Robert Ramey
-
Ulrich Eckhardt