Re: [Boost-users] [Serialization] How to restore ref-counted strings

9 Mar 2007

      On Friday, March 9, 2007 at 10:53:20 (-0800) Robert Ramey writes:
...
...
Define your own serializaton for std::string and use it instead
of the one in the serialization library.  This is probably a bad
idea as it would attribute your special behavior to a standard
object and would make your archives and programs non
portable and harder to support if you want to ask us for help.
Definite downsides, true, but I'm not sure that it would be
non-portable, except perhaps that I have a different idea of "define
your own serialization for std::string".  I have done what I
considered to be this, and posted it below.
...
Define you're own string class derived from std::string.  This
string class could be serialized using your own special sauce
without losing portablity.  The could be formlated as
a "serialization wrapper" as described in the manual so that
you're code would only have to use this "special string"
in the process of serialization and not through out your program.
Look in the recent document and the "is_wrapper" typetrait
for more information.
Ok, I'll have a look at that --- sounds like a reasonable alternative
to what I've done.
...
So now the problem boils down to how your going to capture
and restore the fact that these strings share underlying data.
At first one would think that just letting your wrapper class
use the default tracking behavior eliminate duplicates would
solve your problem.  But I don't think so.  As I said above,
I don't think that you're serializing the SAME (see above)
string one million times. I think you're serializing a million
different strings which happen to contain the same data.
It seems to me that you'll have to delve into the implementation
of the string class you're using and gain access to the internals
of the implementation and figure out how to capture the
reference to the shared contents and serialize that.
The strings share data on assign, so:

string a = "foo";
string b = a;

means they share the underlying memory "foo", with a logical refcount
of 2 (the physical refcount, for implementation reasons, is actually
1).  Once you muck with a or b, they get their own copy of the memory,
decremented ref count, etc.  If I serialize a and b, and deserialize,
the load will "break" this ref count --- I get two "unshared" strings,
each with a block of memory "foo".  Not the fault of the serialization
library, of course ...

So, here is how I've coded this to test it out.  The test I've just
completed shows that the memory bloat is completely removed --- this
is a major relief, as the bloat was literally expanding by 3-4
gigabytes a process that was already near our VM limit.  Think of this
as just a proof-of-concept, if you like
(boost/archive/impl/text_iarchive_impl.ipp):

#ifdef LL_STRING_DESERIALIZATION_CACHE
typedef std::map<std::string, bool> ll_cache;
static std::map<std::string, bool> ll_string_cache;

void nuke_ll_string_cache() {
    ll_string_cache.clear();
}
#endif

template<class Archive>
BOOST_ARCHIVE_DECL(void)
text_iarchive_impl<Archive>::load(std::string &s)
{
#ifndef LL_STRING_DESERIALIZATION_CACHE
    std::size_t size;
    * this->This() >> size;
    // skip separating space
    is.get();
    // borland de-allocator fixup
    #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101))
    if(NULL != s.data())
    #endif
        s.resize(size);
    is.read(const_cast<char *>(s.data()), size);
#else
    std::size_t size;
    * this->This() >> size;
    // skip separating space
    is.get();
    std::string input_string;
    input_string.resize(size);
    is.read(const_cast<char *>(input_string.data()), size);

    ll_cache::iterator i = ll_string_cache.find(input_string);

    if (i == ll_string_cache.end()) {
        std::pair<ll_cache::iterator, bool> x =
            ll_string_cache.insert(make_pair(input_string, true));
        i = x.first;
    }

    s = i->first;
#endif
}

If you have thoughts on how to make this cleaner, without actually
hacking into the actual boost implementation details, that would be
great (if this is what you already suggested about a wrapper, just
say so).

Thanks for the help.

Bill

Re: [Boost-users] [Serialization] How to restore ref-counted strings

Bill Lear