On Friday, March 9, 2007 at 10:53:20 (-0800) Robert Ramey writes:
... Define your own serializaton for std::string and use it instead of the one in the serialization library. This is probably a bad idea as it would attribute your special behavior to a standard object and would make your archives and programs non portable and harder to support if you want to ask us for help.
Definite downsides, true, but I'm not sure that it would be non-portable, except perhaps that I have a different idea of "define your own serialization for std::string". I have done what I considered to be this, and posted it below.
Define you're own string class derived from std::string. This string class could be serialized using your own special sauce without losing portablity. The could be formlated as a "serialization wrapper" as described in the manual so that you're code would only have to use this "special string" in the process of serialization and not through out your program. Look in the recent document and the "is_wrapper" typetrait for more information.
Ok, I'll have a look at that --- sounds like a reasonable alternative to what I've done.
So now the problem boils down to how your going to capture and restore the fact that these strings share underlying data. At first one would think that just letting your wrapper class use the default tracking behavior eliminate duplicates would solve your problem. But I don't think so. As I said above, I don't think that you're serializing the SAME (see above) string one million times. I think you're serializing a million different strings which happen to contain the same data.
It seems to me that you'll have to delve into the implementation of the string class you're using and gain access to the internals of the implementation and figure out how to capture the reference to the shared contents and serialize that.
The strings share data on assign, so: string a = "foo"; string b = a; means they share the underlying memory "foo", with a logical refcount of 2 (the physical refcount, for implementation reasons, is actually 1). Once you muck with a or b, they get their own copy of the memory, decremented ref count, etc. If I serialize a and b, and deserialize, the load will "break" this ref count --- I get two "unshared" strings, each with a block of memory "foo". Not the fault of the serialization library, of course ... So, here is how I've coded this to test it out. The test I've just completed shows that the memory bloat is completely removed --- this is a major relief, as the bloat was literally expanding by 3-4 gigabytes a process that was already near our VM limit. Think of this as just a proof-of-concept, if you like (boost/archive/impl/text_iarchive_impl.ipp): #ifdef LL_STRING_DESERIALIZATION_CACHE typedef std::map<std::string, bool> ll_cache; static std::map<std::string, bool> ll_string_cache; void nuke_ll_string_cache() { ll_string_cache.clear(); } #endif template<class Archive> BOOST_ARCHIVE_DECL(void) text_iarchive_impl<Archive>::load(std::string &s) { #ifndef LL_STRING_DESERIALIZATION_CACHE std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); #else std::size_t size; * this->This() >> size; // skip separating space is.get(); std::string input_string; input_string.resize(size); is.read(const_cast<char *>(input_string.data()), size); ll_cache::iterator i = ll_string_cache.find(input_string); if (i == ll_string_cache.end()) { std::pair<ll_cache::iterator, bool> x = ll_string_cache.insert(make_pair(input_string, true)); i = x.first; } s = i->first; #endif } If you have thoughts on how to make this cleaner, without actually hacking into the actual boost implementation details, that would be great (if this is what you already suggested about a wrapper, just say so). Thanks for the help. Bill