I would first consider something else. It turns out that for reasons I don't want to go into here that std::string is designated as "primitive". That is there is no reference counting (by serialization load), version, etc with std::string type objects. This is probably a good choice except that it makes std::string "special" in comparison to other types. In general, the only other types consider "primitive" are C++ data types that are truely primitive. So, in contrast to the default behavior for std::collections, if the SAME string is saved twice - an actual copy is saved an restored. This might be what most people expect from a datatype like std::string, but it might be an issue in some applicatons. Note that by "SAME string" I'm referring to the same datum - not two different strings with the same contents. So, the serialization could be the source of your issue if you're saving the SAME string many times. A closer reading of your post, suggests that the above isn't what you're referring to. I left in the above just to clarify my thinking on the issue. It sounds to me that you're telling me that the gcc standard library keeps a counted string - "copy on write" implementation so that if a and b are strings the operation a = b doesn't result in a duplication of the content. Which would surprise me. But assuming that's the case, serialization would "lose" the reference counts when the strings are recreated without using the "=" operator. If this is the case, here are a couple of ideas to consider. Define your own serializaton for std::string and use it instead of the one in the serialization library. This is probably a bad idea as it would attribute your special behavior to a standard object and would make your archives and programs non portable and harder to support if you want to ask us for help. Define you're own string class derived from std::string. This string class could be serialized using your own special sauce without losing portablity. The could be formlated as a "serialization wrapper" as described in the manual so that you're code would only have to use this "special string" in the process of serialization and not through out your program. Look in the recent document and the "is_wrapper" typetrait for more information. So now the problem boils down to how your going to capture and restore the fact that these strings share underlying data. At first one would think that just letting your wrapper class use the default tracking behavior eliminate duplicates would solve your problem. But I don't think so. As I said above, I don't think that you're serializing the SAME (see above) string one million times. I think you're serializing a million different strings which happen to contain the same data. It seems to me that you'll have to delve into the implementation of the string class you're using and gain access to the internals of the implementation and figure out how to capture the reference to the shared contents and serialize that. Robert Ramey Bill Lear wrote:
We have a massive amount of data to serialize, on the order of several gigabytes. Lots of strings involved, maybe hundreds of millions.
We discovered that the data structure in memory would bloat enormously when read back in from disk (say, from 2 gig to 3.1 gig). We think we have tracked this down to (gcc implementation) string reference counts not being "restored". I think a solution for us is to do something like the following:
static map
string_map; template <class Archive> void read_string(Archive ar, string& a_string) { string s; ar >> s; // read from disk map
::iterator i = string_map.find(s); if (i == string_map.end()) { i = string_map.insert(make_pair(s, true)); }
a_string = i->first; }
void destroy_map() { string_map.clear(); }
Then, when the data structures have all been read, invoke the destroy_map() method to clear the string_map object, thus decremented all refcounts of strings by one.
Has anyone else encountered this and found a solution?
Also, if anyone has bright ideas on a better data structure than std::map to use for storing hundreds of millions of strings at once for the above purpose, that also might be nice.
Thanks.
Bill