[Serialization] Proposal for improved support for ptrs to contained objects + array data

The current implementation of serialization has some limitations when handling contained data that I would like to remove. By contained data, I mean data that exists inside the allocated boundary of a containing object that is also being serialized. For example, in the code: class TMyClass { int x; int* y; } x is contained data of TMyClass. The data pointed to by y is not necessarily contained data (and in general, will not be). Similarly all the elements in an array of obects is contained data. One problem in the current implementation is that pointers to contained data can only be serialized after the contained data is serialized. Also, special code needs to be written to handle pointers that point to elements in arrays (special case code already exists for handling pointers to elements in STL vectors, but this code could also be made more efficient). Here's a rough propsoal for changing serialization to handle these issues. It's not a complete proposal, just a starting point for discussion. The basic idea is to serialize in two passes. In the first pass, we "walk" the objects using normal serialization order from the root and determine which objects are contained by other objects. In the 2nd pass, we serialize as we do in current implementation, except that instead of always serializing an object the first time we encounter a pointer to it, we only serialize such pointers if the object is not a contained object. Below is a more detailed description of the algorithm: Pass 1: ----------------- 1) For each object to be serialized via a pointer, check if in ObjectManager set. If not, add the object's TObjectInfo to the ObjectManager. class TObjectInfo { int ObjectId; //consecutively assigned when object is first added void* Object; //beginning of object boundary void* ObjectEnd; //end of object boundary TClassInfo* ClassInfo; int OwnerId; unsigned int OwnerOffset; }; class TClassInfo { int ClassId; //consecutively assigned when first object of class is added TSerializationFunctionPtr SerializeFunction; }; Data members of oarchive (similar members already exist with somewhat different implementations): set<TObjectInfo> ObjectManager; set<TClassInfo> ClassManager; vector<TObjectInfo*> SortedObjectInfo; 2) Create a vector SortedObjectInfo containing the TObjectInfo from the ObjectManager, sorted by the object addresses and mark all objects in this vector which are contained in the range of other objects by setting their OwnerId (id of containing object) and OwnerOffset (byte offset of contained object inside container). If two objects have the same address, the object with the larger size "contains" the smaller object. If multiple objects contain an object, the largest container is the owner. Generate a warning and/or exception if an object is partially contained, but not fully contained by any object and treat it as an uncontained object (overlapping data will be duplicated). Pass2: ----------------- Starting from the root object again, serialize each non-contained object as in existing serialization implementation mostly. That is, the first time a non-contained object pointer is encountered, write out the actual data, and on subsquent encounters of that pointer, write out the object id. For pointers to contained objects, always write the OwnerId and OwnerOffset. Need some way to differentiate between ObjectIds and OwnerIds when deserializing. Should be able to use the same mechanism employed already to differentiate between object pointers and actual object data during deserialization, I suppose. Note: The ObjectId could potentially be eliminated and the original object ptr used as an id instead, if desired, but the reproducability of using an Id seems better for testing and it also makes for faster deserialization, since we can build use vector lookup instead of set lookup during deserialization to map between the id/old pointer and the new location for the data. Deserialization: ----------------- class TObjectPointers { void* Object; vector<void*> ObjectPointersToFixup; }; Data member in iarchive: std::vector<TObjectDependents> ObjectDepedents; During deserialization of each tracked object, add it's newly allocated location to the Object field of the ObjectDepedents vector (indexed using object's id) and fixup any addresses in ObjectPointersToFixup. Whenever we encounter an ObjectId during deserialization of an object, check ObjectDependents[ptrId].Object. If not null, use this address for the pointer fixup. If Object is null (because object pointed to has not been loaded yet), save off the address of the pointer to the Object's vector of pointers to be patched in ObjectDependents[ptrId].ObjectPointersToFixup. Similarly, whenever we encounter an OwnerId, check the ObjectDependents vector, but in this case we need to add the OwnerOffset as part of the fixup process.

Dan Notestein wrote:
Here's a rough propsoal for changing serialization to handle these issues.
... For me a one pass solution was a requirement - even though I didn't state it explicitly. I just presumed that users would object to the exra cost of a second pass. So I never really considered it. Robert Ramey

On 08/17/2006 04:18 PM, Dan Notestein wrote:
The current implementation of serialization has some limitations when handling contained data that I would like to remove. By contained data, I mean data that exists inside the allocated boundary of a containing object that is also being serialized. I've got another problem which required a definition of "contained data" only I called it "iterior data". However, then I remembered:
http://lists.boost.org/Archives/boost/2006/01/99387.php which suggests "subobject" maybe a better term. Would it be desirable to use "subobject" in your description instead of "contained data" in order to avoid a proliferation of aliased terms? [snip]

Yes, subobject is definitely the term I was looking for. Hadn't seen this before, thanks for pointing it out! ----- Original Message ----- From: "Larry Evans" <cppljevans@cox-internet.com> To: <boost@lists.boost.org> Sent: Friday, August 18, 2006 2:52 PM Subject: Re: [boost] [Serialization] Proposal for improved support for ptrs to contained objects + array data
On 08/17/2006 04:18 PM, Dan Notestein wrote:
The current implementation of serialization has some limitations when handling contained data that I would like to remove. By contained data, I mean data that exists inside the allocated boundary of a containing object that is also being serialized. I've got another problem which required a definition of "contained data" only I called it "iterior data". However, then I remembered:
http://lists.boost.org/Archives/boost/2006/01/99387.php
which suggests "subobject" maybe a better term. Would it be desirable to use "subobject" in your description instead of "contained data" in order to avoid a proliferation of aliased terms? [snip]
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (3)
-
Dan Notestein
-
Larry Evans
-
Robert Ramey