New subject: [Serialization] Proposal for improved support for ptrs tocontained objects + array data

17 Aug 2006

      The current implementation of serialization has some limitations when
handling contained data that I would like to remove. By contained data,
I mean data that exists inside the allocated boundary of a containing
object that is also being serialized. For example, in the code:

class TMyClass { int x; int* y; }

x is contained data of TMyClass. The data pointed to by y is not
necessarily contained data (and in general, will not be). 
Similarly all the elements in an array of obects is contained data.

One problem in the current implementation is that pointers to contained
data can only be serialized after the contained data is serialized.
Also, special code needs to be written to handle pointers that point to
elements in arrays (special case code already exists for handling
pointers to elements in STL vectors, but this code could also be made
more efficient). 

Here's a rough propsoal for changing serialization to handle these
issues. It's not a complete proposal, just a starting point for
discussion. The basic idea is to serialize in two passes. In the first
pass, we "walk" the objects using normal serialization order from the
root and determine which objects are contained by other objects. In the
2nd pass, we serialize as we do in current implementation, except that
instead of always serializing an object the first time we encounter a
pointer  to it, we only serialize such pointers if the object is not a
contained object. Below is a  more detailed description of the
algorithm:

Pass 1: 
----------------- 
1) For each object to be serialized via a pointer, check if in
ObjectManager set.  If not, add the object's TObjectInfo to the
ObjectManager.

class TObjectInfo
{
  int ObjectId;    //consecutively assigned when object is first added
  void* Object;    //beginning of object boundary
  void* ObjectEnd; //end of object boundary
  TClassInfo* ClassInfo;

  int          OwnerId;
  unsigned int OwnerOffset;
};

class TClassInfo
{
  int ClassId; //consecutively assigned when first object of class is added
  TSerializationFunctionPtr SerializeFunction;
};

Data members of oarchive (similar members already exist with somewhat
different implementations):

set<TObjectInfo>    ObjectManager;
set<TClassInfo>     ClassManager;
vector<TObjectInfo*> SortedObjectInfo;

2) Create a vector SortedObjectInfo containing the TObjectInfo from the
ObjectManager,  sorted by the object addresses and mark all objects in
this vector which are contained in  the range of other objects by
setting their OwnerId (id of containing object) and  OwnerOffset (byte
offset of contained object inside container). If two objects have the 
same address, the object with the larger size "contains" the smaller
object. If multiple  objects contain an object, the largest container is
the owner. Generate a warning and/or  exception if an object is
partially contained, but not fully contained by any object and  treat it
as an uncontained object (overlapping data will be duplicated).

Pass2:
-----------------
Starting from the root object again, serialize each non-contained object
as in existing serialization implementation mostly. That is, the first
time a non-contained  object pointer is encountered, write out the
actual data, and on subsquent encounters of  that pointer, write out the
object id. For pointers to contained objects, always write the OwnerId
and OwnerOffset. Need some way to differentiate between ObjectIds and
OwnerIds when deserializing. Should be able to use the same mechanism
employed already to  differentiate between object pointers and actual
object data during deserialization, I  suppose.

Note: The ObjectId could potentially be eliminated and the original
object ptr used as an  id instead, if desired, but the reproducability
of using an Id seems better for testing  and it also makes for faster
deserialization, since we can build use vector lookup instead  of set
lookup during deserialization to map between the id/old pointer and the
new  location for the data.

Deserialization:
-----------------
class TObjectPointers
{
  void* Object;
  vector<void*> ObjectPointersToFixup;
};

Data member in iarchive:
std::vector<TObjectDependents> ObjectDepedents;

During deserialization of each tracked object, add it's newly allocated
location to the Object field of the ObjectDepedents vector (indexed
using object's id) and fixup any addresses in ObjectPointersToFixup. 

Whenever we encounter an ObjectId during deserialization of an object,
check  ObjectDependents[ptrId].Object. If not null, use this address for
the pointer fixup. If  Object is null (because object pointed to has not
been loaded yet), save off the address  of the pointer to the Object's
vector of pointers to be patched in 
ObjectDependents[ptrId].ObjectPointersToFixup.

Similarly, whenever we encounter an OwnerId, check the ObjectDependents
vector, but in this case we need to add the OwnerOffset as part of the
fixup process.

[Serialization] Proposal for improved support for ptrs to contained objects + array data

Dan Notestein

Robert Ramey

Larry Evans

Dan Notestein

tags

participants (3)