Serialization and Editing Archives

22 Apr 2006

      I've been looking at the posts regarding the Property Tree Library.

I haven't had the time to give it a proper review.  Fortunately
it seems that there are a good number of people willing to help
with this.

I would like to make some observations regarding the serialization
library and the creating of "editable" archives.  This seems to me
the central issue being discussed.

The fundemental notion that the serialization library is meant to
address is the deconstruction of an arbitary set of C++ data
structures into a series of bytes in such a way that the original
set can be restored at another time and place.

The requirements for such a library were

a) completeness - basically no constraints on the C++ data structures
to be saved/restored.

b) factoring out the actually storage of the bytes in such a way
that this could be changed by the user to suit his application.

c) minimal effort - the the library has to leverage on the structure
of the data as described by the C++ code itself rather than
than requiring some extra specification as to how all the
data is organized and related.

This goal has been in large part achieved with the help of many people on 
this list,
The set of requirements was demanding, and required lots of code to 
implement.
Much of that code had to relie on more demanding techniques such as
template metaprogramming.  Also, in some cases, extra information had
to be be appended to "reflect" information necessary for the implementation.
BOOST_CLASS_EXPORT is an example of this.  In addition a number of
user customizable attributes had to be defined to permit the library to be
applied as widely as was reasonable to expect.  I know not everyone is 
totally
happy with the library but all in all I'm satsified that a good balance has 
been
struck.

The goal and requirements dictate that the resultant archive (however it is 
stored
is dependent on the C++ data structure defined at compile time and invoked
at runtime.

In general, one cannot expect that any changes in an archive will possible
without changing the C++ data structures and code to which the changes
correspond.

Note this is an entirely reverse situation to the most common one
exemplified by doing something like parsing an XML file.  In this case
one will load the data into some dynamic data structure which will
reflect the structure of the data being loaded.  At that point some code
will have to traverse the structure to extract or set the data in memory.
Note that there is no requirement that the C++ data structures map
in any special way to the file.  Also the file doesn't have to include
as much "detail" because it doesn't have to reconstruct an arbitray
set of C++ data structures.

And of course we see and will continue to see a proliferation of file
formats and programming modules to support them.

Some attempts have been made to map
serialization to some formats which permit arbitray data - XDR
and a couple of others.  I don't think they've been successful because
part of the file format is really held in the C++ code which saves
and loads the archives. The requirement to be able to arbitrarily edit an 
archive is clearly
not possible.

But then ... it seems tantalizingly close - I meant hell
its XML I can read it!.  So could anything be done?  I haven't
spent too much time on this but I can speculate some.  Suppose
one wanted to create and editable XML format.  First of all
on will have to accept the fact that not all editing is going to be
possible.  One won't be able to make a new class which the
C++ code doesn't implement.  I might be possible to do something
like the following;

a) parse the whole XML file with a standard parser
b) as the serialization algorithm visits each data structure, ratther than
just presuming that all the data is there in sequence, it would check
to see if the tag has been loaded..  This would permit the following
types of editing:

i) deletions
ii) changing data

A little more work would permit expanding / contracting collections.

I really don't know how much work this would end up being but no
one has had sufficient interest to undertake it - or perhaps they have
but it hit a dead end.

To summarize I see two almost irreconciliable paths

a) Archive formats reflect the set of C++ data structures which generate 
them
b) A format for arbitrary data (e.g. XML) is defined and a general- not
application specific - data strucure (e.g. tree) is defined.  Then the 
application
walks the tree in an application dependent manner.

Basically only one thing can be boss - either the C++ data strucures in the 
app
or the file format.  one can drive the other but they can't be varied 
independently.

It seems to me that the property tree library is driven by the need to be 
able
to edit one's "serialized" data out side the main application.  My personal 
view
is that if one feels that he needs to do this - he should be rethinking his 
application.

A huge portion of applications are really variations on just this theme: 
Load some
data, edit it, save data.  And its not just word processors or things like 
that. Machines
startup, run, and shutdown.  a customer recored is accessed, updated, and 
saved.
etc.  So I think many more applications can be considered in terms of 
serialization
than in fact are. The applicatons edit data.

There can also be the case where different applications edit the same data. 
I might
make an application which loads configuration, runs a machine and save the 
data
from the run.  Someone else needs and application to browse and/or annotete 
the
the data.  One could say - oh then we need the data in some application 
independent
manner such as XML.  Well that's one way - but then we have to recode all 
over
again to get and save the data.  Or we can use something like a text editor 
but then
we are casting aside all the facility of using a computer in the first 
place.  Finally
we can make a new application which used the original C++ serialization code
and data structures - which hopefully have been compiled into a library. 
Now
we just use our special editor for that "other" applicaion.  To me - 
definition
of a file format independent of the C++ code which is going to use has
wasted billions of man years of human brain time just trying to keep things
in sync.

So, I see the whole XML approach as a misguided attempt to separate the
fundementally inseparable - data and code - they are one now.  I predict
that XML will be made continually more elaborate and spawn ever more
variations  in vane attempts to reconcile the unreconcilable.  Remember
you heard it here first - XML and is progeny is destined for the ash heap
of history! (hmmm - like COBOL?)

For someone who has nothing else to do there are interesting things to
look at here.  What would be very interesting would be a xml_archive
which created in parallel an (XML?)_schema which would be used
by a generic editor to edit the xml (and maybe other) data in the
archive within the bounds of what is permitted.  I see this as quite
doable.

Another idea would be a generic editor which could be compiled
from the serialization code which would exploit the "reflected"
data about the C++ and perhaps define a little bit more.  For
example I could imagine a template for each datatype which
would invoke a functino to edit it. On a GUI, just compile
the generic editor with your own serializtion code. and BAM
you've got an application specific editor which runs at full
runtime speed.

Just a couple of misceleaneous observations

a) It seems there is a perception that the serialization library is large, 
cumbersome
and inefficient.  I believe it can be used inefficiently (like all 
libraries) and
it can be made faster (and incremental improvemetns are being made).  But
I don't think this is true in general.  When I compile with release mode,
make sure I haven't included in the debug symbols, and use the DLL
(which isn't all that big), a program which saves and loads 4 different
types of archives with several classes only takes 30 KB.   So if you're
not getting similar results, take a closer look.

b) much has been made of things like - you can't turn off tracking etc
and the difficulty of modifying the library.  I believe that most of these
difficulties comeabout by tyring to implemnent some external requirement
- which I've called about trying to reconcile the irreconciable.  If one
turns off tracking- then how is one going to recoveer pointers. By default,
someting track only if its used as a pointer somewhere.  And why would
one have pointers in something like configuration parameters.  And if
tracking is turned off and somethng is duplicated - then what if
one edits only one of the copies.  etc.  Its not so much that I want
to dicuss tracking- its more that I think these kinds of questions
reveal that things originally thought to be almost identical e.g
and XML archive and general XML file are more different than they
first appear.  So when this occurs its time to step back and
look at the big picture (BTW I believe it would be easy to suppres
the object_id etc in he xml archive just be overridding with stubs
the serialziations for these types - but then how would one read the
archve?)

Anyway - I'm sorry if this is a little rambling and off topic

Robert Ramey

Robert Ramey

tags

participants (1)