RE: Serialization Library format stability

8 Aug 2004

      Walter Landry wrote:
...
I am thinking of using your serialization library as part of another
project [1].  It will be writing files that need to be read by
multiple operating systems far into the future.  So I was wondering,
how stable is the output format?  I will be serializing maps, lists,
strings, bitsets, and some simple classes.  Do you forsee any possible
changes in the disk format of those kinds of objects?
This is an exceptionally good question.  I'm amazed that no one has yet to
raise it.  If you don't mind, I will post it to the list as I don't believe
others have given this the importance I believe it deserves. Anyway - here's
my answer:

It is one of the stated goals that this system produce portable archives as
well as use portable code.  By portable archives I mean archives that can
save data structures on one platform (compiler/os/library) and recovered to
another.  This is not as easy as it might seem. I have take specific steps
in the design and implementation to achieve this.

a) My preferred archive - the simple text archive saves/loads primitives as
text.  This is inherently portable across platforms.  XML is also portable.
The included binary archive is not-portabable.  The signature processing
code will detect egregious violations of portability and trap with an
exception.  However this is not an exhaustive check.  There is a demo which
includes a portable binary archive.  Floating point primitives would have to
be added to make this complete. Ralf W Grosse-Kunstleve has submitted code
that would address this but I haven't spent any time with it.

b) I have taken extra pains that certain things are serialized the same way
across platforms.  Take std::list<int> for example.  Practical efficiency
considerations suggest that this should be stored without class information.
The simple way of specifying this depends upon the usage of TPS which not
all environments support.  So I have used a different method to accomplish
this.

c) In the archive signature - there is a library format version index.  If
the format the archive type changes the newer code could include branch to
deal with older format.  I'm hoping it will never be necessary to use this
but its there - btw - its now at #3. This would be a fallback.

Possible problems:

a) In practice, true archive portability will require more.  For example,
suppose I save an integer on a 64 bit machine to a text archive and load it
to a machine which supports 32 bits.  This would work fine - until the 64
bit machine saves an integer > 2^32 .  This problem wouldn't be detected
until the archive is loaded - way too late.  Note that wchar_t, which is
serialized as an integer, is 4 byte on gcc and 2 bytes everywhere else.  So
its possible/likely that this problem would surface where you least expect
it.

b) there are subtle problems than can arise.  On some machines size_t is a
macro that resolves to an integer while on some it's a true type.  Saving a
datum as an integer in one environment while recovering it as a type ( with
a class id ) on another machine create surprises.  Worse (or maybe better)
this issue could look different for different archives.  Text archives
save/load book-keeping info (class-id, object-id, etc. in sequence so
differences of this type will cause problems.  XML archives include this
information as attributes in the XML start tag.  Attributes are retrieved by
tag rather than by sequence (thanks to spirit) so in XML archives might work
where the normal text archive fails.

The good news is that this problem has been considered as goal from the very
start.  So all code has been written to accommodate this goal.

That bad news is that this facility hasn't been exhaustively tested.  What's
needed is a separate/additional test suite which tests archive portability.
This would mean a set of "canonical archives".  The save test would create a
new archives and compare each one to its "canonical" canonical counterpart.
The load test would load the canonical archive and check to see that we
recovered what we expected to. Implementation of such tests would reveal a
host of small errors.  For example, int64_t is just a typedef with gcc on a
64 platform while its separate type on on a 32 bit platform.  This breaks
compilation of the polymorphic archive at compiler time.  In itself, its not
too hard to resolve.  But it requires careful thought to resolve it once and
for all.

Conclusion and recommendation

So, if you want to use the serialization library for your project I would
suggest.

a) Just start doing it.  The serialization library saves an incredible
amount of coding and you can effectively just get on with the other more
intesting parts of your project.

b) When its time to think build your app on another platform and transfer
archives between environments, you can rethink the issue. This is basically
and economic argument.  At this point:

i) You will have invested no effort in serialization, so if it were deemed
unsuitable, you would have lost no investment of effort if you decide to
replace it with something else.  

ii) Maybe you'll get lucky and someone else will have made the archive
portability test suite and the archives your interested in can be
"guaranteed" portable.

iii) If ii) above hasn't happened, you'll have to make your own archive
portability test suite.  The cost of this is also zero.  This is because if
you use some alternative to the serialization library, you'll still have to
test the passing of serialized data between platforms in any case. If you
make an archive portability test suite, please consider adding it to the
library and tests. (Hmm - you're an academic - making such a test suite
would be a great small project for one of your students.  Important,
educational, small enough to do in reasonable time, orthogonal to your main
project goal - whatever that might be).

Thanks for considering the serialization library for your project.

Robert Ramey

Robert Ramey

tags

participants (1)