RE: Serialization Library format stability

Walter Landry wrote:
I am thinking of using your serialization library as part of another project [1]. It will be writing files that need to be read by multiple operating systems far into the future. So I was wondering, how stable is the output format? I will be serializing maps, lists, strings, bitsets, and some simple classes. Do you forsee any possible changes in the disk format of those kinds of objects?
This is an exceptionally good question. I'm amazed that no one has yet to raise it. If you don't mind, I will post it to the list as I don't believe others have given this the importance I believe it deserves. Anyway - here's my answer: It is one of the stated goals that this system produce portable archives as well as use portable code. By portable archives I mean archives that can save data structures on one platform (compiler/os/library) and recovered to another. This is not as easy as it might seem. I have take specific steps in the design and implementation to achieve this. a) My preferred archive - the simple text archive saves/loads primitives as text. This is inherently portable across platforms. XML is also portable. The included binary archive is not-portabable. The signature processing code will detect egregious violations of portability and trap with an exception. However this is not an exhaustive check. There is a demo which includes a portable binary archive. Floating point primitives would have to be added to make this complete. Ralf W Grosse-Kunstleve has submitted code that would address this but I haven't spent any time with it. b) I have taken extra pains that certain things are serialized the same way across platforms. Take std::list<int> for example. Practical efficiency considerations suggest that this should be stored without class information. The simple way of specifying this depends upon the usage of TPS which not all environments support. So I have used a different method to accomplish this. c) In the archive signature - there is a library format version index. If the format the archive type changes the newer code could include branch to deal with older format. I'm hoping it will never be necessary to use this but its there - btw - its now at #3. This would be a fallback. Possible problems: a) In practice, true archive portability will require more. For example, suppose I save an integer on a 64 bit machine to a text archive and load it to a machine which supports 32 bits. This would work fine - until the 64 bit machine saves an integer > 2^32 . This problem wouldn't be detected until the archive is loaded - way too late. Note that wchar_t, which is serialized as an integer, is 4 byte on gcc and 2 bytes everywhere else. So its possible/likely that this problem would surface where you least expect it. b) there are subtle problems than can arise. On some machines size_t is a macro that resolves to an integer while on some it's a true type. Saving a datum as an integer in one environment while recovering it as a type ( with a class id ) on another machine create surprises. Worse (or maybe better) this issue could look different for different archives. Text archives save/load book-keeping info (class-id, object-id, etc. in sequence so differences of this type will cause problems. XML archives include this information as attributes in the XML start tag. Attributes are retrieved by tag rather than by sequence (thanks to spirit) so in XML archives might work where the normal text archive fails. The good news is that this problem has been considered as goal from the very start. So all code has been written to accommodate this goal. That bad news is that this facility hasn't been exhaustively tested. What's needed is a separate/additional test suite which tests archive portability. This would mean a set of "canonical archives". The save test would create a new archives and compare each one to its "canonical" canonical counterpart. The load test would load the canonical archive and check to see that we recovered what we expected to. Implementation of such tests would reveal a host of small errors. For example, int64_t is just a typedef with gcc on a 64 platform while its separate type on on a 32 bit platform. This breaks compilation of the polymorphic archive at compiler time. In itself, its not too hard to resolve. But it requires careful thought to resolve it once and for all. Conclusion and recommendation So, if you want to use the serialization library for your project I would suggest. a) Just start doing it. The serialization library saves an incredible amount of coding and you can effectively just get on with the other more intesting parts of your project. b) When its time to think build your app on another platform and transfer archives between environments, you can rethink the issue. This is basically and economic argument. At this point: i) You will have invested no effort in serialization, so if it were deemed unsuitable, you would have lost no investment of effort if you decide to replace it with something else. ii) Maybe you'll get lucky and someone else will have made the archive portability test suite and the archives your interested in can be "guaranteed" portable. iii) If ii) above hasn't happened, you'll have to make your own archive portability test suite. The cost of this is also zero. This is because if you use some alternative to the serialization library, you'll still have to test the passing of serialized data between platforms in any case. If you make an archive portability test suite, please consider adding it to the library and tests. (Hmm - you're an academic - making such a test suite would be a great small project for one of your students. Important, educational, small enough to do in reasonable time, orthogonal to your main project goal - whatever that might be). Thanks for considering the serialization library for your project. Robert Ramey
participants (1)
-
Robert Ramey