[serialization] portable_binary_archive

Hi Robert, boost... I've implemented a portable binary archive, wondering if there is general interest. I've tested (very thoroughly) on our target platforms, which are 32 and 64 bit linux, and osx on a g5, each platform reads and writes archives written by the others, down to the bit. The archive itself is stored little-endian (as the vast majority of my users are on intel hardware) so big-endian platforms are responsible for byteswapping, though this could be configurable. The problem types are (unsigned) long int and long double. For long int, the archive checks to see if it is small enough to fit into 32 bits, if so it stores it in 32 bits, otherwise it throws. For long double, it just throws, as that's 12 bytes on some platforms, 16 on others, no good way to handle this, and the type is seldom used anyhow. Float and double are assumed to be ieee754, and are just handled as blocks of 32 or 64 bytes. So if the write succeeds, the read is guaranteed to. Part of the idea here is that without resorting to widespread use of int32_t and friends, 4-byte-long platforms are always be able to write archives readable on 8-byte-long platforms. On 8-byte-long platforms, you could write archives readable on the smaller machines, but best to choose either int or long long int or a typedef thereof (int32_t or int64_t). The code is at http://svn.resophonic.com/pub/boost, in boost/archive/detail/portable_binary_archive.hpp boost/archive/portable_binary_iarchive.hpp boost/archive/portable_binary_oarchive.hpp libs/serialization/src/portable_binary_iarchive.cpp libs/serialization/src/portable_binary_oarchive.cpp libs/serialization/test/portable_binary_archive.cpp Issues that come to mind: The test isn't boost-test, that'd need rewriting. Haven't tried any of this on microsoft. Maybe there is some preferred way to detect endianness. The decision to go little-endian was relatively arbitrary, maybe the byte-swapping should be configurable. A possibility I considered was to upgrade all 32 bit longs to 64 bit and do the range-check/throw when *reading* (We nixed this idea, preferring to have the error come up sooner rather than later), maybe this should be configurable as well. There's surely platforms out there with other differences that aren't handled, I don't have a good feel for whether they would make the archive useless to a wider audience. The tests require having access to binary archives written on other platforms, haven't yet thought about how to put that together in the boost build/test context. So have a look, let me know if you're interested. Thanks again for a great lib. troy d. straszheim

There might be people interested. You might upload it to the vault. You don't have to write a test program. You can use all the boost archive test programs with your own archives by following the instructions in the documentation Reference/Archve Class Reference/ Testing. The batch file run_archive_test.bat will run all the tests with your archive. Robert Ramey troy d. straszheim wrote:
Hi Robert, boost...
I've implemented a portable binary archive, wondering if there is general interest. I've tested (very thoroughly) on our target platforms, which are 32 and 64 bit linux, and osx on a g5, each platform reads and writes archives written by the others, down to the bit.
The archive itself is stored little-endian (as the vast majority of my users are on intel hardware) so big-endian platforms are responsible for byteswapping, though this could be configurable. The problem types are (unsigned) long int and long double. For long int, the archive checks to see if it is small enough to fit into 32 bits, if so it stores it in 32 bits, otherwise it throws. For long double, it just throws, as that's 12 bytes on some platforms, 16 on others, no good way to handle this, and the type is seldom used anyhow. Float and double are assumed to be ieee754, and are just handled as blocks of 32 or 64 bytes. So if the write succeeds, the read is guaranteed to.
Part of the idea here is that without resorting to widespread use of int32_t and friends, 4-byte-long platforms are always be able to write archives readable on 8-byte-long platforms. On 8-byte-long platforms, you could write archives readable on the smaller machines, but best to choose either int or long long int or a typedef thereof (int32_t or int64_t).
The code is at http://svn.resophonic.com/pub/boost, in
boost/archive/detail/portable_binary_archive.hpp boost/archive/portable_binary_iarchive.hpp boost/archive/portable_binary_oarchive.hpp libs/serialization/src/portable_binary_iarchive.cpp libs/serialization/src/portable_binary_oarchive.cpp libs/serialization/test/portable_binary_archive.cpp
Issues that come to mind: The test isn't boost-test, that'd need rewriting. Haven't tried any of this on microsoft. Maybe there is some preferred way to detect endianness. The decision to go little-endian was relatively arbitrary, maybe the byte-swapping should be configurable. A possibility I considered was to upgrade all 32 bit longs to 64 bit and do the range-check/throw when *reading* (We nixed this idea, preferring to have the error come up sooner rather than later), maybe this should be configurable as well. There's surely platforms out there with other differences that aren't handled, I don't have a good feel for whether they would make the archive useless to a wider audience. The tests require having access to binary archives written on other platforms, haven't yet thought about how to put that together in the boost build/test context.
So have a look, let me know if you're interested. Thanks again for a great lib.
troy d. straszheim
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey wrote:
There might be people interested. You might upload it to the vault.
You don't have to write a test program. You can use all the boost archive test programs with your own archives by following the instructions in the documentation Reference/Archve Class Reference/ Testing. The batch file run_archive_test.bat will run all the tests with your archive.
Right. This testing architecture I've used; the portable_binary_(i|o)archive and the polymorphic version pass all the existing serialization tests, alongside the regular binary, text, and xml archives (I just got all that done, it's sitting here on disk.) So that much is good. But of course the real test of such an archive is to verify that you can correctly read on platform A what was written on platform B, and vice-versa. I've done this testing (probably to the point of overkill), but again with a different testing scheme. One thing inherent to the problem is that you are required to have portable binary archives checked in to source control. So there is a directory, $SOMEWHERE/test/portable_archives/, in which there are several subdirectories, each of which contains a file called "xml" and one called "pbin". On platform P, when you run the tests, you first create and open a directory $SOMEWHERE/test/portable_archives/<my_hostname>, and you serialize a bunch of stuff out to both file "xml" (in xml, naturally) and "pbin" (in portable binary) and close them. Then you go through all the other directories under $SOMEWHERE/test/portable_archives, (each of which is named after some other host, preferably on a different architecture), reading in from "xml" and "pbin" and comparing the results in memory. Those should all pass, and at the end, the "pbin" files in all the subdirectories should have the same checksum. One would probably want to change this a bit, as the hostname was convenient but doesn't tell you anything about the architecture. One might want to manufacture a name for the binary that describes the architecture in terms of the primitive type sizes, for instance E-C-S-I-L-LL-F-D.(pbin|xml) where E is 0 for big-endian, 1 for little-endian, and C is sizeof(char), S is sizeof(short), I is sizeof(long) etc. OR, maybe you really don't think all this is worth fooling with, that's cool too, I dunno. -t
Robert Ramey
troy d. straszheim wrote:
Hi Robert, boost...
I've implemented a portable binary archive, wondering if there is general interest. I've tested (very thoroughly) on our target platforms, which are 32 and 64 bit linux, and osx on a g5, each platform reads and writes archives written by the others, down to the bit.
The archive itself is stored little-endian (as the vast majority of my users are on intel hardware) so big-endian platforms are responsible for byteswapping, though this could be configurable. The problem types are (unsigned) long int and long double. For long int, the archive checks to see if it is small enough to fit into 32 bits, if so it stores it in 32 bits, otherwise it throws. For long double, it just throws, as that's 12 bytes on some platforms, 16 on others, no good way to handle this, and the type is seldom used anyhow. Float and double are assumed to be ieee754, and are just handled as blocks of 32 or 64 bytes. So if the write succeeds, the read is guaranteed to.
Part of the idea here is that without resorting to widespread use of int32_t and friends, 4-byte-long platforms are always be able to write archives readable on 8-byte-long platforms. On 8-byte-long platforms, you could write archives readable on the smaller machines, but best to choose either int or long long int or a typedef thereof (int32_t or int64_t).
The code is at http://svn.resophonic.com/pub/boost, in
boost/archive/detail/portable_binary_archive.hpp boost/archive/portable_binary_iarchive.hpp boost/archive/portable_binary_oarchive.hpp libs/serialization/src/portable_binary_iarchive.cpp libs/serialization/src/portable_binary_oarchive.cpp libs/serialization/test/portable_binary_archive.cpp
Issues that come to mind: The test isn't boost-test, that'd need rewriting. Haven't tried any of this on microsoft. Maybe there is some preferred way to detect endianness. The decision to go little-endian was relatively arbitrary, maybe the byte-swapping should be configurable. A possibility I considered was to upgrade all 32 bit longs to 64 bit and do the range-check/throw when *reading* (We nixed this idea, preferring to have the error come up sooner rather than later), maybe this should be configurable as well. There's surely platforms out there with other differences that aren't handled, I don't have a good feel for whether they would make the archive useless to a wider audience. The tests require having access to binary archives written on other platforms, haven't yet thought about how to put that together in the boost build/test context.
So have a look, let me know if you're interested. Thanks again for a great lib.
troy d. straszheim
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

troy d. straszheim wrote:
Robert Ramey wrote:
There might be people interested. You might upload it to the vault.
You don't have to write a test program. You can use all the boost archive test programs with your own archives by following the instructions in the documentation Reference/Archve Class Reference/ Testing. The batch file run_archive_test.bat will run all the tests with your archive.
Right. This testing architecture I've used; the portable_binary_(i|o)archive and the polymorphic version pass all the existing serialization tests, alongside the regular binary, text, and xml archives (I just got all that done, it's sitting here on disk.)
So that much is good. But of course the real test of such an archive is to verify that you can correctly read on platform A what was written on platform B, and vice-versa. I've done this testing (probably to the point of overkill), but again with a different testing scheme. One thing inherent to the problem is that you are required to have portable binary archives checked in to source control.
This is something that my test suite should test but doesn't. Regardless of what happens with your portable_binary archive, It would be very helpful to have the test suite enhanced to explicitly test archive portability. Since we're just talking, So I would like to see the following changes in the test suite: a) That output of the tests go to a particular directory. There would be a directory for each column in the test matrix. b) There would be a set of directories for each boost release 1.32, 1.33, ... c) Tests would be run on the four combinations which result from: i) set of archives created by the previous version ii) set of archives created by other compilers (portable archives only) This would result in a HUGE number of tests so it couldn't be run any more than on an occasional basis. But it would be helpful to run very occasionally. This would require more investigation into boost test in order to find out what facilities are already there for helping out with ths. d) An alternative to the above might be just to make some specific tests for backward compatibility and for cross compiler compatibility. This is probably a better idea as it would result in tests that could be run more frequently. e) That the tests move to the boost unit test library rather than the test exec library. e) A number of tests leave memory leaks. I would prefer to fix these if it were possible to do so without making the tests so complex that they are no longer serve as convincing tests. Just food for thought. Robert Ramey

Robert Ramey wrote:
This is something that my test suite should test but doesn't. Regardless of what happens with your portable_binary archive, It would be very helpful to have the test suite enhanced to explicitly test archive portability.
Since we're just talking, So I would like to see the following changes in the test suite:
a) That output of the tests go to a particular directory. There would be a directory for each column in the test matrix. b) There would be a set of directories for each boost release 1.32, 1.33, ... c) Tests would be run on the four combinations which result from: i) set of archives created by the previous version ii) set of archives created by other compilers (portable archives only)
This would result in a HUGE number of tests so it couldn't be run any more than on an occasional basis. But it would be helpful to run very occasionally.
This would require more investigation into boost test in order to find out what facilities are already there for helping out with ths.
Yeah, hmm. It's a lot of work, more tedious than difficult. Since we're just talking, here's some thinking out loud on the subject, for sanity-check purposes: There are a few different classes of tests. The compile-fail ones don't need modification of course. A great many contain the following or similar, one more more times: int test_main( int /* argc */, char* /* argv */[] ) { const char * testfile = boost::archive::tmpnam(NULL); BOOST_REQUIRE(NULL != testfile); BOOST_CHECKPOINT("something"); S s_out = create_some_serializable_data_structure(); { test_ostream os(testfile, TEST_STREAM_FLAGS); test_oarchive oa(os); oa << boost::serialization::make_nvp("S", s_out); } S s_in; { test_istream is(testfile, TEST_STREAM_FLAGS); test_iarchive ia(is); ia >> boost::serialization::make_nvp("S", s_in); } BOOST_CHECK(s_out == s_in); } and one would concievably want to factor these out so that the read-in-and-compare section would happen multiple times, once per boost version and per compiler. For portable archive types, it would happen for all boost versions, compilers *and* platforms. An additional check is that at the end, all portable archive types have the same checksum. tmpnam() would change to, say, persistent_testfile(). Many of the test modules create multiple archives. You could deal with this by keeping track of some index and appending this to the file name. So you would end up with files like $SERIALIZATION_TESTDATA_DIR/ BOOST_PLATFORM/ BOOST_VERSION/ BOOST_COMPILER/ __FILE__.INDEX.ARCHIVETYPE for instance (testdata)/BeOS/103300/Gnu_C++_version_4.0.1/test_something.cpp.2.xml which would be the third file written by tests inside test_something.cpp.
d) An alternative to the above might be just to make some specific tests for backward compatibility and for cross compiler compatibility. This is probably a better idea as it would result in tests that could be run more frequently.
I would think that you would in fact want to run the full battery of tests periodically. Or maybe constantly, on a couple machines dedicated to the purpose, but I agree that the average Joe who checks something out of CVS and runs the tests shouldn't be expected to sit through the whole lot. All that looks doable, to me. I wonder about storage. If you're really going to keep all that data around, the question is where, and how are you going to get at it. One could easily put together a little web interface, where the test suites use a little python script to post and retrieve test data. Let me see what I can come up with. -t

I implemented some of that plan, to see how it would go. I tarred up the test/ directory for you to peruse, it's at http://www.resophonic.com/boost-serialization-test.tar.gz I added a parameter to the jamfile, BOOST_SERIALIZATION_DATA_ROOT, which gets set to BOOST_ROOT/libs/serialization/test/data by default. I messed with test/test_tools.hpp and added a routine test_serialization(), where the action is. It writes a type out to an archive, reads it back in, and calls a template function check(), whose default implementation is just BOOST_CHECK(what_was_written_out == what_was_read_in); I revamped test_vector.cpp, test_simple_class.cpp, test_set.cpp, test_null_ptr.cpp, and test_variant.cpp. In the process I converted these to use the cool autoregistering unit-tests. There are corresponding tweaks in the jamfile. test_serialization("name_of_test", object_to_test) will create tempfiles in, as mentioned, BOOST_SERIALIZATION_DATA_ROOT/ BOOST_PLATFORM/ BOOST_VERSION/ BOOST_COMPILER/ BOOST_ARCHIVE_TEST/ name_of_test so this "name_of_test" has to be unique. My first implementation had it calculating a name based on __FILE__ and an index, but this would cause problems when you move test routines around. test_vector.cpp and test_simple_class.cpp show how much bolierplate code disappears with this scheme. test_variant.cpp and test_set.cpp show writing a specialization of check() for the type being tested. In test_null_ptr.cpp you have to write a wrapper class to ensure that the two pointer types get written in the right order. This probably comes up a lot, it makes me think that one would like to be able to serialize boost::tuple (that one sounds interesting and tractable, I'll have a look) so that you could just stack things up that way. So I believe this clears the way for comparing different boost versions, compilers and portable archives across platforms... test_serialization just needs to scan some directories, deserialize from foreign archives, and compare. I think that even if you don't do all that, the changes are an improvement maintenance-wise. Whaddaya think? -t

On Sep 14, 2005, at 2:57 AM, troy d. straszheim wrote:
Hi Robert, boost...
I've implemented a portable binary archive, wondering if there is general interest. I've tested (very thoroughly) on our target platforms, which are 32 and 64 bit linux, and osx on a g5, each platform reads and writes archives written by the others, down to the bit.
The archive itself is stored little-endian (as the vast majority of my users are on intel hardware) so big-endian platforms are responsible for byteswapping, though this could be configurable. The problem types are (unsigned) long int and long double. For long int, the archive checks to see if it is small enough to fit into 32 bits, if so it stores it in 32 bits, otherwise it throws.
This seems to be fine if all you want is compatibility at the level of the least common denominator. By checking if the value of an integer fits into 32 bits you make this library archive useless for people who might need compatibility between 64 bit platforms. What do you think about the following idea: the portable binary archive implements serialization for the fixed length integers (int32_t, int64_t, ...) and it is the user's responsibility to only use the 32 bit integers if they want portability to all 32 bit platforms. Also watch out that on some platforms even short and int are 64 bit. Matthias

Matthias Troyer wrote:
This seems to be fine if all you want is compatibility at the level of the least common denominator. By checking if the value of an integer fits into 32 bits you make this library archive useless for people who might need compatibility between 64 bit platforms. What do you think about the following idea: the portable binary archive implements serialization for the fixed length integers (int32_t, int64_t, ...) and it is the user's responsibility to only use the 32 bit integers if they want portability to all 32 bit platforms. Also watch out that on some platforms even short and int are 64 bit.
Note that the version of portable_binary that is included with the demo addresses this issue - even though its no explicitly stated. It stores the integers in a compiler independent format - length - bytes using save_binary and restores them using load_binary. If it turns out that the saved integer_type from platform A is too large to fit into the loaded integer_type on platorm B it throws a run-time exception. Now if one uses say int32_t then he is guarenteed that on both platforms the integer will fit in 32 bit so the exception can never be thrown. (Well almost never, if someone stored a 64 bit integer in int32_t on platform A it would be a problem - but that's really a programming error of the user - and in any case one would get an exception) So in my view, handling int32_t and its bretheren shouldn't be any sort of issue. Robert Ramey

Why not use an existing portable binary format? Perhaps http://hdf.ncsa.uiuc.edu/

On Sep 15, 2005, at 7:36 PM, Neal Becker wrote:
Why not use an existing portable binary format? Perhaps http://hdf.ncsa.uiuc.edu/
In our physics applications we currently use the XDR format with Boost serialization. See 'man xdr' on a Unix system.

Matthias Troyer wrote:
In our physics applications we currently use the XDR format with Boost serialization. See 'man xdr' on a Unix system.
Thanks very much for your comments... this is very interesting. I looked back at the list archives and see this xdr archive discussed as far back as the serialization code review.... I'd love to see the code. I certainly prefer the notion of xdr over hdf, given that xdr appears comparatively lightweight, mature, and it is everywhere: external dependencies cause many of my biggest headaches. And if it makes this portable binary archive thing unnecessary, great.
This seems to be fine if all you want is compatibility at the level of the least common denominator. By checking if the value of an integer fits into 32 bits you make this library archive useless for people who might need compatibility between 64 bit platforms.
Right. The approach is purposefully least common denominator, the thinking is that the error should occur as soon as possible. For instance, one could have everything padded out to 64 bits, but it then could easily happen (in my world) that somebody writes a terabyte of simulation data that can't be read on a 32 bit machine and then throws a temper tantrum. But perhaps idiot-proofing to this degree isn't warranted for a boost component, given that it ruins 64bit<->64bit portability... This could easily be made configurable, though: the archive could store a list of sizes for each primitive type in the header (as the binary archive does already), and the iarchive could upsize/downsize and range-check as necessary. The constructor of an oarchive could take size modifiers as arguments: ostream os; portable_binary_oarchive(os, size_of<short>(4), size_of<int>(8)); or something like that. You could put an endianness flag in there as well. One can then force some types to be written with smaller type sizes as needed, and if one does nothing, one has compatibility between machines with equivalent primitive type sizes. Not clear to me what a good default would be. While I'm thinking out loud here, it crossed my mind to allow changing the save or load sizes of primitive types on an archive that is already open with something iomanip-ish, e.g. my_portable_binary_oarchive << setsize<short>(4); as maintaining this state would take a prohibitive amount of space, and doesn't go with the "you don't pay for what you don't use" philosophy.
What do you think about the following idea: the portable binary archive implements serialization for the fixed length integers (int32_t, int64_t, ...) and it is the user's responsibility to only use the 32 bit integers if they want portability to all 32 bit platforms. Also watch out that on some platforms even short and int are 64 bit.
It sounds good to me; this is essentially the least common denominator approach, no? I can't help thinking that there should be a way to get 64 bit integers through there as well, via int64_t, but this is frustrating me at the moment, I think I've just been staring at this code too long. I'm wondering whether I wouldn't rather abandon it if there is (like this xdr_archive) a good alternative already in use. Thanks, -t
participants (4)
-
Matthias Troyer
-
Neal Becker
-
Robert Ramey
-
troy d. straszheim