serialization archive types: random_iarchive, pretty_oarchive

Hey all, Robert specifically -- I've got fascinated with all the uses for this serialize() method, and I'm kicking around ideas for a couple archive types. The use case is an "Event" that occurs in a neutrino detector. This event is very big, it contains containers of smart pointers to containers of maps of smart pointers to... you get the idea. 15k lines of just containers of data. First notion is a pretty_oarchive. Currently I have people implementing operator<<() which calls a member function virtual ToStream() (so reference-to-base works) for debugging/hacking purposes, and if everything has a serialize method anyway, it would be a big savings to forget the ToStream() stuff and say pretty_oarchive(cout) << my_class; where pretty_oarchive is something like xml_oarchive, but with the formatting somehow factored out and modified. I admit that other than the header and start/end tags I haven't looked at this too closely because I assumed it had been discussed, thought I'd check for showstoppers first. random_iarchive (written already): The other issue is that with such a large event, you'd want to be able to verify that it makes the round trip to/from the archive intact. Obviously boost::serialization has thorough test suites, but with people constantly tinkering around in these classes, I was hoping to get a test suite going that does exactly what we're doing. You'd like to be able to just Event written_e, read_e; binary_oarchive oa; binary_iarchive ia; random_iarchive ria; ria >> written_e; oa << written_e; // close, open as input ia >> read_e; assert(written_e == read_e); where random_iarchive sets fundamental types to random values, expands containers to some random length (random() % MAX_RANDOM_LENGTH), and creates a T when it sees shared_ptr<T>. I've written this already. Of course you have to write your operator==()'s so that they have value semantics, that is they compare what is on the ends of their component shared_ptrs, not whether these pointers point to the same object or not. The original idea was to set the serialize() method and the operator==() against one another for verification... but if people forget to modify both, there's no way to tell from the test suites, it seems. It kind of started as an experiment and now that it is written it looks like there's no way to keep users from shooting themselves in the foot like this. Another problem is that if a class contains vector<shared_ptr<Base> >, you'd like to be able to populate this with shared_ptr<Derived>, where Derived is randomly selected from the set of classes that inherit from Base. Since serialization requires these classes to be registered, it seemed to me there might be a way to do this. But maybe its all overkill. Anyhow, this random_iarchive exists (except for the Base/Derived thing, above), maybe it would make a good tutorial case for custom serialization archives, maybe people want to use it for something. I'd be more than glad to write up some tutorial material, I'm sure I'd get a lot out of it. As usual, thanks for your time and for a killer library... -t

troy d. straszheim wrote:
Hey all, Robert specifically --
I've got fascinated with all the uses for this serialize() method, and I'm kicking around ideas for a couple archive types. The use case is an "Event" that occurs in a neutrino detector. This event is very big, it contains containers of smart pointers to containers of maps of smart pointers to... you get the idea. 15k lines of just containers of data.
I love hearing this - I always wanted to be associated with particle physics. LOL
First notion is a pretty_oarchive. Currently I have people implementing operator<<() which calls a member function virtual ToStream() (so reference-to-base works) for debugging/hacking purposes, and if everything has a serialize method anyway, it would be a big savings to forget the ToStream() stuff and say
pretty_oarchive(cout) << my_class;
where pretty_oarchive is something like xml_oarchive, but with the formatting somehow factored out and modified. I admit that other than the header and start/end tags I haven't looked at this too closely because I assumed it had been discussed, thought I'd check for showstoppers first.
Of course you know that is straight forward - and you have he xml archives that can be used as examples. If you just want to display the information and not load it, a bunch of tags like object id, etc. can be suppressed. Personally I would just just the xml_archive and concentrate my efforts on a program that displays XML in a convenient and perhaps customizable way. I suspect you could find or make a suitable program of that nature for free or for low cost. To re-iterate, I would factor the "pretty display" from the serialization and make it customizable according to the kind of display required. In fact, if I had nothing else to do, and had that much interest, I would make an enhanced version of xml_archive would output TWO files, a) the xml_archive and b) an xml_schema which could be used by other programs to parse the xml_archive. Just random thoughts.
random_iarchive (written already):
The other issue is that with such a large event, you'd want to be able to verify that it makes the round trip to/from the archive intact.
Obviously boost::serialization has thorough test suites, but with people constantly tinkering around in these classes, I was hoping to get a test suite going that does exactly what we're doing. You'd like to be able to just
Event written_e, read_e; binary_oarchive oa; binary_iarchive ia; random_iarchive ria; ria >> written_e; oa << written_e; // close, open as input ia >> read_e;
assert(written_e == read_e);
where random_iarchive sets fundamental types to random values, expands containers to some random length (random() % MAX_RANDOM_LENGTH), and creates a T when it sees shared_ptr<T>. I've written this already. Of course you have to write your operator==()'s so that they have value semantics, that is they compare what is on the ends of their component shared_ptrs, not whether these pointers point to the same object or not. The original idea was to set the serialize() method and the operator==() against one another for verification... but if people forget to modify both, there's no way to tell from the test suites, it seems. It kind of started as an experiment and now that it is written it looks like there's no way to keep users from shooting themselves in the foot like this.
I'm not sure I'm convinced of this. I recommend the following when you make a new archive a) run the code module for the new archive through Gimple LINT and fixup the obvious oversights. b) make a file similar to text_archive.hpp in the test directory for your new archive - new_archive.hpp c) modify the Jamfile in the serialization test directory to include your new archive archive d) invoke the batch/script file run_archive_test <compiler> <new_archive.hpp> This will run all the serialization tests against your new archive. It takes a while - but its worth it. I recommend the following when you make a new serializable class. a) run the code module for the new serializable class through Gimple LINT and fixup the obvious oversights. b) using the other tests as a basis, make a new test for your new serializable class. c) in the course of this you may have to make additions to your new class such as operator= or you might not. Perhaps, adding a global operator=(const T lhs &, const T &rhs) might be added just to the test. d) add test for your new class to the Jamfile in serialization/test e) invoke batch/shell script runtest <compiler> to generate a table of all tests including your new one. These tests will run your new class against all currently defined archives. This is important as some archives are not sensitive to some errors. For example, tagged XML can recover from some errors whereas the more efficient native binary cannot. Even if you only use just one particular compiler for the application you ship, I would recommend building and running all tests on at least two pretty good different compilers. For example, gcc 3.4? and VC 7.1 is a good combination. This will often uncover subtle ambiguities that would otherwise linger on for years inflicting programmer pain. I have to say the one single most important thing I've learned from boost is that its cheaper to maintain the test suite and build for several compilers than it is to debug the application. bjam (which DOES drive me crazy) is a godsend for doing this kind of thing.
Another problem is that if a class contains vector<shared_ptr<Base> >, you'd like to be able to populate this with shared_ptr<Derived>, where Derived is randomly selected from the set of classes that inherit from Base. Since serialization requires these classes to be registered, it seemed to me there might be a way to do this. But maybe its all overkill.
If you don't find the above sufficient, then its not overkill. As I said the pain of writing the test is nothing compared to shipping a product with a bug.
Anyhow, this random_iarchive exists (except for the Base/Derived thing, above), maybe it would make a good tutorial case for custom serialization archives, maybe people want to use it for something. I'd be more than glad to write up some tutorial material, I'm sure I'd get a lot out of it.
As I said, I'm not convinced that the random test data should be part of the archive class. But I'm certainly pleased that someone finds the serialization suffiiciently useful and interesting to do stuff like this. So if you want to polish this up and add it to the Files section on source forge I think it would be great. FYI, I'm trying to cut back on the time spent on the boost in general, and the serialization library in particular. However, I have to confess I'm sort of a boost addict. (Is there a support group for this?) I've already added and tested the following in the maincvs. a) serialization library as a DLL b) serialization of variant - hmmm - I think that's yours I have no idea when boot 1.33 will come out - Its not up to me. In the mean time I'm working on a couple of things at a leisurely pace. a) more formal documentation. b) test for serialization of classes implemented in DLLS. This is supported in the the current code but hasn't been tested - so it likely doesn't work. c) demo for serialization of classes implemented in DLLS. This will likely require an enhance ment to extended_type_info in order to include a class factory functionality similar to COM an CORBA. At this point just a small enhancement will be required. d) documentation for archive adaptor, include demo and test e) memoization_archive - an archive adaptor which does a deep copy using the serialize templates. This also requires some extra help from extended_type_info. f) demo for using serialization as a debug and/or database transaction rollback/roll forward logger. This requires a small enhancement in the basic archive classes to permit suppression of object tracking for an archive. So, if anyone who wants to take on one of these things would be welcome to contact me. Note, that the serialization library supports compilers going back to borland 5.51 and MSVC 6.?, and gcc 2.95 and I'm want to maintain that. So if anyone wants a piece of this, you'll have to sign up for that too. Robert Ramey

Robert Ramey wrote:
I love hearing this - I always wanted to be associated with particle physics. LOL
Heh heh... *sigh*. Don't... don't get me started. :)
Of course you know that is straight forward - and you have he xml archives that can be used as examples. If you just want to display the information and not load it, a bunch of tags like object id, etc. can be suppressed. Personally I would just just the xml_archive and concentrate my efforts on a program that displays XML in a convenient and perhaps customizable way. I suspect you could find or make a suitable program of that nature for free or for low cost. To re-iterate, I would factor the "pretty display" from the serialization and make it customizable according to the kind of display required.
In fact, if I had nothing else to do, and had that much interest, I would make an enhanced version of xml_archive would output TWO files, a) the xml_archive and b) an xml_schema which could be used by other programs to parse the xml_archive. Just random thoughts.
Sure, that much I've got: suppressing the object id tags, overriding save() methods for pointers so that they get "skipped", all that, I guess I'm talking more about factoring out the markup within the serialization library itself. It would be easy to just copy/paste the entire xml_ thing, rename the classes and change the tags and so forth, but of course this would be dispicable nastiness. Better to do something like an nvp_archive that referred to some kind of formatting policy class, with xml_archive as nvp_archive<xml_formatting_policy>... or something like that. This could also get you the ability to do SpitAndDuctTapeNVPML, or, say, some kind of binary_nvp_archive, basically the same as XML but without all the ascii bloat. We would certainly find this handy, as we never know if we might be forced to convert to XML at some point, but there's just so much data that we can't afford the ascii bloat in our storage. But of course you can just zip the xml stuff, and a binary_nvp_archive is a lot more work than just factoring tags and indentation out of xml_archive... OK, I've gone off on a tangent. Never mind. And your points about where to focus the effort are well taken. Anyway, the purpose isn't visualization after the program has run, it is more like pretty(log_stream) << my_particle; in the code itself. I'm catering to printf-style debugging. This is what the Beakers (jargon for "Physicists"... think Dr. Benson Honeydew and his assistant...) like to do, and since I'm ripping out the old serialization method (from the root analysis toolkit, which involves running your headers through a quasi-compiler which generates serialization functions, and which pukes the moment it sees anything in namespace boost...) and since the beakers will react violently to this at first, it would be good to toss them a bone as well, like "you get ToStream(ostream&) for free". Now that I see memoization_archive, I see I can give them something else, too.... But to focus on reformatting as you suggest, I could concievably do it with some kind of xml-reformatting stream. Make a convenience function that wraps the insertion into xml_oarchive(stringstream) and the pass through the tag-removing reformatter. Would be a good opportunity to play with iostreams. Yeah, sounds good. OK.
I'm not sure I'm convinced of this.
I recommend the following when you make a new archive
a) run the code module for the new archive through Gimple LINT and fixup the obvious oversights. b) make a file similar to text_archive.hpp in the test directory for your new archive - new_archive.hpp c) modify the Jamfile in the serialization test directory to include your new archive archive d) invoke the batch/script file run_archive_test <compiler> <new_archive.hpp>
This will run all the serialization tests against your new archive. It takes a while - but its worth it.
Sure. I did this with variant, it works great.
I recommend the following when you make a new serializable class.
a) run the code module for the new serializable class through Gimple LINT and fixup the obvious oversights. b) using the other tests as a basis, make a new test for your new serializable class. c) in the course of this you may have to make additions to your new class such as operator= or you might not. Perhaps, adding a global operator=(const T lhs &, const T &rhs) might be added just to the test. d) add test for your new class to the Jamfile in serialization/test e) invoke batch/shell script runtest <compiler> to generate a table of all tests including your new one. These tests will run your new class against all currently defined archives. This is important as some archives are not sensitive to some errors. For example, tagged XML can recover from some errors whereas the more efficient native binary cannot.
I've already learned from experience to have the testsuites run on all archive types automatically, if for no other reason than to catch places where you've forgotten to use make_nvp(). I'm with you. The random_iarchive is intended as a tool to be used in this process: for instance, I won't sleep well until I have seen a terabyte's worth of events get serialized in one run.... The tests have to be *big*, stressful, lots of data.
Even if you only use just one particular compiler for the application you ship, I would recommend building and running all tests on at least two pretty good different compilers. For example, gcc 3.4? and VC 7.1 is a good combination. This will often uncover subtle ambiguities that would otherwise linger on for years inflicting programmer pain.
I have to say the one single most important thing I've learned from boost is that its cheaper to maintain the test suite and build for several compilers than it is to debug the application. bjam (which DOES drive me crazy) is a godsend for doing this kind of thing.
Sure, you don't have to convince me of this. There's nothing more beautiful than a rigorous set of test suites. I'm a crusty UNIX guy with abysmal debugger skills, I'm dependent on them. We have a similar testing infrastructure that I've thrown together... We're a "make" shop... I wasn't sold on bjam. And running classes through all archive types, automatically, is obviously the only way to do it: I put together a few macros to accomplish this in code rather than in a bunch of build-system mechanics. One macro creates tests for one class through all archives. Not sure if they would integrate with Boost.Test so easily, though, and Boost.Test is surely more robust in various ways in case of failure. I can post 'em if you're curious.
Another problem is that if a class contains vector<shared_ptr<Base> >, you'd like to be able to populate this with shared_ptr<Derived>, where Derived is randomly selected from the set of classes that inherit from Base. Since serialization requires these classes to be registered, it seemed to me there might be a way to do this. But maybe its all overkill.
If you don't find the above sufficient, then its not overkill. As I said the pain of writing the test is nothing compared to shipping a product with a bug.
I was wondering how to accomplish it. I am in, say, template <typename T> void random_iarchive::load_override(vector<shared_ptr<T> >), with T = Base. My random_iarchive has had Base and several types Derived registered with it already. Because I know what Base is (from T), I can easily populate the vector with shared_ptr<Base>, but in order to populate it with Base and a variety of classes Derived, I have to somehow ask the archive what possibilites are registered and choose one... Forgive me if I'm way off base. The whole business of type registration in the archives is still pretty opaque to me, and my gut says that this is either impossible or overkill.
Anyhow, this random_iarchive exists (except for the Base/Derived thing, above), maybe it would make a good tutorial case for custom serialization archives, maybe people want to use it for something. I'd be more than glad to write up some tutorial material, I'm sure I'd get a lot out of it.
As I said, I'm not convinced that the random test data should be part of the archive class. But I'm certainly pleased that someone finds the serialization suffiiciently useful and interesting to do stuff like this.
I have also created a root_oarchive which creates root "trees", in case anybody is working with the ROOT analysis toolkit. The way one does this "normally" is a real nightmare, and being able to wrap all that in operator<< is a huge, huge win for cleanliness and maintainability. Testament to the flexibility of the serialization library. One big thing here is that the serialization library allows you to "flatten" nested structures into tuples by keeping track of the nvp paths in a deque inside the oarchive. Kind of like xml output, but without start/end tags, and where each nvp has all of its parents prepended to it separated by some path separator character. One could concievably create an iarchive for these things as well, I haven't bothered.
So if you want to polish this up and add it to the Files section on source forge I think it would be great.
So the attempt is to factor out the business of populating classes with random test data into an iarchive class, in an effort to thoroughly test the "real" archive classes, and so that as a user with a bunch of serializable classes, I can fill them up with random stuff and serialize them through all the various archive types them until my CPU smokes, without writing fill_with_random_data() routines by hand for every one of them. Actually, now that you mention the memoization_archive, it would actually be ideal if there were an archive that could do a deep *comparison*, thus eliminating the need to write all those operator==()s. I had thought about this and deemed it impossible, but if you're talking about deep copy.... Then you've got a real full-of-data workout canned in a function for an arbitrary serializable user class: (for each A in xml, text, binary) MyHugeClass src, dst; random_iarchive >> src; // src now swollen with data A_oarchive oa(somewhere) << src; A_iarchive ia(somewhere) >> dst; comparison_archive ca(src) << dst; // or however that looks From your serialization(archive) method, you get xml/txt/binary i/o, comparison and copy.
e) memoization_archive - an archive adaptor which does a deep copy using the serialize templates. This also requires some extra help from extended_type_info.
This is big to us. I'll contact you... Thanks again, -t

But of course you can
just zip the xml stuff, and a binary_nvp_archive is a lot more work than just factoring tags and indentation out of xml_archive...
You could just store the stuff using binary serialization. Remember you can always make a small program which reads the binary_?archive and creates the equivalent xml one. If storage space is an issue Iwould consider: a) use Jonathon Turkanis stream library to make a matched pair of compressed input and output streams. b) use these streams with binary archive to create the output file. c) make a small program which de-serializes these files and perhaps selects the "interesting" part and re-serializes them to something "readable" like XML d) and pipe the xml output to your favorite viewer. I guess I'm basically a lazy person.
Anyway, the purpose isn't visualization after the program has run, it is more like
pretty(log_stream) << my_particle;
The random_iarchive is intended as a tool to be used in this process: for instance, I won't sleep well until I have seen a terabyte's worth of events get serialized in one run.... The tests have to be *big*, stressful, lots of data.
I'm thinking just the opposite, that I'll sleep well UNTIL someone tries that.
We have a similar testing infrastructure that I've thrown together... We're a "make" shop... I wasn't sold on bjam. And running classes through all archive types, automatically, is obviously the only way to do it: I put together a few macros to accomplish this in code rather than in a bunch of build-system mechanics. One macro creates tests for one class through all archives. Not sure if they would integrate with Boost.Test so easily, though, and Boost.Test is surely more robust in various ways in case of failure. I can post 'em if you're curious.
I've accomodated myself to bjam and Boost.Test. I've got complains about both but I don't want to make an issue of them because it seems that the authors are working hard to address them and i don't want to annoy them. Just for the record my complaints are: a) bjam - I just can't understand it. b) test - changes are made to the development tree which break programs which relie on the test code. Then the corrections aren't promptly made. Boost Test is the bedrock of the whole boost foundation. For boost test the priorty has to be i) correctness accross all platforms boost supports ii) backward compatibility iii) new features Currently I believe with a couple of small issues with the test system. i) it seems we have problems building the test library wint sunpro 5. ii) The last time I ran tests on borland compilers in release mode the programs failed in the test library iii) test library won't build with Comeau due to an issue with libcomo and va_arg I should note that I've noticed improvements re CW and others here so I suppose these kind of questions are being addressed. It may seem that I'm holding the test library to a higher standard than others - I suppose that's true. Its only because it has to be. Maybe we should run the boost tests with the previous version of the library ! (Hmmm - that might actually be a good idea.)
Since serialization requires these classes to be registered, it seemed to me there might be a way to do this. But maybe its all overkill.
I was wondering how to accomplish it. I am in, say,
template <typename T> void random_iarchive::load_override(vector<shared_ptr<T> >), with T = Base.
My random_iarchive has had Base and several types Derived registered with it already. Because I know what Base is (from T), I can easily populate the vector with shared_ptr<Base>, but in order to populate it with Base and a variety of classes Derived, I have to somehow ask the archive what possibilites are registered and choose one... Forgive me if I'm way off base. The whole business of type registration in the archives is still pretty opaque to me, and my gut says that this is either impossible or overkill.
LOL - the whole business of type registration IS pretty opaque. Its also one hell of a lot harder that it seems at first glance. At least for me. I'm not sure, but BOOST_EXPORT might work better for you than explicit registration. it MIGHT be possible to access all the exported types and generate a test automatically but I havn't looked into this. If it were me, I would just write a test which explicitly tries all the known derived types. I don't like the idea of randomness in tests.
Actually, now that you mention the memoization_archive, it would actually be ideal if there were an archive that could do a deep *comparison*, thus eliminating the need to write all those operator==()s. I had thought about this and deemed it impossible, but if you're talking about deep copy.... Then you've got a real full-of-data workout canned in a function for an arbitrary serializable user class:
(for each A in xml, text, binary) MyHugeClass src, dst; random_iarchive >> src; // src now swollen with data A_oarchive oa(somewhere) << src; A_iarchive ia(somewhere) >> dst; comparison_archive ca(src) << dst; // or however that looks
From your serialization(archive) method, you get xml/txt/binary i/o, comparison and copy.
I envisioned the "memoization" archive as just an adaptor which tweaks the handling of pointers to implement deep copy. This might be useful for storing/recovering data state without creating a bunch of new objects. Its 99% easy then runs into a "small" issue with objects of a derived class serialized through the base class. There it stands for now. Your the first consider "comparison" archive. of course the issues are the same. There also exists the possibility of free implementation of deep copies and comparisons by overriding serialize functions without storing the data at all. If one considers the serialize functions as a "reflection" of the class which it corresponds to then the whole topic spins off the track of serialization. Some day things might look like: class A data about class A /// serializable members automatically generated deep copy, deep compare, and serialization. Food for thought Robert Ramey
participants (2)
-
Robert Ramey
-
troy d. straszheim