Serialization and Editing Archives

I've been looking at the posts regarding the Property Tree Library. I haven't had the time to give it a proper review. Fortunately it seems that there are a good number of people willing to help with this. I would like to make some observations regarding the serialization library and the creating of "editable" archives. This seems to me the central issue being discussed. The fundemental notion that the serialization library is meant to address is the deconstruction of an arbitary set of C++ data structures into a series of bytes in such a way that the original set can be restored at another time and place. The requirements for such a library were a) completeness - basically no constraints on the C++ data structures to be saved/restored. b) factoring out the actually storage of the bytes in such a way that this could be changed by the user to suit his application. c) minimal effort - the the library has to leverage on the structure of the data as described by the C++ code itself rather than than requiring some extra specification as to how all the data is organized and related. This goal has been in large part achieved with the help of many people on this list, The set of requirements was demanding, and required lots of code to implement. Much of that code had to relie on more demanding techniques such as template metaprogramming. Also, in some cases, extra information had to be be appended to "reflect" information necessary for the implementation. BOOST_CLASS_EXPORT is an example of this. In addition a number of user customizable attributes had to be defined to permit the library to be applied as widely as was reasonable to expect. I know not everyone is totally happy with the library but all in all I'm satsified that a good balance has been struck. The goal and requirements dictate that the resultant archive (however it is stored is dependent on the C++ data structure defined at compile time and invoked at runtime. In general, one cannot expect that any changes in an archive will possible without changing the C++ data structures and code to which the changes correspond. Note this is an entirely reverse situation to the most common one exemplified by doing something like parsing an XML file. In this case one will load the data into some dynamic data structure which will reflect the structure of the data being loaded. At that point some code will have to traverse the structure to extract or set the data in memory. Note that there is no requirement that the C++ data structures map in any special way to the file. Also the file doesn't have to include as much "detail" because it doesn't have to reconstruct an arbitray set of C++ data structures. And of course we see and will continue to see a proliferation of file formats and programming modules to support them. Some attempts have been made to map serialization to some formats which permit arbitray data - XDR and a couple of others. I don't think they've been successful because part of the file format is really held in the C++ code which saves and loads the archives. The requirement to be able to arbitrarily edit an archive is clearly not possible. But then ... it seems tantalizingly close - I meant hell its XML I can read it!. So could anything be done? I haven't spent too much time on this but I can speculate some. Suppose one wanted to create and editable XML format. First of all on will have to accept the fact that not all editing is going to be possible. One won't be able to make a new class which the C++ code doesn't implement. I might be possible to do something like the following; a) parse the whole XML file with a standard parser b) as the serialization algorithm visits each data structure, ratther than just presuming that all the data is there in sequence, it would check to see if the tag has been loaded.. This would permit the following types of editing: i) deletions ii) changing data A little more work would permit expanding / contracting collections. I really don't know how much work this would end up being but no one has had sufficient interest to undertake it - or perhaps they have but it hit a dead end. To summarize I see two almost irreconciliable paths a) Archive formats reflect the set of C++ data structures which generate them b) A format for arbitrary data (e.g. XML) is defined and a general- not application specific - data strucure (e.g. tree) is defined. Then the application walks the tree in an application dependent manner. Basically only one thing can be boss - either the C++ data strucures in the app or the file format. one can drive the other but they can't be varied independently. It seems to me that the property tree library is driven by the need to be able to edit one's "serialized" data out side the main application. My personal view is that if one feels that he needs to do this - he should be rethinking his application. A huge portion of applications are really variations on just this theme: Load some data, edit it, save data. And its not just word processors or things like that. Machines startup, run, and shutdown. a customer recored is accessed, updated, and saved. etc. So I think many more applications can be considered in terms of serialization than in fact are. The applicatons edit data. There can also be the case where different applications edit the same data. I might make an application which loads configuration, runs a machine and save the data from the run. Someone else needs and application to browse and/or annotete the the data. One could say - oh then we need the data in some application independent manner such as XML. Well that's one way - but then we have to recode all over again to get and save the data. Or we can use something like a text editor but then we are casting aside all the facility of using a computer in the first place. Finally we can make a new application which used the original C++ serialization code and data structures - which hopefully have been compiled into a library. Now we just use our special editor for that "other" applicaion. To me - definition of a file format independent of the C++ code which is going to use has wasted billions of man years of human brain time just trying to keep things in sync. So, I see the whole XML approach as a misguided attempt to separate the fundementally inseparable - data and code - they are one now. I predict that XML will be made continually more elaborate and spawn ever more variations in vane attempts to reconcile the unreconcilable. Remember you heard it here first - XML and is progeny is destined for the ash heap of history! (hmmm - like COBOL?) For someone who has nothing else to do there are interesting things to look at here. What would be very interesting would be a xml_archive which created in parallel an (XML?)_schema which would be used by a generic editor to edit the xml (and maybe other) data in the archive within the bounds of what is permitted. I see this as quite doable. Another idea would be a generic editor which could be compiled from the serialization code which would exploit the "reflected" data about the C++ and perhaps define a little bit more. For example I could imagine a template for each datatype which would invoke a functino to edit it. On a GUI, just compile the generic editor with your own serializtion code. and BAM you've got an application specific editor which runs at full runtime speed. Just a couple of misceleaneous observations a) It seems there is a perception that the serialization library is large, cumbersome and inefficient. I believe it can be used inefficiently (like all libraries) and it can be made faster (and incremental improvemetns are being made). But I don't think this is true in general. When I compile with release mode, make sure I haven't included in the debug symbols, and use the DLL (which isn't all that big), a program which saves and loads 4 different types of archives with several classes only takes 30 KB. So if you're not getting similar results, take a closer look. b) much has been made of things like - you can't turn off tracking etc and the difficulty of modifying the library. I believe that most of these difficulties comeabout by tyring to implemnent some external requirement - which I've called about trying to reconcile the irreconciable. If one turns off tracking- then how is one going to recoveer pointers. By default, someting track only if its used as a pointer somewhere. And why would one have pointers in something like configuration parameters. And if tracking is turned off and somethng is duplicated - then what if one edits only one of the copies. etc. Its not so much that I want to dicuss tracking- its more that I think these kinds of questions reveal that things originally thought to be almost identical e.g and XML archive and general XML file are more different than they first appear. So when this occurs its time to step back and look at the big picture (BTW I believe it would be easy to suppres the object_id etc in he xml archive just be overridding with stubs the serialziations for these types - but then how would one read the archve?) Anyway - I'm sorry if this is a little rambling and off topic Robert Ramey
participants (1)
-
Robert Ramey