Formal review: serialization

20 Apr 2004

      Here's my review of the library. It's based on reading docs and source for 
several hours, as well as using almost every previous version of the library 
on my project.

I think the library should be accepted. I voted "no" the previous time, but I 
think library was considerably improved. I'd also like to thank Robert for 
pushing this effort forward!

Detailed comments are below. Some of them are marked as "critical" -- it means 
I believe they should be addressed before the library is added to official 
distribution.

** General issues **

-- (critical) The primary issue with the library is lack of reference 
documentation -- exact description of semantics for every method (preferrable 
in C++ std style) and overview for all important classes. The library is 
non-trivial, and without such documentation user will have to experiment to 
find out that exact semantic.

Better internal docs (comments) are also needed. For example, I'm looking at 
save_access::end_preamble and don't know what it does. Comment for 
basic_oarchive::end_preamble says what this method is for, but does not say 
when this method is called, which makes it hard to do anything. Again, 
implementation is non-trivial. Lack of comments make it somewhat dangerous to 
use the library -- if I hit a bug while Robert is on vacation, I'm not sure 
I'll be able to quickly fix it locally.

As an illustraction, I've already had a couple of bugs each costing several 
hours to debug.

-- (critical) While BOOST_CLASS_EXPORT is a big step forward, I think it can 
be further improved. Now, if I export a class I need to include some archive 
headers before invoking BOOST_CLASS_EXPORT. The class can be only serialized 
to archives which were included. This means that to really export a class, I 
need to include *all* archive headers. That's not good. What I suggest is

     - create new class 'polymorphic_archive', which will have virtual
       functions for saving/loading and create templated
      'polymorphic_wrapper', which can be used to add this polymorphic
      interface to existing archive.
     - always register all exported classes with polymorphic archive. When
       serializing in a specific archive, check if class is registered with
       that archive. If not, serialize it via polymprphic wrapper.

In fact, I'd be willing to help with polymorphic archive implementation, but 
some changes inside serialization library will be required.

-- (critical) I suspect BOOST_CLASS_EXPORT does not play nice with templates. 
Say I'm implementing RPC system:

      class function_call_base { virtual ~function_call_base() {} } ;
      template<class Arg>
      class function_call1 : public function_call_base {};

      // in some code
      rpc_server s;
      s.register("foo", &foo);

      function_call_base* c;
      while(archive >> c) 
		(*c)();

The idea is the caller will serialize instances of function_call1 and the 
callee will deserialize it and call the proper function. To make this work, 
we have to BOOST_CLASS_EXPORT all used instantinations of function_call1. We 
can assume that all functions (and therefore all argument types) are passed 
to the 'register' function and can function can invoke BOOST_CLASS_EXPORT.

Now the problem: BOOST_CLASS_EXPORT can be used only at top-level of a 
translation unit. There's not way 'register' and automatically export a 
class.

The solution I'd like is:

   template<class T>
   class export_class { 
   public:
		static void instantiate();
   };

So that the object of this type can be created anywhere.

-- One usage of the serialization library I have in mind is taking "debug 
dumps" for an application. After the run, a separate tool can be used to look 
at those dumps. The size of data can potentially be rather large (e.g. 
100*640*480 or maybe more), so I'd rather not to load all data, but build 
some index. So the question is: how is it possible to access specific part of 
archive without reading it all. I understand it's not possible in general, 
but in some cases (e.g. BOOST_CLASS_EXPORT for everything) it should work.

-- The documentation of 'make_binary_object' is very incomplete and does not 
tell the exact semantic. As far as I can tell, the function can be only used 
to load/store objects which have fixed size and will crash when 
saving/loading new[]-d array -- which seemed an obvious use-case for me.

I would suggest that we have two functions:

   make_pod_wrapper -- which is templated, has one parameter -- object, and
   stores sizeof(T) bytes. Another name -- make_static_array_wrapper.

   make_dynarray_wrapper -- which can be used to store/load arrays allocated
   by new[]

-- The library requires exact match between static types for saving and 
loading. For example:

     D* d = new D;
     ia << d;
     B* b = new B;
     ia >> b;

does not work. Moreover, it does not work even if 'D' is 
BOOST_CLASS_EXPORT-ed. I'd suggest that this case should work, since it's 
very easy to save derived classes using pointer to derived class, without 
casting to base class.

-- It appears that XML archive require member/variable names in all cases. I'm 
not sure that's reasonable -- given that the library requires exact 
correspondence between loading and saving order even for XML archives. It 
would load archive back even if I change all the names manually in XML file 
right? Then why requiring names?

-- I have some concerns about XML archives. The point of XML, in general is 
that it's somewhat readable by humans, and readable by other applications 
written using different libraries or in different languages. However, when I 
look at:

    <c2 class_id_reference="1" object_id="_2">
           <i>10</i>
           <j>4</j>
           <base object_id="_3"></base>
    </c2>

it's not clear if any of those benefits are there. This actually corresponds 
to a class "C". To find out this, one has to find "class_id" attribute with 
value of "1" somewhere above. If the above was:

    <C name="c" ...>

finding all "C" elements would be trivial with XPath, or even with more simple 
methods. As it stands now, even finding all instances of "C" is hard. There 
may be a number of similiar issues. If the goal is to produce XML which can 
be processed by other programs, then at least the class name should become 
element name.

-- I believe that library should not change codecvt facet of the stream. There 
are two reasons. First, if user has already imbued some facet (say utf-8), he 
might have good reasons to do so (say, he believes utf-8 is the most 
space-efficient for his data). So the serializaton library should not 
silently change codecvt to ucs-2. Second, the whole locale stuff is complex, 
and it's not even clear if it's possible to change codecvt facet on the fly, 
and when the change will take effect, and so on. I believe requiring user to 
'imbue' proper facet himself is better than some unexpected internal effects.

-- I've tried to play with "implementation level" and created the example at

    http://zigzag.cs.msu.su:7813/tracking.cpp

It stores a pointer to a single class with two data members and a single base 
class. I get the following archive content.

   1
   1 C
   0 10 4 1 0
   1

Is seems too much for a class with implementation level of "object 
serializable". 10 and 4 is the data itself. "1 C" is probably class name, but 
what other data is?

** Implementation issues **

-- I suggest that all MPL expressions in the library are refactored for better 
readability. For example, the expression in save_non_pointer_type::invoke is 
28 lines long. I think it should be possible to extract some of the 
subexressions and create a typedef with good name -- looks like 
instantination of those expression in all cases should not cause compile 
errors.

** Custom archive creation **

I've tried to create a custom archive class (in fact, polymorphic_oarchive), 
and got some comments.

- Documentation says the derived class must provide operator<< and method 
'load'. It also says there common_iarchive::operator<<. I wonder why the 
derived class should provide operator<<. Can't base operator call 'load' 
directly?

- Even if not, why derived operator<< should have

     return * This();

not plain "return *this"? After all, we're in most derived object

- It's needed to add

  friend class boost::serialization::load_access;

to class declaration. However, it's not stated where this type is declared, 
and in fact in declared in *internal* header 

    boost/archive/detail/iserializer.hpp

And, finally, the right syntax seems to be:

  friend class boost::archive::load_access;

(Note the namespace!)

- The example uses:

        boost::serialization::load(* This(), t);

Again, what header is this function in? The answer is that it's in the same 
header as above, and the line should be:

        boost::archive::load(* This(), t);

- The documentation should really state the minimal set of 'load'/'save' 
overloads which will make archive usable. For example, it's probably not 
necessary to provide separate overload for 'bool', right?

- (critical) It should be explained how the newly-created class can be tested. 
As it stands, one should either do superficial tests or try to integrate the 
new archive into existing tests. As for latter -- I don't know how to do 
that, and also, I think it's not necessary to run all the tests with the new 
archive. For example, serialization of vector and map is not likely to be 
broken with the new type. What I'm looking for is a test which saves/loads 
all possible types and verifies results.

** Various comments **

-- The BOOST_CLASS_EXPORT macros defines an explicit instantiation of a 
function. gcc rejests this when BOOST_CLASS_EXPORT is used inside some 
namespace. I've asked on comp.std.c++:

http://groups.google.com/groups?q=Explicit+instantiation&hl=en&lr=&ie=UTF-8&safe=off&selm=u5boi1-7te.ln1%
40zigzag.lvk.cs.msu.su&rnum=3

and got a reply directing me to:

    http://std.dkuug.dk/jtc1/sc22/wg21/docs/cwg_defects.html#275

Which means gcc is mostly right. So, the macro should either be changed 
somehow, or at least the restriction should bes documented -- using 
BOOST_CLASS_EXPORT outside of any namespaces is OK.

-- Docs and code use "const unsigned long int version" everywhere. Is that 
really necessary, as opposed to "unsigned version"?

-- The library saves char* as arrays of integers. The rationale section says:

    input of char * and wchar_t * could be altered to create a buffer overrun
    and potential security problem. 

I don't understand this point at all. How storing the elements as integer 
values helps with the buffer overrun? Moreover, library stores char* 
internally (e.g. for archive signature). What if that string is modified? 
Will that cause buffer overrun? 

-- The following program (yes, of two lines), causes a compile error:

    #include <boost/archive/text_oarchive.hpp>
    #include <boost/serialization/export.hpp>

It's because export.hpp uses 'instantiate_pointer_iserializer' but does not 
include the header where that class is defined. 'text_oarchive.hpp' does not 
include it either. The error goes away if I include

    #include <boost/archive/text_iarchive.hpp>

but a better diagnostic is desirable.

- Volodya

Vladimir Prus

tags

participants (1)