New subject: Formal review: serialization

20 Apr 2004

      Vladimir Prus wrote:
...
Detailed comments are below. Some of them are marked as "critical" -- it
means I believe they should be addressed before the library is added to
official distribution.
** General issues **
-- (critical) The primary issue with the library is lack of reference
documentation -- exact description of semantics for every method
(preferrable in C++ std style) and overview for all important classes. The
library is non-trivial, and without such documentation user will have to 
experiment to find out that exact semantic.
Better internal docs (comments) are also needed. For example, I'm looking
at
save_access::end_preamble and don't know what it does. Comment for
basic_oarchive::end_preamble says what this method is for, but does not
say
when this method is called, which makes it hard to do anything. Again,
implementation is non-trivial. Lack of comments make it somewhat dangerous
to
use the library -- if I hit a bug while Robert is on vacation, I'm not
sure
I'll be able to quickly fix it locally.
As an illustraction, I've already had a couple of bugs each costing
several hours to debug.
I think this is a little harsh. I've had lots of things I had spend several
hours on - including iterator_adaptors, spirit, and mpl.  I'm not against
adding things to the documentation but in general I think it as good as or
better than any other boost documentation.  

Having said that, I do incrementally clarify bits of documentation as people
ask me questions.  

There's another point that has me a little bit stumped.  Most of the boost
libraries are code that one invokes with a documented signature.  The list
of invocations, syntax and semantics is what I think Vladimir is referring
to here with "reference documentation".  However, the serialization library
is more "upside down" in that using it consists almost entirely of filling
in stubs called by the serialization algorithm.  So its not clear to me how
the "reference documentation" would look compared to the current documents.
...
-- (critical) While BOOST_CLASS_EXPORT is a big step forward, I think it
can
be further improved. Now, if I export a class I need to include some
archive
headers before invoking BOOST_CLASS_EXPORT. The class can be only
serialized
to archives which were included. This means that to really export a class,
I
need to include *all* archive headers. That's not good. What I suggest is
- create new class 'polymorphic_archive', which will have virtual
       functions for saving/loading and create templated
      'polymorphic_wrapper', which can be used to add this polymorphic
      interface to existing archive.
     - always register all exported classes with polymorphic archive. When
       serializing in a specific archive, check if class is registered
with
       that archive. If not, serialize it via polymprphic wrapper.
In fact, I'd be willing to help with polymorphic archive implementation,
but
some changes inside serialization library will be required.
There was much objection to the previous version which was based on virtual
functions.  The usage of templates has addressed those concerns but raised
others related to instanticatiation.  In the previous system the virtual
function interface was fixed across all archives (even those not invented
yet !) in the current system serialization code is generated from the
archive and serialization templates a new every time.  So as it stands one
can't make a "library for type X" for all future archives.  This can be
addressed in two ways.

For current archives one can follow the demo_pimpl example, and generate a
library that has pre instantiated code for all know archives.  This will be
satisfactory in many cases.

For those cases not addressed by the above solution - i.e. those a virtual
interface will be required.  This can be addressed by creating of a
"polymorphic_archive" without requiring changes to the internals to the
library.  This is not difficult if done carefully, but it hasn't been done
yet.  It entails definition of an interface class.  The concrete archive
class would inherit and from both this interface as well as one of the
current implementations.  Its exactly the way the Microsoft COM is
implemented.  I have yet to do this.
...
-- (critical) I suspect BOOST_CLASS_EXPORT does not play nice with
templates.
Say I'm implementing RPC system:
class function_call_base { virtual ~function_call_base() {} } ;
      template<class Arg>
      class function_call1 : public function_call_base {};
// in some code
      rpc_server s;
      s.register("foo", &foo);
function_call_base* c;
      while(archive >> c)
      (*c)();
The idea is the caller will serialize instances of function_call1 and the
callee will deserialize it and call the proper function. To make this
work,
we have to BOOST_CLASS_EXPORT all used instantinations of function_call1.
We
can assume that all functions (and therefore all argument types) are
passed
to the 'register' function and can function can invoke BOOST_CLASS_EXPORT.
Now the problem: BOOST_CLASS_EXPORT can be used only at top-level of a
translation unit. There's not way 'register' and automatically export a
class.
The solution I'd like is:
template<class T>
   class export_class {
   public:
      static void instantiate();
   };
So that the object of this type can be created anywhere.
Doesn't including BOOST_CLASS_EXPORT in the header fill corresponding to the
exported data type perform the equivalent function? It was certainly my
intention that that was how it would be used.  In this way the "export"
would be addressed once and for all. That way all types used - and only
those types would be instantiated.
...
-- One usage of the serialization library I have in mind is taking "debug
dumps" for an application. After the run, a separate tool can be used to
look
at those dumps. The size of data can potentially be rather large (e.g.
100*640*480 or maybe more), so I'd rather not to load all data, but build
some index. So the question is: how is it possible to access specific part
of
archive without reading it all. I understand it's not possible in general,
but in some cases (e.g. BOOST_CLASS_EXPORT for everything) it should work.
I think the could fairly easily be addressed by making a derivation from the
desired archive type and making an index on class_id and object_id or
something like that.  I wonder if there isn't an XML reader that opens just
the toplevel and only opens the file as one "drills down".  Then just
serialize to XML and use that reader.
...
-- The documentation of 'make_binary_object' is very incomplete and does
not
tell the exact semantic. As far as I can tell, the function can be only
used
to load/store objects which have fixed size and will crash when
saving/loading new[]-d array -- which seemed an obvious use-case for me.
I would suggest that we have two functions:
make_pod_wrapper -- which is templated, has one parameter -- object,
and
   stores sizeof(T) bytes. Another name -- make_static_array_wrapper.
make_dynarray_wrapper -- which can be used to store/load arrays
allocated
   by new[]
I can expand the doc a bit.  binary_object is just a wrapper around a size
and pointer to permit them to be handled as pair. It presumes the pointer
already points to allocated storage.  It also presumes that the the size of
thing its pointing to is the same on save and load. Its very lightweight.

I don't see a need for the others, but of course any user can make the
wrappers he needs.
...
-- The library requires exact match between static types for saving and
loading. For example:
D* d = new D;
     ia << d;
     B* b = new B;
     ia >> b;
( presume you mean don't mean "new B" as the library creates a new B)
...
does not work. Moreover, it does not work even if 'D' is
BOOST_CLASS_EXPORT-ed. I'd suggest that this case should work, since it's
very easy to save derived classes using pointer to derived class, without
casting to base class.
It doesn't work. Generally saving to one type and loading to another is not
supported. In this particular case it could be made to work but it would
significantly complicate the library.  Also, supporting a practice such as
the above, would lead to serialization which is much harder to debug.  I
don't believe that there is a situation in which doing the above is
necessary.  The appropriate coding of the above would be:

    B* b = new D;
    ia << d;
    B* b2;
    ia >> b2;

I don't think that's an unreasonable burden.
...
-- It appears that XML archive require member/variable names in all cases.
I'm
not sure that's reasonable -- given that the library requires exact
correspondence between loading and saving order even for XML archives. It
would load archive back even if I change all the names manually in XML
file
right? Then why requiring names?
...
-- I have some concerns about XML archives. The point of XML, in general
is
that it's somewhat readable by humans, and readable by other applications
written using different libraries or in different languages. However, when
I
look at:
<c2 class_id_reference="1" object_id="_2">
           <i>10</i>
           <j>4</j>
           <base object_id="_3"></base>
    </c2>
it's not clear if any of those benefits are there. This actually
corresponds
to a class "C". To find out this, one has to find "class_id" attribute
with
value of "1" somewhere above. If the above was:
<C name="c" ...>
finding all "C" elements would be trivial with XPath, or even with more
simple
methods. As it stands now, even finding all instances of "C" is hard.
There
may be a number of similiar issues. If the goal is to produce XML which
can
be processed by other programs, then at least the class name should become
element name.
Correct, the names aren't used on input.  The real question here is what XML
is to be used for.  Personally I have no use for it.  I did it to satisfy
boost members who thought it was indispensable.  After studying the options
in more detail and learning more about it, I'm even more convinced its
utility is way over hyped. The question is what should be included in the
attribute list.  I made my choices and no one has complained.  I didn't put
anything in there that wasn't necessary for de-serialiation except the name.
(Do you think those requiring XML would be satisified if it didn't have the
name?).  The export class name is included if its necessary.  The XML
attribute list could be tweaked a little without too much trouble.  But it
really comes down to what is XML serialization going to be used for besides
de-serialization.  Until that's answered, I think this will have to remain
an open question.

As its stands, I looked at, made my choices and hardly anyone has
complained.  I guess that its probably has limited utility.  I think that
everyone has assumed that someone else will find it indispensable.

BTW, one usage of XML archives did occur to me.  By checking the name tag on
input, we can implement a crude check that save and load operations are
synchronized.  This would effectively a debug mode for archives and might be
useful.
...
-- I believe that library should not change codecvt facet of the stream.
There
are two reasons. First, if user has already imbued some facet (say utf-8),
he
might have good reasons to do so (say, he believes utf-8 is the most
space-efficient for his data). So the serializaton library should not
silently change codecvt to ucs-2. Second, the whole locale stuff is
complex,
and it's not even clear if it's possible to change codecvt facet on the
fly,
and when the change will take effect, and so on. I believe requiring user
to
'imbue' proper facet himself is better than some unexpected internal
effects.
I've wrestled with that.  My current thinking is that the system should not
be changed.  The reason I did this was that I found that the default codecvt
on wchar_t files was to narrow the character and throw away the high order
bits.  Also UTF-8 is need for wchar_t XML. I included the ability to
override this by specifying no_codecvt flag upon opening the archive which
leaves the codecvt unchanged.  So I'm happy with the current method which is
to reflects the most common user intention and permits one to suppress this
default behavior. I've come to conclude that that the only situation where
the codecvt can't be changed on the fly is with the old dinkumware library.
I'm still checking for this though.
...
-- I've tried to play with "implementation level" and created the example
at
http://zigzag.cs.msu.su:7813/tracking.cpp
It stores a pointer to a single class with two data members and a single
base
class. I get the following archive content.
1
   1 C
   0 10 4 1 0
   1
Is seems too much for a class with implementation level of "object
serializable". 10 and 4 is the data itself. "1 C" is probably class name,
but
what other data is?
The easiest way to verify this is to use an XML archive which labels all the
items.  Current archives contain no extraneous data (except the name in the
XML archive).

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE boost_serialization>
<boost_serialization signature="serialization::archive" version="3">
<c class_id="1" class_name="C" object_id="_0">
	<i>10</i>
	<j>4</j>
	<base class_id="0" tracking_level="1" object_id="_1"></base>
</c>
</boost_serialization>
...
** Implementation issues **
-- I suggest that all MPL expressions in the library are refactored for
better
readability. For example, the expression in save_non_pointer_type::invoke
is
28 lines long. I think it should be possible to extract some of the
subexressions and create a typedef with good name -- looks like
I think the indentation is better than creating new names.
...
instantination of those expression in all cases should not cause compile
errors.
Hmm - please expand on this.
...
** Custom archive creation **
I've tried to create a custom archive class (in fact,
polymorphic_oarchive),
and got some comments.

...
- Documentation says the derived class must provide operator<< and method
'load'. It also says there common_iarchive::operator<<. I wonder why the
derived class should provide operator<<. Can't base operator call 'load'
directly?
- Even if not, why derived operator<< should have
return * This();
not plain "return *this"? After all, we're in most derived object

...
- It's needed to add
friend class boost::serialization::load_access;
to class declaration. However, it's not stated where this type is
declared,
and in fact in declared in *internal* header
I wasn't aware that such a friend class had to be declared.  Recently one
compiler has flagged with a warning which is annoying.  So, we'll look at
this if only to suppress a benign warning.
...
boost/archive/detail/iserializer.hpp
And, finally, the right syntax seems to be:
friend class boost::archive::load_access;
(Note the namespace!)
Hmm - I don't see that this is required given that its be called from within
boost::archive::detail. Does any compiler complain?
...
- The example uses:
boost::serialization::load(* This(), t);
Again, what header is this function in? The answer is that it's in the
same
header as above, and the line should be:
boost::archive::load(* This(), t);
Correct thanks.
...
- The documentation should really state the minimal set of 'load'/'save'
overloads which will make archive usable. For example, it's probably not
necessary to provide separate overload for 'bool', right?
The document states:

" However, we're not quite done. The above code addresses serialization of
all non-primitive types. To be complete, each primitive type must either be
covered by a definition of template<typename T> void load(T & t); or an
overload of the >> operator"

And

" If all primitive types have been accounted for, any program with
serialization defined should work with the new archive."

Maybe I can expand upon that a little to something like: 

"Any program with serialization defined should work with the new archive as
long as every primitive type has a matching save/load function prototype or
template."

The trivial_?archive include the "fall-through" template that matches any
primitive type - including bool.  For example, the "fall-through" for
text_oarchive is

Void load(T &t){
    os >> t;
}

Which covers all primitive types.
...
- (critical) It should be explained how the newly-created class can be
tested.
As it stands, one should either do superficial tests or try to integrate
the
new archive into existing tests. As for latter -- I don't know how to do
that, and also, I think it's not necessary to run all the tests with the
new
archive. For example, serialization of vector and map is not likely to be
broken with the new type. What I'm looking for is a test which saves/loads
all possible types and verifies results.
I've concluded this myself.  Its pretty easy given the bjam setup.  Maybe it
can be made even easier with a bjam argument or environmental variable. It
just needs to be explained.  The easiest is to just run all the tests - for
one archive it probably takes less than 30 minutes and its fairly
exhaustive.  So its not worth trying to figure out a way to abbreviate the
test. (Not to mention the fact that the test I skip is always the one that
catches me).  In some situations the brain-dead way is really the best way.
...
** Various comments **
-- The BOOST_CLASS_EXPORT macros defines an explicit instantiation of a
function. gcc rejests this when BOOST_CLASS_EXPORT is used inside some
namespace. I've asked on comp.std.c++:
http://groups.google.com/groups?q=Explicit+instantiation&hl=en&lr=&ie=UTF-
8&safe=off&selm=u5boi1-7te.ln1%
40zigzag.lvk.cs.msu.su&rnum=3
and got a reply directing me to:
http://std.dkuug.dk/jtc1/sc22/wg21/docs/cwg_defects.html#275
Which means gcc is mostly right. So, the macro should either be changed
somehow, or at least the restriction should bes documented -- using
BOOST_CLASS_EXPORT outside of any namespaces is OK.
I'll add language reminding users that ALL these macros have to be invoked
outside of any name space.  Hmmm
...
-- Docs and code use "const unsigned long int version" everywhere. Is that
really necessary, as opposed to "unsigned version"?
That's been fixed.  I'm not sure you have version # 18.
...
-- The library saves char* as arrays of integers. The rationale section
says:
input of char * and wchar_t * could be altered to create a buffer
overrun
    and potential security problem.
I don't understand this point at all. How storing the elements as integer
values helps with the buffer overrun? Moreover, library stores char*
internally (e.g. for archive signature). What if that string is modified?
Will that cause buffer overrun?
The manual is wrong.  I'll fix it.  I commented out the code that
serializaed (char *) and (wchar_t *) as strings.  I was concerned about
buffer overrun security problems. I don't remember my reasoning for this
right now but I don't think implemention char * serialization as a primitive
type is a good idea.
...
-- The following program (yes, of two lines), causes a compile error:
#include <boost/archive/text_oarchive.hpp>
    #include <boost/serialization/export.hpp>
Not on my machine. So I guess I fixed it.

Whoops, fails on gcc - so I've fixed with a forward template declaration.

Robert Ramey

Re:[boost] Formal review: serialization

Robert Ramey

David Abrahams

Vladimir Prus

David Abrahams

tags

participants (3)