Re: [boost] [serialization] fast array serialization (10x speedup)

11 Oct 2005

      Matthias Troyer wrote:
...
On Oct 10, 2005, at 6:48 PM, Robert Ramey wrote:
...
That sounds very good to me. Maybe I spoke too soon.
I don't see how this would require and changes at all to the
serialization
library.
I don't see why has_fast_array_serialization has to be part of the
serialization library.
Maybe all the code can be included in boost/serialization/
fast_arrary.hpp ?
This header would be included by all the classes that use it and no
others.
[snip]
...
If has_fast_array_serialization<Archive,Type> is defined
boost/serialization/fast_arrary.hpp I'm still OK with it.
That's where it is defined in my proposal.
...
The only thing I'm missing here is why the serialization library
itself has
to be
modified to support all this.  It seems that all this could easily be
encapulated in
one (or more) separate optional headers.  This would be the best of
all
possible
worlds.
There are actually only very few modifications:
1. boost/archive/detail/oserializer.hpp and iserializer.hpp require
modifications for the serialization of C-arrays of fixed length. In
my version, the class save_array_type is modified to dispatch to
save_array when fast_array_serialization is possible. The underlying
problem here is that oserializer.hpp implements the serialization of
a type here (the C array!). The optimal solution to this problem
would be to move the array serialization to a separate header boost/
serialization/array.hpp, as is done for all C++ classes.
My intention was to include all types "built-in" to the C++ language in 
?serializer.hpp
so that's why it includes C++ arrays.  Separating this into another header 
would
break this "concept" and would result in a small additional *.hpp file that 
would
have to be included explicitly.  So I would be against it in this case.  On 
the
other hand, I do appreciate ideas that remove pieces from the "core" of the 
library
and turn them into more independent modules so don't get discouraged here.
...
2. boost/serialization/vector.hpp is also modified to dispatch to
save_array and load_array where possible. I don't think that this is
a problem?
That would mean that users are including save/load_array even if they
don't want them or want to use their own versions.  Oh - then documentation
has to be enhanced to explain all this internal behavior.  I would prefer
something like the following:

class my_class {
    stl::vector<int> m_vi;
    ...
};
template<class Archive>
my_class::serialize(Archive &ar, const unsigned int version){
    // standard way
    ar & m_vi;
    // or fast way which defaults to standard way in appropriate cases.
    save_array(ar, m_vi);
}

This
    a) keeps the stl portion of the library smaller
    b) leave the user in control of what's going on
    c) permits development of save/load array to be on an independent 
parallel track
    with everyting else.

If we eventually discover that that everyone is always using save/load array 
then
we can study whether we want to just enhance stl::vector, etc to include it 
as
default functionality.
...
3. I had to introduce a new strong typedef in basic_archive.hpp:
BOOST_STRONG_TYPEDEF(std::size_t, container_size_type)
BOOST_CLASS_IMPLEMENTATION(boost::archive::container_size_type,
primitive_type)
...
...
I don't think that you will object to this.
...
I looked at this and right away I noticed that, as written this would make
all existing binary archives unreadable.  The real fix for is:

a) bump up the library version from 3 to 4
b) alter save/load collections to use unsigned int or size_t depending
upon he library version.

It has to be done this way since stl collections are marked as 
non-versioned.

I was sort of reluctant to do this as it seemed to me that it would be yet
a while before anyone starts to save more than 2,000,000,000 objects
with the serialization library and it seemed to me wasteful for the
native binary archive to included an extra 4 bytes containing 0's for every
collection serialzed.  But maybe now (I mean 1.34) is the time to bite
the bullet on this.  I don't have a strong prefererence for either changing
it or leaving it the same.  If you need it, I would say fine - we'll do it.
...
4. boost/archive/basic_binary_[io]archive.hpp serialize
container_size_type as an unsigned int as done till now. It might be
better to bump the file version and serialize them as std::size_t.
agreed - no point in using size_t in the collection code just to throw it 
away.
...
All the other changes were to modify the binary archives and the
polymorphic archive to support fast array serialization. In contrast
to the above points this is optional. Instead we could provide
fast_binary_[io]archive and fast_polymporphic_[io]archive, that
differ from their normal versions just by supporting fast array
serialization. I could live with this as well, although it makes more
sense in my opinion to just addd the save_array/load_array features
to the existing archives.
sounds like you can go along with me on this.
...
Of all the points above, I believe that you will not have anything
against points 3 and 4 since you proposed something similar already
in the past if I remember correctly.
I don't see a problem here.
...
Issue 2 should also be noncontroversial, and the main discussion
should thus be on issue 1, on how one can improve the design of [io]
serializer to move the the implementation of array serialization into
a separate header.
looks like we'll have to arm wrestle a little more here.  In a way you've 
got
me over a barrel.  I continually advocate keeping the "core" of the library 
small
by factoring out everything that can be factor out.  Now you've got me on
C++ array. sort of.  I'll re-iterate my view that the "core" library address 
serialization
of all "built-in" language types - and no others.

The other part of your proposal really is an enhancement of the stl 
serializations
which are not part of the "core" part of the library.  I'm confident that 
your
enhancement will be quite useful to many people and I'm very happy to see 
that
you've done it.  That's not the same as forcing it on all users of the 
library.  So
I prefer to see your enhancement as an option explicitly invoked by the 
user.
This has a number of advantages:

a) the user can see what he is doing.  No hidden complex behavior.
b) some users might want just minimal code since thier collections are small
c) your enhancement will require significant documentation.  This will be 
much
easier if it is an optional add-on to the serialization library.
d) parallel or layered developement is facilitated.

So to summarize the issues

a) should C++ array serialization be in a separate header? I say no, you say 
yes.
b) should save/load array be incorporated into stl collection serialization 
to make
its usage oblicatory? I say no, you say yes.

Regardless of how these questions are answered, its clear to me that your
enhancements to stl containers will be available to users.  Assuming that 
this
is packaged as an optional add-on as I would hope, the only questions 
remaining will
be:

a) Should this be a totally separate library with its own 
documentation/tests/directory tree etc?
It should be separate but not totally so:

Maybe a files or maybe directory within serialization for save/load array 
and within archive
for fast...archive adaptor.

A group of its own tests - just like we have tests for all other 
combinations of serializations
and archives - I can hear the howling already.  We'll have to see what to do 
about this.

A separate documenation section in the documenation of the serialization 
library.  Similar to the
miscelleneas.  But miscellaneas holds things that are really separate so 
we'll find a good place for it.  Maybe a section titled something like 
"Special Considerations When Serializing Collections" (but shorter).

Note that your fast...archive will really be a new creature - an "archive 
adaptor".  This merits
an new section in the documentation in any case.  This is a cool and useful 
technique
and will encourage future great ideas which enhance the serialization 
library without
making the core bigger.

b) Should such an optional enhancement be subject to some sort of review? 
I'm agonostic
on this.  I would be happy to just be the gate keeper and accept it or make 
you alter it to my taste.
Of course, no guarentee that anyone else would be happy with this.  I'm 
willing to go along
with whatever the consensus is regarding if and/or how this thing should be 
reviewed.

===============

A final note to others wishing to participate in the serialization library.

Welcome and good luck.

Having said that, I will give a little advice on how to be most successful 
in getting
my cooperation.  (encouragement you get for free)

a) Notice that I make huge efforts to keep the "core" library from growing. 
Anything
that adds to it makes my job bigger - I have to make this job smaller.

b) I am out of the business of writing serializations for specific classes. 
Its up to
those who need it to get it done.  Of course I'm always full of advice - 
sorry about that.

c) I am out of the business of writing new archive types.  Instead, I want 
to improve
the documentation to make this easier.

d) My efforts will be focused on implemenation aspects of the "core" 
library.  This
includes:

i) pending issue regarding dynamic loading/unloading of code that contains 
serialization
for specific types.  This includes testing of support for plug-ins.

ii) making a good performance test and correcting any performance 
bottlenecks in the
core library.

iii) making testing more efficient.

iv) fixing bugs

I am getting a little "burned out" on the serialization library.  Its only 
my obsessive nature that
makes me continue to the bitter end which I'm hoping I am approaching - at 
least
asymptotically.

I am extremely gratified by your efforts and those of others to enhance the 
library and
will do everything (subject to the above reservations) I can to encourage 
and support them.
Nothing would make me happier to see people spinning off their own improved 
serializations
and archive classes and making the package the default choice for C++ object 
serializaition.

Robert Ramey