Re: [boost] [serialization] fast array serialization (10x speedup)

24 Nov 2005

      To summarize how we arrived here.
=================================

a) Mattias augmented binary_?archive to replace
element by serialization of primitive types
with save/load binary for C++ arrays, std::vectors
and boost::val_array.  This resulted in a 10 x speed
up of the serialization process.

b) From this it has been concluded that binary archives
should be enhanced to provide this facility automatically
and transparently to the user.

c) The structure of the library an the documentation
suggest that the convenient way to do this is to
specify an overload for each combination of archive/type
which can benefit from special treatment.

d) The above (c) is deemed inconvenient because it has
been supposed that many archive classes will share
a common implementation of load/save array.  This would
suggest that using (c) above, though simple and straight
forward, will result in code repetition.

e) So it has been proposed binary_iarchive be re-implemented
in the following way

iarchive - containg default implementation of load_array
  binary_iarchive - ? presumablu contains implementaion of load_array
    in terms of currently defined load_binary

Its not clear whether all archives would be modified in
this way or just binary_iarchive.

The idea is that each type which can benefit from load_array
can call it and the version of load_array corresponding
to that particular archive will be invoked. This will
require

i) the serialization functino for types which can benefit
from some load_array function would call this.
ii) Only a small number of load_array functions would have
ot be written for each archive. So the number of special functions
to be written would be One for each type which might use
load_array and "one" for each archive.

Problems with the Design
========================

a) It doesn't address the root cause of "slow" performance
of binary archives.

The main problem is that it doesn't address the cause of the
10 X speed up.  Its a classic case of premature optimization.
The 10x speed up was based on test program.  For a C++
array, the test boils down to replacing 10,000 invocations
for stream write(..) with one invocation of stream write
10,000 times longer.

Which is of course faster.  Unfortunatly, the investigation
stopped here with the conclusion that the best way to
improve performance is to reduce the number of stream
write calls in a few specific cases.

As far as I know, the test was never profiled so I can't know
for sure, but past experience and common sense suggest that
stream write is a costly operation for binary i/o.  This
design proposal (as well as the previous one) fail to
address this so its hard to take it as a serious
proposal to speed up native binary serializaition.

The current binary archives are implemented in terms of
stream i/o.  This was convenient to do so and has worked
well.  But basing the implemention on streams results
in a slow implemenation.  The documentation explicitly
states that archives do not have to be implemented in terms
of streams. The binary archives don't use any of the
stream interface other than read(.. write(.. so it would
be quite easy to make another binary archive which isn't
based on stream i/o.  It could be based on fread/fwrite.
Given that the concern of the proposal of the authors
is to make the library faster for machine to machine
communication and the desired protocols (MPI) don't
use file i/o, the fastest would be just a buffer
say buffer_archve which doesn't do any i/o at all.
It would just fill up a user specified buffer whose
address was handed at buffer_archive construction time.
This would totally eliminate stream i/o from the equation.

Note that this would be easy to do. Just clone binary_archive,
and modify it so it doesn't use a stream.  (probably don't want
to derive from basic_binary_archive). I would guess that
that would take about a couple of hours at most.

I would be surprised to see if the 10x speed up still exists
with this "buffered_archive".

note that for the intended application - mpi communication -
some archive which doesn't use stream i/o have to be
created anyway.

b) re-implemenation of binary_archive in such a way so as not
to break existing archives would be an error prone process.
The switching between new and old method "should" result
in exactly the same byte sequence.  But it could easily
occur that a small subtle change might render archives
create under the previous binary_archive unreadable.

c) The premise that one will save a lot of coding
(see d) above) compared to to the current method
of overloading based on the pair of archive/type
is overyly optimistic.  This is explained in Peter Dimov's
post here:

http://lists.boost.org/Archives/boost/2005/11/97089.php

I'm aware this is speculative.  I haven't investigated
MPI, XDR and other's enough to know how much code
sharing is possible.  It does seem that there
will be no sharing with the the "fast binary archive"
of the previous submission.  From the short descriptions
of MPI I've seen on this list along with my cursory
investigation of XDR, I'm doubtful that there is
any sharing there either.

Conclusions
===========
a) The proposal suffers from "premature optimization".
A large amount of design effort has been expended on
areas which are likely not the source of observed
performance bottlenecks.

b) The proposal suffers from "over generalizaton". The
attempt to generalize results in a much more complex
system.  Such a system will result in a net loss of
conceptual integregrity and implementation transparancey.
The claim that this generalization will actually result in a
reduction of code is not convincing.

c) by re-implementing a currently existing and used
archive, it risks creating a maintainence headache
for no real benefit.

Suggestions
===========

a) Do more work in finding the speed bottlenecks.  Run
a profiler. Make a buffer based non-stream based archive
and re-run your tests.

b) Make your MPI, XDR and whatever archives.  Determine
how much opportunity for code sharing is really
available.

c) If you still believe your proposal has merit, make
your own "optimized binary archive". Don't derive
from binary_archive but rather from common_?archive
or perhaps basic_binary_archive.  In this way you
will have a totally free hand and won't have to
achieve consensus with the rest of us which will
save us all a huge amount of time.

Robert Ramey