Re: [boost] [serialization] fast array serialization (10x speedup)

25 Nov 2005

      Matthias Troyer wrote:
...
...
Hi Robert,
I'll let Dave comment on the parts where you review his proposal, and
will focus on the performance.
On Nov 24, 2005, at 6:59 PM, Robert Ramey wrote:
...
a) It doesn't address the root cause of "slow" performance
of binary archives.
I have done the benchmarks you desired last night  (see below), and
they indeed show that the root cause of slow performance is the
individual writing of many small elements instead of "block-writing"
of the array in a call to something like save_array.
...
b) re-implemenation of binary_archive in such a way so as not
to break existing archives would be an error prone process.
The switching between new and old method "should" result
in exactly the same byte sequence.  But it could easily
occur that a small subtle change might render archives
create under the previous binary_archive unreadable.
Dave's design does not change anything in your archives or
serialization functions, but only adds an additional binary archive
using save_array and load_array.
Hmm - that's not the way I read it. I've touched on this in another post.
...
...
...
c) The premise that one will save a lot of coding
(see d) above) compared to to the current method
of overloading based on the pair of archive/type
is overyly optimistic.
Actually I have implemented two new archive classes (MPI and XDR)
which can profit from it, and it does save lots of code duplication.
All of the serialization functions for types that can make use of
such an optimization can be shared between all these archive types.
In addition formats such as HDF5 and netCDF have been mentioned,
which can reuse the *same* serialization function to achieve optimal
performance.
There is nothing "optimistic" here since we have the actual
implementations, which show that code duplication can be avoided.
OK - I can really only comment on that which I've seen.
...
...
...
Conclusions
===========
a) The proposal suffers from "premature optimization".
A large amount of design effort has been expended on
areas which are likely not the source of observed
performance bottlenecks.
As Dave pointed out one main reason for a save_array/load_array or
save_sequence/load_sequence hook is to utilize existing APIs for
serialization (including message pasing) that provide optimized
functions for arrays of contiguous data. Examples include MPI, PVM,
XDR, HDF5. There is a well established reason why all these libraries
have special functions for arrays of contiguous data, because they
all observed the same bottlenecks. These bottlenecks are well known
for decades in high performance computing, and have caused all these
APIs to include special support for contiguous arrays of data.
I admit I'm skeptical of the benefits, but I've not disputed that someone 
should
be able to do this without a problem.  The difference lies in where
the implementation should be placed.
...
...
...
b) The proposal suffers from "over generalizaton". The
attempt to generalize results in a much more complex
system.  Such a system will result in a net loss of
conceptual integregrity and implementation transparancey.
The claim that this generalization will actually result in a
reduction of code is not convincing.
I'm  confused by your statement. Actually the implementations of fast
binary archives, MPI archives and XDR archives do share common
serialization functions, and this does indeed result in code
reduction and avoids code duplication.
Upon reflection - I think I would prefer the term "premature 
generalization".
I concede that's speculation on my part.   It seems a lot of effort has
been invested to avoid the MxN problem.  My own experiments with
bitwise_array_archive_adaptor have failed to convince me that the
library needs more API to deal with this problem.  Shortly, I will
be uploading some code which perhaps will make my reasons
for this belief more obvious.
...
...
...
c) by re-implementing a currently existing and used
archive, it risks creating a maintainence headache
for no real benefit.
To avoid any such potential problems Dave proposed to add a new
archive in an array sub namespace.
As I said - that's not how I understood it.
...
...
I guess that alleviates your
concerns? Also, a 10x speedup might not be a benefit for you and your
applications but as you can see from postings here, it is a concern
for many others.
LOL - No one has ever disputed the utility of a 10x speed up.  The question
is how best to achieve it without creating a ripple of side effects.
...
...
...
Suggestions
===========
a) Do more work in finding the speed bottlenecks.  Run
a profiler. Make a buffer based non-stream based archive
and re-run your tests.
I have attached a benchmark for such an archive class and ran
benchmarks for std::vector<char> serialization. Here are the numbers
(using gcc-4 on a Powerbook G4):
Time using serialization library: 13.37
Time using direct calls to save in a loop: 13.12
Time using direct call to save_array: 0.4
...
...
In this case the buffer had size 0 at first and needed to be resized
during the insertions. Here are numbers for the case where enough
memory has been reserved():
Time using serialization library: 12.61
Time using direct calls to save in a loop: 12.31
Time using direct call to save_array: 0.35
And here are the numbers for std::vector<double>, sing a vector of
1/8-th the size:
Time using serialization library: 1.95
Time using direct calls to save in a loop: 1.93
Time using direct call to save_array: 0.37
Since there are fewer calls for these larger types it looks slightly
better, but even now there is a more than 5x difference in this
benchmark.
As you can see the overhead of the serialization library (less than
2%) is insignificant compared to the cost of doing lots of individual
insertion operations into the buffer instead of one big one. The
bottleneck is thus clearly the many calls to save() instead of a
single call to save_array().
Well, this is interesting data.  the call to save() resolves inline to
a call to std::vector get element and stuffing the value into the buffer.
I wonder how much of this in std::vector and how much is in
the save to the buffer?.

And it does diminish my skepticism about how much benefit
the array serialization would be in at least these specific cases.

So, I'll concede that this will be a useful facility for a significant
group of users.  Now we can focus on how to implement it
with the minimal collateral damage.
...
...
...
b) Make your MPI, XDR and whatever archives.  Determine
how much opportunity for code sharing is really
available.
This has been done and is the reason for the proposal to introduce
something like the save_array/load_array functions. I have coded an
XDR and two different types of MPI archives (one using a buffer,
another not using a buffer). A single serialization function for
std::valarray, using the load_array hook, to use the optimized APIs
in MPI and XDR, as well as a faster binary archive, and the same is
true for other types.
...
...
...
c) If you still believe your proposal has merit, make
your own "optimized binary archive". Don't derive
from binary_archive but rather from common_?archive
or perhaps basic_binary_archive.  In this way you
will have a totally free hand and won't have to
achieve consensus with the rest of us which will
save us all a huge amount of time.
I'm confused. I realize that one should not derive from
binary_iarchive, but why should one not derive from
binary_iarchive_impl?
What I meant is if you don't change the current binary_i/oarchive
implementation you won't have to worry about backward compatibility
with any existing archives.  I (mis?)understood the proposal to include
adjustments to the current implementation so that it could be
derived from.
...
...
Also, following Dave's proposal none of your archives is touched, but
instead additional faster ones are provided.
This wasn't clear to me from my reading of the proposal.

Robert Ramey