Re: [boost] [serialization] fast array serialization (10x speedup)

25 Nov 2005

      Despite saying I would no longer participate in this thread, I re-read
Robert's recent list of queries and was struck by the thought that most, if
not all, is based on accidental mis-readings of some previous posts,
combined with some completely understandable misunderstandings about
performance issues that are well-known in the HPC domain (since clarified
elsewhere by some benchmarks).  So, in the spirit of perhaps rescuing the
situation, I reply again:

Robert Ramey wrote:
...
To summarize how we arrived here.
=================================
a) Mattias augmented binary_?archive to replace
element by serialization of primitive types
with save/load binary for C++ arrays, std::vectors
and boost::val_array.  This resulted in a 10 x speed
up of the serialization process.
Right.  If you see some function in the profile that is called 10,000 times
and is a bottleneck, which is better?  Optimize that function a bit and get
a factor 2 speedup (see
http://lists.boost.org/Archives/boost/2005/11/97156.php), or change the
calling sequence so that that function is only called once, and get a
factor 10 speedup?
...
b) From this it has been concluded that binary archives
should be enhanced to provide this facility automatically
and transparently to the user.
Right.  From the point of view of a user, this is completely analagous to,
say, a standard library implementor optimizing std::copy(container.begin(),
container.end(), ostream_iterator<T>(stream)), for example.  If he/she did
this, would the user be interested in *disabling* that optimization?  What
would be the point?

But note, 'should' here is really 'could'.  The current proposal from David
Abrahams explicitly does *not* modify any of the existing archives.
...
c) The structure of the library an the documentation
suggest that the convenient way to do this is to
specify an overload for each combination of archive/type
which can benefit from special treatment.
As I understand it (someone please correct me if I got this wrong), the core
idea of the new proposal (from
http://lists.boost.org/Archives/boost/2005/11/96923.php and followups) is
to provide a single point of customization, that *serialization function
authors* can utilize to serialize an array in one call.  Of course making
use of this hook is optional, but since it is also a good convenience
function (it saves a couple of lines by avoiding having to manually code a
loop), there isn't really any point to not use it.

Note that a significant point of the proposal
http://lists.boost.org/Archives/boost/2005/11/96923.php is that the
possibility exists that this 'hook' is not part of the serialization
library proper, but the point is *it must be globally accessible* and not
specific to a particular archive.
...
From http://lists.boost.org/Archives/boost/2005/11/96923.php :
David Abrahams wrote:
| In an upcoming message I'm planning to start by describing the least
| intrusive design that could possibly work -- one that makes no changes
| at all to the existing serialization library.  It's not a bad design,
| but it has a few drawbacks that I'd like to discuss.  At that point it
| should become clear what I meant about "hijacking" the serialization
| library.  Finally, I'll describe the smallest set of changes to the
| existing serialization library that would be needed to address those
| drawbacks.
| 
| Just to state the obvious, I hope to convince you that those few
| changes are worth making.  Of course, as the maintainer of
| Boost.Serialization, it's completely up to you whether to do so.

Moving on to your next point:
...
d) The above (c) is deemed inconvenient because it has
been supposed that many archive classes will share
a common implementation of load/save array.  This would
suggest that using (c) above, though simple and straight
forward, will result in code repetition.
I think this must have resulted from a misunderstanding.  

1. As far as I can tell, the proposal is completely consistent with (c)
above.  The question of how the save/load_array is actually dispatched to
the archive is a detail that has relevance *only to archive implementors*. 
Obviously different dispatch mechanisms have different tradeoffs, as
discussed elsewhere on this thread, but none of this should be visible for
people who are not archive authors.

2. I don't recall seeing anyone suggest that different archive types would
be able to share a common implementation of load/save array (except for the
trivial case where an archive has no array support and instead uses a
default implementation that serializes each element in a loop!).  Can you
cite where you saw this, so that someone can clarify this?

3. Without a single 'hook' to use when writing serialization functions, the
only alternative is to specialize each serialization function that can make
use of array optimizations *separately for each archive*.  For example, an
MPI archive might provide a helper function Save_MPI_Datatype(void* buffer,
size_t count, MPI_Datatype type), and require serialization function
authors' to call this member.  An XDR archive might provide a function
SaveArray(T* Begin, T* End).  With no cooperation between the authors of
the archive to use a common function for array serialization, the poor user
will have to write two sets of serialization functions, one that calls
Save_MPI_Datatype() and one that calls SaveArray().  I hope it is obvious
to you that this situation is completely untenable!
...
e) So it has been proposed binary_iarchive be re-implemented
in the following way
iarchive - containg default implementation of load_array
  binary_iarchive - ? presumablu contains implementaion of load_array
    in terms of currently defined load_binary
This was part of the original (pre Nov 19) proposal and is no longer
relevant.
...
Its not clear whether all archives would be modified in
this way or just binary_iarchive.
?  What specific implementation of load_array would you suggest, for the
other existing archive types?
...
The idea is that each type which can benefit from load_array
can call it and the version of load_array corresponding
to that particular archive will be invoked. This will
require
i) the serialization functino for types which can benefit
from some load_array function would call this.
Right.  The set of types that can benefit is very dependent on the archive,
but in general it is anything that looks like a container.
...
ii) Only a small number of load_array functions would have
ot be written for each archive. So the number of special functions
to be written would be One for each type which might use
load_array and "one" for each archive.
Right.
...
Problems with the Design
========================
a) It doesn't address the root cause of "slow" performance
of binary archives.
The main problem is that it doesn't address the cause of the
10 X speed up.  Its a classic case of premature optimization.
The 10x speed up was based on test program.  For a C++
array, the test boils down to replacing 10,000 invocations
for stream write(..) with one invocation of stream write
10,000 times longer.
I think this issue has been covered elsewhere, including benchmarks.  If you
can see any remaining problems with the claim that the root cause of the
poor performance is the multiple calls to the buffer write(), and even
using a specialized buffer *does not significantly help*, then please say
so.

[snip]
...
I would be surprised to see if the 10x speed up still exists
with this "buffered_archive".
As shown elsewhere, you are surprised.
...
note that for the intended application - mpi communication -
some archive which doesn't use stream i/o have to be
created anyway.
Right.  In this case, the write() to the stream is replaced by either a
immediate call to MPI_Send(data, size, ...), or by a mechanism to construct
a derived MPI_Datatype that will essentially record a map of which memory
locations need to be sent and pass this directly to MPI.  In both cases,
the cost of not utilizing array operations would not be 10x, but more like
1000x or more.
...
b) re-implemenation of binary_archive in such a way so as not
to break existing archives would be an error prone process.
The switching between new and old method "should" result
in exactly the same byte sequence.  But it could easily
occur that a small subtle change might render archives
create under the previous binary_archive unreadable.
Again, this is not part of the revised proposal.  But is the binary
serialization really *that* fragile, that this is a significant concern? 
It suggests that binary archives using the boost serialization lib would be
really hard to write correctly (much harder than my experience suggests)!
...
c) The premise that one will save a lot of coding
(see d) above) compared to to the current method
of overloading based on the pair of archive/type
is overyly optimistic.  This is explained in Peter Dimov's
post here:
http://lists.boost.org/Archives/boost/2005/11/97089.php
I think this is based on the same misunderstanding as the first point (d)
above?  Anyway, what do you mean by "current method" ?  Can you describe
how I should write my single serialization function (hopefully just one of
them!) for my_matrix_type using the "curent method" ?
...
I'm aware this is speculative.  I haven't investigated
MPI, XDR and other's enough to know how much code
sharing is possible.  It does seem that there
will be no sharing with the the "fast binary archive"
of the previous submission.  From the short descriptions
of MPI I've seen on this list along with my cursory
investigation of XDR, I'm doubtful that there is
any sharing there either.
Again, I think this is based on the same misunderstanding.  Archives will
typically not share implementations of array processing.
...
Conclusions
===========
a) The proposal suffers from "premature optimization".
A large amount of design effort has been expended on
areas which are likely not the source of observed
performance bottlenecks.
Not true, as already explained here and elsewhere.
...
b) The proposal suffers from "over generalizaton". The
attempt to generalize results in a much more complex
system.  Such a system will result in a net loss of
conceptual integregrity and implementation transparancey.
The claim that this generalization will actually result in a
reduction of code is not convincing.
I think this is based on the same misunderstanding as the first point (d)
above?
...
c) by re-implementing a currently existing and used
archive, it risks creating a maintainence headache
for no real benefit.
Again, this is not part of the current proposal.  This issue is not very
important because most of the interesting uses for array optimizations are
not based on archives currently in the serialization lib.  Whether existing
archives make use of array optimizations is a side-issue that can be
discussed if/when the necessary array hooks actually exist.
...
Suggestions
===========
a) Do more work in finding the speed bottlenecks.  Run
a profiler. Make a buffer based non-stream based archive
and re-run your tests.
Done, see
http://lists.boost.org/Archives/boost/2005/11/97166.php
http://lists.boost.org/Archives/boost/2005/11/97156.php

For the people here who have experience in this problem domain, the results
are completely obvious.  In hindsight it is equally obvious that someone
from a different background would not be aware of this!  Sorry.
...
b) Make your MPI, XDR and whatever archives.  Determine
how much opportunity for code sharing is really
available.
I think Matthias already has a prototype MPI as well as XDR archive?  But
this point is again based on the misunderstanding of point (d) above.  I
doubt there is any possibility for code-sharing between the load/save_array
functions between MPI and XDR *at all*, the point is that I, as a user,
want to be able to write just a single serialization function for
my_matrix_type that will make use of whatever array optimizations the
archive I am using can provide.
...
c) If you still believe your proposal has merit, make
your own "optimized binary archive". Don't derive
from binary_archive but rather from common_?archive
or perhaps basic_binary_archive.  In this way you
will have a totally free hand and won't have to
achieve consensus with the rest of us which will
save us all a huge amount of time.
I don't understand what this means.

Regards,
Ian

Re: [boost] [serialization] fast array serialization (10x speedup)

Ian McCulloch