Re: [boost] [serialization] fast array serialization (10x speedup)

23 Nov 2005

      On Nov 23, 2005, at 10:53 AM, Peter Dimov wrote:
...
Matthias Troyer wrote:
...
Indeed this sounds like a lot of work and that's why this mechanism
for message passing was rarely used in the past. The hard part is to
manually build up the custom MPI datatype, i.e. to inform MPI about
what the offsets and types of the various data members in a struct
are.
This is where the serialization library fits in and makes the task
extraordinarily easy. Saving a data member with such an MPI archive
will register its address, type (as well as the number of identical
consecutive elements in an array) with the MPI library. Thus the
serialization library does all the hard work already!
I still don't see the 10x speedup in the subject.
The 10x speedup was reported at the start of the thread several weeks  
ago for benchmarks, comparing the writing of large arrays and vectors  
using the serialization library as compared to directly writing them  
into a stream and memory buffer. The MPI case was brought up to show  
that the serialization library can be used more efficiently also  
there if we have a save_array and load_array functionality. Please  
keep this in mind when reading my replies. We are discussing MPI here  
as another example, next to binary archives, where save_array and  
load_array optimizations will be important.
...
For a X[], the two
approaches are:
1. for each x in X[], "serialize" into an MPI descriptor
2. serialize X[0] into an MPI descriptor, construct an array  
descriptor from
it
Correct.
...
Conceptual issues with (2) aside (the external format of X is  
determined by
X itself and you have no idea whether the structure of X[0] also  
describes
X[1]),
Of course you can use (2) only for contiguous arrays of the same  
type, and not for any pointer member or polymorphic members. It will  
work for any type that is layout-compatible with a POD and contains  
no pointers or unions. Examples are std::complex<T>, tuples of  
fundamental types, or any struct having only fundamental types as  
members. For these types the memory layout of X[0] is the same as of X 
[1]. The only case where this might not apply would be union or  
pointer members, and in that case the optimization can, of course,  
not be applied.
...
I'm not sure that there will be such a major speedup compared to the
naive (1).
Oh yes, there can be a huge difference. Let me just give a few reasons:

1) in the applications we talk about we have to regularly send huge  
contiguous arrays of numbers (stored e.g. in a matrix, vector,  
valarray or multi_array) over the network. The typical size is 100  
million numbers upwards. I'll stick to 100 million as a typical  
number in the following. Storing these 100 million numbers already  
takes up 800 MByte, and nearly fills the memory of the machine, and  
this causes problems:

   a) copying these numbers into a buffer using the serialization  
library needs another 800 MB of memory that might not be available

   b) creating MPI data types for each member separately mean storing  
at least 12 bytes (4 bytes each for the address, type and count), for  
a total of 1200 MBytes, instead of just 12 bytes. Again we will have  
a memory problem

But the main issue is speed. Serializing 100 million numbers one by  
one, requires 100 million access to the network interface, while  
serializing the whole block at one just causes a single call, and the  
rest will be done by the hardware.      The reason why we cannot  
afford this overhead is that actually on modern high performance  
networks

   ** the network bandwidth is the same as the memory bandwidth **

and that, even if all things could be perfectly inlined and  
optimized, the time to read the MPI datatype for each element when  
using (1) will completely overwhelm the time actually required to  
send the message using (2).

To substantiate my claim (**) above, I want to mention a few numbers:

  * the "Black Widow" network of the Cray X1 series has a network  
bandwidth of 55 GByte/second!
   * the "Red Storm" network of the Cray XT3 Opteron clusters, uses  
one hypertransport channel for the network acces, and another one for  
memory access, and thus the bandwidth here is the same as the memory  
bandwidth
   * the IBM Blue Gene/L has a similarly fast network with 4.2 GByte/ 
second network bandwidth per node
   * even going to cheaper commodity hardware, like Quadrics  
interconnects, 1 GByte/second is common nowadays.

I am sure  you will understand that to keep up with these network  
data transfer rates we cannot afford to perform additional  
operations, such as accessing the network interface once per double  
to read the address, even aside from the memory issue raised above.

I hope this clarifies why the approach (2) should be taken whenever  
possible.
...
Robert's point also deserves attention; a portable binary archive that
writes directly into a socket eliminates the MPI middleman and will  
probably
achieve a similar performance as your two-pass MPI approach.
This is indeed a nice idea and would remove the need for MPI in some  
applications on standard-speed TCP/IP based networks, such as  
Ethernet, but it is not a general solution for a number of reasons:

1. Sockets do not even exist on most dedicated network hardware, but  
MPI is still available since it is the standard API for message  
passing on parallel computers. Even if sockets are still available,  
they just add additional layers of (expensive) function calls   
between the network hardware and the serialization library, while the  
vendor-provided MPI implementations usually access the network  
hardware directly.

2. MPI is much more than a point-to-point communication protocol  
built on top of sockets. It is actually a standardized API for all  
high performance network hardware. In addition to point-to-point  
communication (using synchronous, asynchronous, buffered and one-way  
communication) it also provides a large number of global operations,  
such as broadcasts, reductions, gather and scatter.  These work with  
log(N) complexity on N nodes and often use special network hardware  
dedicated to the task (such as on an IBM Blue Gene/L machine). All  
these operations can take advantage of the MPI datatype mechanism.

3. The MPI implementations can determine at runtime whether the  
transformation to a portable binary archive is actually needed or  
whether just the bits can be streamed, and it will do this  
transparently, hiding it from the user.
...
It also
supports versioned non-PODs and other nontrivial types. As an  
example, I
have a type which is saved as
save( x ):
save( x.anim.name() ); // std::string
and loaded as
load( x ):
string tmp;
    load( tmp );
    x.set_animation( tmp );
Not everything is field-based.
Indeed, and for such types the optimization would not apply.