
On Nov 23, 2005, at 10:53 AM, Peter Dimov wrote:
Matthias Troyer wrote:
Indeed this sounds like a lot of work and that's why this mechanism for message passing was rarely used in the past. The hard part is to manually build up the custom MPI datatype, i.e. to inform MPI about what the offsets and types of the various data members in a struct are.
This is where the serialization library fits in and makes the task extraordinarily easy. Saving a data member with such an MPI archive will register its address, type (as well as the number of identical consecutive elements in an array) with the MPI library. Thus the serialization library does all the hard work already!
I still don't see the 10x speedup in the subject.
The 10x speedup was reported at the start of the thread several weeks ago for benchmarks, comparing the writing of large arrays and vectors using the serialization library as compared to directly writing them into a stream and memory buffer. The MPI case was brought up to show that the serialization library can be used more efficiently also there if we have a save_array and load_array functionality. Please keep this in mind when reading my replies. We are discussing MPI here as another example, next to binary archives, where save_array and load_array optimizations will be important.
For a X[], the two approaches are:
1. for each x in X[], "serialize" into an MPI descriptor 2. serialize X[0] into an MPI descriptor, construct an array descriptor from it
Correct.
Conceptual issues with (2) aside (the external format of X is determined by X itself and you have no idea whether the structure of X[0] also describes X[1]),
Of course you can use (2) only for contiguous arrays of the same type, and not for any pointer member or polymorphic members. It will work for any type that is layout-compatible with a POD and contains no pointers or unions. Examples are std::complex<T>, tuples of fundamental types, or any struct having only fundamental types as members. For these types the memory layout of X[0] is the same as of X [1]. The only case where this might not apply would be union or pointer members, and in that case the optimization can, of course, not be applied.
I'm not sure that there will be such a major speedup compared to the naive (1).
Oh yes, there can be a huge difference. Let me just give a few reasons: 1) in the applications we talk about we have to regularly send huge contiguous arrays of numbers (stored e.g. in a matrix, vector, valarray or multi_array) over the network. The typical size is 100 million numbers upwards. I'll stick to 100 million as a typical number in the following. Storing these 100 million numbers already takes up 800 MByte, and nearly fills the memory of the machine, and this causes problems: a) copying these numbers into a buffer using the serialization library needs another 800 MB of memory that might not be available b) creating MPI data types for each member separately mean storing at least 12 bytes (4 bytes each for the address, type and count), for a total of 1200 MBytes, instead of just 12 bytes. Again we will have a memory problem But the main issue is speed. Serializing 100 million numbers one by one, requires 100 million access to the network interface, while serializing the whole block at one just causes a single call, and the rest will be done by the hardware. The reason why we cannot afford this overhead is that actually on modern high performance networks ** the network bandwidth is the same as the memory bandwidth ** and that, even if all things could be perfectly inlined and optimized, the time to read the MPI datatype for each element when using (1) will completely overwhelm the time actually required to send the message using (2). To substantiate my claim (**) above, I want to mention a few numbers: * the "Black Widow" network of the Cray X1 series has a network bandwidth of 55 GByte/second! * the "Red Storm" network of the Cray XT3 Opteron clusters, uses one hypertransport channel for the network acces, and another one for memory access, and thus the bandwidth here is the same as the memory bandwidth * the IBM Blue Gene/L has a similarly fast network with 4.2 GByte/ second network bandwidth per node * even going to cheaper commodity hardware, like Quadrics interconnects, 1 GByte/second is common nowadays. I am sure you will understand that to keep up with these network data transfer rates we cannot afford to perform additional operations, such as accessing the network interface once per double to read the address, even aside from the memory issue raised above. I hope this clarifies why the approach (2) should be taken whenever possible.
Robert's point also deserves attention; a portable binary archive that writes directly into a socket eliminates the MPI middleman and will probably achieve a similar performance as your two-pass MPI approach.
This is indeed a nice idea and would remove the need for MPI in some applications on standard-speed TCP/IP based networks, such as Ethernet, but it is not a general solution for a number of reasons: 1. Sockets do not even exist on most dedicated network hardware, but MPI is still available since it is the standard API for message passing on parallel computers. Even if sockets are still available, they just add additional layers of (expensive) function calls between the network hardware and the serialization library, while the vendor-provided MPI implementations usually access the network hardware directly. 2. MPI is much more than a point-to-point communication protocol built on top of sockets. It is actually a standardized API for all high performance network hardware. In addition to point-to-point communication (using synchronous, asynchronous, buffered and one-way communication) it also provides a large number of global operations, such as broadcasts, reductions, gather and scatter. These work with log(N) complexity on N nodes and often use special network hardware dedicated to the task (such as on an IBM Blue Gene/L machine). All these operations can take advantage of the MPI datatype mechanism. 3. The MPI implementations can determine at runtime whether the transformation to a portable binary archive is actually needed or whether just the bits can be streamed, and it will do this transparently, hiding it from the user.
It also supports versioned non-PODs and other nontrivial types. As an example, I have a type which is saved as
save( x ):
save( x.anim.name() ); // std::string
and loaded as
load( x ):
string tmp; load( tmp ); x.set_animation( tmp );
Not everything is field-based.
Indeed, and for such types the optimization would not apply.