Re: [boost] [Boost-users] What's so cool about Boost.MPI?

On 5 Nov 2010, at 20:04, Riccardo Murri wrote:
On Fri, Nov 5, 2010 at 3:51 PM, Peter Foelsche <foelsche@sbcglobal.net> wrote:
Can somebody explain to me, what is so great about mpi or boost/mpi?
Daunting task, but I'll give it a try :-)
MPI is a library interface[1] to ease some common communication patterns in scientific/HPC software. In particular, MPI provides library functions for:
1. Reliable delivery of messages between two concurrent processes; there are functions for both synchronous and asynchronous communication.
2. Collective communication of the running processes: e.g., one process broadcasting a message to all others; all processes broadcasting to all; scattering a large array to all running processes, etc.
3. "Remote shared memory": a process can expose a memory buffer that all others can directly read/write.
4. Parallel I/O, where many processes access (possibly interleaved) portions of a file independently.
Despite being non-trivial features to implement on top of a standard TCP socket interface, MPI implementations are mostly into performance: they can usually take advantage of special hardware (e.g., Infiniband or Myrinet interconnects) and use the native protocols (instead of TCP/IP) to provide faster communication speed.
A "message" in usual socket programming is just a flat array of bytes. A "message" in MPI is a (potentially non-contiguous portion) of a C or FORTRAN data structure; MPI can handle C structures and multi-dimensional C-style arrays, and multi-dimensional arrays of C structures (containing multi-dimensional arrays, etc.). However, since MPI is implemented as a library, you have to tdescribe to MPI the in-memory layout of the data structures you want to send as a message; thus "duplicating" work that is done by the compiler. You can imagine that this quickly gets tedious and unwieldy once you start having non-trivial data structures, possibly with members of unknown length (e.g., a list).
This is where Boost.MPI comes to the rescue: if you can provide Boost.Serialization support for your classes, you can send them as MPI messages with no extra effort: no need for lengthy calls to MPI_Type_create_struct() etc. And it does this with minimal or even no overhead! (Plus you also get a nice C++ interface to MPI functionality, whereas the MPI C calls are quite low-level.)
.. [1] MPI proper is a specification; there are several implementations of the spec (e.g., OpenMPI, MPICH, MVAPICH, plus many vendor/proprietary one), but they are all compatible at the source level.
Best regards, Riccardo
Nice summary! Matthias

On Sat, Nov 6, 2010 at 3:37 AM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
Nice summary!
Agreed. Can anyone suggest changes to my article that would obviate the need for a second explanation? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sun, Nov 7, 2010 at 1:47 AM, Dave Abrahams <dave@boostpro.com> wrote:
On Sat, Nov 6, 2010 at 3:37 AM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
Nice summary!
Agreed. Can anyone suggest changes to my article that would obviate the need for a second explanation?
A backgrounder on Beowulf clusters, parallel programming, and high performance computing can frame Boost.MPI and MPI better. It's sad that not everyone knows how to program clusters of workstations to turn them into parallel supercomputers. An article like this is a good way of introducing the "art" of massively parallel computing and the importance of high performance computing at large. That would be my suggestion at least as a one-time heavy user of MPI and later on Boost.MPI to solve large problems. -- Dean Michael Berris deanberris.com

On Sat, Nov 6, 2010 at 3:37 AM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
Nice summary! Dave Abrahams
I've been meaning to ask you... ( after reading your web page in a rushed manner ) How is your library different from RPC? Unless I missed the point, the library efficiently delivers data to a receiver so that it can process it offline. If you can, please, point out some important points that I may not have grokked, (I mean other than the network and OS abstraction-layer). Also, do you believe that efficiency in data transmission is important? Usually, headers of serialized data adds about 5% overhead, unless data packets are small. The internet and local networks are about 10-gigabit, and are pushing into the 100-gigabit range now. By the time you get a lot of programmers coding to the library, the networks and CPU's will be so fast, I hardly think the small overheads will make any difference. One thing I've noticed for a decade now is that networks are an order or two magnitudes faster than computers, I mean, networks deliver data way faster than a computer can process it. Thanks, -Sid Sacek

At Wed, 10 Nov 2010 02:05:43 -0500, Sid Sacek wrote:
On Sat, Nov 6, 2010 at 3:37 AM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
Nice summary! Dave Abrahams
I've been meaning to ask you... ( after reading your web page in a rushed manner )
How is your library different from RPC? Unless I missed the point, the library efficiently delivers data to a receiver so that it can process it offline. If you can, please, point out some important points that I may not have grokked, (I mean other than the network and OS abstraction-layer).
RPC is a *paradigm* in which you make Remote Procedure Calls. The idea is that you call some function over here foo(bar) and it invokes a function called foo on some representation of bar's value on a remote machine. MPI (which is not my library) is an *API* with a bunch of available implementations that offers *Message Passing*. That's just about movement of data, and the computation patterns aren't limited to those you can create with an imperative interface like RPC. Massively parallel computations tend to do well with a Bulk Synchronous Parallel (BSP) model, and MPI supports that paradigm very nicely. Boost.MPI is a library that makes it much nicer to do MPI programming in C++.
Also, do you believe that efficiency in data transmission is important?
I'm sure it depends on the underlying technology and the problem you're applying it to.
Usually, headers of serialized data adds about 5% overhead, unless data packets are small. The internet and local networks are about 10-gigabit, and are pushing into the 100-gigabit range now. By the time you get a lot of programmers coding to the library, the networks and CPU's will be so fast, I hardly think the small overheads will make any difference. One thing I've noticed for a decade now is that networks are an order or two magnitudes faster than computers, I mean, networks deliver data way faster than a computer can process it.
That's not commonly the case with systems built on MPI. Communication costs tend to be very significant in a system where every node needs to talk to an arbitrary set of other nodes, and that's a common pattern for HPC problems. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Nov 10, 2010, at 8:11 PM, David Abrahams wrote:
Usually, headers of serialized data adds about 5% overhead, unless data packets are small. The internet and local networks are about 10-gigabit, and are pushing into the 100-gigabit range now. By the time you get a lot of programmers coding to the library, the networks and CPU's will be so fast, I hardly think the small overheads will make any difference. One thing I've noticed for a decade now is that networks are an order or two magnitudes faster than computers, I mean, networks deliver data way faster than a computer can process it.
That's not commonly the case with systems built on MPI. Communication costs tend to be very significant in a system where every node needs to talk to an arbitrary set of other nodes, and that's a common pattern for HPC problems.
I also disagree with the statement that communication is faster than computation. Even if you have 10 Gb/second networks into a compute node, that corresponds only to about 150 M double precision floating point numbers. Lets connect that to a node with a *single* quad core Nehalem CPU that operates at actually measured sustained speeds of 74 Gflop, and you see that the network is 500 times slower. Using 4 such CPUs on a quad-core node brings the performance ratio to 2000! Even 10 times faster networks will only take this down to a factor of 200. Thus, in contrast to your statements networks are *not* an order or two magnitudes faster than computers but two or three orders of magnitude slower than the compute nodes. This will only get worse by an additional two orders of magnitude once we add GPUs or other future accelerator chips to the nodes. One reason why you might have the impression that you cannot process the incoming data fast enough might be limitations in how you can get the data from he network to the CPU. That's where dedicated high performance network hardware, and optimized MPI libraries will help. Boost.MPI makes it easier to use those MPI libraries. Matthias

I also disagree with the statement that communication is faster than computation. Even if you have 10 Gb/second networks into a compute node, that corresponds only to about 150 M double precision floating point numbers. Lets connect that to a node with a *single* quad core Nehalem CPU that operates at actually measured sustained speeds of 74 Gflop, and you see that the network is 500 times slower. Using 4 such CPUs on a quad-core node brings the performance ratio to 2000! Even 10 times faster networks will only take this down to a factor of 200. Thus, in contrast to your statements networks are *not* an order or two magnitudes faster than computers but two or three orders of magnitude slower than the compute nodes. This will only get worse by an additional two orders of magnitude once we add GPUs or other future accelerator chips to the nodes.
Wow! You picked (cherry-picked) a very particular data type, and then performed a simple division between the FPU speed and the incoming data rate. There are so many things that occur in the CPU before you can process network data. Like NIC interrupts to the drivers, driver interrupt processing, drivers signaling the running processes, task swaps, page faults, paging, cache flushes, cache updates, data transfers between buffers two to five times before it is processed, endian conversions, programs switching on key data bytes to call the proper procedures to process the data, the processed data then being used to trigger new actions, etc... A much better algorithm to use for calculating performance is to determine how many assembly instructions do you anticipate it will take to process a single byte of data. Data comes in infinite forms. Before the FPU gets a crack at the data, it has to pass though the CPU. Think about it... the data coming in from the network isn't being fed straight into your FPU hardware and the results being tossed away. My experience in "network data" processing is very different from yours. -Sid

On 12 Nov 2010, at 05:26, Sid Sacek wrote:
I also disagree with the statement that communication is faster than computation. Even if you have 10 Gb/second networks into a compute node, that corresponds only to about 150 M double precision floating point numbers. Lets connect that to a node with a *single* quad core Nehalem CPU that operates at actually measured sustained speeds of 74 Gflop, and you see that the network is 500 times slower. Using 4 such CPUs on a quad-core node brings the performance ratio to 2000! Even 10 times faster networks will only take this down to a factor of 200. Thus, in contrast to your statements networks are *not* an order or two magnitudes faster than computers but two or three orders of magnitude slower than the compute nodes. This will only get worse by an additional two orders of magnitude once we add GPUs or other future accelerator chips to the nodes.
Wow! You picked (cherry-picked) a very particular data type, and then performed a simple division between the FPU speed and the incoming data rate.
Not at all! I picked a typical high performance computing application for which MPI is designed. Such performance is achieved in real world applications, and the main bottleneck is usually the network.
There are so many things that occur in the CPU before you can process network data. Like NIC interrupts to the drivers, driver interrupt processing, drivers signaling the running processes, task swaps, page faults, paging, cache flushes, cache updates, data transfers between buffers two to five times before it is processed, endian conversions, programs switching on key data bytes to call the proper procedures to process the data, the processed data then being used to trigger new actions, etc...
Actually, this is where MPI comes into play by getting rid of most of that overhead. WHat you describe above is exactly the reason why high performance computers use special network hardware, and why high performance MPI implementations are not built on top of TCP/IP. Using MPI we bypass most of the stuff you write above, and we typically know what kind of data to expect. Network latency and bandwidth are a worsening bottleneck for the scaling of parallel programs.
A much better algorithm to use for calculating performance is to determine how many assembly instructions do you anticipate it will take to process a single byte of data. Data comes in infinite forms. Before the FPU gets a crack at the data, it has to pass though the CPU.
Think about it... the data coming in from the network isn't being fed straight into your FPU hardware and the results being tossed away.
My experience in "network data" processing is very different from yours.
Indeed. I talk about high performance computing which is apparently very different from your kind of applications. But in HPC we often are able to saturate the floating point units and memory and network performance are big problems. If you desire, can send you application codes that are totally limited by bandwidth, and where the CPU is idle 99% of the time. Keep in mind that MPI (and Boost.MPI) is specifically designed for high performance computing applications, and we should look at such applications when talking about MPI. And there my estimates above are right, even if they are an upper bound for what we see. Even doing 10 or hundred times more operations on the data the CPU is still 5 times faster than the network. In addition the processing power of the compute nodes grows much faster than network speeds and this issue will only get worse. Matthias

MPI (which is not my library) is an *API* with a bunch of available implementations that offers *Message Passing*. That's just about movement of data, and the computation patterns aren't limited to those you can create with an imperative interface like RPC. Massively parallel computations tend to do well with a Bulk Synchronous Parallel (BSP) model, and MPI supports that paradigm very nicely.
Boost.MPI is a library that makes it much nicer to do MPI programming in C++.
Dave Abrahams
Thank you very much for that thorough explanation. -Sid
participants (5)
-
Dave Abrahams
-
David Abrahams
-
Dean Michael Berris
-
Matthias Troyer
-
Sid Sacek