poor boost::mpi performance

19 Oct 2009

      Hi! I'm using the boost::mpi library for a HPC project. I really like 
the interface, but I'm currently getting very poor performance from the 
library. I started out by serializing my objects (which are full of 
pointers and allocated memory, and what not), but that didn't perform at 
all, so I went for a more brute force approach instead. No luck there 
either.

Essentially what I want to do in a typical case, is to send a set of 
indexes (4 integers) followed by an array of doubles. The arrays sizes 
are fixed at startup, and have a size between 10-60 kB each. There are 
usually many of these arrays, and the total amount of data to be 
communicated at the end of a calculation is of the order of 1GB. Here is 
my current implementation: 1) Pack the indexes into an array of 4 
integers and send (or broadcast) to the receiver(s). The receiver 
figures out where to store the next packet based on the indexes (this 
takes next to no time). 2) Send the array to the receivers using:

double *data = coefs->data();
world.send(who, tag, data, nCoefs);

where coefs is a pointer to an Eigen2 vector, and data is a pointer to a 
contiguous array of doubles.

I'm running the code on a big HPC cluster with individual nodes with 8 
cores and 16 GB memory, all connected with Infiniband. Using this setup 
I achieve a maximum transfer rate of 66 MB/s doing all to one 
communication, which is approx. 10 times less than what I'm supposed to 
get. I will not even mention how long a broadcast takes, but suffice to 
say that it takes 20-25 times longer than doing the calculation. I get 
the same poor performance regardless of whether I'm communicating only 
over 127.0.0.1 or over the net. Since out environment is homogeneous, I 
have compiled the both the mpi library and my program defining the 
BOOST_MPI_HOMOGENEOUS macro. I will try to batch more packages into 
larger units, but earlier experiences (with basic MPI) has shown that 
with 65 kB arrays, transfer rates of 1 GB/s are possible over our 
Infiniband switch. Asynchronous transfer is an option, but that 
complicates the load balancing algorithms to a point where I really 
don't want to go unless under gun point. Any suggestions?

Best regards,

-jonas-

-- 
________________________________________________________________________
Dr. Jonas Jusélius
Centre for Theoretical and    E-mail    : jonas.juselius@uit.no
Computational Chemistry       Telephone : +47 77644079
Department of Chemistry       Fax       : +47 77644765
University of Tromsø          Mobile ph.: +47 47419869
N-9037 Tromsø, NORWAY         http://jonas.iki.fi
_______________________________________________________________________
[ PGP key    : keyserver or http://jonas.iki.fi/pubkey.asc    ]
[ Fingerprint: 2516 A57A 3012 7962 287D  B66E C1A9 157F 0A59 7A66 ]

Jonas Juselius

Matthias Troyer

Jonas Juselius

tags

participants (2)