
Hi! I'm using the boost::mpi library for a HPC project. I really like the interface, but I'm currently getting very poor performance from the library. I started out by serializing my objects (which are full of pointers and allocated memory, and what not), but that didn't perform at all, so I went for a more brute force approach instead. No luck there either. Essentially what I want to do in a typical case, is to send a set of indexes (4 integers) followed by an array of doubles. The arrays sizes are fixed at startup, and have a size between 10-60 kB each. There are usually many of these arrays, and the total amount of data to be communicated at the end of a calculation is of the order of 1GB. Here is my current implementation: 1) Pack the indexes into an array of 4 integers and send (or broadcast) to the receiver(s). The receiver figures out where to store the next packet based on the indexes (this takes next to no time). 2) Send the array to the receivers using: double *data = coefs->data(); world.send(who, tag, data, nCoefs); where coefs is a pointer to an Eigen2 vector, and data is a pointer to a contiguous array of doubles. I'm running the code on a big HPC cluster with individual nodes with 8 cores and 16 GB memory, all connected with Infiniband. Using this setup I achieve a maximum transfer rate of 66 MB/s doing all to one communication, which is approx. 10 times less than what I'm supposed to get. I will not even mention how long a broadcast takes, but suffice to say that it takes 20-25 times longer than doing the calculation. I get the same poor performance regardless of whether I'm communicating only over 127.0.0.1 or over the net. Since out environment is homogeneous, I have compiled the both the mpi library and my program defining the BOOST_MPI_HOMOGENEOUS macro. I will try to batch more packages into larger units, but earlier experiences (with basic MPI) has shown that with 65 kB arrays, transfer rates of 1 GB/s are possible over our Infiniband switch. Asynchronous transfer is an option, but that complicates the load balancing algorithms to a point where I really don't want to go unless under gun point. Any suggestions? Best regards, -jonas- -- ________________________________________________________________________ Dr. Jonas Jusélius Centre for Theoretical and E-mail : jonas.juselius@uit.no Computational Chemistry Telephone : +47 77644079 Department of Chemistry Fax : +47 77644765 University of Tromsø Mobile ph.: +47 47419869 N-9037 Tromsø, NORWAY http://jonas.iki.fi _______________________________________________________________________ [ PGP key : keyserver or http://jonas.iki.fi/pubkey.asc ] [ Fingerprint: 2516 A57A 3012 7962 287D B66E C1A9 157F 0A59 7A66 ]

On 19 Oct 2009, at 18:26, Jonas Juselius wrote:
Hi! I'm using the boost::mpi library for a HPC project. I really like the interface, but I'm currently getting very poor performance from the library. I started out by serializing my objects (which are full of pointers and allocated memory, and what not), but that didn't perform at all, so I went for a more brute force approach instead. No luck there either.
Essentially what I want to do in a typical case, is to send a set of indexes (4 integers) followed by an array of doubles. The arrays sizes are fixed at startup, and have a size between 10-60 kB each. There are usually many of these arrays, and the total amount of data to be communicated at the end of a calculation is of the order of 1GB. Here is my current implementation: 1) Pack the indexes into an array of 4 integers and send (or broadcast) to the receiver(s). The receiver figures out where to store the next packet based on the indexes (this takes next to no time). 2) Send the array to the receivers using:
double *data = coefs->data(); world.send(who, tag, data, nCoefs);
where coefs is a pointer to an Eigen2 vector, and data is a pointer to a contiguous array of doubles.
I'm running the code on a big HPC cluster with individual nodes with 8 cores and 16 GB memory, all connected with Infiniband. Using this setup I achieve a maximum transfer rate of 66 MB/s doing all to one communication, which is approx. 10 times less than what I'm supposed to get. I will not even mention how long a broadcast takes, but suffice to say that it takes 20-25 times longer than doing the calculation. I get the same poor performance regardless of whether I'm communicating only over 127.0.0.1 or over the net. Since out environment is homogeneous, I have compiled the both the mpi library and my program defining the BOOST_MPI_HOMOGENEOUS macro. I will try to batch more packages into larger units, but earlier experiences (with basic MPI) has shown that with 65 kB arrays, transfer rates of 1 GB/s are possible over our Infiniband switch. Asynchronous transfer is an option, but that complicates the load balancing algorithms to a point where I really don't want to go unless under gun point. Any suggestions?
Dear Jonas, I am surprised by that slow performance. It should actually by the same as doing an MPI_Send and passing it the same pointer and lengths and MPI_DOUBLE as data type. Could you please try this instead and let me know if it speeds up things? Matthias

On 10/20/09 3:26 AM, Matthias Troyer wrote:
On 19 Oct 2009, at 18:26, Jonas Juselius wrote:
Hi! I'm using the boost::mpi library for a HPC project. I really like the interface, but I'm currently getting very poor performance from the library. I started out by serializing my objects (which are full of pointers and allocated memory, and what not), but that didn't perform at all, so I went for a more brute force approach instead. No luck there either.
Dear Jonas,
I am surprised by that slow performance. It should actually by the same as doing an MPI_Send and passing it the same pointer and lengths and MPI_DOUBLE as data type. Could you please try this instead and let me know if it speeds up things?
Matthias
I did some much more careful testing, just timing the individual calls to world.send() and broadcast(world). It turns out that I was very wrong, and the transfer speeds are actually between 750 and 950 MB/s, which is quite acceptable ;) However, the average transfer rate is only 66 MB/s when all the waiting etc. is included. Unfortunately it seems that the only way to improve the situation is by swallowing the bullet and go for asynchronous sends and receives, and/or by using all_gather(). Currently I need to collect all the data to the master process. In the future everything needs to be distributed anyway, because no single node will be able to hold all data in memory. Best regards, -jonas- -- ________________________________________________________________________ Dr. Jonas Jusélius Centre for Theoretical and E-mail : jonas.juselius@uit.no Computational Chemistry Telephone : +47 77644079 Department of Chemistry Fax : +47 77644765 University of Tromsø Mobile ph.: +47 47419869 N-9037 Tromsø, NORWAY http://jonas.iki.fi _______________________________________________________________________ [ PGP key : keyserver or http://jonas.iki.fi/pubkey.asc ] [ Fingerprint: 2516 A57A 3012 7962 287D B66E C1A9 157F 0A59 7A66 ]
participants (2)
-
Jonas Juselius
-
Matthias Troyer