Re: [boost] [Review][MPI] Boost.MPI review begins

16 Sep 2006

      On Sep 16, 2006, at 2:20 AM, K. Noel Belcourt wrote:
...
I was able to review the implementations of the broadcast, gather,
scatter, and reduce functions which all call through to the
corresponding MPI_ function.  This is perfectly reasonable.  But
these, and other, functions can be implemented much more efficiently
using sends and recvs.  These less efficient implementations may
adversely impact adoption of Boost.MPI by the larger high performance
computing community.  I would like the authors to consider these more
efficient algorithms at some point in the future.
Performance is extremely important to us, so I want to make sure I  
understand exactly what you mean.

One of the biggest assumptions we make, particularly with  
collectives, is that using the most specialized MPI call gives the  
best performance. So if the user sums up integers with a reduce()  
call, we should call MPI_Reduce(..., MPI_INT, MPI_SUM, ...) to get  
the best performance, because it has probably been optimized by the  
MPI vendor, both in general (i.e., a better algorithm than ours) and  
for their specific hardware. Of course, if the underlying MPI has a  
poorly-optimized implementation of MPI_Reduce, it is conceivable that  
Boost.MPI's simple tree-based implementation could perform better. I  
haven't actually run into this problem yet, but it clearly can  
happen: I've peeked at one or two MPI implementations and have been  
appalled at how naively some of the collectives are implemented. I  
think this is the point you're making: it might be better not to  
specialize down to, e.g., the MPI_Reduce call, depending on the  
underlying MPI implementation.

There is at least one easy way to address this issue. We could  
introduce a set of global, compile-time flags that state whether the  
underlying implementation of a given collective is better than ours.  
These flags would vary depending on the underlying MPI. For instance,  
maybe Open MPI has a fast broadcast implementation, so we would have

	typedef mpl::true_ has_fast_bcast;

whereas LAM/MPI might not have a fast broadcast:

	typedef mpl::false_ has_fast_bcast;

These flags would be queried in the algorithm dispatch logic:

template<typename T>
void broadcast(const communicator& comm, T& value, int root = 0)
{
   detail::broadcast_impl(comm, value, root,  
mpl::and_<is_mpi_datatype<T>, has_fast_bcast>());
}

The only tedious part of implementing this is determining which  
collectives are well-optimized in all of the common MPI  
implementations, although we could certainly assume the best and  
tweak the configuration as our understanding evolves.

	Doug

Re: [boost] [Review][MPI] Boost.MPI review begins

Douglas Gregor