
On Sep 16, 2006, at 2:20 AM, K. Noel Belcourt wrote:
I was able to review the implementations of the broadcast, gather, scatter, and reduce functions which all call through to the corresponding MPI_ function. This is perfectly reasonable. But these, and other, functions can be implemented much more efficiently using sends and recvs. These less efficient implementations may adversely impact adoption of Boost.MPI by the larger high performance computing community. I would like the authors to consider these more efficient algorithms at some point in the future.
Performance is extremely important to us, so I want to make sure I understand exactly what you mean. One of the biggest assumptions we make, particularly with collectives, is that using the most specialized MPI call gives the best performance. So if the user sums up integers with a reduce() call, we should call MPI_Reduce(..., MPI_INT, MPI_SUM, ...) to get the best performance, because it has probably been optimized by the MPI vendor, both in general (i.e., a better algorithm than ours) and for their specific hardware. Of course, if the underlying MPI has a poorly-optimized implementation of MPI_Reduce, it is conceivable that Boost.MPI's simple tree-based implementation could perform better. I haven't actually run into this problem yet, but it clearly can happen: I've peeked at one or two MPI implementations and have been appalled at how naively some of the collectives are implemented. I think this is the point you're making: it might be better not to specialize down to, e.g., the MPI_Reduce call, depending on the underlying MPI implementation. There is at least one easy way to address this issue. We could introduce a set of global, compile-time flags that state whether the underlying implementation of a given collective is better than ours. These flags would vary depending on the underlying MPI. For instance, maybe Open MPI has a fast broadcast implementation, so we would have typedef mpl::true_ has_fast_bcast; whereas LAM/MPI might not have a fast broadcast: typedef mpl::false_ has_fast_bcast; These flags would be queried in the algorithm dispatch logic: template<typename T> void broadcast(const communicator& comm, T& value, int root = 0) { detail::broadcast_impl(comm, value, root, mpl::and_<is_mpi_datatype<T>, has_fast_bcast>()); } The only tedious part of implementing this is determining which collectives are well-optimized in all of the common MPI implementations, although we could certainly assume the best and tweak the configuration as our understanding evolves. Doug