
On Sep 16, 2006, at 5:05 AM, Matthias Troyer wrote:
On Sep 16, 2006, at 10:22 AM, Markus Blatt wrote:
The question came up when I looked into mpi/collectives/ broadcast.hpp:
// We're sending a type that does not have an associated MPI // datatype, so we'll need to serialize it. Unfortunately, this // means that we cannot use MPI_Bcast, so we'll just send from the // root to everyone else. template<typename T> void broadcast_impl(const communicator& comm, T& value, int root, mpl::false_)
If this function gets called the performance will definitely be suboptimal as the root will send to all others. It this just the case if no MPI_Datatype was constructed (like for the linked list) or is it called whenever the boost serialization is used?
OK, I see your concern. This is actually only used when no MPI_Datatype can be constructed. That is when there no MPI_Datatype is possible, such as for a linked list, and if you do not use the skeleton&content mechanism either.
Right. from code code standpoint, in addition to that broadcast_impl shown above, there is one that looks like this: // We're sending a type that has an associated MPI datatype, so // we'll use MPI_Bcast to do all of the work. template<typename T> void broadcast_impl(const communicator& comm, T& value, int root, mpl::true_) That last parameter decides which implementation to use, based on whether we have or can create an MPI_Datatype for the type T.
Since this part of the code was written by Doug Gregor, I ask him to correct me if I say something wrong now or if I miss something. When no MPI datatype exists then we need to pack the object into a buffer using MPI_Pack, and the buffer needs to be broadcast. So far we all seem to agree. The problem now is that the receiving side needs to know the size of the buffer to allocate enough memory, but there is no MPI_Probe for collectives that could be used to inquire about the message size. I believe that this was the reason for implementing the broadcast as a sequence of nonblocking sends and receives (Doug?).
Yes, this was the reason for the sequence of nonblocking sends and receives.
Thinking about it, I realize that one could instead do two consecutive broadcasts: one to send the size of the buffer and then another one to send the buffer. This will definitively be faster on machines with special hardware for collectives. On Beowulf clusters on the other hand the current version is faster since most MPI implementation just perform the broadcast as a sequence of N-1 send/ receive operations from the root instead of optimizing it.
Right. I guess we could provide some kind of run-time configuration switch that decides between the two implementations, if someone runs into a case where it matters. Doug