
On Sep 16, 2006, at 10:22 AM, Markus Blatt wrote:
Just forget about it. I was missing the tags in the collective communication where they definitely are none in the MPI standard. Probably I should have gotten more sleep. Sorry.
I would actually also love to have tags there :-)
I hope these answers address the issues you had in mind. I can elaborate if you want.
The question came up when I looked into mpi/collectives/broadcast.hpp:
// We're sending a type that does not have an associated MPI // datatype, so we'll need to serialize it. Unfortunately, this // means that we cannot use MPI_Bcast, so we'll just send from the // root to everyone else. template<typename T> void broadcast_impl(const communicator& comm, T& value, int root, mpl::false_)
If this function gets called the performance will definitely be suboptimal as the root will send to all others. It this just the case if no MPI_Datatype was constructed (like for the linked list) or is it called whenever the boost serialization is used?
OK, I see your concern. This is actually only used when no MPI_Datatype can be constructed. That is when there no MPI_Datatype is possible, such as for a linked list, and if you do not use the skeleton&content mechanism either. Since this part of the code was written by Doug Gregor, I ask him to correct me if I say something wrong now or if I miss something. When no MPI datatype exists then we need to pack the object into a buffer using MPI_Pack, and the buffer needs to be broadcast. So far we all seem to agree. The problem now is that the receiving side needs to know the size of the buffer to allocate enough memory, but there is no MPI_Probe for collectives that could be used to inquire about the message size. I believe that this was the reason for implementing the broadcast as a sequence of nonblocking sends and receives (Doug?). Thinking about it, I realize that one could instead do two consecutive broadcasts: one to send the size of the buffer and then another one to send the buffer. This will definitively be faster on machines with special hardware for collectives. On Beowulf clusters on the other hand the current version is faster since most MPI implementation just perform the broadcast as a sequence of N-1 send/ receive operations from the root instead of optimizing it. Matthias