[mpi] Automatically Cascading/Nested Reduce?

Hi Guys, I'm writing quite a number of parallel-distributed applications in the day job, and I'm missing a facility which allows for "automagically" nesting my reductions. The reduction strategy I'm looking for is one of a networked approach where (if you can imagine a tree): 0 |-1 | |-2 | |-3 | |-4 |-5 |-6 Nodes 2 and 3 send their data to node 1, node 5 and 6 send their data to node 4, and eventually node 1 and 4 send their (reduced) data to node 0. My idea is that it should be simple to implement this without having to go through too much trouble with communicators -- or that it could be automagically done by hiding the communicator splits in the special reduce implementation. Would something like this be best implemented within Boost.MPI? Or is there a way of implementing this by composing the functionality already available in Boost.MPI? The reason I ask is that sometimes the reduction step dominates the amount of time the application spends especially when you have quite a number of nodes (around 90ish). Being able to parallelize (or nest) the reduction would definitely help, but the cost of supporting that routine over a number of applications seems to warrant a special implementation of the Boost.MPI reduce. TIA -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
Hi Guys,
I'm writing quite a number of parallel-distributed applications in the day job, and I'm missing a facility which allows for "automagically" nesting my reductions. The reduction strategy I'm looking for is one of a networked approach where (if you can imagine a tree):
0 |-1 | |-2 | |-3 | |-4 |-5 |-6
Nodes 2 and 3 send their data to node 1, node 5 and 6 send their data to node 4, and eventually node 1 and 4 send their (reduced) data to node 0. My idea is that it should be simple to implement this without having to go through too much trouble with communicators -- or that it could be automagically done by hiding the communicator splits in the special reduce implementation.
Would something like this be best implemented within Boost.MPI? Or is there a way of implementing this by composing the functionality already available in Boost.MPI?
The reason I ask is that sometimes the reduction step dominates the amount of time the application spends especially when you have quite a number of nodes (around 90ish). Being able to parallelize (or nest) the reduction would definitely help, but the cost of supporting that routine over a number of applications seems to warrant a special implementation of the Boost.MPI reduce.
TIA
Dean, What MPI implementation are you using? I ask because I believe that at least some of them already use optimizations of this sort when implementing reduce. That is really more the right place for a communications scheme of this sort. Boost.MPI just uses the facilities for this that are provided by MPI itself. In principle, the MPI implementation should do the reduce in as efficient a way as it can. What does yours currently do? John

Hi John, On Tue, Apr 21, 2009 at 7:46 AM, John Phillips <phillips@mps.ohio-state.edu> wrote:
What MPI implementation are you using?
I'm using OpenMPI version 1.2.6 on Gentoo Linux.
I ask because I believe that at least some of them already use optimizations of this sort when implementing reduce.
Oh, that's good to know.
That is really more the right place for a communications scheme of this sort.
You may be right, I haven't looked at it that way. :-)
Boost.MPI just uses the facilities for this that are provided by MPI itself. In principle, the MPI implementation should do the reduce in as efficient a way as it can. What does yours currently do?
Right. I have noticed that reduce takes more time on the root node when you use the 'world' communicator than the communication does. This led me to think that the cost of reducing the data coming from the other nodes is dominating the whole execution time -- if the reductions happened in an hierarchical fashion I'm thinking this should limit the sequential parts of the computation and leverage the available parallelism in the system. I haven't looked at whether OpenMPI does what I'm looking for that's why I was asking if it might be worth implementing at the Boost.MPI layer. Although it would be nice if OpenMPI did it for me in the first place. ;-) -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
Hi John,
On Tue, Apr 21, 2009 at 7:46 AM, John Phillips <phillips@mps.ohio-state.edu> wrote:
What MPI implementation are you using?
I'm using OpenMPI version 1.2.6 on Gentoo Linux.
...
I haven't looked at whether OpenMPI does what I'm looking for that's why I was asking if it might be worth implementing at the Boost.MPI layer. Although it would be nice if OpenMPI did it for me in the first place. ;-)
Their current version is 1.3.1, and I know the transition to 1.3 included a number of performance optimizations in the code. They also have a fairly active, and quite responsive user list, with good involvement from the developers. So, as a first step, I'd suggest checking with them. I don't think Doug's intent when he submitted Boost.MPI was to have it evolve into a MPI implementation, so I don't know that he would want to add something like this. (He is, obviously, the one to ask to be sure.) John
participants (2)
-
Dean Michael Berris
-
John Phillips