mpi isend to group

newer
Some boost headers use Objective-C...

Philipp Kraus

27 Dec 2012 27 Dec '12

2:15 p.m.

Hello, I have got a problem to create a mpi::isend call. I have got a thread loop like: while (thread_is_running) { std::size_t id = 0; if (!mpicom.rank()) { try { id = getID(); mpicom.isend(id) } catch (...) { } } else { mpicom.ireceive(id); } if (id > 0) { mpi::barrier(); do something with id } boost::thread::yield(); } I would like to create a non-blocking communication over all nodes, so my node with rank 0 checks if a dataset exists, if it exists, the id should be send to all other hosts. After all hosts reveived this id, they should start the calculation. The nodes must be synchronized before the "working part" is startet, but if there is no data send to the host, they should do nothing. I don't know how to use isend to send a message to all hosts. I know only the mpi::any_source flag but I can not find an any_destination. IMHO the ireceive call must be written with iprobe like if (boost::optional<mpi::status> l_status = mpicom.iprobe( 0, myTag )) { std::size_t id = 0; mpicom.recv( l_status->source(), l_status->tag(), id ) } Is there a way to create a non-blocking group communication? How can I send if data exists, a message to all nodes and receive it on each node also with non-blocking communication? Thanks for your help Phil }

Show replies by date

Riccardo Murri

28 Dec 28 Dec

8:57 p.m.

On Thu, Dec 27, 2012 at 3:15 PM, Philipp Kraus <philipp.kraus@flashpixx.de> wrote:

...

Is there a way to create a non-blocking group communication? How can I send if data exists, a message to all nodes and receive it on each node also with non-blocking communication?

I'm afraid there's no non-blocking collective communication primitive in MPI 2. Hence, no such feature in Boost.MPI. The "broadcast" collective apparently is the closest match to your requirements, except that it's blocking: http://www.boost.org/doc/libs/1_52_0/doc/html/boost/mpi/broadcast.html Use of `broadcast` would imply that your code sends an id value even when there is nothing to send. Another option would be to write your own "ibroadcast" on top of isend/irecv, e.g., rank n sends the id value to ranks 2n and 2n+1. Cheers, Riccardo

Andreas Schäfer

29 Dec 29 Dec

10:32 p.m.

Hi, since you mention both, threads and MPI, I might add that the threading support of most MPI implementations contains some caveats, to say the least. It is certainly possible to emulate asynchronous collectives either by isend/irecv (although that's really tough job given the level of optimizations at hand in any major MPI implementation) or to just use (possibly empty) synchronous broadcasts (as Riccardo suggested). But to get a clearer picture I'd like to ask for some details: - Which MPI are you using? - How many MPI processes do your jobs contain? - Which threading level do you request via MPI_Thread_init()? - How do you ensure asynchronous progress? (i.e.: mostly MPI will only send/receive data when some MPI function is being called. Unless e.g. MPI_Test is being polled or you MPI supports progress threads, the bulk of communication won't be carried out until you call MPI_Wait()) Cheers -Andreas On 15:15 Thu 27 Dec , Philipp Kraus wrote:

...

Hello,

I have got a problem to create a mpi::isend call. I have got a thread loop like:

while (thread_is_running) { std::size_t id = 0; if (!mpicom.rank()) { try { id = getID(); mpicom.isend(id) } catch (...) { } } else { mpicom.ireceive(id); }

if (id > 0) { mpi::barrier(); do something with id }

boost::thread::yield(); }

I would like to create a non-blocking communication over all nodes, so my node with rank 0 checks if a dataset exists, if it exists, the id should be send to all other hosts. After all hosts reveived this id, they should start the calculation. The nodes must be synchronized before the "working part" is startet, but if there is no data send to the host, they should do nothing.

I don't know how to use isend to send a message to all hosts. I know only the mpi::any_source flag but I can not find an any_destination. IMHO the ireceive call must be written with iprobe like

if (boost::optional<mpi::status> l_status = mpicom.iprobe( 0, myTag )) { std::size_t id = 0; mpicom.recv( l_status->source(), l_status->tag(), id ) }

Is there a way to create a non-blocking group communication? How can I send if data exists, a message to all nodes and receive it on each node also with non-blocking communication?

Thanks for your help

Phil }

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Philipp Kraus

11:02 p.m.

Am 29.12.2012 um 23:32 schrieb Andreas Schäfer:

...

Hi,

since you mention both, threads and MPI, I might add that the threading support of most MPI implementations contains some caveats, to say the least. It is certainly possible to emulate asynchronous collectives either by isend/irecv (although that's really tough job given the level of optimizations at hand in any major MPI implementation) or to just use (possibly empty) synchronous broadcasts (as Riccardo suggested). But to get a clearer picture I'd like to ask for some details:

- Which MPI are you using?

I use at the moment OpenMPI (but it should be worked also under MS Windows)

...

- How many MPI processes do your jobs contain?

The system hast got 64 cores (on each core 2 threads can be created).

...

- Which threading level do you request via MPI_Thread_init()?

At my testing I use MPI_THREAD_SERIALIZED

...

- How do you ensure asynchronous progress? (i.e.: mostly MPI will only send/receive data when some MPI function is being called. Unless e.g. MPI_Test is being polled or you MPI supports progress threads, the bulk of communication won't be carried out until you call MPI_Wait())

Each MPI process runs in the thread loop some database calls, so after each database block I will check if there is a message from the MPI core 0 and if it exists, all cores should be "barried". The core 0 checks after the database calls is there is any data, if yes, it sends this data to the other cores. So I can not use a MPI_Wait call, because this creates a blocking communication. Thanks Phil

...

On 15:15 Thu 27 Dec , Philipp Kraus wrote:

...
Hello,

I have got a problem to create a mpi::isend call. I have got a thread loop like:

while (thread_is_running) { std::size_t id = 0; if (!mpicom.rank()) { try { id = getID(); mpicom.isend(id) } catch (...) { } } else { mpicom.ireceive(id); }

if (id > 0) { mpi::barrier(); do something with id }

boost::thread::yield(); }

I would like to create a non-blocking communication over all nodes, so my node with rank 0 checks if a dataset exists, if it exists, the id should be send to all other hosts. After all hosts reveived this id, they should start the calculation. The nodes must be synchronized before the "working part" is startet, but if there is no data send to the host, they should do nothing.

I don't know how to use isend to send a message to all hosts. I know only the mpi::any_source flag but I can not find an any_destination. IMHO the ireceive call must be written with iprobe like

if (boost::optional<mpi::status> l_status = mpicom.iprobe( 0, myTag )) { std::size_t id = 0; mpicom.recv( l_status->source(), l_status->tag(), id ) }

Is there a way to create a non-blocking group communication? How can I send if data exists, a message to all nodes and receive it on each node also with non-blocking communication?

Thanks for your help

Phil }

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ==========================================================

(\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Andreas Schäfer

30 Dec 30 Dec

9:37 p.m.

On 00:02 Sun 30 Dec , Philipp Kraus wrote:

...

I use at the moment OpenMPI (but it should be worked also under MS Windows)

Did you enable progress threads in Open MPI?

...

The system hast got 64 cores (on each core 2 threads can be created).

Are we talking about a single machine with 64 cores, or a small cluster?

...

...
- Which threading level do you request via MPI_Thread_init()?

At my testing I use MPI_THREAD_SERIALIZED

In your code fragment you called boost::thread::yield(). Do you ensure that no concurrent calls to MPI take place? Otherwise I'd assume that this doesn't comply with the MPI standard.

...

...
- How do you ensure asynchronous progress? (i.e.: mostly MPI will only send/receive data when some MPI function is being called. Unless e.g. MPI_Test is being polled or you MPI supports progress threads, the bulk of communication won't be carried out until you call MPI_Wait())

Each MPI process runs in the thread loop some database calls, so after each database block I will check if there is a message from the MPI core 0 and if it exists, all cores should be "barried". The core 0 checks after the database calls is there is any data, if yes, it sends this data to the other cores. So I can not use a MPI_Wait call, because this creates a blocking communication.

From what I understood so far I fear that you have to change your architecture. MPI needs cycles to make progress. This can only be achieved by calling it. Also, your code fragment suggests that you repetitively call MPI_Irecv() and MPI_Isend() without ever waiting for completion. This results in a memory leak as new handles for each communication will be created within MPI. As your jobs are reasonably small and waiting for MPI 3 seems to be no option I'd suggest you to use MPI_Iprobe(), similar to the following code. Notice that MPI_Send/Recv are only called when necessary and their blocking nature creates an implicit barrier. The fragment uses a hard-coded binary tree to implement a reasonably fast/scalable broadcast (even though Open MPI's MPI_Bcast() would be much faster). while (thread_is_running) { int id = 0; if (rank == 0) { try { id = getID(); MPI_Send(&id, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } catch (...) { } } else { int flag; MPI_Iprobe(predecessor, 0, MPI_COMM_WORLD, &flag, 0); if (flag) { MPI_Recv(&id, 1, MPI_INT, predecessor, 0, MPI_COMM_WORLD, 0); if ((2 * rank + 0) < size) { MPI_Send(&id, 1, MPI_INT, 2 * rank + 0, 0, MPI_COMM_WORLD); } if ((2 * rank + 1) < size) { MPI_Send(&id, 1, MPI_INT, 2 * rank + 1, 0, MPI_COMM_WORLD); } do_something(id); } } // no yield here, but you may call some worker function here } HTH -Andreas -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Philipp Kraus

10:20 p.m.

Am 30.12.2012 um 22:37 schrieb Andreas Schäfer:

...

On 00:02 Sun 30 Dec , Philipp Kraus wrote:

...
I use at the moment OpenMPI (but it should be worked also under MS Windows)

Did you enable progress threads in Open MPI?

Yes, I have build the OpenMPI lib myself

...

...
The system hast got 64 cores (on each core 2 threads can be created).

Are we talking about a single machine with 64 cores, or a small cluster?

a small cluster system with 8 nodes

...

...
...
- Which threading level do you request via MPI_Thread_init()?

At my testing I use MPI_THREAD_SERIALIZED

In your code fragment you called boost::thread::yield(). Do you ensure that no concurrent calls to MPI take place? Otherwise I'd assume that this doesn't comply with the MPI standard.

...
...
- How do you ensure asynchronous progress? (i.e.: mostly MPI will only send/receive data when some MPI function is being called. Unless e.g. MPI_Test is being polled or you MPI supports progress threads, the bulk of communication won't be carried out until you call MPI_Wait())

Each MPI process runs in the thread loop some database calls, so after each database block I will check if there is a message from the MPI core 0 and if it exists, all cores should be "barried". The core 0 checks after the database calls is there is any data, if yes, it sends this data to the other cores. So I can not use a MPI_Wait call, because this creates a blocking communication.

From what I understood so far I fear that you have to change your architecture. MPI needs cycles to make progress. This can only be achieved by calling it. Also, your code fragment suggests that you repetitively call MPI_Irecv() and MPI_Isend() without ever waiting for completion. This results in a memory leak as new handles for each communication will be created within MPI.

Sorry I have forgot a "main information": This calls are a preexecution of the algorithm. The main algorithm is cycled and uses a MPI blocking communication, so only the preexecution must be a little bit weak.

...

As your jobs are reasonably small and waiting for MPI 3 seems to be no option I'd suggest you to use MPI_Iprobe(), similar to the following code. Notice that MPI_Send/Recv are only called when necessary and their blocking nature creates an implicit barrier. The fragment uses a hard-coded binary tree to implement a reasonably fast/scalable broadcast (even though Open MPI's MPI_Bcast() would be much faster).

while (thread_is_running) { int id = 0;

if (rank == 0) { try { id = getID(); MPI_Send(&id, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } catch (...) { } } else { int flag; MPI_Iprobe(predecessor, 0, MPI_COMM_WORLD, &flag, 0);

if (flag) { MPI_Recv(&id, 1, MPI_INT, predecessor, 0, MPI_COMM_WORLD, 0); if ((2 * rank + 0) < size) { MPI_Send(&id, 1, MPI_INT, 2 * rank + 0, 0, MPI_COMM_WORLD); } if ((2 * rank + 1) < size) { MPI_Send(&id, 1, MPI_INT, 2 * rank + 1, 0, MPI_COMM_WORLD); }

do_something(id); } }

// no yield here, but you may call some worker function here }

Thanks for the code, but this code is based on OpenMPI. My program must be also work wirth MPI CH2 (MPI implemention on Windows based systems), so I would like to create a boost-only solution. I do it at the moment with: while (thread_is_running) { if (!l_mpicom.rank()) for(std::size_t i=1; i < l_mpicom.size(); ++i) l_mpicom.isend(i, 666, l_task.getID()); else if (boost::optional<mpi::status> l_status = l_mpicom.iprobe(0, 666)) { std::size_t l_taskid = 0; l_mpicom.recv( l_status->source(), l_status->tag(), l_taskid ); } } Thanks Phil

4619

Age (days ago)

4622

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Andreas Schäfer
Philipp Kraus
Riccardo Murri