bjam hangs on select (in develop branch)

newer
[type_traits] Check if implicit...

older
[Boost.Polygon] Union filling holes

Alain Miniussi

20 Oct 2014 20 Oct '14

3:33 p.m.

Hi, I am trying to test Boost.MPI with Intel's implementation and I am stuck while trying to run simple tests through bjam. Bjam is hangs on the select (not pselect ?) instruction of the unix exec_wait. As far as processes are concerned: PID USER PR NI S %CPU TIME+ PPID COMMAND ....................... 16882 alainm 20 0 S 0.0 0:01.61 6507 bjam 16899 alainm 20 0 T 0.0 0:00.00 16882 sh 16900 alainm 20 0 Z 0.0 0:00.00 16899 mpiexec.hydra <defunct> ....... bjam calls a generated shell (below) which calls a mpiexe.hydra which work perfectly fine outside bjam. The mpiexec.hydra dies the the shell refuses to let it go. the shell script, generated by bjam, is: =============================================== [alainm@gurney engine]$ more /proc/16899/cmdline /bin/sh LD_LIBRARY_PATH="/gpfs/scratch/alainm/view/boost/bin.v2/libs/mpi/build/intel-linux/debug:/gpfs/scratch/alainm/view/boost/bin.v2/libs/serialization/build/intel-linux/debug:/softs/ intel/composer_xe_2015.0.090/bin/lib:/softs/intel/composer_xe_2015.0.090/lib/intel64:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH status=0 if test $status -ne 0 ; then echo Skipping test execution due to testing.execute=off exit 0 fi mpiexec.hydra -n 2 "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2" blob > "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.te st/intel-linux/debug/broadcast_stl_test-2-run.output" 2>&1 status=$? echo >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo EXIT STATUS: $status >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" if test $status -eq 0 ; then cp "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel- linux/debug/broadcast_stl_test-2-run" fi verbose=0 if test $status -ne 0 ; then verbose=1 fi if test $verbose -eq 1 ; then echo ====== BEGIN OUTPUT ====== cat "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo ====== END OUTPUT ====== fi exit $status [alainm@gurney engine]$ ================================================= Note that select only test for the subprocess output, at the hanging point mpiexec.hydra is done with its outputs. Any idea ? Alain PS: there was a cmake based project some time ago, is it still active or is bjam here to stay ?

Show replies by date

Belcourt, Kenneth

20 Oct 20 Oct

5:10 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

Hi Alian, I’ve seen this problem before but it appears to affect very few people so I’ve not needed to fix it. Perhaps the time has come to address it. Was bjam passed a -j option, if so, what was it? — Noel On Oct 20, 2014, at 9:33 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...

Hi,

I am trying to test Boost.MPI with Intel's implementation and I am stuck while trying to run simple tests through bjam. Bjam is hangs on the select (not pselect ?) instruction of the unix exec_wait. As far as processes are concerned:

PID USER PR NI S %CPU TIME+ PPID COMMAND ....................... 16882 alainm 20 0 S 0.0 0:01.61 6507 bjam 16899 alainm 20 0 T 0.0 0:00.00 16882 sh 16900 alainm 20 0 Z 0.0 0:00.00 16899 mpiexec.hydra <defunct> .......

bjam calls a generated shell (below) which calls a mpiexe.hydra which work perfectly fine outside bjam. The mpiexec.hydra dies the the shell refuses to let it go.

the shell script, generated by bjam, is:

=============================================== [alainm@gurney engine]$ more /proc/16899/cmdline /bin/sh LD_LIBRARY_PATH="/gpfs/scratch/alainm/view/boost/bin.v2/libs/mpi/build/intel-linux/debug:/gpfs/scratch/alainm/view/boost/bin.v2/libs/serialization/build/intel-linux/debug:/softs/ intel/composer_xe_2015.0.090/bin/lib:/softs/intel/composer_xe_2015.0.090/lib/intel64:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH

status=0 if test $status -ne 0 ; then echo Skipping test execution due to testing.execute=off exit 0 fi mpiexec.hydra -n 2 "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2" blob > "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.te st/intel-linux/debug/broadcast_stl_test-2-run.output" 2>&1 status=$? echo >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo EXIT STATUS: $status >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" if test $status -eq 0 ; then cp "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel- linux/debug/broadcast_stl_test-2-run" fi verbose=0 if test $status -ne 0 ; then verbose=1 fi if test $verbose -eq 1 ; then echo ====== BEGIN OUTPUT ====== cat "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo ====== END OUTPUT ====== fi exit $status

[alainm@gurney engine]$ =================================================

Note that select only test for the subprocess output, at the hanging point mpiexec.hydra is done with its outputs.

Any idea ?

Alain

PS: there was a cmake based project some time ago, is it still active or is bjam here to stay ?

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Alain Miniussi

10:02 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

Hi Noel, No, no -j option. I tried the -p (since bjam is hangin in a select on output streams) with no effect. I don't know if that's relevant but it seems that most calls to setpgid (and those on the sh process) sets errno to 13 (permission problem).. The select is waiting (without -p) on the stdout of the 'sh' process (wit the redirected stderr). If I replace mpiexec.hydra (a binary) with mpirun (a wrapper around that binary) only mpiexec.hydra will be defunct. PID USER PR NI S %CPU TIME+ PPID COMMAND 769 alainm 20 0 S 0.0 0:02.79 768 bjam 1028 alainm 20 0 T 0.0 0:00.00 769 sh 1029 alainm 20 0 T 0.0 0:00.00 1028 mpirun 1034 alainm 20 0 Z 0.0 0:00.00 1029 mpiexec.hydra <defunct> Alain On 20/10/2014 19:10, Belcourt, Kenneth wrote:

...

Hi Alian,

I’ve seen this problem before but it appears to affect very few people so I’ve not needed to fix it. Perhaps the time has come to address it.

Was bjam passed a -j option, if so, what was it?

— Noel

On Oct 20, 2014, at 9:33 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
Hi,

I am trying to test Boost.MPI with Intel's implementation and I am stuck while trying to run simple tests through bjam. Bjam is hangs on the select (not pselect ?) instruction of the unix exec_wait. As far as processes are concerned:

PID USER PR NI S %CPU TIME+ PPID COMMAND ....................... 16882 alainm 20 0 S 0.0 0:01.61 6507 bjam 16899 alainm 20 0 T 0.0 0:00.00 16882 sh 16900 alainm 20 0 Z 0.0 0:00.00 16899 mpiexec.hydra <defunct> .......

bjam calls a generated shell (below) which calls a mpiexe.hydra which work perfectly fine outside bjam. The mpiexec.hydra dies the the shell refuses to let it go.

the shell script, generated by bjam, is:

=============================================== [alainm@gurney engine]$ more /proc/16899/cmdline /bin/sh LD_LIBRARY_PATH="/gpfs/scratch/alainm/view/boost/bin.v2/libs/mpi/build/intel-linux/debug:/gpfs/scratch/alainm/view/boost/bin.v2/libs/serialization/build/intel-linux/debug:/softs/ intel/composer_xe_2015.0.090/bin/lib:/softs/intel/composer_xe_2015.0.090/lib/intel64:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH

status=0 if test $status -ne 0 ; then echo Skipping test execution due to testing.execute=off exit 0 fi mpiexec.hydra -n 2 "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2" blob > "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.te st/intel-linux/debug/broadcast_stl_test-2-run.output" 2>&1 status=$? echo >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo EXIT STATUS: $status >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" if test $status -eq 0 ; then cp "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel- linux/debug/broadcast_stl_test-2-run" fi verbose=0 if test $status -ne 0 ; then verbose=1 fi if test $verbose -eq 1 ; then echo ====== BEGIN OUTPUT ====== cat "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo ====== END OUTPUT ====== fi exit $status

[alainm@gurney engine]$ =================================================

Note that select only test for the subprocess output, at the hanging point mpiexec.hydra is done with its outputs.

Any idea ?

Alain

PS: there was a cmake based project some time ago, is it still active or is bjam here to stay ?

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Belcourt, Kenneth

10:11 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On Oct 20, 2014, at 4:02 PM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...

Hi Noel,

No, no -j option.

Interesting. I’ve usually seen bjam miss the subprocess termination signal when -j is around 64 or more. I’ve got an Intel MPI setup I can try to reproduce than Zombie child with. This is code I added quite a few years ago so I’ll have to dust off my bjam hat and track this down. Sorry about the hassle, it might take me a few days before I can debug this. — Noel

...

I tried the -p (since bjam is hangin in a select on output streams) with no effect. I don't know if that's relevant but it seems that most calls to setpgid (and those on the sh process) sets errno to 13 (permission problem).. The select is waiting (without -p) on the stdout of the 'sh' process (wit the redirected stderr). If I replace mpiexec.hydra (a binary) with mpirun (a wrapper around that binary) only mpiexec.hydra will be defunct.

PID USER PR NI S %CPU TIME+ PPID COMMAND 769 alainm 20 0 S 0.0 0:02.79 768 bjam 1028 alainm 20 0 T 0.0 0:00.00 769 sh 1029 alainm 20 0 T 0.0 0:00.00 1028 mpirun 1034 alainm 20 0 Z 0.0 0:00.00 1029 mpiexec.hydra <defunct>

Alain

On 20/10/2014 19:10, Belcourt, Kenneth wrote:

...
Hi Alian,

I’ve seen this problem before but it appears to affect very few people so I’ve not needed to fix it. Perhaps the time has come to address it.

Was bjam passed a -j option, if so, what was it?

— Noel

On Oct 20, 2014, at 9:33 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
Hi,

I am trying to test Boost.MPI with Intel's implementation and I am stuck while trying to run simple tests through bjam. Bjam is hangs on the select (not pselect ?) instruction of the unix exec_wait. As far as processes are concerned:

PID USER PR NI S %CPU TIME+ PPID COMMAND ....................... 16882 alainm 20 0 S 0.0 0:01.61 6507 bjam 16899 alainm 20 0 T 0.0 0:00.00 16882 sh 16900 alainm 20 0 Z 0.0 0:00.00 16899 mpiexec.hydra <defunct> .......

bjam calls a generated shell (below) which calls a mpiexe.hydra which work perfectly fine outside bjam. The mpiexec.hydra dies the the shell refuses to let it go.

the shell script, generated by bjam, is:

=============================================== [alainm@gurney engine]$ more /proc/16899/cmdline /bin/sh LD_LIBRARY_PATH="/gpfs/scratch/alainm/view/boost/bin.v2/libs/mpi/build/intel-linux/debug:/gpfs/scratch/alainm/view/boost/bin.v2/libs/serialization/build/intel-linux/debug:/softs/ intel/composer_xe_2015.0.090/bin/lib:/softs/intel/composer_xe_2015.0.090/lib/intel64:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH

status=0 if test $status -ne 0 ; then echo Skipping test execution due to testing.execute=off exit 0 fi mpiexec.hydra -n 2 "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2" blob > "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.te st/intel-linux/debug/broadcast_stl_test-2-run.output" 2>&1 status=$? echo >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo EXIT STATUS: $status >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" if test $status -eq 0 ; then cp "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel- linux/debug/broadcast_stl_test-2-run" fi verbose=0 if test $status -ne 0 ; then verbose=1 fi if test $verbose -eq 1 ; then echo ====== BEGIN OUTPUT ====== cat "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" echo ====== END OUTPUT ====== fi exit $status

[alainm@gurney engine]$ =================================================

Note that select only test for the subprocess output, at the hanging point mpiexec.hydra is done with its outputs.

Any idea ?

Alain

PS: there was a cmake based project some time ago, is it still active or is bjam here to stay ?

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Belcourt, Kenneth

21 Oct 21 Oct

6:29 a.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On Oct 20, 2014, at 4:11 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...

On Oct 20, 2014, at 4:02 PM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
No, no -j option.

Interesting. I’ve usually seen bjam miss the subprocess termination signal when -j is around 64 or more. I’ve got an Intel MPI setup I can try to reproduce than Zombie child with.

Just pushed a fix to develop: commit 252b5aa019 Can you check if this fixes your issue? — Noel

Alain Miniussi

12:35 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

Sorry, the problem is still here: 6817 alainm 20 0 S 0.0 0:01.74 4517 b2 6870 alainm 20 0 T 0.0 0:00.00 6817 sh 6871 alainm 20 0 T 0.0 0:00.00 6870 mpirun 6876 alainm 20 0 Z 0.0 0:00.00 6871 mpiexec.hydra <defunct> bottom of b2 strace: lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 select(5, [4], NULL, NULL, NULL On 21/10/2014 08:29, Belcourt, Kenneth wrote:

...

On Oct 20, 2014, at 4:11 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 20, 2014, at 4:02 PM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
No, no -j option. Interesting. I’ve usually seen bjam miss the subprocess termination signal when -j is around 64 or more. I’ve got an Intel MPI setup I can try to reproduce than Zombie child with. Just pushed a fix to develop:

commit 252b5aa019

Can you check if this fixes your issue?

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Belcourt, Kenneth

1:56 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On Oct 21, 2014, at 6:35 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...

Sorry, the problem is still here: 6817 alainm 20 0 S 0.0 0:01.74 4517 b2 6870 alainm 20 0 T 0.0 0:00.00 6817 sh 6871 alainm 20 0 T 0.0 0:00.00 6870 mpirun 6876 alainm 20 0 Z 0.0 0:00.00 6871 mpiexec.hydra <defunct>

bottom of b2 strace:

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 select(5, [4], NULL, NULL, NULL

Okay, that’s helpful. Let me try a couple of other things. Thanks Alain. — Noel

Alain Miniussi

24 Oct 24 Oct

10:08 a.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

Hi, I don't know if it can help, but I attached a minimal (ok, let say small) example that reproduces the problem without the mpi test nor boost code. It's basically a minimized ~100loc version of bjam. I sent it to intel so they can investigate, since mpiexec.hydra might be part of the problem. Although I think bjam should be able to deal with it since the mpi test passes on the command line. Alain On 21/10/2014 15:56, Belcourt, Kenneth wrote:

...

On Oct 21, 2014, at 6:35 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
Sorry, the problem is still here: 6817 alainm 20 0 S 0.0 0:01.74 4517 b2 6870 alainm 20 0 T 0.0 0:00.00 6817 sh 6871 alainm 20 0 T 0.0 0:00.00 6870 mpirun 6876 alainm 20 0 Z 0.0 0:00.00 6871 mpiexec.hydra <defunct>

bottom of b2 strace:

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 select(5, [4], NULL, NULL, NULL Okay, that’s helpful. Let me try a couple of other things. Thanks Alain.

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Alain Miniussi

12:39 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

Sorry, I probably screw up something with the last test, please ignore the minimized code. On 24/10/2014 12:08, Alain Miniussi wrote:

...

Hi,

I don't know if it can help, but I attached a minimal (ok, let say small) example that reproduces the problem without the mpi test nor boost code. It's basically a minimized ~100loc version of bjam. I sent it to intel so they can investigate, since mpiexec.hydra might be part of the problem. Although I think bjam should be able to deal with it since the mpi test passes on the command line.

Alain

On 21/10/2014 15:56, Belcourt, Kenneth wrote:

...
On Oct 21, 2014, at 6:35 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
Sorry, the problem is still here: 6817 alainm 20 0 S 0.0 0:01.74 4517 b2 6870 alainm 20 0 T 0.0 0:00.00 6817 sh 6871 alainm 20 0 T 0.0 0:00.00 6870 mpirun 6876 alainm 20 0 Z 0.0 0:00.00 6871 mpiexec.hydra <defunct>

bottom of b2 strace:

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 select(5, [4], NULL, NULL, NULL Okay, that’s helpful. Let me try a couple of other things. Thanks Alain.

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Alain Miniussi

1:33 p.m.

New subject: bjam hangs on select (in develop branch)

I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. Can the fact that the setgpid system calls fails be an issue ? I notice they are among the few sys call those return code is not tested (under gdb, I noticed they return 13 (PERM issue)). Alain On 21/10/2014 15:56, Belcourt, Kenneth wrote:

...

On Oct 21, 2014, at 6:35 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
Sorry, the problem is still here: 6817 alainm 20 0 S 0.0 0:01.74 4517 b2 6870 alainm 20 0 T 0.0 0:00.00 6817 sh 6871 alainm 20 0 T 0.0 0:00.00 6870 mpirun 6876 alainm 20 0 Z 0.0 0:00.00 6871 mpiexec.hydra <defunct>

bottom of b2 strace:

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 select(5, [4], NULL, NULL, NULL Okay, that’s helpful. Let me try a couple of other things. Thanks Alain.

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Alain Miniussi

1:56 p.m.

New subject: bjam hangs on select (in develop branch)

On 24/10/2014 15:33, Alain Miniussi wrote:

...

I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs.

Can the fact that the setgpid system calls fails be an issue ? I notice they are among the few sys call those return code is not tested (under gdb, I noticed they return 13 (PERM issue)).

Alain

On 21/10/2014 15:56, Belcourt, Kenneth wrote:

...
On Oct 21, 2014, at 6:35 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
Sorry, the problem is still here: 6817 alainm 20 0 S 0.0 0:01.74 4517 b2 6870 alainm 20 0 T 0.0 0:00.00 6817 sh 6871 alainm 20 0 T 0.0 0:00.00 6870 mpirun 6876 alainm 20 0 Z 0.0 0:00.00 6871 mpiexec.hydra <defunct>

bottom of b2 strace:

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 select(5, [4], NULL, NULL, NULL Okay, that’s helpful. Let me try a couple of other things. Thanks Alain.

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Belcourt, Kenneth

10:33 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

Hi Alian, On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...

On 24/10/2014 15:33, Alain Miniussi wrote:

...
I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs.

Can the fact that the setgpid system calls fails be an issue ?

Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle. For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently. I’m still looking at this. — Noel

...

...
I notice they are among the few sys call those return code is not tested (under gdb, I noticed they return 13 (PERM issue)).

Alain

On 21/10/2014 15:56, Belcourt, Kenneth wrote:

...
On Oct 21, 2014, at 6:35 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
Sorry, the problem is still here: 6817 alainm 20 0 S 0.0 0:01.74 4517 b2 6870 alainm 20 0 T 0.0 0:00.00 6817 sh 6871 alainm 20 0 T 0.0 0:00.00 6870 mpirun 6876 alainm 20 0 Z 0.0 0:00.00 6871 mpiexec.hydra <defunct>

bottom of b2 strace:

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 select(5, [4], NULL, NULL, NULL Okay, that’s helpful. Let me try a couple of other things. Thanks Alain.

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Belcourt, Kenneth

10:43 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...

On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
On 24/10/2014 15:33, Alain Miniussi wrote:

...
I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs.

Can the fact that the setgpid system calls fails be an issue ?

Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.

For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently.

Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing? — Noel

Belcourt, Kenneth

10:52 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On Oct 24, 2014, at 4:43 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...

On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
On 24/10/2014 15:33, Alain Miniussi wrote:

...
I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs.

Can the fact that the setgpid system calls fails be an issue ?

Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.

For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently.

Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing?

Well I be danged. I was just testing thie change on my Mac and found this in the output: setpgid (parent): Permission denied So it seems we’ve been ignoring this problem for some time and didn’t know it. That would be my bad. Let me work on a fix (will probably remove the duplicate call in the parent process). — Noel

Belcourt, Kenneth

25 Oct 25 Oct

12:14 a.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On Oct 24, 2014, at 4:52 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...

On Oct 24, 2014, at 4:43 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
On 24/10/2014 15:33, Alain Miniussi wrote:

...
I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs.

Can the fact that the setgpid system calls fails be an issue ?

Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.

For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently.

Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing?

Well I be danged. I was just testing thie change on my Mac and found this in the output:

setpgid (parent): Permission denied

So it seems we’ve been ignoring this problem for some time and didn’t know it. That would be my bad. Let me work on a fix (will probably remove the duplicate call in the parent process).

I left both setpgid checks in, but removed the call to exit() so we’ll see the failed call to setpgid without killing b2. commit 156bc5c42ec3 in develop. — Noel

Alain Miniussi

27 Oct 27 Oct

3:32 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On 25/10/2014 02:14, Belcourt, Kenneth wrote:

...

On Oct 24, 2014, at 4:52 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:43 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
On 24/10/2014 15:33, Alain Miniussi wrote:

...
I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs. Can the fact that the setgpid system calls fails be an issue ? Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.

For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently. Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing? Well I be danged. I was just testing thie change on my Mac and found this in the output:

setpgid (parent): Permission denied

So it seems we’ve been ignoring this problem for some time and didn’t know it. That would be my bad. Let me work on a fix (will probably remove the duplicate call in the parent process). I left both setpgid checks in, but removed the call to exit() so we’ll see the failed call to setpgid without killing b2.

commit 156bc5c42ec3 in develop.

Thanks, So the mpiexe.hydra is still defunct *but* I have something new: Let say I am in the following situation: PID PPID 20104 alainm 20 0 S 0.0 0:13.33 17184 bjam 20170 alainm 20 0 T 0.0 0:00.00 20104 sh 20171 alainm 20 0 Z 0.0 0:00.00 20170 mpiexec.hydra <defunct> [alainm@gurney ~]$ pstree 20104 bjam───sh───mpiexec.hydra [alainm@gurney ~]$ So, mpiexe is dead, the calling shell should take notice, but somehow doesn't. It just wait, but with no conviction: $ gdb /bin/sh 20170 ................ (gdb) bt #0 0x0000003bd92ac8ce in __libc_waitpid (pid=-1, stat_loc=0x7fff344888bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:32 #1 0x000000000043ec82 in waitchld (wpid=<value optimized out>, block=1) at jobs.c:3064 #2 0x000000000043ff1f in wait_for (pid=20171) at jobs.c:2422 #3 0x00000000004309f9 in execute_command_internal (command=0x18beda0, the interesting thing is that, if y just entre a <continue> command under gdb, then the bjam magically proceed up to the next mpiexec. Which gave me the idea to just $ kill -CONT <shell id> to get see the next target proceed. So my current theory is that the mpiexec.hydra pauses it calling process by sending it a STOP signal (why would it do that ? I have no clue) and then exit without sending a CONTINUE signal. Maybe signaling the child from exec_cmd just before the select would be a solution, but it looks like a pretty ugly one... Alain

...

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Alain Miniussi

4:13 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On 27/10/2014 16:32, Alain Miniussi wrote:

...

On 25/10/2014 02:14, Belcourt, Kenneth wrote:

...
On Oct 24, 2014, at 4:52 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:43 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
On 24/10/2014 15:33, Alain Miniussi wrote: > I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same > problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs. > Can the fact that the setgpid system calls fails be an issue ? Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.

For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently. Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing? Well I be danged. I was just testing thie change on my Mac and found this in the output:

setpgid (parent): Permission denied

So it seems we’ve been ignoring this problem for some time and didn’t know it. That would be my bad. Let me work on a fix (will probably remove the duplicate call in the parent process). I left both setpgid checks in, but removed the call to exit() so we’ll see the failed call to setpgid without killing b2.

commit 156bc5c42ec3 in develop.

Thanks,

So the mpiexe.hydra is still defunct *but* I have something new: Let say I am in the following situation:

PID PPID 20104 alainm 20 0 S 0.0 0:13.33 17184 bjam 20170 alainm 20 0 T 0.0 0:00.00 20104 sh 20171 alainm 20 0 Z 0.0 0:00.00 20170 mpiexec.hydra <defunct> [alainm@gurney ~]$ pstree 20104 bjam───sh───mpiexec.hydra [alainm@gurney ~]$

So, mpiexe is dead, the calling shell should take notice, but somehow doesn't. It just wait, but with no conviction: $ gdb /bin/sh 20170 ................ (gdb) bt #0 0x0000003bd92ac8ce in __libc_waitpid (pid=-1, stat_loc=0x7fff344888bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:32 #1 0x000000000043ec82 in waitchld (wpid=<value optimized out>, block=1) at jobs.c:3064 #2 0x000000000043ff1f in wait_for (pid=20171) at jobs.c:2422 #3 0x00000000004309f9 in execute_command_internal (command=0x18beda0,

the interesting thing is that, if y just entre a <continue> command under gdb, then the bjam magically proceed up to the next mpiexec.

Which gave me the idea to just $ kill -CONT <shell id> to get see the next target proceed.

So my current theory is that the mpiexec.hydra pauses it calling process by sending it a STOP signal (why would it do that ? I have no clue) and then exit without sending a CONTINUE signal.

Maybe signaling the child from exec_cmd just before the select would be a solution, but it looks like a pretty ugly one...

Ok, I might have a fix (nothing to be proud of though) that basically consist in inserting: for ( i = 0; i < globs.jobs; ++i ) { if ( cmdtab[ i ].pid != 0 ) { kill(cmdtab[ i ].pid, SIGCONT); } } at the beginning of exec_wait. Maybe killpg would be better (didn't check), since I'm not sure that a simple kill will deal with mpirun (which add a shell layer between bjam and mpiexec.hydra). I'll try to propose a pull request tonight. Thanks ! Alain

...

Alain

...
— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Alain Miniussi

28 Oct 28 Oct

7:41 a.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

On 27/10/2014 17:13, Alain Miniussi wrote:

...

On 27/10/2014 16:32, Alain Miniussi wrote:

...
On 25/10/2014 02:14, Belcourt, Kenneth wrote:

...
On Oct 24, 2014, at 4:52 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:43 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

> On 24/10/2014 15:33, Alain Miniussi wrote: >> I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the >> same problem. > It did not, just forgot to edit a field in project-config.jam. > Only the intel mpiexec/run hangs. >> Can the fact that the setgpid system calls fails be an issue ? Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.

For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently. Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing? Well I be danged. I was just testing thie change on my Mac and found this in the output:

setpgid (parent): Permission denied

So it seems we’ve been ignoring this problem for some time and didn’t know it. That would be my bad. Let me work on a fix (will probably remove the duplicate call in the parent process). I left both setpgid checks in, but removed the call to exit() so we’ll see the failed call to setpgid without killing b2.

commit 156bc5c42ec3 in develop.

Thanks,

So the mpiexe.hydra is still defunct *but* I have something new: Let say I am in the following situation:

PID PPID 20104 alainm 20 0 S 0.0 0:13.33 17184 bjam 20170 alainm 20 0 T 0.0 0:00.00 20104 sh 20171 alainm 20 0 Z 0.0 0:00.00 20170 mpiexec.hydra <defunct> [alainm@gurney ~]$ pstree 20104 bjam───sh───mpiexec.hydra [alainm@gurney ~]$

So, mpiexe is dead, the calling shell should take notice, but somehow doesn't. It just wait, but with no conviction: $ gdb /bin/sh 20170 ................ (gdb) bt #0 0x0000003bd92ac8ce in __libc_waitpid (pid=-1, stat_loc=0x7fff344888bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:32 #1 0x000000000043ec82 in waitchld (wpid=<value optimized out>, block=1) at jobs.c:3064 #2 0x000000000043ff1f in wait_for (pid=20171) at jobs.c:2422 #3 0x00000000004309f9 in execute_command_internal (command=0x18beda0,

the interesting thing is that, if y just entre a <continue> command under gdb, then the bjam magically proceed up to the next mpiexec.

Which gave me the idea to just $ kill -CONT <shell id> to get see the next target proceed.

So my current theory is that the mpiexec.hydra pauses it calling process by sending it a STOP signal (why would it do that ? I have no clue) and then exit without sending a CONTINUE signal.

Maybe signaling the child from exec_cmd just before the select would be a solution, but it looks like a pretty ugly one...

Ok, I might have a fix (nothing to be proud of though) that basically consist in inserting: for ( i = 0; i < globs.jobs; ++i ) { if ( cmdtab[ i ].pid != 0 ) { kill(cmdtab[ i ].pid, SIGCONT); } } at the beginning of exec_wait. Maybe killpg would be better (didn't check), since I'm not sure that a simple kill will deal with mpirun (which add a shell layer between bjam and mpiexec.hydra). I'll try to propose a pull request tonight.

More complicated than it looked, the pause signal can be sent at any time, so the wake up call probably need interleaved with the select.

...

Thanks !

Alain

...
Alain

...
— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

Alain Miniussi

12:40 p.m.

New subject: [EXTERNAL] bjam hangs on select (in develop branch)

Pull request 47 (https://github.com/boostorg/build/pull/47) seems to fix the (well, my) problem. On 25/10/2014 02:14, Belcourt, Kenneth wrote:

...

On Oct 24, 2014, at 4:52 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:43 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...
On Oct 24, 2014, at 7:56 AM, Alain Miniussi <alain.miniussi@oca.eu> wrote:

...
On 24/10/2014 15:33, Alain Miniussi wrote:

...
I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the same problem. It did not, just forgot to edit a field in project-config.jam. Only the intel mpiexec/run hangs. Can the fact that the setgpid system calls fails be an issue ? Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.

For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently. Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing? Well I be danged. I was just testing thie change on my Mac and found this in the output:

setpgid (parent): Permission denied

So it seems we’ve been ignoring this problem for some time and didn’t know it. That would be my bad. Let me work on a fix (will probably remove the duplicate call in the parent process). I left both setpgid checks in, but removed the call to exit() so we’ll see the failed call to setpgid without killing b2.

commit 156bc5c42ec3 in develop.

— Noel

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

-- --- Alain

3931

Age (days ago)

3939

Last active (days ago)

List overview

Download

18 comments

2 participants

participants (2)

Alain Miniussi
Belcourt, Kenneth