On 27/10/2014 17:13, Alain Miniussi wrote:
On 27/10/2014 16:32, Alain Miniussi wrote:
On 25/10/2014 02:14, Belcourt, Kenneth wrote:
On Oct 24, 2014, at 4:52 PM, Belcourt, Kenneth
wrote: On Oct 24, 2014, at 4:43 PM, Belcourt, Kenneth
wrote: On Oct 24, 2014, at 4:33 PM, Belcourt, Kenneth
wrote: On Oct 24, 2014, at 7:56 AM, Alain Miniussi
wrote: > On 24/10/2014 15:33, Alain Miniussi wrote: >> I did a gnu/openmpi 1.8.2 build on ubuntu which exhibit the >> same problem. > It did not, just forgot to edit a field in project-config.jam. > Only the intel mpiexec/run hangs. >> Can the fact that the setgpid system calls fails be an issue ? Perhaps. We make the forked child process it’s own process group leader so that if it’s an MPI job and it dies, all the MPI ranks are cleaned up as well. We’ve been using this syntax for a number of years on multiple platforms without issues so I’m a little surprised it fails on Ubuntu with OpenMPI 1.8.2 That said, it’s possible that there’s a race condition that you’re able to tickle.
For example, we fork the child process and right before we exec the child process, we set the child process group. We also set the child process group in the parent process as well. Perhaps we should on do this once, not twice (i.e. only in the child or only in the parent, not both). Or perhaps there’s a race if both the child and parent call to setpgid runs concurrently. Just pushed this commit, 7bcbc5ac31ab1, to develop which adds checks to the setpgid calls and, if they fail, indicates whether it was the parent or child process who called. Can you give this a try and let me know which call is failing? Well I be danged. I was just testing thie change on my Mac and found this in the output:
setpgid (parent): Permission denied
So it seems we’ve been ignoring this problem for some time and didn’t know it. That would be my bad. Let me work on a fix (will probably remove the duplicate call in the parent process). I left both setpgid checks in, but removed the call to exit() so we’ll see the failed call to setpgid without killing b2.
commit 156bc5c42ec3 in develop.
Thanks,
So the mpiexe.hydra is still defunct *but* I have something new: Let say I am in the following situation:
PID PPID 20104 alainm 20 0 S 0.0 0:13.33 17184 bjam 20170 alainm 20 0 T 0.0 0:00.00 20104 sh 20171 alainm 20 0 Z 0.0 0:00.00 20170 mpiexec.hydra <defunct> [alainm@gurney ~]$ pstree 20104 bjam───sh───mpiexec.hydra [alainm@gurney ~]$
So, mpiexe is dead, the calling shell should take notice, but somehow doesn't. It just wait, but with no conviction: $ gdb /bin/sh 20170 ................ (gdb) bt #0 0x0000003bd92ac8ce in __libc_waitpid (pid=-1, stat_loc=0x7fff344888bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:32 #1 0x000000000043ec82 in waitchld (wpid=<value optimized out>, block=1) at jobs.c:3064 #2 0x000000000043ff1f in wait_for (pid=20171) at jobs.c:2422 #3 0x00000000004309f9 in execute_command_internal (command=0x18beda0,
the interesting thing is that, if y just entre a <continue> command under gdb, then the bjam magically proceed up to the next mpiexec.
Which gave me the idea to just $ kill -CONT <shell id> to get see the next target proceed.
So my current theory is that the mpiexec.hydra pauses it calling process by sending it a STOP signal (why would it do that ? I have no clue) and then exit without sending a CONTINUE signal.
Maybe signaling the child from exec_cmd just before the select would be a solution, but it looks like a pretty ugly one...
Ok, I might have a fix (nothing to be proud of though) that basically consist in inserting: for ( i = 0; i < globs.jobs; ++i ) { if ( cmdtab[ i ].pid != 0 ) { kill(cmdtab[ i ].pid, SIGCONT); } } at the beginning of exec_wait. Maybe killpg would be better (didn't check), since I'm not sure that a simple kill will deal with mpirun (which add a shell layer between bjam and mpiexec.hydra). I'll try to propose a pull request tonight.
More complicated than it looked, the pause signal can be sent at any time, so the wake up call probably need interleaved with the select.
Thanks !
Alain
Alain
— Noel
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
-- --- Alain