What bothers me is the fact that it doesn't segfault (with or without mpirun)
as a 'classical' executable with main() function, but it crashes when I run it
as a boost test without mpirun. I must admit I didn't know that there's no
guarantee that this should actually work without mpirun and maybe I'm
complaining about a problem where there isn't any.
I ran the executable with gdb and curiously enough, it terminated correctly
without reporting any problems. I also tried valgrind to see if I get any
memory errors. The segfault happens when I call MPI_Finalize() despite the
fact that mpi environment has been initialized but not finalized yet. The
output is below.
Martin
==11986== Command: ./utest-mpi
==11986==
Global fixture constructor:
==11986== Syscall param writev(vector[...]) points to uninitialised byte(s)
==11986== at 0x14F0D9E7: writev (in /usr/lib/libc-2.19.so)
==11986== by 0x1790BF72: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:249)
==11986== by 0x1790D0B3: mca_oob_tcp_peer_send (oob_tcp_peer.c:204)
==11986== by 0x179109BB: mca_oob_tcp_send_nb (oob_tcp_send.c:167)
==11986== by 0x172FDC5A: orte_rml_oob_send (rml_oob_send.c:136)
==11986== by 0x172FE228: orte_rml_oob_send_buffer (rml_oob_send.c:270)
==11986== by 0x17D1B7BF: modex (grpcomm_bad_module.c:573)
==11986== by 0x5577324: ompi_mpi_init (ompi_mpi_init.c:541)
==11986== by 0x558E7D2: PMPI_Init (pinit.c:84)
==11986== by 0x40E52B:
boost::unit_test::ut_detail::global_fixture_impl<MPIFixture>::test_start(unsigned
long) (utest-Poisson.cpp:509)
==11986== by 0x6A44763: boost::unit_test::ut_detail::callback0_impl_t::invoke() (in
/usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986== by 0x6A36175:
boost::execution_monitor::catch_signals(boost::unit_test::callback0<int>
const&) (in /usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986== Address 0x164d2341 is 161 bytes inside a block of size 256 alloc'd
==11986== at 0x4C2AA3E: realloc (in /usr/lib/valgrind/vgpreload_memcheck-
amd64-linux.so)
==11986== by 0x56060F7: opal_dss_buffer_extend
(dss_internal_functions.c:63)
==11986== by 0x560650D: opal_dss_copy_payload (dss_load_unload.c:164)
==11986== by 0x55DACC2: orte_grpcomm_base_pack_modex_entries
(grpcomm_base_modex.c:861)
==11986== by 0x17D1B6CE: modex (grpcomm_bad_module.c:563)
==11986== by 0x5577324: ompi_mpi_init (ompi_mpi_init.c:541)
==11986== by 0x558E7D2: PMPI_Init (pinit.c:84)
==11986== by 0x40E52B:
boost::unit_test::ut_detail::global_fixture_impl<MPIFixture>::test_start(unsigned
long) (utest-Poisson.cpp:509)
==11986== by 0x6A44763: boost::unit_test::ut_detail::callback0_impl_t::invoke() (in
/usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986== by 0x6A36175:
boost::execution_monitor::catch_signals(boost::unit_test::callback0<int>
const&) (in /usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986== by 0x6A369B2:
boost::execution_monitor::execute(boost::unit_test::callback0<int> const&) (in
/usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986== by 0x6A3FDB1: boost::unit_test::framework::run(unsigned long,
bool) (in /usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986==
MPI environment is initialized: 1
MPI environment is finalized: 0
Running 1 test case...
Running dummy test case
Global fixture destructor
MPI environment is initialized: 1
MPI environment is finalized: 0
==11986== Invalid write of size 8
==11986== at 0x6A358AC: ??? (in
/usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986== by 0x14E643FF: ??? (in /usr/lib/libc-2.19.so)
==11986== by 0x14F0D9E6: writev (in /usr/lib/libc-2.19.so)
==11986== by 0x1790BF72: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:249)
==11986== by 0x1790D0B3: mca_oob_tcp_peer_send (oob_tcp_peer.c:204)
==11986== by 0x179109BB: mca_oob_tcp_send_nb (oob_tcp_send.c:167)
==11986== by 0x172FDC5A: orte_rml_oob_send (rml_oob_send.c:136)
==11986== by 0x172FE228: orte_rml_oob_send_buffer (rml_oob_send.c:270)
==11986== by 0x55F6EEC: orte_routed_base_register_sync
(routed_base_register_sync.c:86)
==11986== by 0x17B17276: finalize (routed_binomial.c:115)
==11986== by 0x55F64F7: orte_routed_base_close
(routed_base_components.c:126)
==11986== by 0x55D6BB4: orte_ess_base_app_finalize (ess_base_std_app.c:265)
==11986== Address 0xa98 is not stack'd, malloc'd or (recently) free'd
==11986==
==11986==
==11986== Process terminating with default action of signal 11 (SIGSEGV)
==11986== Access not within mapped region at address 0xA98
==11986== at 0x6A358AC: ??? (in
/usr/lib/libboost_unit_test_framework.so.1.55.0)
==11986== by 0x14E643FF: ??? (in /usr/lib/libc-2.19.so)
==11986== by 0x14F0D9E6: writev (in /usr/lib/libc-2.19.so)
==11986== by 0x1790BF72: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:249)
==11986== by 0x1790D0B3: mca_oob_tcp_peer_send (oob_tcp_peer.c:204)
==11986== by 0x179109BB: mca_oob_tcp_send_nb (oob_tcp_send.c:167)
==11986== by 0x172FDC5A: orte_rml_oob_send (rml_oob_send.c:136)
==11986== by 0x172FE228: orte_rml_oob_send_buffer (rml_oob_send.c:270)
==11986== by 0x55F6EEC: orte_routed_base_register_sync
(routed_base_register_sync.c:86)
==11986== by 0x17B17276: finalize (routed_binomial.c:115)
==11986== by 0x55F64F7: orte_routed_base_close
(routed_base_components.c:126)
==11986== by 0x55D6BB4: orte_ess_base_app_finalize (ess_base_std_app.c:265)
==11986== If you believe this happened as a result of a stack
==11986== overflow in your program's main thread (unlikely but
==11986== possible), you can try to increase the size of the
==11986== main thread stack using the --main-stacksize= flag.
==11986== The main thread stack size used in this run was 8388608.
==11986==
==11986== HEAP SUMMARY:
==11986== in use at exit: 530,391 bytes in 4,383 blocks
==11986== total heap usage: 8,298 allocs, 3,915 frees, 13,119,175 bytes
allocated
==11986==
==11986== LEAK SUMMARY:
==11986== definitely lost: 5,064 bytes in 34 blocks
==11986== indirectly lost: 5,390 bytes in 22 blocks
==11986== possibly lost: 25,881 bytes in 584 blocks
==11986== still reachable: 494,056 bytes in 3,743 blocks
==11986== suppressed: 0 bytes in 0 blocks
==11986== Rerun with --leak-check=full to see details of leaked memory
==11986==
==11986== For counts of detected and suppressed errors, rerun with: -v
==11986== Use --track-origins=yes to see where uninitialised values come from
==11986== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 1)
Segmentation fault (core dumped)
On Friday 21 March 2014 08:50:25 Rhys Ulerich wrote:
it doesn't work for me.
To be sure I understand because you were vague... You believe it
shouldn't segfault when the binary is executed without using mpirun?
If it's even possible depends on your MPI stack. There's zero
guarantee in the MPI standard, IIRC, that an MPI-based binary can be
executed without mpirun.
The latter case does not segfault for me on MPICH2 1.4.1p1, gcc 4.6.3,
Boost 1.5.1.
I suggest you attach debugger and isolate the origin of the segfault.
- Rhys
_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users