[MPI] error in hybrid OpenMP + Boost.MPI application
Hello, I have an OpenMP+MPI application, which crashes with an exception on some inputs. The backtrace shows that the error originates in the Boost.MPI code:: terminate called after throwing an instance of 'boost::archive::archive_exception' what(): unregistered class ... [compute-0-7:23208] [ 0] /lib64/libpthread.so.0 [0x3110c0e4c0] [compute-0-7:23208] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3110430215] [compute-0-7:23208] [ 2] /lib64/libc.so.6(abort+0x110) [0x3110431cc0] [compute-0-7:23208] [ 3] /usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x114) [0x31114bec44] [compute-0-7:23208] [ 4] /usr/lib64/libstdc++.so.6 [0x31114bcdb6] [compute-0-7:23208] [ 5] /usr/lib64/libstdc++.so.6 [0x31114bcde3] [compute-0-7:23208] [ 6] /usr/lib64/libstdc++.so.6 [0x31114bceca] [compute-0-7:23208] [ 7] /home/oci/murri/sw/lib/libboost_serialization.so.1.43.0(_ZN5boost7archive6detail19basic_iarchive_impl12load_pointerERNS1_14basic_iarchiveERPvPKNS1_25basic_pointer_iserializerEPFS9_RKNS_13serialization18extended_type_infoEE+0x23b) [0x2b3d485348ab] [compute-0-7:23208] [ 8] /home/oci/murri/rank/rank_wf_dbg(_ZN5boost7archive6detail17load_pointer_typeINS_3mpi15packed_iarchiveEE6invokeIPN9Waterfall9Processor9SparseRowEEEvRS4_RT_+0x49) [0x4d7989] [compute-0-7:23208] [ 9] /home/oci/murri/rank/rank_wf_dbg(_ZN5boost7archive4loadINS_3mpi15packed_iarchiveEPN9Waterfall9Processor9SparseRowEEEvRT_RT0_+0x22) [0x4d79f6] [compute-0-7:23208] [10] /home/oci/murri/rank/rank_wf_dbg(_ZN5boost7archive6detail15common_iarchiveINS_3mpi15packed_iarchiveEE13load_overrideIPN9Waterfall9Processor9SparseRowEEEvRT_i+0x28) [0x4d7a20] [compute-0-7:23208] [11] /home/oci/murri/rank/rank_wf_dbg(_ZN5boost7archive21basic_binary_iarchiveINS_3mpi15packed_iarchiveEE13load_overrideIPN9Waterfall9Processor9SparseRowEEEvRT_i+0x23) [0x4d7a45] [compute-0-7:23208] [12] /home/oci/murri/rank/rank_wf_dbg(_ZN5boost3mpi15packed_iarchive13load_overrideIPN9Waterfall9Processor9SparseRowEEEvRT_iN4mpl_5bool_ILb0EEE+0x23) [0x4d7a6b] [compute-0-7:23208] [13] /home/oci/murri/rank/rank_wf_dbg(_ZN5boost3mpi15packed_iarchive13load_overrideIPN9Waterfall9Processor9SparseRowEEEvRT_i+0x2a) [0x4d7a98] [compute-0-7:23208] [14] /home/oci/murri/rank/rank_wf_dbg(_ZN5boost7archive6detail18interface_iarchiveINS_3mpi15packed_iarchiveEErsIPN9Waterfall9Processor9SparseRowEEERS4_RT_+0x2a) [0x4d7ac4] [compute-0-7:23208] [15] /home/oci/murri/rank/rank_wf_dbg(_ZNK5boost3mpi12communicator9recv_implIPN9Waterfall9Processor9SparseRowEEENS0_6statusEiiRT_N4mpl_5bool_ILb0EEE+0x93) [0x4f41ef] [compute-0-7:23208] [16] /home/oci/murri/rank/rank_wf_dbg(_ZNK5boost3mpi12communicator4recvIPN9Waterfall9Processor9SparseRowEEENS0_6statusEiiRT_+0x3f) [0x4f427b] [compute-0-7:23208] [17] /home/oci/murri/rank/rank_wf_dbg(_ZN9Waterfall4rankEv+0x291) [0x4acf27] [compute-0-7:23208] [18] /home/oci/murri/rank/rank_wf_dbg(main+0x698) [0x4ad9b6] [compute-0-7:23208] [19] /lib64/libc.so.6(__libc_start_main+0xf4) [0x311041d974] [compute-0-7:23208] [20] /home/oci/murri/rank/rank_wf_dbg [0x4aa219] [compute-0-7:23208] *** End of error message *** One MPI rank is started per compute node; all OpenMP threads may call mpi::isend(); only one will do mpi::iprobe()/mpi::recv(). Although the above error is in the mpi::communicator::recv(), serializing the mpi::isend() calls apparently solves the issue; similarly, the program runs fine if I run it on one node only; with some other (smaller) inputs, it runs fine as well. This leads me to think that it is a thread-safety issue with the MPI part. I have checked that the MPI library (OpenMPI 1.4.2) is initialized with MPI_Init_threads() and provides the threading level MPI_THREAD_MULTIPLE. So, question: is there a (known) thread-safety issue with Boost.MPI, or should I definitely look somewhere else? Thanks for any help! Riccardo
Riccardo Murri wrote:
One MPI rank is started per compute node; all OpenMP threads may call mpi::isend(); only one will do mpi::iprobe()/mpi::recv().
Although the above error is in the mpi::communicator::recv(), serializing the mpi::isend() calls apparently solves the issue; similarly, the program runs fine if I run it on one node only; with some other (smaller) inputs, it runs fine as well. This leads me to think that it is a thread-safety issue with the MPI part. I have checked that the MPI library (OpenMPI 1.4.2) is initialized with MPI_Init_threads() and provides the threading level MPI_THREAD_MULTIPLE.
So, question: is there a (known) thread-safety issue with Boost.MPI, or should I definitely look somewhere else?
As a MPI+openMP user, we take great care to not have multiple omp thread to do MPI operation not because of Boost.MPI but of MPI way of handling things which are not that thread safe.
participants (2)
-
joel falcou
-
Riccardo Murri