From the backtrace it seems that the code dies when performing serialization of a "lattice_type" element. This leads me to think
Hello, I tried to run your code but it's still too big and complex for me to be able to say anything without a long debugging session, which I cannot do now. So please take this email with a grain of salt, as I could have totally misunderstood the code... I compiled the code you sent with Boost.MPI 1.45 and OpenMPI 1.4. Running it on two MPI ranks, I always get the same two errors: * Rank 1 dies with the "archive_exception / array size too short" error, but * Rank 0 dies with a segmentation fault. I managed to get rid of the "archive_exception / array size too short" error from rank 1 (modified slbmpi.h attached), but the rank 0 still segfaults. (1) Concerning the "array size" error: your code reads: reqs.push_back( world.isend( Sender , SendTag , &Neighbor2Proc[ i ] , Msg2Send_size ) ) ; [Aside: I think there's a typo here: the first argument to "world.isend" is the *destination rank*, so I guess you have "Sender" where "Receiver" should be...] This sends Msg2Send_size elements of type "lattice_type" starting at location "&Neighbor2Proc[i]". However, the corresponding "world.irecv" has: reqs.push_back( world.irecv( msg.source() , msg.tag() , Neighbor4Proc[ i ] ) ) ; // , Msg2Send_size: not compiling: request irecv(int source, int tag, T * values, int n) const; @ http://www.boost.org/doc/libs/1_48_0/doc/html/boost/mpi/communicator.html#id... so you are receiving an array of "Msg2Send_size" elements into a single value of type "lattice_type". If you change the sender line to: reqs.push_back( world.isend( Receiver , SendTag , Neighbor2Proc[i] ) ); then the type of the sent object and the receiving slot do match, and the error is gone. If you wanted to send more than one element of Neighbor2Proc, then you have to use an exactly corresponding type in the recv call. (2) Regading the segfault: Adding some debug statements to slbmpi.h, I can see that it dies when executing "world.isend(..., Neighbor2Proc[i])": rmurri@xenia:~/tst$ mpirun -np 2 --tag-output ./a.out ... [1,0]<stdout>:Process=0's MiniGridSize= 3 3 3 ... [1,0]<stdout>:init_internal_neighbors_wf: Process 0 of 2 about to exchange (if necessary) w/+/-1! Sender=0, Receiver=1, Neighbor2Proc.size()=1, Msg2Send_size=1, i=0 @ idx= 0 0 0 [1,0]<stdout>: [1,0]<stdout>:Pause @ "init_internal_neighbors_wf: _slbmpi_h: 108: pre-exchange" if 1 process: <Enter> or <Return> continues; ^C aborts: [1,0]<stderr>:DEBUG: slbmpi.h:124 <==== THIS IS JUST BEFORE world.isend(...) [1,0]<stderr>:[xenia:06768] *** Process received signal *** [1,0]<stderr>:[xenia:06768] Signal: Segmentation fault (11) [1,0]<stderr>:[xenia:06768] Signal code: (128) [1,0]<stderr>:[xenia:06768] Failing at address: (nil) [1,0]<stderr>:[xenia:06768] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7f6453010060] [1,0]<stderr>:[xenia:06768] [ 1] ./a.out(_ZN5boost7archive4saveINS_3mpi15packed_oarchiveEKP12lattice_typeEEvRT_RT0_+0x14) [0x44b4aa] ... [1,0]<stderr>:[xenia:06768] [29] ./a.out(_ZN5boost7archive4saveINS_3mpi15packed_oarchiveEKNS_13serialization5arrayIKP12lattice_typeEEEEvRT_RT0_+0x23) [0x44a1f0] [1,0]<stderr>:[xenia:06768] *** End of error message *** that the "lattice_type" element "Neighbor2Proc[i]" has not been fully initialized. Now the serialization code for "lattice_type" reads: struct lattice_type { ... public: lattice_type* neighbors[ en - 1 ]; .... protected: template<class Archive> //serializes (boost::mpi::packed_iarchive& ar) & deserializes (boost::mpi::packed_oarchive& ar) inline void serialize( Archive & ar , const unsigned int ) { ar & neighbors ; // for 'packing' (& unpacking) for message-passing: put together in 'series' to exchange } ... As far as I understand, this means the boost::serialization code will try to dereference each pointer in the "neighbors" array and serialize the pointed-to elements. Could it be that some elements of "lattice_type::neighbors" are NULL pointers? (It would make sense for elements at a corner of the grid.) Also, IIRC, the "serialize" code is responsible for serializing *all* fields in a struct: in this case you are basically transmitting just a tiny part of the "lattice_type" and will thus get garbage on the receiving side... Hope this helps, Riccardo