
Hi, sorry for opening a new thread but I just joined the list. I got pointed to this review process at the beginning of this week. Unfortunately I did not find time to evaluate the design until now. Yesterday I took a short glance at the documentation and today I investigated some issues that struck me at once. I was not on the boost list until today and am not really familiar with the whole boost project. Therefore I apologize if I state some things about boost wrong because of lack of insight. So lets start with my actual evaluation.
- What is your evaluation of the design?
The design is pretty straight forward. Therefore I think everybody familiar with the C-Interface of MPI will feel at home at once. One thing that struck me at once is the missing of tags for the communication. In the MPI standard they are everywhere and usually are used to distinguish different communications (as they might get to the target process in arbitrary order). How can the programmer make sure that the message that the process is receiving is the message that he wants to receive at this point of the program? If there is no such means I would consider this a serious flaw in the design.
- What is your evaluation of the implementation?
One thing I do not understand is why the collective operations use std:vector to collect the data. I always thought that there is now garanty by the standard that the values are stored in contiguous memory (which is what MPI needs). What if the user wants to comunicate from/to a given data structure represented by a standard C-Array? This might be the case very often in scientific computing where you tend to reuse existent libraries (probably implemented C or by a novice C++ programmer). As I see the implementation is using the boost serialization framework to send data types that do not correspond to a standard MPI_Datatype. While in this way it is still possible to send values between different architectures, I fear that there is (in some cases) a great efficiency loss: E. g. with none-PODs broadcasts (and probably other collective operations) do not use MPI's collective operations, but independent asynchronous MPI send and receives. This might circumvent applications of special optimization based on the underlying network topology (e. g. on a hypercube the broadcast is just O(ld P) where P is the number of processes). This is a knock-out criterium for people doing High-Performance-Computing where efficiency is everything unless there is a way of sending at least 1-dimensional arrays the usual way. But I fear that even then what they gain by using Boost.MPI is not enough to persuade them to use it. Maybe it would be possible to use MPI_Pack or create custom MPI_Datatypes? I have not tested my statement. This should probably be done to see whether I am right.
- What is your evaluation of the documentation?
The documentation is really good. It was possible to understand the usage of the framework at once.
- What is your evaluation of the potential usefulness of the library?
At the current state it will probably not be used by people doing High Performance Computing due to the lack of efficiency stated above. If people do not need best performance then the library is useful to them as it is by far easier to use as MPI for a C++ programmer.
- Did you try to use the library? With what compiler?
No, unfortunately there was no time.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
I read through the documentation and then studied issues that struck me in the source.
- Are you knowledgeable about the problem domain?
I have been programming in High Performance Computing seriously since 2001. Currently I am doing research and implementation in parallel iterative solvers (especially parallel algebraic multigrid methods). Regarding whether I think if boost mpi should be included into boost: I don't think I am in the position to argue about that as I do not have enough background knowledge about boost, but: If you want to adress people who are into High Performance Computing, you should address the issues above before you incorporate it into boost. If you just want people, to get started with parallel programming more easily, go for it. But unless effiency is granted by Boost.MPI those people have to hand code and deal with MPI directly sooner or later to get the efficiency they need. I hope I could help in the evaluation process. Best regards, Markus Blatt -- DUNE -- The Distributed And Unified Numerics Environment <http://www.dune-project.org>

On Sep 15, 2006, at 9:12 PM, Markus Blatt wrote:
- What is your evaluation of the design?
The design is pretty straight forward. Therefore I think everybody familiar with the C-Interface of MPI will feel at home at once.
One thing that struck me at once is the missing of tags for the communication. In the MPI standard they are everywhere and usually are used to distinguish different communications (as they might get to the target process in arbitrary order). How can the programmer make sure that the message that the process is receiving is the message that he wants to receive at this point of the program? If there is no such means I would consider this a serious flaw in the design.
I don't fully understand that. There are tags in all the point-to- point communications, which is where the MPI standard supports them.
- What is your evaluation of the implementation?
One thing I do not understand is why the collective operations use std:vector to collect the data. I always thought that there is now garanty by the standard that the values are stored in contiguous memory (which is what MPI needs).
Section 23.2.4 of C++03 states "The elements of a vector are stored contiguously"
What if the user wants to comunicate from/to a given data structure represented by a standard C-Array? This might be the case very often in scientific computing where you tend to reuse existent libraries (probably implemented C or by a novice C++ programmer).
This is a good point. Doug, can we extend the interface to also allow a C-array?
As I see the implementation is using the boost serialization framework to send data types that do not correspond to a standard MPI_Datatype. While in this way it is still possible to send values between different architectures, I fear that there is (in some cases) a great efficiency loss: E. g. with none-PODs broadcasts (and probably other collective operations) do not use MPI's collective operations, but independent asynchronous MPI send and receives. This might circumvent applications of special optimization based on the underlying network topology (e. g. on a hypercube the broadcast is just O(ld P) where P is the number of processes).
This is a knock-out criterium for people doing High-Performance-Computing where efficiency is everything unless there is a way of sending at least 1-dimensional arrays the usual way. But I fear that even then what they gain by using Boost.MPI is not enough to persuade them to use it.
I believe that your comments are based on a misunderstanding. The goal of the library design was to actually make it efficient and useful for high performance applications. The library allows the use of custom MPI_Datatypes and actually creates them automatically whenever possible, using the serialization library. For those data types for which this is not possible (e.g. the pointers in a linked list) the serialization library is used to serialize the data structure, which is then packed into a message using MPI_Pack. The skeleton&content mechanism is essentially a simple way to create a custom MPI datatype. The "content" is just an MPI_Datatype created for the data members of the object you want to send. Another case is when sending, e.g. a std::vector or other array-like data structure of a type for which a custom MPI_Datatype can be constructed. In that case the serialization library is called once to construct the needed MPI_Datatype for one element in the array, and then communication is done using that MPI_Datatype. You might be worried that the use of the serialization library leads to inefficiencies since in the released versions each element of an array was serialized independently. The recent changes, which are in the CVS HEAD, actually address this issue by providing optimizations for array-like data structures. I hope these answers address the issues you had in mind. I can elaborate if you want.
Maybe it would be possible to use MPI_Pack or create custom MPI_Datatypes?
Yes, both are implemented and used automatically.
- What is your evaluation of the potential usefulness of the library?
At the current state it will probably not be used by people doing High Performance Computing due to the lack of efficiency stated above. If people do not need best performance then the library is useful to them as it is by far easier to use as MPI for a C++ programmer.
I believe, as stated above, that this is just a misunderstanding.
I don't think I am in the position to argue about that as I do not have enough background knowledge about boost, but: If you want to adress people who are into High Performance Computing, you should address the issues above before you incorporate it into boost.
This was one of the main goals, and a large fraction of the design and implementation time was spent on optimizing the array serialization so that the serialization library could be used to create MPI_Datatypes as well as to allow efficient message passing using MPI_Pack.
If you just want people, to get started with parallel programming more easily, go for it. But unless effiency is granted by Boost.MPI those people have to hand code and deal with MPI directly sooner or later to get the efficiency they need.
I hope I could help in the evaluation process.
Thank you very much for your comments! Matthias

On Sep 15, 2006, at 4:40 PM, Matthias Troyer wrote:
What if the user wants to comunicate from/to a given data structure represented by a standard C-Array? This might be the case very often in scientific computing where you tend to reuse existent libraries (probably implemented C or by a novice C++ programmer).
This is a good point. Doug, can we extend the interface to also allow a C-array?
Yes, we'll do that. Cheers, Doug

Hi Matthias, On Fri, Sep 15, 2006 at 10:40:25PM +0200, Matthias Troyer wrote:
On Sep 15, 2006, at 9:12 PM, Markus Blatt wrote:
One thing that struck me at once is the missing of tags for the communication. In the MPI standard they are everywhere and usually are used to distinguish different communications (as they might get to the target process in arbitrary order).
I don't fully understand that. There are tags in all the point-to- point communications, which is where the MPI standard supports them.
Just forget about it. I was missing the tags in the collective communication where they definitely are none in the MPI standard. Probably I should have gotten more sleep. Sorry.
- What is your evaluation of the implementation?
One thing I do not understand is why the collective operations use std:vector to collect the data. I always thought that there is now garanty by the standard that the values are stored in contiguous memory (which is what MPI needs).
Section 23.2.4 of C++03 states "The elements of a vector are stored contiguously"
Thanks. That's a relieve. I should definitely get a newer standard.
As I see the implementation is using the boost serialization framework to send data types that do not correspond to a standard MPI_Datatype. While in this way it is still possible to send values between different architectures, I fear that there is (in some cases) a great efficiency loss: E. g. with none-PODs broadcasts (and probably other collective operations) do not use MPI's collective operations, but independent asynchronous MPI send and receives. This might circumvent applications of special optimization based on the underlying network topology (e. g. on a hypercube the broadcast is just O(ld P) where P is the number of processes).
I believe that your comments are based on a misunderstanding. The goal of the library design was to actually make it efficient and useful for high performance applications. The library allows the use of custom MPI_Datatypes and actually creates them automatically whenever possible, using the serialization library. For those data types for which this is not possible (e.g. the pointers in a linked list) the serialization library is used to serialize the data structure, which is then packed into a message using MPI_Pack.
The skeleton&content mechanism is essentially a simple way to create a custom MPI datatype. The "content" is just an MPI_Datatype created for the data members of the object you want to send.
Another case is when sending, e.g. a std::vector or other array-like data structure of a type for which a custom MPI_Datatype can be constructed. In that case the serialization library is called once to construct the needed MPI_Datatype for one element in the array, and then communication is done using that MPI_Datatype.
You might be worried that the use of the serialization library leads to inefficiencies since in the released versions each element of an array was serialized independently. The recent changes, which are in the CVS HEAD, actually address this issue by providing optimizations for array-like data structures.
I hope these answers address the issues you had in mind. I can elaborate if you want.
The question came up when I looked into mpi/collectives/broadcast.hpp: // We're sending a type that does not have an associated MPI // datatype, so we'll need to serialize it. Unfortunately, this // means that we cannot use MPI_Bcast, so we'll just send from the // root to everyone else. template<typename T> void broadcast_impl(const communicator& comm, T& value, int root, mpl::false_) If this function gets called the performance will definitely be suboptimal as the root will send to all others. It this just the case if no MPI_Datatype was constructed (like for the linked list) or is it called whenever the boost serialization is used? Regards, Markus Blatt -- DUNE -- The Distributed And Unified Numerics Environment <http://www.dune-project.org>

On Sep 16, 2006, at 10:22 AM, Markus Blatt wrote:
Just forget about it. I was missing the tags in the collective communication where they definitely are none in the MPI standard. Probably I should have gotten more sleep. Sorry.
I would actually also love to have tags there :-)
I hope these answers address the issues you had in mind. I can elaborate if you want.
The question came up when I looked into mpi/collectives/broadcast.hpp:
// We're sending a type that does not have an associated MPI // datatype, so we'll need to serialize it. Unfortunately, this // means that we cannot use MPI_Bcast, so we'll just send from the // root to everyone else. template<typename T> void broadcast_impl(const communicator& comm, T& value, int root, mpl::false_)
If this function gets called the performance will definitely be suboptimal as the root will send to all others. It this just the case if no MPI_Datatype was constructed (like for the linked list) or is it called whenever the boost serialization is used?
OK, I see your concern. This is actually only used when no MPI_Datatype can be constructed. That is when there no MPI_Datatype is possible, such as for a linked list, and if you do not use the skeleton&content mechanism either. Since this part of the code was written by Doug Gregor, I ask him to correct me if I say something wrong now or if I miss something. When no MPI datatype exists then we need to pack the object into a buffer using MPI_Pack, and the buffer needs to be broadcast. So far we all seem to agree. The problem now is that the receiving side needs to know the size of the buffer to allocate enough memory, but there is no MPI_Probe for collectives that could be used to inquire about the message size. I believe that this was the reason for implementing the broadcast as a sequence of nonblocking sends and receives (Doug?). Thinking about it, I realize that one could instead do two consecutive broadcasts: one to send the size of the buffer and then another one to send the buffer. This will definitively be faster on machines with special hardware for collectives. On Beowulf clusters on the other hand the current version is faster since most MPI implementation just perform the broadcast as a sequence of N-1 send/ receive operations from the root instead of optimizing it. Matthias

On Sat, Sep 16, 2006 at 11:05:12AM +0200, Matthias Troyer wrote:
On Sep 16, 2006, at 10:22 AM, Markus Blatt wrote:
The question came up when I looked into mpi/collectives/broadcast.hpp:
If this function gets called the performance will definitely be suboptimal as the root will send to all others. It this just the case if no MPI_Datatype was constructed (like for the linked list) or is it called whenever the boost serialization is used?
OK, I see your concern. This is actually only used when no MPI_Datatype can be constructed. That is when there no MPI_Datatype is possible, such as for a linked list, and if you do not use the skeleton&content mechanism either.
So there is always a way to circumvent that. That's fair enough. Cheers, Markus Blatt -- DUNE -- The Distributed And Unified Numerics Environment <http://www.dune-project.org>

On Sep 16, 2006, at 5:05 AM, Matthias Troyer wrote:
On Sep 16, 2006, at 10:22 AM, Markus Blatt wrote:
The question came up when I looked into mpi/collectives/ broadcast.hpp:
// We're sending a type that does not have an associated MPI // datatype, so we'll need to serialize it. Unfortunately, this // means that we cannot use MPI_Bcast, so we'll just send from the // root to everyone else. template<typename T> void broadcast_impl(const communicator& comm, T& value, int root, mpl::false_)
If this function gets called the performance will definitely be suboptimal as the root will send to all others. It this just the case if no MPI_Datatype was constructed (like for the linked list) or is it called whenever the boost serialization is used?
OK, I see your concern. This is actually only used when no MPI_Datatype can be constructed. That is when there no MPI_Datatype is possible, such as for a linked list, and if you do not use the skeleton&content mechanism either.
Right. from code code standpoint, in addition to that broadcast_impl shown above, there is one that looks like this: // We're sending a type that has an associated MPI datatype, so // we'll use MPI_Bcast to do all of the work. template<typename T> void broadcast_impl(const communicator& comm, T& value, int root, mpl::true_) That last parameter decides which implementation to use, based on whether we have or can create an MPI_Datatype for the type T.
Since this part of the code was written by Doug Gregor, I ask him to correct me if I say something wrong now or if I miss something. When no MPI datatype exists then we need to pack the object into a buffer using MPI_Pack, and the buffer needs to be broadcast. So far we all seem to agree. The problem now is that the receiving side needs to know the size of the buffer to allocate enough memory, but there is no MPI_Probe for collectives that could be used to inquire about the message size. I believe that this was the reason for implementing the broadcast as a sequence of nonblocking sends and receives (Doug?).
Yes, this was the reason for the sequence of nonblocking sends and receives.
Thinking about it, I realize that one could instead do two consecutive broadcasts: one to send the size of the buffer and then another one to send the buffer. This will definitively be faster on machines with special hardware for collectives. On Beowulf clusters on the other hand the current version is faster since most MPI implementation just perform the broadcast as a sequence of N-1 send/ receive operations from the root instead of optimizing it.
Right. I guess we could provide some kind of run-time configuration switch that decides between the two implementations, if someone runs into a case where it matters. Doug

Hi, in light of the performance questions let me summarize some details of how the proposed Boost.MPI library sends data: If an object is sent for which an MPI_Datatype exists, then communication is done using that MPI_Datatype. This applies to both the builtin primitive types, as well as custom MPI_Datatypes for "POD- like" types for which an MPI_Datatype can be constructed. For each such type, an MPI_Datatype is built *once* during execution of the program by using the serialization library. If a fixed-size array is sent for which an MPI_Datatype exists then again send is done using the MPI_Datatype. Thus for these two cases we get optimal performance, and a much simplified usage, since the creation of MPI_Datatypes is made much easier than in MPI. For all other types (variable-sized vectors, linked lists, trees) the data structure is serialized into a buffer by the serialization library using MPI_Pack. Again MPI_Datatypes are used wherever they exist, and contiguous arrays of homogeneous types for which MPI_Datatypes exist are serialized using a single MPI_Pack call (using a new optimization in Boost.Serialization). At the end, the buffer is sent using MPI_Send. Note here that while MPI_Pack calls do incur an overhead, we are talking about sending complex data structures for which no corresponding MPI call exists, and any program directly written using MPI would also need to first serialize the data structure into a buffer. To counter the mentioned overhead, there is the "skeleton&content" mechanism for cases when a data structure needs to be sent multiple times, with different "contents" while the "skeleton" (the sizes of arrays, the values of pointers, ...) of the data structure remains unchanged. In that case the structural information only (sizes, types, pointers) is serialized using MPI_Pack and sent once so that the receiving side can create an identical data structure to receive the data. Afterwards an MPI_Datatype for the "contents" (the data members) of the data structure is created, and sending the content is done using this custom MPI_Datatype, which again gives optimal performance. It seems to me that the simplicity of the interface does a good job at hiding these optimizations from the user. If anyone knows of a further optimization trick that could be used then please post it to the list. Matthias

On Sat, Sep 16, 2006 at 08:41:53AM +0200, Matthias Troyer wrote:
Hi,
in light of the performance questions let me summarize some details of how the proposed Boost.MPI library sends data:
If an object is sent for which an MPI_Datatype exists, then communication is done using that MPI_Datatype. This applies to both the builtin primitive types, as well as custom MPI_Datatypes for "POD- like" types for which an MPI_Datatype can be constructed. For each such type, an MPI_Datatype is built *once* during execution of the program by using the serialization library.
If a fixed-size array is sent for which an MPI_Datatype exists then again send is done using the MPI_Datatype.
Thus for these two cases we get optimal performance, and a much simplified usage, since the creation of MPI_Datatypes is made much easier than in MPI.
For all other types (variable-sized vectors, linked lists, trees) the data structure is serialized into a buffer by the serialization library using MPI_Pack. Again MPI_Datatypes are used wherever they exist, and contiguous arrays of homogeneous types for which MPI_Datatypes exist are serialized using a single MPI_Pack call (using a new optimization in Boost.Serialization). At the end, the buffer is sent using MPI_Send. Note here that while MPI_Pack calls do incur an overhead, we are talking about sending complex data structures for which no corresponding MPI call exists, and any program directly written using MPI would also need to first serialize the data structure into a buffer.
To counter the mentioned overhead, there is the "skeleton&content" mechanism for cases when a data structure needs to be sent multiple times, with different "contents" while the "skeleton" (the sizes of arrays, the values of pointers, ...) of the data structure remains unchanged. In that case the structural information only (sizes, types, pointers) is serialized using MPI_Pack and sent once so that the receiving side can create an identical data structure to receive the data. Afterwards an MPI_Datatype for the "contents" (the data members) of the data structure is created, and sending the content is done using this custom MPI_Datatype, which again gives optimal performance.
It seems to me that the simplicity of the interface does a good job at hiding these optimizations from the user. If anyone knows of a further optimization trick that could be used then please post it to the list.
As far as I can tell, there are three choices when sending a message: 1. Send it as MPI_PACK, in which case structure and content can be encoded together arbitrarily. 2. Send it as a datatype with the entire structure known beforehand. This allows easy use of nonblocking sends and receives. 3. Send it as a datatype with the structure determined up to the total size of the message. This requires getting the size with MPI_Probe, then building a datatype, then receiving it. The third option allows you to send a variable size vector with no extra explicit buffer. The same applies to a vector plus a constant amount of data (such as pair<int,vector<int> >). That would be quite useful, but probably difficult to work out automatically. Unfortunately, this trick interacts very poorly with multiple nonblocking messages, since the only ways to wait for one of several messages sent this way are to either busy wait or use the same tag for all messages. This restriction probably makes it impossible to hide this inside the implementation of a general nonblocking send function. Also, if my classification of messages strategies is wrong, I would love to know! Geoffrey

On Sep 16, 2006, at 9:21 AM, Geoffrey Irving wrote:
As far as I can tell, there are three choices when sending a message:
1. Send it as MPI_PACK, in which case structure and content can be encoded together arbitrarily.
This is the default if no MPI datatype exists and the user does not employ the skeleton&content trick.
2. Send it as a datatype with the entire structure known beforehand. This allows easy use of nonblocking sends and receives.
This is the default if there exists an MPI datatype for the structure.
3. Send it as a datatype with the structure determined up to the total size of the message. This requires getting the size with MPI_Probe, then building a datatype, then receiving it.
The third option allows you to send a variable size vector with no extra explicit buffer. The same applies to a vector plus a constant amount of data (such as pair<int,vector<int> >). That would be quite useful, but probably difficult to work out automatically.
This could be implemented as an optimization for e.g. std::vector and std::valarray. Our general mechanism that will work even for more complex data structures, such as a pair of two vectors of arbitrary length, is the skeleton&content mechanism where we first send serialized information about the sizes, and then can send the content (multiple times) using an MPI data type.
Unfortunately, this trick interacts very poorly with multiple nonblocking messages, since the only ways to wait for one of several messages sent this way are to either busy wait or use the same tag for all messages. This restriction probably makes it impossible to hide this inside the implementation of a general nonblocking send function.
For std::vector and std::valarray of MPI datatypes I can see your trick being generally useful and it can be implemented as a special optimization (Doug, what do you think?). For more general types I believe that the skeleton&content mechanism is the appropriate generalization of the third option. Matthias

On Sep 16, 2006, at 4:43 AM, Matthias Troyer wrote:
On Sep 16, 2006, at 9:21 AM, Geoffrey Irving wrote:
3. Send it as a datatype with the structure determined up to the total size of the message. This requires getting the size with MPI_Probe, then building a datatype, then receiving it.
The third option allows you to send a variable size vector with no extra explicit buffer. The same applies to a vector plus a constant amount of data (such as pair<int,vector<int> >). That would be quite useful, but probably difficult to work out automatically.
For std::vector and std::valarray of MPI datatypes I can see your trick being generally useful and it can be implemented as a special optimization (Doug, what do you think?). For more general types I believe that the skeleton&content mechanism is the appropriate generalization of the third option.
We can do this, although MPI_Probe() is both a problem for threading (as Geoffrey noted) and for performance (with some MPI implementations). Our best bet would be to do the same thing we do with serialized types, but without the serialization: send a small message containing the size of the vector, followed by the data itself. In a multithreaded environment, we need to send the second message with a unique tag and send that tag value as part of the first message. Again, it's the same thing we would do for serialized types. It would be so much easier if the MPI committee had come up with an MPI_Probe that actually worked :) Doug

Hi, On Sat, Sep 16, 2006 at 08:41:53AM +0200, Matthias Troyer wrote:
To counter the mentioned overhead, there is the "skeleton&content" mechanism for cases when a data structure needs to be sent multiple times, with different "contents" while the "skeleton" (the sizes of arrays, the values of pointers, ...) of the data structure remains unchanged. In that case the structural information only (sizes, types, pointers) is serialized using MPI_Pack and sent once so that the receiving side can create an identical data structure to receive the data. Afterwards an MPI_Datatype for the "contents" (the data members) of the data structure is created, and sending the content is done using this custom MPI_Datatype, which again gives optimal performance.
This skeleton&content approach sounds pretty powerful to me. Is there a way for the user to tune this behaviour? Until now I always had the impression that you only allow the sending and receiving of complete objects and/or their structure. But often there are other scenarios, too: One maybe wants to send just parts of an object. And it might be the case that the local structure of your object is different from your global structure, e.~g. it might need redistribution in the communication. Let me give you an example: Consider a distributed array or vector. Each entry of the global vector has a consecutive, zero starting local index l and a coresponding global index g together with tag specifying whether the process is owner of the value (meaning that he can compute a new value from consistent data without communication) or not. Process 0: local global tag 0 0 owner 1 3 owner 2 4 notowner Process 1: local global tag 0 1 owner 1 2 owner 3 3 notowner This would represent a global array of 5 entries. In our case we have the following communication patterns: owner to notowner Each process sends to each other process all entries of the array that are tagged as owner and on the other process as notowner in one message. The matching is done via the global id. In this case 0 sends to 1 the value at index 1 (global: 3) and process 1 stores is at local index 3. If that is not clear, take a look at http://hal.iwr.uni-heidelberg.de/~mblatt/communication.pdf (early draft) for the more exact notion of index sets. Would something like this be possible or can one just send and receive objects with the same layout? Cheers, Markus Blatt -- DUNE -- The Distributed And Unified Numerics Environment <http://www.dune-project.org>

On Sep 16, 2006, at 12:09 PM, Markus Blatt wrote:
This skeleton&content approach sounds pretty powerful to me. Is there a way for the user to tune this behaviour?
There is no such mechanism provided explicitly by Boost.MPI, but you can tune the behavior almost arbitrarily with serialization wrappers and archive adaptors. E.g. you could filter out certain members depending on whether you are on the sending or receiving side.
Until now I always had the impression that you only allow the sending and receiving of complete objects and/or their structure. But often there are other scenarios, too:
One maybe wants to send just parts of an object. And it might be the case that the local structure of your object is different from your global structure, e.~g. it might need redistribution in the communication.
Let me give you an example:
Consider a distributed array or vector. Each entry of the global vector has a consecutive, zero starting local index l and a coresponding global index g together with tag specifying whether the process is owner of the value (meaning that he can compute a new value from consistent data without communication) or not.
Process 0: local global tag 0 0 owner 1 3 owner 2 4 notowner
Process 1: local global tag 0 1 owner 1 2 owner 3 3 notowner
This would represent a global array of 5 entries.
In our case we have the following communication patterns:
owner to notowner Each process sends to each other process all entries of the array that are tagged as owner and on the other process as notowner in one message. The matching is done via the global id.
In this case 0 sends to 1 the value at index 1 (global: 3) and process 1 stores is at local index 3. If that is not clear, take a look at http://hal.iwr.uni-heidelberg.de/~mblatt/communication.pdf (early draft) for the more exact notion of index sets.
Would something like this be possible or can one just send and receive objects with the same layout?
The easiest solution in your case is to just send the object at index 1 and receive the object at index 3 by calling the send/receive with these sub-objects. For a more general case, I assume that you want to use different MPI_Datatypes on the receiving and sending side, and create them in some automatic way? It depends on the details of your mechanism, but I think that it should be possible to automate your mechanism. I would propose to introduce a new class for these local/global arrays and to provide a special serialize function for them. Matthias Matthias

On Sep 16, 2006, at 6:09 AM, Markus Blatt wrote:
Consider a distributed array or vector. Each entry of the global vector has a consecutive, zero starting local index l and a coresponding global index g together with tag specifying whether the process is owner of the value (meaning that he can compute a new value from consistent data without communication) or not.
Process 0: local global tag 0 0 owner 1 3 owner 2 4 notowner
Process 1: local global tag 0 1 owner 1 2 owner 3 3 notowner
This would represent a global array of 5 entries.
Very interesting. We're actually working on this very problem on a different project (our parallel graph library). If we come up with a good, general abstraction, we could try to get it into Boost.MPI. At the moment, however, i think Matthias has the right approach, using a special data-type wrapper that provides an alternative way to serialize the local/global array when building the skeleton. Doug
participants (4)
-
Douglas Gregor
-
Geoffrey Irving
-
Markus Blatt
-
Matthias Troyer