boost message passing

25 Sep 2006

      Hi Boosters!

In this post I just wanted to throw in my two cents about the newly
accepted MPI library, from a luser's point of view.
(doc: http://tinyurl.com/fjc9x)

It was a great excitement to see the news on boost.org "Message Passing
Review Begins"! The C bindings of MPI are a great pain to use -- a fact 
which is mostly denied by programmers who are unaware of C++'s beauty 
and content with even good old F77 most of the time. And, as it has been 
pointed out by Boost.MPI developers, the C++ bindings of standard MPI, 
though an improvement, are nothing more than simple wrappers on top of 
the C bindings. Moreover, certain supercomputing centers do not even 
have these C++ bindings of MPI installed on their otherwise 
state-of-the-art clusters.

So, anyway, it was great to hear that finally MPI will be *boost*ed. 
There is a lot to what is good about this new boost library. For one, 
quoting Jeremy: "One particularly nice aspect of Boost.MPI is the way it 
leverages the Boost serialization library". This indeed appears to be a 
greatly elegant way to message-pass user types around. There are other 
features to applaud, such as reduce() via arbitrary functors. Yet, in 
this post, with the intent of providing constructive criticism, I'd like 
to point out several design issues related to the user interface; mostly 
in comparison with other MPI implementations.

(hereafter the discussion will assume specific knowledge about MPI)

Here is a brief outline of what I'll be talking about, somewhat in the 
order of importance:
1. Port object
2. Message object
3. Skeleton and content questions
4. MPI-2 bindings

*Port Object*

Those of you who have actively programmed with MPI certainly have used 
or at least heard of the OOMPI library (I believe Doug Gregor, one of 
the two developers of Boost.MPI, is affiliated with the institution 
OOMPI was born). Its web page, http://www.osl.iu.edu/research/oompi/ , 
greets you with random (mostly funny) messages, such as "OOMPI- Isn't it 
time you did something for YOU?" :)

Anyway, it was thanks to OOMPI library that I realized that in fact MPI 
programming could be fun! Especially, there is this *OOMPI_Port* 
abstraction that literally _enlighten_s the way point-to-point (PTP) and 
rooted-collective (RC) operations are made. Let's contrast the way PTP 
is carried out for the three interfaces (Standard MPI, Boost.MPI and OOMPI):

//================================================\\
// Standard way
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

int msg;
int msg_tag = 10;
if (rank == 0) {
    msg = 17;
    MPI_Send(&msg, 1, MPI_INT, 1, msg_tag, MPI_COMM_WORLD);
    // (synopsis: http://tinyurl.com/ejmgg )
} elseif (rank==1) {
    MPI_Recv(&msg, 1, MPI_INT, 0, msg_tag, MPI_COMM_WORLD, 
MPI_STATUS_IGNORE);
    // (synopsis: http://tinyurl.com/elkez )
}

// Boost.MPI way
mpi::communicator world;
int msg;
int msg_tag = 10;
if (world.rank() == 0) {
    msg = 17;
    world.send(1, msg_tag, msg);
} elseif (world.rank() == 1) {
    world.recv(0, msg_tag, msg);
}

// OOMPI way
OOMPI_Comm_world& world = OOMPI_COMM_WORLD;
world.Init(argc,argv);
int msg;
int msg_tag = 10;
OOMPI_Port to   (world[1]);
OOMPI_Port from (world[0]);
if (world.Rank() == from.Rank()) {
    msg = 17;
    to.Send(msg,msg_tag);
} else if (world.Rank() == to.Rank()) {
    from.Recv(msg,msg_tag);
}
//=============================================

The standard C provides global functions for this simplest and most 
commonly employed Send/Recv commands. These functions require a long 
list of arguments, which are usually a pain to keep track of (here is 
another P2P command which is really harsh http://tinyurl.com/mk6dq). 
Also, such an interface does not provide any conceptual description of 
what is going on. The design of MPI is indeed mostly object oriented 
(http://tinyurl.com/qm9hd), however the implementation doesn't realize 
this fact.

Boost.MPI improves two aspects of this: For one the P2P commands are 
member functions of the communicator (same as the standard C++ 
bindings); and two, the message's type is templated and the user does 
not have to repeatedly specify what she is sending. But, as in 
C-version, one still has to realize the points (P's in P2P) of 
communication as a bunch of integer values rather than, well, the points 
of communication, which are an entities by themselves.

OOMPI takes this one step further and things kind of fit in place with 
the whole framework of MPI's design: The points of communication are 
objects themselves, abstracted as OOMPI_Port class. P2P functions are 
member functions of OOMPI_Port and OOMPI_Port is in turn a certain piece 
of a given communicator. Here is how an RC communication looks like:

//=============================================
// Root-collective communication with OOMPI.
// The message will be created at the 4th process and
//+ will be broadcasted to everyone in the world communicator.
OOMPI_Port root = world[4];
root.Bcast(msg);
// or if the Port will not be used for anything else, it is
//+ possible to create a temporary:
world[4].Bcast(msg);

// Here is how this is with Boost.MPI
broadcast(world, msg, 4);
//=============================================

Clearly the OOMPI version is more instructive of the communication 
process. A more elaborate illustration can be given in the context of 
communication topologies (a feature that is not yet available in 
Boost.MPI, unless I am mistaken). For instance:

//=============================================
// Create a cartesian virtual topology
enum {X,Y,NDIM};
int shape[NDIM] = {3,2}; // requires world.size() >= 6;
bool is_periodic[] = {true,true};
OOMPI_Cart_comm grid(world,NDIM,shape,is_periodic);

// Get nodes of subsequent P2P communication
OOMPI_Port left  = grid[grid.Shift(X,-1)]; // left neighbor
OOMPI_Port right = grid[grid.Shift(X, 1)]; // right neighbor

// get cartesian coordinates
int mycoords[NDIM]; grid.Coords(mycoords);

// create a message
int msg = grid.Rank();

// P2P communicate
left.Send(msg);  // send to left
right.Recv(msg); // recv from right

//================================================

I wonder why this Port abstraction (or a similar idea) couldn't make its 
way to Boost.MPI? Is there an underlying design decision to this?

Also, why the P2P commands are members of the communicator class but RC 
commands such as broadcast and gather are global?

*Message object*

A message in MPI is created by specifiying a data type, data count, data 
address and finally a message tag. Note the C interface of passing a 
message in a P2P communication (see first example above). The 4 out of 6 
arguments in MPI_Send() is just to specify the message. And for every 
communication statement of the same kind, this list should be repeated.

Boost.MPI (ignoring the skeleton/content idea for a second) reduces this 
to three arguments (http://tinyurl.com/k2w7q): Reference to the data 
(from which the type is deduced), data count and the message tag. 
However, this message is still not a separate entity; and all this 
information is repeatedly specified as arguments to communication functions.

OOMPI provides the convenience of OOMPI_Message object, which is then 
used in communications. For instance, to pass an array of integers:

//=============================================
int size = 5;
int arr[size];
OOMPI_Tag msg_tag = 10;
OOMPI_Message msg(arr1,size,msg_tag);

if (world.Rank() == root.Rank()) /* fill up the arr */;
root.Bcast(msg);
//=============================================

*Skeleton and Content - Separating structure from content*

Boost.MPI provides this promising functionality which not only 
simplifies message specifications during communication calls but also 
helps to easily create more complex data types.

But as admittedly a naive user, I am not so sure if this won't actually 
  degrade the performance, or why is this even necessary at all. For 
instance, let's examine the example given here 
(http://tinyurl.com/mgwkt). A plain C implementation of the same thing 
would be something like this:

//===================================================
int root = 0;
// Create the message type (this section of the
//+ code is run by ALL processes in the communicator).
int list_len = 10; // size of the message
std::list<int> mylist(list_len);
std::vector<int*> data_locations(list_len);
std::vector<int > block_lengths(list_len,1);
using boost::lambda::_1;
std::transform( mylist.begin(), mylist.end(),
                 data_locations.begin(), &_1);
MPI_Datatype msg_type;
MPI_Type_hindexed(list_len, &block_lengths[0],
                   reinterpret_cast<MPI_Aint*>(&data_locations[0]),
                   MPI_INT, &msg_type);
MPI_Type_commit(&msg_type);

// fill with useful data in the master node
if (world.Rank() == root )
   std::generate(mylist.begin(),mylist.end(),&std::rand);

// distribute the data
MPI_Bcast(MPI_BOTTOM,1,msg_type,0,MPI_COMM_WORLD);
//====================================================

What is happening here is that, initially each process creates in its 
own address space, a list of some pre-specified size that is equal in 
every process. Then a local _definition_ of the container is created as 
a collection of addresses of every element in the list. A type is then 
associated with this list to be used in subsequent message passing 
calls. The root node fills up the list in its own memory space with 
useful data, and via a broadcast call (a rooted-collective call) 
message-passes the values to other slave processes. The slave processes 
gathers the received data into their own list, which is specified by the 
local type that has just been created. As a matter of fact, the slaves 
might have gathered the whole data into a vector of ints if they so 
pleased (and this is usually more efficient since underlying MPI 
implementation already receives the data as a consequtive chunk of 
values before processing it with the specified MPI_Datatype).

Now, Boost.MPI conveniently reduces the data type definition down to a 
single statement:

mpi::skeleton(mylist);

But, then the example has this extra broadcasting statement (indeed, the 
example is such that the skeleton (MPI_Datatype) is created as a 
temporary):

broadcast(world, mpi::skeleton(mylist), 0);

Here, I am not so sure why this is useful. The only thing that is common 
  with the master and slave processes' lists is the list_len, which is 
already assumed to be known by all. The containers live in totally 
distinct memory spaces and the elements are located at different 
addresses. So everybody has to call mpi::skeleton(mylist) separately, 
but not _receive_ it from the master.

The actual data is passed by subsequent possibly multiple broadcast calls:

mpi::content c = mpi::get_content(l);
broadcast(world, c, root);

Now, here is the content anything more than MPI_BOTTOM? What extra 
information does it carry? Is it possible to apply the same skeleton to 
a different list<int> of same size?

Another point, as I indicated before with the C example, is it possible 
to create skeletons of different containers of the same value_type (say 
list<int> on master and vector<int>'s on slaves) and pass the contents 
around without performance penalty?

*MPI-2 Bindings*

Somewhere in the documentation a reference to MPI-2 bindings has been 
made (http://tinyurl.com/jhjo8). Including modern MPI-2 bindings are of 
course an important provision; however as I indicated previously, there 
are certain supercomputing centers which do not support MPI-2 yet. Does 
this resctrict the usage of Boost.MPI in anyway? In other words, would 
my code still compile and link if I avoid the MPI-2 subset of Boost.MPI?

To conclude my rather huge post, I was expecting perhaps Boost.MPI to be 
a *boost*ed OOMPI implementation (for instance OOMPI does have a lot of 
disturbing issues especially related to user data types), that amplifies 
the elegant abstractions while incurring minimum performance penalty. A 
lot of work has been put into the user type manipulations and it is 
exciting to see the serialization library in such good use. However, 
with the essentials such as the communication statements, Boost.MPI does 
not seem to add much on top of the standard MPI C++ bindings.

Also, as indicated by the developers themselves, the current 
implementation supports only a limited subset of MPI 1.1. Yet, IMHO, 
given this subset doesn't even include the virtual topology creation, it 
is perhaps a bit too early to have it in boost.

Well, thanks for bearing with me this far. I hope I was able to provide 
useful feedback for this fresh library.

- Levent

Levent Yilmaz

Doug Gregor

tags

participants (2)