distributed breadth first search doesn't work properly?

Hi, i am new to boost and i'm using the PBGL to write a software that compute some graph analysis. I've installed boost 1.44.0 and openmpi 1.5 I tried to run the breadth first search example located in $BOOST_ROOT/libs/graph_parallel/example with the graph $BOOST_ROOT/libs/graph/test/weighted_graph.gr All went well since i figure out that the result is different if i run the algorithm with a different number of processors. Only with one processor the values are corrected. here's what happens: mpirun -np 1 ./breadth_first_search.out graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=2]; n2[label=1]; n3[label=2]; n4[label=1]; } n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4; n2 -- n1; n2 -- n3; n3 -- n4; n4 -- n0; n4 -- n1; } --------------------- mpirun -np 2 ./breadth_first_search.out graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=1]; n2[label=1]; } n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4; n2 -- n1; n2 -- n3; subgraph cluster_1 { label="p1"; n3[label=1]; n4[label=0]; } n3 -- n4; n4 -- n0; n4 -- n1; } Is that correct? From the doc i can't figure out why.. -- Cordiali saluti, Mattia Lambertini

Hi, On Wednesday, 8. December 2010 10:49:54 Mattia Lambertini wrote:
Hi, i am new to boost and i'm using the PBGL to write a software that compute some graph analysis. I've installed boost 1.44.0 and openmpi 1.5
I tried to run the breadth first search example located in $BOOST_ROOT/libs/graph_parallel/example
with the graph $BOOST_ROOT/libs/graph/test/weighted_graph.gr
All went well since i figure out that the result is different if i run the algorithm with a different number of processors. Only with one processor the values are corrected.
here's what happens:
mpirun -np 1 ./breadth_first_search.out
graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=2]; n2[label=1]; n3[label=2]; n4[label=1]; }
n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4; n2 -- n1; n2 -- n3; n3 -- n4; n4 -- n0; n4 -- n1; }
---------------------
mpirun -np 2 ./breadth_first_search.out
graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=1]; n2[label=1]; }
n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4; n2 -- n1; n2 -- n3;
subgraph cluster_1 { label="p1"; n3[label=1]; n4[label=0]; }
n3 -- n4; n4 -- n0; n4 -- n1; }
Is that correct?
I am not sure if this correct, but respecting the concept of PBGL, where the vertices are local to processors, it seems reasonable. After all, the edges are the same, except they are split up into different subgraphs, which again looks reasonable if one thinks of the subgraphs local to the processors (due to the local vertices). Perhaps you might try to run it on more processors and see if you get the according number ob subgraphs?
From the doc i can't figure out why..
Best, Cedric

I am not sure if this correct, but respecting the concept of PBGL, where the vertices are local to processors, it seems reasonable. After all, the edges are the same, except they are split up into different subgraphs, which again looks reasonable if one thinks of the subgraphs local to the processors (due to the local vertices).
I am sorry, i wasn't clear enough. I am talking about the 'label' attribute that should be the distance (number of edges) from vertex 0 to each other reachable vertex (the owner of the attribute). Maybe i misunderstood the distance computed by the PBGL algorithm.. is it not the global distance computed by bfs algorithm usually? -- Cordiali saluti, Mattia Lambertini

On Wednesday, 8. December 2010 16:08:42 Mattia Lambertini wrote:
I am not sure if this correct, but respecting the concept of PBGL, where the vertices are local to processors, it seems reasonable. After all, the edges are the same, except they are split up into different subgraphs, which again looks reasonable if one thinks of the subgraphs local to the processors (due to the local vertices).
I am sorry, i wasn't clear enough. I am talking about the 'label' attribute that should be the distance (number of edges) from vertex 0 to each other reachable vertex (the owner of the attribute). Maybe i misunderstood the distance computed by the PBGL algorithm.. is it not the global distance computed by bfs algorithm usually?
Ok, got you now on this one. Following your explanation, this seems strange to me too. Perhaps you could attach the code of a minimal working example? I have no way to test distributed graphs (therefore no example files available for me), nevertheless I would like to have a look at it. Thank you. Best, Cedric

Perhaps you could attach the code of a minimal working example?
of course, i can give you a link to the example i've tested [0] and here below is the graph i used (METIS): 5 9 1 3 1 2 2 4 1 5 2 2 7 4 3 5 1 1 1 2 1 [0] http://www.boost.org/doc/libs/1_44_0/libs/graph_parallel/example/breadth_fir... To compile the example you have to link: boost_graph_parallel-mt boost_mpi-mt boost_system-mt Thanks for your time. -- Best regards, Mattia

Hi, On Friday, 10. December 2010 17:06:34 Mattia Lambertini wrote:
Perhaps you could attach the code of a minimal working example?
of course, i can give you a link to the example i've tested [0] and here below is the graph i used (METIS):
5 9 1 3 1 2 2 4 1 5 2 2 7 4 3 5 1 1 1 2 1
[0] http://www.boost.org/doc/libs/1_44_0/libs/graph_parallel/example/breadth_fi rst_search.cpp
To compile the example you have to link:
boost_graph_parallel-mt boost_mpi-mt boost_system-mt
Thanks for your time.
Thank you for the link and all. There is something in the documentation at http://www.osl.iu.edu/research/pbgl/documentation/breadth_first_search.html that makes me believe the observed behavior is actually normal. If you look at " Making Visitors Safe for Distributed BFS" point 3, it basically says that the "best" value needs to be stored for each vertex. Looking at your results, this seems to be the case as the distances are all smaller or equal to the distances in the single-process case. The only thing that is actually confusing me with these thoughts is that there is a second node with label "0" in the distributed execution result. This seems to me as if the bfs-algorithm started there again. Because the node seems to be local to another process than the original start node, this could actually indeed be the case, but I am not sure. Unfortunately, I am not familiar with METIS at all, so is there a way to provide the underlying graph in a more intuitive format? An actual image would be nice. Or again, the dot-output for graphviz? This would make it easier to understand how the graph is actually partitioned among the processes and perhaps one could then understand why the "better"(smaller) values are set. Again, I have no distributed boost here, nor mpi, so I can't test it on my own, sorry. Best, Cedric

In addition to my previous answer, your observed behavior more and more seems reasonable to me. It may seem uncorrect, but that's a question of the point- of-view IMHO. Solutions to some problems differ when going from sequential to distributed computing and the concept chosen by the PBGL implies that the "best" value for the distance should be stored here. But now to explain why I further support the observed behavior. The algorithms in (P)BGL are based on visitors that define event points. One of those is the initialization of a vertex. Due to the distributed behavior, you have two (or more) processes that can start independently. Thus, these two processes will usually move from intializing to discovering (another event point) at different moments. If you look at the code of the distributed_bfs_visitor you will see the tree_edge() function definition which relies on the stored value of the source in order to assign the appropriate distance value to the target of an edge. Due to the distributed computing, the vertex n4 can/will start discovering while the vertex n0 is _still_ initializing (the formulation probably is not 100% correct as I just want to illustrate the general idea). Because vertex n4 has not yet been discovered by n0 (only initialized or, even more, maybe default initialized) it will discover vertex n3 and assign it the discovery-value of n4 (which is 0) plus 1. When n0 now discovers vertex n4, it would like to assign it the value 1, but this is prohibited by the set_property_map_role(). The same will hold for n3. What further supports this, is the label (distance-value) of vertex n1. In your single-process call it gets a value of 2, while in the multi-process call this is changed to 1. Because this is a BFS, n1 is discovered in the same level as n3 via n4. Thus, it gets the same value as n3 and will not be assigned a value by the discovery via n0 (where it seems to be one level deeper in the underlying graph). The only thing I have not come up with a me-satisfying explanation is why the second BFS starts at n4 (I mean, it is the "last" vertex). I suspect that this is due to an initialization by vertex n0 and it only then can start a second process. Because, after all, it is one of the vertices that will be initialized first, probably even the first to intialize vertex, besides n0. If the second (or third etc.) process would be started arbitrarily, than the results of the multi-core run should differ between some consecutive executions at least (if it was truly random). In this context it would also be interesting to see how the whole approach behaves when it is run on more than two processes or with a different graph, probably having an initial vertex (start) with degree > 2 (possibly 3 for easiness). Best, Cedric On Friday, 10. December 2010 17:06:34 Mattia Lambertini wrote:
Perhaps you could attach the code of a minimal working example?
of course, i can give you a link to the example i've tested [0] and here below is the graph i used (METIS):
5 9 1 3 1 2 2 4 1 5 2 2 7 4 3 5 1 1 1 2 1
[0] http://www.boost.org/doc/libs/1_44_0/libs/graph_parallel/example/breadth_fi rst_search.cpp
To compile the example you have to link:
boost_graph_parallel-mt boost_mpi-mt boost_system-mt
Thanks for your time.

I am not familiar with METIS at all, so is there a way to provide the underlying graph in a more intuitive format? An actual image would be nice. Or again, the dot-output for graphviz?
An image of the graph distributed among two processes. http://lamberti.web.cs.unibo.it/bfs.png If i get your point, process 0 starts a bfs with the n0 as root and process 1 starts with n4 as root cause n0 and n4 are the local node 0 of each process. And the ouput is the minimum distance value of each node discovered via n0 and n4 either? To what end? This is the ouput with three processes: graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=0]; } n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4; subgraph cluster_1 { label="p1"; n2[label=0]; n3[label=1]; } n2 -- n1; n2 -- n3; n3 -- n4; subgraph cluster_2 { label="p2"; n4[label=0]; } n4 -- n0; n4 -- n1; } This seems to confirm your explanation (at least what i understood of your explanation). -- Best regards, Mattia

On Monday, 13. December 2010 16:57:23 Mattia Lambertini wrote:
I am not familiar with METIS at all, so is there a way to
provide the underlying graph in a more intuitive format? An actual image
would
be nice. Or again, the dot-output for graphviz?
An image of the graph distributed among two processes. http://lamberti.web.cs.unibo.it/bfs.png
Thank you. I confused the graph here with the resulting BFS tree.
If i get your point, process 0 starts a bfs with the n0 as root and process 1 starts with n4 as root cause n0 and n4 are the local node 0 of each process.
Although it might be that process 1 sees n4 as its root, I don't see why this should be the case. It is more like the BFS starting at n0 reaches n4 and thus might take this node as the start-node of a second BFS running in process 1. But you seem to have got the general idea.
And the ouput is the minimum distance value of each node discovered via n0 and n4 either? To what end?
The reason why it takes the minimum distance is simply "by its design". If you look at my last mail, a distributed algorithm needs to take care that during its execution an already visited/finished part of the problem will not be arbitrarily changed at a later moment. Because a BFS starting at n4 will reach one of its neighbors "faster" than it would when starting at n0, n4 gets the first chance to write some result (although a node per-se usually will not perform an action...). When n0 is now at the node that n4 already has finished, the program needs to know what to do, how to handle that situation. And the BFS-visitor of PBGL simply states that the value (which will be assigned to the label in the output) of a node is supposed to be the smallest. If I correctly intepretate the question "To what end?" you are wondering why it is the smallest value and not the one that it should have when being executed with a single process? If so, this would impose a checking of the process ID and at the long end this would lead, IMHO, to a distributed-BFS version that is actually fairly the same as the serial version. After all, it would have to wait each time for process 0 reaching that node and the running time should become equal (maybe worse due to the distributed overhead) to the serial version. So would/should be the result of the distributed execution... This is what the documentation says: <quote> 3. Since multiple distance values may be written for each vertex, we must always choose the best value we can find to update the distance map. The distributed property map distance needs to resolve distances to the smallest distance it has seen. For instance, process 0 may find vertex u at level 3 but process 1 finds it at level 5: the distance must remain at 3. To do this, we state that the property map's role is as a distance map, which introduces an appropriate reduction operation: set_property_map_role(vertex_distance, distance); <\quote>
This is the ouput with three processes:
graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=0]; }
n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4;
subgraph cluster_1 { label="p1"; n2[label=0]; n3[label=1]; }
n2 -- n1; n2 -- n3; n3 -- n4;
subgraph cluster_2 { label="p2"; n4[label=0]; }
n4 -- n0; n4 -- n1; }
This seems to confirm your explanation (at least what i understood of your explanation).
It seems so, yes. However, I am wondering why node n1 (the one with the self- loop) is assigned a value of 0. If my explanation holds, this could only occur when a separate process is started on that node. But the grouping of the subgraphs (matching the number of processes) does not indicate this. It might very well be that this depends on the internals of the distributed BFS implementation or general concept of distributing processes over vertices in PBGL/MPI/whatever... It might sound a bit vague and could be a little bit to biased as a valid speculation, but it might be that process 0 is available after finding out that all adjacent vertices have their own processes running and process 0 is "spawned" freshly on node n1, which is at the same time a neighbor of n4 and n2. The reason why n1 gets its "own" process might be similar to the reason why n4 gets its "own" process in the two-process execution. Recognizing (or validating) a pattern might need further executions with higher process numbers and/or executing the BFS for other graphs. It might be of particular interest to see how a linear graph behaves, where each vertex has only one predecessor and one successor (except for the "source" and the "sink") Unfortunately, I can't really confirm that this is the correct behavior nor can I say that there's something in particular going wrong. This doesn't help you much probably. I apologize for that. Maybe you could tell why you are particularily into a distributed BFS? There are of course obvious reasons like "It exists and I have the hardware here, so I want to use it". It might be that your reason is more specific and there's an alternative or a work-around possible?! I do understand that you are supposed to solve a particular problem and not deal with the correction or validation of some code. The sad truth is that working on an actual problem and veryfing that the basis (even for off-the-shelf products) works as expected (here rather as designed) go more-or-less hand-in- hand. I had to experience that several times already. So unless someone else is dropping in to confirm (or not) the behavior of the PBGL BFS, it would be interesting (at least for me ;) to see how it behaves for different scenarios (like more processes or linear graphs) in order to shed some more light on this. Best, Cedric

If I correctly intepretate the question "To what end?" you are wondering why it is the smallest value and not the one that it should have when being executed with a single process?
right.
If so, this would impose a checking of the process ID and at the long end this would lead, IMHO, to a distributed-BFS version that is actually fairly the same as the serial version. After all, it would have to wait each time for process 0 reaching that node and the running time should become equal (maybe worse due to the distributed overhead) to the serial version. So would/should be the result of the distributed execution...
IMHO it doesn't. A distributed algorithm should return the same ouput of the sequential algorithm but in less time (wall clock time). Obviously not all the algorithms can be distributed efficiently but the goal of a distributed algorithm is to spread the computation over many nodes(give a slice of global computation to each node) and at the end you'll get the same result of the sequential one. All other distributed algorithms, which i used, works this way. i don't want to sound arrogant but i still don't understand the reasons behind a design like this one.(I don't demand that you provide all the answers :)) What does such a result mean? Is it usefull for something?
This is what the documentation says: <quote> 3. Since multiple distance values may be written for each vertex, we must always choose the best value we can find to update the distance map. The distributed property map distance needs to resolve distances to the smallest distance it has seen. For instance, process 0 may find vertex u at level 3 but process 1 finds it at level 5: the distance must remain at 3. To do this, we state that the property map's role is as a distance map, which introduces an appropriate reduction operation:
set_property_map_role(vertex_distance, distance); <\quote>
I've thought that the documentation, in that section, was talking about the smallest distance from the root node. Le me explain: Process 0 has found that the distance from vertex s to vertex u is 5 through the path: s -> n1 -> n2 -> n3 -> n4 -> u But process 1 has found that the distance from vertex n3 to vertex u is equal to 1 (they are connected). Thus in the communication phase (accordingly to BSP[0]) process 0 and 1 exchange their knowledge of the world and the result should be: s -> n1 -> n2 -> n3 -> u
Maybe you could tell why you are particularily into a distributed BFS? There are of course obvious reasons like "It exists and I have the hardware here, so I want to use it". It might be that your reason is more specific and there's an alternative or a work-around possible?!
we have to analyze very huge graphs (milions of nodes) and we simply want to do it in the shortest possible time. We tried a sequential and a parallel(shared memory) approach but it takes too much time. I want to thank you very much for your time. [0] http://www.osl.iu.edu/research/pbgl/documentation/process_group.html -- Cordiali saluti, Mattia Lambertini

On Monday, 13. December 2010 22:37:20 Mattia Lambertini wrote:
If I correctly intepretate the question "To what end?" you are wondering why it is the smallest value and not the one that it should have when being executed with a single process?
right.
If so, this would impose a checking of the process ID and at the long end this would lead, IMHO, to a distributed-BFS version that is actually fairly the same as the serial version. After all, it would have to wait each time for process 0 reaching that node and the running time should become equal (maybe worse due to the distributed overhead) to the serial version. So would/should be the result of the distributed execution...
IMHO it doesn't. A distributed algorithm should return the same ouput of the sequential algorithm but in less time (wall clock time). Obviously not all the algorithms can be distributed efficiently but the goal of a distributed algorithm is to spread the computation over many nodes(give a slice of global computation to each node) and at the end you'll get the same result of the sequential one. All other distributed algorithms, which i used, works this way.
i don't want to sound arrogant but i still don't understand the reasons behind a design like this one.(I don't demand that you provide all the answers :)) What does such a result mean? Is it usefull for something?
I agree with you. It's just that it was designed this way for some reason and that's exactly how it seems to work actually. To find out the real reasons for this design decision, you could try to mail the authors of the PBGL-BFS personally.
This is what the documentation says: <quote> 3. Since multiple distance values may be written for each vertex, we must always choose the best value we can find to update the distance map. The distributed property map distance needs to resolve distances to the smallest distance it has seen. For instance, process 0 may find vertex u at level 3 but process 1 finds it at level 5: the distance must remain at 3. To do this, we state that the property map's role is as a distance map, which introduces an appropriate reduction operation:
set_property_map_role(vertex_distance, distance); <\quote>
I've thought that the documentation, in that section, was talking about the smallest distance from the root node. Le me explain:
Process 0 has found that the distance from vertex s to vertex u is 5 through the path: s -> n1 -> n2 -> n3 -> n4 -> u
But process 1 has found that the distance from vertex n3 to vertex u is equal to 1 (they are connected). Thus in the communication phase (accordingly to BSP[0]) process 0 and 1 exchange their knowledge of the world and the result should be: s -> n1 -> n2 -> n3 -> u
This is indeed a good point... Which makes me refer again to an eMail directly to the original authors. I hoped that keeping the discussion alive would either result into a solution found by discussion or the participation of the authors or someone more familiar with the topic. This discussion could also indicate the need to update the documentation, perhaps with a concrete example demonstrating the behavior of the PBGL-BFS.
Maybe you could tell why you are particularily into a distributed BFS? There are of course obvious reasons like "It exists and I have the hardware here, so I want to use it". It might be that your reason is more specific and there's an alternative or a work-around possible?!
we have to analyze very huge graphs (milions of nodes) and we simply want to do it in the shortest possible time. We tried a sequential and a parallel(shared memory) approach but it takes too much time.
I want to thank you very much for your time.
You are welcome.
[0] http://www.osl.iu.edu/research/pbgl/documentation/process_group.html
Best, Cedric

Hi, I don't know if you are following on this actually, but the problem that you stated in December of last year kept puzzling me and I tried to get it cleared after all. For a summary, you can look at http://groups.google.com/group/boost- list/browse_thread/thread/273160db809f1702?fwc=2 Basically, what I understood so far is that the BFS-example seems to be "wrong" because the reduction operation (among others) is not appropriate. However, it seems that it will be fixed soon. Best, Cedric On Wednesday, 8. December 2010 10:49:54 Mattia Lambertini wrote:
Hi, i am new to boost and i'm using the PBGL to write a software that compute some graph analysis. I've installed boost 1.44.0 and openmpi 1.5
I tried to run the breadth first search example located in $BOOST_ROOT/libs/graph_parallel/example
with the graph $BOOST_ROOT/libs/graph/test/weighted_graph.gr
All went well since i figure out that the result is different if i run the algorithm with a different number of processors. Only with one processor the values are corrected.
here's what happens:
mpirun -np 1 ./breadth_first_search.out
graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=2]; n2[label=1]; n3[label=2]; n4[label=1]; }
n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4; n2 -- n1; n2 -- n3; n3 -- n4; n4 -- n0; n4 -- n1; }
---------------------
mpirun -np 2 ./breadth_first_search.out
graph g { subgraph cluster_0 { label="p0"; n0[label=0]; n1[label=1]; n2[label=1]; }
n0 -- n2; n1 -- n1; n1 -- n3; n1 -- n4; n2 -- n1; n2 -- n3;
subgraph cluster_1 { label="p1"; n3[label=1]; n4[label=0]; }
n3 -- n4; n4 -- n0; n4 -- n1; }
Is that correct?
From the doc i can't figure out why..
participants (2)
-
Cedric Laczny
-
Mattia Lambertini