
On Oct 13, 2011, at 9:53 AM, Yan wrote:
Hi all, I'm learning to use the distributed adjacency list and distributed dijkstra SP-algorithm. I use some test data contains about 3,800,000 edges and 2,800,000 vertices. I use a pair of iterators to pass the data to the distributed adjacency list, like that: Graph g(reader.begin(), reader.end(), reader.weight_begin(), reader.num_vertices(), pg, dist); "reader" is an user-defined class which supplys required data (edges, vertices, weights). "pg" is defined like that "mpi_process_group pg;" "dist" is supplied by the reader like that "graph::metis_distribution dist = reader.distribution();". It is based on a metis partition file coordinating the test data. I run 2 processes on my notebook computer(Intel core2 2.0G, Ram 2G, WinXPsp3).
I also test the same data using a sequential bgl dijkstra SP- algorithm on this computer.
The result is that: Parallel bgl: Creating of Graph g costs *10080 seconds*; dijkstra computing costs 28.281 seconds. Sequential bgl: Creating of Graph g costs *4.563 seconds*; dijkstra computing costs 1.703 seconds.
As you see, when using distributed adjacency list, it costs so much time to create the graph. Is it nomal?
I have done some more comparison between 2-processes parallel dijkstra on one core2 computer and sequential dijkstra on the same computer, using different scale data set. Always, the sequential bgl dijkstra costs less time than the 2-processes parallel dijkstra. What's the problem? I have read the boost document on "boost_1_46_1\libs\graph_parallel\doc\html \dijkstra_shortest_paths.html", which shows a good performance and speedup in the charts. Doesn't that mean less time costing than the sequential program?
Any advice would be appreciated.
Thanks, Yan
While I can't offer categorical answers, my first reaction is to say that it is highly likely that you're paging in the example you provide. With only 2G of RAM and the memory consumed by the mpi_process_group (discussed in another thread on this list, look there for a description of how to reduce it) plus whatever else is running on your laptop it's entirely possible that you're running out of memory. Once the PBGL begins to page between mpi_process_group buffers and graph data the runtime becomes entirely unpredictable and if the degree of paging is high very, very long. It's also worth noting that parallel speedup of the PBGL vs. sequential BGL depends on the graph structure and size, it's is entirely possible, and not uncommon, to find graphs for which no speedup or negative speedup is observed. The distributed communication mechanisms in the PBGL incur non-trivial overhead due to copying and serialization overhead. The overhead of the serialization in PBGL is so high that I have seldom found it beneficial to run multiple MPI ranks within a single address space (as you are doing on your two core laptop). There are a number of ways to speed up serialization (by defining BOOST_MPI_HOMOGENOUS to avoid MPI_Pack()/MPI_Unpack() overhead amongst other optimizations). I considered rewriting PBGL to make it possible to avoid serialization entirely, and in fact the new prototype active message based version does this, but it's quite a bit of work and I likely won't have the time to make similar changes to the Boost version of the library. The new version of the library also incorporates parallelism within address spaces via threading or (potentially) off-loading to accelerators rather than running multiple MPI processes. Hope that answers your question and gives you some ideas regarding where you can expect to find (known) performance issues in PBGL and how to ameliorate them to some degree... as well as what you could do to improve PBGL if you're so inclined! Cheers, Nick