Boost Graph: preventing duplicate vertices when constructing graph incrementally

29 Mar 2010

      Hi,

The vast majority of Boost Graph's sample code has pre-defined source  
data. For example, from city_visitor.cpp:

   enum { SanJose, SanFran, LA, SanDiego, Fresno, LasVegas, Reno,
          Sacramento, SaltLake, Phoenix, N };
   E edge_array[] = { E(Sacramento, Reno), E(Sacramento, SanFran),
                      E(Reno, SaltLake),
                      E(SanFran, SanJose),
                      E(SanJose, Fresno), E(SanJose, LA),
                      E(LA, LasVegas), E(LA, SanDiego),
                      E(LasVegas, Phoenix) };

Unfortunately, static source data with enums as the vertices isn't  
very typical in real-world programs. It's much more common to  
construct the graph incrementally from a streaming data set, such as a  
file containing a list of edge pairs. But this presents a problem: As  
the list of edges is read in, a vertex may be encountered more than  
once, so add_vertex can't simply be called on every vertex found  
because that will result in redundant vertices.

An obvious fix is to check whether a vertex exists before calling  
add_vertex. If it is found, it can be used for the edge; otherwise  
add_vertex can be called and the returned value used for the edge. The  
only question is, How do I check whether a vertex exists? I could  
iterate through all vertices in the graph, searching for one that  
equals the one I'm trying to add, but this approach doesn't scale  
well. It has a time complexity of O(V).

A faster solution to this problem appears to be in the  
boost_web_graph.cpp sample. It deals with the same issue: It reads in  
graph data from a file called boost_web.dat, and this file contains  
the same vertex value in multiple places. Rather than call add_vertex  
multiple times (creating redundancies), the code puts every vertex it  
finds into a map. The map associates the vertex's unique value (a  
string in this case) with its corresponding vertex descriptor. Then,  
before calling add_vertex, the code checks the map for the unique  
value; if found, it uses the corresponding vertex descriptor;  
otherwise it calls add_vertex and adds the new ID/descriptor pair to  
the map.

I assume that this is faster (compared to my naive approach of  
searching linearly through the vertices) because std::map's insert  
function is supposed to have logarithmic time complexity. It probably  
achieves this with a red-black tree, so as long as the keys are  
sortable, O(log(V)) complexity is guaranteed. Strings (used in the  
sample code) are certainly sortable, although in my program I'm using  
pointers as the vertex property. But I guess pointers are sortable  
too, since they're just numbers really. My initial tests show that  
this approach works properly with pointers as the values, though I've  
not done any performance testing, only correctness testing.

Any feedback on this would be appreciated. Thanks,

Trevor

Trevor Harmon

tags

participants (1)