[Graph] Problem with large vertex ids

Stephen Woodbridge

15 May 2013 15 May '13

3:54 p.m.

Hi all, I'm trying to understand/fix some bugs in pgRouting that use the boost libraries. I know C not not C++, so I can understand most of the more C-ish parts. We have an old bug that I believe is caused when we add an edge and it has large vertex ids or when there is a large number between the min and max vertex numbers. We currently identify the min id number and down shift the ids by that, but that does not help if I add an edge (1)->(100000000). I could renumber all the nodes and then un-renumber them when done if I have to. Questions: 1. is this a known limitiation in Boost graph? 2. is there a different type of adjacency list mechanism that does not have this issue 3. currently the code crashes the database backend, ideally I would like to catch any issues in the C++ wrapper, free memory and return an error code to the C caller so we don't kill the server or leak lots of memory. Here is a link to our boost wrapper code: https://github.com/pgRouting/pgrouting/blob/sew-devel-2_0/src/dijkstra/src/b... We use that pattern for a lot of functions so if there is a better pattern, I might be able to update most(all?) of the functions that use this. I could really use some help resolving this issue. Meanwhile, I'm off to write a C main() to run this outside of the database and in gdb to see whats happening. Thanks, -Steve, trying to release pgRouting 2.0

Show replies by date

Jeremiah Willcock

15 May 15 May

5:19 p.m.

On Wed, 15 May 2013, Stephen Woodbridge wrote:

...

Hi all,

I'm trying to understand/fix some bugs in pgRouting that use the boost libraries. I know C not not C++, so I can understand most of the more C-ish parts. We have an old bug that I believe is caused when we add an edge and it has large vertex ids or when there is a large number between the min and max vertex numbers.

We currently identify the min id number and down shift the ids by that, but that does not help if I add an edge (1)->(100000000). I could renumber all the nodes and then un-renumber them when done if I have to.

For adjacency lists with vecS as the vertex container type (the default), there is an assumption that vertex numbers are in the range 0...num_vertices(g)-1 (inclusive). Thus, if you want to use a vertex number 10000, your graph will need to have at least 10001 vertices (before you add the edge). There are several data structures in the graph class that whose sizes are proportional to the number of vertices, so using very large vertex numbers can end up eating large amounts of memory. There should not be any limits on vertex counts other than available memory; the vertex index type is usually size_t or some other large-enough type if I remember correctly.

...

Questions:

1. is this a known limitiation in Boost graph?

We have run it with larger numbers of vertices before (but more often with compressed_sparse_row_graph), so it should not be a limitation.

...

2. is there a different type of adjacency list mechanism that does not have this issue

Are you using widely separated vertex numbers (i.e., you use large numbers as vertex IDs but don't actually have a large number of vertices)? If so, you can try using labeled_graph. Otherwise,

...

3. currently the code crashes the database backend, ideally I would like to catch any issues in the C++ wrapper, free memory and return an error code to the C caller so we don't kill the server or leak lots of memory.

I suspect you are hitting a buffer overflow here, so that won't be easy to trap. Compiling with _GLIBCXX_DEBUG defined (with GCC) will turn some of those problems into assertion failures, but that still won't help you get an error code. I think the best thing to try would be to fix any overflow issues; you are likely to get an exception if you try to add more vertices or edges than will fit into virtual memory (which probably won't happen, since you will most likely run out of physical memory first, and it's up to your OS what happens in that case).

...

Here is a link to our boost wrapper code:

https://github.com/pgRouting/pgrouting/blob/sew-devel-2_0/src/dijkstra/src/b...

We use that pattern for a lot of functions so if there is a better pattern, I might be able to update most(all?) of the functions that use this.

I see a few potential improvements: 1. If you have all of the edge properties up front, you can create the adjacency list (or a compressed_sparse_row_graph, which will use less memory) directly using your edges and their properties, rather than adding them one-at-a-time. Even if you do need to use add_edge, you can add the edge properties as a fourth argument to that function. The constructor to adjacency_list to consider using is: template <class EdgeIterator, class EdgePropertyIterator> adjacency_list(EdgeIterator first, EdgeIterator last, EdgePropertyIterator ep_iter, vertices_size_type n, edges_size_type m = 0, const GraphProperty& p = GraphProperty()) 2. Using a raw pointer as a property map (as on lines 115 and 117) often breaks, especially on Visual C++. The recipe to use for that is: boost::make_iterator_property_map (predecessors.begin(), get(boost::vertex_index, graph)) 3. You can use edge_predecessor_recorder as a visitor to Dijkstra's algorithm to get the edges in the path directly from the algorithm, rather than needing to find them yourself. Use http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/bfs_visitor.html as a template for what to do, replacing predecessor_recorder with edge_predecessor_recorder and using a property map that stores edge_descriptors rather than vertex_descriptors. 4. Including "using namespace boost" and "using namespace std" together is likely to break in the future, since C++11's namespace std includes many of the same names as namespace boost does. Visual C++ has C++11 mode enabled by default in newer versions.

...

I could really use some help resolving this issue.

Meanwhile, I'm off to write a C main() to run this outside of the database and in gdb to see whats happening.

OK. You might be able to attach to the database server itself with gdb as well, but Valgrind might be a more useful tool for seeing what the problem is. -- Jeremiah Willcock

Stephen Woodbridge

7:12 p.m.

Hi Jeremiah, I always appreciate you concise and informative responses. Thank you. I feww comments below inline: On 5/15/2013 1:19 PM, Jeremiah Willcock wrote:

...

On Wed, 15 May 2013, Stephen Woodbridge wrote:

...
Hi all,

I'm trying to understand/fix some bugs in pgRouting that use the boost libraries. I know C not not C++, so I can understand most of the more C-ish parts. We have an old bug that I believe is caused when we add an edge and it has large vertex ids or when there is a large number between the min and max vertex numbers.

We currently identify the min id number and down shift the ids by that, but that does not help if I add an edge (1)->(100000000). I could renumber all the nodes and then un-renumber them when done if I have to.

For adjacency lists with vecS as the vertex container type (the default), there is an assumption that vertex numbers are in the range 0...num_vertices(g)-1 (inclusive). Thus, if you want to use a vertex number 10000, your graph will need to have at least 10001 vertices (before you add the edge). There are several data structures in the graph class that whose sizes are proportional to the number of vertices, so using very large vertex numbers can end up eating large amounts of memory. There should not be any limits on vertex counts other than available memory; the vertex index type is usually size_t or some other large-enough type if I remember correctly.

We seem to have two use cases that are common in pgRouting. 1. the user loads GIS data and we assign vertex ids that start with one and are sequential over the data set. These are usually not a problem for the existing code. 2. the user loads GIS data and the vender has assigned unique ids for the vertices. These tend to be arbitrary, non-sequential, and increase over time as the dataset evolves. So big numbers and not sequential. We really have no idea which kind of data we are going to get. The number of edges in the graph varies based on the spatial extents the the start and end points define expanded a little to catch edge conditions. So I guess we have to pick the worst case, and hold that that does not have too much overhead.

...

...
Questions:

1. is this a known limitiation in Boost graph?

We have run it with larger numbers of vertices before (but more often with compressed_sparse_row_graph), so it should not be a limitation.

This sounds interesting.

...

...
2. is there a different type of adjacency list mechanism that does not have this issue

Are you using widely separated vertex numbers (i.e., you use large numbers as vertex IDs but don't actually have a large number of vertices)? If so, you can try using labeled_graph. Otherwise,

Yes, both. sometimes we have a large number of vertices or in bug report I have 23 edges, but huge range on numbers between: 4401465 <-> 134860887

...

...
3. currently the code crashes the database backend, ideally I would like to catch any issues in the C++ wrapper, free memory and return an error code to the C caller so we don't kill the server or leak lots of memory.

I suspect you are hitting a buffer overflow here, so that won't be easy to trap. Compiling with _GLIBCXX_DEBUG defined (with GCC) will turn some of those problems into assertion failures, but that still won't help you get an error code. I think the best thing to try would be to fix any overflow issues; you are likely to get an exception if you try to add more vertices or edges than will fit into virtual memory (which probably won't happen, since you will most likely run out of physical memory first, and it's up to your OS what happens in that case).

OK, from my C test code it is dying with a BadAlloc. I was able to wrapper the whole C++ function body in: try{ ... } catch(...) { return ERROR; } And this allowed me to catch and return an error. I suspect that this is a little brute force, because without some finer grain checking this will probably leak memory depend on where it throws from.

...

...
Here is a link to our boost wrapper code:

https://github.com/pgRouting/pgrouting/blob/sew-devel-2_0/src/dijkstra/src/b...

We use that pattern for a lot of functions so if there is a better pattern, I might be able to update most(all?) of the functions that use this.

I see a few potential improvements:

1. If you have all of the edge properties up front, you can create the adjacency list (or a compressed_sparse_row_graph, which will use less memory) directly using your edges and their properties, rather than adding them one-at-a-time. Even if you do need to use add_edge, you can add the edge properties as a fourth argument to that function. The constructor to adjacency_list to consider using is:

template <class EdgeIterator, class EdgePropertyIterator> adjacency_list(EdgeIterator first, EdgeIterator last, EdgePropertyIterator ep_iter, vertices_size_type n, edges_size_type m = 0, const GraphProperty& p = GraphProperty())

2. Using a raw pointer as a property map (as on lines 115 and 117) often breaks, especially on Visual C++. The recipe to use for that is:

boost::make_iterator_property_map (predecessors.begin(), get(boost::vertex_index, graph))

3. You can use edge_predecessor_recorder as a visitor to Dijkstra's algorithm to get the edges in the path directly from the algorithm, rather than needing to find them yourself. Use http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/bfs_visitor.html as a template for what to do, replacing predecessor_recorder with edge_predecessor_recorder and using a property map that stores edge_descriptors rather than vertex_descriptors.

4. Including "using namespace boost" and "using namespace std" together is likely to break in the future, since C++11's namespace std includes many of the same names as namespace boost does. Visual C++ has C++11 mode enabled by default in newer versions.

This is great feedback. I've add your response to put ticket and I've ask on out user list for a C++ user to step and help us out with these changes.

...

...
I could really use some help resolving this issue.

Meanwhile, I'm off to write a C main() to run this outside of the database and in gdb to see whats happening.

OK. You might be able to attach to the database server itself with gdb as well, but Valgrind might be a more useful tool for seeing what the problem is.

Yes, I know how to do that, but postgresql is compiled in such a way the gdb doesn't do much that is useful, I do regularly check our code with valgrind on the backend to make sure we are not leaking. As always, many thanks. -Steve

...

-- Jeremiah Willcock _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Anthony Foiani

16 May 16 May

6:03 a.m.

Stephen Woodbridge <woodbri@swoodbridge.com> writes:

...

2. the user loads GIS data and the vender has assigned unique ids for the vertices. These tend to be arbitrary, non-sequential, and increase over time as the dataset evolves. So big numbers and not sequential.

We really have no idea which kind of data we are going to get. The number of edges in the graph varies based on the spatial extents the the start and end points define expanded a little to catch edge conditions.

I haven't used BGL very much, but when I did play around with user-assigned labels for the vertices, it was painful all the way around. If I were to do anything with the BGL today, I would maintain my own map from "externally meaningful id" to sequential integers (starting at 0 or 1). My analysis was that the mapping incurs a O(n) or O(n log n) cost twice (once to forward map, once to reverse map). However, the actual graph ops are so much faster with a vector than with any other representation, and looking up vertices by id happens many more times than twice. Maybe I don't understand the BGL. I would not be surprised. :) I did find it telling that the majority of the examples in the manual use small integers. Best, Anthony Foiani

Jeremiah Willcock

2:50 p.m.

On Thu, 16 May 2013, Anthony Foiani wrote:

...

Stephen Woodbridge <woodbri@swoodbridge.com> writes:

...
2. the user loads GIS data and the vender has assigned unique ids for the vertices. These tend to be arbitrary, non-sequential, and increase over time as the dataset evolves. So big numbers and not sequential.

We really have no idea which kind of data we are going to get. The number of edges in the graph varies based on the spatial extents the the start and end points define expanded a little to catch edge conditions.

I haven't used BGL very much, but when I did play around with user-assigned labels for the vertices, it was painful all the way around.

If I were to do anything with the BGL today, I would maintain my own map from "externally meaningful id" to sequential integers (starting at 0 or 1).

That probably does make the most sense, especially if you can use a hash table to do that mapping.

...

My analysis was that the mapping incurs a O(n) or O(n log n) cost twice (once to forward map, once to reverse map). However, the actual graph ops are so much faster with a vector than with any other representation, and looking up vertices by id happens many more times than twice.

Maybe I don't understand the BGL. I would not be surprised. :)

I did find it telling that the majority of the examples in the manual use small integers.

Your analysis is right -- the examples (and most of the work that I do with BGL myself) use small integers so they can be used as indexes into arrays. Many BGL algorithms won't work by default on graphs that don't have that kind of mapping available, even if the vertex descriptors themselves aren't small integers. -- Jeremiah Willcock

4472

Age (days ago)

4473

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Anthony Foiani
Jeremiah Willcock
Stephen Woodbridge