BGL: interrupting breadth_first_search() at a certain distance/predicate?
Hi all, I have a graph where I start at a certain vertex and want to find the next undiscovered vertex at a maximum distance of 5 or so from the source vertex. Now I am running breadth_first_search() using a record_distances() visitor and I get the distances to all other vertices (say 10 000 of them) in my graph. This makes my algorithm slow, since I only need to know about vertices at a max distance of 5-6, not all of them. Is there a way of interrupting the algorithm when a certain predicate function supplied by me evaluates to true? (e.g. max distance reached, or a new interesting/valid vertex found, etc) thanks, AW
Hi!
AFAIK you should throw an exception from the visitor. At least this is what
came out in this year's Boost Con Graph Talk.
Regards,
Ovanes
On Mon, Aug 2, 2010 at 4:46 PM, Anders Wallin
ed by me evaluates to true? (e.g. max distance reached, or a new interesting/valid vertex found, etc)
On Aug 2, 2010, at 10:46 AM, Anders Wallin wrote:
Hi all,
I have a graph where I start at a certain vertex and want to find the next undiscovered vertex at a maximum distance of 5 or so from the source vertex. Now I am running breadth_first_search() using a record_distances() visitor and I get the distances to all other vertices (say 10 000 of them) in my graph. This makes my algorithm slow, since I only need to know about vertices at a max distance of 5-6, not all of them. Is there a way of interrupting the algorithm when a certain predicate function supplied by me evaluates to true? (e.g. max distance reached, or a new interesting/valid vertex found, etc)
For this particular job I think I'd build a wrapper around your graph that presents the nearby sub-graph. I think there may already be a filtered_graph adaptor you can leverage in the BGL. -- David Abrahams BoostPro Computing http://boostpro.com
On Aug 2, 2010, at 10:46 AM, Anders Wallin wrote:
Hi all,
I have a graph where I start at a certain vertex and want to find the next undiscovered vertex at a maximum distance of 5 or so from the source vertex. Now I am running breadth_first_search() using a record_distances() visitor and I get the distances to all other vertices (say 10 000 of them) in my graph. This makes my algorithm slow, since I only need to know about vertices at a max distance of 5-6, not all of them. Is there a way of interrupting the algorithm when a certain predicate function supplied by me evaluates to true? (e.g. max distance reached, or a new interesting/valid vertex found, etc)
For this particular job I think I'd build a wrapper around your graph that presents the nearby sub-graph. I think there may already be a filtered_graph adaptor you can leverage in the BGL.
Yes, I've been working with filtered_graph recently. It is lightweight and quick, so you could just filter on distance from the vertex and do the search on the filtered_graph. I'm sure that the filtering would be quicker than searching all other vertices. See recent thread for example code on filtered_graph. Hope that helps, Adam. -- Dr Adam Spargo High Performance Assembly Group email: aws@sanger.ac.uk Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728 Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Eventhough I love filtered graphs I do not see how it would work in your situation assuming that you don't always use the same source vertex (If you do use the same source vertex it will be a good option). I would probably make my own breath first search on the boost graph that kept track of the depth and stopped after depth 5-6. Best Line ________________________________________ Fra: boost-users-bounces@lists.boost.org [boost-users-bounces@lists.boost.org] På vegne af Anders Wallin [anders.e.e.wallin@gmail.com] Sendt: 2. august 2010 16:46 Til: boost-users Emne: [Boost-users] BGL: interrupting breadth_first_search() at a certain distance/predicate? Hi all, I have a graph where I start at a certain vertex and want to find the next undiscovered vertex at a maximum distance of 5 or so from the source vertex. Now I am running breadth_first_search() using a record_distances() visitor and I get the distances to all other vertices (say 10 000 of them) in my graph. This makes my algorithm slow, since I only need to know about vertices at a max distance of 5-6, not all of them. Is there a way of interrupting the algorithm when a certain predicate function supplied by me evaluates to true? (e.g. max distance reached, or a new interesting/valid vertex found, etc) thanks, AW _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Eventhough I love filtered graphs I do not see how it would work in your situation assuming that you don't always use the same source vertex (If you do use the same source vertex it will be a good option). I would probably make my own breath first search on the boost graph that kept track of the depth and stopped after depth 5-6.
Best Line
I was thinking to reset the filtered graph at every source vertex. Roughly equivalent to keeping track of depth, without having to write the code. -- Dr Adam Spargo High Performance Assembly Group email: aws@sanger.ac.uk Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728 Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
I was thinking to reset the filtered graph at every source vertex. Roughly equivalent to keeping track of depth, without having to write the code.
thanks for all replies, I will try writing an exception-throwing visitor or my own depth-limited bfs. I should probably describe my application, since there might be a much better solution than the one I am trying. As part of a computer-aided-manufacturing toolpath calculation I am producing planar graphs like this: http://tinyurl.com/377lvt8 red edges go in the X-direction, blue edges in the Y-direction. the vertices I am interested in are drawn as blue/red dots(the color does not matter here), and the problem is to find the correct order of these vertices which creates a "silhouette" or "contour" of the graph. A naive solution would just hop from one dot to the closest one, but in more complicated cases this is not correct, so I would prefer traversing the graph to the next dot. sometimes the graph has many disconnected components: (no colored dots in this pic, and nevermind the white triangulated surface...) http://tinyurl.com/3aefpgn and note that the donut-shaped component of the graph has two of these "loops" I am interested in, one on the outside, one on the inside. I've been thinking about two approaches: 1) start at one vertex, search in the neighbourhood of this vertex for the next one, figure out which one of the found points is the correct next vertex. iterate. 2) something based on the idea of surface tension, or some kind of relaxation/removal/addition of edges. start in the middle of the graph and work your way towards the outside and when we are finished only the correct edges remain. any ideas? AW
On Aug 3, 2010, at 12:51 PM, Anders Wallin wrote:
I was thinking to reset the filtered graph at every source vertex. Roughly equivalent to keeping track of depth, without having to write the code.
thanks for all replies, I will try writing an exception-throwing visitor or my own depth-limited bfs.
Really, that approach is likely to have very poor performance for your application. You can use the exception approach once for a very large search, but if you use it repeatedly on very short searches, you won't like the result. filtered_graph is your friend. -- David Abrahams BoostPro Computing http://boostpro.com
my previous explanation of my application was maybe not the clearest, so here is another try: http://tinyurl.com/3aoejnd looking at the BGL-documentation, would a planar_face_traversal of the outermost/innermost face of my graph traverse the edges/vertices in a useful order for producing the output I want? thanks, Anders
thanks for all replies, I will try writing an exception-throwing visitor or my own depth-limited bfs.
Really, that approach is likely to have very poor performance for your application. You can use the exception approach once for a very large search, but if you use it repeatedly on very short searches, you won't like the result. filtered_graph is your friend.
On Tue, 3 Aug 2010, Anders Wallin wrote:
my previous explanation of my application was maybe not the clearest, so here is another try: http://tinyurl.com/3aoejnd
looking at the BGL-documentation, would a planar_face_traversal of the outermost/innermost face of my graph traverse the edges/vertices in a useful order for producing the output I want?
I'm not too familiar with them, but isn't this the sort of thing that marching squares/marching cubes algorithms are for? What is the actual input to your algorithm? Just a graph where the nodes have positions attached? Is that guaranteed to be a planar drawing? Also, is there anything in CGAL (www.cgal.org) that would help you? It appears that CGAL has algorithms for generating meshes from point sets. -- Jeremiah Willcock
On Tue, Aug 3, 2010 at 1:30 PM, Anders Wallin
my previous explanation of my application was maybe not the clearest, so here is another try: http://tinyurl.com/3aoejnd
looking at the BGL-documentation, would a planar_face_traversal of the outermost/innermost face of my graph traverse the edges/vertices in a useful order for producing the output I want?
thanks, Anders
thanks for all replies, I will try writing an exception-throwing visitor or my own depth-limited bfs.
Really, that approach is likely to have very poor performance for your application. You can use the exception approach once for a very large search, but if you use it repeatedly on very short searches, you won't like the result. filtered_graph is your friend.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Hi Anders, I'm still not really sure what you're asking after reading that post, but let me give it a shot - if you have a graph that looks like: a | b--c--d | e--f--g | h which I think is what you're describing, basically, then you could use planar_face_traversal to create an "outer border" like the red outline in your post. planar_face_traversal will iterate over the edges of the above graph in sequence like (a,c) (c,d) (d,c) (c,f) (f,g) (g,f) (f,h) (h,f) (f,e) (e,f) (f,c) (c,b) (b,c) (c,a). Take that sequence and keep only the pairs of adjacent duplicate edges: (a,c) (c,d) (d,c) (f,g) (g,f) (f,h) (h,f) (f,e) (e,f) (c,b) (b,c) (c,a) (it's a cyclic order, so we keep (a,c) at the beginning and (c,a) at the end.) Now from each pair of duplicate edges, extract any vertex of degree 1 and you get a, d, g, h, e, b, a which defines the sequence of edges you need to add to get the "outer border": (a,d) (d,g) (g,h) (h,e) (e,b), and (b,a). The same argument applies to getting an "inner border" like the orange outline in your post. Regards, Aaron
I'm still not really sure what you're asking after reading that post, but let me give it a shot - if you have a graph that looks like: a | b--c--d | e--f--g | h which I think is what you're describing, basically, then you could use planar_face_traversal to create an "outer border" like the red outline in your post.
yes, this is exactly what I want. However I suspect that the planar embedding is not unique (?). As drawn above if we proceed clockwise the output would indeed be a, d, g, h, e, b, a but another planar embedding could for example have the degree-1 adjacent vertices of c in a completely different order and the output would be different. which means I should probably create the planar embedding myself, and not delegate to boyer_myrvold_planarity_test as is done in the example. Are there any examples out there of how create a planar embedding? from the example it looks like a std::vector< std::vector<EdgeDescriptor> > would something like this work: [ [(a,c)] ] the row for vertex a. a has only one edge (a,c) [ [ (b,c) ] ] same thing for b [ [ (c,b) , (c,a) , (c,d), (c,f) ] ] the row for vertex c. edges should be traversed in the order (c,b) then (c,a) then (c,d) then (c,f) ... and so on... thanks, Anders
On Wed, Aug 4, 2010 at 4:12 AM, Anders Wallin
I'm still not really sure what you're asking after reading that post, but let me give it a shot - if you have a graph that looks like: a | b--c--d | e--f--g | h which I think is what you're describing, basically, then you could use planar_face_traversal to create an "outer border" like the red outline in your post.
yes, this is exactly what I want. However I suspect that the planar embedding is not unique (?). As drawn above if we proceed clockwise the output would indeed be a, d, g, h, e, b, a but another planar embedding could for example have the degree-1 adjacent vertices of c in a completely different order and the output would be different.
which means I should probably create the planar embedding myself, and not delegate to boyer_myrvold_planarity_test as is done in the example. Are there any examples out there of how create a planar embedding? from the example it looks like a std::vector< std::vector<EdgeDescriptor> > would something like this work: [ [(a,c)] ] the row for vertex a. a has only one edge (a,c) [ [ (b,c) ] ] same thing for b [ [ (c,b) , (c,a) , (c,d), (c,f) ] ] the row for vertex c. edges should be traversed in the order (c,b) then (c,a) then (c,d) then (c,f) ... and so on...
Yeah, you're right - there can be many planar embeddings for a graph, and there's no way to specify which one you want boyer_myrvold_planarity_test to generate. So if you already have an implicit ordering in the 2D plane on the edges around each vertex in the graph, you'll want to create the embedding yourself. And yes, you're on the right track in your example to create a planar embedding - using a vector of vectors of edges, each edge should appear twice in the planar embedding, once in the clockwise cyclic order around each vertex the edge is adjacent to. The planar embedding concept documentation (http://www.boost.org/doc/libs/1_43_0/libs/graph/doc/PlanarEmbedding.html) has some more information about what a planar embedding means to the BGL, although I can't find a link to any code that manually constructs a planar embedding. -Aaron
On Wed, Aug 4, 2010 at 11:12 AM, Anders Wallin
I'm still not really sure what you're asking after reading that post, but let me give it a shot - if you have a graph that looks like: a | b--c--d | e--f--g | h which I think is what you're describing, basically, then you could use planar_face_traversal to create an "outer border" like the red outline in your post. yes, this is exactly what I want. However I suspect that the planar embedding is not unique (?). which means I should probably create the planar embedding myself,
thanks for all the help with this, it now seems to work quite well! The end result are these "waterline" toolpaths: http://imagebin.org/108079 If anyone is interested, I wrote my own planar embedding-code like this: http://pastebin.com/qwHuVAc1 thanks, Anders
Anders Wallin
I should probably describe my application, since there might be a much better solution than the one I am trying. As part of a computer-aided-manufacturing toolpath calculation I am producing planar graphs like this: http://tinyurl.com/377lvt8
I like the whole idea of BGL, but I don't have a specific use for it in my current job. I am extremely interested in the fields where it is being used. Thanks for sharing the application you are using it for. I would appreciate it if anyone else could share what application they use BGL for. Jerry
On Wed, 4 Aug 2010, Jerry wrote:
Anders Wallin
writes: I should probably describe my application, since there might be a much better solution than the one I am trying. As part of a computer-aided-manufacturing toolpath calculation I am producing planar graphs like this: http://tinyurl.com/377lvt8
I like the whole idea of BGL, but I don't have a specific use for it in my current job. I am extremely interested in the fields where it is being used. Thanks for sharing the application you are using it for. I would appreciate it if anyone else could share what application they use BGL for.
There is a list of some users at URL:https://svn.boost.org/svn/boost/trunk/libs/graph/doc/users.html. -- Jeremiah Willcock
On 04/08/2010 06:10, Jeremiah Willcock wrote:
On Wed, 4 Aug 2010, Jerry wrote:
Anders Wallin
writes: I should probably describe my application, since there might be a much better solution than the one I am trying. As part of a computer-aided-manufacturing toolpath calculation I am producing planar graphs like this: http://tinyurl.com/377lvt8
I like the whole idea of BGL, but I don't have a specific use for it in my current job. I am extremely interested in the fields where it is being used. Thanks for sharing the application you are using it for. I would appreciate it if anyone else could share what application they use BGL for.
There is a list of some users at URL:https://svn.boost.org/svn/boost/trunk/libs/graph/doc/users.html.
When I try to connect to that site I get this message... You have asked Firefox to connect securely to svn.boost.org, but we can't confirm that your connection is secure. Normally, when you try to connect securely, sites will present trusted identification to prove that you are going to the right place. However, this site's identity can't be verified. Techincal details: svn.boost.org uses an invalid security certificate. The certificate is not trusted because the issuer certificate is unknown. ...is that normal? Thanks, Louis
On Wed, 4 Aug 2010, Louis Lavery wrote:
On 04/08/2010 06:10, Jeremiah Willcock wrote:
On Wed, 4 Aug 2010, Jerry wrote:
Anders Wallin
writes: I should probably describe my application, since there might be a much better solution than the one I am trying. As part of a computer-aided-manufacturing toolpath calculation I am producing planar graphs like this: http://tinyurl.com/377lvt8
I like the whole idea of BGL, but I don't have a specific use for it in my current job. I am extremely interested in the fields where it is being used. Thanks for sharing the application you are using it for. I would appreciate it if anyone else could share what application they use BGL for.
There is a list of some users at URL:https://svn.boost.org/svn/boost/trunk/libs/graph/doc/users.html.
When I try to connect to that site I get this message...
You have asked Firefox to connect securely to svn.boost.org, but we can't confirm that your connection is secure.
Normally, when you try to connect securely, sites will present trusted identification to prove that you are going to the right place. However, this site's identity can't be verified.
Techincal details:
svn.boost.org uses an invalid security certificate.
The certificate is not trusted because the issuer certificate is unknown.
...is that normal?
Yes -- we have a self-signed certificate on the SVN server and so browsers complain about it. -- Jeremiah Willcock
On Aug 3, 2010, at 10:04 PM, Jerry wrote:
I like the whole idea of BGL, but I don't have a specific use for it in my current job. I am extremely interested in the fields where it is being used. Thanks for sharing the application you are using it for. I would appreciate it if anyone else could share what application they use BGL for.
I am doing static code analysis and use BGL to represent a control flow graph of the code. Nothing sophisticated really, I just needed a generic C++ graph data structure -- with abstractions for vertices, edges, etc. -- along with basic routines for adding, removing, and iterating over them. I'm surprised there's nothing like this in the STL. Although there is a handful of open-source projects providing a C+ + graph data structure, BGL seems to be the most popular, feature- rich, and well-supported of the lot. Trevor
On 03/08/2010 17:51, Anders Wallin wrote:
I was thinking to reset the filtered graph at every source vertex. Roughly equivalent to keeping track of depth, without having to write the code.
thanks for all replies, I will try writing an exception-throwing visitor or my own depth-limited bfs.
I should probably describe my application, since there might be a much better solution than the one I am trying. As part of a computer-aided-manufacturing toolpath calculation I am producing planar graphs like this: http://tinyurl.com/377lvt8 red edges go in the X-direction, blue edges in the Y-direction. the vertices I am interested in are drawn as blue/red dots(the color does not matter here), and the problem is to find the correct order of these vertices which creates a "silhouette" or "contour" of the graph. A naive solution would just hop from one dot to the closest one, but in more complicated cases this is not correct, so I would prefer traversing the graph to the next dot.
sometimes the graph has many disconnected components: (no colored dots in this pic, and nevermind the white triangulated surface...) http://tinyurl.com/3aefpgn and note that the donut-shaped component of the graph has two of these "loops" I am interested in, one on the outside, one on the inside.
I've been thinking about two approaches: 1) start at one vertex, search in the neighbourhood of this vertex for the next one, figure out which one of the found points is the correct next vertex. iterate. 2) something based on the idea of surface tension, or some kind of relaxation/removal/addition of edges. start in the middle of the graph and work your way towards the outside and when we are finished only the correct edges remain.
any ideas?
Yes. 1. Remove (or filter out) all vtx of deg 1 together with their edges. 2. Mark all vtx of deg < 4 with p. 3. Starting at, and working away from, the p vtxs of deg 2[1], join with the adjacent p vtx. 4. Now any p vtx of pdeg < 2 will be adjacent to a non p vtx that is adjacent to another p vtx of pdeg < 2, make the non p vertex a p vertex and join with the two adjacent p vtxs. (pdeg is the number of adjacent p vtxs, of course). 5. That's it, bar sticking the loose threads back on etc. No shadow of a proof, so may contain bugs! Louis. [1] Looks like you need to move away from the deg 2 vtxs in step, in case you're on a strip 2 vtxs wide, in which case there'll be two adjacent p vtxs you could join with.
Hi, Are there any experts on the list that would give me some advice on a shortest path with resource constraints type problem. In brief: I have a graph with tangles. The tangles are due to spurious edges which arise due to repeats in genomes. If the underlying genome didn't have any repeats, then I could just find the Euler path (which would coincide with the Hamilton path) and I'd be done. I have extra data which can be interpretted as the distance between pairs of nodes, which I hope to use to remove as many of the spurious edges as possible, but having read some papers this sounds like the SPPRC. The trouble is that the distances aren't exact, they come from a distribution, which is only approximately known. However my graph will have lots of untangled regions from which I can make a better approximation, iteratively maybe. -- Dr Adam Spargo High Performance Assembly Group email: aws@sanger.ac.uk Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728 Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919 On Wed, 4 Aug 2010, Louis Lavery wrote:
On 03/08/2010 17:51, Anders Wallin wrote:
I was thinking to reset the filtered graph at every source vertex. Roughly equivalent to keeping track of depth, without having to write the code.
thanks for all replies, I will try writing an exception-throwing visitor or my own depth-limited bfs.
I should probably describe my application, since there might be a much better solution than the one I am trying. As part of a computer-aided-manufacturing toolpath calculation I am producing planar graphs like this: http://tinyurl.com/377lvt8 red edges go in the X-direction, blue edges in the Y-direction. the vertices I am interested in are drawn as blue/red dots(the color does not matter here), and the problem is to find the correct order of these vertices which creates a "silhouette" or "contour" of the graph. A naive solution would just hop from one dot to the closest one, but in more complicated cases this is not correct, so I would prefer traversing the graph to the next dot.
sometimes the graph has many disconnected components: (no colored dots in this pic, and nevermind the white triangulated surface...) http://tinyurl.com/3aefpgn and note that the donut-shaped component of the graph has two of these "loops" I am interested in, one on the outside, one on the inside.
I've been thinking about two approaches: 1) start at one vertex, search in the neighbourhood of this vertex for the next one, figure out which one of the found points is the correct next vertex. iterate. 2) something based on the idea of surface tension, or some kind of relaxation/removal/addition of edges. start in the middle of the graph and work your way towards the outside and when we are finished only the correct edges remain.
any ideas?
Yes.
1. Remove (or filter out) all vtx of deg 1 together with their edges.
2. Mark all vtx of deg < 4 with p.
3. Starting at, and working away from, the p vtxs of deg 2[1], join with the adjacent p vtx.
4. Now any p vtx of pdeg < 2 will be adjacent to a non p vtx that is adjacent to another p vtx of pdeg < 2, make the non p vertex a p vertex and join with the two adjacent p vtxs. (pdeg is the number of adjacent p vtxs, of course).
5. That's it, bar sticking the loose threads back on etc.
No shadow of a proof, so may contain bugs!
Louis.
[1] Looks like you need to move away from the deg 2 vtxs in step, in case you're on a strip 2 vtxs wide, in which case there'll be two adjacent p vtxs you could join with. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
On Wed, 4 Aug 2010, Adam Spargo wrote:
Hi, Are there any experts on the list that would give me some advice on a shortest path with resource constraints type problem.
In brief: I have a graph with tangles. The tangles are due to spurious edges which arise due to repeats in genomes. If the underlying genome didn't have any repeats, then I could just find the Euler path (which would coincide with the Hamilton path) and I'd be done. I have extra data which can be interpretted as the distance between pairs of nodes, which I hope to use to remove as many of the spurious edges as possible, but having read some papers this sounds like the SPPRC.
The trouble is that the distances aren't exact, they come from a distribution, which is only approximately known. However my graph will have lots of untangled regions from which I can make a better approximation, iteratively maybe.
There is a resource-constrained shortest paths algorithm in BGL, although I have never used it. There is someone working on a newer version, but I don't know about the recent progress of that. Do you have a more precise description of what you are trying to accomplish, especially in graph terms? -- Jeremiah Willcock
ok, I'll try to give a better description. When you sequence a genome, you have lots of short fragments coming off the sequencing machine in a random order, the genome is covered with overlapping fragments to a considerable depth. The assembly task is then to put them together again in the correct order. It's kind of like doing a big jigsaw puzzle. I'm following the overlap-layout-consensus model. You find all overlaps between fragments and draw a graph with fragments as vertices and overlaps as edges. Each edge is bidirectional, labelled with the sequences that overhang each end of the overlap. The bidirectionality reflects the double stranded nature of DNA. The correct assembly is then a path through this graph. For example if you have overlapping fragments A and B, with overlap O(A,B), the edge between A and B has (A-B) in one direction and (B-A) in the other direction. Then A.(B-A) or B.(A-B) are valid assemblies of A and B. The problem is that there are many spurious overlaps - edges which look like true overlaps but which are due to repetitive sequences in the genome. There are also sequencing errors, but I'll deal with them in phase 2, I'm just on simulated data at present. Traditional algorithms look for Hamiltonian Path in this graph, in practice what happens is that you output the sequence on simple walks, leaving gaps between the 'contigs' at every cycle. So the sequencing guys worked out a way to produce read-pairs - each fragment has a pair, the distribution of the insert size between the pairs is not really known, for example when they aim for 3000 nucleotides the mean could be something like 2000 or 4000, with a sd of something like 600. In general you have several libraries of read-pairs with different insert size. What is currently done is to approximate the insert size distribution on the contigs and then use this information to join contigs into scaffold and attempt to fill in the gaps, similarly contigs which contradict the read-pair information are broken or discarded. What I want to do is to use the read-pair data on the graph, by removing infeasible edges which hopefully correspond to spurious overlaps. If I have insert sizes longer than the repeats in the genome I should be able to disambiguate all the tangles and be left with an acyclic path from which I can just read off the assembly without the need for a layout algorithm at all. I plan to do what is called transitive reduction on the graph before looking at the read pairs, that is if X->Y, Y->Z, X->Z - remove X->Z - just retain the maximal overlaps and kick out the shorter ones with the same information. I was just wondering if this problem fits with a well known SPPRC or am I in unvisited territory? It's quite a worthwhile project, because the 'finishing' stage of sequencing a genome is the most expensive these days, but we had the necessary information on the graph already, keeping the easy bits and throwing away all the tangles. The finishers never even get to look at the tangles. Maybe that was an even worse description ... At the moment I'm just seeing what the graph looks like for different classes of repeat which I insert into my simulation data and trying to get my head around the problem. I'm unclear as to whether I should find the fundemental cycle basis and attack each one in turn with the pair information or whether I could formulate this as a SPPRC and just call the BGL function. I'm not expecting the answer to come out of the ether, but if anyone has any insights I'd be glad to hear them. Adam. -- Dr Adam Spargo High Performance Assembly Group email: aws@sanger.ac.uk Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728 Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919 On Wed, 4 Aug 2010, Jeremiah Willcock wrote:
On Wed, 4 Aug 2010, Adam Spargo wrote:
Hi, Are there any experts on the list that would give me some advice on a shortest path with resource constraints type problem.
In brief: I have a graph with tangles. The tangles are due to spurious edges which arise due to repeats in genomes. If the underlying genome didn't have any repeats, then I could just find the Euler path (which would coincide with the Hamilton path) and I'd be done. I have extra data which can be interpretted as the distance between pairs of nodes, which I hope to use to remove as many of the spurious edges as possible, but having read some papers this sounds like the SPPRC.
The trouble is that the distances aren't exact, they come from a distribution, which is only approximately known. However my graph will have lots of untangled regions from which I can make a better approximation, iteratively maybe.
There is a resource-constrained shortest paths algorithm in BGL, although I have never used it. There is someone working on a newer version, but I don't know about the recent progress of that. Do you have a more precise description of what you are trying to accomplish, especially in graph terms?
-- Jeremiah Willcock _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
On Thu, 5 Aug 2010, Adam Spargo wrote:
ok, I'll try to give a better description.
When you sequence a genome, you have lots of short fragments coming off the sequencing machine in a random order, the genome is covered with overlapping fragments to a considerable depth. The assembly task is then to put them together again in the correct order. It's kind of like doing a big jigsaw puzzle.
I'm following the overlap-layout-consensus model. You find all overlaps between fragments and draw a graph with fragments as vertices and overlaps as edges. Each edge is bidirectional, labelled with the sequences that overhang each end of the overlap. The bidirectionality reflects the double stranded nature of DNA. The correct assembly is then a path through this graph.
For example if you have overlapping fragments A and B, with overlap O(A,B), the edge between A and B has (A-B) in one direction and (B-A) in the other direction. Then A.(B-A) or B.(A-B) are valid assemblies of A and B.
The problem is that there are many spurious overlaps - edges which look like true overlaps but which are due to repetitive sequences in the genome. There are also sequencing errors, but I'll deal with them in phase 2, I'm just on simulated data at present.
Traditional algorithms look for Hamiltonian Path in this graph, in practice what happens is that you output the sequence on simple walks, leaving gaps between the 'contigs' at every cycle.
So the sequencing guys worked out a way to produce read-pairs - each fragment has a pair, the distribution of the insert size between the pairs is not really known, for example when they aim for 3000 nucleotides the mean could be something like 2000 or 4000, with a sd of something like 600.
In general you have several libraries of read-pairs with different insert size.
What is currently done is to approximate the insert size distribution on the contigs and then use this information to join contigs into scaffold and attempt to fill in the gaps, similarly contigs which contradict the read-pair information are broken or discarded.
What I want to do is to use the read-pair data on the graph, by removing infeasible edges which hopefully correspond to spurious overlaps. If I have insert sizes longer than the repeats in the genome I should be able to disambiguate all the tangles and be left with an acyclic path from which I can just read off the assembly without the need for a layout algorithm at all.
I plan to do what is called transitive reduction on the graph before looking at the read pairs, that is if X->Y, Y->Z, X->Z - remove X->Z - just retain the maximal overlaps and kick out the shorter ones with the same information.
There is a transitive reduction algorithm in BGL, and a better version on
the way. I do not believe it handles weights, though, which it looks like
you need. You might want to contact Eric Wolf
I was just wondering if this problem fits with a well known SPPRC or am I in unvisited territory?
It's quite a worthwhile project, because the 'finishing' stage of sequencing a genome is the most expensive these days, but we had the necessary information on the graph already, keeping the easy bits and throwing away all the tangles. The finishers never even get to look at the tangles.
Maybe that was an even worse description ...
At the moment I'm just seeing what the graph looks like for different classes of repeat which I insert into my simulation data and trying to get my head around the problem. I'm unclear as to whether I should find the fundemental cycle basis and attack each one in turn with the pair information or whether I could formulate this as a SPPRC and just call the BGL function.
I'm not expecting the answer to come out of the ether, but if anyone has any insights I'd be glad to hear them.
I'll think about this further and see if I have any insights. -- Jeremiah Willcock
participants (10)
-
Aaron Windsor
-
Adam Spargo
-
Anders Wallin
-
David Abrahams
-
Jeremiah Willcock
-
Jerry
-
Line Blander Reinhardt
-
Louis Lavery
-
Ovanes Markarian
-
Trevor Harmon