[BGL] SPPRC formulation
Hi, I would like to use the r_c_shortest_paths algorithm. My constraint is that given pairs of vertices i, j (each i have a single mate j) must have a distance between them which is consistent with a given distribution. I am looking for the path which is most consistent with this distribution. Does anybody have any experience of formulating such a constraint in BGL? The second problem is that I am currently modelling the problem with an undirected graph. The edges could be viewed as bi-directional, but only due to the double stranded nature of DNA. Once I have started on a path I just choose the label which is consistent with the direction I am going, but the graph is essentially undirected, you can always go in either direction down an edge. Could an undirected graph be used with this algorithm or must I use a directed graph? I have tried to ask this question before, hopefully this time I have phrased it better - plus I am actually ready to code it now. As always, any advice gratefully received. Adam. -- Dr Adam Spargo High Performance Assembly Group email: aws@sanger.ac.uk Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728 Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
On Thu, 23 Sep 2010, Adam Spargo wrote:
Hi, I would like to use the r_c_shortest_paths algorithm. My constraint is that given pairs of vertices i, j (each i have a single mate j) must have a distance between them which is consistent with a given distribution. I am looking for the path which is most consistent with this distribution. Does anybody have any experience of formulating such a constraint in BGL?
I'm not sure what you mean here. What exactly does "consistent with this distribution" mean? Is that referring to a probability distribution? Are you trying to find a path that maximizes some property of edge labels? If so, is the path required to be simple (contain each vertex at most once)?
The second problem is that I am currently modelling the problem with an undirected graph. The edges could be viewed as bi-directional, but only due to the double stranded nature of DNA. Once I have started on a path I just choose the label which is consistent with the direction I am going, but the graph is essentially undirected, you can always go in either direction down an edge. Could an undirected graph be used with this algorithm or must I use a directed graph?
The documentation says that the graph needs to be directed. However, have you tried it with an undirected graph? I don't see a reason in the pseudocode that justifies the directedness requirement. Note too that there is work going on to rewrite the r_c_shortest_paths code to be more generic; that may end up increasing its capabilities later. -- Jeremiah Willcock
Hi, thanks for your reply
I would like to use the r_c_shortest_paths algorithm. My constraint is that given pairs of vertices i, j (each i have a single mate j) must have a distance between them which is consistent with a given distribution. I am looking for the path which is most consistent with this distribution. Does anybody have any experience of formulating such a constraint in BGL?
I'm not sure what you mean here. What exactly does "consistent with this distribution" mean? Is that referring to a probability distribution? Are you trying to find a path that maximizes some property of edge labels? If so, is the path required to be simple (contain each vertex at most once)?
Sorry ... each edge has a length, and pairs of vertices have a distance between them (called the insert size), depending on how well the DNA library is made the insert size of the pairs has a distribution. All we get is the size they were aiming for - we usually assume a normal distribution with this size as mean and s.d. of 20% - but this is completely arbitrary and is often wrong. I was planning to approximate the distribution from parts of my graph without cycles. BUT in the first instance implementing a SPPRC with exact insert size could be seen as progress ... Yes, the path should be simple.
The second problem is that I am currently modelling the problem with an undirected graph. The edges could be viewed as bi-directional, but only due to the double stranded nature of DNA. Once I have started on a path I just choose the label which is consistent with the direction I am going, but the graph is essentially undirected, you can always go in either direction down an edge. Could an undirected graph be used with this algorithm or must I use a directed graph?
The documentation says that the graph needs to be directed. However, have you tried it with an undirected graph? I don't see a reason in the pseudocode that justifies the directedness requirement.
Thanks, I will implement a simple example and try to build up to the real problem. Best, Adam. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
On Wed, 29 Sep 2010, Adam Spargo wrote:
Hi, thanks for your reply
I would like to use the r_c_shortest_paths algorithm. My constraint is that given pairs of vertices i, j (each i have a single mate j) must have a distance between them which is consistent with a given distribution. I am looking for the path which is most consistent with this distribution. Does anybody have any experience of formulating such a constraint in BGL?
I'm not sure what you mean here. What exactly does "consistent with this distribution" mean? Is that referring to a probability distribution? Are you trying to find a path that maximizes some property of edge labels? If so, is the path required to be simple (contain each vertex at most once)?
Sorry ... each edge has a length, and pairs of vertices have a distance between them (called the insert size), depending on how well the DNA library is made the insert size of the pairs has a distribution. All we get is the size they were aiming for - we usually assume a normal distribution with this size as mean and s.d. of 20% - but this is completely arbitrary and is often wrong. I was planning to approximate the distribution from parts of my graph without cycles. BUT in the first instance implementing a SPPRC with exact insert size could be seen as progress ...
Is there a description somewhere of what you're trying to do at a high level? I still don't completely understand the actual problem you're trying to solve. Are you just looking for paths that minimize the total vertex distances, or are there more limitations or objectives that you have?
Yes, the path should be simple.
Could you ever have a situation similar to a negative-weight cycle where the "best" overall (not necessarily simple) path is not simple? -- Jeremiah Willcock
Hi, You are quite right - my biggest problem is getting the problem definition right. I'm away for a week from tomorrow, but I'll try to work up a better description. Adam. -- Dr Adam Spargo High Performance Assembly Group email: aws@sanger.ac.uk Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728 Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919 On Wed, 29 Sep 2010, Jeremiah Willcock wrote:
On Wed, 29 Sep 2010, Adam Spargo wrote:
Hi, thanks for your reply
I would like to use the r_c_shortest_paths algorithm. My constraint is that given pairs of vertices i, j (each i have a single mate j) must have a distance between them which is consistent with a given distribution. I am looking for the path which is most consistent with this distribution. Does anybody have any experience of formulating such a constraint in BGL?
I'm not sure what you mean here. What exactly does "consistent with this distribution" mean? Is that referring to a probability distribution? Are you trying to find a path that maximizes some property of edge labels? If so, is the path required to be simple (contain each vertex at most once)?
Sorry ... each edge has a length, and pairs of vertices have a distance between them (called the insert size), depending on how well the DNA library is made the insert size of the pairs has a distribution. All we get is the size they were aiming for - we usually assume a normal distribution with this size as mean and s.d. of 20% - but this is completely arbitrary and is often wrong. I was planning to approximate the distribution from parts of my graph without cycles. BUT in the first instance implementing a SPPRC with exact insert size could be seen as progress ...
Is there a description somewhere of what you're trying to do at a high level? I still don't completely understand the actual problem you're trying to solve. Are you just looking for paths that minimize the total vertex distances, or are there more limitations or objectives that you have?
Yes, the path should be simple.
Could you ever have a situation similar to a negative-weight cycle where the "best" overall (not necessarily simple) path is not simple?
-- Jeremiah Willcock _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
On Thu, 30 Sep 2010, Adam Spargo wrote:
Hi, You are quite right - my biggest problem is getting the problem definition right. I'm away for a week from tomorrow, but I'll try to work up a better description.
One thing you could do that might be easier is to see if you could write your problem as an optimization problem subject to various constraints. You should use the constraints and objective function you would ideally want to use, even if you don't think they're feasible to optimize; modifications can come later if any are needed. -- Jeremiah Willcock
participants (2)
-
Adam Spargo
-
Jeremiah Willcock