ok, I'll try to give a better description. When you sequence a genome, you have lots of short fragments coming off the sequencing machine in a random order, the genome is covered with overlapping fragments to a considerable depth. The assembly task is then to put them together again in the correct order. It's kind of like doing a big jigsaw puzzle. I'm following the overlap-layout-consensus model. You find all overlaps between fragments and draw a graph with fragments as vertices and overlaps as edges. Each edge is bidirectional, labelled with the sequences that overhang each end of the overlap. The bidirectionality reflects the double stranded nature of DNA. The correct assembly is then a path through this graph. For example if you have overlapping fragments A and B, with overlap O(A,B), the edge between A and B has (A-B) in one direction and (B-A) in the other direction. Then A.(B-A) or B.(A-B) are valid assemblies of A and B. The problem is that there are many spurious overlaps - edges which look like true overlaps but which are due to repetitive sequences in the genome. There are also sequencing errors, but I'll deal with them in phase 2, I'm just on simulated data at present. Traditional algorithms look for Hamiltonian Path in this graph, in practice what happens is that you output the sequence on simple walks, leaving gaps between the 'contigs' at every cycle. So the sequencing guys worked out a way to produce read-pairs - each fragment has a pair, the distribution of the insert size between the pairs is not really known, for example when they aim for 3000 nucleotides the mean could be something like 2000 or 4000, with a sd of something like 600. In general you have several libraries of read-pairs with different insert size. What is currently done is to approximate the insert size distribution on the contigs and then use this information to join contigs into scaffold and attempt to fill in the gaps, similarly contigs which contradict the read-pair information are broken or discarded. What I want to do is to use the read-pair data on the graph, by removing infeasible edges which hopefully correspond to spurious overlaps. If I have insert sizes longer than the repeats in the genome I should be able to disambiguate all the tangles and be left with an acyclic path from which I can just read off the assembly without the need for a layout algorithm at all. I plan to do what is called transitive reduction on the graph before looking at the read pairs, that is if X->Y, Y->Z, X->Z - remove X->Z - just retain the maximal overlaps and kick out the shorter ones with the same information. I was just wondering if this problem fits with a well known SPPRC or am I in unvisited territory? It's quite a worthwhile project, because the 'finishing' stage of sequencing a genome is the most expensive these days, but we had the necessary information on the graph already, keeping the easy bits and throwing away all the tangles. The finishers never even get to look at the tangles. Maybe that was an even worse description ... At the moment I'm just seeing what the graph looks like for different classes of repeat which I insert into my simulation data and trying to get my head around the problem. I'm unclear as to whether I should find the fundemental cycle basis and attack each one in turn with the pair information or whether I could formulate this as a SPPRC and just call the BGL function. I'm not expecting the answer to come out of the ether, but if anyone has any insights I'd be glad to hear them. Adam. -- Dr Adam Spargo High Performance Assembly Group email: aws@sanger.ac.uk Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728 Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919 On Wed, 4 Aug 2010, Jeremiah Willcock wrote:
On Wed, 4 Aug 2010, Adam Spargo wrote:
Hi, Are there any experts on the list that would give me some advice on a shortest path with resource constraints type problem.
In brief: I have a graph with tangles. The tangles are due to spurious edges which arise due to repeats in genomes. If the underlying genome didn't have any repeats, then I could just find the Euler path (which would coincide with the Hamilton path) and I'd be done. I have extra data which can be interpretted as the distance between pairs of nodes, which I hope to use to remove as many of the spurious edges as possible, but having read some papers this sounds like the SPPRC.
The trouble is that the distances aren't exact, they come from a distribution, which is only approximately known. However my graph will have lots of untangled regions from which I can make a better approximation, iteratively maybe.
There is a resource-constrained shortest paths algorithm in BGL, although I have never used it. There is someone working on a newer version, but I don't know about the recent progress of that. Do you have a more precise description of what you are trying to accomplish, especially in graph terms?
-- Jeremiah Willcock _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.