On Thu, 5 Aug 2010, Adam Spargo wrote:
ok, I'll try to give a better description.
When you sequence a genome, you have lots of short fragments coming off the sequencing machine in a random order, the genome is covered with overlapping fragments to a considerable depth. The assembly task is then to put them together again in the correct order. It's kind of like doing a big jigsaw puzzle.
I'm following the overlap-layout-consensus model. You find all overlaps between fragments and draw a graph with fragments as vertices and overlaps as edges. Each edge is bidirectional, labelled with the sequences that overhang each end of the overlap. The bidirectionality reflects the double stranded nature of DNA. The correct assembly is then a path through this graph.
For example if you have overlapping fragments A and B, with overlap O(A,B), the edge between A and B has (A-B) in one direction and (B-A) in the other direction. Then A.(B-A) or B.(A-B) are valid assemblies of A and B.
The problem is that there are many spurious overlaps - edges which look like true overlaps but which are due to repetitive sequences in the genome. There are also sequencing errors, but I'll deal with them in phase 2, I'm just on simulated data at present.
Traditional algorithms look for Hamiltonian Path in this graph, in practice what happens is that you output the sequence on simple walks, leaving gaps between the 'contigs' at every cycle.
So the sequencing guys worked out a way to produce read-pairs - each fragment has a pair, the distribution of the insert size between the pairs is not really known, for example when they aim for 3000 nucleotides the mean could be something like 2000 or 4000, with a sd of something like 600.
In general you have several libraries of read-pairs with different insert size.
What is currently done is to approximate the insert size distribution on the contigs and then use this information to join contigs into scaffold and attempt to fill in the gaps, similarly contigs which contradict the read-pair information are broken or discarded.
What I want to do is to use the read-pair data on the graph, by removing infeasible edges which hopefully correspond to spurious overlaps. If I have insert sizes longer than the repeats in the genome I should be able to disambiguate all the tangles and be left with an acyclic path from which I can just read off the assembly without the need for a layout algorithm at all.
I plan to do what is called transitive reduction on the graph before looking at the read pairs, that is if X->Y, Y->Z, X->Z - remove X->Z - just retain the maximal overlaps and kick out the shorter ones with the same information.
There is a transitive reduction algorithm in BGL, and a better version on
the way. I do not believe it handles weights, though, which it looks like
you need. You might want to contact Eric Wolf
I was just wondering if this problem fits with a well known SPPRC or am I in unvisited territory?
It's quite a worthwhile project, because the 'finishing' stage of sequencing a genome is the most expensive these days, but we had the necessary information on the graph already, keeping the easy bits and throwing away all the tangles. The finishers never even get to look at the tangles.
Maybe that was an even worse description ...
At the moment I'm just seeing what the graph looks like for different classes of repeat which I insert into my simulation data and trying to get my head around the problem. I'm unclear as to whether I should find the fundemental cycle basis and attack each one in turn with the pair information or whether I could formulate this as a SPPRC and just call the BGL function.
I'm not expecting the answer to come out of the ether, but if anyone has any insights I'd be glad to hear them.
I'll think about this further and see if I have any insights. -- Jeremiah Willcock