Re: [Boost-users] [BGL] SPPRC variant

5 Aug 2010


      On Thu, 5 Aug 2010, Adam Spargo wrote:
...
ok, I'll try to give a better description.
When you sequence a genome, you have lots of short fragments coming off the 
sequencing machine in a random order, the genome is covered with overlapping 
fragments to a considerable depth. The assembly task is then to put them 
together again in the correct order. It's kind of like doing a big jigsaw 
puzzle.
I'm following the overlap-layout-consensus model. You find all overlaps 
between fragments and draw a graph with fragments as vertices and overlaps as 
edges. Each edge is bidirectional, labelled with the sequences that overhang 
each end of the overlap. The bidirectionality reflects the double stranded 
nature of DNA. The correct assembly is then a path through this graph.
For example if you have overlapping fragments A and B, with overlap O(A,B), 
the edge between A and B has (A-B) in one direction and (B-A) in the other 
direction. Then A.(B-A) or B.(A-B) are valid assemblies of A and B.
The problem is that there are many spurious overlaps - edges which look like 
true overlaps but which are due to repetitive sequences in the genome. There 
are also sequencing errors, but I'll deal with them in phase 2, I'm just on 
simulated data at present.
Traditional algorithms look for Hamiltonian Path in this graph, in practice 
what happens is that you output the sequence on simple walks, leaving gaps 
between the 'contigs' at every cycle.
So the sequencing guys worked out a way to produce read-pairs - each fragment 
has a pair, the distribution of the insert size between the pairs is not 
really known, for example when they aim for 3000 nucleotides the mean could 
be something like 2000 or 4000, with a sd of something like 600.
In general you have several libraries of read-pairs with different insert 
size.
What is currently done is to approximate the insert size distribution on the 
contigs and then use this information to join contigs into scaffold and 
attempt to fill in the gaps, similarly contigs which contradict the read-pair 
information are broken or discarded.
What I want to do is to use the read-pair data on the graph, by removing 
infeasible edges which hopefully correspond to spurious overlaps. If I have 
insert sizes longer than the repeats in the genome I should be able to 
disambiguate all the tangles and be left with an acyclic path from which I 
can just read off the assembly without the need for a layout algorithm at 
all.
I plan to do what is called transitive reduction on the graph before looking 
at the read pairs, that is if X->Y, Y->Z, X->Z - remove X->Z - just retain 
the maximal overlaps and kick out the shorter ones with the same information.
There is a transitive reduction algorithm in BGL, and a better version on 
the way.  I do not believe it handles weights, though, which it looks like 
you need.  You might want to contact Eric Wolf <eric@boese-wolf.eu>, who 
is working on the new transitive reduction code.
...
I was just wondering if this problem fits with a well known SPPRC or am I in 
unvisited territory?
It's quite a worthwhile project, because the 'finishing' stage of sequencing 
a genome is the most expensive these days, but we had the necessary 
information on the graph already, keeping the easy bits and throwing away all 
the tangles. The finishers never even get to look at the tangles.
Maybe that was an even worse description ...
At the moment I'm just seeing what the graph looks like for different classes 
of repeat which I insert into my simulation data and trying to get my head 
around the problem. I'm unclear as to whether I should find the fundemental 
cycle basis and attack each one in turn with the pair information or whether 
I could formulate this as a SPPRC and just call the BGL function.
I'm not expecting the answer to come out of the ether, but if anyone has any 
insights I'd be glad to hear them.
I'll think about this further and see if I have any insights.

-- Jeremiah Willcock