Bowtie and Single End Mapped Paired End Data
One of the researchers here has been working on an assembly of a reasonably small fungal genome, using short read technology with which I've been somewhat involved. It's been an educational project for me, so I'm glad I had the opportunity to contribute. One of the key elements of the paper, however, was the use of Paired End Tags (PET) from an early Illumina machine to assess the quality of the assembly.
Unfortunately, an early run of the software I'd written to analyze the Maq alignments of the paired ends to the assemblies had a bug, which made it look like the data supported the theory quite nicely - but alas, it was just a bug. The bug was fixed a few weeks ago, and a re-run of the data turned up something interesting:
If you use Maq alignments, you're biasing your alignments towards the smith-waterman aligned sequences, which are not an independent measure of the quality of the assembly. Not shocking? I know - it's pretty obvious.
Unfortunately, we hadn't put it in context before hand, so we had to find another way of doing this. We wanted to get a fairly exhaustive set of alignments for each short read - and we wanted to get it quickly, so we turned to Bowtie. While I wasn't the one running the alignments, I have to admit, I'm impressed with the quick performance of the aligner. Multi-matches work well, the file format is intuitive - similar to an eland file, and the quality of the information seems to be good. (The lack of a good scoring mechanism is a problem, but wasn't vital for this project.)
Anyhow, by performing a Single End Tag style alignment on PET data, while retaining multimatches, we were able to remove the biases of the aligners, and perform the pairings ourselves - learning more about the underlying assembly to which we were aligning. This isn't something you'd want to do often, unless you're assembling genomes fairly regularly, but it's quite cool.
Alas, for the paper, I don't think this data was good enough quality - there may not be a good story in the results that will replace the one we thought we had when the data was first run... but there were other good results in the data, and the journey was worth the trip, even if the destination wasn't all we'd hoped it would be.
As a sidenote, my code was run on an otherwise unused 16-way box, and managed to get 1592% CPU usage, at one point, while taking up 46Gb of RAM. (I should have processed each lane of reads separately, but didn't.) That's the first time I've seen CPU usage that high on my code. My previous record was 1499%. I should note that I only observe these scalings during the input processing - so the overal speed up was only 4x.
The code is part of the Vancouver Short Read Analysis Package, called BowtieToBedFormat.java, if anyone needs a process similar to this for Bowtie Reads. (There's no manual for it at the moment, but I can put one togther, upon request.)