Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Monday, November 10, 2008

Bowtie and Single End Mapped Paired End Data

Strange title for a posting, but it actually makes sense, if you think about it.

One of the researchers here has been working on an assembly of a reasonably small fungal genome, using short read technology with which I've been somewhat involved. It's been an educational project for me, so I'm glad I had the opportunity to contribute. One of the key elements of the paper, however, was the use of Paired End Tags (PET) from an early Illumina machine to assess the quality of the assembly.

Unfortunately, an early run of the software I'd written to analyze the Maq alignments of the paired ends to the assemblies had a bug, which made it look like the data supported the theory quite nicely - but alas, it was just a bug. The bug was fixed a few weeks ago, and a re-run of the data turned up something interesting:

If you use Maq alignments, you're biasing your alignments towards the smith-waterman aligned sequences, which are not an independent measure of the quality of the assembly. Not shocking? I know - it's pretty obvious.

Unfortunately, we hadn't put it in context before hand, so we had to find another way of doing this. We wanted to get a fairly exhaustive set of alignments for each short read - and we wanted to get it quickly, so we turned to Bowtie. While I wasn't the one running the alignments, I have to admit, I'm impressed with the quick performance of the aligner. Multi-matches work well, the file format is intuitive - similar to an eland file, and the quality of the information seems to be good. (The lack of a good scoring mechanism is a problem, but wasn't vital for this project.)

Anyhow, by performing a Single End Tag style alignment on PET data, while retaining multimatches, we were able to remove the biases of the aligners, and perform the pairings ourselves - learning more about the underlying assembly to which we were aligning. This isn't something you'd want to do often, unless you're assembling genomes fairly regularly, but it's quite cool.

Alas, for the paper, I don't think this data was good enough quality - there may not be a good story in the results that will replace the one we thought we had when the data was first run... but there were other good results in the data, and the journey was worth the trip, even if the destination wasn't all we'd hoped it would be.

As a sidenote, my code was run on an otherwise unused 16-way box, and managed to get 1592% CPU usage, at one point, while taking up 46Gb of RAM. (I should have processed each lane of reads separately, but didn't.) That's the first time I've seen CPU usage that high on my code. My previous record was 1499%. I should note that I only observe these scalings during the input processing - so the overal speed up was only 4x.

The code is part of the Vancouver Short Read Analysis Package, called, if anyone needs a process similar to this for Bowtie Reads. (There's no manual for it at the moment, but I can put one togther, upon request.)

Labels: , , ,


Blogger morin.ryan said...

The more recent versions of Maq allow you to turn off Smith-Waterman alignment:

Usage: maq map [options] chr.bfa reads_1.bfq [reads_2.bfq]

-W disable Smith-Waterman alignment

I wonder if this is a more tractable option for the future?

November 10, 2008 4:27:00 PM PST  
Blogger Anthony said...

Hi Ryan,

I suppose there are two reasons why Maq isn't ideal: Maw is slower - and more importantly, maq won't do multiple alignments. The key to this analysis was to identify the potential forks where one end of the PET maps to more than one other contig. BY using bowtie, we found all of the forks - though the disappointing part of the analysis was just that it doesn't seem to help the assembly.

We were aware of the -W option, but didn't think it would have answered the core question of forks.

Ironically, by doing the pairs manually, you can see that most of the simple ones that are recoverable do look pretty good (same scaffold, right distance. etc.) The more complex (1-to-many or many-to-many) forks also contain interesting data, which is another post for another day. (=

November 10, 2008 4:53:00 PM PST  

Post a Comment

<< Home