Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Monday, February 18, 2008

Aligning DNA - comments from above

I've been pretty bad about continuing my posts on how the different aligners work. It's a lot of work keeping up with them, since I seem to hear about a new one each week. However, a post-doc in my lab gave a presentation on contrasting the various aligners, to discuss each of their strengths and weaknesses for doing short (Illumina) read alignments.

Admittedly, I don't know how accurate the presenter's data was - most of the presentation was in being used to set up his own in-house aligner development, and thus all of the aligners were painted in a poor light, except his, of course. That being said, there's some truth to what he found: most of the aligners out there have some pretty serious issues.

Eland is still limited by it's 32-base limit, which you'd think they'd have been over by now. For crying out loud, the company that produces it is trying to sell kits for doing 36-base alignments. It's in their best interest to have an aligner that does more than 32 bases. (Yes, they have a new work-around in their Gerald program, but it's hardly ideal.)

MAQ, apparently, has a weird "feature" that if multiple alignments are found, it just picks one at random as the "best". Hardly ideal for most experiments.

Mosaik provides output in .ace files - which are useless for any further work, unless you want to reverse engineer converters to other, more reasonable, formats.

SOAP only aligns against the forward strand! (How hard can it be to map the reverse compliment???)

Exonerate is great when run in "slow mode", at which point it's hardly usable for 40M reads, and when it's run in "fast mode", it's results are hardly usable at all.

SHRiMP, I just don't know enough about to comment on.

And yes, even the post-doc's in-house aligner (called Slider) has some serious issues: it's going to miscall all SNPs, unless you're aligning fragments from the reference sequence back to itself. (That's not counting the 20 hours I've already put in to translate the thing to java proper, patching memory leaks, and the like...)

Seriously, what's with all of these aligners? Why hasn't anyone stepped up to the plate and come up with a decent open-source aligner? There are got to be hundreds of groups out there who are struggling to make these work, and not one of them is ideal for use with Illumina reads. Isn't there one research group out there dog-fooding their own Illumina sequence aligner?

At this rate, I may have to build my own. I know what they say about software, though: You can have fast, efficient or cheap - pick any two. With aligners, it seems that's exactly where we're stuck.

Labels: , ,


Anonymous Jason Stajich said...

While I am all for people writing their own software, writing your own aligner seems more difficult than figuring out how to get what you want from a published (and used format) like ACE assembly files. Looking at the recent Hillier et. paper, they end up using a combination of BLAT and Mosaik. It seems like an email to LaDeana or Gabor Marth's group could probably get you some ideas of some quick ways to extract what you need from the format? This seems easier than all the debugging of writing your own aligner from scratch...

I also think that while it is weird for MAQ to randomly choose one place for a read to align when there are multiple locations, the quality score of the SNP is really the important part. So to me, it makes sense to use it as a SNP calling pipelinem, but not as a where-do-all-these-tags-map-to-the-genome question. I haven't tried the comparative assembly process, but that too seems like it wouldn't do so great using tags with multiple locations.

When trying to map the locations of a transcript piece from an EST library, it does seem a different tool is useful.

I would also still suggest taking the existing software like SHRiMP for a spin before writing your own. The authors appeared to came up with a fast optimized Smith-Waterman for short reads using SSE2 to align in both colorspace (from ABI SOLiD) and short solexa/illumina reads. I've used it for our projects and it seems to really fly through the data and give us interpretable results in a dead-simple file format.

I've also heard the Carrington Lab at Oregon State has their own in-house aligner which is quite fast and doesn't have 36mer limit. Can't remember the name, something like CacheXOR. The algorithm seemed quite similar to SHRiMP in terms of how matches were seeded. I believe it is production within their group, but may still be in development to freely distribute. But they are open to users if you contact them.

What I am most curious about is how people are planning to do the statistics of gene expression comaprision from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3' biased tags?

February 18, 2008 11:47:00 PM PST  
Blogger Mediocre said...

It is a bit hard for me to comment on these aligners at my position. However, I will say all these software have some interesting bits.

I have not used SHRiMP, but from its README, I think this should be a wonderful software developed by an extremely strong person. The algorithm behind is also very attractive. Nonetheless, I a bit worry that SHRiMP may not be fast enough for whole-human-genome alignment. (Also judged from its readme.) Anyway, based on Jason's comments, I will definitely try it out.

As for MAQ, you might have overlooked the most important bit behind it: the mapping quality. It does randomly align reads, but you will know which reads are randomly aligned and which tend to be wrong. You can easily filter them out if you do not like, and yet get more from the mapping quality.

I used to try SOAP. I think it also maps reads to reverse strand. I can hardly imagine how a forward-strand only software can ever get published in Bioinformatics.

In GAPipeline-0.3.0, Eland has been updated, assisted by a few other programs. Eland now finds multiple hits and handles paired end alignment a bit better. Another important improvement is it, like MAQ, now gives mapping quality. Anyway, Eland is still the fastest of all. Although SOAP is claimed to be as fast as Eland, it is missing some interesting features in Eland.

I surely believe you are very capable and I am happy to see another capable aligner!

(PS: You can delete the same comments sent by the anonymous a minute ago. Sorry for posting twice.)

February 19, 2008 11:58:00 AM PST  
Anonymous Anonymous said...

In the middle of all this serious discussion...
I am helping a student write a talk on markov
models in board games and in bioinformatics.
We came across the brief article "Hidden Markov Models Made Easy". Did anyone write more on
this approach to introducing the ideas?

Thanks for any advice, Terry Bisson

March 17, 2008 4:48:00 AM PDT  
Blogger Anthony said...

Hi Terry,

I wrote that piece back in undergrad - roughly a decade ago. As far as I'm aware, no one else picked up on the method I used, though the article itself was mirrored all over the place.

If you're interested in taking the idea further, or would like me to write on more of these topics, I'd be happy to discuss that. It would fit nicely into my blog, these days.



March 17, 2008 9:29:00 AM PDT  
Anonymous Anonymous said...

SeqMap ( - work like ELand, can do 3 or more bp mismatches and also insdel

June 19, 2008 1:22:00 PM PDT  
Blogger Sparks said...

I've just finished writing a short read aligner, it uses qualities, handles inserts, deletes up to 7bp per single ended read with speed comparable to Eland. There's also a paired end module. All specific to Illumina. It's available for free download at if anyone wants to try it. Any feedback would be appreciated.

July 4, 2008 1:26:00 AM PDT  

Post a Comment

<< Home