Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Tuesday, July 22, 2008

SNP calling from MAQ

With that title, you're probably expecting a discussion on how MAQ calls snps, but you're not going to get it. Instead, I'm going to rant a bit, but bear with me.

Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No - but not for the reasons you might expect.

I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don't get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the "kludges" with the method I didn't know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.

If you're wondering why it took so long, it's three fold:

1. I started off depending on someone else's method, since they came up with it. As is often the case, that person was working quickly to get results, and I don't think they had the goal of writing production quality code. Since I didn't have their code (though, honestly, I didn't ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there's a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.

2. I was trying to write code that would be re-usable. That's something I'm very proud of, as most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I'll have to redo the whole experiment from the start at some point because I'm not fully satisfied with the method, and we won't be doing it exactly this way in the future. I just hope the change doesn't happen in the next 3 weeks.

3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a "MT" or just "M"? What case do we use for "X" and "Y" (or is it "x" and "y"?) in our files? Should we pre-pend "chr" to our chromsome names? And what on earth is "chr5_random" doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read... bleh.

Anyhow, I fully admit that SNP calling isn't hard to do. Once you've read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there's not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.

However, once you're done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I'm aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there's much difference beyond the alignment quality.

Unfortunately, because the aligner field is still immature, there is no single file output format that's common to all aligners, so the comparison is a pain to do - which means it's probably a long way off. That, in itself, might be a good topic for an article, one day.

Labels: , , , ,


Blogger William said...

Yet another interesting post. I think I'll just be commenting up your blog today.

I'd like to put a bug in your ear about not just SNP calling, but InDel calling with short read sequencing. One short read aligner that I am aware of, SOAP, seems to to a good job calling not only SNPs but short indels (although computational time starts to get prohibitive for indels of significant size). I'm aware that SNP calling is sort of the hot topic of the day, but I suspect that indels markers could potentially become as important as sequencing becomes a viable alternative to arrays. To my knowledge no packages (as opposed to the plethora of packages for SNPS) currently exist to process such data in an appropriately nuanced way.

July 23, 2008 9:42:00 AM PDT  
Blogger Anthony said...

Hi William,

That's an interesting point, actually. I've tried doing the same thing with Exonerate, but never really had much success with it. Gapped alignments are such a tricky affair with short reads, that I'm really hesitant to even delve into them while our reads are under about 50bp.

On the other hand, we're expecting to get ~70 bp reads by the end of the year, if the Illumina predictions come true, which is more than long enough to do short gapped alignment without making too many assumptions.

Still, I think the real underlying issue is that all of this is just about the quality of the alignments. Once you know where the reads align, all of this becomes VERY easy. All that's left are the details (e.g. scoring gaps is one of those things that no one is talking about yet...) and tweaking the code.

Have you any idea how much time it takes for soap to do a gapped alignment for a constant number of reads? If it's not too bad, I could give this a try.


July 23, 2008 10:47:00 AM PDT  
Anonymous Anonymous said...

On indel calling, I more agree with anthony: for human genome, short indel detection is much more messy than most people have thought. With single ended ~35bp reads, I would not trust most of indel calls.

SNP calling is easy. The hard part is to achieve the best accuracy. There are many good read aligners, but there are relatively few SNP callers which are ready for end users. This is kind of telling.

July 23, 2008 1:34:00 PM PDT  
Blogger Anthony said...

Thanks for the comment, Anonymous.

In some ways I agree with you, but I think I'd refine your argument a little further.

I used Eland until recently, switched to using MAQ, and now I'm considering jumping ship to novoalign - or something else. All the while, I still use Exonerate for the occasional job. This suggests we're still at the stage of "good enough" aligners - and with the number of aligners still being developed, the field probably has a lot more consolidation left to go.

Where I diverge from your comment is in where the difficulty lies. SNP and indel calling is nothing more than processing the results of the alignment and looking for oddities. The accuracy always comes from the aligner, not the SNP caller, as you suggest. I could give you a SNP caller now, and it won't change dramatically from the SNP caller in use two years from now, with the exception of adding in indel calling as the aligners handle that better.

Still, with Pac Biosci on the horizon, is any of the work we're doing now really going to be useful in 3 years?

Food for though.

July 23, 2008 1:50:00 PM PDT  
OpenID stowaway-geek said...

Jim Kent's BLAT isn't specifically designed for this sort of thing, but it can do this. It's slower than ELAND, clearly, but if you're only mapping the things that didn't match before, this is less of a problem.

You need the options "-fine" and "-minScore=15", and probably need to try different "-maxIntron". (And maybe the "-stepSize" option needs tweaking as well.)

And such short reads are apt to be unreliable without some filtering. But as people said here, longer reads should help...

-- Josh Burdick

July 24, 2008 3:11:00 PM PDT  
Blogger Anthony said...

Hi Stowaway,

That's a good point - this stuff *can* always be done - there are plenty of aligners that will do it for you. (BLAT is a good example of one that definitely will work. I suspect SOAP and Exonerate are also viable alternatives.)

There are two potential problems with your suggestion: Can you really assume that everything that maps to a genomic location is correctly mapped, and would not map equally well or better with a gapped alignment? And would you really trust a gapped alignment with 42 bases or less?

For my data sets, which are cancer cell lines, I don't think I'm ready to take those blind leaps of faith, however, that's probably not an issue for a less divergent samples. (Eg. sequencing a lab strain of C. elegans and mapping it to another lab strain.)

What it boils down to, I guess, is that I need to add one more thing to my list, which has come up for me several times today. The more divergent your reference sequence is from your sample, the less you can trust your alignments, regardless of the aligner you use.

July 24, 2008 3:45:00 PM PDT  

Post a Comment

<< Home