Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, August 14, 2008

Indel Calling

William left an interesting comment yesterday, which I figured was worthy of it's own post, albeit a short one. (Are any of my posts really short, tho?) His point was that everyone in the genomics field is currently paying a lot of attention to SNPs, and very little attention to INDELs (Insertions and Deletions). And, I have to admit - this is true. I'm not aware of anyone really trying to do indels with Solexa reads, yet. But there are very good reasons for this.

The first is a lack of tools - most aligners aren't doing gapped alignments. Well, there's Exonerate and ZOOM, and possibly blast, but all of those have their problems. (Exonerate gapped mode is very slow, poor support, and we had a very difficult time making some versions work at all, while Blast is completely infeasible for short reads in a reasonable time scale and not guaranteed to give the right answer, while Zoom just hasn't been released yet. If there are others, feel free to let me know.) Nothing currently available will really do a good gapped short read alignment. And, if you've noticed they key words: "short read", you're on to the real reason why no one is currently working with indels.

Yep - the reads are just too short. If you have a 36bp read, or even, say a 42bp read, you're looking at (best case) having your indel right in the middle of the sequence, giving you two 21-base sequences, one on each side of the gap. Think about that for a moment and let that settle in. How much of the genome is unique for 21bp reads, which may have 2 or more SNPs or sequencing errors? I'd venture to say it's 60% or so. With the 36 base pair read, you're looking at two 18-bp reads, which is more like 40-50% of the genome. (Please don't quote me on those numbers - they're just estimates.) And that's best case.

If your gap is closer to the end, you'll get something more like a 32bp read and a 10bp read.... and I wouldn't trust a 10bp seed to give the correct match against the genome no matter what aligner you've got - especially if it comes from the "poor" end of an Illumina sequence.

So that leaves you with two options: use a paired end tag, or use a longer read.

Paired end tags (PET) have been around for a couple months, now. We're still trying to figure out the best way of using the technology, but it's coming. People are mostly interested in using them for other applications - gross structural abnormalities, inversions, duplications, etc, but indels will be in there. It should be a few more months before we really see some good work done with PETs in the literature. I know of a couple of neat applications already, but a lot of the difficulty was just getting a good PET aligner going. Maq is there now, and it does an excellent job, albeit post processing the .map files for PET is not a lot of fun. (I do have software for it, tho, so it's definitely a tractable problem.)

Longer reads are also good. Once you get gaps with enough bases on either side to do double-seed searches, we'll get fast Indel capable aligners - and I'm sure it's coming. There are long reads being attempted this week at the GSC. I don't know anything about them, or the quality, but if they work, I'd expect to see a LOT more sequences being generated like this in the future, and a lot more attention being paid to indels.

So, I can only agree with William: we need to pay more attention to indels, but we need the technology to catch up first.

P.S. For 454 fans out there, yes, you do get longer reads, but I think you also need a lot of redundancy to show that the reads aren't artifacts. As 454 ramps up its throughput, we'll see both the Solexa and 454 platforms converge towards better data for indel studies.

Labels: , ,