### Aligning DNA - Eland

Continuing in my series of blog articles on short-read sequencing, it's time for a little bit of a discussion on the first short-sequence aligner that I'd ever met: Eland. The problem with writing anything about Eland, is that 90% of the information I have on it can't be confirmed independently. In fact, the majority of what I know comes from the header of the ELAND.cpp file. I'll admit, while I've gone through the source code a couple of times, I haven't spent much time getting to know it's well - I've processed billions of lines of Eland files, but have yet to run it, myself...

Anyhow, here's what I can tell you: Eland stands for "Efficient Local Alignment of Nucleotide Data" and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it's a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be "error free", so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don't know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

I'm sure I could go on and list other disadvantages, but I'd like to point out the main advantage: It's FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it's worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I'll discuss in future posts.

(2008-01-28 Update: I stand corrected: I've heard that the new betas have finally arrived! I haven't seen what they can do yet, but at least we know my prediction that they're vapourware is wrong.)

Anyhow, here's what I can tell you: Eland stands for "Efficient Local Alignment of Nucleotide Data" and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it's a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be "error free", so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don't know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

- The Eland aligner can only align sequences up to 32 base pairs long. (Implementation specific: they use four integers to represent the subsequences {A,B,C and D}, which reduces the memory requirements.)
- The Eland aligner can only align sequences with up to two mismatches. (Algorithm specific: at least two of the keys must match, although the algorithm could be expanded to use more subsequences, at the expense of expanding the sequence space.)
- The Eland aligner only returns sequences that match to one unique location in the genome (Implementation specific: Eland already tells you when a sequence matches to more than one location, it just doesn't tell you what those locations are.)
- The Eland aligner can do mismatches in the sequence, but can not handle insertions or deletions (i.e, gapped alignments). (Algorithm specific: The algorithm for matching the keys could be expanded to include this, but it would be very costly in time/memory.)
- The code is not freely available. (Implementation specific: I don't know of anywhere you can go to get the code... or documentation!)
- Changing the length of the alignment sequence (I called it N, above), requires a recompile. (Implementation, obviously. Why this isn't a command line parameter, I don't know.)

I'm sure I could go on and list other disadvantages, but I'd like to point out the main advantage: It's FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it's worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I'll discuss in future posts.

(2008-01-28 Update: I stand corrected: I've heard that the new betas have finally arrived! I haven't seen what they can do yet, but at least we know my prediction that they're vapourware is wrong.)

Labels: Aligners, DNA, Sequencing, Solexa/Illumina

## 10 Comments:

You may want to take a look at SHRiMP which was developed for both colorspace matching from SOLiD and for short-read technology.

http://compbio.cs.toronto.edu/shrimp/

Apparently the SOAP program deals with the shortcomings of ELAND fantastically. It handles repeat hits (and can report them all, none or one random). And it can align with in-dels up to 3bp. Tried it just now. All works.

http://soap.genomics.org.cn/ and

http://www.ncbi.nlm.nih.gov/pubmed/18227114

Hello

I found your blog whilst searching for details about the way that Eland actually works.

I understand your description of the algorithm but what interests me is the data structure that is used to hold the reads (in RAM) and which the algorithm searches over (or in SOAPs case the reference). Do you know how this is done?

Hi Malarky,

I don't know how happy Illumina/Solexa or Anthony Cox would be if I gave out all of the details of their implementation, so I'm going to refrain from giving too much detail. Though the algorithm is relatively involved, using look up tables and such, they appear to be using a suffix tree to hold the oligos.

I hope that helps.

Anthony

Hi

Thats good enough. Thanks

Hi Anthony, I just discovered your blog and it looks very interesting to me! Since this article on Eland is now more than one year old, I was wondering if the description at point 3 about multi matching locations is still applicable to the Eland program in the Illumina pipeline 1.3. More in general, would you trust the multi matching locations extracted from the multi_eland output files to perform a repeat enrichment analysis over an experiment of ChIP-seq? If no, why? Thank you in advance for your attention

Have you taken a look at Bowtie http://bowtie-bio.sourceforge.net/index.shtml . It is the fastest aligner out there and does much more than ELAND can. And its open source.

Hi Anshul,

This article is pretty old, so it really doesn't reflect the current state of the aligner "wars".

I have tried an early version of Bowtie, and it was fast, but it wasn't particularly robust. I have heard that it has improved dramatically recently, but the general consensus seems to be that people are jumping over bowtie to BWA. (Written by Heng Li.)

I don't see any advantage to bowtie over BWA, except that it produces .map format files. (And of course, .map is being phased out in favour of SAM/BAM files.)

Cheers

Ah! Sorry I didn't notice that the post was old. Thanks for pointing out BWA. Need to take a look at it.

No worries,

I've just updated the comment so that they show the date as well. Hopefully that should clear up any confusion for other commenters too. (=

Post a Comment

<< Home