Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, May 29, 2009

Science Cartoons - 2

Comic #2/5. This one, obviously is about aligners. I've added in the copyright on the far right, this time. If I expect people to respect my copyright, I really do need to put it on there, don't I?

Labels: , , ,

Wednesday, January 30, 2008

Comments on de novo assembly

First off, I'd like to say thanks to Paul and stajich for their comments. Paul for raising several points that I'd like to discuss tonight, and stajich for bringing my attention to the SHRiMP aligner. Obviously, I haven't had much chance to look at SHRiMP, yet, but I'll definitely get around to it.

So, paul mentioned several things that are worth discussing:

The Velvet roadmap for mRNA could be interesting. Sometimes the intron is snipped, sometimes it isn't, get a nice little bubble. And the similar transcripts will do... interesting things.

Short reads would be fine for de-novo if DNA was completely random. Pity that it isn't.


For the most part, I agree with Paul, Using Velvet on a transcriptome would be pretty cool. Yes, you'd get bubbles, but you'd get a lot of nifty information about new transcripts, or intermediate forms of transcripts. This neatly avoids one of the major hurdles I currently face when working with transcriptomes: I can either align against a genome, or a list of known genes, and neither one really gives me much opportunity to identify new genes. (Of course, I'm working on doing just that, with a colleague of mine, but I'm saving that for another post.)

Where I disagree with paul, however, is his final statement, that short reads would be fine for de novo assembly if they were truly random. I have two problems with that: The first is mapability (or sequencability), and the second is the short nature of the reads themselves.

Mappability has been defined in many different ways, but for general purposes, it's the ability to identify a sequenced read that can be unambiguously aligned to that location on a chromosome. Fortunately, with 36-mer reads, as are produced by Illumina 1G's, something like 70-80% of the genome is mappable. Not 100% mappable, but mappable to some degree. This may seem trivial, but it's important.

Why it's important is that a genome with 95% mapability doesn't mean you get a single contig covering 95% of the genome, and a chunk of 5% of your reads that don't cover your genome. It's more like every 20-100kb you'll find something that you just can't assemble over. (I'm guestimating that number, by the way.) This means you have lots of small contigs that have no overlap. That is, of course, assuming you had enough highly mappable reads to do a relatively error free assembly, and, also of course, assuming your sequencing errors haven't completely interfered with your assembly. I don't think either can really be taken for granted.

In reality, the mappability of a genome isn't a property you can figure out until you've sequenced it, so I don't want to discourage anyone from trying. I'm aware of people working on using Illumina reads to do this, albeit they've supplemented the reads with BAC sequencing, ESTs and various other options, which will provide a nice scaffolding. This approach allows them to augment their sequencing power by obtaining a lot of coverage through the Illumina runs, rather than having to use them as the basis of a serious de novo assembly - which seems wise to me. (And even then, they're only doing a fungus - not a higher vertebrate!)

And to return to my earlier point about the mappable genome not being 100% mappable, this is where I think paul's point over simplifies. Although some bases may be 50% mappable, and are thus considered to be "mappable" in common discussion, that means that 50% of the possible sequencing reads in which they're likely to participate will not yield an un-ambiguous fragment! That means you can say goodbye to 50% of the likely fragments which you would need to generate an overlap to span two forming contigs. That, to me, indicates that any de novo assembly is unlikely to correctly proceed past that base, and 50% mappability is not unusual in the genome.

The other point of paul's that I wanted to discuss, was the assertion that the non-random fragmenting of DNA is what's stopping us from doing a serious de novo assembly. While it's true that shearing DNA isn't an entirely random process, it's also true that it doesn't need to be. People have been doing restriction enzyme digests for decades now, in order to map vectors and inserts. (I learned how to do it in grade 11 biology, and that was back in 1994). So yes, while sonication or digests may not be random, what's to stop someone from stripping DNA of it's histones, and then doing 25 different digests? The net effect is just about the same (assuming you pick 25 good restriction enzymes with different base recognitions), and will yield fragments that will get around paul's issue. But does that mean we can now do a de novo assembly?

No... I don't think so. I doubt the lack of a random fragmentation pattern is the limiting factor in de novo assemblies. Unfortunately, the only way to prove myself or paul right or wrong is to simulate this: Take the mouse genome, fragment it into random 36-mers (the same size you get from an Illumina sequencing run), then inject 1 random base for every 1000 bases read (I'll be generous and assume a .1% error rate, though the real thing is closer to 3-5% from what I've seen), and then try running velvet on it.

I bet you'll observe that somewhere around 40x coverage you'll start to saturate and discover that your assembly has covered anywhere from 60-80% of the genome (say 5% of that is just way wrong), and that it'll have taken you longer to do that assembly than it would have to just sequenced the damned thing in the first place with PCR.

Anyhow, I take paul's point to heart: We're still early on in this game, and there are a lot of neat things we can try. I'm looking forward to seeing the results of many of them. (Who has time to try them all?)

By the way, FindPeaks 3.0.1 is done, and I'll be presenting it as a poster at AGBT 2008 this week. If you're interested in ChIP-Seq/Chip-Solexa (or for the pedantic, ChIP-Illumina), come find me, and I'll tell you some of its new features.

Labels: , , , , ,

Tuesday, January 22, 2008

Solexa Transcriptom Shotgun:Transcriptome alignments vs. Genome Alignments

First off, I was up late trying to finish off one of my many projects, so I didn't get a lot of sleep. Thus, if my writing is more incoherent than usual, that's probably why. And now on with the show.

I'm jumping gears a bit, today. I haven't finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.

When you're re-sequencing a genome, you map back to the genome template. Yes, you're probably sequencing a different individual, so you'll find occasional sections that don't match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you're seeing are already commonly known changes. Of course, if you're re-sequencing Craig Venter, you don't need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.

When you're sequencing a genome you've never sequenced before, you can't do mapping at all. There are various assemblers (i.e., Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add... ), SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and Euler (which I'd never heard of till I googled the web page for velvet...). The key point: you don't need to worry about what you're mapping back to when you do de novo assembly, since you're creating your own map. I'll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they're so short!

Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I'm ignoring cloned libraries... they're so 90's!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I'm ignoring some complexity here, because it makes for a better story.) But I'm getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.

mRNA is an interesting beast. Unlike Genomic DNA, it's a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there's a small chance that it might overlap the boundaries between two exons, which means the bases you have won't map back nicely to the genomic point of origin. That can be a serious problem.

Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you're sophisticated, you'll even know that introns often have signals that indicate their presence most of the time. And yes, you'd be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn't seem like much, really), how long would it take to do that alignment?

Too long.

So most of the time, we don't do gapped alignments. Instead, we have two choices:
  1. Align against the genome, and throw away reads that we can't align (i.e. those that over lap intron/exon boundaries.)

  2. Align against a collection of known coding DNA sequences


Number two isn't a bad option: it already has all the introns spliced out, so you don't need to worry about missing those alignments. Unfortunately, there are several issues with this approach:
  • Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.

  • Many transcripts are very similar. This isn't a problem with the data, but with your alignment program. If your aligner doesn't handle multi-matches (like Eland), this will never work.

  • Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.

  • Aligning to a transcript is hard to visualize. This is going to be one of my next projects... but with all the fantastic genomes out there, I'm still not aware of a good transcriptome visualization tool.


And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.

For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven't released my code for either one, yet, but they both work, and if you contact my supervisor, he'd probably be interested in collaborating, or possibly sharing the code.

Labels: , , , , ,

Thursday, January 17, 2008

Aligning DNA - Eland

Continuing in my series of blog articles on short-read sequencing, it's time for a little bit of a discussion on the first short-sequence aligner that I'd ever met: Eland. The problem with writing anything about Eland, is that 90% of the information I have on it can't be confirmed independently. In fact, the majority of what I know comes from the header of the ELAND.cpp file. I'll admit, while I've gone through the source code a couple of times, I haven't spent much time getting to know it's well - I've processed billions of lines of Eland files, but have yet to run it, myself...

Anyhow, here's what I can tell you: Eland stands for "Efficient Local Alignment of Nucleotide Data" and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it's a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be "error free", so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don't know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

  1. The Eland aligner can only align sequences up to 32 base pairs long. (Implementation specific: they use four integers to represent the subsequences {A,B,C and D}, which reduces the memory requirements.)

  2. The Eland aligner can only align sequences with up to two mismatches. (Algorithm specific: at least two of the keys must match, although the algorithm could be expanded to use more subsequences, at the expense of expanding the sequence space.)

  3. The Eland aligner only returns sequences that match to one unique location in the genome (Implementation specific: Eland already tells you when a sequence matches to more than one location, it just doesn't tell you what those locations are.)

  4. The Eland aligner can do mismatches in the sequence, but can not handle insertions or deletions (i.e, gapped alignments). (Algorithm specific: The algorithm for matching the keys could be expanded to include this, but it would be very costly in time/memory.)

  5. The code is not freely available. (Implementation specific: I don't know of anywhere you can go to get the code... or documentation!)

  6. Changing the length of the alignment sequence (I called it N, above), requires a recompile. (Implementation, obviously. Why this isn't a command line parameter, I don't know.)


I'm sure I could go on and list other disadvantages, but I'd like to point out the main advantage: It's FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it's worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I'll discuss in future posts.

(2008-01-28 Update: I stand corrected: I've heard that the new betas have finally arrived! I haven't seen what they can do yet, but at least we know my prediction that they're vapourware is wrong.)

Labels: , , ,

Wednesday, January 9, 2008

Aligning DNA - a quick intro to some of the tools

A long time ago, I mentioned that I'd use this blog for some science. I don't really know who my audience is, but I suspect it's 1 part people who google in for linux related key words, 1 part people who are googling me because they've read my name somewhere, and 1 part random people... oh, and not to forget those people who I actually know in real life, but don't get to see often. Since I'm unlikely to scare of the googlers or the random people with some science, and the people I know in person will just skip over this posting (that's a hint, if you haven't already stopped reading...), I'll just go ahead with a quick lesson in Genomics.

As I've mentioned to many people, my research is currently heavily embedded in writing code for the processing of Solexa/Illumina reads. For those who aren't in the know, the Illumina 1G is a DNA sequencing machine that can produce ~40 Million DNA sequencing reactions in about 36 hours, where each sequence is 36 bases long. This is drastically different from traditional sequencing, where you get far fewer sequences (~100's), each of which can be up to 1000 bases long. With the low volume of alignments produced, and the long reads, traditional sequencing can be processed with a rather laisez-faire attitute. if you take 10 seconds to align 100 reads, you're talking about 1000 seconds to do all the alignments, or roughly 15 minutes.

In contrast, with 40 million reads, at ten seconds each, you're talking about waiting 12 years for your data to be processed. Clearly this is not an option, if you want to get into a cutting edge journal.

(For those who are curious to know where the 10 second mark came from, my supervisor timed a few blast searches on a local server.)

Consequently, computational scientists have risen to the task, and created several new algorithms for performing these alignments. The first one that I met is called "Eland", which I believe stands for "Efficient Local Alignment of Nucleotide Data". To be honest, I'm not really sure how to get Eland, as there is very little information available on the web. I've gleaned a few scraps of information: It was written by Anthony Cox, and was distributed by Solexa/Illumina. As far as I can tell, it came with the Illumina 1G machines.

The second one I met was called "Mosaik". This aligner comes from the lab of Gabor Marth, at boston college. The third was "Exonerate", followed by several others. Hands down, the best name for an aligner has got to be "MAPASS"... I'll let you ponder that for a few minutes.

Anyhow, each alginer has it's advantages and disadvantages. Eland, for instance is crippled by a algorithmic limitation to only ever being able to align the first 32 bases of a sequence, and it's inability to map a read to more than one location in a target DNA source (i.e genome or transcriptome). On the other hand, it's one of the fastest aligners out there.

Mosaik, on the other hand, is a bit slower, but has several nice features - and is able to handle the cases that Eland can't. On the other hand, it dumps out it's alignments to a file format that really isn't convenient for doing any further processing, which is a product of it's original use in sequence assembly.

Just to throw in one more curve, it's worth mentioning a competing proprietary piece of software. There's a company in Malaysia that has what appears to be the fastest aligner out there, with none of the limitations of Eland, and a flexible file output. Like all commercial products for the science market, they set the base price of their product outside of the reach of most consumers: I can't seem to find a good justification to get my supervisor to pay $200,000+ USD/CAD for a copy of their software. (You could hire four post-docs for a year for that... or 10 grad students.)

Anyhow, now you know who the players are. My next post on this topic will be to introduce some of them in detail, explain how they work, and then tell you how we use them.

Yes, I'm feeling ambitious: I got three pieces of software working today: two of which are likely to be in production in the next couple of weeks at the Genome Sciences Centre: One for ChIP-Seq experiments (FindPeaks 3.0) and the other for transcriptome processing for mammalian cells (yet to be named.)

Labels: , , ,