Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Tuesday, October 6, 2009

DNA sequencing Videos.

With IBM tossing it's hat into the ring of "next-next-generation" sequencing, I'm starting to get lost as to which generation is which. For the moment, I'm sort of lumping things together, while I wait to see how the field plays out. In my mind, first generation is anything that requires chain termination, Second generation is chemical based pyrosequencing, and third generation is single molecule sequencing based on a nano-scale mechanical process. It's a crude divide, but it seems to have some consistency.

At any rate, I decided I'd collect a few videos to illustrate each one. For Sanger, there are a LOT of videos, many of which are quite excellent, but I only wanted one. (Sorry if I didn't pick yours.) For second and third generation DNA sequencing videos, the selection kind of flattens out, and two of them come from corporate sites, rather than youtube - which seems to be the general consensus repository of technology videos.

Personally, I find it interesting to see how each group is selling themselves. You'll notice some videos press heavily on the technology, while others focus on the workflow.

As an aside, I also find it interesting to look for places where the illustrations don't make sense... there's a lovely place in the 454 video where two strands of DNA split from each other on the bead, leaving the two full strands and a complete primer sequence... mysterious! (Yes, I do enjoy looking for inconsistencies when I go to the movies.)

Ok, get out your popcorn.

First Generation:
Sanger Entry: Link



Second Generation:
Pyrosequencing Entry: Link



Helicose Entry: Link



Illumina (Corporate site): Link

(Click to see the Flash animation)


454 Entry: Link



Third Generation:

Pacific Biosciences: Link

(Click to see the Flash Video)


Oxford Nanopore Entry: Link



IBM's Entry: Link



Note: If I've missed something, please let me know. I'm happy to add to this post whenever something new comes up.

Labels: , ,

Friday, October 2, 2009

Base quality by position

A colleague of mine was working on a nifty tool to give graphs of the base quality at each position in a read using Eland Export files, which could be incorporated into his pipeline. Over a discussion about the length of time it should take to do that analysis, (His script was taking an hour, and I said it should take about a minute to analyze 8M illumina reads...) I ended up saying I'd write my own version to do the analysis, just to show how quickly it could be done.

Well, I was wrong about it taking about a minute. It turns out that the file has a lot more than about double the originally quoted 8 million reads (QC, no match and multi match reads were not previously filtered), and the whole file was bzipped, which adds to the processing time.

Fortunately, I didn't have to add bzip support in to the reader, as tcezard (Tim) had already added in a cool "PIPE" option for piping in whatever data format I want in to applications of the Vancouver Short Read Analysis Package, thus, I was able to do the following:
time bzcat /archive/solexa1_4/analysis2/HS1406/42E6FAAXX_7/42E6FAAXX_7_2_export.txt.bz2 | java6 src/projects/maq_utilities/QualityReport -input PIPE -output /projects/afejes/temp -aligner elandext

Quite a neat use of piping, really.

Anyhow, the fun part is that this was that the library was a 100-mer illumina run, and it makes a pretty picture. Slapping the output into openoffice yields the following graph:



I didn't realize quality dropped so dramatically at 100bp - although I remember when qualities looked like that for 32bp reads...

Anyhow, I'll include this tool in Findpeaks 4.0.8 in case any one is interested in it. And for the record, this run took 10 minutes, of which about 4 were taken up by bzcat. Of the 16.7M reads in the file, only 1.5M were aligned, probably due to the poor quality out beyond 60-70bp.

Labels: , , , , ,

Wednesday, September 30, 2009

Second Gen Sequencer Map and naming our generations properly

Yes, this is old news, but I've found myself searching for the second generation sequencing map that was started at seqanswers several times in the last few days. Just to make it even easier to find - and of course to give some positive publicity to a really cool project - here's the Google Maps based list of facilities that run second generation (Next-Generation) sequencing machines:

http://tinyurl.com/orm8cr

For the record, I understand a few places are missing, however. I hear there's some second-generation sequencing going on Alberta, which clearly hasn't appeared on the map yet.

And, as a foot note, now that I see other people have picked up on it and started calling "next-generation sequencing" the more appropriate label of "second generation sequencing" (e.g. here at nature), I'm going to drop next-gen as a label entirely. First-gen is sanger/dideoxy/capillary/etc, second-gen is pyrosequencing and the third gen is biotech-based (using cellular components such as DNA polymerases and the like). Lets end the confusion and name our generations accordingly, shall we?

Labels: ,

Monday, June 15, 2009

Another day, another result...

I had the urge to just sit down and type out a long rant, but then common sense kicked in and I realized that no one is really interested in yet another graduate student's rant about their project not working. However, it only took a few minutes for me to figure out why it's relevant to the general world - something that's (unfortunately) missing from most grad student projects.

If you follow along with Daniel McArthur's blog, Genetic Future, you may have caught the announcement that Illumina is getting into the personal genome sequencing game. While I can't admit that I was surprised by the news, I will have to admit that I am somewhat skeptical about how it's going to play out.

If your business is using arrays, then you'll have an easy time sorting through the relevance of the known "useful" changes to the genome - there are only a couple hundred or thousand that are relevant at the moment, and several hundred thousand more that might be relevant in the near future. However, when you're sequencing a whole genome, interpretation becomes a lot more difficult.

Since my graduate project is really the analysis of transcriptome sequencing (a subset of genome sequencing), I know firsthand the frustration involved. Indeed, my project was originally focused on identifying changes to the genome common to several cancer cell lines. Unfortunately, this is what brought on my need to rant: there is vastly more going on in the genome than small sequence changes.

We tend to believe blindly what we were taught as the "central paradigm of molecular biology". Genes are copied to mRNA, mRNA is translated to proteins, and the protein goes off to do it's work. However, cells are infinitely more complex than that. Genes can be inactivated by small changes, can be chopped up and spliced together to become inactivated or even deregulated, interference can be run by distally modified sequences, gene splicing can be completely co-opted by inactivating genes we barely even understand yet and desperately over-expressed proteins can be marked for deletion by over-activating garbage collection systems so that they don't have a chance to get where they were needed in the first place. And here we are, looking for single nucleotide variations, which make up a VERY small portion of the information in a Cell.

I don't have the solution, yet, but whatever we do in the future, it's not going to involve $48,000 genome re-sequencing. That information on it's own is pretty useless - we'll have to study expression (WTSS or RNA-Seq, so figure another $30,000), changes to epigenetics (of which there are many histone marks, so figure 30 x $10,000) and even dna methylation (I don't begin to know what this process costs.)

So, yes, while I'm happy to see genome re-sequencing move beyond the confines of array based SNP testing, I'm pretty confident that this isn't the big step forward it might seem. The early adopters might enjoy having a pretty piece of paper that tells them something unique about their DNA, and I don't begrudge it. (In fact, I'd love to have my DNA sequenced, just for the sheer entertainment value.) Still, I don't think we're seeing a revolution in personal genomics - not quite yet. Various experiments have shown we're on the cusp of a major change, but this isn't the tipping point: we're still going to have to wait for real insight into the use of this information.

When Illumina offers a nice toolkit that allows you to get all of the SNVs, changes in expression and full ChIP-Seq analysis - and maybe even a few mutant transcription factor ChIP-Seq experiments thrown in - and all for $48,000, then we'll have a truly revolutionary system.

In the meantime, I think I'll hold out on buying my genome sequence. $48,000 would buy me a couple more weeks in Tahiti, which would currently offer me a LOT more peace of mind. (=

And on that note, I'd better get back to doing the things I do.... new FindPeaks tag, anyone?

Labels: , , , ,

Monday, June 1, 2009

Science Cartoons - 5 (RNA-Seq)

This is the last science cartoon I did for my poster. I was pretty happy with the pictures, although if I were to do it over again, I've learned a few more tricks that I'd have used instead.

Anyhow, my favorite effect on this picture is the "text to path", where you can make any string follow any line - who knew graphic design could be so much fun. It definitely makes for some interesting graphics. I'd definitely use this effect in an RNA folding paper, if I ever got the chance to do another one. (-;

Labels: , ,

Thursday, April 16, 2009

Multi-match reads in ChIP-Seq

I had an interesting comment left on my blog today, which is worth taking a few minutes to write a response to:
"Hi Anthony, I just discovered your blog and it looks very interesting to me!
Since this article on Eland is now more than one year old, I was wondering
if the description at point 3 about multi matching locations is still
applicable to the Eland program in the Illumina pipeline 1.3. More in general,
would you trust the multi matching locations extracted from the multi_eland
output files to perform a repeat enrichment analysis over an experiment of
ChIP-seq? If no, why? Thank you in advance for your attention."

The first question asks about multi-matching locations - and if the point in question (point 3) applies to the Illumina Pipeline 1.3. Since point 3 was just that the older pipeline didn't provide the locations of the multi-matche reads, I suppose this no longer really applies: I understand the new version of Eland does provide multi-match alignment information, as do other aligners such as Bowtie. However, I should also mention that since I adopted Maq as my preferred aligner, I haven't used Eland much - so it's hard for me to give an informed opinion on the quality of the matches. I simply don't know if they're any good, and I won't belabour that point. I have used Bowtie specifically because it was able to do mutli-matches, but we didn't use it for ChIP-Seq, and the multi-matches had other uses in that experiment.

So, the more interesting question is whether I'd use multi-match reads in a ChIP-Seq analysis. And, off hand, my answer has to be no. But let me explain my reasoning, and the conditions in which I would change that answer.

First, lets assume that we have Single End Tags, so the multi-match information is not resolvable. That means anytime we have a read that maps to more than one location, we have the possibility that we can either map it to it's source - or we're mapping it incorrectly. A 50% change of "getting it right." The greater the number of multi-match locations, the smaller the chance we're actually finding it's correct origin. So, at best we've got a 50-50 chance that we're not adversely affecting the outcome of the experiment. That's not great.

In contrast, there are things we could do to make them usable. The most widely used method from FindPeaks is the weighted fragment distribution type. Thus, we could expand the principle to weight the fragments according to the number of sites. That would be... bearable. But would it significantly add to the quality of the alignment?

I'm still going to say no. Fragments we see in ChIP-Seq experiments tend to fall within 200-300bp of the regions in which the transcription factor (or other sites) bind. Thus, even if we were concerned that a particular transcription factor binds primarily to the similar motif regions at two sites, there should be more than enough (unique) sequence around that site (which is usually <30-40bp in length) to which you'll still see fragments aligning. That should compensate for the loss of the multi-match fragments.

Even more importantly, as read lengths increase, the amount of non-unique sequence decreases rapidly, making the shrinking number of multi-match reads less important.

The same argument can be extended for paired end tags: Just as read lengths improve and reduce the number of multi-match sites, more of the multi-match reads will be resolved by pairing them with a second read, which is unlikely to be within the same repeat region, thus reducing the number of reads that become unresolvable multi-matches. Proportionally, one would then expect that leaving out these reads become a smaller and smaller segment of the population, and would have to worry less and less about their contribution.

So, then, when would I want them?

Well, on the odd chance you're working with very short reads, you can pull off the weighting properly, and you have single end tags - and the multi-match reads make up a significant proportion of the reads, then it's worth exploring.

You'd need to start asking the tough questions: did the aligner simply find that a small k-mer of the read aligned to multiple locations (and was then unable to resolve the tie by extension the way some Eland aligners work)? Does the aligner use quality scores to identify mis-alignments? How reliable are the alignments (what's their error rate)? What was your sample, and how divergent is it from reference ? (e.g., cancer samples have a high variation rate, and so encourage many false alignments, making the alignments less reliable.)

Overall, I really don't see too many cases where you're going to gain a lot by digging in the multi-match files. That's not too say that you won't find anything good in there - you probably would, if you knew where to look, but the noise to signal ratio is going to be pretty poor - just by definition of the fact that they're mutli-match reads alone. You'll just have to ask if it's worth your time.

For the moment, I don't think my time (even at grad student wages) is worth it. It's just not low hanging fruit, when it comes to ChIP-Seq.

Labels: , , , , , , ,

Wednesday, March 25, 2009

Searching for SNPs... a disaster waiting to happen.

Well, I'm postponing my planned article, because I just don't feel in the mood to work on that tonight. Instead, I figured I'd touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they're not SNPs, they're variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They're only polymorphisms if they're common to a portion of the population.

In this case, they're from cancer cell lines, so after I filter out all the real SNPs, what's left are SNVs... and they're bloody annoying. This is the second major project I've done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs - a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data - nearly a year old. It wasn't particularly reliable, and I've now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked... and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious... do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question "why?" Why do we get unexpected results?

Fortunately, I'm a scientist, so that question is really something I like. I don't begrudge the last year's worth of work - which apparently is now more or less down the toilet - but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I'm hooked, otherwise I'd have tossed in the towel long ago.

Labels: , , , , , ,

Friday, July 4, 2008

8 Postdoc positions

I don't want to spam anything, but since this is my own web page, I guess I can advertise as much as I'd like. I was just passed an email from a colleague at the Plant Science department at UBC, where they're currently looking for eight post doc positions: mainly people who have or are interested in gaining Illumina sequence processing experience.

I figured this is noteworthy for several reasons:
  1. There is a growing demand for next-gen trained bioinformaticians, which looks good for the future career prospects of anyone in the Next-gen Sequencing/Genomics field (though this is hardly a surprise),
  2. Genomics is beginning to expand out of the narrow {yeast | human | C.elegans | etc} model organism fields into areas such as plant science, where it will have a huge impact. (going mainstream is always a good thing for a field of science, in my humble opinion.)
  3. Some of the positions will put bioinformaticians into key positions where they become the cornerstone of research projects, which is a far cry from the "bioinformaticians as a service" role that's been popular in many research settings.

Anyhow, I can highly recommend at least one of these positions, having worked with the professor before, so if anyone is interested in the email, I'd be happy to forward along the advertisements.

Labels: , ,

Sunday, April 13, 2008

Genomics Forum 2008

You can probably guess what this post is about from the title - which means I still haven't gotten around to writing an entry on thresholding for ChIP-Seq. Actually, it's probably a good thing I haven't, as we've been learning a lot about thresholding in the past week. It seems many things we took for granted aren't really the case. Anyhow, I'm not going to say too much about that, as I plan to collect my thoughts and discuss it in a later entry.

Instead, I'd like to discuss the 2008 Genomics Forum, sponsored by Genome BC, which took place on Friday - though, in particular, I'm going to focus on one talk, near to my own research. Dr. Barbara Wold from Caltech gave the first of the science talks, and focussed heavily on ChIP-Seq and Whole Transcriptome Shotgun Sequencing (WTSS). But before I get to that, I wanted to mention a few other things.

The first is that Genome BC took a few minutes to announce a really neat funding competition, which really impressed me, the Genome BC Science Opportunities Fund. (There's nothing up on the web page yet, but if you google for it, you'll come across the agenda for Friday's forum in which it's mentioned - I'm sure more will appear soon.) Its whole premise revolves around the question: "Are there experiments that we need to be doing, that are of strategic importance to the BC life science community?" I take that to mean, are there projects that we can't afford not to undertake, that we wouldn't have the funding to do otherwise? I find that to be very flexible, and very non-academic in nature - but quite neat. I hope the funding competition goes well, and I'm looking forward to seeing what they think falls into the "must do" category.

The second was the surprising demand for Bioinformaticians. I'm aware of several jobs for bioinformaticians with experience in next-gen sequencing, but the surprise to me was the number of times (5) I heard people mention that they were actively recruiting. If anyone with next-gen experience is out there looking for a job (post-doc, full time or grad student), drop me a note, and I can probably point you in the right direction.

The third was one of the afternoon talks, on journalism in science, from the perspective of traditional news paper/tv journalists. It seems so foreign to me, yet the talk touched on several interesting points, including the fact that journalists are struggling to come to terms with "new media." (... which doesn't seem particularly new to those of us who have been using the net since the 90's, but I digress.) That gave me several ideas about things I can do with my blog, to bring it out of the simple text format I use now. I guess even those of us who live/breath/sleep internet don't do a great job of harnessing it's power for communicating effectively. Food for though.

Ok... so on to the main topic of tonight's blog: Dr. Wold's talk.

Dr. Wold spoke at length on two topics, ChIP-Seq and Whole Transcriptome Shotgun Sequencing. Since these are the two subject I'm actively working on, I was obviously very interested in hearing what she has to say, though I'll comment more on the ChIP-Seq side of things.

One of the great open questions at the Genome Sciences Centre has been how to do an effective control for a ChIP-Seq experiment. It's not something we've done much of, in the past, but the Wold lab demonstrated why they're necessary, and how to do them well. It seems that ChIP-Seq experiments tend to yield fragments in several genomic regions that have nothing to do with the antibody or experiment itself. The educated guess is that these are caused by hypersensitive sites in the genome that tend to fragment in repeatable patterns, giving rise to peaks that appear in all samples. Indeed, I spend a good portion of this past week talking about observations of peaks exactly like that, and how to "filter" them out of the ChIP-Seq results. I wasn't able to get a good idea of how the Wold lab does this, other than by eye, (which isn't very high throughput), but knowing what needs to be done now, it shouldn't be particularly difficult to incorporate into our next release of the FindPeaks code.

Another smart thing that the Wold lab has done is to separate the interactions of ChIP-Seq into two different types: Type 1 and Type 2, where Type 1 refers to single molecule-DNA binding events, which give rise to sharp peaks, and very clean profiles. These tend be transcription factors like NRSF, or STAT1, upon which the first generation of ChIP-Seq papers were published. Type 2 interactomes tend to be less clear, as they are transcription factors that recruit other elements, or form complexes that bind to the DNA at specific sites, and require other proteins to bind to encourage transcription. My own interpretation is that the number of identifiable binding sites should indicate the type, and thus, if there were three identifiable transcription factor consensus sites lined up, it should be considered a Type 3 interactome, though, that may be simplifying the case tremendously, as there are, undoubtedly, many other proteins that must be recruited before any transcription will take place.

In terms of applications, the members of the wold lab have been using their identified peaks to locate novel binding site motifs. I think this is the first thing everyone thinks of when they hear of ChIP-Seq for the first time, but it's pretty cool to see it in action. (We also do it at the GSC too, I might add.) The neatest thing, however, was that they were able to identify a rather strange binding site, with two halves of a motif, split by a variable distance. I haven't quite figured out how that works, in terms of DNA/Protein structure, but it's conceptually quite neat. They were able to show that the distance between the two halves of the structure vary by 10-20 bases, making it a challenge to identify, for most traditional motif scanners. Nifty.

Another neat thing, which I think everyone knows, but was cool to hear that it's been shown is that the binding sites often line up on areas of high conservation across species. I use that as a test for my own work, but it was good to have it confirmed.

Finally, one of the things Dr. Wold mentioned was that they were interested in using the information in the directionality of reads in their analysis. Oddly enough, this was one of the first problems I worked on in ChIP-Seq, months ago, and discovered several ways to handle it. I enjoyed knowing that there's at least one thing my own ChIP-Seq code does that is unique, and possibly better than the competition. (-;

As for transcriptome work, there were only a couple things that are worth mentioning. The Wold lab seems to be using MAQ and a list of splice junctions assembled from annotated exons to map the transcriptome sequences. I've heard that before, actually, from someone at the GSC who is doing exactly the same thing. It's a small world. I'm not really a fan of the technique, however. Yes, you'll get a lot of the exon junction reads, but you'll only find the ones you're looking for, which is exactly the criticism all the next-gen people throw at the use of micro-arrays. There has got to be a better solution... but I don't yet know what it is. (We thought it was Exonerate, but we can't seem to get it to work well, due to several bugs in the software. It's clearly a work in progress.)

Anyhow, I think I'm going to stop here. I'll just sum it all up by saying it was a pretty good talk, and it's given me lots of things to think about. I'm looking forward to getting back to coding tomorrow.

Labels: , , , ,

Wednesday, April 2, 2008

New ChIP-Seq tool from Illumina

Ok, I had to blog this. Someone on the SeqAnswers forum brought it to my attention that Illumina has a new tool for ChIP-Seq experiments. That in itself doesn't bother me - the more people in this space, the faster we learn about what makes us tick.

What surprises me, though, is the tool itself (beadstudio data analysis software - chip sequencing module). It's implemented only for Windows, for one. (Don't most self-respecting scientists use Macs or Linux these days? Or at least use and develop tools that can be used cross-platform?) Second, the feature set appears to be a re-implementation of the UCSC Genome Browser. Given the choice between the two, I don't see any reason to buy the Illumina version. (Yes, you have to pay for it, whereas UCSC is free and flexible.) I can't tell if it loads bed files or wig files, but the screen shots show a rather unflexible tool that looks like a graphical version of Gap4 or Consed. I'm not particularly impressed.

Worse still, I can't see this being implemented in a pipeline. If you're processing 100's of ChIP-Seq experiments in a year, or 1000's once this technique really starts to hit it's stride, why would you want to force it all through a GUI? I just don't get it.

Well, what do I know? Maybe there's a big market for people out there who don't want free cross-platform tools, and would rather pay for a brand name science application than use something that works. Come to think of it, I'm willing to bet there are a few pharma companies out there who do think like that, and Illumina is likely to conquer that market with their tool. Happy clicking, Vista users.

Labels: , ,

Friday, March 28, 2008

Chip-Seq on a laptop.

Every once in a while, I get an interesting email or comment from someone, which is worthy of a blog post. I woke up this morning to a comment left on my last post:

What kind of computing infrastructure do you need to be able to handle chipSEQ datasets? I'm guessing my standard IBM T60 laptop is not going to do the trick - but what does?


So yes, I do check my email when I get up... but that's not the point. Moving along...

The simple answer is, your laptop probably can't run a chip-Seq data set with the currently available programs... but it's probably not far off. My laptop has 2Gb of RAM (after tossing two new 1Gb sticks in last week), and a dual core AMD, running at 2GHz. Ironically, it's not far from what we do use to run FindPeaks: A quadcore 2.2GHz opteron with 8Gb of RAM.

Of course, FindPeaks doesn't take up the full resources of the machine. (Though I've seen it take up to 12CPUs on one of our fancier servers.) Generally, I only allocate 4Gb of RAM to the process, which should be more than enough for several millions or tens of millions of reads. The reason it takes so much ram? Because I don't drop any of the information being held that's contained in the Eland file, and because I need to retain all of the reads in memory to do a sort.

What does it need? Generally, FindPeaks only needs the start, end and direction of each read to do the basics, however, I don't throw away any of the other information collected, in case we want to use it later. If I did that, or if I pre-sorted the reads, I could probably drop the memory requirements down by an order of magnitude or more. (No one's asked me to do this, yet, however.) Is there anyone out there who needs a laptop-runnable version of FindPeaks?

This is somewhat timely, though. For the last few days I've been playing with a companion piece of software for FindPeaks which generates UCSC-compatible BED tracks from Eland data where I've adopted a minimalist approach. It takes about the same amount of time, and runs in under 300m of RAM. That, clearly, should run on anyone's laptop.

Labels: , , ,

Friday, March 21, 2008

Catching up....

I can't believe it's been nearly a month since my last post! I feel like I've been neglecting this a bit more than I should, but I'll try to rectify that as best I can.

For an indication of how busy I've been, I sat down to update my resume yesterday, and ended up adding 3 papers (all in submission) and two posters. That just about doubles what was in there previously in the papers section.

Anyhow, Next-generation sequencing doesn't stand still, so I thought I'd outline some of the things I want to talk about in my next posts, and set up a few pointers to other resources:

1. SeqAnswers. This aptly named forum has been around for a few months now, but has recently become more popular, and a great forum for discussing relevant Next-gen technology and sequencing methods. I'm especially happy to see the automated posts triggered by new literature on the subject, which are a great resource for those of us who are busy and forget to check for new publications ourselves.

2. There's one forum in particular that's of great interest: Benchmarking different aligners. This appears to be a well done comparison (if lightweight) that may be a good focus for other people who are interested in comparing aligners, and discussing it in a wider forum.

3. For people interested in ChIP-Seq, or Chromatin immunoprecipitation and massively parallel sequencing, I've finally gotten around to posting FindPeaks 3.1 on the web. I'd consider this release (3.1.3) an alpha release. I'd love to get more people contributing by using this application and telling me what could be improved on it, or what enhancements they'd like to see. I'm always happy to discuss new features, and can probably add most of them in with a relatively quick turn around time.

4. For people interested in assessing the quality of the whole transcriptome shotgun sequencing (WTSS), I'm about to break out a tool that should fit that purpose. If anyone is interested in giving suggestions on ways they'd like to see quality tests performed, I'd be more than happy to code those into my applications. (And yes, if you contribute to the tool, I will provide you a copy of the tool to use. Collaborations, etc, can be discussed, as well.)

5. A quick question, of which I'll post more in the future. Has anyone here managed to get Exonerate 2.0 to work in client/server mode on two separate machines?

6. I'll also post a little more about this in the future: building environments, ant and java. Why are people still doing large projects in perl?

7. One last thing I wanted to mention. I was going to write more on this topic, but eh... I'll let slashdot do it for me: The more you drink, the less you publish. Well, So much for keeping a bottle of tequila under the desk. Now I know what to get the competition for x-mas, though...

Cheers!

Labels: , , ,

Monday, February 25, 2008

Transcriptome sequencing.

In an earlier comment, Jason Stajich asked:

What I am most curious about is how people are planning to do the statistics of gene expression comparison from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3' biased tags?


Though I've been taking my time about answering, its a really good question. I've been working with transcriptomes for a while, now, and have a fair amount of experience with it. I don't want to give away all of my secrets, but I can give a few pointers. If anyone wants to collaborate on something, you know where to find me. (-;

So, first things first, with transcriptome sequencing using Illumina based sequencing, each read you get is presumably from a single molecule of DNA, which presumably came from a single molecule of cDNA, from your library. I can't speak for all of the protocols used by the labs here at the Genome Science Centre, but the results I've seen have shown a very close correlation with expression levels measured by Nimblegen/Affymetrix DNA arrays, and so I tend to believe that the number of tags we're observing per region(eg gene/transcript/exon) are a direct (or nearly direct) measurement of the RNA levels in the general cell population used to create the library.

I should also mention that this is very dependent upon the protocols being used. If your protocol involves amplifying the cDNA with the use of PCR, you're really not going to maintain that relationship. Consult an expert on this subject, if you plan to try this at home. (-;

The other questions Jason asked are not quite as straight forward. We have a protocol here at the GSC that gives pretty darn even coverage across transcripts as a whole, which means that transcript end bias is pretty minimal. That totally negates the need to look at biases, or otherwise. Of course, this comes down to a lot of lab technique (which is totally outside the scope of my post), as it seems to be dependent on following the appropriate protocols. I've seen libraries which are completely skewed, libraries that perfectly follow the transcript outlines, and libraries somewhere in between. As it stands, I now run my tools over each data set as it appears to judge the quality before I ask for more lanes.

So, the short answer is: no, I don't normalize or exclude any data when I deal with transcriptomes, but I'm in the fortunately position of being able to identify (and accept or reject!) which data sets meet a reasonable level of quality before I process them.

Labels: , ,

Monday, February 18, 2008

Aligning DNA - comments from above

I've been pretty bad about continuing my posts on how the different aligners work. It's a lot of work keeping up with them, since I seem to hear about a new one each week. However, a post-doc in my lab gave a presentation on contrasting the various aligners, to discuss each of their strengths and weaknesses for doing short (Illumina) read alignments.

Admittedly, I don't know how accurate the presenter's data was - most of the presentation was in being used to set up his own in-house aligner development, and thus all of the aligners were painted in a poor light, except his, of course. That being said, there's some truth to what he found: most of the aligners out there have some pretty serious issues.

Eland is still limited by it's 32-base limit, which you'd think they'd have been over by now. For crying out loud, the company that produces it is trying to sell kits for doing 36-base alignments. It's in their best interest to have an aligner that does more than 32 bases. (Yes, they have a new work-around in their Gerald program, but it's hardly ideal.)

MAQ, apparently, has a weird "feature" that if multiple alignments are found, it just picks one at random as the "best". Hardly ideal for most experiments.

Mosaik provides output in .ace files - which are useless for any further work, unless you want to reverse engineer converters to other, more reasonable, formats.

SOAP only aligns against the forward strand! (How hard can it be to map the reverse compliment???)

Exonerate is great when run in "slow mode", at which point it's hardly usable for 40M reads, and when it's run in "fast mode", it's results are hardly usable at all.

SHRiMP, I just don't know enough about to comment on.

And yes, even the post-doc's in-house aligner (called Slider) has some serious issues: it's going to miscall all SNPs, unless you're aligning fragments from the reference sequence back to itself. (That's not counting the 20 hours I've already put in to translate the thing to java proper, patching memory leaks, and the like...)

Seriously, what's with all of these aligners? Why hasn't anyone stepped up to the plate and come up with a decent open-source aligner? There are got to be hundreds of groups out there who are struggling to make these work, and not one of them is ideal for use with Illumina reads. Isn't there one research group out there dog-fooding their own Illumina sequence aligner?

At this rate, I may have to build my own. I know what they say about software, though: You can have fast, efficient or cheap - pick any two. With aligners, it seems that's exactly where we're stuck.

Labels: , ,

Thursday, February 7, 2008

AGBT post #2.

Good news.. my bag arrived! I'm going to go pick it up after the current session, and finally get some clean clothes and a shave. Phew!

Anyhow, on the AGBT side of things, I just came back form the Pacific Biosciences panel discussion, which was pretty neat. The discussion was on "how many base pairs will it take to enable personalized medicine?" A topic I'm really quite interested in.

The answers stretched from infinite, to 6 Billion, to 100TB, to 100 people (if they can pick the right person), to 1 (if they find the right one). It was a pretty decent discussion, covering things from American politics, to snp finding, to healthcare... you get the idea. The moderator was also good, the host of a show (Biotechworld?) on NPR.

My one problem is that in giving their answers, they brushed on several key points, but never really followed up on it.

1) just having the genome isn't enough. Stuff like transription factor binding sites, methylation, regulation, and so forth are all important. If you don't know how the genome works, personal medicine applications aren't going to fall out of it. (Elaine Mardis did mention this, but there was little discussion of it.)

2) Financial aspects will drive this. That, in itself was mentioned, but the real paradigm shifts will happen when you can convince the U.S. insurance companies that preventive medicine is cheaper than treating illness. That's only a matter of time, but I think that will drive FAR more long term effects than having people's genomes. (If insurance companies gave obese people a personal trainer and cooking lessons, assuming their health issues are diet related, they'd save a bundle in not having to pay for diabetes medicine, heart surgery, and associated costs.... but targeting people for preventive treatment requires much more personal medicine than we have now.)

Other points that were well covered include the effect of computational power as a limiting agent in processing information, the importance of sequencing the right people, and how its impossible to predict where the technology will take us, both morally and scientifically.

Anyhow, as I'm typing this while sitting in other talks:

Inanc Birol, also from the GSC, gave a talk on his work on a new de novo assembler:

80% reconstruction of the C.elegans genome from 30x coverage, which required 6 hours (10 cpu) for data preparation and performing the assembly in less than 10 minutes on a single CPU, using under 4Gb of RAM.

There you go.. the question for me (relevant to the last posting) is "how much of the 20% remaining has poor sequencability?" I'm willing to bet it's the same.

And I just heard a talk on SSAHA_pileup, which seems to try to sort snps. Unfortunately, every SNP caller talk I see always assumes 30X coverage.. How realistic is that for human data? Anyhow, I'm sure I missed something. I'll have to check out the slides on slideshare.net, once they're posted.

And the talks continue....


btw, remind me to look into the fast smith-waterman in cross-match - it sounds like it could be useful.

Labels: , , ,

AGBT post #1.

I'm here.. and online. I almost didn't make it, thanks to bad weather in florida, but at least the car we rented didn't break down on the road, the way the other group's did. Apparently the police saved them from the aligators and wild pigs... No one can say AGBT hasn't been exciting, so far.

Anyhow, lots of good topics, and meeting interesting people already. (I'm even sitting beside an exec from Illumina, in the ABI sponsored lunch.. how's that for irony?) Anyhow, I'm excited to start the poster sessions and get some good discussions going.

Unfortunately, I missed two of the talks this morning, while I negotiated with the good people at United airlines to have my bag delivered. The three others I've seen so far have been good. Some interesting points:

The best graphics are the ones with the two DNA strands shown separately. Too cool - must include that in my FindPeaksToR scripts.

Loss or gain of homozygosity can screw up what you think you have, compared to what's really there. Many models assume you have only one copy of a gene, or just don't really do much to make sense of these events.

From David Cox, I learned that Barcoding isn't new (it's not), but that it usually doesn't work well (I can't prove that), but hey, they got it to work (and that's good!).

And yes, my favorite line from David Cox's presentation was something like: 900 PCR products, 90 people['s samples]... 1 tube. Make sure you don't drop it!

Anyhow, I'm getting lots of ideas, and I'm thoroughly enjoying this conference. I'm saturated in next-gen sequencing work.

Anyhow, if anyone else is reading this, My poster is #38... feel free come come by and talk.

Labels: , ,

Tuesday, February 5, 2008

AGBT and Sequencability

First of all, I figured I'd try to do some blogging from ABGT, while I'm there. I don't know how effective it'll be, or even how real-time, but we'll give it a shot. (Wireless in Linux on the Vostro 1000 isn't particularly reliable, and I don't even know how accessible internet will be.)

Second, what I wrote yesterday wasn't very clear, so I thought I'd take one more stab at it.

Sequencability (or mappability) is a direct measure of how well you'll be able to sequence a genome using short reads. Thus, by definition, de novo sequencing of a genome is going to be a direct result of the sequencability of that genome. Unfortunately, when people talk about the sequencability, they talk about it in terms of "X% of the genome is sequencable", which means "sequencability is not zero for X% of the genome."

Unfortunately, even if sequencability is not zero, it doesn't mean you can generate all of the sequences (even if you could do 100% random fragments, which we can't), indicating that much of the genome beyond that magical "X% sequencable" is still really not assemblable. (Wow, that's such a bad word.)

Fortunately, sequencability is a function of the length of the reads used, and as the read length increases, so does sequencability.

Thus, there's hope that if we increase the read length of the Illumina machines, or someone else comes up with a way to do longer sequences with the same throughput (e.g. ABI Solid, or 454's GS FLX), the assemblability of the genome will increase accordingly. All of this goes hand in hand: longer reads and better lab techniques always make a big contribution to the end results.

Personally, I think the real answer lays in using a variety of techniques: Paired-End-Tags to span difficult to sequence areas (eg. low or zero sequencability regions), and Single-End-Tags to get high coverage... and hey throw in a few BACs and ESTs reads for good luck. (=

Labels: , , , , ,

Wednesday, January 30, 2008

Comments on de novo assembly

First off, I'd like to say thanks to Paul and stajich for their comments. Paul for raising several points that I'd like to discuss tonight, and stajich for bringing my attention to the SHRiMP aligner. Obviously, I haven't had much chance to look at SHRiMP, yet, but I'll definitely get around to it.

So, paul mentioned several things that are worth discussing:

The Velvet roadmap for mRNA could be interesting. Sometimes the intron is snipped, sometimes it isn't, get a nice little bubble. And the similar transcripts will do... interesting things.

Short reads would be fine for de-novo if DNA was completely random. Pity that it isn't.


For the most part, I agree with Paul, Using Velvet on a transcriptome would be pretty cool. Yes, you'd get bubbles, but you'd get a lot of nifty information about new transcripts, or intermediate forms of transcripts. This neatly avoids one of the major hurdles I currently face when working with transcriptomes: I can either align against a genome, or a list of known genes, and neither one really gives me much opportunity to identify new genes. (Of course, I'm working on doing just that, with a colleague of mine, but I'm saving that for another post.)

Where I disagree with paul, however, is his final statement, that short reads would be fine for de novo assembly if they were truly random. I have two problems with that: The first is mapability (or sequencability), and the second is the short nature of the reads themselves.

Mappability has been defined in many different ways, but for general purposes, it's the ability to identify a sequenced read that can be unambiguously aligned to that location on a chromosome. Fortunately, with 36-mer reads, as are produced by Illumina 1G's, something like 70-80% of the genome is mappable. Not 100% mappable, but mappable to some degree. This may seem trivial, but it's important.

Why it's important is that a genome with 95% mapability doesn't mean you get a single contig covering 95% of the genome, and a chunk of 5% of your reads that don't cover your genome. It's more like every 20-100kb you'll find something that you just can't assemble over. (I'm guestimating that number, by the way.) This means you have lots of small contigs that have no overlap. That is, of course, assuming you had enough highly mappable reads to do a relatively error free assembly, and, also of course, assuming your sequencing errors haven't completely interfered with your assembly. I don't think either can really be taken for granted.

In reality, the mappability of a genome isn't a property you can figure out until you've sequenced it, so I don't want to discourage anyone from trying. I'm aware of people working on using Illumina reads to do this, albeit they've supplemented the reads with BAC sequencing, ESTs and various other options, which will provide a nice scaffolding. This approach allows them to augment their sequencing power by obtaining a lot of coverage through the Illumina runs, rather than having to use them as the basis of a serious de novo assembly - which seems wise to me. (And even then, they're only doing a fungus - not a higher vertebrate!)

And to return to my earlier point about the mappable genome not being 100% mappable, this is where I think paul's point over simplifies. Although some bases may be 50% mappable, and are thus considered to be "mappable" in common discussion, that means that 50% of the possible sequencing reads in which they're likely to participate will not yield an un-ambiguous fragment! That means you can say goodbye to 50% of the likely fragments which you would need to generate an overlap to span two forming contigs. That, to me, indicates that any de novo assembly is unlikely to correctly proceed past that base, and 50% mappability is not unusual in the genome.

The other point of paul's that I wanted to discuss, was the assertion that the non-random fragmenting of DNA is what's stopping us from doing a serious de novo assembly. While it's true that shearing DNA isn't an entirely random process, it's also true that it doesn't need to be. People have been doing restriction enzyme digests for decades now, in order to map vectors and inserts. (I learned how to do it in grade 11 biology, and that was back in 1994). So yes, while sonication or digests may not be random, what's to stop someone from stripping DNA of it's histones, and then doing 25 different digests? The net effect is just about the same (assuming you pick 25 good restriction enzymes with different base recognitions), and will yield fragments that will get around paul's issue. But does that mean we can now do a de novo assembly?

No... I don't think so. I doubt the lack of a random fragmentation pattern is the limiting factor in de novo assemblies. Unfortunately, the only way to prove myself or paul right or wrong is to simulate this: Take the mouse genome, fragment it into random 36-mers (the same size you get from an Illumina sequencing run), then inject 1 random base for every 1000 bases read (I'll be generous and assume a .1% error rate, though the real thing is closer to 3-5% from what I've seen), and then try running velvet on it.

I bet you'll observe that somewhere around 40x coverage you'll start to saturate and discover that your assembly has covered anywhere from 60-80% of the genome (say 5% of that is just way wrong), and that it'll have taken you longer to do that assembly than it would have to just sequenced the damned thing in the first place with PCR.

Anyhow, I take paul's point to heart: We're still early on in this game, and there are a lot of neat things we can try. I'm looking forward to seeing the results of many of them. (Who has time to try them all?)

By the way, FindPeaks 3.0.1 is done, and I'll be presenting it as a poster at AGBT 2008 this week. If you're interested in ChIP-Seq/Chip-Solexa (or for the pedantic, ChIP-Illumina), come find me, and I'll tell you some of its new features.

Labels: , , , , ,

Tuesday, January 22, 2008

Solexa Transcriptom Shotgun:Transcriptome alignments vs. Genome Alignments

First off, I was up late trying to finish off one of my many projects, so I didn't get a lot of sleep. Thus, if my writing is more incoherent than usual, that's probably why. And now on with the show.

I'm jumping gears a bit, today. I haven't finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.

When you're re-sequencing a genome, you map back to the genome template. Yes, you're probably sequencing a different individual, so you'll find occasional sections that don't match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you're seeing are already commonly known changes. Of course, if you're re-sequencing Craig Venter, you don't need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.

When you're sequencing a genome you've never sequenced before, you can't do mapping at all. There are various assemblers (i.e., Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add... ), SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and Euler (which I'd never heard of till I googled the web page for velvet...). The key point: you don't need to worry about what you're mapping back to when you do de novo assembly, since you're creating your own map. I'll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they're so short!

Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I'm ignoring cloned libraries... they're so 90's!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I'm ignoring some complexity here, because it makes for a better story.) But I'm getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.

mRNA is an interesting beast. Unlike Genomic DNA, it's a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there's a small chance that it might overlap the boundaries between two exons, which means the bases you have won't map back nicely to the genomic point of origin. That can be a serious problem.

Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you're sophisticated, you'll even know that introns often have signals that indicate their presence most of the time. And yes, you'd be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn't seem like much, really), how long would it take to do that alignment?

Too long.

So most of the time, we don't do gapped alignments. Instead, we have two choices:
  1. Align against the genome, and throw away reads that we can't align (i.e. those that over lap intron/exon boundaries.)

  2. Align against a collection of known coding DNA sequences


Number two isn't a bad option: it already has all the introns spliced out, so you don't need to worry about missing those alignments. Unfortunately, there are several issues with this approach:
  • Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.

  • Many transcripts are very similar. This isn't a problem with the data, but with your alignment program. If your aligner doesn't handle multi-matches (like Eland), this will never work.

  • Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.

  • Aligning to a transcript is hard to visualize. This is going to be one of my next projects... but with all the fantastic genomes out there, I'm still not aware of a good transcriptome visualization tool.


And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.

For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven't released my code for either one, yet, but they both work, and if you contact my supervisor, he'd probably be interested in collaborating, or possibly sharing the code.

Labels: , , , , ,

Thursday, January 17, 2008

Aligning DNA - Eland

Continuing in my series of blog articles on short-read sequencing, it's time for a little bit of a discussion on the first short-sequence aligner that I'd ever met: Eland. The problem with writing anything about Eland, is that 90% of the information I have on it can't be confirmed independently. In fact, the majority of what I know comes from the header of the ELAND.cpp file. I'll admit, while I've gone through the source code a couple of times, I haven't spent much time getting to know it's well - I've processed billions of lines of Eland files, but have yet to run it, myself...

Anyhow, here's what I can tell you: Eland stands for "Efficient Local Alignment of Nucleotide Data" and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it's a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be "error free", so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don't know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

  1. The Eland aligner can only align sequences up to 32 base pairs long. (Implementation specific: they use four integers to represent the subsequences {A,B,C and D}, which reduces the memory requirements.)

  2. The Eland aligner can only align sequences with up to two mismatches. (Algorithm specific: at least two of the keys must match, although the algorithm could be expanded to use more subsequences, at the expense of expanding the sequence space.)

  3. The Eland aligner only returns sequences that match to one unique location in the genome (Implementation specific: Eland already tells you when a sequence matches to more than one location, it just doesn't tell you what those locations are.)

  4. The Eland aligner can do mismatches in the sequence, but can not handle insertions or deletions (i.e, gapped alignments). (Algorithm specific: The algorithm for matching the keys could be expanded to include this, but it would be very costly in time/memory.)

  5. The code is not freely available. (Implementation specific: I don't know of anywhere you can go to get the code... or documentation!)

  6. Changing the length of the alignment sequence (I called it N, above), requires a recompile. (Implementation, obviously. Why this isn't a command line parameter, I don't know.)


I'm sure I could go on and list other disadvantages, but I'd like to point out the main advantage: It's FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it's worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I'll discuss in future posts.

(2008-01-28 Update: I stand corrected: I've heard that the new betas have finally arrived! I haven't seen what they can do yet, but at least we know my prediction that they're vapourware is wrong.)

Labels: , , ,

Wednesday, January 9, 2008

Aligning DNA - a quick intro to some of the tools

A long time ago, I mentioned that I'd use this blog for some science. I don't really know who my audience is, but I suspect it's 1 part people who google in for linux related key words, 1 part people who are googling me because they've read my name somewhere, and 1 part random people... oh, and not to forget those people who I actually know in real life, but don't get to see often. Since I'm unlikely to scare of the googlers or the random people with some science, and the people I know in person will just skip over this posting (that's a hint, if you haven't already stopped reading...), I'll just go ahead with a quick lesson in Genomics.

As I've mentioned to many people, my research is currently heavily embedded in writing code for the processing of Solexa/Illumina reads. For those who aren't in the know, the Illumina 1G is a DNA sequencing machine that can produce ~40 Million DNA sequencing reactions in about 36 hours, where each sequence is 36 bases long. This is drastically different from traditional sequencing, where you get far fewer sequences (~100's), each of which can be up to 1000 bases long. With the low volume of alignments produced, and the long reads, traditional sequencing can be processed with a rather laisez-faire attitute. if you take 10 seconds to align 100 reads, you're talking about 1000 seconds to do all the alignments, or roughly 15 minutes.

In contrast, with 40 million reads, at ten seconds each, you're talking about waiting 12 years for your data to be processed. Clearly this is not an option, if you want to get into a cutting edge journal.

(For those who are curious to know where the 10 second mark came from, my supervisor timed a few blast searches on a local server.)

Consequently, computational scientists have risen to the task, and created several new algorithms for performing these alignments. The first one that I met is called "Eland", which I believe stands for "Efficient Local Alignment of Nucleotide Data". To be honest, I'm not really sure how to get Eland, as there is very little information available on the web. I've gleaned a few scraps of information: It was written by Anthony Cox, and was distributed by Solexa/Illumina. As far as I can tell, it came with the Illumina 1G machines.

The second one I met was called "Mosaik". This aligner comes from the lab of Gabor Marth, at boston college. The third was "Exonerate", followed by several others. Hands down, the best name for an aligner has got to be "MAPASS"... I'll let you ponder that for a few minutes.

Anyhow, each alginer has it's advantages and disadvantages. Eland, for instance is crippled by a algorithmic limitation to only ever being able to align the first 32 bases of a sequence, and it's inability to map a read to more than one location in a target DNA source (i.e genome or transcriptome). On the other hand, it's one of the fastest aligners out there.

Mosaik, on the other hand, is a bit slower, but has several nice features - and is able to handle the cases that Eland can't. On the other hand, it dumps out it's alignments to a file format that really isn't convenient for doing any further processing, which is a product of it's original use in sequence assembly.

Just to throw in one more curve, it's worth mentioning a competing proprietary piece of software. There's a company in Malaysia that has what appears to be the fastest aligner out there, with none of the limitations of Eland, and a flexible file output. Like all commercial products for the science market, they set the base price of their product outside of the reach of most consumers: I can't seem to find a good justification to get my supervisor to pay $200,000+ USD/CAD for a copy of their software. (You could hire four post-docs for a year for that... or 10 grad students.)

Anyhow, now you know who the players are. My next post on this topic will be to introduce some of them in detail, explain how they work, and then tell you how we use them.

Yes, I'm feeling ambitious: I got three pieces of software working today: two of which are likely to be in production in the next couple of weeks at the Genome Sciences Centre: One for ChIP-Seq experiments (FindPeaks 3.0) and the other for transcriptome processing for mammalian cells (yet to be named.)

Labels: , , ,

Monday, December 3, 2007

Molecules

Planning and reality often don't mix in grad school, I'm discovering. Since I have a committee meeting tomorrow, I thought I'd use today to finish my presentation and written documentation for it. Instead, I was told I had to have code running by 1pm that would process transcriptome data using paired end solexa reads. 5 hours later, I had code analyzing Single end transcriptome data, but nothing finished for my committee meeting.

On that note, I'm not going to blog about paired end data - I'll save that for another day, and instead, I'm going to go work on my presentation and report. However, I did want to provide a neat link Elaine passed to me: made with molecules. Jewelry for the rich and geeky. (I'm geeky enough to like it, but not rich enough to pay THAT much for it, even if I wore ornaments.)

Oh, and one last parting note... my laptop is being assembled today. If I wasn't going to be so busy tomorrow, I'd probably be checking it's status every 10 minutes. The curse and blessing of real-time tracking data.

Labels: , , ,