fejes.ca

Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, December 17, 2009

One lane is (still) not enough...

After my quick post yesterday where I said one lane isn't enough, I was asked to elaborate a bit more, if I could. Well, I don't want do get into the details of the experiment itself, but I'm happy to jump into the "controls" a bit more in depth.

What I can tell is that with one lane of RNA-Seq (Illumina data50bp), all of the variations I find show up either in known polymorphism database or as somatic SNPs, with a few exceptions. The few exceptions just turn out to be exceptions for lack of coverage.

For a "control", I took two data sets (from two separate patients) - each with 6 individual lanes of sequencing data. (I realize this isn't the most robust experiment, but it shows a point.) In the perfect world, each of the 6 lanes per person would have sampled the original library equally well.

So, I matched up one lane from each patient into 6 sets and asked the question: How many transcripts are void (less than 5 tags) in one sample and at least 5x greater in the other sample. (I did this in both directions.)

The results aren't great. In one direction, I see an average of 1245 Transcripts (about 680 genes, so there's some overlap amongst the transcript set) with a std dev. of 38 Transcripts. That sounds pretty consistent, till you look for the overlap in actual transcripts: avg 27.3 with a std dev of 17.4. (range 0-60). And, when with do the calculations, the most closely matched data sets only have a 5% overlap.

The results for the opposite direction were similar: Average of 277 transcripts found that met the criteria (std.dev of 33.61), with an average overlap between data sets being 4.8, std. dev 4.48. (range of 0-11 transcripts in common.) The best overlap in "upregulated" genes for this dataset was just over 4% concordance with a second pair of lanes.

So, what this tells me (for a VERY dirty experiment) is that expression of genes in one lane is highly variable depending on the lane for genes expressed at the low end. (Sampling at the high end usually pretty good, so I'm not too concerned about that.)

What I haven't answered yet is how many lanes is enough. Alas, I have to go do some volunteering, so that experiment will have to wait for another day. And, of course, the images I created along the way will have to follow later as well.

Labels: Bioinformatics, false discovery rate, Genomics, Sequencing, Vancouver Short Read Analysis Package

Tuesday, October 6, 2009

DNA sequencing Videos.

With IBM tossing it's hat into the ring of "next-next-generation" sequencing, I'm starting to get lost as to which generation is which. For the moment, I'm sort of lumping things together, while I wait to see how the field plays out. In my mind, first generation is anything that requires chain termination, Second generation is chemical based pyrosequencing, and third generation is single molecule sequencing based on a nano-scale mechanical process. It's a crude divide, but it seems to have some consistency.

At any rate, I decided I'd collect a few videos to illustrate each one. For Sanger, there are a LOT of videos, many of which are quite excellent, but I only wanted one. (Sorry if I didn't pick yours.) For second and third generation DNA sequencing videos, the selection kind of flattens out, and two of them come from corporate sites, rather than youtube - which seems to be the general consensus repository of technology videos.

Personally, I find it interesting to see how each group is selling themselves. You'll notice some videos press heavily on the technology, while others focus on the workflow.

As an aside, I also find it interesting to look for places where the illustrations don't make sense... there's a lovely place in the 454 video where two strands of DNA split from each other on the bead, leaving the two full strands and a complete primer sequence... mysterious! (Yes, I do enjoy looking for inconsistencies when I go to the movies.)

Ok, get out your popcorn.

First Generation:
Sanger Entry: Link

Second Generation:
Pyrosequencing Entry: Link

Helicose Entry: Link

Illumina (Corporate site): Link

(Click to see the Flash animation)

454 Entry: Link

Third Generation:

Pacific Biosciences: Link

(Click to see the Flash Video)

Oxford Nanopore Entry: Link

IBM's Entry: Link

Note: If I've missed something, please let me know. I'm happy to add to this post whenever something new comes up.

Labels: Sequencing, SMRT, Solexa/Illumina

Friday, October 2, 2009

Base quality by position

A colleague of mine was working on a nifty tool to give graphs of the base quality at each position in a read using Eland Export files, which could be incorporated into his pipeline. Over a discussion about the length of time it should take to do that analysis, (His script was taking an hour, and I said it should take about a minute to analyze 8M illumina reads...) I ended up saying I'd write my own version to do the analysis, just to show how quickly it could be done.

Well, I was wrong about it taking about a minute. It turns out that the file has a lot more than about double the originally quoted 8 million reads (QC, no match and multi match reads were not previously filtered), and the whole file was bzipped, which adds to the processing time.

Fortunately, I didn't have to add bzip support in to the reader, as tcezard (Tim) had already added in a cool "PIPE" option for piping in whatever data format I want in to applications of the Vancouver Short Read Analysis Package, thus, I was able to do the following:

time bzcat /archive/solexa1_4/analysis2/HS1406/42E6FAAXX_7/42E6FAAXX_7_2_export.txt.bz2 | java6 src/projects/maq_utilities/QualityReport -input PIPE -output /projects/afejes/temp -aligner elandext

Quite a neat use of piping, really.

Anyhow, the fun part is that this was that the library was a 100-mer illumina run, and it makes a pretty picture. Slapping the output into openoffice yields the following graph:

I didn't realize quality dropped so dramatically at 100bp - although I remember when qualities looked like that for 32bp reads...

Anyhow, I'll include this tool in Findpeaks 4.0.8 in case any one is interested in it. And for the record, this run took 10 minutes, of which about 4 were taken up by bzcat. Of the 16.7M reads in the file, only 1.5M were aligned, probably due to the poor quality out beyond 60-70bp.

Labels: application development, Open Source, programming, Sequencing, Solexa/Illumina, Vancouver Short Read Analysis Package

Wednesday, September 30, 2009

Second Gen Sequencer Map and naming our generations properly

Yes, this is old news, but I've found myself searching for the second generation sequencing map that was started at seqanswers several times in the last few days. Just to make it even easier to find - and of course to give some positive publicity to a really cool project - here's the Google Maps based list of facilities that run second generation (Next-Generation) sequencing machines:

http://tinyurl.com/orm8cr

For the record, I understand a few places are missing, however. I hear there's some second-generation sequencing going on Alberta, which clearly hasn't appeared on the map yet.

And, as a foot note, now that I see other people have picked up on it and started calling "next-generation sequencing" the more appropriate label of "second generation sequencing" (e.g. here at nature), I'm going to drop next-gen as a label entirely. First-gen is sanger/dideoxy/capillary/etc, second-gen is pyrosequencing and the third gen is biotech-based (using cellular components such as DNA polymerases and the like). Lets end the confusion and name our generations accordingly, shall we?

Labels: Sequencing, Solexa/Illumina

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

Labels: Algorithms, Aligners, application development, assembly, Bioinformatics, Chip-Seq, documentation, new technology, programming, Sequencing, short-reads, SNP calling, Software

Saturday, August 15, 2009

What would you do with 10kbp reads?

I just caught a tweet about an article on the Pathogens blog (What can you do with 1000 base pair reads?), which is specifically about 454 reads. Personally, I'm not so interested in 454 reads - the technology is good, but I don't have access to 454 data, so it's somewhat irrelevant to me. (Not to say 1kbp reads isn't neat, but no one has volunteered to pass me 454 data in a long time...)

So, anyhow, I'm trying to think two steps ahead. 2010 is supposed to be the year that Pacific Biosciences (and other companies) release the next generation of sequencing technologies - which will undoubtedly be longer than 1k. (I seem to recall hearing that PacBio has 10k+ reads.- UPDATE: I found a reference.) So to heck with 1kbp reads, this raises the real question: What would you do with a 10,000bp read? And, equally important, how do you work with a 10kbp read?

What software do you have now that can deal with 10k reads?
Will you align or assemble with a 10k read?
What experiments will you be able to do with a 10k read?

Frankly, I suspect that nothing we're currently using will work well with them - we'll all have to go back to the drawing board and rework the algorithms we use.

So, what do you think?

Labels: Aligners, Bioinformatics, future, Pacific Biosciences, Sequencing

Wednesday, July 29, 2009

Aligner tests

You know what I'd kill for? A simple set of tests for each aligner available. I have no idea why we didn't do this ages ago. I'm sick of off-by-one errors caused by all sorts of slightly different formats available - and I can't do unit tests without a good simple demonstration file for each aligner type.

I know Sam format should help with this - assuming everyone adopts it - but even for SAM I don't have a good control file.

I've asked someone here to set up this test using a known sequence- and if it works, I'll bundle the results into the Vancouver Package so everyone can use it.

Here's the 50-mer I picked to do the test. For those of you with some knowledge of cancer, it comes from tp53. It appears to blast uniquely to this location only.

>forward - chr17:7,519,148-7,519,197
CATGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCGG

>reverse - chr17:7,519,148-7,519,197
ccggcacccgcgtccgcgccatggccatctacaagcagtcacagcacatg

Labels: Aligners, application development, Bioinformatics, Code planning, Sequencing, short-reads, Vancouver Short Read Analysis Package

Monday, July 27, 2009

how recently was your sample sequenced?

One more blog for the day. I was postponing writing this one because it's been driving me nuts, and I thought I might be able to work around it... but clearly I can't.

With all the work I've put into the controls and compares in FindPeaks, I thought I was finally clear of the bugs and pains of working on the software itself - and I think I am. Unfortunately, what I didn't count on was that the data sets themselves may not be amenable to this analysis.

My control finally came off the sequencer a couple weeks ago, and I've been working with it for various analyses (snps and the like - it's a WTSS data set)... and I finally plugged it into my FindPeaks/FindFeatures pipeline. Unfortunately, while the analysis is good, the sample itself is looking pretty bad. In looking at the data sets, the only thing I can figure is that the year and a half of sequencing chemistry changes has made a big impact on the number of aligning reads and the quality of the reads obtained. I no longer get a linear correlation between the two libraries - it looks partly sigmoidal.

Unfortunately, there's nothing to do except re-seqeunce the sample. But really, I guess that makes sense. If you're doing a comparison between two data-sets, you need them to have as few differences as possible.

I just never realized that the time between samples also needed to be controlled. Now I have a new question when I review papers: How much time elapsed between the sequencing of your sample and it's control?

Labels: biochemistry, Bioinformatics, biology, FindPeaks, Genomics, Sequencing, Vancouver Short Read Analysis Package

Friday, May 29, 2009

Science Cartoons - 3

I wasn't going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that's generally called "fair use" in the US, and copyright laws in most countries have similar exceptions), but you can't publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don't steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

Labels: Algorithms, Chip-Seq, Code planning, Copyright, documentation, figures, FindPeaks, Sequencing, Software, Vancouver Short Read Analysis Package

Wednesday, March 25, 2009

Searching for SNPs... a disaster waiting to happen.

Well, I'm postponing my planned article, because I just don't feel in the mood to work on that tonight. Instead, I figured I'd touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they're not SNPs, they're variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They're only polymorphisms if they're common to a portion of the population.

In this case, they're from cancer cell lines, so after I filter out all the real SNPs, what's left are SNVs... and they're bloody annoying. This is the second major project I've done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs - a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data - nearly a year old. It wasn't particularly reliable, and I've now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked... and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious... do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question "why?" Why do we get unexpected results?

Fortunately, I'm a scientist, so that question is really something I like. I don't begrudge the last year's worth of work - which apparently is now more or less down the toilet - but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I'm hooked, otherwise I'd have tossed in the towel long ago.

Labels: Bioinformatics, biology, Sequencing, short-reads, SNP calling, SNPS, Solexa/Illumina

Sunday, April 13, 2008

Genomics Forum 2008

You can probably guess what this post is about from the title - which means I still haven't gotten around to writing an entry on thresholding for ChIP-Seq. Actually, it's probably a good thing I haven't, as we've been learning a lot about thresholding in the past week. It seems many things we took for granted aren't really the case. Anyhow, I'm not going to say too much about that, as I plan to collect my thoughts and discuss it in a later entry.

Instead, I'd like to discuss the 2008 Genomics Forum, sponsored by Genome BC, which took place on Friday - though, in particular, I'm going to focus on one talk, near to my own research. Dr. Barbara Wold from Caltech gave the first of the science talks, and focussed heavily on ChIP-Seq and Whole Transcriptome Shotgun Sequencing (WTSS). But before I get to that, I wanted to mention a few other things.

The first is that Genome BC took a few minutes to announce a really neat funding competition, which really impressed me, the Genome BC Science Opportunities Fund. (There's nothing up on the web page yet, but if you google for it, you'll come across the agenda for Friday's forum in which it's mentioned - I'm sure more will appear soon.) Its whole premise revolves around the question: "Are there experiments that we need to be doing, that are of strategic importance to the BC life science community?" I take that to mean, are there projects that we can't afford not to undertake, that we wouldn't have the funding to do otherwise? I find that to be very flexible, and very non-academic in nature - but quite neat. I hope the funding competition goes well, and I'm looking forward to seeing what they think falls into the "must do" category.

The second was the surprising demand for Bioinformaticians. I'm aware of several jobs for bioinformaticians with experience in next-gen sequencing, but the surprise to me was the number of times (5) I heard people mention that they were actively recruiting. If anyone with next-gen experience is out there looking for a job (post-doc, full time or grad student), drop me a note, and I can probably point you in the right direction.

The third was one of the afternoon talks, on journalism in science, from the perspective of traditional news paper/tv journalists. It seems so foreign to me, yet the talk touched on several interesting points, including the fact that journalists are struggling to come to terms with "new media." (... which doesn't seem particularly new to those of us who have been using the net since the 90's, but I digress.) That gave me several ideas about things I can do with my blog, to bring it out of the simple text format I use now. I guess even those of us who live/breath/sleep internet don't do a great job of harnessing it's power for communicating effectively. Food for though.

Ok... so on to the main topic of tonight's blog: Dr. Wold's talk.

Dr. Wold spoke at length on two topics, ChIP-Seq and Whole Transcriptome Shotgun Sequencing. Since these are the two subject I'm actively working on, I was obviously very interested in hearing what she has to say, though I'll comment more on the ChIP-Seq side of things.

One of the great open questions at the Genome Sciences Centre has been how to do an effective control for a ChIP-Seq experiment. It's not something we've done much of, in the past, but the Wold lab demonstrated why they're necessary, and how to do them well. It seems that ChIP-Seq experiments tend to yield fragments in several genomic regions that have nothing to do with the antibody or experiment itself. The educated guess is that these are caused by hypersensitive sites in the genome that tend to fragment in repeatable patterns, giving rise to peaks that appear in all samples. Indeed, I spend a good portion of this past week talking about observations of peaks exactly like that, and how to "filter" them out of the ChIP-Seq results. I wasn't able to get a good idea of how the Wold lab does this, other than by eye, (which isn't very high throughput), but knowing what needs to be done now, it shouldn't be particularly difficult to incorporate into our next release of the FindPeaks code.

Another smart thing that the Wold lab has done is to separate the interactions of ChIP-Seq into two different types: Type 1 and Type 2, where Type 1 refers to single molecule-DNA binding events, which give rise to sharp peaks, and very clean profiles. These tend be transcription factors like NRSF, or STAT1, upon which the first generation of ChIP-Seq papers were published. Type 2 interactomes tend to be less clear, as they are transcription factors that recruit other elements, or form complexes that bind to the DNA at specific sites, and require other proteins to bind to encourage transcription. My own interpretation is that the number of identifiable binding sites should indicate the type, and thus, if there were three identifiable transcription factor consensus sites lined up, it should be considered a Type 3 interactome, though, that may be simplifying the case tremendously, as there are, undoubtedly, many other proteins that must be recruited before any transcription will take place.

In terms of applications, the members of the wold lab have been using their identified peaks to locate novel binding site motifs. I think this is the first thing everyone thinks of when they hear of ChIP-Seq for the first time, but it's pretty cool to see it in action. (We also do it at the GSC too, I might add.) The neatest thing, however, was that they were able to identify a rather strange binding site, with two halves of a motif, split by a variable distance. I haven't quite figured out how that works, in terms of DNA/Protein structure, but it's conceptually quite neat. They were able to show that the distance between the two halves of the structure vary by 10-20 bases, making it a challenge to identify, for most traditional motif scanners. Nifty.

Another neat thing, which I think everyone knows, but was cool to hear that it's been shown is that the binding sites often line up on areas of high conservation across species. I use that as a test for my own work, but it was good to have it confirmed.

Finally, one of the things Dr. Wold mentioned was that they were interested in using the information in the directionality of reads in their analysis. Oddly enough, this was one of the first problems I worked on in ChIP-Seq, months ago, and discovered several ways to handle it. I enjoyed knowing that there's at least one thing my own ChIP-Seq code does that is unique, and possibly better than the competition. (-;

As for transcriptome work, there were only a couple things that are worth mentioning. The Wold lab seems to be using MAQ and a list of splice junctions assembled from annotated exons to map the transcriptome sequences. I've heard that before, actually, from someone at the GSC who is doing exactly the same thing. It's a small world. I'm not really a fan of the technique, however. Yes, you'll get a lot of the exon junction reads, but you'll only find the ones you're looking for, which is exactly the criticism all the next-gen people throw at the use of micro-arrays. There has got to be a better solution... but I don't yet know what it is. (We thought it was Exonerate, but we can't seem to get it to work well, due to several bugs in the software. It's clearly a work in progress.)

Anyhow, I think I'm going to stop here. I'll just sum it all up by saying it was a pretty good talk, and it's given me lots of things to think about. I'm looking forward to getting back to coding tomorrow.

Labels: Chip-Seq, Sequencing, Solexa/Illumina, Talks, transcriptome

Friday, March 21, 2008

Catching up....

I can't believe it's been nearly a month since my last post! I feel like I've been neglecting this a bit more than I should, but I'll try to rectify that as best I can.

For an indication of how busy I've been, I sat down to update my resume yesterday, and ended up adding 3 papers (all in submission) and two posters. That just about doubles what was in there previously in the papers section.

Anyhow, Next-generation sequencing doesn't stand still, so I thought I'd outline some of the things I want to talk about in my next posts, and set up a few pointers to other resources:

1. SeqAnswers. This aptly named forum has been around for a few months now, but has recently become more popular, and a great forum for discussing relevant Next-gen technology and sequencing methods. I'm especially happy to see the automated posts triggered by new literature on the subject, which are a great resource for those of us who are busy and forget to check for new publications ourselves.

2. There's one forum in particular that's of great interest: Benchmarking different aligners. This appears to be a well done comparison (if lightweight) that may be a good focus for other people who are interested in comparing aligners, and discussing it in a wider forum.

3. For people interested in ChIP-Seq, or Chromatin immunoprecipitation and massively parallel sequencing, I've finally gotten around to posting FindPeaks 3.1 on the web. I'd consider this release (3.1.3) an alpha release. I'd love to get more people contributing by using this application and telling me what could be improved on it, or what enhancements they'd like to see. I'm always happy to discuss new features, and can probably add most of them in with a relatively quick turn around time.

4. For people interested in assessing the quality of the whole transcriptome shotgun sequencing (WTSS), I'm about to break out a tool that should fit that purpose. If anyone is interested in giving suggestions on ways they'd like to see quality tests performed, I'd be more than happy to code those into my applications. (And yes, if you contribute to the tool, I will provide you a copy of the tool to use. Collaborations, etc, can be discussed, as well.)

5. A quick question, of which I'll post more in the future. Has anyone here managed to get Exonerate 2.0 to work in client/server mode on two separate machines?

6. I'll also post a little more about this in the future: building environments, ant and java. Why are people still doing large projects in perl?

7. One last thing I wanted to mention. I was going to write more on this topic, but eh... I'll let slashdot do it for me: The more you drink, the less you publish. Well, So much for keeping a bottle of tequila under the desk. Now I know what to get the competition for x-mas, though...

Cheers!

Labels: Aligners, Code planning, Sequencing, Solexa/Illumina

Monday, February 25, 2008

Transcriptome sequencing.

In an earlier comment, Jason Stajich asked:

What I am most curious about is how people are planning to do the statistics of gene expression comparison from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3' biased tags?

Though I've been taking my time about answering, its a really good question. I've been working with transcriptomes for a while, now, and have a fair amount of experience with it. I don't want to give away all of my secrets, but I can give a few pointers. If anyone wants to collaborate on something, you know where to find me. (-;

So, first things first, with transcriptome sequencing using Illumina based sequencing, each read you get is presumably from a single molecule of DNA, which presumably came from a single molecule of cDNA, from your library. I can't speak for all of the protocols used by the labs here at the Genome Science Centre, but the results I've seen have shown a very close correlation with expression levels measured by Nimblegen/Affymetrix DNA arrays, and so I tend to believe that the number of tags we're observing per region(eg gene/transcript/exon) are a direct (or nearly direct) measurement of the RNA levels in the general cell population used to create the library.

I should also mention that this is very dependent upon the protocols being used. If your protocol involves amplifying the cDNA with the use of PCR, you're really not going to maintain that relationship. Consult an expert on this subject, if you plan to try this at home. (-;

The other questions Jason asked are not quite as straight forward. We have a protocol here at the GSC that gives pretty darn even coverage across transcripts as a whole, which means that transcript end bias is pretty minimal. That totally negates the need to look at biases, or otherwise. Of course, this comes down to a lot of lab technique (which is totally outside the scope of my post), as it seems to be dependent on following the appropriate protocols. I've seen libraries which are completely skewed, libraries that perfectly follow the transcript outlines, and libraries somewhere in between. As it stands, I now run my tools over each data set as it appears to judge the quality before I ask for more lanes.

So, the short answer is: no, I don't normalize or exclude any data when I deal with transcriptomes, but I'm in the fortunately position of being able to identify (and accept or reject!) which data sets meet a reasonable level of quality before I process them.

Labels: Sequencing, Solexa/Illumina, transcriptome

Tuesday, February 19, 2008

Jason's comment

I received a comment on my posting last night from Jason Stajich, which (as always) is right on the money. To quote the main theme I want to reply to:

While I am all for people writing their own software, writing your own aligner seems more difficult than figuring out how to get what you want from a published (and used format) like ACE assembly files.

That's absolutely correct. I think writing my own aligner is probably the last thing I want to do. While I have the source for at least our own internal aligner on my desktop this week (and have been doing some serious hacking on it), I really don't want to have to build my own. It's a distraction from my real research, it's a pain to support, it's going to have to compete with much larger groups doing similar things, and it's probably not a long-term viable option. That said, I still think the current state of affairs in the realm of aligners is pretty poor - thought that's not to say I think I could do it better. I'll leave writing assemblers to the professionals - or at least those who have more time to do this stuff.

Still, maybe if I can offer one helpful suggestion to people writing aligners. Get together and use a standard format! I'm less inclined to test Yet Another Aligner when I have to write another filter to interpret the results. There are even currently efforts under way to create a unified format for aligned reads and short-read data. (Disclaimer: I've only read a couple handfuls of posts, and not all of it appears to be on topic.)

The formats we're familiar with (such as fasta) aren't really designed for this type of work. (It hasn't stopped several aligners from using them however - which is still better than creating their own file formats, I might add.) What's actually needed is a purpose built format.

I'm sure you can guess from this rant that I'm a big fan of stuff like the OpenDocument Format (ODF), which are levelling the playing field for word processing documents (despite MicroSoft's best efforts to remain the dominant force at all costs), but even a limited (i.e. non-ISO standardized) approach could make a huge impact in this area.

Why not have a two day get together for all the people building aligners, and decide on a subset of formats? Make it versioned, so that it can change over time, and elect a maintainer of the official standard. While you're at it decide on a few conventions - what's a U0 hit? (Slider can have up to 9 SNPs and still call it a U0...) What is a read start? How do you represent stranded genomic information - which end is really the "start"?

Anyhow, I know it's wishful thinking, but hey, maybe this rant will encourage a few people to band together and come up with something to make all of our lives easer. Even something as simple as Fasta freed up thousands (millions?) of man-hours (or woman-hours) for bioinformaticians, since they no longer need a custom format library for every new program they use. Maybe it's time to do the same for aligned data.

P.S. I'll answer Jason's questions on EST/transcriptome approaches in my next post.

Labels: Aligners, Sequencing, short-reads, standards

Saturday, February 9, 2008

Pacific Biotech new sequencing technology

I have some breaking news. I doubt I'm the first to blog this, but I find it absolutely amazing, so I had to share.

Steve Turner from Pacific Biosciences (PacBio), just gave the final talk of the AGBT session, and it was damn impressive. They have a completely new method of doing sequencing that uses DNA polymerase as a sequencing engine. Most impressively, they've completed their proof of concept, and they presented data from it in the session.

The method is called Single Molecule Real Time (SMRT) sequencing. It's capable of producing 5000-25,000 base pair reads, at a rate of 10 bases/second. (They apparently have 25bps techniques in development, and expect to release when they have 50bps working!)

The machinery has zero moving part, and once everything is in place, they anticipate that they'll have a sequencing rate of greater than 100 Gb per hour! As they are proud to mention, that's about a full draft genome for a human being in 15 minutes, and at a cost of about $100. Holy crap!

Labels: Bioinformatics, new technology, Pacific Biosciences, Sequencing, SMRT

Wednesday, January 30, 2008

Comments on de novo assembly

First off, I'd like to say thanks to Paul and stajich for their comments. Paul for raising several points that I'd like to discuss tonight, and stajich for bringing my attention to the SHRiMP aligner. Obviously, I haven't had much chance to look at SHRiMP, yet, but I'll definitely get around to it.

So, paul mentioned several things that are worth discussing:

The Velvet roadmap for mRNA could be interesting. Sometimes the intron is snipped, sometimes it isn't, get a nice little bubble. And the similar transcripts will do... interesting things.

Short reads would be fine for de-novo if DNA was completely random. Pity that it isn't.

For the most part, I agree with Paul, Using Velvet on a transcriptome would be pretty cool. Yes, you'd get bubbles, but you'd get a lot of nifty information about new transcripts, or intermediate forms of transcripts. This neatly avoids one of the major hurdles I currently face when working with transcriptomes: I can either align against a genome, or a list of known genes, and neither one really gives me much opportunity to identify new genes. (Of course, I'm working on doing just that, with a colleague of mine, but I'm saving that for another post.)

Where I disagree with paul, however, is his final statement, that short reads would be fine for de novo assembly if they were truly random. I have two problems with that: The first is mapability (or sequencability), and the second is the short nature of the reads themselves.

Mappability has been defined in many different ways, but for general purposes, it's the ability to identify a sequenced read that can be unambiguously aligned to that location on a chromosome. Fortunately, with 36-mer reads, as are produced by Illumina 1G's, something like 70-80% of the genome is mappable. Not 100% mappable, but mappable to some degree. This may seem trivial, but it's important.

Why it's important is that a genome with 95% mapability doesn't mean you get a single contig covering 95% of the genome, and a chunk of 5% of your reads that don't cover your genome. It's more like every 20-100kb you'll find something that you just can't assemble over. (I'm guestimating that number, by the way.) This means you have lots of small contigs that have no overlap. That is, of course, assuming you had enough highly mappable reads to do a relatively error free assembly, and, also of course, assuming your sequencing errors haven't completely interfered with your assembly. I don't think either can really be taken for granted.

In reality, the mappability of a genome isn't a property you can figure out until you've sequenced it, so I don't want to discourage anyone from trying. I'm aware of people working on using Illumina reads to do this, albeit they've supplemented the reads with BAC sequencing, ESTs and various other options, which will provide a nice scaffolding. This approach allows them to augment their sequencing power by obtaining a lot of coverage through the Illumina runs, rather than having to use them as the basis of a serious de novo assembly - which seems wise to me. (And even then, they're only doing a fungus - not a higher vertebrate!)

And to return to my earlier point about the mappable genome not being 100% mappable, this is where I think paul's point over simplifies. Although some bases may be 50% mappable, and are thus considered to be "mappable" in common discussion, that means that 50% of the possible sequencing reads in which they're likely to participate will not yield an un-ambiguous fragment! That means you can say goodbye to 50% of the likely fragments which you would need to generate an overlap to span two forming contigs. That, to me, indicates that any de novo assembly is unlikely to correctly proceed past that base, and 50% mappability is not unusual in the genome.

The other point of paul's that I wanted to discuss, was the assertion that the non-random fragmenting of DNA is what's stopping us from doing a serious de novo assembly. While it's true that shearing DNA isn't an entirely random process, it's also true that it doesn't need to be. People have been doing restriction enzyme digests for decades now, in order to map vectors and inserts. (I learned how to do it in grade 11 biology, and that was back in 1994). So yes, while sonication or digests may not be random, what's to stop someone from stripping DNA of it's histones, and then doing 25 different digests? The net effect is just about the same (assuming you pick 25 good restriction enzymes with different base recognitions), and will yield fragments that will get around paul's issue. But does that mean we can now do a de novo assembly?

No... I don't think so. I doubt the lack of a random fragmentation pattern is the limiting factor in de novo assemblies. Unfortunately, the only way to prove myself or paul right or wrong is to simulate this: Take the mouse genome, fragment it into random 36-mers (the same size you get from an Illumina sequencing run), then inject 1 random base for every 1000 bases read (I'll be generous and assume a .1% error rate, though the real thing is closer to 3-5% from what I've seen), and then try running velvet on it.

I bet you'll observe that somewhere around 40x coverage you'll start to saturate and discover that your assembly has covered anywhere from 60-80% of the genome (say 5% of that is just way wrong), and that it'll have taken you longer to do that assembly than it would have to just sequenced the damned thing in the first place with PCR.

Anyhow, I take paul's point to heart: We're still early on in this game, and there are a lot of neat things we can try. I'm looking forward to seeing the results of many of them. (Who has time to try them all?)

By the way, FindPeaks 3.0.1 is done, and I'll be presenting it as a poster at AGBT 2008 this week. If you're interested in ChIP-Seq/Chip-Solexa (or for the pedantic, ChIP-Illumina), come find me, and I'll tell you some of its new features.

Labels: Aligners, Bioinformatics, Chip-Seq, DNA, Sequencing, Solexa/Illumina

Tuesday, January 22, 2008

Solexa Transcriptom Shotgun:Transcriptome alignments vs. Genome Alignments

First off, I was up late trying to finish off one of my many projects, so I didn't get a lot of sleep. Thus, if my writing is more incoherent than usual, that's probably why. And now on with the show.

I'm jumping gears a bit, today. I haven't finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.

When you're re-sequencing a genome, you map back to the genome template. Yes, you're probably sequencing a different individual, so you'll find occasional sections that don't match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you're seeing are already commonly known changes. Of course, if you're re-sequencing Craig Venter, you don't need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.

When you're sequencing a genome you've never sequenced before, you can't do mapping at all. There are various assemblers (i.e., Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add... ), SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and Euler (which I'd never heard of till I googled the web page for velvet...). The key point: you don't need to worry about what you're mapping back to when you do de novo assembly, since you're creating your own map. I'll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they're so short!

Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I'm ignoring cloned libraries... they're so 90's!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I'm ignoring some complexity here, because it makes for a better story.) But I'm getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.

mRNA is an interesting beast. Unlike Genomic DNA, it's a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there's a small chance that it might overlap the boundaries between two exons, which means the bases you have won't map back nicely to the genomic point of origin. That can be a serious problem.

Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you're sophisticated, you'll even know that introns often have signals that indicate their presence most of the time. And yes, you'd be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn't seem like much, really), how long would it take to do that alignment?

Too long.

So most of the time, we don't do gapped alignments. Instead, we have two choices:

Align against the genome, and throw away reads that we can't align (i.e. those that over lap intron/exon boundaries.)

Align against a collection of known coding DNA sequences

Number two isn't a bad option: it already has all the introns spliced out, so you don't need to worry about missing those alignments. Unfortunately, there are several issues with this approach:

Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.

Many transcripts are very similar. This isn't a problem with the data, but with your alignment program. If your aligner doesn't handle multi-matches (like Eland), this will never work.

Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.

Aligning to a transcript is hard to visualize. This is going to be one of my next projects... but with all the fantastic genomes out there, I'm still not aware of a good transcriptome visualization tool.

And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.

For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven't released my code for either one, yet, but they both work, and if you contact my supervisor, he'd probably be interested in collaborating, or possibly sharing the code.

Labels: Aligners, Bioinformatics, DNA, Grad School, Sequencing, Solexa/Illumina

Thursday, January 17, 2008

Aligning DNA - Eland

Continuing in my series of blog articles on short-read sequencing, it's time for a little bit of a discussion on the first short-sequence aligner that I'd ever met: Eland. The problem with writing anything about Eland, is that 90% of the information I have on it can't be confirmed independently. In fact, the majority of what I know comes from the header of the ELAND.cpp file. I'll admit, while I've gone through the source code a couple of times, I haven't spent much time getting to know it's well - I've processed billions of lines of Eland files, but have yet to run it, myself...

Anyhow, here's what I can tell you: Eland stands for "Efficient Local Alignment of Nucleotide Data" and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it's a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be "error free", so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don't know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

The Eland aligner can only align sequences up to 32 base pairs long. (Implementation specific: they use four integers to represent the subsequences {A,B,C and D}, which reduces the memory requirements.)

The Eland aligner can only align sequences with up to two mismatches. (Algorithm specific: at least two of the keys must match, although the algorithm could be expanded to use more subsequences, at the expense of expanding the sequence space.)

The Eland aligner only returns sequences that match to one unique location in the genome (Implementation specific: Eland already tells you when a sequence matches to more than one location, it just doesn't tell you what those locations are.)

The Eland aligner can do mismatches in the sequence, but can not handle insertions or deletions (i.e, gapped alignments). (Algorithm specific: The algorithm for matching the keys could be expanded to include this, but it would be very costly in time/memory.)

The code is not freely available. (Implementation specific: I don't know of anywhere you can go to get the code... or documentation!)

Changing the length of the alignment sequence (I called it N, above), requires a recompile. (Implementation, obviously. Why this isn't a command line parameter, I don't know.)

I'm sure I could go on and list other disadvantages, but I'd like to point out the main advantage: It's FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it's worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I'll discuss in future posts.

(2008-01-28 Update: I stand corrected: I've heard that the new betas have finally arrived! I haven't seen what they can do yet, but at least we know my prediction that they're vapourware is wrong.)

Labels: Aligners, DNA, Sequencing, Solexa/Illumina

Wednesday, January 9, 2008

Aligning DNA - a quick intro to some of the tools

A long time ago, I mentioned that I'd use this blog for some science. I don't really know who my audience is, but I suspect it's 1 part people who google in for linux related key words, 1 part people who are googling me because they've read my name somewhere, and 1 part random people... oh, and not to forget those people who I actually know in real life, but don't get to see often. Since I'm unlikely to scare of the googlers or the random people with some science, and the people I know in person will just skip over this posting (that's a hint, if you haven't already stopped reading...), I'll just go ahead with a quick lesson in Genomics.

As I've mentioned to many people, my research is currently heavily embedded in writing code for the processing of Solexa/Illumina reads. For those who aren't in the know, the Illumina 1G is a DNA sequencing machine that can produce ~40 Million DNA sequencing reactions in about 36 hours, where each sequence is 36 bases long. This is drastically different from traditional sequencing, where you get far fewer sequences (~100's), each of which can be up to 1000 bases long. With the low volume of alignments produced, and the long reads, traditional sequencing can be processed with a rather laisez-faire attitute. if you take 10 seconds to align 100 reads, you're talking about 1000 seconds to do all the alignments, or roughly 15 minutes.

In contrast, with 40 million reads, at ten seconds each, you're talking about waiting 12 years for your data to be processed. Clearly this is not an option, if you want to get into a cutting edge journal.

(For those who are curious to know where the 10 second mark came from, my supervisor timed a few blast searches on a local server.)

Consequently, computational scientists have risen to the task, and created several new algorithms for performing these alignments. The first one that I met is called "Eland", which I believe stands for "Efficient Local Alignment of Nucleotide Data". To be honest, I'm not really sure how to get Eland, as there is very little information available on the web. I've gleaned a few scraps of information: It was written by Anthony Cox, and was distributed by Solexa/Illumina. As far as I can tell, it came with the Illumina 1G machines.

The second one I met was called "Mosaik". This aligner comes from the lab of Gabor Marth, at boston college. The third was "Exonerate", followed by several others. Hands down, the best name for an aligner has got to be "MAPASS"... I'll let you ponder that for a few minutes.

Anyhow, each alginer has it's advantages and disadvantages. Eland, for instance is crippled by a algorithmic limitation to only ever being able to align the first 32 bases of a sequence, and it's inability to map a read to more than one location in a target DNA source (i.e genome or transcriptome). On the other hand, it's one of the fastest aligners out there.

Mosaik, on the other hand, is a bit slower, but has several nice features - and is able to handle the cases that Eland can't. On the other hand, it dumps out it's alignments to a file format that really isn't convenient for doing any further processing, which is a product of it's original use in sequence assembly.

Just to throw in one more curve, it's worth mentioning a competing proprietary piece of software. There's a company in Malaysia that has what appears to be the fastest aligner out there, with none of the limitations of Eland, and a flexible file output. Like all commercial products for the science market, they set the base price of their product outside of the reach of most consumers: I can't seem to find a good justification to get my supervisor to pay $200,000+ USD/CAD for a copy of their software. (You could hire four post-docs for a year for that... or 10 grad students.)

Anyhow, now you know who the players are. My next post on this topic will be to introduce some of them in detail, explain how they work, and then tell you how we use them.

Yes, I'm feeling ambitious: I got three pieces of software working today: two of which are likely to be in production in the next couple of weeks at the Genome Sciences Centre: One for ChIP-Seq experiments (FindPeaks 3.0) and the other for transcriptome processing for mammalian cells (yet to be named.)