Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Thursday, January 14, 2010

How to be a better Programmer: Tactics.

I'm a bit too busy for a long post, but a link was circulating around the office that I thought was worth passing on to any bioinformaticians out there.

The article above is on how to be a better programmer - and I wholeheartedly agree with what the author proposed, with one caveat that I'll get to in a minute. The point of the the article is that learning to see the big picture (not specific skills) will make you a better programmer. In fact, this is the same advice Sun Tzu gives in "The Art of War", where understanding the terrain, the enemy, etc are the tools you need to be a better general. [This would be in contrast to learning how to wield each weapon, which would only make you a better warrior.] Frankly, it's good advice, and this leads you down the path towards good planning and clear thinking - the keys to success in most fields.

The caveat, however, is that there are times in your life where this is the wrong approach: ie. grad school. As a grad student, your goal isn't to be great at everything you touch - it's to specialize in some small corner of one field, and tactics are no help here. If grad school existed for Ninjas, the average student would walk out being the best (pick one of: poisoner/dart thrower/wall climber/etc) in the world - and likely knowing little or nothing about how to be a real ninja beyond what they learned in their Ninja undergrad. Tactics are never a bad investment, but they aren't always what is being asked of you.

Anyhow, I plan to take the advice in the article and to keep studying the tactics of bioinformatics in my spare time, even though my daily work is more on the details and implementation side of it. There are a few links in the comments of the original article to sites the author believes are good comp-sci tactics... I'll definitely be looking into those tonight. Besides, when it comes down to it, the tactics are really the fun parts of the problems, although there is also something to be said for getting your code working correctly and efficiently.... which I'd better get back to. (=

Happy coding!

Labels: , , , , ,

Thursday, December 17, 2009

One lane is (still) not enough...

After my quick post yesterday where I said one lane isn't enough, I was asked to elaborate a bit more, if I could. Well, I don't want do get into the details of the experiment itself, but I'm happy to jump into the "controls" a bit more in depth.

What I can tell is that with one lane of RNA-Seq (Illumina data50bp), all of the variations I find show up either in known polymorphism database or as somatic SNPs, with a few exceptions. The few exceptions just turn out to be exceptions for lack of coverage.

For a "control", I took two data sets (from two separate patients) - each with 6 individual lanes of sequencing data. (I realize this isn't the most robust experiment, but it shows a point.) In the perfect world, each of the 6 lanes per person would have sampled the original library equally well.

So, I matched up one lane from each patient into 6 sets and asked the question: How many transcripts are void (less than 5 tags) in one sample and at least 5x greater in the other sample. (I did this in both directions.)

The results aren't great. In one direction, I see an average of 1245 Transcripts (about 680 genes, so there's some overlap amongst the transcript set) with a std dev. of 38 Transcripts. That sounds pretty consistent, till you look for the overlap in actual transcripts: avg 27.3 with a std dev of 17.4. (range 0-60). And, when with do the calculations, the most closely matched data sets only have a 5% overlap.

The results for the opposite direction were similar: Average of 277 transcripts found that met the criteria ( of 33.61), with an average overlap between data sets being 4.8, std. dev 4.48. (range of 0-11 transcripts in common.) The best overlap in "upregulated" genes for this dataset was just over 4% concordance with a second pair of lanes.

So, what this tells me (for a VERY dirty experiment) is that expression of genes in one lane is highly variable depending on the lane for genes expressed at the low end. (Sampling at the high end usually pretty good, so I'm not too concerned about that.)

What I haven't answered yet is how many lanes is enough. Alas, I have to go do some volunteering, so that experiment will have to wait for another day. And, of course, the images I created along the way will have to follow later as well.

Labels: , , , ,

Thursday, November 12, 2009

Go from Google...

Just a short post, since I'm actually (although you probably can't tell) rather busy today. However, I'm absolutely fascinated by Google's new language, Go. It's taken the best from just about every existing language out there, and appears so clean!

I'm currently watching Google's talk on it, while I write... I'm only a few minutes in, but it seems pretty good. Watching this seriously makes me want to start a new bio-go project... so nifty!

Labels: , ,

Friday, November 6, 2009

ChIP-Seq normalization.

I've spent a lot of time working on ChIP-Seq controls recently, and wanted to raise an interesting point that I haven't seen addressed much: How to normalize well. (I don't claim to have read ALL of the chip-seq literature, and someone may have already beaten me to the punch... but I'm not aware of anything published on this yet.)

The question of normalization occurs as soon as you raise the issue of controls or comparing any two samples. You have to take it in to account when doing any type of comparision, really, so it's somewhat important as the backbone to any good second-gen work.

The most common thing I've heard to date is to simply normalize by the number of tags in each data set. As far as I'm concerned, that really will only work when your data sets come from the same library, or two highly correlated samples - when nearly all of your tags come from the same locations.

However, this method fails as soon as you move into doing a null control.

Imagine you have two samples, one is your null control, with the "background" sequences in it. When you seqeunce, you get ~6M tags, all of which represent noise. The other is ChIP-Seq, so some background plus an enriched signal. When you sequence, hopefully you sequence 90% of your signal, and 10% of the background to get ~8M tags - of which ~.8M are noise. When you do a compare, the number of tags isn't quite doing justice to the relationship between the two samples.

So what's the real answer? Actually, I'm not sure - but I've come up with two different methods of doing controls in FindPeaks: One where you normalize by identifying a (symmetrical) linear regression through points that are found in both samples, the other by identifying the points that appear in both samples and summing up their peak heights. Oddly enough, they both work well, but in different scenarios. And clearly, both appear (so far) to work better than just assuming the number of tags is a good normalization ratio.

More interesting, yet, is that the normalization seems to change dramatically between chromosomes (as does the number of mapping reads), which leads you to ask why that might be. Unfortunately, I'm really not sure why it is. Why should one chromosome be over-represented in an "input dna" control?

Either way, I don't think any of us are getting to the bottom of the rabbit hole of doing comparisons or good controls yet. On the bright side, however, we've come a LONG way from just assuming peak heights should fall into a nice Poisson distribution!

Labels: , , , ,

Wednesday, November 4, 2009

New ChIP-seq control

Ok, so I've finally implemented and debugged a second type of control in FindPeaks... It's different, and it seems to be more sensitive, requiring less assumptions to be made about the data set itself.

What it needs, now, is some testing. Is anyone out there willing to try a novel form of control on a dataset that they have? (I won't promise it's flawless, but hey, it's open source, and I'm willing to bug fix anything people find.)

If you do, let me know, and I'll tell you how to activate it. Let the testing begin!

Labels: , , , , ,

Wednesday, October 14, 2009

Useful error messages.... and another format rant.

I'll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It's one of my "go-to" sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I'd get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the "email results to you" option, whereupon I got the following email message:

Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress

That has to be the first time I've ever had a server shutdown cause a result failure. Ok, it's not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable - I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I'm also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it's not freely available, which is somewhat un-helpful. (I don't know when or if it will be available, although I've heard rumours about it being put into our pipeline...)

That's kind of a sad state of affairs - although I really shouldn't complain. I have more than enough work on my plate, and I'm sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I'll just have to sit here and be patient... and maybe write an 8th snp format iterator.

Labels: , , , , , ,

Monday, October 5, 2009

Why peak calling is painful.

In discussing my work, I'm often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It's trivial. Peak calling itself isn't hard. However, there are plenty of pitfalls that can surprise the unwary. (I've found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful - masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:
  1. Maintaining the software

  2. Peak calling without unlimited resources eg, 64Gb RAM

  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it's really easy to fire up a "one-off" script. Anyone can write something that can be tossed aside when they're done with it - but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we've transitioned through two versions of Eland output, Maq's .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we've been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn't have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption - and this is somewhat more technical. When dealing with sequencing reads, you're faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land - OR - you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I'll come back to in a moment. The second requires much less cpu time - but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a "pileup" file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn't matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you're using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn't always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually - which works well when correctly sorted. Unfortunately, new formats no longer allow that - using a "pre-sorted" bam/sam file, you can now find multi-part reads, but there's no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods - and do it ahead of everyone else. That on it's own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they're not. For a group of 30 people, that's probably not too hard, but for academic peak callers, that can be a challenge - particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that's the answer I give: It's trivial - but a lot of hard work. It's rewarding, educational and cool, but it's a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

Labels: , , , , , , ,

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

Labels: , , , , , , , , , , , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Saturday, August 15, 2009

What would you do with 10kbp reads?

I just caught a tweet about an article on the Pathogens blog (What can you do with 1000 base pair reads?), which is specifically about 454 reads. Personally, I'm not so interested in 454 reads - the technology is good, but I don't have access to 454 data, so it's somewhat irrelevant to me. (Not to say 1kbp reads isn't neat, but no one has volunteered to pass me 454 data in a long time...)

So, anyhow, I'm trying to think two steps ahead. 2010 is supposed to be the year that Pacific Biosciences (and other companies) release the next generation of sequencing technologies - which will undoubtedly be longer than 1k. (I seem to recall hearing that PacBio has 10k+ reads.- UPDATE: I found a reference.) So to heck with 1kbp reads, this raises the real question: What would you do with a 10,000bp read? And, equally important, how do you work with a 10kbp read?
  • What software do you have now that can deal with 10k reads?
  • Will you align or assemble with a 10k read?
  • What experiments will you be able to do with a 10k read?
Frankly, I suspect that nothing we're currently using will work well with them - we'll all have to go back to the drawing board and rework the algorithms we use.

So, what do you think?

Labels: , , , ,

Thursday, August 13, 2009

Ridiculous Bioinformatics

I think I've finally figured out why bioinformatics is so ridiculous. It took me a while to figure this one out, and I'm still not sure if I believe it, but let me explain to you and see what you think.

The major problem is that bioinformatics isn't a single field, rather, it's the combination of (on a good day) biology and computer science. Each field on it's own is a complete subject that can take years to master. You have to respect the biologist who can rattle off the biochemicals pathway chart and then extrapolate that to the annotations of a genome to find interesting features of a new organism. Likewise, theres some serious respect due to the programmer who can optimize code down at the assembly level to give you incredible speed while still using half the amount of memory you initially expected to use. It's pretty rare to find someone capable of both, although I know a few who can pull it off.

Of course, each field on it's own has some "fudge factors" working against you in your quest for simplicity.

Biologists don't actually know the mechanisms and chemistry of all the enzymes they deal with - they are usually putting forward their best guesses, which lead them to new discoveries. Biology can effectively be summed us as "reverse engineering the living part of the universe", and we're far from having all the details worked out.

Computer Science, on the other hand, has an astounding amount of complexity layered over every task, with a plethora of languages and system, each with their own "gotchas" (are your arrays zero based or 1 based? how does your operating system handle wild cards at the command line? what does your text editor do to gene names like "Sep9") leading to absolute confusion for the novice programmer.

In a similar manner, we can also think about probabilities of encountering these pitfalls. If you have two independent events, and each of which has a distinct probability attached, you can multiply the probabilities to determine the likelihood of both events occurring simultaneously.

So, after all that, I'd like to propose "Fejes' law of interdisciplinary research"

The likelihood of achieving flawless work in an interdisciplinary research project is the product of the likelihood of achieving flawless work in each independent area.

That is to say, that if your biology experiments (on average) are free of mistakes 85% of the time, and your programming is free of bugs 90% of the time. (eg, you get the right answers), your likely hood of getting the right answer in a bioinformatics project is:
Fp = Flawless work in Programming
Fb = Flawless work in Biology
Fbp = Flawless work in Bioinformatics

Thus, according to Fejes' law:
Fb x Fp = Fbp

and the example given:
0.90 x 0.85 = 0.765

Thus, even an outstanding programmer and bioinformatician will struggle to get an extremely high rate of flawless results.

Fortunately, there's one saving grace to all of this: The magnitude of the errors is not taken into account. If the bug in the code is tiny, and has no impact on the conclusion, then that's hardly earth shattering, or if the biology measurements have just a small margin of error, it's not going to change the interpretation.

So there you have it, bioinformticians. if i haven't just scared you off of ever publishing anything again, you now know what you need to do...

Unit tests, anyone?

Labels: , , , ,

Tuesday, August 11, 2009

SNP/SNV callers minimum useful information

Ok, I sent a tweet about it, but it didn't solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you'd think they grow on trees. (Actually, they grow on arrays...) I've written one, myself, and I know there are at least 3 others written at the GSC.

Anyhow, At first sight, what pisses me off is that there's no standard format. Frankly, that's not even the big problem, however. What's really underlying that problem is that there's no standard "minimum information" content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.

So, here's what I propose. If you're going to write a SNP or SNV caller, make sure your called variations contain the following fields:
  • chromosome: obviously the coordinate to find the location
  • position: the base position on the chromo
  • genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
  • canonical: what you expect to see at that position. (Invaluable for error checking!)
  • observed: what you did see at that position
  • coverage: the depth at that position (filtered or otherwise)
  • canonical_obs: how many times you saw the canonical base (key to evaluating what's at that position
  • variation_obs: how many times you saw the variation
  • quality: give me something to work with here - a confidence value between 0 and 1 would be ideal... but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!
Really, most of the callers out there give you most, if not all of it - but I have yet to see the final "quality" being given. The MAQ SNP caller (which is pretty good) asks you to look at several different fields and make up your own mind. That's fine for a first generation, but maybe I can convince people that we can do better in the second gen snp callers.

Ok, now I've got that off my chest! Phew.

Labels: , , , ,

Thursday, August 6, 2009

New Project Time... variation database

I don't know if anyone out there is interested in joining in - I'm starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I'm currently on my first pass - consider it version 0.1 - but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there - this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

Labels: , , , , , ,

Tuesday, August 4, 2009

10 minutes in a room with microsoft

As the title suggests, I spent 10 minutes in a room with reps from Microsoft. It counts as probably the 2nd least productive time span in my life - second only to the hour I spent at lunch while the Microsoft reps told us why they were visiting.

So, you'd think this would be educational, but in reality, it was rather insulting.

Wisdom presented by Microsoft during the first hour included the fact that Silverlight is cross platform, Microsoft is a major supporter of interoperability and that bioinformaticians need a better platform to replace bio{java|perl|python|etc} in .net.

My brain was actively leaking out of my ear.

My supervisor told me to be nice and courteous - and I was, but sometimes it can be hard.

The 30 minute meeting was supposed to be an opportunity for Microsoft to learn what my code does, and to help them plan out their future bioinformatics tool kit. Instead, they showed up with 8 minutes remaining in the half hour, during which myself and another grad student were expected to explain our theses, and still allow for 4 minutes of questions. (Have you ever tried to explain two thesis projects in 4 minutes?)

The Microsoft reps were all kind and listened to our spiel, and then engaged in a round-table question and discussion. What I learned during the process was interesting:
  • Microsoft people aren't even allowed to look at GPL software - legally, they're forbidden.
  • Microsoft developers also have no qualms about telling other developers "we'll just read your paper and re-implement the whole thing."
And finally,
  • Microsoft reps just don't get biology development: the questions they asked all skirted around the idea that they already knew what was best for developers doing bioinformatics work.
Either they know something I don't know, or they assumed they did. I can live with that part, tho - They probably know lots of things I don't know. Particularly, I'm sure they know lots about doing coding for biology applications that require no new code development work.

So, in conclusion, all I have to say is that I'm very glad I only published a bioinformatics note instead of a full description of my algorithms (They're available for EVERYONE - except Microsoft - to read in the source code anyhow) and that I produce my work under the GPL. While I never expected to have to defend my code from Microsoft, today's meeting really made me feel good about the path I've charted for developing code towards my PhD.

Microsoft, if you're listening, any one of us here at the GSC could tell you why the biology application development you're doing is ridiculous. It's not that I think you should stop working on it - but you should really get to know the users (not customers) and developers out there doing the real work. And yes, the ones that are doing the innovative and ground breaking code are are mainly working with the GPL. You can't keep your head in the sand forever.

Labels: , ,

Wednesday, July 29, 2009

Aligner tests

You know what I'd kill for? A simple set of tests for each aligner available. I have no idea why we didn't do this ages ago. I'm sick of off-by-one errors caused by all sorts of slightly different formats available - and I can't do unit tests without a good simple demonstration file for each aligner type.

I know Sam format should help with this - assuming everyone adopts it - but even for SAM I don't have a good control file.

I've asked someone here to set up this test using a known sequence- and if it works, I'll bundle the results into the Vancouver Package so everyone can use it.

Here's the 50-mer I picked to do the test. For those of you with some knowledge of cancer, it comes from tp53. It appears to blast uniquely to this location only.
>forward - chr17:7,519,148-7,519,197

>reverse - chr17:7,519,148-7,519,197

Labels: , , , , , ,

Monday, July 27, 2009

how recently was your sample sequenced?

One more blog for the day. I was postponing writing this one because it's been driving me nuts, and I thought I might be able to work around it... but clearly I can't.

With all the work I've put into the controls and compares in FindPeaks, I thought I was finally clear of the bugs and pains of working on the software itself - and I think I am. Unfortunately, what I didn't count on was that the data sets themselves may not be amenable to this analysis.

My control finally came off the sequencer a couple weeks ago, and I've been working with it for various analyses (snps and the like - it's a WTSS data set)... and I finally plugged it into my FindPeaks/FindFeatures pipeline. Unfortunately, while the analysis is good, the sample itself is looking pretty bad. In looking at the data sets, the only thing I can figure is that the year and a half of sequencing chemistry changes has made a big impact on the number of aligning reads and the quality of the reads obtained. I no longer get a linear correlation between the two libraries - it looks partly sigmoidal.

Unfortunately, there's nothing to do except re-seqeunce the sample. But really, I guess that makes sense. If you're doing a comparison between two data-sets, you need them to have as few differences as possible.

I just never realized that the time between samples also needed to be controlled. Now I have a new question when I review papers: How much time elapsed between the sequencing of your sample and it's control?

Labels: , , , , , ,

Picard code contribution

Update 2: I should point out that the subject of this post has been resolved. I'll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.

Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I'll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) - and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I've been slowly reading through the code, as I'd like to use it in my project for reading/writing SAM format files - which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It's an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it's suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They're not key pieces, but they are important to maintain your code's good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the "==" operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It's not in bad shape, but it's got a lot of little "problems" that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I'd fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it's best, right?

The sad part of the tale is that open source only works when the community's contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn't. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I'm not going to waste my time submitting the second set of changes I made. That's kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I'd begin making larger code contributions as my familiarity with the code improves - and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren't interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they'd like to accomplish with the software - and including contributions may mean including someone on a publication downstream if and when it does become published. I don't know if that was the case here: it was well within the project leader's rights to reject my patches on any grounds they like, but I can't say it makes me happy. I still don't enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see "Open Source" software, I'll remember that just because it's open source, it doesn't mean all contributions are welcome - I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I'll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that's highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

Labels: , , , , , ,

Friday, July 17, 2009


This week has been a tremendous confluence of concepts and ideas around community. Not that I'd expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate's successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can't imagine having with too many people. (By the way, I think Obi's suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I've temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a "distinct" source of information, easily distinguishable from other people's blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks - I don't want to sink into a "borg" experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation - and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of "successes and failures in developing an open source community" as applied to the Vancouver Short Read Analysis Package. Yes, I'm happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us - not to mention how we integrate into them, both in our work and outside. If you can't maximize your ability to motivate them (or their ability to motivate you), then you're at a serious disadvantage. How we balance all of that is an open question, and one I'm still working hard at answering.

I've attached my presentation from this morning, just in case anyone is interested. (I've decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it's about 7Mb.)

Labels: , , , , , , , ,

Thursday, July 9, 2009

New Tool: KeepNote

Obviously I haven't updated much here lately - I've been pretty busy and inspiration hasn't struck me much in the last few days to get anything written. However, I started using some new software this morning, and I'm enjoying it so much I figured I have to share.

One of the big problems I have, as a bioinformatician, is keeping track of all the notes and one off scripts I write. I don't want to use an SVN, because it's just a repository with no organization. I don't want to use a wiki, because it's a huge hassle to maintain for small projects, and I hate using text files.

The compromise, it seems, is to use standards compliant files with a hell of a wrapper around them that does the organization for you, and the one I found is called KeepNote. The project page and downloads can be found at The software is available for all major OS (Linux, Mac and even Windows), and can be installed relatively quickly and (for the most part) painlessly. (Linux builds are missing a library in the dependencies, but that can be figured out pretty quickly - just apt-get the missing lib and re-install if you hit this problem.)

While it may not fit everyone's workflow, my few hours of using it have already helped me get my tools organized and assembled in a logical manner, and it's allowed me to remove a load of files from my desktop. There are still bugs with it: I had to manually do some configuration of the the web browser, text editor and such before I could get started, but so far I haven't hit any of the bugs.

It also claims to help you organize notes - which I can clearly see. next time I go to a conference, I'll be using this for recording and organizing the usual 30-40 pages of notes I take.

For me, this falls under the heading of required tools for bioinformaticians and students alike and I look forward to seeing the project evolve and grow.

Labels: , , ,

Tuesday, June 30, 2009

An interesting converation on bioinformatics business models

Every once in a while, I suddenly remember, and rush over there to see what I've been missing. (My occasional lapses generally coincide with my bi-weekly meetings with my supervisor, an upcoming talk or something of that sort...) SeqAnswers is easily the best resource on Next-Gen sequencing, and I truly enjoy the people that hang out on that forum.

Anyhow, I've been participating in an interesting conversation on the business of bioinformatics and next-gen sequencing. It started off on a question on market research, and then blossomed into a much wider ranging conversation. One re-occurring thread in the discussion is if there are valid bioinformatics business models in which the bioinformatics application is the commodity. I maintain that there aren't but clearly other people disagree.

In the name of encouraging a wider audience to contribute, I thought I'd ask anyone who's reading my blog what they think. Join in here or on the forums.



Monday, June 15, 2009

Another day, another result...

I had the urge to just sit down and type out a long rant, but then common sense kicked in and I realized that no one is really interested in yet another graduate student's rant about their project not working. However, it only took a few minutes for me to figure out why it's relevant to the general world - something that's (unfortunately) missing from most grad student projects.

If you follow along with Daniel McArthur's blog, Genetic Future, you may have caught the announcement that Illumina is getting into the personal genome sequencing game. While I can't admit that I was surprised by the news, I will have to admit that I am somewhat skeptical about how it's going to play out.

If your business is using arrays, then you'll have an easy time sorting through the relevance of the known "useful" changes to the genome - there are only a couple hundred or thousand that are relevant at the moment, and several hundred thousand more that might be relevant in the near future. However, when you're sequencing a whole genome, interpretation becomes a lot more difficult.

Since my graduate project is really the analysis of transcriptome sequencing (a subset of genome sequencing), I know firsthand the frustration involved. Indeed, my project was originally focused on identifying changes to the genome common to several cancer cell lines. Unfortunately, this is what brought on my need to rant: there is vastly more going on in the genome than small sequence changes.

We tend to believe blindly what we were taught as the "central paradigm of molecular biology". Genes are copied to mRNA, mRNA is translated to proteins, and the protein goes off to do it's work. However, cells are infinitely more complex than that. Genes can be inactivated by small changes, can be chopped up and spliced together to become inactivated or even deregulated, interference can be run by distally modified sequences, gene splicing can be completely co-opted by inactivating genes we barely even understand yet and desperately over-expressed proteins can be marked for deletion by over-activating garbage collection systems so that they don't have a chance to get where they were needed in the first place. And here we are, looking for single nucleotide variations, which make up a VERY small portion of the information in a Cell.

I don't have the solution, yet, but whatever we do in the future, it's not going to involve $48,000 genome re-sequencing. That information on it's own is pretty useless - we'll have to study expression (WTSS or RNA-Seq, so figure another $30,000), changes to epigenetics (of which there are many histone marks, so figure 30 x $10,000) and even dna methylation (I don't begin to know what this process costs.)

So, yes, while I'm happy to see genome re-sequencing move beyond the confines of array based SNP testing, I'm pretty confident that this isn't the big step forward it might seem. The early adopters might enjoy having a pretty piece of paper that tells them something unique about their DNA, and I don't begrudge it. (In fact, I'd love to have my DNA sequenced, just for the sheer entertainment value.) Still, I don't think we're seeing a revolution in personal genomics - not quite yet. Various experiments have shown we're on the cusp of a major change, but this isn't the tipping point: we're still going to have to wait for real insight into the use of this information.

When Illumina offers a nice toolkit that allows you to get all of the SNVs, changes in expression and full ChIP-Seq analysis - and maybe even a few mutant transcription factor ChIP-Seq experiments thrown in - and all for $48,000, then we'll have a truly revolutionary system.

In the meantime, I think I'll hold out on buying my genome sequence. $48,000 would buy me a couple more weeks in Tahiti, which would currently offer me a LOT more peace of mind. (=

And on that note, I'd better get back to doing the things I do.... new FindPeaks tag, anyone?

Labels: , , , ,

Saturday, May 16, 2009

BIoinformatics in the lab

After yesterday's talk by Dr. Bowdish (I just feel weird calling professors by the first name when referring to their talks), I walked away with several different trains of thought, one of which was the easy integration of bioinformatics into the research program she'd undertaken. The interesting thing isn't so much that it was there, but the absolutely relaxed attitude with which it had been presented.

When I first started talking to professors about the interface between computers and biology or biochemistry, the field had barely even been given a name - and most of the professors were utterly confused about what computers could do to enhance their research programs. (Yes, I was an undergrad in the mid-90's.) I remember several profs saying they couldn't think of a reason to have computers in their labs at all. (Of course, at the time, there probably wasn't much use for computers in the lab anyhow.)

There was one prof who was working on the edge of the two subjects: Dr. Patricia Schulte. Although she was working on the field of fish biology, somehow she was able to see the value and encourage her students to explore the interface of bioinformatics and lab integration - and she was the first person to introduce me to the term Bioinformatics (among many other topics: HMMs, Neural Nets, etc...)

Anyhow, at that point, I was hooked on bioinformatics, but finding the opportunity to do hands on work was nearly impossible. The biology professors didn't know what it could do for them - and clearly didn't have the vocabulary with which to express their interests in computational information. It was awkward, at times. One prof couldn't figure out why I wanted to use word processors for biology.

To my great amazement, things have dramatically changed in the (nearly) decade and a half since I started my first undergrad, and yesterday's talk was really a nice opportunity to contemplate that change. Dr. Bowdish's talk included a significant amount of biology, genomics and bioinformatics predictions. When the predictions didn't turn out (eg. the putative myristolation site wasn't actually important), there was no accompanying comment about how unreliable bioinformatics is (which I used to see ALL the time in the early days of the field), and there was no hesitation to jump in to the next round of bioinformatics predictions (structure predictions for the enzyme).

I think even this quiet incorporation of bioinformatics into a young lab is incredibly encouraging. Perhaps it's Dr. Bowdish's past, having done her PhD in Dr. Hancock's lab, who himself was an early adopter of bioinformatics predictions, or possibly it's just researchers who have grown up with computers for most of their life finally getting into the ranks of academia. Either way, I'm impressed and encouraged. Bioinformatics gold age may not be here yet, but I think the idea that they'll never become mainstream has finally started to fade from the halls of the ivory tower.

Labels: ,

Friday, May 15, 2009

On the necessity of controls

I guess I've had this rant building up for a while, and it's finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it's effects on a particular closed system. The scientific method actually demands that we do it - hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn't bother with controls for ChIP-Seq for a long time. I can't say I've even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I've put in to the latest version of FindPeaks, I'm finally getting a good grasp of the importance of using controls well. With the other software I've seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets... and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that's what prompted me to write this. As I look over the results from the new FindPeaks (, both for ChIP-Seq and WTSS, I'm amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in - but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren't in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I've been pushing for a decent control for my WTSS lanes - and there is matched normal for one cell line - but it's still two months away from having the reads land on my desk... and I'm getting impatient.

Now that I'm able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it's so much more of a challenge without an appropriate control. Besides, who'd believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I'm going to make a suggestion: if you're still doing experiments without a control, and you want to get them published, it's going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

Labels: , , , , , , , ,

Thursday, April 16, 2009

Multi-match reads in ChIP-Seq

I had an interesting comment left on my blog today, which is worth taking a few minutes to write a response to:
"Hi Anthony, I just discovered your blog and it looks very interesting to me!
Since this article on Eland is now more than one year old, I was wondering
if the description at point 3 about multi matching locations is still
applicable to the Eland program in the Illumina pipeline 1.3. More in general,
would you trust the multi matching locations extracted from the multi_eland
output files to perform a repeat enrichment analysis over an experiment of
ChIP-seq? If no, why? Thank you in advance for your attention."

The first question asks about multi-matching locations - and if the point in question (point 3) applies to the Illumina Pipeline 1.3. Since point 3 was just that the older pipeline didn't provide the locations of the multi-matche reads, I suppose this no longer really applies: I understand the new version of Eland does provide multi-match alignment information, as do other aligners such as Bowtie. However, I should also mention that since I adopted Maq as my preferred aligner, I haven't used Eland much - so it's hard for me to give an informed opinion on the quality of the matches. I simply don't know if they're any good, and I won't belabour that point. I have used Bowtie specifically because it was able to do mutli-matches, but we didn't use it for ChIP-Seq, and the multi-matches had other uses in that experiment.

So, the more interesting question is whether I'd use multi-match reads in a ChIP-Seq analysis. And, off hand, my answer has to be no. But let me explain my reasoning, and the conditions in which I would change that answer.

First, lets assume that we have Single End Tags, so the multi-match information is not resolvable. That means anytime we have a read that maps to more than one location, we have the possibility that we can either map it to it's source - or we're mapping it incorrectly. A 50% change of "getting it right." The greater the number of multi-match locations, the smaller the chance we're actually finding it's correct origin. So, at best we've got a 50-50 chance that we're not adversely affecting the outcome of the experiment. That's not great.

In contrast, there are things we could do to make them usable. The most widely used method from FindPeaks is the weighted fragment distribution type. Thus, we could expand the principle to weight the fragments according to the number of sites. That would be... bearable. But would it significantly add to the quality of the alignment?

I'm still going to say no. Fragments we see in ChIP-Seq experiments tend to fall within 200-300bp of the regions in which the transcription factor (or other sites) bind. Thus, even if we were concerned that a particular transcription factor binds primarily to the similar motif regions at two sites, there should be more than enough (unique) sequence around that site (which is usually <30-40bp in length) to which you'll still see fragments aligning. That should compensate for the loss of the multi-match fragments.

Even more importantly, as read lengths increase, the amount of non-unique sequence decreases rapidly, making the shrinking number of multi-match reads less important.

The same argument can be extended for paired end tags: Just as read lengths improve and reduce the number of multi-match sites, more of the multi-match reads will be resolved by pairing them with a second read, which is unlikely to be within the same repeat region, thus reducing the number of reads that become unresolvable multi-matches. Proportionally, one would then expect that leaving out these reads become a smaller and smaller segment of the population, and would have to worry less and less about their contribution.

So, then, when would I want them?

Well, on the odd chance you're working with very short reads, you can pull off the weighting properly, and you have single end tags - and the multi-match reads make up a significant proportion of the reads, then it's worth exploring.

You'd need to start asking the tough questions: did the aligner simply find that a small k-mer of the read aligned to multiple locations (and was then unable to resolve the tie by extension the way some Eland aligners work)? Does the aligner use quality scores to identify mis-alignments? How reliable are the alignments (what's their error rate)? What was your sample, and how divergent is it from reference ? (e.g., cancer samples have a high variation rate, and so encourage many false alignments, making the alignments less reliable.)

Overall, I really don't see too many cases where you're going to gain a lot by digging in the multi-match files. That's not too say that you won't find anything good in there - you probably would, if you knew where to look, but the noise to signal ratio is going to be pretty poor - just by definition of the fact that they're mutli-match reads alone. You'll just have to ask if it's worth your time.

For the moment, I don't think my time (even at grad student wages) is worth it. It's just not low hanging fruit, when it comes to ChIP-Seq.

Labels: , , , , , , ,

Wednesday, March 25, 2009

Searching for SNPs... a disaster waiting to happen.

Well, I'm postponing my planned article, because I just don't feel in the mood to work on that tonight. Instead, I figured I'd touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they're not SNPs, they're variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They're only polymorphisms if they're common to a portion of the population.

In this case, they're from cancer cell lines, so after I filter out all the real SNPs, what's left are SNVs... and they're bloody annoying. This is the second major project I've done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs - a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data - nearly a year old. It wasn't particularly reliable, and I've now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked... and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious... do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question "why?" Why do we get unexpected results?

Fortunately, I'm a scientist, so that question is really something I like. I don't begrudge the last year's worth of work - which apparently is now more or less down the toilet - but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I'm hooked, otherwise I'd have tossed in the towel long ago.

Labels: , , , , , ,

Friday, March 20, 2009

Universal format converter for aligned reads

Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot... and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie.... and several other more obscure formats.

For the moment, the "conversion utility" forces the output to bed file format (since that's the file type with the least information, and I don't have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there's no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren't being scaled or altered, you could end up with some rather bizzare conversions unless they're handled cleanly. Unfortunately, doing this scaling is such a moving target that it's just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I'll be asking for a co-op student for the summer to help out.)

Anyhow, I'll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

Labels: , , , , , , , , , , ,

Wednesday, March 4, 2009

Bioinfomatics in a spreadsheet?

This is an old article, but it just came to my attention today.

Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics

The title really does say it all. Alas, I just tested it with openoffice 3.0, and it has the same problem.

Good thing I do my gene name storage in databases!


Friday, January 9, 2009

No More Maq?

Another grad student at the GSC forwarded an email to our mailing list the other day, which was in turn from the maq-help mailing list. Unfortunately, the link on the maq-help mailing list takes you to another page, which incidentally (and erroneously) complains that FindPeaks doesn't work with Maq .map files - which it does. Instead, I suggest checking out this post on SeqAnswers from Li Heng, the creator of Maq, which has a very similar message.

The main gist of it is that the .map file format will be deprecated, and there will be no new versions of the Maq software package in the future. Instead, they will be working on two other projects (from the forwarded email):
  1. Samtools: replaces maq's (reference-based) "assembly"
  2. bwa: replaces maq's "mapping" for whole human genome alignment.
I suppose it means that eventually FindPeaks should support the Samtools formats, which I'll have to look into at some point. For those of you who are still using Maq, you may need to start following those projects as well, simply because it raises the question of long-term Maq support. As with many early generation Bioinformatics tools, we'll just have to be patient and watch how the software landscape evolves.

It probably also means that I'll have to start watching the Samtools development more carefully for use with my thesis project - many of the tools they are planning seem to replace the ones I've already developed in the Vancouver Short Read Alignment Package. Eventually, I'll have to evaluate both sets against each other. (That could also be an interesting project.)

While this was news to me, it's probably no more than the expected churn of a young technology field. I'm sure it's not going to be long until even the 2nd generation sequencing machines themselves evolve into something else.

Labels: , , ,

Tuesday, January 6, 2009

My Geneticist dot com

A while back, I received an email from a company called that is doing genetic testing to help patients identify adverse drug reactions. I'm not sure what the relationship is, but they seem to be a part of something called DiscoverMe technologies. I bring mygeneticist up, because I had an "interview" with one of their partners, to determine if I am a good subject for their genetic testing program. It seems I'm too healthy to be included, unless they later decide to include me as a control. Nuts-it! (I'm still trying to figure out how to get my genome sequenced here at the GSC too, but I don't think anyone wants to fund that...)

At any rate, I spoke with the representative of their clinical side of operations this morning and had an interesting conversation about my background. In typical fashion, I also took the time to ask a few specific questions about their operations. I'm pretty sure they didn't tell me much more than was available on their various web pages, but I think there was some interesting information that came out of it.

When I originally read their email, I had assumed that they were going to be doing WTSS on each of their patients. At about $8000 per patient, it's expensive, but a relatively cheap form of discovery - if you can get around some of the challenges involved in tissue selection, etc. Instead, it seems that they're doing specific gene interrogation, although I wasn't able to get the type of platform their using. This leads me to believe that they're probably doing some form of literature check for genes related to the drugs of interest, followed by a PCR or Array based validation across their patient group. Considering the challenges of associating drug reactions with SNPs and genomic variation, I would be very curious to see what they have planned for "value-added" resources. Any drug company can find out (and probably does already know) what's in the literature, and any genetic testing done without approval from the FDA will probaby be sued/litigated/regulated out of existance... which doesn't leave a lot of wiggle room for them.

And that lead me to thinking about a lot of other questions, which went un-asked. (I'll probably email the Genomics expert there to ask some questions, though I'm mostly interested in the business side of it, which they probably won't answer.) What makes them think that people will pay for their services? How can they charge a low-enough fee to make the service attractive while getting making a profit? And, from the scientific side, assuming they're not just a diagnostic application company, I'm not sure how they'll get a large enough cohort to make sense of the data they receive through their recruitment strategy.

Anyhow, I'll be keeping my eyes on this company - if they're still around in a year or two, I'd be very interested in talking to them again about their plans in the next-generation sequencing field.

Labels: , , ,

Saturday, December 6, 2008

Nothing like reading to stimulate ideas

Well, this week has been exciting. The house sale competed last night, with only a few hiccups. Both us and the seller of the house we were buying got low-ball offers during the week, which provided the real estate agents lots to talk about, but never really made an impact. We had a few sleepless nights waiting to find out of the seller would drop our offer and take the competing one that came in, but in the end it all worked out.

On the more science-related side, despite the fact I'm not doing any real work, I've learned a lot, and had the chance to talk about a lot of ideas.

There's been a huge ongoing discussion about the qcal values, or calibrated base call scores that are appearing in Illumina runs these days. It's my understanding that in some cases, these scores are calibrated by looking at the number of perfect alignments, 1-off alignments, and so on, and using the SNP rate to identify some sort of metric which can be applied to identify an expected rate of mismatched base calls. Now, that's fine if you're sequencing an organism that has a genome identical to, or nearly identical to the reference genome. When you're working on cancer genomes, however, that approach may seriously bias your results for very obvious reasons. I've had this debate with three people this week, and I'm sure the conversation will continue on for a few more weeks.

In terms of studying for my comprehensive exam, I'm now done the first 12 chapters of the Weinberg "Biology of Genomes" textbook, and I seem to be retaining it fairly well. My girlfriend quizzed me on a few things last night, and I did reasonably well answering the questions. 6 more days, 4 more chapters to go.

The most interesting part of the studying was Thursday's seminar day. In preparation for the Genome Sciences Centre's bi-annual retreat, there was an all-day seminar series, in which many of the PIs spoke about their research. Incidentally, 3 of my committee members were speaking, so I figured it would be a good investment of my time to attend. (Co-incidentally, the 4th committee member was also speaking that day, but on campus, so I missed his talk.)

Indeed - having read so many chapters of the textbook on cancer biology, I was FAR better equipped to understand what I was hearing - and many of the research topics presented picked up exactly where the textbook left off. I also have a pretty good idea what questions they will be asking now: I can see where the questions during my committee meetings have come from; it's never far from the research they're most interested in. Finally, the big picture is coming together!

Anyhow, two specific things this week have stood out enough that I wanted to mention them here.

The first was the keynote speaker's talk on Thursday. Dr. Morag Park spoke about the environment of tumours, and how it has a major impact on the prognosis of the cancer patient. One thing that wasn't settled was why the environment is responding to the tumour at all. Is the reaction of the environment dictated by the tumour, making this just another element of the cancer biology, or does the environment have it's own mechanism to detect growths, which is different in each person. This is definitely an area I hadn't put much thought into until seeing Dr. Park speak. (She was a very good speaker, I might add.)

The second item was something that came out of the textbook. They have a single paragraph at the end of chapter 12, which was bothering me. After discussing cancer stem cells, DNA damage and repair, and the whole works (500 pages of cancer research into the book...), they mention progeria. In progeria, children age dramatically quickly, such that a 12-14 year old has roughly the appearance of an 80-90 year old. It's a devastating disease. However, the textbook mentions it in the context of DNA damage, suggesting that the progression of this disease may be caused by general DNA damage sustained by the majority of cells in the body over the short course of the life of a progeria patient. This leaves me of two minds: 1), the DNA damage to the somatic cells of a patient would cause them to lose tissues more rapidly, which would have to be regenerated more quickly, causing more rapid degradation of tissues - shortening telomeres would take care of that. This could be cause a more rapid aging process. However, 2) the textbook just finished describing how stem cells and rapidly reproducing progenitor cells are dramatically more sensitive to DNA damage, which are the precursors involved in tissue repair. Wouldn't it be more likely then that people suffering with this disease are actually drawing down their supply of stem cells more quickly than people without DNA repair defects? All of their tissues may also suffer more rapid degradation than normal, but it's the stem cells which are clearly required for long term tissue maintenance. An interesting experiment could be done on these patients requiring no more than a few milliliters of blood - has their CD34+ ratio of cells dropped compared to non-sufferers of the disease? Alas, that's well outside of what I can do in the next couple of years, so I hope someone else gives this a whirl.

Anyhow, just some random thoughs. 6 days left till the exam!

Labels: , , , , , ,

Sunday, October 5, 2008

Field Programmable Gate Arrays

Yes, I'm procrastinating again. I have two papers, two big chunks of code and a thesis proposal to write, a paper to review (it's been done but I have yet to type out my comments..), several major experiments to do and at least one poster looming on the horizon - not to mention squeezing in a couple of manuals for the Vancouver Package Software. And yet, I keep finding other stuff to work on, because it's the weekend.

So, I figured this would be a good time to touch on a topic of Field Programmable Gate Arrays or FPGAs. I've done very little research on this topic, since it's so far removed from my own core expertise, but it's a hot topic in bioinformatics, so I'd be doing a big disservice by not touching on this subject at all. However, I hope people will correct me if they spot errors.

So what is an FPGA? I'd suggest you read the wikipedia article linked above, but I'd sum it up as a chip that can be added to a computer, which has the ability to optimize the way in which information is processed, so as to accellerate a given algorithm. It's a pretty cool concept - move a particular part of an algorithm into the hardware itself to speed it up. Of course, there are disadvantages as well. Reprogramming is (was? - this may have changed) a few orders of magnitude slower than processing information, so you can't change the programming on the fly while processing data and still hope to get a speed up. Some chips can change programming of unused sub-sections, while other algorithms are running... but now we're getting really technical.

(For a very good technical discussion, I suggest this book, of which I've read a few useful paragraphs.)

Rather than discuss FPGAs, which are a cool subject on their own, I'd rather discuss their applications in Bioinformatics. As far as I know, they're not widely used for most applications at the moment. The most processor intensive bioinformatics applications, Molecular Modeling and drug docking, are mainly vector-based calculationd, so vector chips (eg Graphics Processing Units - GPUs) are more applicable for them. As for the rest, CPUs have traditionally been "good enough". However, recently the following two things seem to have accelerated this potential mariage of technology:
  1. The makers of FPGAs have been looking for applications for their products for years and have targeted bioinformatics because of it's intense computer use. Heavy computer use is always considered to be a sign that more efficient processing speed is an industry need - and FPGAs appear to meet that need - on the surface.
  2. Bioinformatics was doing well with the available computers, but suddenly found itself behind the processing curve with the advent of Second Generation Sequencing (SGS). Suddenly, the amount of information being processed spiked by an order of magnitude (or more), causing the bioinformaticians to scream for more processing power and resources.
So, it was inevitable that FPGA producers would hear about the demand for more power in the field, and believe that it's the ideal market into which they should pluge. To the casual observer, Bioinformatics needs more efficiency and power, and FPGA producers are looking for a martet where efficiency and power are needed! Is this a match made in heaven or what?

Actually, I contend that FPGAs are the wrong solution for several reasons.

While Second Generation Sequencing produces tons more data, the algorithms being employed haven't yet settled down. Every 4 months we pick a different aligner. Every 3 months we add a new data base. Every month we produce a more efficient version of our algorithms for interpreting the data. Due to the overhead in producing an algorithm translation into hardware necessary to use the FPGA (which seems large to me, but may not be to people more fluent in HDL) would mean that you'd spend a disproportionate amount of time trying to get the chips set up to process your data - which you're only going to use for a short period of time before moving on. And the gain of efficiency would probably be wiped out by the amount of effort introduced.

Furthermore, even when we do know the algorithms being used are going to stay around, a lot of our processing isn't necessarily CPU bound - but rather is I/O or memory bound. When you're trawling through 16Gb or memory, it's not necessarily obvious that adding more speed to the CPU will help. Pre-fetching and pre-caching are probably doing more to help you out than anything else bound to your CPU.

In the age of multi-CPUs, using multi-threaded programs already reduces many of the pains that plague bioinformaticians. Most of my java code is thrilled to pull 2, 3, or more processors in to work faster - without a lotof explicit multi-treadding. (My record so far is 1496% cpu usage - nearly 15 processors.) I would expect that buying 16-way processors is probably more cost-efficient than buying 16 FPGAs in terms of processing data for many of the current algorithms in use.

Buying more conventional resources will probably alleviate the sudden bottle-neck in compute power, rather than innovating around new solutions to solve the need. It's likely that many groups getting into the second generation genomics technologies failed to understand the processing demands of the data, and thus didn't plan adequately for the resources. This means that much of the demand for data processing is just temporary, and may even be aleviated with more efficient algorithms in the future.

So where does the FPGA fit in?

I'd contend that there are very few products out there that would benefit from FPGAs in Bioinformatics... but there are a few. Clearly, all bioinformaticians know that aligning short reads is one of those areas. Considering that a full Maq run for a flow cell from an Illumina GAII takes 14+ hours on a small cluster, that would be one area in which they'd clearly benefit.

Of course, no bioinformatician wants to have to reprogram an FPGA on the fly to utilize their work. Were I to pick a model, it would probably be to team up with an aligner group, to produce a stand alone, multi-FPGA/CPU hybrid box with 32Gb of RAM, and a 3-4 year upgrade path. Every 4 months you produce a new aligner algorithm and HDL template, and users pick up the aligner and HDL upgrade, and "flash" their computer to use the new software/hardware. This would follow the Google Appliance model: an automated box that does one task, and does it well, with the exception that hardware "upgrades" come along with the software patches. That would certainly turn a few heads.

At any rate, only time will tell. If the algorithms settle down, FPGAs may become more useful. If the FPGAs become easier to program for bioinformaticians, they may find a willing audience. If the FPGAs begin to understand the constraints of the bioinformatics groups, they may find niche applications that will truly benefit from this technology. I look forward to seeing where this goes.

Ok... now that I've gone WAY out on a limb, I think it's time to tackle a few of those tasks on my list.

Labels: , , ,

Tuesday, August 12, 2008

SNP callers.

I thought I'd switch gears a bit this morning. I keep hearing people say that the next project their company/institute/lab is going to tackle is a SNP calling application, which strikes me as odd. I've written at least 3 over the last several months, and they're all trivial. They seem to perform as well as any one else's SNP calls, and, if they take up more memory, I didn't think that was too big of a problem. We have machines with lots of RAM these days, and it's relatively cheap, these days.

What really strikes me as odd is that people think there's money in this. I just can't see it. The barrier to creating a new SNP calling program is incredibly low. I'd suggest it's even lower than creating an aligner - and there are already 20 or so of those out there. There's even an aligner being developed at the GSC (which I don't care for in the slightest, I might add) that works reasonably well.

I think the big thing that everyone is missing is that it's not the SNPs being called that important - it's SNP management. In order to do SNP filtering, I have a huge postgresql database with SNPs from a variety of sources, in several large tables, which have to be compared against the SNPs and gene calls from my data set. Even then, I would have a very difficult time handing off my database to someone else - my database is scalable, but completely un-automated, and has nothing but the psql interface, which is clearly not the most user friendly. If I were going to hire a grad student and allocate money to software development, I wouldn't spend the money on a SNP caller and have the grad student write the database - I'd put the grad student to work on his own SNP caller and buy a SNP management tool. Unfortunately, it's a big project, and I don't think there's a single tool out there that would begin to meet the needs of people managing output from massively-parallel sequencing efforts.

Anyhow, just some food for thought, while I write tools that manage SNPs this morning.


Labels: , , ,

Monday, February 18, 2008

Aligning DNA - comments from above

I've been pretty bad about continuing my posts on how the different aligners work. It's a lot of work keeping up with them, since I seem to hear about a new one each week. However, a post-doc in my lab gave a presentation on contrasting the various aligners, to discuss each of their strengths and weaknesses for doing short (Illumina) read alignments.

Admittedly, I don't know how accurate the presenter's data was - most of the presentation was in being used to set up his own in-house aligner development, and thus all of the aligners were painted in a poor light, except his, of course. That being said, there's some truth to what he found: most of the aligners out there have some pretty serious issues.

Eland is still limited by it's 32-base limit, which you'd think they'd have been over by now. For crying out loud, the company that produces it is trying to sell kits for doing 36-base alignments. It's in their best interest to have an aligner that does more than 32 bases. (Yes, they have a new work-around in their Gerald program, but it's hardly ideal.)

MAQ, apparently, has a weird "feature" that if multiple alignments are found, it just picks one at random as the "best". Hardly ideal for most experiments.

Mosaik provides output in .ace files - which are useless for any further work, unless you want to reverse engineer converters to other, more reasonable, formats.

SOAP only aligns against the forward strand! (How hard can it be to map the reverse compliment???)

Exonerate is great when run in "slow mode", at which point it's hardly usable for 40M reads, and when it's run in "fast mode", it's results are hardly usable at all.

SHRiMP, I just don't know enough about to comment on.

And yes, even the post-doc's in-house aligner (called Slider) has some serious issues: it's going to miscall all SNPs, unless you're aligning fragments from the reference sequence back to itself. (That's not counting the 20 hours I've already put in to translate the thing to java proper, patching memory leaks, and the like...)

Seriously, what's with all of these aligners? Why hasn't anyone stepped up to the plate and come up with a decent open-source aligner? There are got to be hundreds of groups out there who are struggling to make these work, and not one of them is ideal for use with Illumina reads. Isn't there one research group out there dog-fooding their own Illumina sequence aligner?

At this rate, I may have to build my own. I know what they say about software, though: You can have fast, efficient or cheap - pick any two. With aligners, it seems that's exactly where we're stuck.

Labels: , ,

Saturday, February 9, 2008

Pacific Biotech new sequencing technology

I have some breaking news. I doubt I'm the first to blog this, but I find it absolutely amazing, so I had to share.

Steve Turner from Pacific Biosciences (PacBio), just gave the final talk of the AGBT session, and it was damn impressive. They have a completely new method of doing sequencing that uses DNA polymerase as a sequencing engine. Most impressively, they've completed their proof of concept, and they presented data from it in the session.

The method is called Single Molecule Real Time (SMRT) sequencing. It's capable of producing 5000-25,000 base pair reads, at a rate of 10 bases/second. (They apparently have 25bps techniques in development, and expect to release when they have 50bps working!)

The machinery has zero moving part, and once everything is in place, they anticipate that they'll have a sequencing rate of greater than 100 Gb per hour! As they are proud to mention, that's about a full draft genome for a human being in 15 minutes, and at a cost of about $100. Holy crap!

Labels: , , , ,

Thursday, February 7, 2008

AGBT post #2.

Good news.. my bag arrived! I'm going to go pick it up after the current session, and finally get some clean clothes and a shave. Phew!

Anyhow, on the AGBT side of things, I just came back form the Pacific Biosciences panel discussion, which was pretty neat. The discussion was on "how many base pairs will it take to enable personalized medicine?" A topic I'm really quite interested in.

The answers stretched from infinite, to 6 Billion, to 100TB, to 100 people (if they can pick the right person), to 1 (if they find the right one). It was a pretty decent discussion, covering things from American politics, to snp finding, to healthcare... you get the idea. The moderator was also good, the host of a show (Biotechworld?) on NPR.

My one problem is that in giving their answers, they brushed on several key points, but never really followed up on it.

1) just having the genome isn't enough. Stuff like transription factor binding sites, methylation, regulation, and so forth are all important. If you don't know how the genome works, personal medicine applications aren't going to fall out of it. (Elaine Mardis did mention this, but there was little discussion of it.)

2) Financial aspects will drive this. That, in itself was mentioned, but the real paradigm shifts will happen when you can convince the U.S. insurance companies that preventive medicine is cheaper than treating illness. That's only a matter of time, but I think that will drive FAR more long term effects than having people's genomes. (If insurance companies gave obese people a personal trainer and cooking lessons, assuming their health issues are diet related, they'd save a bundle in not having to pay for diabetes medicine, heart surgery, and associated costs.... but targeting people for preventive treatment requires much more personal medicine than we have now.)

Other points that were well covered include the effect of computational power as a limiting agent in processing information, the importance of sequencing the right people, and how its impossible to predict where the technology will take us, both morally and scientifically.

Anyhow, as I'm typing this while sitting in other talks:

Inanc Birol, also from the GSC, gave a talk on his work on a new de novo assembler:

80% reconstruction of the C.elegans genome from 30x coverage, which required 6 hours (10 cpu) for data preparation and performing the assembly in less than 10 minutes on a single CPU, using under 4Gb of RAM.

There you go.. the question for me (relevant to the last posting) is "how much of the 20% remaining has poor sequencability?" I'm willing to bet it's the same.

And I just heard a talk on SSAHA_pileup, which seems to try to sort snps. Unfortunately, every SNP caller talk I see always assumes 30X coverage.. How realistic is that for human data? Anyhow, I'm sure I missed something. I'll have to check out the slides on, once they're posted.

And the talks continue....

btw, remind me to look into the fast smith-waterman in cross-match - it sounds like it could be useful.

Labels: , , ,

Tuesday, February 5, 2008

AGBT and Sequencability

First of all, I figured I'd try to do some blogging from ABGT, while I'm there. I don't know how effective it'll be, or even how real-time, but we'll give it a shot. (Wireless in Linux on the Vostro 1000 isn't particularly reliable, and I don't even know how accessible internet will be.)

Second, what I wrote yesterday wasn't very clear, so I thought I'd take one more stab at it.

Sequencability (or mappability) is a direct measure of how well you'll be able to sequence a genome using short reads. Thus, by definition, de novo sequencing of a genome is going to be a direct result of the sequencability of that genome. Unfortunately, when people talk about the sequencability, they talk about it in terms of "X% of the genome is sequencable", which means "sequencability is not zero for X% of the genome."

Unfortunately, even if sequencability is not zero, it doesn't mean you can generate all of the sequences (even if you could do 100% random fragments, which we can't), indicating that much of the genome beyond that magical "X% sequencable" is still really not assemblable. (Wow, that's such a bad word.)

Fortunately, sequencability is a function of the length of the reads used, and as the read length increases, so does sequencability.

Thus, there's hope that if we increase the read length of the Illumina machines, or someone else comes up with a way to do longer sequences with the same throughput (e.g. ABI Solid, or 454's GS FLX), the assemblability of the genome will increase accordingly. All of this goes hand in hand: longer reads and better lab techniques always make a big contribution to the end results.

Personally, I think the real answer lays in using a variety of techniques: Paired-End-Tags to span difficult to sequence areas (eg. low or zero sequencability regions), and Single-End-Tags to get high coverage... and hey throw in a few BACs and ESTs reads for good luck. (=

Labels: , , , , ,

Wednesday, January 30, 2008

Comments on de novo assembly

First off, I'd like to say thanks to Paul and stajich for their comments. Paul for raising several points that I'd like to discuss tonight, and stajich for bringing my attention to the SHRiMP aligner. Obviously, I haven't had much chance to look at SHRiMP, yet, but I'll definitely get around to it.

So, paul mentioned several things that are worth discussing:

The Velvet roadmap for mRNA could be interesting. Sometimes the intron is snipped, sometimes it isn't, get a nice little bubble. And the similar transcripts will do... interesting things.

Short reads would be fine for de-novo if DNA was completely random. Pity that it isn't.

For the most part, I agree with Paul, Using Velvet on a transcriptome would be pretty cool. Yes, you'd get bubbles, but you'd get a lot of nifty information about new transcripts, or intermediate forms of transcripts. This neatly avoids one of the major hurdles I currently face when working with transcriptomes: I can either align against a genome, or a list of known genes, and neither one really gives me much opportunity to identify new genes. (Of course, I'm working on doing just that, with a colleague of mine, but I'm saving that for another post.)

Where I disagree with paul, however, is his final statement, that short reads would be fine for de novo assembly if they were truly random. I have two problems with that: The first is mapability (or sequencability), and the second is the short nature of the reads themselves.

Mappability has been defined in many different ways, but for general purposes, it's the ability to identify a sequenced read that can be unambiguously aligned to that location on a chromosome. Fortunately, with 36-mer reads, as are produced by Illumina 1G's, something like 70-80% of the genome is mappable. Not 100% mappable, but mappable to some degree. This may seem trivial, but it's important.

Why it's important is that a genome with 95% mapability doesn't mean you get a single contig covering 95% of the genome, and a chunk of 5% of your reads that don't cover your genome. It's more like every 20-100kb you'll find something that you just can't assemble over. (I'm guestimating that number, by the way.) This means you have lots of small contigs that have no overlap. That is, of course, assuming you had enough highly mappable reads to do a relatively error free assembly, and, also of course, assuming your sequencing errors haven't completely interfered with your assembly. I don't think either can really be taken for granted.

In reality, the mappability of a genome isn't a property you can figure out until you've sequenced it, so I don't want to discourage anyone from trying. I'm aware of people working on using Illumina reads to do this, albeit they've supplemented the reads with BAC sequencing, ESTs and various other options, which will provide a nice scaffolding. This approach allows them to augment their sequencing power by obtaining a lot of coverage through the Illumina runs, rather than having to use them as the basis of a serious de novo assembly - which seems wise to me. (And even then, they're only doing a fungus - not a higher vertebrate!)

And to return to my earlier point about the mappable genome not being 100% mappable, this is where I think paul's point over simplifies. Although some bases may be 50% mappable, and are thus considered to be "mappable" in common discussion, that means that 50% of the possible sequencing reads in which they're likely to participate will not yield an un-ambiguous fragment! That means you can say goodbye to 50% of the likely fragments which you would need to generate an overlap to span two forming contigs. That, to me, indicates that any de novo assembly is unlikely to correctly proceed past that base, and 50% mappability is not unusual in the genome.

The other point of paul's that I wanted to discuss, was the assertion that the non-random fragmenting of DNA is what's stopping us from doing a serious de novo assembly. While it's true that shearing DNA isn't an entirely random process, it's also true that it doesn't need to be. People have been doing restriction enzyme digests for decades now, in order to map vectors and inserts. (I learned how to do it in grade 11 biology, and that was back in 1994). So yes, while sonication or digests may not be random, what's to stop someone from stripping DNA of it's histones, and then doing 25 different digests? The net effect is just about the same (assuming you pick 25 good restriction enzymes with different base recognitions), and will yield fragments that will get around paul's issue. But does that mean we can now do a de novo assembly?

No... I don't think so. I doubt the lack of a random fragmentation pattern is the limiting factor in de novo assemblies. Unfortunately, the only way to prove myself or paul right or wrong is to simulate this: Take the mouse genome, fragment it into random 36-mers (the same size you get from an Illumina sequencing run), then inject 1 random base for every 1000 bases read (I'll be generous and assume a .1% error rate, though the real thing is closer to 3-5% from what I've seen), and then try running velvet on it.

I bet you'll observe that somewhere around 40x coverage you'll start to saturate and discover that your assembly has covered anywhere from 60-80% of the genome (say 5% of that is just way wrong), and that it'll have taken you longer to do that assembly than it would have to just sequenced the damned thing in the first place with PCR.

Anyhow, I take paul's point to heart: We're still early on in this game, and there are a lot of neat things we can try. I'm looking forward to seeing the results of many of them. (Who has time to try them all?)

By the way, FindPeaks 3.0.1 is done, and I'll be presenting it as a poster at AGBT 2008 this week. If you're interested in ChIP-Seq/Chip-Solexa (or for the pedantic, ChIP-Illumina), come find me, and I'll tell you some of its new features.

Labels: , , , , ,

Tuesday, January 22, 2008

Solexa Transcriptom Shotgun:Transcriptome alignments vs. Genome Alignments

First off, I was up late trying to finish off one of my many projects, so I didn't get a lot of sleep. Thus, if my writing is more incoherent than usual, that's probably why. And now on with the show.

I'm jumping gears a bit, today. I haven't finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.

When you're re-sequencing a genome, you map back to the genome template. Yes, you're probably sequencing a different individual, so you'll find occasional sections that don't match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you're seeing are already commonly known changes. Of course, if you're re-sequencing Craig Venter, you don't need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.

When you're sequencing a genome you've never sequenced before, you can't do mapping at all. There are various assemblers (i.e., Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add... ), SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and Euler (which I'd never heard of till I googled the web page for velvet...). The key point: you don't need to worry about what you're mapping back to when you do de novo assembly, since you're creating your own map. I'll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they're so short!

Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I'm ignoring cloned libraries... they're so 90's!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I'm ignoring some complexity here, because it makes for a better story.) But I'm getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.

mRNA is an interesting beast. Unlike Genomic DNA, it's a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there's a small chance that it might overlap the boundaries between two exons, which means the bases you have won't map back nicely to the genomic point of origin. That can be a serious problem.

Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you're sophisticated, you'll even know that introns often have signals that indicate their presence most of the time. And yes, you'd be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn't seem like much, really), how long would it take to do that alignment?

Too long.

So most of the time, we don't do gapped alignments. Instead, we have two choices:
  1. Align against the genome, and throw away reads that we can't align (i.e. those that over lap intron/exon boundaries.)

  2. Align against a collection of known coding DNA sequences

Number two isn't a bad option: it already has all the introns spliced out, so you don't need to worry about missing those alignments. Unfortunately, there are several issues with this approach:
  • Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.

  • Many transcripts are very similar. This isn't a problem with the data, but with your alignment program. If your aligner doesn't handle multi-matches (like Eland), this will never work.

  • Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.

  • Aligning to a transcript is hard to visualize. This is going to be one of my next projects... but with all the fantastic genomes out there, I'm still not aware of a good transcriptome visualization tool.

And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.

For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven't released my code for either one, yet, but they both work, and if you contact my supervisor, he'd probably be interested in collaborating, or possibly sharing the code.

Labels: , , , , ,