Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Monday, February 25, 2008

Transcriptome sequencing.

In an earlier comment, Jason Stajich asked:

What I am most curious about is how people are planning to do the statistics of gene expression comparison from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3' biased tags?


Though I've been taking my time about answering, its a really good question. I've been working with transcriptomes for a while, now, and have a fair amount of experience with it. I don't want to give away all of my secrets, but I can give a few pointers. If anyone wants to collaborate on something, you know where to find me. (-;

So, first things first, with transcriptome sequencing using Illumina based sequencing, each read you get is presumably from a single molecule of DNA, which presumably came from a single molecule of cDNA, from your library. I can't speak for all of the protocols used by the labs here at the Genome Science Centre, but the results I've seen have shown a very close correlation with expression levels measured by Nimblegen/Affymetrix DNA arrays, and so I tend to believe that the number of tags we're observing per region(eg gene/transcript/exon) are a direct (or nearly direct) measurement of the RNA levels in the general cell population used to create the library.

I should also mention that this is very dependent upon the protocols being used. If your protocol involves amplifying the cDNA with the use of PCR, you're really not going to maintain that relationship. Consult an expert on this subject, if you plan to try this at home. (-;

The other questions Jason asked are not quite as straight forward. We have a protocol here at the GSC that gives pretty darn even coverage across transcripts as a whole, which means that transcript end bias is pretty minimal. That totally negates the need to look at biases, or otherwise. Of course, this comes down to a lot of lab technique (which is totally outside the scope of my post), as it seems to be dependent on following the appropriate protocols. I've seen libraries which are completely skewed, libraries that perfectly follow the transcript outlines, and libraries somewhere in between. As it stands, I now run my tools over each data set as it appears to judge the quality before I ask for more lanes.

So, the short answer is: no, I don't normalize or exclude any data when I deal with transcriptomes, but I'm in the fortunately position of being able to identify (and accept or reject!) which data sets meet a reasonable level of quality before I process them.

Labels: , ,

Thursday, February 21, 2008

Document Freedom Day

Just another quick note - on the subject of open source and standard document formats, check out
Document Freedom Day, on March 26th, 2008.

Check this out:

Synthetic genomes

A nifty announcement this morning pre-empted my transcriptome post:


Scientists at the J. Craig Venter Institute have succeeded in creating a fully synthetic bacterial genome, which they have named Mycoplasma genitalium JCVI-1.0. This DNA structure is the largest man-made molecule in existence, measuring 582,970 base pairs.


Kind of neat, really. Unfortunately, I think it's putting the cart before the horse. We don't understand 95% of what's actually going on in the genome, so making an artificial genome is more like having a Finnish person making a copy of the English dictionary by leaving out random words (just one or two), and then seeing if Englishmen can still have a decent conversation with what he's left them. When he finds that leaving out two words still results in a reasonable discussion on toothpaste, he declares he's created a new Dialect.

Still, it's an engineering feat to build a genome from scratch, much like the UBC engineers hanging VW bugs off of bridges. Pointless and incomprehensible, but neat.

Labels:

Tuesday, February 19, 2008

Jason's comment

I received a comment on my posting last night from Jason Stajich, which (as always) is right on the money. To quote the main theme I want to reply to:

While I am all for people writing their own software, writing your own aligner seems more difficult than figuring out how to get what you want from a published (and used format) like ACE assembly files.


That's absolutely correct. I think writing my own aligner is probably the last thing I want to do. While I have the source for at least our own internal aligner on my desktop this week (and have been doing some serious hacking on it), I really don't want to have to build my own. It's a distraction from my real research, it's a pain to support, it's going to have to compete with much larger groups doing similar things, and it's probably not a long-term viable option. That said, I still think the current state of affairs in the realm of aligners is pretty poor - thought that's not to say I think I could do it better. I'll leave writing assemblers to the professionals - or at least those who have more time to do this stuff.

Still, maybe if I can offer one helpful suggestion to people writing aligners. Get together and use a standard format! I'm less inclined to test Yet Another Aligner when I have to write another filter to interpret the results. There are even currently efforts under way to create a unified format for aligned reads and short-read data. (Disclaimer: I've only read a couple handfuls of posts, and not all of it appears to be on topic.)

The formats we're familiar with (such as fasta) aren't really designed for this type of work. (It hasn't stopped several aligners from using them however - which is still better than creating their own file formats, I might add.) What's actually needed is a purpose built format.

I'm sure you can guess from this rant that I'm a big fan of stuff like the OpenDocument Format (ODF), which are levelling the playing field for word processing documents (despite MicroSoft's best efforts to remain the dominant force at all costs), but even a limited (i.e. non-ISO standardized) approach could make a huge impact in this area.

Why not have a two day get together for all the people building aligners, and decide on a subset of formats? Make it versioned, so that it can change over time, and elect a maintainer of the official standard. While you're at it decide on a few conventions - what's a U0 hit? (Slider can have up to 9 SNPs and still call it a U0...) What is a read start? How do you represent stranded genomic information - which end is really the "start"?

Anyhow, I know it's wishful thinking, but hey, maybe this rant will encourage a few people to band together and come up with something to make all of our lives easer. Even something as simple as Fasta freed up thousands (millions?) of man-hours (or woman-hours) for bioinformaticians, since they no longer need a custom format library for every new program they use. Maybe it's time to do the same for aligned data.

P.S. I'll answer Jason's questions on EST/transcriptome approaches in my next post.

Labels: , , ,

Monday, February 18, 2008

Aligning DNA - comments from above

I've been pretty bad about continuing my posts on how the different aligners work. It's a lot of work keeping up with them, since I seem to hear about a new one each week. However, a post-doc in my lab gave a presentation on contrasting the various aligners, to discuss each of their strengths and weaknesses for doing short (Illumina) read alignments.

Admittedly, I don't know how accurate the presenter's data was - most of the presentation was in being used to set up his own in-house aligner development, and thus all of the aligners were painted in a poor light, except his, of course. That being said, there's some truth to what he found: most of the aligners out there have some pretty serious issues.

Eland is still limited by it's 32-base limit, which you'd think they'd have been over by now. For crying out loud, the company that produces it is trying to sell kits for doing 36-base alignments. It's in their best interest to have an aligner that does more than 32 bases. (Yes, they have a new work-around in their Gerald program, but it's hardly ideal.)

MAQ, apparently, has a weird "feature" that if multiple alignments are found, it just picks one at random as the "best". Hardly ideal for most experiments.

Mosaik provides output in .ace files - which are useless for any further work, unless you want to reverse engineer converters to other, more reasonable, formats.

SOAP only aligns against the forward strand! (How hard can it be to map the reverse compliment???)

Exonerate is great when run in "slow mode", at which point it's hardly usable for 40M reads, and when it's run in "fast mode", it's results are hardly usable at all.

SHRiMP, I just don't know enough about to comment on.

And yes, even the post-doc's in-house aligner (called Slider) has some serious issues: it's going to miscall all SNPs, unless you're aligning fragments from the reference sequence back to itself. (That's not counting the 20 hours I've already put in to translate the thing to java proper, patching memory leaks, and the like...)

Seriously, what's with all of these aligners? Why hasn't anyone stepped up to the plate and come up with a decent open-source aligner? There are got to be hundreds of groups out there who are struggling to make these work, and not one of them is ideal for use with Illumina reads. Isn't there one research group out there dog-fooding their own Illumina sequence aligner?

At this rate, I may have to build my own. I know what they say about software, though: You can have fast, efficient or cheap - pick any two. With aligners, it seems that's exactly where we're stuck.

Labels: , ,

Sunday, February 10, 2008

Just overheard

Just announced at the Southwestern Florida Airport:

"Would the person who left behind a black suitcase, containing a red suitcase, containing a pair of boots.... please claim it at security."

That's some carry-on luggage.

Labels:

Saturday, February 9, 2008

Pacific Biotech new sequencing technology

I have some breaking news. I doubt I'm the first to blog this, but I find it absolutely amazing, so I had to share.

Steve Turner from Pacific Biosciences (PacBio), just gave the final talk of the AGBT session, and it was damn impressive. They have a completely new method of doing sequencing that uses DNA polymerase as a sequencing engine. Most impressively, they've completed their proof of concept, and they presented data from it in the session.

The method is called Single Molecule Real Time (SMRT) sequencing. It's capable of producing 5000-25,000 base pair reads, at a rate of 10 bases/second. (They apparently have 25bps techniques in development, and expect to release when they have 50bps working!)

The machinery has zero moving part, and once everything is in place, they anticipate that they'll have a sequencing rate of greater than 100 Gb per hour! As they are proud to mention, that's about a full draft genome for a human being in 15 minutes, and at a cost of about $100. Holy crap!

Labels: , , , ,

Thursday, February 7, 2008

AGBT post #2.

Good news.. my bag arrived! I'm going to go pick it up after the current session, and finally get some clean clothes and a shave. Phew!

Anyhow, on the AGBT side of things, I just came back form the Pacific Biosciences panel discussion, which was pretty neat. The discussion was on "how many base pairs will it take to enable personalized medicine?" A topic I'm really quite interested in.

The answers stretched from infinite, to 6 Billion, to 100TB, to 100 people (if they can pick the right person), to 1 (if they find the right one). It was a pretty decent discussion, covering things from American politics, to snp finding, to healthcare... you get the idea. The moderator was also good, the host of a show (Biotechworld?) on NPR.

My one problem is that in giving their answers, they brushed on several key points, but never really followed up on it.

1) just having the genome isn't enough. Stuff like transription factor binding sites, methylation, regulation, and so forth are all important. If you don't know how the genome works, personal medicine applications aren't going to fall out of it. (Elaine Mardis did mention this, but there was little discussion of it.)

2) Financial aspects will drive this. That, in itself was mentioned, but the real paradigm shifts will happen when you can convince the U.S. insurance companies that preventive medicine is cheaper than treating illness. That's only a matter of time, but I think that will drive FAR more long term effects than having people's genomes. (If insurance companies gave obese people a personal trainer and cooking lessons, assuming their health issues are diet related, they'd save a bundle in not having to pay for diabetes medicine, heart surgery, and associated costs.... but targeting people for preventive treatment requires much more personal medicine than we have now.)

Other points that were well covered include the effect of computational power as a limiting agent in processing information, the importance of sequencing the right people, and how its impossible to predict where the technology will take us, both morally and scientifically.

Anyhow, as I'm typing this while sitting in other talks:

Inanc Birol, also from the GSC, gave a talk on his work on a new de novo assembler:

80% reconstruction of the C.elegans genome from 30x coverage, which required 6 hours (10 cpu) for data preparation and performing the assembly in less than 10 minutes on a single CPU, using under 4Gb of RAM.

There you go.. the question for me (relevant to the last posting) is "how much of the 20% remaining has poor sequencability?" I'm willing to bet it's the same.

And I just heard a talk on SSAHA_pileup, which seems to try to sort snps. Unfortunately, every SNP caller talk I see always assumes 30X coverage.. How realistic is that for human data? Anyhow, I'm sure I missed something. I'll have to check out the slides on slideshare.net, once they're posted.

And the talks continue....


btw, remind me to look into the fast smith-waterman in cross-match - it sounds like it could be useful.

Labels: , , ,

AGBT post #1.

I'm here.. and online. I almost didn't make it, thanks to bad weather in florida, but at least the car we rented didn't break down on the road, the way the other group's did. Apparently the police saved them from the aligators and wild pigs... No one can say AGBT hasn't been exciting, so far.

Anyhow, lots of good topics, and meeting interesting people already. (I'm even sitting beside an exec from Illumina, in the ABI sponsored lunch.. how's that for irony?) Anyhow, I'm excited to start the poster sessions and get some good discussions going.

Unfortunately, I missed two of the talks this morning, while I negotiated with the good people at United airlines to have my bag delivered. The three others I've seen so far have been good. Some interesting points:

The best graphics are the ones with the two DNA strands shown separately. Too cool - must include that in my FindPeaksToR scripts.

Loss or gain of homozygosity can screw up what you think you have, compared to what's really there. Many models assume you have only one copy of a gene, or just don't really do much to make sense of these events.

From David Cox, I learned that Barcoding isn't new (it's not), but that it usually doesn't work well (I can't prove that), but hey, they got it to work (and that's good!).

And yes, my favorite line from David Cox's presentation was something like: 900 PCR products, 90 people['s samples]... 1 tube. Make sure you don't drop it!

Anyhow, I'm getting lots of ideas, and I'm thoroughly enjoying this conference. I'm saturated in next-gen sequencing work.

Anyhow, if anyone else is reading this, My poster is #38... feel free come come by and talk.

Labels: , ,

Tuesday, February 5, 2008

AGBT and Sequencability

First of all, I figured I'd try to do some blogging from ABGT, while I'm there. I don't know how effective it'll be, or even how real-time, but we'll give it a shot. (Wireless in Linux on the Vostro 1000 isn't particularly reliable, and I don't even know how accessible internet will be.)

Second, what I wrote yesterday wasn't very clear, so I thought I'd take one more stab at it.

Sequencability (or mappability) is a direct measure of how well you'll be able to sequence a genome using short reads. Thus, by definition, de novo sequencing of a genome is going to be a direct result of the sequencability of that genome. Unfortunately, when people talk about the sequencability, they talk about it in terms of "X% of the genome is sequencable", which means "sequencability is not zero for X% of the genome."

Unfortunately, even if sequencability is not zero, it doesn't mean you can generate all of the sequences (even if you could do 100% random fragments, which we can't), indicating that much of the genome beyond that magical "X% sequencable" is still really not assemblable. (Wow, that's such a bad word.)

Fortunately, sequencability is a function of the length of the reads used, and as the read length increases, so does sequencability.

Thus, there's hope that if we increase the read length of the Illumina machines, or someone else comes up with a way to do longer sequences with the same throughput (e.g. ABI Solid, or 454's GS FLX), the assemblability of the genome will increase accordingly. All of this goes hand in hand: longer reads and better lab techniques always make a big contribution to the end results.

Personally, I think the real answer lays in using a variety of techniques: Paired-End-Tags to span difficult to sequence areas (eg. low or zero sequencability regions), and Single-End-Tags to get high coverage... and hey throw in a few BACs and ESTs reads for good luck. (=

Labels: , , , , ,