Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, February 26, 2010

AGBT 2010 - Brian Haas - Broad

Genome annotation using mRNA-Seq: A case study of Schizosaccharomyces pombe

Leverage evidence for genome annotation
* eg, 3 ab initio gene predictions

Major chanllenge:
* lack of high quality evidence
* this is changing with NGS.
* we now have evidence - but we need to standarize and develop algorithms
* reconstructing transcripts is difficult

Approach 1: de novo assembly
* treat them like EST
* align to genome

Approach 2: align reads to genome
* reconstruct based on alignments

Sequencing genomes from Schizosaccharomyces
* pombe is model organism - sequenced in 2002
* 12.5Mb, 5k genes, avg gene 1,489 bp
* genome should be well annotated, good quality annotations

Seq:
* 44M reads, 65% aligned (Maq)
* align to genome - look good
* challenge is to bring it to high quality automated state

Align: Use TopHat for short read alignment + Cufflinks
Assemble: Velvet/Ananas + GMAP

ELT structures transferred into PASA, which does refinement, alt splicing and validate existing annotations

This is all exploration - This is NOT a tool Bake off.

Elts: Velvet (21167), Cufflins (4158), Ananas (8309)
Almost all alignments to genome were perfect.

Then, test how many assembled to reconstruct full length gene support: Ananas did best, cufflinks 2nd best, velet only 1/3 of those done by Ananas.
* Velvet did very well with supporting introns

Problems:
* readthrough and encroachment
* again, ananas did best, velvet 2nd best, Cufflinks worst (by a long shot.)

Examples given.
* Velvet seems to give fractionated transcripts.. breaks where coverage is high. [Probably seq errors are causing it to break?]
* some annotations needed to be extended
* corrected genes - merging two genes that are really one.

Compare:
* none of these methods are great - they're all missing some that others caught.

Challenges:
* some well covered genomic loci not fully reconstructed (paralogs?)
* intron readthrough/encroachment
* incorrectly merged genes/transcripts
* UTR structures and alt splicing.

For well covered genomic loci not fully reconstructed
* identify disjoint regions
* colect reads and assemble independently
* genome directed to avoid misassembly
* very fast to do this
* This helps, but still have a long way to go.
* more tuning needed (expect to get up to 90%)

Dissecting merged transcripts.
* use coverage based assembly clipping - break up transcripts

Technology will greatly facilitate efforts
* Use stranded mRNA-seq

Summary:
* the information from mRNA-seq is needed for high throughput annotation
* current tools show progress
* still much more to be done in optimization
* need for optimized methods for ALL types of genomes.

Labels:

4 Comments:

Anonymous Anonymous said...

Hi... not familiar with Ananas. Do you have a reference or link to this software? Thanks!

March 1, 2010 12:28:00 PM PST  
Blogger Anthony Fejes said...

Sorry, I don't have much information on Ananas. I believe it is an in-house development project at the broad: The only thing I can suggest is to contact Brian Haas about it.

March 1, 2010 12:33:00 PM PST  
Anonymous lcollado said...

I wonder how Oases (by Zerbino et al, on the making) compares to the other tools. I guess that it should do better as it will still handle indels (like Velvet) but have fewer issues with partioned transcripts.

March 1, 2010 9:40:00 PM PST  
Blogger Anthony Fejes said...

Dr. Haas was very clear that this wasn't intended to be a tool bake-off - just a test to see how well annotation could be done from RNA-Seq. I'm sure there's room for someone to do the tool comparison - must be a good paper in there somewhere.

March 2, 2010 11:37:00 AM PST  

Post a Comment

<< Home