AGBT 2010 - Brian Haas - Broad
Leverage evidence for genome annotation
* eg, 3 ab initio gene predictions
* lack of high quality evidence
* this is changing with NGS.
* we now have evidence - but we need to standarize and develop algorithms
* reconstructing transcripts is difficult
Approach 1: de novo assembly
* treat them like EST
* align to genome
Approach 2: align reads to genome
* reconstruct based on alignments
Sequencing genomes from Schizosaccharomyces
* pombe is model organism - sequenced in 2002
* 12.5Mb, 5k genes, avg gene 1,489 bp
* genome should be well annotated, good quality annotations
* 44M reads, 65% aligned (Maq)
* align to genome - look good
* challenge is to bring it to high quality automated state
Align: Use TopHat for short read alignment + Cufflinks
Assemble: Velvet/Ananas + GMAP
ELT structures transferred into PASA, which does refinement, alt splicing and validate existing annotations
This is all exploration - This is NOT a tool Bake off.
Elts: Velvet (21167), Cufflins (4158), Ananas (8309)
Almost all alignments to genome were perfect.
Then, test how many assembled to reconstruct full length gene support: Ananas did best, cufflinks 2nd best, velet only 1/3 of those done by Ananas.
* Velvet did very well with supporting introns
* readthrough and encroachment
* again, ananas did best, velvet 2nd best, Cufflinks worst (by a long shot.)
* Velvet seems to give fractionated transcripts.. breaks where coverage is high. [Probably seq errors are causing it to break?]
* some annotations needed to be extended
* corrected genes - merging two genes that are really one.
* none of these methods are great - they're all missing some that others caught.
* some well covered genomic loci not fully reconstructed (paralogs?)
* intron readthrough/encroachment
* incorrectly merged genes/transcripts
* UTR structures and alt splicing.
For well covered genomic loci not fully reconstructed
* identify disjoint regions
* colect reads and assemble independently
* genome directed to avoid misassembly
* very fast to do this
* This helps, but still have a long way to go.
* more tuning needed (expect to get up to 90%)
Dissecting merged transcripts.
* use coverage based assembly clipping - break up transcripts
Technology will greatly facilitate efforts
* Use stranded mRNA-seq
* the information from mRNA-seq is needed for high throughput annotation
* current tools show progress
* still much more to be done in optimization
* need for optimized methods for ALL types of genomes.
Labels: AGBT 2010