Adam Siepel, Biologcal statistics & Computational biology – Comparative Analysis of 2x genomes: Progress, challenges and opportunities.
Placental mammals (Eutherian) are well sequenced. The last of the 2x assemblies were released just last week. There are 22 genomes being focussed on: most are 2x, a couple are 7x, and some are in progress of being ramped up.
One of the main obstacles is error. (sequencing or otherwise). Miscalled bases and indels from erroneous sequences have a big impact. Thus, the goal is clean up the 2x sets. In 120 bases, 5 spurrious indels and 7 miscalled bases. [Wow, that's a lot of error.] Nearly 1/3rd of all 1-2 base indels are spurious.
Thus, comparative genomics often gets hit hard by these errors.
A solution: error correction rates: use redundancy to systematically reduce error. In some sense, there is a version built in – we can use comparative genomics to “decode” the error correcting code. This can be done because the changes between species tend to be variable in predictable ways.
The core idea: Indel Imputation: “Lineage-specific indels in low-quality sequence are likely to be spurious.
Do an “automatic reconstruction” using parsimony... If a lineage specific indel is low quality, then assume it's an error. More computationally intense methods are actually not much more effective.
There is also base Masking – don't try to guess what they should be, but just change them to N's. Doing these thing may change reading frames, however.
After doing the error correction, the error appears to drop dramatically. (I'm not sure what the metric was, however.)
Summary: good dataset with some error. Correction method used here is a “blunt instrument”, many or most errors can be masked or corrected if some over-correction is allowed. There is a trade off, of course.
Conservation has its own problems, which can be a problem as well. Thus, they have been working on new programs for this type of work: PhyloP. Has multiple algorithms for scaling phylogeny, and the like. Extensive evaluations of power for these methods were undertaken. However, the problem is that people are at the limit for what they can get out from conservation, depending on what's there. Pretty reasonable power is good when selection is good, or when the elements are longer (eg 3bp.)
Discussing uses of conservation.... moving towards single base pair resolution.
Labels: AGBT 2009