Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Sunday, February 28, 2010

Complete Genomics, Revisited (Feb 2010)

While I'm writing up my notes on my way back to Vancouver, I thought I'd include one more set of notes - the ones I took while talking to the Complete Genomics team.

Before launching into my notes (which won't really be in note form), I should give the backstory on how this came to be. Normally, I don't do interviews, and I was very hesitant about doing one this time. In fact, the format came out more like a chat, so I don't mind discussing it - with Complete Genomic's permission.

Going back about a week or so, I received an email from someone working on PR for Complete Genomics, inviting me to come talk with them at AGBT. They were aware of my blog post from last year, written after discussing some aspects of their company with several members of the Complete Genomics team.

I suppose in the world of marketing, any publicity is good publicity, and perhaps they were looking for an update for the blog entry. Either way, I was excited to have an opportunity to speak with them again, and I'm definitely happy to write what I learned. I won't have much to contribute beyond what they've discussed elsewhere, but hey, not everything has to be new, right?

In the world of sequencing, who is Complete Genomics? They're clearly not 2nd generation technology. Frankly, their technology is the dinosaur in the room. While everyone else is working on single molecule sequencing, Complete Genomics is using technology from the stone age of sequencing - and making it work.

Their technology doesn't have any bells and whistles - and in fact, the first time I saw their ideas, I was fairly convinced that it wouldn't be able to compete in the world of the Illuminas and Pac Bios... and all the rest. Actually, I think I was right. What I didn't know at the time was that they don't need to compete. They're clearly in their own niche - and they have the potential to become the 300 pound gorilla.

While they're never going to be the nimble or agile technology developers, they do have a shot at dominating the market they've picked: Low cost, standardized genomics. As long as they stick with this plan - and manage to keep their cost lower than everyone else, they've got a shot... Only time will tell.

A lot of my conversation with Complete Genomics revolved around the status of their technology - what it is that they're offering to their customers. That's old hat, though. You can look through their web page and get all of the information - you'll probably even get more up to date information - so go check it out.

What is important is that their company is based on well developed technology. Nothing that they're doing is bleeding edge, nothing is going to be a surprise show stopper: of all of the companies doing genomics, they're the only one that can accurately chart the path ahead with clear vision. Pac bio may never solve their missing base problem, Illumina may never get their reads past 100bp, Life Tech may never solve their dark base problem, and Ion Torrent may never have a viable product. You never know... but Complete Genomics is the least likely to hit a snag in their plans.

That's really the key to their future fate - there are no bottle necks to scaling up their technology. We'll all watch as they bring down the distance between the spots on their chips, lower the amount of reagent required, and continue to automate their technology. It's not rocket science - it's just engineering. Each time they drop the scale of their technology down, they also drop the cost of the genome. That's clearly the point - low cost.

The other interesting thing about their company is that they've really put an emphasis on automation and value-added services. Their process is one of the more hands off processes out there. It's an intriguing concept. You fed-ex the DNA to them, and you get back a report. Done.

Of course, I have to say that while this may be their strength, it's probably also one of their weaknesses. As a scientist, I don't know that the bioinformatics of the field are well enough developed yet that I trust someone to do everything from alignment to analysis on a sample for me. I've seen aligners come and go so many times in the last 3 years that I really believe that there is value in having the latest modifications.
What you're getting from Complete Genomics is a snapshot of where their technology is at the moment you (figuratively) click the "go" button. Researchers like do play with their data, revisit it, optimize it and squeeze every last drop out of it - something that is not going to be easy with a Complete Genomics dataset. (They aren't sharing their tools..) However, as I said earlier, they're not in the business of competing with the other sequencing companies - so really, they may be able to side step this weakness entirely by just not targeting those people who feel this way about genomic data.

And that also brings me to their second weakness - they are fixated on doing one thing, and doing it well. That's often the sign of a good start-up company: a dogged pursuit of a single goal of excellence in one endeavour. However, in this one case, I disagree with Dr. Dramanac. Providing complete genomes is only part of the picture. In the long run, genomic information will have to be placed in the context of epigenetics, and so I wonder if this is an avenue that they'll be forced to travel in the future. For the moment, Dr. Drmanac insists that this is not something they'll do. If they haven't put any thought into it, when it does become necessary, it's something that will drive customers towards a company that can provide that information. Not all research questions can be solved by gazing into genomic sequences, and that's a reality that could bite them hard.

For the moment, at least, Complete Genomics is well positioned to work well with researchers who don't want to do the lab and bioinformatics tweaking themselves. You can't ask a microbiology lab to give up their PCR machine, and sequencing centres will never drop the 2nd (and now 3rd) generation technology lab to jump on board the 1st generation sequencing provided by Complete Genomics. Despite the few centres that have ordered a few genomes (wow.. I can't just believe I said "a few genomes"), I don't see any of them committing to it in the long run for all of the reasons I've pointed out above.

However, Complete Genomics could take over genomic testing for pharma or hospital diagnostics. Whoever is best able to identify variations (structural or otherwise) in genomes for the lowest cost will be the best bet to do cohort studies for patient stratification studies - and hey, maybe they'll be the back end for the next 23andMe.

So, to conclude, Complete Genomics has impressed me with their business model, and they have come to know themselves well. I'll never understand why they think AGBT is the right conference to showcase their company, when it's not likely to yield that many customers in the long run. But, I'm glad I've had the chance to watch them grow. Although they may be a dinosaur in the technology race, the T-Rex is still a fearsome beast, and I'd hate to meet one in a dark alley.


Saturday, February 27, 2010

AGBT 2010 - Illumina Workshop

[I took these notes on a scrap of paper, when my laptop was starting to run low on batteries. They're less complete than most of the other talks I've taken notes on, but should still give the gist of the talks. Besides, now that I'm at the airport, it's nice to be able to lose a few pieces of scrap paper.]

Introducing the HiSeq 2000(tm)
- redefining the trajectory of sequencing

First presentation:
- Jared from Marketing

Overview of machine.
- real data of Genome and transcriptome
- more than 2 billion base pairs per run
- more than 25Gb per day
- uses line scanning (scan in rows, like a photocopier, instead of a whole picture at once, like a camera)
- now uses "dual surface engineering": image both the top and bottom surface, which means you have twice as much area to form clusters
- Machine holds two individual flow cells
- flow cells are held in by a vacuum
- simple insertion - just toggle a switch through three positions - an LED lights up when you've turned it on.
- preconfigured reagenets - bottles all stacked together: just push in the rack
- touch screen user interface
- "wizard" like set up for runs
- realtime metrics available on interface - even an ipod app (available for ipad too..)
- multimedia help will walk you through things you may not understand.
- major focus on ease of use
- it has the "simplest workflow" of any of the sequencing machines available
- tile size reduced [that's what I wrote but I seem to recall him saying that the number of tiles is smaller, but the tiles themselves are larger?]
- 1 run can now do a 30x coverage for a cancer and a normal (one in each flow cell.)
- 2 methylomes can be done in a week
- you could do 20 RNA-Seq experiments in 4 days.

Next up:
David Bently

Major points:
- error rates and feel of data are similar if not identical to the GAIIx.
- from a small sampling of experiments shown it looks like error rate is very slightly higher
- Demonstrated 300Gb/run, more than 25Gb per day at release
- PET 2x100 supported.
- Software is same for GAII [Although somewhere in the presentation, I heard that they are working on a new version of the pipeline (v 1.6?)... no details on it, tho.]

Next up:
Eliot Margulies, NHGRI/NIH Sequencing
- talking about projects today for the undiagnosed disease program

work flow
- basically same as in his earlier talk [notes are already posted.]
- use cross match to do realignment of reads that don't map first time
- use MPG scores

[In a technology talk, I didn't want to take notes on the experiment itself... mainly points are on the HiSeq data.

Data set: concordance with SNP Chips was in the range of 98% for each flow cell, 99% when both are combined (72x coverage)

- Speed: Increased throughput
- more focus on biology rather than on tweaking pipelines and bioinformatic processing. (eg, biological analysis takes front seat.)

Next Up:
Gary Schroth

Working on a project for Body Map 2.0 : Total human transcriptome
- 16 tissues, each PET 2x50bp, 1x75bp
- $8,900 for 1x50bp
- multiplexing will reduce cost further.
- if you only need 7M reads, you could mutliplex 192 samples (on both cells, I assume), and the cost would be $46. (including seqeuncing, not sample prep.

[which just makes the whole cost equation that much more vague in my mind... Wouldn't it be nice to know how much it costs to do the whole process?]

[Many examples of how RNA-seq looks on HiSeq 2000 (tm)]

- output has 5 billion reads, 300Gb of data.

Next up:
David Bently

Present a graph
- amount of sequence per run.
- looks like a "hockey stick graph"

[Shouldn't it be sequence per machine per day? It'd still look good - and wouldn't totally shortchange the work done on the human genome project. This is really a bad graph.... at least put it on a log scale.]

In the past 5 years:
- 10^4 scale in throughput
- 10^7 scale up in parallelizations

Buzzwords about the future of the technology:
- "Democratizating sequencing"
- "putting it to work"


AGBT 2010 - Complete Genomics Workshop

Complete Genomics CEO:

- sequence only human genomes - 1 Million genomes in the next 5 years
- build out tools to gain a good undertanding of the human genome
- done 50 genomes last year
- Recent Science publication
- expect to do 500 genomes/month

Lots of Customers.
- Deep projects

- don't waste pixels,
- use ligases to read
- very high quality reads - low cost reagents
- provide all bioinformatics to customers

- don't sell technology, just results.
- just return all the processed calls (snps, snv, sv, etc)
- more efficient to outsource the "engineering" for groups who just want to do biology
- fedex sample, get back results.
- high throughput "on demand" sequencing
- 10 centres around the world
- Sequence 1 Million genomes to "break the back" of the research problem

Value add
- they do the bioinformatics

- first wave: understand functional genomics
- second wave: pharmaceutical - patientient stratification
- third wave: personal genomics - use that for treatment

Focus on research community

Two customers to present results:
First Customer:

Jared Roach, Senior Research Sceintist, Institute for Systems Biology (Rare Genetic disease study)

Miller Syndrome
- studied coverage in four genomes
- 85-92% of genome
- 96% coverage in at least one individual
- Excellent coverage in unique regions.

Breakpoint resolution
- within 25bp, and some places down to 10bp
- identified 125 breakpoints
- 90/125 occur at hotspots
- can reconstruct breakpoints in the family

Since they have twins, they can do some nice tests
- infer error rate: 1x10^-5
- excluded regions with compression blocks (error goes up to 1.1^-5)
- Homozygous only: 8.0x10^-6 (greater than 90% of genome)
- Heterozygous only: 1.7x10^-4

[Discussion of genes found - no names, so there's no point in taking notes. They claim they get results that make sense.]

[Time's up - on to next speaker.

Second Customer:
Zemin Zhang, Senior Scientist, Genentech/Roche (Lung Cancer Study)

Cancer and Mutations
[Skipping overview of what cancer is.... I think that's been well covered elsewhere.]

- lung cancer is the leading cause of cancer related mortality worldwide...
- significant unmet need for treatment

Start with one patient
- non small cell lung adenocarcinoma.
- 25 cigarettes/day
- tumour: 95% cancer cells

Genomic characterization on Affy and Agilent arrays
- lots of CNV and LOH
- circos diagrams!

- 131GB mapped sequence in normal, 171Gb mapped seq in tumour
- 46x coverage normal, 60x tumour
[Skipping some info on coverage...]

KRAS G12C mutation

what about rest of 2.7M SNVs?
- SomaticScore predicts SNV validation rates
- 67% are somatic by prediction
- more than 50,000 somatic SNV are projected

Selection and bias observed in the lung cancer genome by comparing somatic and germline mutations

GC to TA changes: Tobacco-associated DNA damage signature

Protection against mutations in coding and promoter regions.
- look at coding regions only - mutations are dramatically less than expected - there is probably strong selection pressure and/or repair

Fewer mutations in expressed genes.
- expressed genes have fewer mutations even lower in transcribed strand
- non-expressed genes have mutation rate similar to non-genic regions

Positive selection in subsets of genes
- KRAS is the only previously known mutation
- Genes also mutated in other lung cancers...
- etc

Finding structural variation by paired end reads
- median dist between pairs 300bp.
- distance almost never goes beyond 1kb.

Look for clusters of sequence reads where one arm is on a different chromosome or more than 1kb away
- small number of reads
- 23 inter-chr
- 56 intra-chr
- use fish + pcr
- validate results
- 43/65 test cases are found to be somatic and have nucleotide level breakpoint junctions
- chr 4 to 9 translocation
- 50% of cells showed this fusion (FISH)

Possible scenario of Chr15 inversion and deletion investigated.
[got distracted, missed point.. oops.]

Genomic landscape:
- very nice Circos diagram
- > 1 mutation for every 3 cigarettes

In the process of doing more work with Complete Genomics


AGBT 2010 - Yardena Samuels - NHGRI

Mutational Analysis of the Melanoma Genome

Histological progression of Melanocyte Transformation
- too much detail to copy down

- mutational analysis of signal transduction gene families in genome
- evaluate most highly mutated gene family members
- translational

Somatic mutation analysis.
- matched tumor normal
- make cell lines

Tumor Bank establishment
- 100 tumor normal samles
- also have original OCT blocks
- have clinical information
- do SNP detection for matching normal/tumor
- 75% of cells are cancer
- look for highly mutated oncogenes

Start looking for somatic mutations
- looking at TK family (kinome)
- known to be frequently mutated by cancer

Sanger did this in the past, but only did 6 melanomas
- two phases: discovery, validation
- started with 29 samples - all kinase domains
- looked for somatic mutations
- move on to sequence all domains...

- 99 NS mutations
- 19 genes

[She's talking fast, and running through the slides fast! I can't keep up no matter how fast I type.]

Somatic mutations in ERBB4 - 19% in total
- one alteration was known in lung cancer

[Pathway diagram - running through the members VERY quickly] (Hynes and Lane, Nature Reviews)

Which mutation to investigate? Able to use crystal structure to identify location of mutations. Select for the ones that were previously found in EGFR1 and (something else?)

Picked 7 mutations, cloned and over-expressed - basic biochemistry followed.

[Insert westerns here - pricket et al Nature Genetics 41, 2009]

ERBB4 mutations have increased basal activity - also seen in melanoma cells

Mutant ERBB4 promotes NIH3T3 Transformation

Expression of Mutant ERBB4 Provides an Essential cell Survival Signal in Melanoma
- oncogene addiction

Is this a good target in the clinic.
- used lapatinib.
- showed that it also works here in melanoma. Mutant ERBB4 sensitizes cells to lapatinib
- mechanism is apoptosis
- it does not kill 100% of cells - may be necessary to combine it with other drugs.

- ERBB4 is mutated in 19% of melanomas
- reiterate poitns
- new oncogene in melanoma
- can use lapatinib
[only got 4 of the 8 or 9]

Future studies
- maybe use in clinics - trying a clinical trial.
- will isolated tumor dna w ICM
... test several hypotheses.
- sensitivity to lapatinib

What else should be sequenced? not taking into account whole genome sequencing.
- look at crosstalk to get good targets
- List of targets. (mainly transduction genes)

Want to look at other cancers, where whole exome was done.
- revealed : few gene alterations in majority of cancers. Limited number of siganlling pathways. Pathway oriented models will work better than Gene oriented models

[ chart that looks like london subway system... have no idea what it was.]

Personalized Medicine
- their next goal.

[great talk - way too fast, and is cool, but no NGS tie in. Seems odd that she's picking targets this way - WGSS would make sense, and narrow things down faster.]


AGBT 2010 - Joseph Puglisi - Stanford University School of Meicine

The Molecular Choreography of Translation

Questions have made the same, despite recent advances - we still want to understand how the molecular machines work. We always have snapshots that capture the element of motion, but we want animation, not snapshots

- Converting nucleotides to amino acids.
- ribosome 1-20 aa/s
- 1/10^4 errors
- very complex process (tons of proteins factors, etc, required for the process)
-requires micro-molar concentrations of each component

- we now know the structure of the ribosome
- nobel prize given for it.
- 2 subunits. (50S & 30S)
- 3 sites, E, P & A
- image 3 trna's to a ribosome - in the 3 sites...
- all our shots are static - no animated
- The Ribosome selects tRNA for Catalysis - must be correct, and incorrect must be rapidly rejected
- EFTu involved in rejection

[Walking us through how ribosomes work - there are better sources for this on the web, so I'm not going to copy it.]

Basic questions:
= timing of factor
- initiation pathway
- origins of translational fidelity
- mechanisms

Look at it as a high dynamic process
- flux of tRNAs
- movements of the ribosome (internal and external)
- much slower than photosynthesis, so easier to observe.

Can we track this process in real time?
- Try: Label the ligand involved in translation.
- Problem: solution averaging destroys signal (many copies of ribosome get out of sync FAST.) would require single molecule monitoring
- Solution: immobilization of single molecule - also allows us to watch for a long time

Single molecule real time translation
- Functional fluorescent labeling of tRNAs ribosomes and factors
- surface immobilization retains function.
- observation of translation at micromolar conc. fluorescent components
- instrumentation required to resolve multiple colors
- yes, it does work.
- you can tether with biotin-streptavidin, instead of fixing to surface
- immobilization does not modify kinetics

Tried this before talking to Pac Bio - It was a disaster. Worst experiments they'd ever tried.

- use PAcBio ZMW to do this experiment.
- has multiple colour resolution required
- 10ms time resolution

Can you put a 20nm ribosome into a 120nm hole? Use biotin tethering - Yes

Can consecutive tRNA binding be observed in real time? Yes

Flourescence doesn't leave after... they overlap because the labeled tRNA must transit through the ribosome.
- at low nanomolar sigals, you can see the signals move through individual
- works at higher conc.
- if you leave EF-G out, you get binding, but no transit - then photobleaching.
- demonstrate Lys-tRNA
- 3 three labeled dyes (M, F, K)... you can see it work.
- timing isn't always the same (pulse length)
-missing stop coding - so you see really long stall with labeled dye... and then sampling, as other tRNAs try to fit.
- you can also sequence as you code. [neat]

Decreased tRNA transit time at higher EF-G concentrations
- if you translocate faster, pulses are faster
- you can titrate to get the speed you'd like.
- translation is slowest for first couple of codons, but then speeds up. This may have to do with settling the reading frame? Much work to do here.

Ribosome is a target for antibiotics
- eg. erythromycin
- peptides exit through a channel in the 50S subunit.
- macrolide antibiotics block this channel by binding inside at narrowest point.
- They kill peptide chains at 6 bases. Are able to demonstrate this using the system.

Which model of tRNA dissociation during translation is correct
- tRNA arrival dependent model
- Translocate dependent model

Post syncrhonization of number of tRNA occupancy
- "remix our data"
- data can then be set up to synchronize an activity - eg, the 2nd binding.

Fusidic acid allows the translocation but blocks arrival of subsequent tRNA to A site.
- has no effect on departure rate of tRNA.

only ever 2 trnas at once on Ribosome. - it can happen, but not normally

Translocation dependent model is correct.

Correlating ribosome and tRNA dynamics
- towards true molecular movies
- label tRNAs... monitor fluctuation and movement

Translational processes are highly regulated
- regulation of initiation (51 and 3` UTR)
- endpoint in signallig pathways (mTOR, PKR)
- programmed changes in reading frames (frameshifts)
- control of translation mode (IRES, nromal)
- target of therapeutics (PTC124 [ribosome doesn't respect stop codons] and antibiotics)

- directly track in real time
- tRNAs dissociate from the E site post translocation and no correlation...

Paper is in Nature today.


AGBT 2010 - Joseph Puglisi - Stanford University School of Meicine

The Molecular Choreography of Translation

Questions have made the same, despite recent advances - we still want to understand how the molecular machines work. We always have snapshots that capture the element of motion, but we want animation, not snapshots

- Converting nucleotides to amino acids.
- ribosome 1-20 aa/s
- 1/10^4 errors
- very complex process (tons of proteins factors, etc, required for the process)
-requires micro-molar concentrations of each component

- we now know the structure of the ribosome
- nobel prize given for it.
- 2 subunits. (50S & 30S)
- 3 sites, E, P & A
- image 3 trna's to a ribosome - in the 3 sites...
- all our shots are static - no animated
- The Ribosome selects tRNA for Catalysis - must be correct, and incorrect must be rapidly rejected
- EFTu involved in rejection

[Walking us through how ribosomes work - there are better sources for this on the web, so I'm not going to copy it.]

Basic questions:
= timing of factor
- initiation pathway
- origins of translational fidelity
- mechanisms

Look at it as a high dynamic process
- flux of tRNAs
- movements of the ribosome (internal and external)
- much slower than photosynthesis, so easier to observe.

Can we track this process in real time?
- Try: Label the ligand involved in translation.
- Problem: solution averaging destroys signal (many copies of ribosome get out of sync FAST.) would require single molecule monitoring
- Solution: immobilization of single molecule - also allows us to watch for a long time

Single molecule real time translation
- Functional fluorescent labeling of tRNAs ribosomes and factors
- surface immobilization retains function.
- observation of translation at micromolar conc. fluorescent components
- instrumentation required to resolve multiple colors
- yes, it does work.
- you can tether with biotin-streptavidin, instead of fixing to surface
- immobilization does not modify kinetics

Tried this before talking to Pac Bio - It was a disaster. Worst experiments they'd ever tried.

- use PAcBio ZMW to do this experiment.
- has multiple colour resolution required
- 10ms time resolution

Can you put a 20nm ribosome into a 120nm hole? Use biotin tethering - Yes

Can consecutive tRNA binding be observed in real time? Yes

Flourescence doesn't leave after... they overlap because the labeled tRNA must transit through the ribosome.
- at low nanomolar sigals, you can see the signals move through individual
- works at higher conc.
- if you leave EF-G out, you get binding, but no transit - then photobleaching.
- demonstrate Lys-tRNA
- 3 three labeled dyes (M, F, K)... you can see it work.
- timing isn't always the same (pulse length)
-missing stop coding - so you see really long stall with labeled dye... and then sampling, as other tRNAs try to fit.
- you can also sequence as you code. [neat]

Decreased tRNA transit time at higher EF-G concentrations
- if you translocate faster, pulses are faster
- you can titrate to get the speed you'd like.
- translation is slowest for first couple of codons, but then speeds up. This may have to do with settling the reading frame? Much work to do here.

Ribosome is a target for antibiotics
- eg. erythromycin
- peptides exit through a channel in the 50S subunit.
- macrolide antibiotics block this channel by binding inside at narrowest point.
- They kill peptide chains at 6 bases. Are able to demonstrate this using the system.

Which model of tRNA dissociation during translation is correct
- tRNA arrival dependent model
- Translocate dependent model

Post syncrhonization of number of tRNA occupancy
- "remix our data"
- data can then be set up to synchronize an activity - eg, the 2nd binding.

Fusidic acid allows the translocation but blocks arrival of subsequent tRNA to A site.
- has no effect on departure rate of tRNA.

only ever 2 trnas at once on Ribosome. - it can happen, but not normally

Translocation dependent model is correct.

Correlating ribosome and tRNA dynamics
- towards true molecular movies
- label tRNAs... monitor fluctuation and movement

Translational processes are highly regulated
- regulation of initiation (51 and 3` UTR)
- endpoint in signallig pathways (mTOR, PKR)
- programmed changes in reading frames (frameshifts)
- control of translation mode (IRES, nromal)
- target of therapeutics (PTC124 [ribosome doesn't respect stop codons] and antibiotics)

- directly track in real time
- tRNAs dissociate from the E site post translocation and no correlation...

Paper is in Nature today.


AGBT 2010 - Bing Ren - UCSD

Epigenomic Landscapes of Pluripotent and Lineage-Committed Human Cells

Sequencing of the human genome has led to
* identification of disease causing genes
* Personalized medicine
* advanced sequencing technologies
* Foundation for understanding the construction of human beings

But DNA is only half the story
* variations in DNA alone not account for all variations in phenotypic traits
* organisms with identical DNA often exhibit distinct phenotypes (eg plants, insects, mammals)
* Epigenetic changes contribute to human diseases, phenotypes, etc

We know about the mechanisms
* DNA is wrapped around histone proteins which can be modified
* DNA is itself modified (methylation)

[paraphrased] DNA is hardware, epigenome is the software (Duke university quote... missed author's name)

* very complex
* varies among different cell types
* generally reprogrammed during the life cycle of tan organism
* Epigenome is also affected by environmental clues

How do we ecipher the "epigentic code"?
* sytematic approach
* large scale profindg of chromatin modification
* finding common modifications
* validation

* ChIP-Seq based. (started with Tiling arrays)
* use antibodies that recognize chromatin modification.

[beautiful pictures]
* Chromatin signature for the promoter and gene body
* H3K4me3 marks active promoters
* H3K36me3 marks gene body of active genes
* Signature has led to identification of thousands of long non-coding RNA genes.

Chromatin signatures of enhancers
* Can use information about modifications to model patterns
* predict enhancers in the human genome.
* 36,589 enhancer predictions were made
* 56% found in intergenic regions
* test a few with reporter assays - show that 80% of predicted enhancers do drive reporter genes. (Far fewer of the control sequences do - missed number)

Finding chromatin modification patterns in the genome de novo
(Hon et al, PLoS Comp Bio 2009)
* 16 different patterns of chromosome modification
* some are enhancers,
* others have no associations
* one has pattern highly enriched for exons.. regulates alt splicing.

* chromatin modification patterns could be used to annotate ...
* Epigenome Roadmap project (Generate reference epigenome maps for a large number of primary human cells and tissues)

Datasets are available at GEO. (NCBI)

Mapping of DNA methyltion and 53 histone modifications in human cells
* Human embryonic stem cells (H1)
* Fetal fibroblast cell line

Method for mapping DNA methylation
* Ryan Lister and Joe Ecker (Salk)
* sodium bisulfite (C to U), if not methylated
* Must do deep sequencing. If using HiSeq - could do it in 10 days. Used to take 20 runs
* Methylation status for more than 94% of cytosines determined.
* 75.5% in H1, 99.98% in Fibroblast
* DNA methylation is depletee from functional sequences
* no-CpG methlyation is enriched in gene body of transcribed genes suggesting link to the transcription process

11 chromatin modification marks
* comparing cells: different results
* K9me3 and K27me3 become dramatically extended (7% in ES to more than 30% in fibroblast.)
* genes with above marks are highly enriched in developmental genes.

Reduction of repressive chromatins in induced pluripotent cells

Repressive chromatin domains occupy small fraction of genome which is maintained as open structure in stem cells

Repressive chromatin domains occupy large fraction of genome, keeping genes involved in development silenced in differentiated cells.

* widespread difference in epigenomes of ES and fibroblasts
* stem cells are characterized by abundant non-CpG methylation
* Expansion of repressive domains may be a key characteristic of cellular differentiation
* [Missed 2]


AGBT 2010 - Jesse Gray - Harvard Medical School

Widespread RNA Polymerase II Recruitment and Transcription at Enhancers During Stimulus-Dependent Gene Expression

Mamalian brain is [paraphrased] Awesome technology
* Sensory experience shapes brain wiring via neuronal activation
* Whiskers compete for real estate in meta-sensory cortex.
* Brain can re-wire to adapt to environment
* Transcriptional changes in nucleous as brain cells reprogram
* (Discussion in terms of real-estate for rat whisker areas of brain.)

Neuronal activation affects circuit function by altering gene expression
* Activity dependent gene expression

* Ca++ influx
* kinases & phosphatases
* recruit Creb binding protein
* Induce about 50-100x expression in genes (eg, fos)
* Can we do genome wide approaches to understand what's being expressed?

An experimental system for genome-wide analysis of activity-regulatee gene expression
* grow in dish
* depolarize with KCl
* do ChIP-seq and RNA-seq

CBP and transcription factor binding at fos locus
* see CBP binding at conserved region up stream, as well as promotor for fos gene
* also see NPAS4 CREB and SRF with similar (but not identical) binding sites

Is the activity dependent binding CBP restricted to the locus or genome wide?
* compare CBP peaks in both conditions
* binding appears limited to KCL stimulated only.

Are CBP-bound sites enhancers or promoters or both?
* Promoters don't necessarily drive transcription
* Promoters have H3K4Me3 histone modifications (enhancers dont)
* 3d configuration to bring enhancers together with promoters.

Most CBP peaks are not at TSSs and do not show H3K4Me3
* 5079 at TSSSs
* 36,069 not at TSSs

Align all seq that are enhancers
* there is much M3K4Me1 (clear pattern)
* there is not much M3K4Me3

Use known site
* upstream from Arc - used to build a construct

CBP and HK4Me1-marked loci function as activity-dependent transcriptional enhancers.
* Found 8 enhancers

* about 20,000 CBP sites that are activity-regulated enhancers
* do not correspond to annotated start sites
* H3K4Me1 modified
* lack H3K4Me3 mark
* do not initiate long RNAs
* confer activity-regulation on the arc promotor

Questions about activity-regulated enhancers
* do they play a role in binding RNA Polymerase II?
* Evidence is tending towards saying that most enhancers do not seem to have RNAPII binding.

fos enhancers bind RNAPII
* use chip for RNAPII and CBP
* 10-20% of sites have RNAPII at enhancer
* potential artifact - crosslinking conditions may exaggerate this by tying promotors and enhancers.

Does RNAPII at enhancers synthesize RNA?
* Enhancers at the fos locus produce enhancer RNAs
* non-polyadenylated RNA? Yes.
* you do get some transcription at enhancers... [doesn't this start to describe lincRNA?]

Enhancer transcription is correlated with promoter transcription.

The Arc enhancer can be activated without the presence of the Arc promoter
* increases in polymerase binding at enhancer even when promoter is gone.
* preliminary - but may not be transcription when the promoter is gone.
* what is the function of eRNA transcription? (don't know the answer yet)
* Could be that it helps to lay down epigenetic marks.


AGBT 2010 - Keynote: Henry Erlich - Roche Molecular Systems

Applications of Next Generation Sequencing: HLA Typing With the GSFLX System

High Throughput HLA typing
* the allelic diversity is enormous
* Focussing on HLA class I and II genes (germ-line)

Challengeing because it's the most polymorphic region in the genome
* HLA-B has well over 1000 alleles
* only 68 different serological types can be distinguished
* 3,529 genes at 12 loci as of April 2009
* chromosome 6
* Can't be typed using existing conventional techniques [I assume in high throughput]
* DR-DQ region - involved in type I diabetes
[Much detail here, which I can't get down fast enough with any hope at accuracy.]

Polymorphism is highly localized.
* virtually all of the polymorphic amino acid residues are localized to a groove.
* most allelic differences are protein coding.
* critical to distinguish known alleles

* eg HLA-A * 24020101
* only the first 4 numbers are the ones that distinguish the protein.

Survival curve for bone marrow transplant
* even with 8/8 allele matches, there are WAY more things that need to be matched - and so you need the best possible match.
* a single coding mismatch can cause graph vs host disease.
* Bone Marrow matching requires high precision

[List of disease applications - 22 different diseases including Narcolepsy, cancers, drug allergic reactions..]

GWAS in Type 1 diabetes.
* identified disease related genes - HLA SNPs are significant
* Dr-DQ haplotypes are associated strongly with Odds ratio for diabetes
* looking at genomic risk factors increase up to 40x

[something about a particular combination of DR-DQ giving VERY high risk, and consequently is never seen in humans...]

* Dot blots... evolved into Probe Array Typing System.
* Even if you have hundreds of probes, you still have "HLA Genotye Ambiguity"
* "Fail to distinguish alleles" without NGS (with or without phasing..)

[Explanation of how 454 works - protocol]

* amplify exons with MID primers/emPCR/sequence

Benefits of clonal sequencing
* set phase to reduce ambiguity
* allow amplification and sequencing of multiple members of multi-gene family with generic primers
* allow sorting /separation of co-amplified sequences from target sequence (signal)

Parallel clonal sequencing of 8 loci x 24 samples

[More protocol... ]

Graph of read length : around 250bp

Connexio Assignment of DRB1 Genotype
* image reassuring to a HLA researcher.
* like the interface (plug for the company)
* aligns sequence, consensus sequence, does genotype assignment
* [Must admit, the information on this interface is rather mysterious to me...]
* [Several more slides of Connexio data and immunology types that mean nothing to me.]
* get a genotype report...


Testing on SCIDS patient
* patients are potentially chimeric
* look for presence of non-transmitted maternal allele
* can find stuff in "fail layer" because software assumes only two alleles possible.

[Wow... I know I don't know much immunology, but I'm not getting much out of this. This is a lot of software for immunologists, and I really don't understand the terminology, making it challenging to get coherent notes.]

Takes about 4 days - [says 5-7 on the slide]
* amplicon prep
* emulsion
* DNA bead process
* loading wells
* sequencing on GSLFX
* Data analysis

[Missed slide on how much data they were getting - 1M reads?]

Multiplex - 500 samples in one run
* Got good results [not copying down seemingly random DRB numbers...]


Friday, February 26, 2010

AGBT 2010 - Christopher Mason - Weill Cornel Medical College

Developmental Changes in Human Neocortical Transcriptome Revealed by RNA-Seq

How do we go from sequence to organism?

Example of disease that they were able to find change in exon.. but that's not the normal. Brain transcriptome is especiallly bad.

Complexity of transcriptome is vast.

NGS transformed the amount of data we're getting

Compared microarrays vs RNA-seq
* RNA-seq gives you much more information on DE.
* Metric for RNA-seq expression (Reads per kb per million reads)
* Controls: spike in synthetic w poly-A tails [next slide: control worked]

Looking at brain
* validate existing gene boundaries.
* longer isoforms
* find other genes
* 70-90% of genes expressed in the brain with strong neuro-developmental correlation
* Ensembl genes categories expressed: many types of RNAs found
* ~18% of splicee forms are unique to each individual - splicing levels similar across development
* at high expression, 80-90% of genes have alt isoforms

[Lists of genes that were DE in fetal/adult brain - "things that make sense"]

What is different is Transcription Factors - especially Zinc Finger TFs.
* Shift towards fetal expression

Zinc Finger
* most rapidly expanding class of genes

Look at UTRs
* fetal brain exhibits myriad extensions of gene models and variable UTRs.
* TARs found. (Transcriptionally activated regions) - confirmed with PCR

No visible end of gene discovery.
* the deeper you go, the more new things you see.

ROC plot
* sensitivity (TP / TP + FN) and specificity
* looks incredible - nearly straight to 1.

Source of "wiggles" in RNA-seq.
* it's everything, really
* biggest problem: annotation is one source.

Human genome is not just 33Mb.... it's only 1/2 to 1/5th ofthe exome capture.
* 165 Mb have been validated on multiple SeQC platforms!

There aren't just 20,000 genes - it's closer to 45,000!

Begat: every bp of the genome is a locus for ttesting, each remiaing sequence is a variable.

Don't forget, we also have to filter out viruses/bacteria/other
* Code for Begat is available. (Email given - forgot to copy it down.)


AGBT 2010 - Manual Garber - Broad

Annotating LincRNA Transcripts Using Targeted Sequencing

Goal: Identify functional large ncRNAs in the mammalian genome
* look like mRNA, but non-coding
* Use Chip-Seq to separate genome into regions
* use Tiling arrays, hybridize RNA...
* Tiling arrays - no information about connectivity, limited resolution

* studying the functions of lincRNAs reqruie precise sequences for both experimental and computational analyses.

Use RNA-Seq protocol to build transcriptome

what RNA-seq gives you:
* RNA, map to genome
* introns... junction reads.
* use reads with mate in poly-A to find end.

Used Tophat to align

Junction reads:
* Longer reads provide junction evidence
* first, use only reads that align with a gap. (Build connectivity map)
* topology map
* use map with ChIP-Seq data to build "paths"
* use paths to call transcripts
* clean up with Paired End Data - > join or kill unlikely isoforms.

* Mouse ES
* Illumina sequence (156M - 76bp reads)
* 75% exonic alignment
* correctly reconstruct most expressed known genes at single nucleotide resolution.
* works even on overlapping genes.
* 81% genes fully-reconstructed
* Good recovery of genes at all expression levels.

Novel Transcripts discovered:
* 800 loci between genes
** 250 out of 317 ES lincsRNA are reconstructed
* 200 loci overlapping genes
** 131 overlap coding exons. (making them antisense for visual purpose.)

Are they protein coding genes?
* LincRNAs are probably too small to produce proteins [Strange assumption, IMHO... maybe I'm missing something.]
* 650 of 800 have no lincRNAs have no coding potential
* have lower expression level than coding regions.
* intergenic transcript conservations.. (similar conservation to old lincRNAs)
* Antisense transcripts? - no antisense coding potential
* antisense expression - very low antisense expression
* Antisense conservation - a little more conserved than sense lincRNA because of overlap with exons of genes
* antisense exons are not conserved.

What do overlapping trancripts do?
* expression is low,
* little or no conservation
* correlation with overlapping transcripts
* Thus: artifacts, noise, fine tuners? other ideas?

* novel statistical method takes advantage of longer reads
* mouse ES coding gene novelties
* intergenic non coding RNA (lincRNA)
* new family of antisense non coding RNA
* validation of 18/20.


AGBT 2010 - Brian Haas - Broad

Genome annotation using mRNA-Seq: A case study of Schizosaccharomyces pombe

Leverage evidence for genome annotation
* eg, 3 ab initio gene predictions

Major chanllenge:
* lack of high quality evidence
* this is changing with NGS.
* we now have evidence - but we need to standarize and develop algorithms
* reconstructing transcripts is difficult

Approach 1: de novo assembly
* treat them like EST
* align to genome

Approach 2: align reads to genome
* reconstruct based on alignments

Sequencing genomes from Schizosaccharomyces
* pombe is model organism - sequenced in 2002
* 12.5Mb, 5k genes, avg gene 1,489 bp
* genome should be well annotated, good quality annotations

* 44M reads, 65% aligned (Maq)
* align to genome - look good
* challenge is to bring it to high quality automated state

Align: Use TopHat for short read alignment + Cufflinks
Assemble: Velvet/Ananas + GMAP

ELT structures transferred into PASA, which does refinement, alt splicing and validate existing annotations

This is all exploration - This is NOT a tool Bake off.

Elts: Velvet (21167), Cufflins (4158), Ananas (8309)
Almost all alignments to genome were perfect.

Then, test how many assembled to reconstruct full length gene support: Ananas did best, cufflinks 2nd best, velet only 1/3 of those done by Ananas.
* Velvet did very well with supporting introns

* readthrough and encroachment
* again, ananas did best, velvet 2nd best, Cufflinks worst (by a long shot.)

Examples given.
* Velvet seems to give fractionated transcripts.. breaks where coverage is high. [Probably seq errors are causing it to break?]
* some annotations needed to be extended
* corrected genes - merging two genes that are really one.

* none of these methods are great - they're all missing some that others caught.

* some well covered genomic loci not fully reconstructed (paralogs?)
* intron readthrough/encroachment
* incorrectly merged genes/transcripts
* UTR structures and alt splicing.

For well covered genomic loci not fully reconstructed
* identify disjoint regions
* colect reads and assemble independently
* genome directed to avoid misassembly
* very fast to do this
* This helps, but still have a long way to go.
* more tuning needed (expect to get up to 90%)

Dissecting merged transcripts.
* use coverage based assembly clipping - break up transcripts

Technology will greatly facilitate efforts
* Use stranded mRNA-seq

* the information from mRNA-seq is needed for high throughput annotation
* current tools show progress
* still much more to be done in optimization
* need for optimized methods for ALL types of genomes.


AGBT 2010 - Shuro Sen - NHGRI

Transcriptome Profiling of ClinSeq Particpants by Massively Parallel Short-Read DNA Sequencing

[No Microphone - I may not get much from this talk. Mostly I will be pulling from Slides, I think]

* cohort of 1,000 individuals
* initial focus on Cardiovascular disease
* Consent for follow up
* transcriptome, exome + few genomes
* application of large-scale medical sequencing in a clinical research setting.
* concurrent "Omes" from same individual
* move on to other diseases in the long term

* started with sanger
* now moved to Illumina

* published marker paper on this topic last Sept in Genome Research

* transcriptome component of ClinSeq
* demonstrate use of RNA-seq in clinical research
* better than SAGE or Microarray

Transcriptome + Exome
* gene expression
* splicing
* gene fusions
* etc

* hardening of arteries
* Looking for biomarkers for calcification
* can look for it by CT scan (in example, arteries look like bone.. [Ouch!]

*4 people w high calcification, 4 with low calcification
* two RNA sources: LCLs and whole blood
* emphasis on uniform cell culture conditions
* repeated EBV transformation from same individual (see noise)
* RNA Fragmentation (Covaris S2)
* PCR amplification 12 cycles
* two PE 51bp lanes Illumina

Differential gene expression
* Expression vs Statistical Significance.
* "upside down volcano plot"
* found about 100 genes that were differently expressed and significant
* Looking at those 100 in detail
* Many of these genes are noise.
* more sequencing reads to improve statistical depth

Discussing his bet hits - but not giving names of genes.

[Kind of silly to take notes on random unnamed genes. Take home message is that some of the genes were found that were known in the process -but obviously not all of them. TFs, TKs and something associated with rheumatoid arthitis. This might be a good time for me to rant about how picking any random list of proteins will give you things that you think are promising. All gene hit sets are "interesting" at first, and useless when not validated... but that's obvious, no?]

Coming up
* analysis of next 8 subjects
* follow up
* sequence more subjects for rare variants
* integrated analysis of genome and transcriptome dat to uncover SNV loci underlying differential expression. ("integrating multiple omes")


AGBT 2010 - Nicole Cloonan - The University of Queensland

Translation-State RNAseq of Human Embryonic Stem Cells using Paired-End Sequencing.

Intro to Stem Cells
* hot topic - potential for cell generating therapies
* Self renewable
* pluripotent
* directable
* tractable

Looking at Extracellular space network.
* molecules that control cell-cell interactions (among others)

The "Plurinet"
* defines the pluripotent status of the cell
* protein-protein interactions
(Muller et al, Nature 455:401-505)

Transcriptional complexity
* 6 transcripts per gene on average
* so how does this affect the plurinet

* have a pipelien... [too fast]
* done SET and PET.
* 80% of tags map, 194M 50mers, 114M 25mers

Tags that don't map:
* LincRNA, intergene, etc...

* alternate splicing
* works well if you know what the annotations are.
* with PET, you can build transcript models if you don't have them already - learn more about alt. splice
* can be used for novel exon discovery

Chip-Seq from Ku et al
* Extended Exons. 3' exon extensions can be very long.

[Why is this Chip-Seq?]

Do Virtual Northerns
* Size fractionations
* What you find is that most annotated genes have the right refseq predicted lengths.
* however, some are shorter, some are longer
* Frequency at which tags from a particular library match predicted (based on refseq) vs from RNA data... You do see that some have very different results.

RNA are translated...
* if no signal peptide, cytoplasmic (on free ribosomes)
* if has signal, then it's translated by ribosomes bound to membranes
* use sucrose gradient to separate the two populations
* do PET, (35/75bp reads)
* compare signals in both fractions - they come up well in the predicted fraction.

Novel transcription
* membrane associated RNA have very different proportions of extension (mainly long 3' UTRs) than those in the cytoplasmic fraction

MiRNA biogenesis and mRNA interactions.
* use fractionation to test
* RISC associated with polysomes (which works with fractionation)
* complexes stay together through fractionation
* Long UTRS are enriched for mRNA binding sites

Back to Plurinet
* Complexity is incredibly increased with the extra products and miRNA

* PET allows you to reconstruct loci level complexity from RNAseq data
* Size fractionation is useful
* translation state RNAseq allow s the capture of mRNA and miRNA data from polyribosomes
* Transcriptional complexity impacts greatly on interactions.


AGBT 2010 - Jonas Korlach - Pacific Biosciences

Direct Single Molecule, Real Time RNA Sequencing.

Opportunity to further work with this platform to replace enzyme in ZMW with other enzymes of interest - can observe new functionality.

"Single Molecule Realtime Biology" [SRMB? How do you say that acronym?]

Of interest: Reverse Transcription
* replace polymerase with rna polymerase (reverse transcriptase)
* have done this - simple extension tests.
* done kinetic analysis, and the phospho dntps are incorporated well, but MUCH slower (1 order slower) than non-marked nucleotides

Tested the system out anyhow.
* Seems to work in principle - albeit it's slow. One dNTP in enzyme is not yet one nucleotide inserverion.

Ribosomal RNA Sequencing.
* Can withold catalytic metal, which allows binding, but not ligation. Thus, you can just watch the flourescnece - and in this case, binding only happens with correct nucleotide.
* can also detect modified RNA bases - eg, Pseudouridine. Can measure binding time - takes longer.

Detection of Modified RNA bases
* pauses indicate kinetic changes

For viruses, you can get a single enzyme to process the entire genome of a virus - very long read lengths at the tail end of the distribution.

HIV reverse transcriptatse translocation dynamics.
* use terminating bases and AIDS drugs - and monitor incorporation and pulses.
* Show graphs of kinetic analysis of P-Sites and N-site
* Can then study binding in the presense of the terminators/drugs.
* Can calculate binding energy from puslses.

* Demonstrated SMRT RNA sequencing - still room to grow.
* Deomnstrated SMRT Biology - Translation (shown tomorrow) and reverse transcriptase.


AGBT 2010 -Pacfic Biosciences Workshop

"The debut of the 3rd Generation"

* Came from the basement of a building in cornell. [what is it with basements on campus?]
* technology detects 500 photons per base
* Raised $266M in company history

* show slide with first results that launched company - detecting 3 labeled C's, barely

"yes, it is big, yes, it is heavy, and yes, it does work"
* smallest: $50,000 desktop version
* Largest: full human genome in 15 minutes.

Already have manufacturing for reagents - and building a facility to construct machines.

Steve Turner: Founder, CSO, Board Member

1. brief overview of technology
2. Update on Collaborations
3. Instrument debut
4. Applications
5. Scalability

* Video of polymerase - same one from web.

* influenza
* cancer transcript
* long read progress
* strobe seqeuncing for strutral variation
* Palustris systems biology [Go palustris!!!]
* circular sonsensus sequencing
* survey of coverage bias
* direct detection of methylation and DNA modification

* serotyping doesn't give picture - immunologically distict viruses.
* Fast Time to result: 9 hours from sample Extraction to sequencing analysis completion.
* did not look at consensus call - used single molecule reads.
* match single molecules with sequenced refernce genomes of similar influenza.
* Turned out that the strain was misidentified - phylogeny was incorrect.
* side benefit: in every case, each segment was covered in single reads. Potential for quasi-species studies of viruses.

Sequenced MCF-7
* known alt. splice forms implicated in tumorigenesis.
* Can map entire transcripts (2400bases) in single read.
[neat stuff]

10,351 base read scrolling... goes on and on.
* they see up to 20kb reads.

Strobe sequencing
* answer to Mate Pairs?
* Polymerase is damaged by laser, so reads will continue until damaged
* Turn off the light, and the polymerase is unharmed... will continue till you turn the lights back on.
* Who needs mate pairs when you can just sequence 10kb at a time?
* show repeat lengths - at 20kb, you can sequence most of your repeat regions. - Strobe it as well...
* Very useful for assembly.

Insertion AC223433 fosmid
* can use time as a way to look at insert size.

* 58 contigs from palustris
* Hybrid assembly - now have a single contig. (Used Strobe, straight and other tech..)

Read Length.
* Expect that you can epxand readss to 50-70kb.
* demonstrate by haprpin ligation to lambda genome (linear)

circular consensus sequencing
* make something circular, then go 'round and 'round till you get consensus.
* Q40 on single molecules by going over it many times

* results in Low bias for GC content
* tested on many organisms

modified nuclear bases
* look at kinetics of base incorporation
* modified nuclear base Methylated Adenosine causes kinetic differences
** 6-10x kinetic changes.
* Methylated Cytosine - still get a signal
* Hydroxymethycytosine: can also see that - also different from other traces
* duration and spacing are different for the three bases.
* Single base resolution, less than 1% FP, methylation detection on single moleucles
* also looked at other modifications - can always tell that it's different.
* Polymerase stalls at T-dimers.

[Summarized it all]

[Insert CEO talk here - wonderful company, wonderful people, "state of the art", hard work.]

Unveil worlds first 3rd gneration sequencer
* Movie time!
* 8 Cells per package - $100 per cell.
* SMRT Cell - 96 / tray.
* reagent plate (96 well)
* each cell works indepenently - in any protocol
* Uses CSV files
* API to LIMS with designs.
* System looks pretty child-proof (though probably not idiot-proof)

Monitoring ar run:
1. monitor at instrument or remotely
2. View real time base incorporations
3. remaining runtime
4. status of each cell from cell prep to run.

Signal to noise ratio is dramatically improved from last year


* web based interface
* accessible from any computer
* automated secondary analysis

* full complement of reports automatically generated
* quality files
* ....

Browswer integrated into viewer.

* FastQ
* etc...

All in one day.
* sample prep to analysis.

* methylation sequencing will be released in an update
* direct rRNA sequencing.

Working towards SMRT Translation
* replace Trancription (Polymerase) with translation (ribosome & labeled tRNA....)

[ok, didn't see that coming]

Scaling of performance over instrument life
* current yeild 30% improved to 90%
* Multiplex: 80k improves to 160,000
* speed 1-3bps improving to 15bps

Throughput should pass 2nd generation with this instrument. Expect new instrument in 3 years to blow all of this away.

Interpretation of Genomics will require epigenetics, etc etc etc. and much data processing. [Oddly, That's what I tried to convince Complete Genomics people of this morning, without success.]

* Dark Bases? They are not dark bases - they are missed bases. They now have better bases, that bind better than the natural bases. Missed bases are a problem - the nucleotide docks, and if happens too fast, you don't get enough phototons...

* Something about algorithms for de novo assembly - check out the posters, and we'll have more information for you.

* What is your error rate? [Very agressive question] Single pass error rate is greater than ensemble sequencing. You don't get systematic error in Pac Bio - Approach towards consensus is linear. You know when you see systematic errors - you can catch and repair. Expect Q90 with this technology.

* Exponential decay on read lengths.


AGBT 2010 - Elaine Mardis - Washington University School of Medicine

Single Molecule Sequencing to Detect and Characterize Somatic Mutations in Cancer Genomes

[Disclaimer Statement - she is a Pac Bio board member]

Why Sequence Whole Genomes?
* [same as always - nothing new]

Focus on talk today is on point mutations

How Current NGS (eg, Illumina) works:
* Sequence tumour & normal to 30x,
* Compare to reference, then compare tumour to normal, and remove known dbnps sites, etc etc...
* Validate SNVs.

4 Tier levels.
* focus validation on Tier 1 results.

Why Validate?
* Pipeline is tuned to have a slightly elevated false positive mutation rate so things aren't missed.
* Orthogonal validation is important.
* Validation is expensive and time consuming, however.

Why check for prevalence of mutations?
* Each tumour gNA sample consists of the contributions of many tumour cells
* digital nature of NGS data allows an estimation of how common each validated mutation is in the tumor cell population
* more prevalent mutations are likely "older" - happen earlier in progression.

Recurrent SNVs
* why? Adding evidence. The ones that happen more often are likely to be earlier in progression and are thus more likely to be drivers. [Not sure I buy that logic, however.]

* Faster Sequence data generation (analysis is not getting cheaper)
* iNcreased validation/prealece data demand (need to decrease cost)
* Recurrent mutation screening (site specific vs whole gene)

Medical impact:
* always want our results to be useful. [Kind of ignoring this part... selling us on the use of sequencing for medical use.]

Discussion of AML project, as discussed in last talk.
* prognostic IDH1 mutations.

[Dr. Mardis' talks always remind me of an infomercial... It has the feel of a commercial presentation, but with data to back it up. It's glossy, the slides are clean, and the presentations feel well rehearsed - something we just don't get much of in science talks.]

Insert sales pitch for Pac Bio systems here.

[5 slides later... ]

three experiments:
* first for accuracy
* second for sensitivity
* third for detection of mutational prevalence

* 32 directed PCr products from glioblastoma tumor normal pair
* 77% neoplastic cellularity
* SMRT sequencing (alpha prototype detector)
* Wrote software for SNP detection
* 94% of 86 known sites were found
* 6 FP and 6FN results

* 5 LOH sites were detected properly
* All mutations were detected at different confidence levels

* used AML genome
* 95% population purity
* All variants detected at each cellularity...

Detection of Mutational Prevalence:
* Concordance with Illumina is good - but not great in tier 3 mutations. C to T mutations were slightly biased against.

* Platform is Ramping up quickly


AGBT 2010 - Keynote speaker: James Downing - St. Jude Children's Hospital

The Molecular Pathology of Acute Leukemia

Was head of pathology at St. Jude for many years - doing cancer genomics before it was called cancer genomics.

First time at AGBT
No methodology or technology - focus on biology and clinical relevance. Not going to present NGS data! Using completely outdated technology - and all of it was published in the last 12 months.

The cancer he's focused on is the best characterized of all the cancers.

What leukemia really is: Proliferating B-cells that rapidly take over the whole body. Highest tumor lode of all the cancers. In his generation, 95% of children died within 12 months of diagnosis. Now have 80-85% cure... but relapses happen in 30%.

[Classical diagram of immune system lineage]

Mutations in early progenitors generate leukemias. Two types: ALL and AML. They are not homogeneous diseases, however. Distinct biological subtypes are characterized by translocations. - They contribute to the leukemia: Necessary, but not sufficient.

What are the biologic processes that need to be altered to generate leukemia:
1. Alteration in self-renewal capacity - need to become "immortal" (unlimited self-renewal) (eg AML1-ETO)
2. Need to have an altered response to growth signals - contnued growth (eg. BCR-ABL1)
3. Block in apoptosis (eg PML-RAR alpha)
4. Block in differentiation

Doing "routine molecular diagnosis":
* CNV, expression, etc
* Use Affy Chips

What have they found? (using 242 diagnostic ALLs with matched germ line DNA.)
* there are a small number of copy number changes per casee... vary markedly across the different subtypes. (eg, MLL: ~1, other has ~11)
* more Deletions that Amplifications
* 60% of b lineage all have a genetic lesion in a gene regulating B-cell differentiantion (PAC5, Ikaros, EVF, LDF1, BNK)

PAX5 deletions most common.
* 10 exons...
* Half of deletions deleted half of the genes
* Others delete required domains
* some were homozygous, but not all.
* Lots of fusions with this gene occurs as well.
* Point mutations were also seen in binding domains...

[Ok, so this gene can be deleted in many ways... got it. The cells find ways to kill off this gene.]

Haploinsufficiency in PAX5 deficient mice
* Was not sufficient to cause lymphoma.
* cooperates with BCR-ABL1 to cause lymphoma. (Mouse Model)
* strong driving pressure for diabling the b-cell differentiation genes in Leukemia.

60% of B-progenitors ALL have Mutations in B-cell regulatory Genes

Look at Ikaros
* entire literature about altered isoforms.
* saw a high frequency of mutations in BCR-ABL1 ALL,
* 85% of BCR-ABL ALL have deletions of Ikaros: Almost never see the deletions in Ikaros.
* mapping deletions of Ikaros: Some are complete, but there is a subset of deletions that commonly knock out all 4 zinc fingers (exons 3-6).
* Never see Ikaros "isoforms" without these deleitons. There probably are no isoforms - it's always genetic lesions.
* Deletions typically happen within a few bases of each other - result from aberrant RAG-mediated recombinations.

Start putting the lesions together. [Nice lists of genes for each of the 3 pathways]

Clinical relevance:
* looking for markers in a new cohort. Remove two types of ALL (BCR-ABL1 + infant), look at 221 samples: Are there new markers?
* Yes, it was Ikaros: 75% of relapse if you have Ikaros deletions.

Compare BCR-ABL1- and Ikaros- (Bad outcome) with BCR-ABL1+ ALL (Also has Ikaros deletions)
* Significant expression similarity
* Look at the Kinases: JAK family, which have a high rate of mutations in ALLs.

JAK mutations:
* not seen in other types of cancers - unique to JH2 domain, clustering in a single spot. (R683)
* Turns out that high risk ALL have JAK deletions.

* Over expression of this receptor (compensating for Jak Mutations and lack of signaling), combine to cause a proliferative signal. [I didn't get everything here.]

Looking at high risk again:
* Ikaros deletions
* Jak Mutations
* CRLF2 (cytokine receptor mutations)

What other kinases are activated in this subset of patients?
* Work in progress
* quick review of other genes they're now finding... [too fast to get that down.]

Genetic Alterations Acquired at Relapse
* Relapsing is only 20% blast population.
* Need to Flow sort.
* CDKN2A/B mutations
* [list of genes, including ikaros... ]
* No common mechanism of relapse - variety of pathways
* Varieties do not include drug target mutations. It's always in signalling, etc.
* 7% of relapse is "unrelated" (secondary leukemia)
* 8% same as diagnosis
* 34% clonal evolution from diagnosis
* 51% clonal evolution from pre-leukemic clone

* small number of variation
* Ikaros mutations
* Aberant RAG-mdeidated recombination
* JAK mutations
* ...

This disease "begs for NGS" - Get a complete picture of what's going on.
* Collaborating with WashU. (Mardis, Wilson, Ley)
* Doing the "Bad" leukemias (infant, high risk, CBF)
* also doing brain and solid tumours (neuroblastoma osteosarcoma, retinoblastoma)
* Started Feb 1st - already have 5 genomes and matched normals.
* over $50M invested in this project


Thursday, February 25, 2010

AGBT 2010 - Ogan Abaan - NIH/NCI

Identification of novel cancer mutations in sarcomas

Sarcomas: two categories
* simple genetic changes (eg. Ewings)
* Complex genetic changes (eg, osteosarcomas)
Soft tissue sarcomas in general:
* rare
* high mastastasis.
* connective tissue origin.
* 50 subgroups - most have unknown biology
Tumour samples from 24 soft tissue sarcoma patents
* matched normals will be sequenced when available at some point in the future.

* 15k exons from 1334 genes
* used "in-solution" capture method.
* 33.5k -150mers
* no repeat masking.
* biotinylated baits

Used Eland - and used GAII or GAIIx, as available - mixed read lengths

Custom python scripts - wrote them himself. Still a work in progress.

Variant Calling is VERY simple. Uses Phred score based approach, adjusted by error rate at that position.

Did the standard: filter on dbsnp130, annotate on UCSC refGene and Visual confirmation (IGV Browser)

Shows stats - they don't look great, but they seem similar to those published in Tewhey et al (Genome Biol 2009). [Shown to justify low rates?]

Some optimization could be done to get more coverage.
* gets 23-46% at greater than or equal to 10x, paper gets 88% or more at 7x

6 of variants are known in COSMIC db.

KEGG pathway: Many mismatch repair... [actually, this is the usual set you'd see with any cancer sample. Nothing sticks out.]

* 305 variants, no common variants.

* increase sample size.
* pathway analysis
* Understand biology

[Not the most impressive talk - I could give the same talk on my cell lines, and would have roughly the same results.... nothing particularly interesting.]


AGBT 2010 - Ian Bosdet - BC Gancer Agency

Mutational Profiling of Pre and Post-Treatement Lung Tumors Using Whole-Transcriptome Sequencing and Targeted Sequence Capture

EGF receptor is often mutated (non-small cell)
* some tyrosine kinase inhibitors exist, but response is variable.
* clinical characteristics known to be associated with response was used as primary criterial for recruitment

Identifying patients that are likely to bbenefit from TKI therapy can have a significant impact on overall survival.
* cells become addicted to the rampant signalling from TK. Cutting it off can kill them
* often a mutation that can dampen or negate result of drug.

All cancers used were first line.
* non-smoker
* female & asian
* stage IIIb
* NSCLC 1st line.

Majority of patients have now progressed - and encouraged to donate 2nd biopsy.

65 patients over 2 years,
goal: non progression over 8 weeks.
80% did not progress in 8 weeks.
* 23 partial response,
* 24 stable disease

30 tumours selected for RNA sequencing
* 13 responders, 14 non-responders
* 3 progression tumours
* gene expression analysis and mutation discovery
* some correlation to clinical characteristics.

One gene correlated with EGFR sensitivity mutations.
Another seemed to correlated to smokers who did not respond: IER5L

Excess unaligned reads were aligned to virus transcripts - Highly enriched for Epistein-Barr Virus. Tumour ended up being re-classified.

3 patients then sequenced with Capture:
* Used Agilent (47,558 baits)
* Normal, pre-treatment and post-treatment tumour samples
* can be used to identify small deletions
* Putative somatic mutations resulting in significant amino-acid alterations were identified using SNVMix
* Mutations similar between patients were not observed, but pre-treatment tumour pairs show significant overlap.

[Talking about putative somatic mutations.... I got ripped into for doing the exact same analysis and calling the same mutations "most likely" somatic 2 weeks ago... DOH.]

* clinical selection of patients can greatly enhance incidence of EGFR and mutations and response to erlotnib at 8-weeks
* EGFR mutation status is a good but imperfect predictor of patient response
* mutation discovery in treatment naive lung tumours has identified a relatively small number of mutations (need validation(
* more progressions will be analyzed.


AGBT 2010 - Daniel MacArthur - Welcome Trust Sanger Institute

Loss-of_Function Mutations in Healthy Human Genomes: Implications for Clinical Genome Sequencing

[Missed the firsts couple minutes?]
Analysis of 1000 genomes data.

Loss of Function sub-group
Aim: create a catalogue of variants predicted to result in severe disruption of gene function
What is a LOF variant: [annotation based on GENCODE v3lb]
1. stop codon SNPs
2. splice disruption SNPs
3. frame shift indels
4. disruptive structural variants. (eg. loss of exons, loss of start codons...)

LOF variants:
* enriched for:
** severe recessive mutations
** other variants with functional effects
** neutral variatns in redundant genes/pseudogenes
** Sequencing and annotation arefacts

Many of these will be neutral.

3 pilots.
* total of 1,6556 unique genes affected.
* that is to say that a substantial portion of the genome has LOF variants
* acknowledging that there are errors, that's still a lot. (=

Disrupted genes per individual. Visible difference between European vs. Yoruba. (Africans have higher variability)

Structural variants seem relatively constant, splicing seems constant, stops seem to vary most. (CEU, CHB, JPT, YRI) [I'm eyeballing]

Expect to se some carriers for recessive disease mutations
* Several likely carrier mutations identified. [didn't catch them]

Derived allele frequency spectra.
* stop and splice are heavily shifted to the low end (0.05+)
LOF sites are enriched for artefacts
* Conserved region have less polymorphisms, but equal amount of error.
* Non-conserved have more polymorphisms, and equal error:
** thus tends to increase artefact rate in conserved regions.

LOF clustering points to mapping and annotation arefacts
* 91% of LOF carying genes contain only one LOF variant.
* there are some genes that are enriched for multiple independent LOF variants.
** many of them are CNV, seg dup, close paralogues.... which means that they're artefacts too.
* other annotation artefacts exist too... LOFs are making them stand out.

Beyond cataloging:
* large scale sequencing studies tend to produce many potential LOF candidates
* discriminate between disease causing and benign variations.
* is there a functional profile distinguishing recessive and LOF-tolerant genes?

Compare LOF-tolerant genes (& non-OR) to 725 recessive disease genes from OMIM. (Early results)
* use it to do classification
* linear discriminant analysis

[Kind of feels like a fast drive-by-blogging... my notes really didn't do justice to Daniel's explanations - i just managed to get down some of the points.]


AGBT 2010 - Timothy Triche - Children's Hospital Los Angeles

Unraveling the Complexity of Primary and Metastatic Ewing's Sarcoma Using Helicos Singele Molecule Sequencing

Came out of ongoing studies of high risk childhood cancers.
Ewing Sarcoma
* had no survivors, now has 50% survival rate.
* If it metastisize, there is no survival (poor outcome)
Started with 16 year old female
* metastasize 6 months later
* had a lot of DNA from stock piled bone marrow

* use RNA/DNA/Epigenomics to understand cancer
Interested in just about all types of sequencing, and integrating it all. List pretty much every type of Next gen sequencing technique. [Not much they aren't interested in.]

Using Helicos to do sequencing
* Identify two p53 mutations - both previously known.
* Chimeric genes in sarcomas usually mean the rest of the genome is less rearranged. However, there were a fairly significant rearrangements in metastasis. (eg. Entire chr 7 & 8 duplicated, )

Metastasis is not just an explant.
* Statistically, there is a strong association between CNV duplications and RNA up expression of genes.

[something about cell adhesion molecules?]

Mechanisms of double strand breaks... uniform at nearly single base resolution.
* 11q24.3
* in middle of FLI1 gene.
* approximately 1Mb deletion in tumour, in metastasis, this completely disappears.

Breakpoint @ 22q12.2
* less defined... again CNV changes disappear.
18qter DEL & LOH in CHLA9 disappears in CHLA10: Is the metastasis derived from the primary?
* Deletions "disappear", so the the dominant clone in the primary is not likely the one that metastasized.

Comparing primary to metastasis

[Whoa... colours... orange fused to blue, turned to light blue.. something diluted... much too fast to take notes on this without pictures.]
* Dosage effects are seen.
* 22 chromosomes show LOH and profound Homozygosity in the Metastasis that is not seen in the primary. 16/20 chromosomes.
* This shows a major simplification in the genome.

Used RNA-Seq... some filtering on RNA.
* random primers, poly a Tail addition, Hybidize and ...
* look for fusion - get EWS-FLI1 fusion

overall RNA expression.
* far more complex, especially intronic, 5` and 3` of exons.
* This is regulated under controls that have yet to be discovered.
* more than 40% of transcription in the pirmary tumour and metastasis is non-exonic.
* Genes are up regulated in metastases
* Some times you see lots of intron expression, some times you see LOTS.

[ I'm going to go see another talk - Have to stop notes here.]


AGBT 2010 - Kristian Cibulskis - Broad Institute

ITector: Accurate Somatic Mutation Detection in Whole Genome and Exome Capture

Mutation Detection - the goal
* Somatic Point Mutations: SNV in the tumour DNA that are not present in the normal

One challenge: Sensitivity
* tumour purity : Normal tissue gets into the sample - you may be testing normal tissue in high quantities
* ploidy: often there are multiple copies of the DNA

60% tumour is common - with 3x ploidy and min allele fraction: 0.23

Challenge 2: Specificity
* Signal: 1 somatic mutation per Mb
* Noise: 1000 common germline varients per Mb (in dbsnp)
* Mutations are not recurrent. (Constant discovery mode)
* 1000s mutations per sample, 100s of samples
* Too expensive to validate every mutation - would cost more than to discover.

* Core detection algorithm and practical artifact filters
* Under dev since Nov 2008
* Built upon GATK

Some artifacts can be cleaned up globally
* Remove molecular Duplicates
* Recalibrate Quality Scores (make Q values match)
* Locally Realign [Gapped - uses SW - I saw the poster]

Core Statistical Test
* Prior genotype probablities enforce variant expectation rate..
* first calculate score for non-reference (for tumor)
* then calculate scover for it being reference (for normal)
* Controling sequencing error
* Controlling missing a germline ref in the normal.

Running: you get more somatic mutations
* expected 30 somatic mutations, ended up with 133 in 30mb of coding sequence
* Error processes not captured by the core statistic produce high confidence mistakes
* Information about reference alleles and mutatn alleles should come from similar distributions
* linked mutations, library errors... etc

* Sequence context causes base hallucinations
* Fisher's exact test to check distribution of strand of reads containing reference allele versus alternate allele
* Bigger effect in capture than whole genome

* Sequencers/Aligners tend to make reproduceable errors, which then show up in alignments

Small changes to filters have big effects
* Very sensitive!

Filtering goes from 133 to 35.

* 26/29, 30/35, 31/36, 92/100
* Around 95%

How Sensitive?
* use core statistics.
* depends on coverage! [of course]
* use theoretical prediction data and ultra deep coverage as "control"
* Both seem to give the same/similar results
* Average 60-80% power to detect

Beta Testing going on
* Release of the software will be soon!


AGBT 2010 - Elliot Margulies - NHGRI/NIH

Sequencing and analysis of matched tumor and normal genomes from a melanoma patient

Experimental Design:
* melanoma tumor sample - sequence it
* matched normal blood sample - sequence it
* seems simple, but takes new tools.
* unique advertisement strategies. (-;

Saved 10 runs of Images alone - more than 100 Tb of storage

Compare Illumina 1.6 v 1.4
* Uniquely aligning read and next_phred
* Didn't explain the results of the graphs shown... missed the point.

Used Eland, partition into bins
* realign with xmatch. (well characterized and scales well.)

In the end, 2 whole genome datsets
* 2 x 100 bp read
* 33 tumour and 24 normal (lanes)
* total runs (5 and 3)
* total alignable reads 1billion/1.2billion

Coverage statistics:
* Greater than 99% covered 1x
* 5x-10x range for variants covered by 94-95%

Method for variant detection
* Most Probable Genotype
* bayesian statistic approach, prior probability of observing a non-ref allele (expected mutation rate)
* Equation given - not going to copy that for html.
* Confidence is the difference between the best call and the next most probable call.

[This looks VERY much like SNVMix2...]

Graph concordance with percentage called. If you use a cutoff of 10, you get 95% in the normal genome, 90% in the tumor.

Moved from MPG to Most Probable Variant (MPV)
* Compare between the best call and the probability of the reference data.
* improves the quality of the call.

* Using MPV greater than 10 (4Million variants)
* Subtract out evidence for germ line or low coverage
** take out high confidence gernline variants
** subtract MPG is less than 10, but looks like a variant.
** throw out low confidence somatic variants.
* leaves 189,000 somatic variants (tumour variants)
* also filtering dbsnp
* break into coding/non-coding
* synonymous/non-synonymous
* verify SNVs by sanger sequencing. (75/84 verify) It may be that some of them are there, but not detectable by sanger.

Summary table of SNV pipeline.
* 174,000 non coding variants.

Paper: Local DNA Topography correlates with functional noncoding regions of the human genome.

Impact on SNPs on Local DNA Structure - sometimes this can change the structure alot.

Use "Chai" to do structure informed evolutionary information
* only about 10,000 overlap "chai" regions
* 2,176 appear to dramatically change DNA shape.

"Chai" spots are "mutation cold spots"
Future plans, look at more tumor normal pairs, and investigate it further.


AGBT 2010 - Penny Chisholm - MIT

From Single Cells to Global Metagenomics: What Prochlorococcus and its Phage Can Teach Us About Life.

It does not cause disease... but all our lives depend on it. Not enough time to discuss both of them.

World view:

* Information genome architecture

* Cellular machinery and physiology

* Population and community dynamics

* Biogenochemistry and Physics (Global Biosphereic Process)

First 2 are How, last 2 are why.

Our goal: Study a single microbe at all scales of organization

What is procholorcoccus

* smallest and most abundant photosynthetic on earth.

* cyanobacteria

* .6-.8 um diameter

* single species, - less than 3% difference in 16S rRNA

* very small and simple

* discovered in 1985

Account for 25-70% of photosynthesis in the oceans.

* sequesters up to 5 billion Volkswagen weights worth of CO2.

Light optimum for growth differs among strains

Also come in temperature optimized strains

Can map different ecotypes at different depths of oceans, as well as latitude of the earth.

Prochlorococcus has the smallest gennome that can make life from scratch.

* (requires no organic compounds)

* has 2000 genes.

* complete self-sufficiency

* Core genes: saturates ate 1250

* Optional genes: up to 5736

What's the global pan-genome?

* has a lot.. and is always changing. (lots of gene exchange)

A lot of data from global ocean survey (thanks Craig Venter!)

[Nice graph with location of core/optional genes along genome. They tend to form clusters in variable numbers.] -- Island regions

80% of island genes are not of cyanobacterial origin. (Probably phage involved in that)

Lots of phage in culture - lots of genome.

* Phage actually carry photosynthesis gene, among others.

2 vignettes:

1. hunt for NO3 utilizing chorlococcus

* problem: synechococcus has a common ancestor, which can use all 3 forms of nitrogens

* prochlorococcus: some use ammonia, some use No2, none use NO3

* confirmed that cells did not contain NO3 reductase.... (used seqencing)

* However, cells were all cultured on HN3+ media...

Went back to database from global ocean survey

* recruit all fragments that had NO3 reductase gene (narB) (Seems to use a cloned library?)

* Find those that also have known Prochlorococus gene.

* were able to find and identify them.

* not distributed everywhere, however, these only are distributed in some regions.

2. modeling the fitness consequence of loss of NO3- utilization in the global ocean.

* used global physical chemical model from Mick Follows (Darwin Project)

[WOW... incredible false color model movie of earth and nutrient concentrations and populations of phytoplankton!!!!!!!!!!]

Where in the oceans are the NO3- loss mutants least disadvantaged to relative to null?

Answer: Least disadvantaged in tropical pacific.

* Agrees with the global survey data.

[That is cool...]


* directed isolations for NO3 processing prochlorococcus

* single cell sequencing - using cell sorting.

* would like to know what other genes go along with the NO3 processing gene (narB)

Sea water is loaded with extraneous DNA... doing sorting (twice) reduces contaminating DNA.

Normalization reuces unevenness in sequencing coverage

* Genome recovery as high as 99.6% with reference guided assembly (done with cultured cells.)

Next Topic:

Do prochlorococcus populations in the atlantic and pacific have the same genetic composition?

* metagenomic sequencing with 454 FLX

* Recruit prochlorococcus reads

* analyze frequency of genes - tells you about selective pressures.


* only major difference is.... P- acquisition! (Lots of genes fit here)

* same things happens for a different bacteria Pelagibacter - it's all phosphorus genes, but different ones.

Cyanophage genes are also carying phosphorus genes in the pacific.

Thus, simple system with a clear answer.

Would like to do this more, now using Illumina. Gives same results as 454. Will increase throughput.

Last topic: Choreography of cellular metabolism

* cell cycle synthesizes to light dark cycle.

* transcription is dynamic: 80% of genes are cycled through light/dark cycle.

* Proteome some are dynamic, others are not.

* Many genes cycle, but do not change protein levels much!

Smoking gun: co-evolution between host and phage.... details about calvin cycle, but no time to get into it.


* we now have a global understand every level ("integrated systems biology")

* genome to global understanding


AGBT 2010 - Thomas Briese - Columbia University

New Frontiers in Molecular Diagnosis of Infectious Diseases

The Zoonotic Pool (Morse 1993)
* Assuming that 20 viruses exist for each vertebrate, more than 99% of viruses remain to be found.

Recent Emerging /re-emerging diseases - threats to global health.
[Pictures of diseases, where they come from.... comic about king Tut being diagnosed with West Nile virus...]

* More serious discussion of history of West Nile Virus.
* Identified separately as both human encephalitis, and death in birds... eventually converged

Hong Kong & Sars...
* unfortunately add: "Visiting hong kong will take your breath away."

Why do we need new methods?
* 40-60% of samples will be unidentified or undiagnosed - acute respiratory infection
* 60% of encephalitis is never explained
* enteric infection is undiagnosed in 40% of cases

Introducing a staged sample for pathogen detection.
* this is faster
* more sensitive
* automate-able
* work with non-infectious nucleotides
* incoming samples screened (multiplexed PCR)
* if not known, move to array...
* if still not known, move to sequencing.
* Surveillance assays, QPCR, Serology, Koch's Postulates...

MassTag PCR.
* New technology for sensitive, highly multiplexed, rapid differential diagnose of common viral diseases.
* Has software, aligners, etc.

* primers are tagged, pcr purification on plate, elute to well, use mass spec, automated injection, identification of pathogen by signal analysis.

Has been applied - eg, New York State, 2004-2005.
* decrease in positive signals for infuenza, so a retrospective study was taken.
* 151 patients sample - found identical previously results, but also were able to diagnose 30% more samples. eg, tripple infections, etc.
* undiagnosed also included rhinoviruses (common cold)
* many rhinovirses turne out to be type C, which was unexpected. [not sure if it's because type C wasn't well known, or if it's just these types were not known.]

Not everything is resolved by multiplex PCR... Greene chips.
* named after Donor.
* highly multiplexed
Task 1: Select probes, devised selection mechanism that represents all sequence in genebank
* QUarterly results
Task 2: Amplify and label. Random amplification in sample, and lift to clinically relevant detection level.
Tested a sample on MARV, couldn't be found in standard pcr, etc...
ended up diagnosing as something else... plasmodia? [not sure how that fits with the gene chip.]

Pathogen discovery by NGS.
* used to sequence EVERYTHING in sample.
* if massTag PCR and GreeneChip fail, do HTS.
* Started with Colony Collapse Disorder
* All technologies were geared to vertebrate. but they did isolate a virus... however, not necessarily associated with causation.

Example: Transplant associated pathogenicity.
* Had a difficult time, as virus was very different from known viruses... only 14 tags observed
Forgotten Scripts and the puzzled librarian.
* How do you work with lost languages?
* Use patterns or word probabilities
* Did the same thing for viruses.

Unknown Disease of Parrots
* took tissue from brain of parrots
* discovered Bonavirus[?]

[This talk is incredibly difficult to follow - it's a lot or anecdotes on a theme. I'm sure that's reflected in my notes. Each item is interesting, but it's difficult to tie the common elements together to form a single picture. Mixing method anecdote and acknowledgements together...]

Nosocomial infection
* all died - except patient number 5.
* no disease was initially identified.
* eventually discovered new Hemorrhagic fever by sequencing.
* it has a very diverse biology, sequence, etc.

Hot spots for infectious diseases.
* Go to where the emerging diseases are likely to be..
* Look at the reservoirs
* work on preventative issues.
Projects include
* bats from bangledesh

Lastly: Unexplained encephalopathy
* hts revealed two sequences w 30% homology to mink astrovirus
* Not an efficient path.
* PCR gave more,
* Eventually identified new astrovirus.
* caused them to rethink pipeline, which is leading to new methods.
* better performance.

Prospective birth cohort samples with Norwegian govt.
* building a biobank from mothers, cord blood, fathers, etc

Quote from Einstein

Student: The questions are all the same from last year
Einstein: True, but this year the answers are all different

Question: how often are outbreaks followed up? And how many can't be solved?
Answer: depends on how many they are made aware of. The better the quality of the sample, the better the odds of getting the disease. Can even handle tissue samples.


AGBT 2010 - David Wang, Washington in St. Louis school of Medicine

David Wang, Washington in St. Louis school of Medicine

Metagenomic approaches to Pathogen Discovery

Disease this group studies
* Respiratory tract infections
* Diarrheal diseases

Which viruses cause Respiratory tract infections? Lots!
* 20-30% can not be associated with any known virus.

genomic approach
* Get nucleic acids
* Use ViroChip
* use Mass Sequencing
* now use NGS - 454 + a little Illumina

Lots of this was published - 20,000 sequences on an array... etc

Good example for the application of this information: SARS.
* Collaborated with CDC
* ViroChip identified coronavirus signal
* Elution and sequencing of SARS fragments from microarray (affinity capture-sequencing.)
* Sanger shotgun sequencing of SARS culture -- 80% of SARS genome.
* (Rota et al, Science 2003)

Applications to respiratory infections

Genomic screening for Novel viruses - pre-NGS
* start with 3 year old with pneumonia
* Sample was negative for all platforms at the time... used Sequence Analysis Pipeline (High throughput sanger sequencing.)
* less than 50% identity to known viruses

Polyomavirus background
* circular, double stranded DNA
* infect mammals and birds
* two human viruses of this type was known at the time
* has a T-antigen (may be cancer related)
* pathogenic in compromised individuals
* persistent once contacted.

Novelty of the virus they isolated
* thought they'd found the 3rd polyomavirus
* turned out they HAD found a 3rd... and another one in sweeden (4th) by the same method by another group at the same time.

* prevalence is 1-8%
* 80-90% of adults have antibodies against WUV.
* Infection peaks in early childhood.
* There is still a lot of work left to understand how they work, what they're doing, etc

* can it be cultured?
* is it associated with cancer, respiratory disease, etc?

On to Diarrhea...
* 1.5-2M children die from acute diarrhea each year
* Major viral causes of diarrhea
* not much activity on this now.
* even today 40% of known diarrhea can't be identified as a known virus.

Start with Pediatric
* children had failed known screening.
* 1864 high quality unique sequences - 14 were viral
* as many as 3 viruses in one sample
* evidence for novel viruses.

* ssRNA ~6-8kb in length
* only one species had been described - 8 serotypes

Novel Astroviruses fragments detected.
* MLB1 is clearly a novel astrovirus
* does not cluster with other serotypes
* globally widespread!

1. is it a cause of human diarrhea?
2. does it cause disease elsewhere, but shed in stool?
3. is it commensal/symbiotic part of the virome?
4. is it a result of dietary ingestion?

How do we go deeper into this story?
* pooling and NGS!

* Gastroenteritis outbreak in day care, Virginia
* Found another novel Astrovirus
* ~2000 sequences - 313 novel
* Assembled all of this:
** 6.5 kb contig
* one single run of 454 generated the entire genome, with 5' end missing 5 nt.

* Same questions as before, however.

In new work, have now identified 3 new astroviruses.

What is better than one stool samples? Look at 1000's of them!
[Picture of a field of toilets...]

Working now, with raw sewage samples. Fecal and urine microbiome. Correlate back to individual viruses.


AGBT 2010 - Julie Segre - NHGRI

Human Skin Microbiome

Each human cell has the same protein encoding potential. Microbes are more diverse and dynamic than human genome.

Our interactions with our environment strongly effect our microbiome - they bring a huge diversity of interactions - and may affect our ability to develop personalized medicine.

Skin: Barrier to infection, but also home to our microbes.

We can investigate through the use of 16S rRNA. 16S rRNA is more variable in loops than in stems. This has to do with conservation of structure.

HMP spent a lot of focus on amplification strategy that accurately represents bacterial population (match with cultured isolates) with Sanger and 454/Roche.

Do we get more information if we seqeunce than using the traditional culture based methods.

Test: Parallel swabs - culture on various agar/ vs seqeunce. You see a huge bias in the culture samples. Staph are very good at culture - so it swamps out the other actinobacteria, cyanobacteria, etc.

Do this for the whole human body - eg, rash diagnoses are dependent on *location*. Human cells are same, but bacteria change depending on the environments. Oily vs dry, vs moist, vs.... etc. [Lets not go too far with this.]

[Neat.] Variation between sites is greater than variation between individuals.

[also neat:] More diverse sites are more stable.

Sanger sequencing doesn't give nearly enough depth about the rare factors - which is a huge component about what's going on at any given site.

4 vignettes: [I only counted 3... maybe I missed or grouped two together.]

1. Kid with Severe Eczema. (Atopic Dematitis)
* Chronic episodic itchy red skin
* Prevalence 15% in USA
* incidence has trippled in last 30 years
* Treatment: topical or oral antibiotics and /or steroids
* 50% of children with moderate to severe AD will go on to develop asthma and/or hay fever. (Have morbidity and mortality associations.)
* Compare with healthy twin: During flare, diversity is lost, staph is overabundant. Post treatment, diversity returns.
Ask: does 16S have the granularity we need to understand AD? Will we need metagenomics? (Bacteria are underdoing gene transfer/sharing - it may not be sufficient.)

Test: use 454/Roche-XLR to find out what the genome looks like.
* Gene content is HIGHLY variable. 80% of the genes are identical - 20% of genes are variable. The genome sizes, however, stay relatively constant.
* 80% is core genes
* 20% are defense mechanisms, etc.

Why do we care about staph?
* It's the most common hospital-acquired infection.
* they form biofims
* infect on medical equipment.
* when staff wakes you up at night to do a blood draw, they're looking for staph - do you have staph in your blood?

Use genomics to identify difference between "medical device" vs. "commensal" strains.

2. Does your immune system shape microbiome.
* Patients seeking cancer treatment often become temporarily immuno-compromised.
* Patients lack Th17 Cells.
* Serratia predominates on Hyper-IgE skin, lack of other Proteobacteria.

3. Chronic skin wounds: A complication of many diseases, especially diabetes
* Half of total cost of skin diseases ($10 billion/yr)
* antibiotic treatment common with minimal efficacy
* Many bacteria associated with delayed wound healing based on culture studies.
* Diabetic mice model
** diabetic mice have increase in staff, and microbiome composition change.
** retain expression of immune response genes longer than non-diabetic
** Staph abundance positively correlates with skin expression of defense genes.

How to bridge genomics with microbiology.
* Still don't have a good equivalent of a LOD score

To do:
* greater bacterial diversity than appreciated by pure culture studies.
* Establish bacterial baseline of human skin sites
* Informatics chanllenges
* ... [2 more I missed]

Human Microbiome Project (HMP)
* not about disease, but about health.

Are pro-biotics and anti-biotics promoting health or disease? Is this neutral to our health or our environment.


AGBT 2010 - Carlos Bustamante - Stanford University School of Medicine

Complete Genome Sequencing and Analysis of a Diploid African-American and Mexican-American Genome: Implications for Personal Ancestry Reconstruction and Multi-Ethnic Medical Genomics

Motivation and Objectives:
* GWAS has been successful
* Many traits, however, are not being explained by GWAS
* Understanding rare and common genetic variants will require multi-ethnic sequencing

* reseq two admixed genomes to high coverage
* compare to population and demographic models
* Understand diversity [?... missed this point]

* Establish resource for studying human population gentics, recent dmography and admixture
* Using Affymetrix 500k
* 500 samples

You can cluster by ancestry, principle component analyses.
* Admixing: S. Asian and Mexican.
* Using PC 3 an PC 4, you get huge amount of diversity from native americans that aren't sampled by current chips.

Think about approaches that are "dated ribbon?"
* proportion of african ancestry P = b / (a+b)

[I'm really missing stuff in this talk - it's very quick, and I know nothing about admixing.... will read up on that later.]

* PCA along windows across genome
* Use HMM for Admixture Estimation in African Americans
* This identifies "Ancestry switchpoints" which seem to be cross over events that skew towards one or the other ancestry within the same chromosome.
* Multiple events in one chromosome are possible.

Individual ancestry results:
* You get a lot of variation in content across single chromosomes, you can then quantify this amount.
* Latin americans, however, are all over the place - they are really mosaic.

Great variety in amount of ancestry and location of breakpoints.

Take home message:
Personal ancestry reconstruction including detection of admixture tracts ins feasible on genome-wide scale

How to improve ability to deconvolute this using Sequencing.
* use reference human genome samples sequenced with SOLiD.
* includes 100 genomes data

New Tools:
* STRUCTURE 2.0 LIKE" algorithm -- J. Degenhardt

Use Reconstruction - shows each chromosome in different colors representing which of the ancestries is likely at each window.
* Can see small regions. Are they important? Are they real? Do they matter?

Haplotype-based Andmixture deconvolution
* can reveal fine-scale admixture.
* Seems the signals (small regions are real.)
* Lots of small regions (segments) of diverse ancestry signatures in the genome.
* Do they happen at hotspots?

Looked at mexican
* Many more switch points than previous example (African)
* [Sums up history.... ]

Distribution of Ancestry switches is used to compare
* can look at history of mixture - correlates to length of mix of population
* Scales as (1 + k) / (1 + theta).

Can use this information to find "time to most common recent ancestry"

Extrapolate this to show lengths of time for whole chromosomes and genomes.
* TMRCA varies dramatically along the genome.
* Also fits nicely with SNP work that people have done (dbSNP mentioned as well.)

Functional implications
* discovered ~10,000 NS snps in each genome (Varies by individual)
** Some might be deleterious...
* Functional Annotation of nsSPS using PolyPhen
* Show that admixed populations share more snps, I think.
* snps that are probably damaging are highest in CEPH and MEX.

[Moving very fast over a bunch of slides showing the same message - no notes here.]

Bottle neck in European founding population - Europeans show more deleterious SNPs.

Demographic models explain difference in dN/dS

* 3M snps for each genome
* 10k nsSNPs
* thinking about demographic history is important
* they're really only been working on only one small bit of diversity of the human genome. More will be necessary moving forward for medical applications.


AGBT 2010 - Arend Sidow - Stanford University School of Medicine

Extremely High-Resolution Nucleosome Organization Maps and Gene Expression Analysis in Purified Human Cells

Interested in how sequence and regulation interact in the cell. Focused on the role of nucleosome organization. Long standing collaboration : Anton Valouev, Steve Johnson, and LifeTech.

[Review of Compaction of DNA. bases, wrapped on nucleosomes, clusters of nucleosomes, extended form of DNA, condensed sections, chromosomes...]

Very fancy picture of DNA-Nucleosome Structures - ~156 bp wrap around each nucleosome.

Require regulatory elements to release DNA from nucleosomes.
* what are most people thinking about? Histone modifications. However, this is not what they're interested in.
* Instead, looking at how DNA is organized on nucleosomes.
* Nice figure of Averaged coverage of nucleosomes areound promotor -> gene region. Shows open regions,etc
* Fits nicely pol II binding.
* Data for specific cell types

* DNA Sequencing
* Local gene regulatory functions/protein
* Global/cellular parameters

Work in the field:
* Yeast, (Segql, widom and colleagues)
** Predictions on how they work
* C. elegans : Fire, Sidow and colleages
** fits slightly differently than Yeast
* Human... may yet be different than other.

Org chart:
* Maps of nuclosome position -> parameters of cellularnucleosome organization -> merge with (sequence dictated preferences of nucleosome) -> form models

Show illustration technique:
* Micrococcal nuclease digest used
* Use high salt concentraton. ( 1nucleosome/kb?)
* Used 4 cell types (CD4 CD8 T lymphocites, granulocytes, in vitro reconstitution, & control)

10bp Rotational Setting
* Pronounced propensity for the AA/AT/TA rich to be where the minor grove contacts nucleosome
* CG rich where Nucleosome major grouve contacts nucleosome?
* Overall GC preference for Nucleosome binding.

[I think I missed something in the above explanation.]

Phasing isn't always pretty - it has gaps and isn't perfectly even.
Plot phasing - you get an aggregate data set
* peaks are not even and smooth, starts to look like signal/noise gets messy.
* called the phasing diagram: Phaseogram.

Positioning can drive phasing in a stereotypic way.

Core + linker (193bp)
linker ~ 47bp (at least in this cell type)

Linker distance tends to vary slightly in cell type (eg, granulocytes vs. CD4/CD8).
in CD4/CD8, it's closer to 55bp.

Thus, if linker lengths change slightly, then the rotational setting, and the seqeunce specificity must be weaker.

Sequence alone is incapable of putting nucleotides next to each other.
However, you DO see a sequence signature when nucleotides assemble spontaneously. This seems to TUNE the nucleotide placements.

In vivo, this still seems to happen. Specific signals are highly enriched for positioning sites. However, these are relatively sparse. Phasing is also clean in close proximity to these locations.

Binding Proteins:
* Looking at NRSF binding sites, this is a strong enrichment for phasing on either side,
* NRSF actually sits to this location, not a nucleosome,
* This however, nucleates the phasing location.
* Same thing with CTCF
* the actual spacing between the nucleosomes is still cell type dependent, however, that is dependent on the biology of the cell, not on the factor itself.

In vivo vs in vitro dinucleotide coverage
* GC likes nucleosomes in vitro
* AT does not like nucleosomes in vitro
* In vitro, none of these seem to matter at all, overrideing it.... with exceptions!
* CpG islands are really liked by nucleosides, however, the biology overrides these, as they are usually nucleotide free.

Expressed vs non-expressed genes.
* As expression level rises: coverage falls (though, still at 90%)
* However, spacing drops more obviously. (200bp is default when not expressed, drops by 10% when expressed) [I hope I got that in the right order]

* Sequence preferences
* Nucleation of nucleaosome arrays
* inter nucleosome distances

All combine to decide nucleosome spacing.


AGBT 2010 - Michael Zody - Broad

Detecting signatures of selection in domestic chickens by whole genome re-sequencing

Work Mostly done at Upsalla university

Chickens were domesticated ~8000 chicken.
Mainly descended from red jungle fowl

Sequence pools of chickens - what are most variant alleles?
Used SOLiD sequencing - 4-5x coverage for each chicken

Model organisms... this is a different type of model.
Question: Can we find the loci that underly variations that people have selected for in domestication

Why use chickens?
* numerous QTL
*small genome
* etc

Draft chicken genome already
* 3 million snps (capilliary reseq)
* microsatelite and snp maps
* 38 autozome + W and Z
* 10 are macro rechromosomes
* 28 microchromosomes - high gc + recombination
* 9 of them have no sequences yet in draft.

Genome is 1Gb
* Female has WZ, male has ZZ.
* W is poor.

Two libraries of red jungle fowl.
* one from a pool of 8 from a zoo
* second was the reference genome.

Also did some commercial chickens (broiliers?)
high growth and low growth from *white plymouth rock", selected for high and low growth.
high growth chicken has a huge appetite - will eat themselves to death
low growth chickens are anorexic.. won't reproduce at all if not required to eat.
Other strains:
* rhode island.,
* white leghorns

[Wow... lots of types of chickens.. I'm taking more on this part]

80% of genome covered in ref.
7Million snps identified - corrona lite caller
Most of snps are high quality - tested with arrays, and get 98.8% validation
Some errors seen are in reference chicken

saw some evidence of grey jungle fowl in the domestic bird. (Yellow skin....)

Selective sweeps
* looking for regions of homozygosity.
* use Z score -6
* proof of principe: BCDO2, yellow skin
* graph showing windows of heterozygosity - drops dramatically around BCDO2
* demonstrate power to do sweeps.

Can do this over the whole genome.
* 23 loci in "all domestic" had one or more windows ZH < -6
* 9 loci in boilers only, 6 in layers only.
* TSHR - early domesticaion locus in chickens?

Key step in domestication of chicken was the lack of seasonal reproduction check
* Look for this in genomes.

Also look for deleterious mutation
* eg, myostatin mutation in double-muscled cattle
* Loss of function may be positively selected
* may accumulate due to relaxed restrictions

look for gene mutations, fixed in one or more line
* remember, using pooled data

* removed anything not covered in reference
* calulate p-value
correct for multiple
only considered autosomes
dletions matched to Ensemble

1,284 mutations
* 27 in protein coding regions by annotation
* only 7 of them were real deletions
* validate by sanger

Known deletion (GHR)
* turned out to be a good control.
* detected as 1802bp deletion in commercial broiler 2
* removes c-terminous, modifies protein
* causes sex-linked dwarfism
* used to reduce growth.... interesting aside about how this deletion is used commercially

Novel deletion (SH3RF2)
* removes exons 2-5. (human)
* deletion in high-growth line
* possible problems in annotation
* found in QTL region

Phenotype-genotype correlation
* strongly associated with growth rate
* affects males and females
* Effect not linked to parent.
* Tested expression - doesn't change at exon 1
* No expression in exons 2-5 in high.
* This gene may play a role in appetite

May be a great model for future work on other domesticated model organisms.


AGBT 2010 - Stacey Gabriel, Broad Institute

Applications of New Sequencing Technology to Medical and Cancer Genetics

Testing the full range of allele frequencies
* High frequency polymorphisms = 90% of heterozygosity
* Low frequency polymorphisms = 9%
* Rare mutations = <1%

As cost of generating sequences drops, ability to generate sequence rises quickly, we can now work on rare variants more effectively.
1. 1000 Genomes
2. Exome Sequencine
3. genome and exome sequencing in cancer - hardest set. Eg. cancers

1000 Genomes:
* Shotgun and targeted sequencing in reference samples
* Very large collaborative project
* Should get all variation down to 0.5%.
* Will enable better array development in the future

Analytic pipeline
* unmapped reads to genetic variation.
1. data production
2 mapping (qc and calibration)
3. pilots (shallow or deep)
4. Bayesian modeling
5. Variant calling
(check out poster on this)

Pilot results:
* SNPs (18M SNPs, 50% are novel)
* FDR of novel variants <10%
* 1.5M short indels (Validation under review)
* CNVs (~10,000)
* All of this will be in dbsnp131

What does it mean?
* Now, 90-95% of snps called in any genome have already been discovered
* dbsnp is growing a LOT in 131.

Move on to de novo Mutations (or rare mutations)
* Exome sequencing ($5k -> $1k per sample in the future)
* Whole genome ($50k -> 10k per sample in future)

There will always be a need to do targetted exomes.

Goals of sequencing exomes at scale
* > 95% of alignable exome at 10% or less cost of whole genome sequencing
* Remove sample prep capacity as a bottleneck

Sequencing more individual molecules reduces the chance that PCR errors become high quality snp calls

Targets: v.1.1
* 18,560 Genes
* 188,260 targets
* 32.7mb target territory
* 473k baits
* 120b bait length
* 41mb bait territory

Exome coverage: mean coverage 150x
* mean, 15k snps, 88% concordancy (1/1700 snp/bases)
* Scale up : 1500 selections every month
* should go towards 8000 per month
* planning to do 5000 by end of year

* First example: pilot - Extreme LDL-C
* 6 children with high LDL (95th percentile LDL)
* Previously screened by other labs for normal genes, nothing found
* Called snps, ended up with 16,00, not in dbsnp 784, not in 1000 genomes 512, not in 60 controls: 286
* Of that, synonymous: 105, missense : 170, premature stop: 4, splice sites: 7.
* So, look at burden on genes-> no smoking guns, however. There are interesting glimpses.
* 2 of the individuals had 2 muttations missed in earlier screening

Another example, 5 families, 6 samples - Low LDL
* Found one gene with 2 stop mutations. Has been implicated in other LDL studies.
* Sequencing data came off this weekend.

* New mutations that occur only in tumour and not in matched normals.
* Whole genome shotgun sequencing.
* Looking at all types of changes
* one 120 genomes (60 tumour normals)
(Circos diagrams!)
* Metrics: Mutation rate, copy number, inter chomosomal, intra chromosomeal
* Won't do well until we start looking at similar tumours of one type
* Need to sequence ~500 samples to find good statistic power

Multiple Myeloma:
* Have enough samples to do this.
* Incurable - median survival ~4 years
* 26 pairs whole genome
* 17 pairs whole exome

* point mutations: avg 40 mutations per sample (95% validation)
* rearrangements: 200 candidates per sample (30-50% validation)
* Nothing recurrent (rearrangements)

* Mutation count - 1.5 mutations per Mb.

* Significant mutated genes: Yes.
* NRAS, KRAS, TP53, etc.

Pathway analysis also shows some good information: co-agulation pathway (novel finding)

Found known mutations
* Found Acvivating mutations in BRAF
* Found coaggualation mathway
* DIS3 mutations and FAM46c mutations (25% of cases)
* Mutations in HOX modifying genes
* IRF4 mutations (two identical mutations) - known required for MM survival

Sequencing is getting easier and cheaper.... Firehose analogy.

* analysis and interpretation
* data quality


AGBT 2010 - Keynote Speaker: Debbie Nickerson

Debbie Nickerson - University of Washington, Department of Genome Sciences

Exon reseqeuncing: Next Generation Mendelian Genetics and Beyond."

Start with Acknolegements: Sarah Ng, Jay Shendure, Emily Turner, Mike Bamshad

Exome Sequencing: 180,000 exons, 30 megabases (inc. splice sites & rna

Why? Only 1%, so we can sequence more exoms for the cost.

Disadvantage: Missing non-coding variants

Larger effect Sizes,

more interpretable,

easier to follow up.

Many different approaches (just 3)

- most of todays work is array hybridization. However, it doesn't scale

In solution hybridization scaled better

Also, molecular inversion probes

Exome captured by Nimblegen:

compare GAIIx vs HiSeq: 2.7 times more data. (50x -> 138X). Will now do multiple exomes per lane

Why tackle mendelian disorders?

Most impactful

foundation of human genetics

more insights into gene function

thousands remain unsolved.

Example published last year: Feemen-sheltman syndrome? Sequenced individuals and identified gene that was alreadyknown.Example current presented: mMiller Syndrom. (Potaxial acrofacial dystosis) Presumed autosomal recessive. (Recessives will be easier.)

Sequenced 4 exomes from 3 families

Two individuals were siblings

Includes CF-like lung function in siblings

other: facial deformities, limb development problems

Recently published.


First (~2,000 snps)

Filter on dbsnp129 (may not be valid in the future... more and more complex traits won't work well at all.)

Brings down to 12

Start filtering further on Hapmap (8 individuals), brings it down to 3 genes

Candidate genes quickly become apparent - recessive diseases are relatively easy this way.

Turns out gene is key enzyme in novel pyrimidine biosynthesis. (DHODH) Important in limb development - analogous to drosophila wing mutations. (Found originally in 1975)

What about lung phenotype - found a gene that is common to sibblings

Back to Exome capture, not all diseases have worked out this way. Some have been unsuccessful.

Starting a project to scale exome analysis to complex traits. (Heart and lung)

Looking at other rare traits - including early onset and extremes of phenotypes

Early onset MI

Extremes of LDL Cholesterol

Extremes of BMI

Lung Diseases = pulmonary Hypertension

COPD... etc

How do we replicate this study? Following up on 1000's of samples with ~50 genes is daunting. We'll have to work out how these findings can be replicated in the future.

[missed some stuff at the end when blogger crashed and lost the end of the lecture. - sorry]


AGBT 2010 is underway

Well, here's my first post from AGBT - I've only got a few minutes to get something out quickly before the keynote starts, so it'll be short.

I guess the important parts: The posters are starting to sprout like weeds after a spring rain shower, and I'm eager to go check them out. There are a few I've seen already that look really neat. The other part is that I've seen several bloggers already - and looking far more awake than I am this morning... I imagine that the blogging coverage of AGBT is going to be pretty impressive.

Just looking around at the crowd as they file in, I also recognize a lot of faces. It seems like everyone is here and ready for the torrent of talks and information to start...

Let the games begin!


Monday, February 22, 2010

The eve of AGBT 2010

I guess everyone is getting into the AGBT mood already. The list of Tweets is starting to grow @ #AGBT. I've even seen a quick blog entry from Luke Jostins on his plans for AGBT. (And, just for the record, getting eaten by crocodiles really does seem to be reserved for PI's - us bloggers are probably safe. Just as long as you're not sharing a van with the PI's from the BC Genome Science Centre... did you hear about the time they got stuck in the swamp?)

For those of you who don't know, AGBT is the once-a-year conference where all of the newest, shiniest technology for sequencing comes out. Sort of like Comdex for Sequencing geeks, I suppose. You can check out the full agenda and list of speakers at, and yes, it's worth checking out. Pretty much everyone who's into sequencing technology and analysis will be there. Well, those who got their applications in early - they cut off registration before the abstract deadline, as far as I can tell.

Unfortunately, unlike a lot of the other people who are leaving tomorrow, I've got a nearly-full day's worth of work ahead of me, as I don't leave till early Wednesday morning. While everyone else is in the air, I'll be doing some database admin, some debugging - and hopefully a bit of analysis on at least one experiment. And then, of course, I'll head home to pack. Fortunately my poster has already been sent by fedex, so I just need to pack what I'll wear, my swimsuit - and of course, my mac.

For those of you who know me, you'll know I'm not a mac person: I prefer macs to microsoft products, but really, I'm a Linux guy. Unfortunately, my linux laptop doesn't travel well - and the mac (which is usually loaned to my fiancée to keep her off of my linux boxes) goes into my backpack. In fact, I'll have to find a good twitter client for the mac before I get on the plane. Anyone have any suggestions?

Anyhow, the other thing I'll be using the mac for (aside from keeping in touch with my fiancée) is blogging. Yes, I'll try to match last year's performance of blogging all of the talks I see. I'll be looking at conference policies when I arrive, but I imagine most of the speakers won't be too upset to have me discuss their talks online. I will probably reduce my personal comments - my blog has a wider audience than it used to and I'd hate to offend anyone with an unintended insult. It's only prudent to be careful about what I say - although hey, it is my blog.

So, finally, I figured I should leave a comment on what I'm looking forward to:
  • Of course, I hope to meet up with the other next-gen bloggers - I learned a lot from them last year, and I'm sure they still have more to teach this year.
  • I'm looking forward to seeing the new technologies from the major vendors - I'm still a huge fan of Pacific Biosciences SMRT technology and I can't wait for their updates.
  • I'm very excited to hear from Complete Genomics - they were somewhat the dark horse of last year's presenters, but they've just announced a lot of great contracts. With any luck, I'll have the opportunity to discuss some bioinformatics with them.
  • I'm looking forward to all of the networking. Hopefully, by this time next year, I'll be done with my degree, which means it's nearly time to start looking for jobs/post-docs.
  • I'm looking forward to meeting the friends I made last time - and I hope to make a few more.
  • I'm looking forwards to seeing some damn cool science. Hey, last year I learned how the Burrows-Wheeler algorithm works, and I'm sure there will be more interesting posters as well.
Well, the countdown to AGBT has started. I'm definitely excited.


Wednesday, February 10, 2010

Citation for parameters for accurate genome alignment

I just saw an interesting article citation on twitter (via BioInfo - his tweet):
Frith M, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics. 2010 February;11(1):80+
Unfortunately, it's not yet available as an early access, yet... but it would imagine it will be tomorrow. Sounds like a good read!


Way off topic: Disproving God with the double slit experiment?

Ok, I've been watching a lot of Hitchens and Dawkins recently on youtube, so maybe it's not a surprise that I would eventually start thinking about the age-old question"is there a god"?

After thinking about it for a while, I'm eventually left with a simple logical problem.

If you know the classical quantum physics problem, (youtube video here), you'll know that an electron can act as a wave or a particle, depending on whether or not there is anyone watching it. When no one watches it, the electron is capable of transforming itself into a wave and interfering with itself.... Yes, that sounds odd, but that is in fact the nature of the universe we inhabit. However, when someone observes it, the electron is forced to pick a single path through the two slits, and you get a different pattern emerging beyond the two slits. (The video explains it fairly well, albeit with cheesy narration by a disembodied head...)

The only thing that decides whether it passes through one or both slits is whether it is observed to do so or not. A second piece of information about the universe we live in is that observation requires that we interact in some small way with the universe. You can't observe an event or it's indirect effects if there is no information escaping the event you're observing. (think black holes...) So, if you are observing something, you are, in a small way, interacting with the event - either monitoring the change in an electric field (interacting with the event), or an escaping photon or other form of wave. No matter how you slice it, an observation is an interaction.

So one thing that puzzles me with the double slit experiment and the existence of a god (and I don't care which one), is that the electron is able to transform itself into a complex wave of all possible paths when we don't interact with it. As long us humans (or any device we can conceive of) isn't watching it, the electron will travel through all possible paths. Once we directly observe it, the electron has no choice but to follow a single path.

If observation prevents the electron from being a wave, and non-observation allows the wave, doesn't that mean that when we see the results of the electron acting as a wave, no one else is watching it either?

If you assume there is a god watching everything, how can you explain that the electron is ever able to become a wave and travel through all of the slits when there is a constant observer? The only escape is that you propose your god can observe without interacting with the universe.... and that opens the can of worms that the god must not be part of the universe. (Yes, all things in the universe interact, so not being a part of the universe necessitates non-existence... just like the Ether. If you can't ever interact with it, it doesn't exist.)

Food for thought. Clearly this argument won't convince anyone with faith, but somehow it seems to destroy the concept of an all-seeing god for me. And now, back to genetics...


Friday, February 5, 2010

GFF3 undocumented feature...

Earlier today, I tweeted:
Does anyone know how to decypher a diBase GFF3 file? They don't identify the "most abundant" nucleotide uniquely. seems useless to me.
Apparently, there is a solution, albeit undocumented:

The attribute "genotype" contains an IUB code that is limited to using either a single base or a double base annotation (eg, it should not contain, H, B, V, D or N - but may contain R, Y, W, S, M or K ), which then allows you to subtract the "reference" attribute (that must be canonical) from the "genotype" attribute IUB code to obtain the new SNP - but only when the "genotype" attribute is not a canonical base.

If only that were documented somewhere...

UPDATE: Actually, this turns out not to be the case at all -- there are still positions for which the "genotype" attribute is an IUB code, and the reference is not one of the called bases. DOH!

Labels: , , ,

Thursday, February 4, 2010

Why do you blog - part II?

In the first part, I answered the generic "why do you blog" questions. The second part, I wanted to address one of the questions Heather Etchevers asked, because it really is the core reason of why we blog:

Who are you blogging for/who are you talking to?

After much soul searching, I have to answer the reason that probably lies behind all blogs: I blog for myself. If I didn't get something out of it, I wouldn't be doing it. Although, that doesn't mean there's nothing altruistic about the time I invest - there are people who find some of the information I post to be useful. However, it makes me happy to when I get a comment that tells me that something I post made them think, answered a question or just helped them get their computer configured. Yes, I (unashamedly) love to help people, and that is what I get out of blogging.

The less subtle question implied is "who do you think your target audience is?" As to that, I have to admit, I'm not sure. There are several distinct groups who might find information I post to be useful:
  1. People who do Chip-Seq may enjoy the posts on FindPeaks
  2. Next Generation Sequencing related posts may have a broader audience of scientists in the field
  3. People who use Linux probably enjoy the Ubuntu related posts
  4. Grad Students might find my school related posts to be insightful (maybe?)
  5. Anyone who enjoys art might find some of my science art to be unique.
And yes, what that should tell you is that I have a wide, diverse audience. I would suggest that many of the groups above are non-overlapping, so at any one time, I'm probably boring 80% of my audience.

That is the core of the "why am I blogging" question: who am I writing for? Between now and the time I move my blog over to NN (Yes, I think that's where I'm headed), I'm going to try to narrow it down a bit. Some decisions are fairly easy, I'm probably going to drop my linux related posts (there are better forums, and I already participate in them.) and the art/photography is already on the wane. The Grad School posts will probably accelerate for a bout a year (hopefully) and then tail off completely. That should provide a little more focus, in the wake of my scholastic adventures, assuming I can continue blogging once I leave academia - I'll cross that bridge when I get to it.

So, why do I blog? Because I enjoy the conversations and the community. As long as people are reading what I have to say, as long as I get the occasional comment, and as long as there is a reason to keep talking, I'll keep blogging.

Ah... Clarity. (=


Why do you blog - part I?

In light of recent events, this is a question I've had to ask my self. Why am I blogging, and is it worth continuing. Actually, it's not hard to answer, but worth returning to, periodically.

Since I've been thinking about it quite a bit recently, it's no surprise that other blogger's posts on the same topic are of interest to me. One of the blogs I read quite often belongs to Heather Etchevers, and she has an interesting take on it. It's also worth noticing the link on the top of her post to a discussion on it, as well as the answers from other bloggers.

Anyhow, I thought I'd take a stab at the questions myself.

1. What is your blog about?

My blog is generally anything related to next-generation sequencing, the open source science development I do and my journey through grad studies. Anything that catches my eye that's related (sometimes tenuously) to one of those is fair game.

2. What will you never write about?

Anyone who hasn't explicitly agreed to be a part of my blog. I am fantastically lucky to have wonderful people in my life, but their participation in my life isn't consent to being included in anything I write.

3. Have you ever considered leaving science?

Yep - After leaving my start-up company, I briefly toyed with the idea of doing other things and just starting fresh in another field. In the end, my love of science won out.

4. What would you do instead?

Oddly enough, I'd probably have done photography. I'm content to let it be a hobby, for now, but Travel photography is really a passion of mine, and I'd love to do more of it. Incidentally, I started the blog about the same time, because I had initially intended to use it to display my pictures. Interesting how things work out...

5. What do you think will science blogging be like in 5 years?

Not all that much more different - just a lot more condensed. Twitter is becoming an alternative to blogging, and I think the two will converge somewhere for most people.

6. What is the most extraordinary thing that happened to you because of blogging

Wow... that's tough. All the really cool people I've met has been an incredible bonus that I never expected. The fact that people read my blog at all never seizes to amaze me. Anytime I'm at a conference and someone recognizes my name, I'm thrilled - and that's more than extrordinary enough for me. (When they actually pronounce it correctly, it blows my mind)

7. Did you write a blog post or comment you later regretted?

Of course... but most of those were done early on in my blogging days, on a blog that's no longer visible (thank goodness!) I've pissed off friends, insulted people, and even annoyed people in my own lab. My first blog taught me a LOT about what not to do on line. I hope I've learned most of those lessons.

8. When did you first learn about science blogging?

Long after I started posting science on my blog, really. People started telling me that I should take a look at other blogs, and the more I read, the more I discovered there was a community out there.

9. What do your colleagues at work say about your blogging?

Not much, really. Occasionally, one of them will comment on something I wrote, or offer me advice on something I've discussed, but for the most part, it doesn't come up much in conversation. Although, there is the "blog effect", where people around you suddenly know things going on in your life/research that you are sure you didn't tell them. It's somewhat creepy, and it has taught me not to tell stories to people who read my blogs - they already know what I have to say on some topics.

10. How the heck do you have time to blog and do research at the same time?

Code, commit, run, wait... wait.... (blog)... wait... RESULTS!


Tuesday, February 2, 2010

The end is near...

Well, here we are, nearly 350 posts into my blog, and I have to say, I think it's coming down to the end. This isn't something I was contemplating until about 15 minutes ago, so it's a bit of a surprise to me. I should start at the beginning, however, and explain what's going on.

About 6 months ago, I received an invitation to join the Nature Networks blogs, which really appeals to me, in that they have a fantastic community going over there. There are a lot of advantages to participating in such neat group of people who all share a common interest in science. So, pending a few changes at NN, I thought I'd move my blog over there at some point, or at the very least, associate my blog with my NN account.

And then, just this morning, I got an email from blogger/google letting me know that they are going to drop support for FTP published blogs. I have until the end of March to step in line and move my blog to their server, because supporting FTP-based blogs publishing is no longer cost-effective for them. Unfortunately, the only reason I had picked blogger in the first place was because they supported FTP publishing, which leaves me somewhat in the lurch.

Regardless of what I do next, this blog will have to go through some major changes in the next month. There are three changes I can see might solve the issue:
  1. change back-ends and migrate away from blogger (eg. try out wordpress)
  2. change servers, and set up a "custom domain" (eg, use blogger's servers)
  3. move my blog to a completely new venue (eg, start fresh with Nature Blogs.)
Each of the above options will take a lot of work. At the very least, I'll have to archive the current blogs and comments, and then likely be forced to turn off commenting unless I pick option #2. However, with that said, I'm thinking it's time to just bite the bullet and move on to a completely new blog, and Nature Blogs does seem like a good place. For the moment, it is the most appealing option to me.

On the bright side, if I do start a new blog, maybe I can come up with a slightly easier to say name. (Yes, Fejes is actually pronounced as "fey-esh".) Unfortunately, on short notice, the best new blog name I can come up with is "Blog-seq". If anyone has better ideas, I'm definitely open to them. (=

Well, here's to the closing of one chapter, and opening a few new doors... I don't know where this will take me, but as always the journey should be fun!