Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Friday, April 4, 2008

Dr. Henk Stunnenberg's lecture

I saw an interesting seminar today, which I thought I'd like to comment on. Unfortunately, I didn't bring my notes home with me, so I can only report on the details I recall - and my apologies in advance if I make any errors - as always, any mistakes are obviously with my recall, and not the fault of the presenter.

Ironically, I almost skipped the talk - it was billed as discussing Epigenetics using "ChIP-on-Chip", which I wrote off several months ago as being a "poor man's ChIP-Seq." I try not to say that too loud, usually, since there are still people out there who put a lot of faith in it, and I have no evidence to say it's bad. Or, at least, I didn't until today.

The presenter was Dr. Stunnenberg, from Nijmegen Center for Molecular Sciences, who's web page doesn't do him justice in any respect. To begin with, Dr. Stunnenberg gave a big apology for the change in date of his talk - I gather the originally scheduled talk had to be postponed because someone had stolen his bags while he was on the way to the airport. That has got to suck, but I digress...

Right away, we were told that the talk would focus not on "ChIP-on-Chip", but on ChIP-Seq, instead, which cheered me up tremendously. We were also told that the poor graduate student (Mark?) who had spent a full year generating the first data set based on the ChIP-on-Chip method had had to throw away all of his data and start over again once the ChIP-Seq data had become available. Yes, it's THAT much better. To paraphrase Dr. Stunnenberg, it wasn't worth anyone's time to work with the ChIP-on-Chip data set when compared to the accuracy, speed and precision of the ChIP-Seq technology. Ah, music to my ears.

I'm not going to go over what data was presented, as it would mostly be of interest only to cancer researchers, other than to mention it was based on estrogen receptor mediated binding. However, I do want to raise two interesting points that Dr. Stunnenberg touched upon: the minimum height threshold they applied to their data, and the use of Polymerase occupancy.

With respect to their experiment, they performed several lanes of sequencing on their ChIP-Seq sample, and used the standard peak finding to identify areas of enrichment. This yielded a large number of sites, which I seem to recall was in the range of 60-100k peaks, with a "statistically derived" cutoff around 8-10. No surprise, this is a typical result for a complex interaction with a relatively promiscuous transcription factor; a lot of peaks! The surprise to me was that they decided that this was too many peaks, and so applied an arbitrary threshold of a minimum peak height of 30, which reduced the number of peaks down to 6,400-ish peaks. Unfortunately, I can't come up with a single justification for this threshold at 30. In fact, I don't know that anyone could, including Dr. Stunnenberg, who admitted it was rather arbitrary, because they thought the first number, in the 10's of thousands of peaks was too many.

I'll be puzzling over this for a while, but it seems like a lot of good data was rejected for no particularly good reason. yes, it made the data set more tractable, but considering the number of peaks we work on regularly at the GSC, I'm not really sure this is a defensible reason. I'm personally convinced that there is a lot of biological relevance for the peaks with low peak heights, even if we aren't aware of what that is yet, and arbitrarily raising the minimum height threshold 3-fold over the statistically justifiable cut off is a difficult pill to swallow.

Moving along, the part that did impress me a lot (one of many impressive parts, really) was the use of Polymerase occupancy ChIP-Seq tracks. Whereas the GSC tends to do a lot of transcriptome work to identify the expression of genes, Dr. Stunnenberg demonstrated that polymerase ChIP can be used to gain the same information, but with much less sequencing. (I believe he said 2-3 lanes of Solexa data were all that were needed, whereas our transcriptomes have been done up to a full 8 lanes.) Admittedly, I'd rather have both transcriptome and polymerase occupancy, since it's not clear where each one has weaknesses, but I can see obvious advantages to both methods, particularly the benefits of having direct DNA evidence, rather than mapping cDNA back to genomic locations for the same information. I think this is something I'll definitely be following up on.

In summary, this was clearly a well thought through talk, delivered by a very animated and entertaining speaker. (I don't think Greg even thought about napping through this one.) There's clearly some good work being done at the Nijmegen Center for Molecular Sciences, and I'll start following their papers more closely. In the meantime, I'm kicking myself for not going to the lunch to talk with Dr. Stunnenberg afterwards, but alas, the chip-on-chip poster sent out in advance had me fooled, and I had booked myself into a conflicting meeting earlier this week. Hopefully I'll have another opportunity in the future.

By the way, Dr. Stunnenberg made a point of mentioning they're hiring bioinformaticians, so interested parties may want to check out his web page.

Labels: ,


Blogger Malarky said...

There are a lot of intresting points in your post:
1. Chip-Seq sounds great and I am looking forward to trying it on our Solexa soon. However I have seen a paper recently where a side by side comparison showed excellent overlap betwen boh methods. I still see ChIP-chip as a useful and cheap first pass-- before pyrosequencing particular sites. I guess in a year or 2 the costs may converge.

2. I find the point about the multiplicity of peaks interesting. In computational modelling of TF they are beginning to reject absolute thresholds too.

see for instance this paper:
Statistical Modeling of Transcription Factor Binding Affinities Predicts Regulatory Interactions
Thomas Manke*, Helge G. Roider, Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany
PLOS Comp Biol.

Here they model the strength of all interactions on a sequence and sum them to produce a score for a region (rather than just adding 'hits').

April 5, 2008 3:12:00 AM PDT  
Anonymous Anonymous said...

guess you should better have come to the lunch meeting than writing about it on internet

April 5, 2008 8:03:00 AM PDT  
Blogger Anthony said...

To anonymous: I posted a blog item on it 14 hours after the seminar, when I had the time to do so. I don't quite follow your point.

To Malarky:

While I agree that it's a cheap first pass, I suspect that the money may not be well spent if you plan to do any ChIP-Seq afterwards. As Dr. Stunnenberg pointed out, once they had done the ChIP-Seq, they tossed out the ChIP-chip results for reasons of quality. So, unless you really don't have the money to follow up with a lane of Chip-Seq, I'm not sure a quick and dirty first pass is a good investment. (I do agree that they show excellent agreement, however, from what few comparisons I've seen.)

Also, thanks for the reference. I'll look into it - and possibly post something on it. I've been moving towards different scoring schemes myself, in my own ChIP-seq code.

April 5, 2008 9:09:00 AM PDT  
Anonymous Chipper said...


I am interrested in what kind of negative controls you use in your ChIP-seq experiments and if Henk did comment on how they do it? For STAT1 one IgG was run but just tossed away, I have not been able to find tha data anywhere...

I do not see how ChIP-chip could be a cheap first-pass unless you get the chips for free? Unless you are comparing it to the 454...

Setting the threshold can be really difficult - how do you calculate a meaningful FDR? For example, do you only use perfect alignments or do you take possible misalignments into account? Do you know if you have read only one or both ends of a fragment? I do not dispute the high numer of bindings (well, ok the 40K STAT1 perhaps ;) but you should not rely too much on a statistical number based on a model that may have just too many discrepancis from your experiment.

Anyway, thanks for a nice blog and some insights also into how you do the transcriptomes. Right now it is just to expensive if you have to run a whole flowcell I guess.

April 5, 2008 12:42:00 PM PDT  
Blogger Anthony said...

Hi Chipper,

For controls, we've tried several different things, though I believe this is still an open question - and probably ChIP-Seq's biggest weakness. Dr. Stunnenberg did not mention (or I did not catch) if they performed a control experiment. Instead, they attempted to use the Polymerase occupancy (Poc) experiment such that only peaks supported by both ChIP-Seq and Poc were considered real. Of course, because of the thresholding, they threw away 85%+ of their ChIP-Seq peaks anyhow, so possibly controls of any other sort would have been superfluous.

For ChIP-chip being cheap, it's all relative. You don't need to purchase the Solexa machine, and many labs already have the equipment for ChIP-chip, though I believe Malarky is the only one supporting this point. I personally believe that doing the ChIP-chip isn't worth the time or effort, unless you never plan to run ChIP-Seq.

Finally, I'll talk more about thresholding in another blog, but my own ChIP-Seq thresholding in FindPeaks now uses a MC simulation that replicates all of the experimental conditions, but uses randomly placed tags to generate the FDR. This is much more robust than what we were doing in the past, but there's still plenty of room for improvement.

Finally, just to address your last point - though no one has yet demonstrated good convergence for a ChIP-Seq experiment, only a few lanes (2-3) are needed to get good resolution for most experiments. Thus, the cost of doing a ChIP-Seq run is probably more like 1/3rd of a flow cell, anyhow. As for transcriptomes, the question is what depth you want, if you're not doing normalization - and we wanted very good depth for what we're doing. (Again, that's another post for another day.)


April 5, 2008 3:57:00 PM PDT  

Post a Comment

<< Home