Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, July 24, 2008

ChIP-Seq control Data

I promised a quick discussion on ChIP-Seq and controls, as well as a comparison of FindPeaks with USeq, though I may have to defer on the second topic. I haven't played much with USeq, though I did look through it's code several months ago. I think I'd best keep my comments to a minium and say only that ChIP-Seq software evolves quickly, and that it's a moving target. Comparing FP 3.1.9.2 would be doing a disservice to my code, which was tagged several months back, while USeq is being developed out in the open, and you can get a copy of what the developer was working on last night. Truly an unfair comparison. I think I'll revisit this topic when I release FP 4.0.

The other topic is the use of controls. It's an area that everyone at GSC acknowledges we could do better in. It's just common sense that you'd want controls - this is science, after all.

However, there are different types of controls. Do you want a control that simulates a null-antibody? Do you want a genomic control that leaves out the ChIP entirely? Possibly there are other kinds I haven't even though about, and they are all "valid" controls for different things in a ChIP-Seq experiment.

Which brings up the really key point: there are more things going on in ChIP-Seq than just peaks, and you can't completely ignore all of them, just to filter out peaks that are found in your control. That's a relatively naive method that other groups used several months ago. (I don't know if they're still using them.) I believe "Singapore" was using controls simply by checking if each peak's maximum had a greater than 4:1 ratio over the same base location in their control and throwing away anything that didn't pass that criteria. I believe other labs are doing similar things.

Anyhow, I think it's safe to say that such a simple method is a great first pass (and yes, ALL of us are on our first pass, at the moment, so it's definitely not an insult), but I don't think this will stand the test of time.

For now, when I get a few minutes, I'm going to start working on integrating controls into FindPeaks 4.0. As of this morning, I ran the first few successful PET runs with the software, so work on controls won't be that far off.

4 Comments:

Blogger William said...

Thanks for the update, and this goes a long way to answer some of the questions I was wondering about. I have some ideas about how to introduce at least slightly more sophisticated control methods, but as with all things in such early stages, it will mean sacrificing a good deal of the data's complexity.

July 25, 2008 11:31:00 AM PDT  
Blogger William said...

I hope you will forgive me the following rant.

Your peak finding algorithm does an excellent job of estimating peak significance, from what I can tell by using an internally generated 'control' to generate an distribution, from which it takes only those peaks that deviate significantly from that distribution. This might be an elementary or even dead wrong understanding of your algorithm, but that's what I'm assuming.

If a control is available then that 'significance' assessment can be generated by comparing tag counts between the control run and the ChIP run. It may turn out that a region of ChIP enrichment will appear not as a peak in the ChIP-seq output, but rather the absence of a dip.

However a key problem is that, per individual tag, it's probable that one will have only one or zero tags at some of the sequencing depths used in practice. This would be insufficient to assess significance. But if the tags are compounded into regional averages, using methods such as the kernel density estimator used in FindPeaks, perhaps it can be aggregated sufficiently to gain statistical power, though perhaps at limited physical resolution on the chromosome.

For example suppose breaking the genome into a 1 kb (or whatever) chunk around a peak. You could naively compute the number of reads binning into each chunk (just a sum, here), and compare that to the number of reads binning into that same chunk from the control data set. One would then wish to determine whether the underlying parameter, the probability that a read falls into that bin vs outside that bin, is identical for the control run and the ChIP run, or different.

Obviously this would require some kind of measure of significant difference. I was thinking a prior distribution on the 'bin probability' parameter's value could be generated from the actual values in randomly sampled 1 kb chunks of the control data set.

At this point bayesian methods could be used to evaluate the two models, one in which there is a single bin probability parameter for both the control and the chip, and one in which the parameter varies freely for both. I'm not quite sure what the statistical methods would be to do this, but perhaps you do.

Anyhow, an idea that I wanted to share with someone who was working on these matters, though my thoughts may be somewhat naive. Congrats on creating an excellent package that is sure to see continued use, and best of luck on 4.0!

July 25, 2008 1:27:00 PM PDT  
Blogger Anthony said...

Thanks for the rant, William!

I just want to hit a few points in reply. Let me know if I've missed something.

First, that was long, but hardly a rant - it's definitely relevant and so it fits more into the suggestion bin than the rant bin. (I'm already taking your suggestion and binning things...)

Second, your synopsis of my FP3 method is pretty accurate, and I agree - it's only useful when you don't have a control. Indeed, for much of my time working on FindPeaks, I haven't had a real control lane/flow cell to work with, which is why that algorithm was developed. I'm very aware that it's lacking in many respects, particularly when you do have a control data set to use alongside your chip-seq experiment.

Third, your point about "regional averages" is very much in line with what I am working on. Needless to say, I've put a lot of thought into where this is going, and I think I have a relatively sophisticated solution that doesn't require binning.

Third (and a half), I generally don't like binning algorithms for ChIP-Seq. If you have base pair resolution, why throw it away?

Fourth, my stats background is probably not as good as you think, but I'm learning as I go through this. If anyone wants to help me out with the stats behind it, I'd be very happy to accept their help. All I feel comfortable saying at this point is: if all goes as planned, I will have a MUCH better method of doing a real FDR, which won't be based on a thresholding method we use now.

Fifth, well, I really don't want to go telling everyone on my blog what I have in mind, as I'd hate to commit to something if it turns out to be impossible. I don't mind talking about these things in a less public forum, however. That said, I think you're on the right track, though what gets implemented is probably not going to match closely with what you've suggested. I've discussed my algorithm several of the people here, so there's been a bit of refinement of my idea already.

Finally (Sixth?), The most interesting thing for me is that the algorithm clearly will work best with a PET solution. (All ChIP-Seq works best with PET...) I just have to wonder if people will continue doing SET ChIP-Seq, or if I should just move completely over to PET runs.

That's another question for another day.

July 25, 2008 2:07:00 PM PDT  
Blogger Anthony said...

Sorry Chipper,

I accidentally hit delete to your comment, but here it is:

"I am curious about the PET reads, do you see a reduction in alignments in centromers and other types of repeats where control peaks are found?"

Again, without saying too much, I believe a better alignment leads to less noise, which in turn leads to cleaner peaks. My contention is that using PETs should give us the ability to make better alignments.

On the other hand, I believe the Centromere peaks are coming from another phenomenon: the DNA itself is fragile in certain regions, or more prone to breakage, etc. Thus, I don't think the centromere peaks will disappear, though I haven't looked at the data sets I've processed yet - there are other people working on the interpretation of the results, and they would probably be able to address that better than I can.

Anthony

July 29, 2008 2:01:00 PM PDT  

Post a Comment

<< Home