(UPDATE: A response to this article was kindly provided by Anton Valouev, and can be read here.)
I once wrote a piece of software called WINQ, which was the predecessor of a piece of software called Quest. Not that I'm going to talk about that particular piece of Quest software for long, but bear with me a moment - it makes a nice lead in.
The software I wrote wasn't started before the University of Waterloo's version of Quest, but it was released first. Waterloo was implementing a multi-million dollar set of software for managing student records built on oracle databases, PeopleSoft software, and tons of custom extensions to web interfaces and reporting. Unfortunately, The project was months behind, and the Quest system was no where near being deployed. (Vendor problems and the like.) That's when I became involved - in two months of long days, I used Cognos tools (several of them, involving 5 separate scripting and markup languages) to build the WINQ system, which provided the faculty with a way to access query the oracle database through a secure web frontend and get all of the information they needed. It was supposed to be in use for about 4-6 months, until Quest took over... but I heard it was used for more than two years. (There are many good stories there, but I'll save them for another day.)
Back to ChIP-Seq's QuEST, this application was the subject of a recently published article. In a parallel timeline to the Waterloo story, QuEST was probably started before I got involved in ChIP-Seq, and was definitely released after I released my software - but this time I don't think it will replace my software.
The paper in question (Valouev et al, Nature Methods, Advanced Online Publication) is called "Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. I suspect it was published with the intent of being the first article on ChIP-Seq software, which, unfortunately, it wasn't. What's most strange to me is that it seems to be simply a reiteration of the methods used by Johnson et al. in their earlier ChIP-Seq paper. I don't see anything novel in this paper, though maybe someone else has seen something I've missed.
The one thing that surprises me about this paper, however, is their use of a "kernel density bandwidth", which appears to be a sliding window of pre-set length. This flies in the face of the major advantage of ChIP-Seq, which is the ability to get very strong signals at high resolution. By forcing a "window" over their data, they are likely losing a lot of the resolution they could have found by investigating the reads directly. (Admittedly, with a window of 21bp, as used in the article, they're not losing much, so it's not a very heavy criticism.) I suppose it could be used to provide a quick way of doing subpeaks (finding individual peaks in areas of contiguous read coverage) at a cost of losing some resolving power, but I don't see that discussed as an advantage.
The second thing they've done is provide a directional component to peak finding. Admittedly, I tried to do the same thing, but found it didn't really add much value. Both the QuEST publication and my application note on FindPeaks 3.1 mention the ability to do this - and then fail to show any data that demonstrates the value of using this mechanism versus identifying peak maxima. (In my case, I wasn't expected to provide data in the application note.)
Anyhow, that was the down side. There are two very good aspects to this paper. The first is that they do use controls. Even now, the Genome Sciences Centre is struggling with ChIP-Seq controls, while it seems everyone else is using them to great effect. I really enjoyed this aspect of it. In fact, I was rather curious how they'd done it, so I took a look through the source code of the application. I found the code somewhat difficult to wade through, as the coding style was very different from my own, but well organized. Unfortunately, I couldn't find any code for dealing with controls, which leads me to think this is an unreleased feature, and was handled by post-processing the results of their application. Too bad.
The second thing I really appreciated was the motif finding work, which isn't strictly ChIP-Seq, but is one of the uses to which the data can be applied. Unfortunately, this is also not new, as I'm aware of many earlier experiments (published and unpublished) that did this as well, but it does make a nice little story. There's good science behind this paper - and the data collected on the chosen transcription factors will undoubtedly be exploited by other researchers in the future.
So, here's my summary of this paper: As a presentation of a new algorithm, they failed to produce anything novel, and with respect to the value of those algorithms versus any other algorithm, no experiments were provided. On the other hand, as a paper on growth-associated binding protein, and serum response factor proteins (GABP and SRF respectively), it presents a nice compact story.