Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Wednesday, December 10, 2008

countdown to comps continues..

Today has been a relatively productive day in one sense. I finally finished reading the Biology of Cancer (Weinberg), after about 2 solid weeks of doing a chapter a day. At about 60-75 pages a day, it was pretty intense, but I learned a LOT. (Well, how could I not?)

Since the last chapter was on drug design, an area I'm familiar with, it was pretty easy reading and went by pretty quickly. There are only so many times you can learn about Gleevec.

So I moved on to a few other review questions suggested by my committee... such as "Draw a gene" and "penetrance vs. expressivity," which made for a nice general review.

And then I moved on to a few papers. One of them discussed doing PET sequencing from 2nd generation machines to find chromosomal fusion points in cancer (Bashir et al., 2008). The math was interesting enough, but the end result was that they tested it on BACs. When I get around to doing PET on my samples, this will be a good review to make sure we get the parameters right, but the paper didn't go far enough, really, in my humble opinion. I was hoping for more.

A second paper that I went over was on identifying SNPs in 2nd generation sequencing, using bovine DNA (Tassell et al, 2008). I don't know what to make of their method though. They combined samples from 8 different types of cow (I didn't know there were that many types of cow!), and then sequenced it to a depth of 10x, so that on average, each read covering a specific location should reflect the sequence from a different breed. Maybe I'm missing something, but SNP calling on this should technically be impossible - even assuming you get 0% sequencing errors, how confident can you be in a SNP found only once? Anyhow, I had to abandon this paper... I just didn't understand how they could draw any conclusions at all from this data.

Finally, a colleague of mine (Thanks Simon!) recommended a review by Weinburg himself on "Mechanisms of Malignant Progression." (Weinburg, 2008). For anyone who ends up reading his textbook from cover to cover, I highly suggest following up with his review. Things have changed a bit from the time the textbook was published to now, which makes this review a timely followup. In particular, it flushes out much of the textbook's discussion of Epithelial-Mesenchymal Transitions (EMTs). Not that I want to go into much detail, but it's clearly worth a browse - and it's only 4 pages long. (MUCH shorter than the 790 pages I had to read to understand what he was talking about in the first place.) (-:

So now, this leaves me with 1 day left - just enough time to gather together my 15 minute presentation, to review the chapter summaries from the Biology of Cancer, and still have enough time to freak out a little bit. Perfect timing.

Labels: ,

Thursday, October 16, 2008

Manuscript review.

Several weeks ago, I was really flattered when I was asked to review a manuscript, even if it was for a journal that I've never heard of. As a grad student, it's cool that people have even heard of me to put me down as a potential reviewer. I'm very flattered. (I've also had other requests in the meantime, though I don't think I'll have the spare hours to tackle more reviews - though I'm still flattered!)

Unfortunately, I'm disappointed in the quality of the very first manuscript I've ever reviewed. I won't go into details, however.

On a completely unrelated topic, and for future reference, I think I'll provide a few links that may be useful to people who want to submit manuscripts in the future:
  • Plagiarism (Wikipedia). Covers what plagiarism is, and the possible consequences of it as a student.
  • How to avoid Plagiarism (Purdue). General tips on when and how to give credit to the originators of the idea and source of the material.
  • A good discussion on different forms of plagiarism (Andrew Roberts at Middlesex university.) I highly suggest this, if you're unsure where the grey line between copying and paraphrasing begins.
  • If things are still unclear this reference (Irving Hexham at the University of Calgary) provides examples and a demonstrates the correct forms of how to quote and reference other people's work.


Friday, September 12, 2008


One more day, one more piece of ChIP-Seq software to cover. I've not talked about FindPeaks, much, which is the software descended from Robertson et al, for obvious reasons. The paper was just an application note - and well, I'm really familiar with how it works, so I'm not going to review it. I have talked about Quest, however, which was presumably descended from Johnson et al.. And, for those of you who have been following ChIP-Seq papers since the early days will realize that there's still something missing: The aligner descended from Barski et al, which is the subject of today's blog: SISSR. Those were the first three published ChIP-Seq papers, and so it's no surprise that each of them followed up with a paper (or application note!) on their software.

So, today, I'll take a look at SISSR, to complete the series.

From the start, the Barski paper was discussing both histone modifications and transcription factors. Thus, the context of the peak finder is a little different. Where FindPeaks (and possibly QuEST as well) was originally conceived for identifying single peaks, and expanded to do multiple peaks, I would imagine that SISSR was conceived with the idea of working on "complex" areas of overlapping peaks. Although, that's only relevant in terms of their analysis, but I'll come back to that.

The most striking thing you'll notice about this paper is that the datasets look familiar. They are, in fact the sets from Robertson, Barski and Johnson: STAT1, CTCF and NRSF, respectively. This is the first of the Chip-Seq application papers that actually performs a comparison between the available peak finders, and of course, claim that theirs is the best. Again, I'll come back to that.

The method used by SISSR is almost identical to the method used by FindPeaks, with the use of directional information built into the base algorithm, whereas FindPeaks provides it as an optional module (-directional flag, which uses a slightly different method). They provide an excellent visual image on the 4th page of the article, demonstrating their concept, which will explain the method better than I can, but I'll try anyhow.

In ChIP-Seq, a binding site is expected to have many real tags pointing at it, as tags upstream should be on the sense strand, and tags on downstream should be on the anti-sense strand. Thus, a real binding site should exist at transition points, where the majority of tags switch from the sense to the anti-sense tag. By identifying these transition points, they will be able to identify locations of real binding sites. More or less, that describes the algorithm employed, with the following modifications: A window is used, (20bp default) instead of doing it on a base-by-base basis, and parameter estimation is employed to guess the length of the fragments.

In my review of QuEST, I complained that windows are a bad idea(tm) for ChIP-Seq, only to be corrected that QuEST wasn't using a window. This time, the window is explicitly described - and again, I'm puzzled. FindPeaks uses an identical operation without windows, and it runs blazingly fast. Why throw away resolution when you don't need to?

On the subject of length estimation, I'm again less than impressed. I realize this is probably an early attempt at it - and FindPeaks has gone through it's fair share of bad length estimators, so it's not a major criticism, but it is a weakness. To quote a couple lines from the paper: "For every tag i in the sense strand, the nearest tag k in the anti-sense strand is identified. Let J be the tag in the sense strand immediately upstream of k." Then follows a formula based upon the distances between (i,j) and (j,k). I completely fail to understand how this provides an accurate assessment of the real fragment length. I'm sure I'm missing something. As a function that describes the width of peaks, that may be a good method, which is really what the experiment is aiming for, anyhow - so it's possible that this may just be poorly named.

In fairness, they also provide options for a manual length estimation (or XSET, as it was referred to at the time), which overrides the length estimation. I didn't see a comparison in the paper about which one provided the better answers, but having lots of options is always a good thing.

Moving along, my real complaint about this article is the analysis of their results compared to past results, which comes in two parts. (I told you I'd come back to it.)

The first complaint is what they were comparing against. The article was submitted for publication in May 2008, but they compared results to those published in the June 2007 Robertson article for STAT1. By August, our count of peaks had changed. By January 2008, several upgraded versions of FindPeaks were available, and many bugs had been ironed out. It's hardly fair to compare the June 2007 FindPeaks results to the May 2008 version of SISSR, and then declare SISSR the clear winner. Still, that's not a great problem - albeit somewhat misleading.

More vexing is their quality metric. In the Motif analysis, they clearly state that because of the large amount of computing power, only the top X% of reads were used in their analysis. For comparison with FindPeaks, the top 5% of peaks were used - and were able to observe the same motifs. Meanwhile, their claim to find 74% more peaks than FindPeaks, is not really discussed in terms of the quality of the additional sites. (FindPeaks was also modified to identify sub-peaks after the original data set was published, so this is really comparing apples to oranges, a fact glossed over in the discussion.)

Anyhow, complaints aside, it's good to see a paper finally compare the various peak finders out there. They provide some excellent graphics, and a nice overview on how their ChIP-Seq application works, while contrasting it to published data available. Again, I enjoyed the motif work, particularly that of figure 5, which correlates four motif variants to tag density - which I feel is a fantastic bit of information, buried deeper in the paper than it should be.

So, in summary, this paper presents a rather unfair competition by using metrics guaranteed to make SISSR stand out, but still provides a good read with background on ChIP-Seq, excellent illustrations and the occasional moment of deep insight.

Labels: , ,

Tuesday, September 9, 2008

ChIP-Seq in silico

Yesterday I got to dish out some criticism, so it's only fair that I take some myself, today. It came in the form of an article called "Modeling ChIP Sequencing In Silico with Applications", by Zhengdong D. Zhang et al., PLoS Computational Biology, August 2008: 4(8).

This article is actually very cool. They've settled several points that have been hotly debated here at the Genome Sciences Centre, and made the case for some of the stuff I've been working on - and then show me a few places where I was dead wrong.

The article takes direct aim at the work done in Robertson et al., using the STAT1 transcription factor data produced in that study. Their key point is that the "FDR" used in that study was far from ideal, and that it could be significantly improved. (Something I strongly believe as well.)

For those that aren't aware, Robertson et al. is sort of the ancestral origin of the FindPeaks software, so this particular paper is more or less aiming at the FindPeaks thresholding method. (Though I should mention that they're comparing their results to the peaks in the publication, which used the unreleased FindPeaks 1.0 software - not the FindPeaks 2+ versions, of which I'm the author.) Despite the comparison to the not-quite current version of the software, their points are still valid, and need to be taken seriously.

Mainly, I think there are two points that stand out:

1. The null model isn't really appropriate
2. The even distribution isn't really appropriate.

The first, the null model, is relatively obvious - everyone has been pretty clear from the start that the null model doesn't really work well. This model, pretty much consistent across ChIP-Seq platforms can be paraphrased as "if my reads were all noise, what would the data look like?" This assumption is destined to fail every time - the reads we obtain aren't all noise, and thus assuming they are as a control is really a "bad thing"(tm).

The second, the even distribution model, is equally disastrous. This can be paraphrased as "if all of my noise were evenly distributed across some portion of the chromosome, what would the data look like?" Alas, noise doen't distribute evenly for these experiments, so it should be fairly obvious why this is also a "bad thing"(tm).

The solution presented in the paper is fairly obvious; create a full simulation for your ChIP-Seq data. Their version requires a much more rigorous process, however. They simulate a genome-space, remove areas that would be gaps or repeats in the real chromosome, then begin tweaking the genome simulation to replicate their experiment using weighted statistics collected in the ChIP-Seq experiment.

On the one hand, I really like this method, as it should give a good version of a control, whereas on the other hand, I don't like that you need to know a lot about the genome of interest before you can analyze your ChIP-Seq data. (ie, mappability, repeat-masking, etc.) Of course, if you're going to simulate your genome, simulate it well - I agree with that.

I don't want to belabor the point, but this paper provides a very nice method for simulating ChIP-Seq noise in the absence of a control, as in Robertson et al. However, I think there are two things that have changed since this paper was submitted (January 2008) that should be mentioned:

1. FDR calculations haven't stood still. Even at the GSC, we've been working on two separate FDR models that no longer use the null model, however, both still make even distribution assumptions, which, is also not ideal.

2. I believe everyone has now acknowledged that there are several biases that can't be accounted for in any simulation technique, and that controls are the way forward. (They're used very successfully in QuEST, which I discussed yesterday.)

Anyhow, to summarize this paper: Zhang et al. provide a fantastic critique of the thresholding and FDR used in early ChIP-Seq papers (which is still in use today, in one form or another), and demonstrate a viable and clearly superior method for refining Chip-Seq results with out a matched control. This paper should be read by anyone working on FDRs for next-gen sequencing and ChIP-Seq software.

(Post-script: In preparation for my comprehensive exam, I'm trying to prepare critical evaluations of papers in the area of my research. I'll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and - coming from a graduate student - shouldn't be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism - I'm simply trying to evaluate the science in the context of the field as it stands today.)

Labels: , ,

Monday, September 8, 2008


(Pre-script: In preparation for my comprehensive exam, I'm trying to prepare critical evaluations of papers in the area of my research. I'll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and - coming from a graduate student - shouldn't be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism - I'm simply trying to evaluate the science in the context of the field as it stands today.)

(UPDATE: A response to this article was kindly provided by Anton Valouev, and can be read here.)

I once wrote a piece of software called WINQ, which was the predecessor of a piece of software called Quest. Not that I'm going to talk about that particular piece of Quest software for long, but bear with me a moment - it makes a nice lead in.

The software I wrote wasn't started before the University of Waterloo's version of Quest, but it was released first. Waterloo was implementing a multi-million dollar set of software for managing student records built on oracle databases, PeopleSoft software, and tons of custom extensions to web interfaces and reporting. Unfortunately, The project was months behind, and the Quest system was no where near being deployed. (Vendor problems and the like.) That's when I became involved - in two months of long days, I used Cognos tools (several of them, involving 5 separate scripting and markup languages) to build the WINQ system, which provided the faculty with a way to access query the oracle database through a secure web frontend and get all of the information they needed. It was supposed to be in use for about 4-6 months, until Quest took over... but I heard it was used for more than two years. (There are many good stories there, but I'll save them for another day.)

Back to ChIP-Seq's QuEST, this application was the subject of a recently published article. In a parallel timeline to the Waterloo story, QuEST was probably started before I got involved in ChIP-Seq, and was definitely released after I released my software - but this time I don't think it will replace my software.

The paper in question (Valouev et al, Nature Methods, Advanced Online Publication) is called "Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. I suspect it was published with the intent of being the first article on ChIP-Seq software, which, unfortunately, it wasn't. What's most strange to me is that it seems to be simply a reiteration of the methods used by Johnson et al. in their earlier ChIP-Seq paper. I don't see anything novel in this paper, though maybe someone else has seen something I've missed.

The one thing that surprises me about this paper, however, is their use of a "kernel density bandwidth", which appears to be a sliding window of pre-set length. This flies in the face of the major advantage of ChIP-Seq, which is the ability to get very strong signals at high resolution. By forcing a "window" over their data, they are likely losing a lot of the resolution they could have found by investigating the reads directly. (Admittedly, with a window of 21bp, as used in the article, they're not losing much, so it's not a very heavy criticism.) I suppose it could be used to provide a quick way of doing subpeaks (finding individual peaks in areas of contiguous read coverage) at a cost of losing some resolving power, but I don't see that discussed as an advantage.

The second thing they've done is provide a directional component to peak finding. Admittedly, I tried to do the same thing, but found it didn't really add much value. Both the QuEST publication and my application note on FindPeaks 3.1 mention the ability to do this - and then fail to show any data that demonstrates the value of using this mechanism versus identifying peak maxima. (In my case, I wasn't expected to provide data in the application note.)

Anyhow, that was the down side. There are two very good aspects to this paper. The first is that they do use controls. Even now, the Genome Sciences Centre is struggling with ChIP-Seq controls, while it seems everyone else is using them to great effect. I really enjoyed this aspect of it. In fact, I was rather curious how they'd done it, so I took a look through the source code of the application. I found the code somewhat difficult to wade through, as the coding style was very different from my own, but well organized. Unfortunately, I couldn't find any code for dealing with controls, which leads me to think this is an unreleased feature, and was handled by post-processing the results of their application. Too bad.

The second thing I really appreciated was the motif finding work, which isn't strictly ChIP-Seq, but is one of the uses to which the data can be applied. Unfortunately, this is also not new, as I'm aware of many earlier experiments (published and unpublished) that did this as well, but it does make a nice little story. There's good science behind this paper - and the data collected on the chosen transcription factors will undoubtedly be exploited by other researchers in the future.

So, here's my summary of this paper: As a presentation of a new algorithm, they failed to produce anything novel, and with respect to the value of those algorithms versus any other algorithm, no experiments were provided. On the other hand, as a paper on growth-associated binding protein, and serum response factor proteins (GABP and SRF respectively), it presents a nice compact story.

Labels: ,