Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Tuesday, September 30, 2008

More ups, less downs

This has been an interesting week, and it's only Tuesday. While last week dragged on - mainly because I was stuck at home sick - this week is flying by in a hurry. I have several topics to blog about (ABI SOLiD, Arrays vs Chip-Seq, Excellent organizational tools, etc) but now find myself with enough work that I don't really have much time to blog on all of them today. Hopefully I can get around to it in the next couple days.

Last week, I blogged about my frustration with the GSC environment, but then found a few things to act as a counter-point. Whereas last week, I was stuck at home and could only watch things as they unfolded from a distance, this week I'm back in the office and able to talk with my co-workers, and realized how valuable a tool that can be. Even if I'm frustrated with jostling for position that goes on, there are fantastic people at the GSC who work hard and are happy to lend a hand. The GSC isn't perfect, but there are definitely fantastic people here.

That said, on a more personal note, I've had quite a few "perks" appear this week:
  • For the first time in my life, I've been asked to review a paper for a journal,
  • I've also become involved in several really cool projects,
  • I've had the opportunity to review some ABI SOLiD files for inclusion in FindPeaks,
  • I was even offered a chance to apply for a post-doc with a lab that I've been following closely. Talk about a flattering offer! (Unfortunately, I'm not close to graduating yet, so it's a bit premature. DRATS!)
Anyhow, its stuff like that keeps me writing the software and working hard. I guess that solves the question of why I try to do good work all of the time - A little bit of recognition goes a long way to keep me motivated. (=

Happy Tuesday, everyone.


Saturday, September 27, 2008

James Bond and evil hackers...

Totally off topic blog today, but it's a Weekend, so I feel like that's not a bad thing.

I watched "Tomorrow Never Dies" this evening, a 1999 James Bond movie starring Pierce Brosnan. I swear, I thought I'd seen most, if not all, of the James Bond flicks, though I guess this one had slipped right by me. It wasn't the best Bond movie I'd seen, but it was amusing enough for a Saturday night without a televised hockey game.

On my evening walk with the puppy, I got to thinking about who I'd be if I were in a Bond Movie.

Clearly, I wouldn't be James Bond himself - I leave that to my older brother who probably is the Canadian James bond. At least, I always picture him that way. (Though, I don't usually picture him in a tuxedo, and I seriously doubt if martinis are his drink. I'll have to ask him next time I see him.)

I figured, based on my physical prowess, I probably wouldn't be any of the action heroes. I'm a little too clumsy for that sort of stuff. (You can ask any of my lab partners from undergrad about how well glassware and I get along.)

Which leaves me with two options: the guy who gets run over during the car chase (which isn't too flattering) or one of the generic assistant genius characters that hacks computers or develops bio-weapons in the pay of the truly insane evil world domination plotter. Well, since the glassware and I probably wouldn't get along on the set any better than in real life, that probably leaves the assistant hacker role.

And that brings me to my next observation. In most films these days, the hackers are either the fat slob (in the style of Dennis Nedry in Jurassic Park), or the twig thin socially inept hackers(Neo from The Matrix). Instead, The evil American techno-genius computer hacker in this particular flick was made up to look like none other than Richard Stallman. (Considering how they lampoon Microsoft in this film, I wouldn't be surprised if that were the case.) Both the shot at Microsoft and this are geeky jokes, but an interesting comment on society from Hollywood's perspective. Anyhow, judge for yourself.

On the left, evil American techno-genius Henry Gupta (A very prescient name, considering the techno-expert Gupta that works at evil American corporation SCO...) played by Magician Ricky Jay, and Benevolent Genius Richard Stallman (as himself!) on the right. What do you think?

Evil Genius GuptaBenevolent Genius Richard Stallman

Thursday, September 25, 2008

grad school - phase II?

I'm trying to keep up with my one-posting per-day plan, but I considered abandoning it today. It's just hard to keep focussed long enough to write a post when you've been coughing all day. Fortunately, the coughing let up, the fever went away, and - even if my brain isn't back to normal - I'm feeling much better this afternoon than I was this morning.

Anyhow, spending the day on the day on the couch gave me lots of time for reflection, even if my brain was at 1/4 capacity. I've been able to follow several email threads, and contribute to a couple along the way. And, of course, all this sitting around and watching email go by made me re-evaluate my attitude to grad school. Of course, this posting could just be because I'm sick and grumpy. Still, there's a nugget of truth in here somewhere...

Until today, I've been trying to write good quality code that can be used in our production pipelines - which some of it has. FindPeaks has been a great success in that respect. [EDITED]

However, several other pieces of my work were slated to be put into production for several months, until a few days ago. Now, they've all been discarded by the GSC in favour of other people's work. Of course, I can't complain about that - they're free to use whatever they want, and - despite doing my research at the GSC - I have to remind myself that I'm not a GSC employee, and they have no obligation to use my code.

To be fair, the production requirements for one element of the software changed this morning, and my code that was ready to be put in production a year ago is no longer cutting edge.

However, in hindsight, I guess I have to ask some questions: why have I been writing good code, when I could have been writing code that does only what I want, and doesn't handle the 100's of other cases that I don't come across in my research? What advantages were there for me to do a good job?
  • If I want to sell my code, it belongs to UBC. Clearly this is a non-starter.
  • If I want the GSC to use my code, I won't get any recognition for it. (That's the problem with writing a tool: Many people will get the results, but no one will ever know that it was my code in use, and unlike lab work, pipeline software doesn't seem to get included on publications.)
  • If I'm doing it just for myself, then there's the satisfaction of a job well done, but it distracts me from other projects I could be working on.
  • If I want to publish using my code, it doesn't have to be production ready - it only has to do the job I want it to, so there's clearly no advantage here.
Which leaves me with only one decent answer:
  • develop all my code freely on the web, and let other people participate. The more production-ready it is, the more people will use it. The more people who use it, the more my efforts are recognized - which in a way, is what grad school is all about:
Publications and recognition.

I seem to have more or less already adopted that model [EDITED]

Some time last year, I came to the recognition that I should stop helping people who just take advantage of my willingness to work, but I think this is a new, more jaded version of it.

Viva la grad school.

[EDITED] I guess now I understand why people leave Academia.


Wednesday, September 24, 2008

The best laid schemes...

"The best laid schemes o’ mice an’ men / Gang aft a-gley.” - Robert Burns

Well, ok, I don't speak 18-th century Scots, but I fully understand the sentiment. I have several major things in the works, and yet, I'm suffering from a bad cold and sore throat. So, the FindPeaks manual is going to be delayed a few days. I guess I'm not the only one working on it, but I defintely have plans for it.

Anyhow, between the FindPeaks manual, some new FindPeaks features, a poster abstract and two papers I'm working on, I have enough on my plate - but just don't have the energy to get it all done. Damn colds! They always get you just as the stress level starts to rise.

Anyhow, Thanks to tcezard (Tim), who's taking on an ever increasing role on the FindPeaks work, FindPeaks 3.2.1 was tagged today. Development is so much faster when people work together!

I thought I'd also go one further today, and add a plug in for a very cool technology that hasn't (to my knowledge) gotten much coverage - ODF@WWW. As a scientist who uses wikis and blogs, this has to be one of the best applications I've seen of extending the wiki concept. This has the ability to change the way scientists communicate data, and allow wikis to be incorporated into environments where they previously were "too complex to be useful." Of course, more people have to adopt ODF before this will really take off, but with applications like this, it's only a matter of time. (Yay open formats for making this possible!)

Labels: , ,

Monday, September 22, 2008

Two milestones reached

Two of my three projects have met significant milestones this weekend, which I thought were worth sharing.

The first one is that FindPeaks had it's first official patch submitted to me from developers outside the Genome Science Centre, on sunday night. Thank you very much to the submitters! I'm very glad to see someone using (and playing with!) the code. While the patch was minor, it tells me that there is definitely use in having released it to the wild.

The second milestone is for my blog, and while it's not officially a "project" in the academic sense of the word, has consumed enough of my time that it starts to feel like one. The milestone? There are now as many people coming here directly as there are finding it through google. It's not a huge number of hits, but it's nice to know there's an audience who's enjoying my random rants. (-:

Labels: ,

Friday, September 19, 2008

MACS and end of the first generation of peak finders.

First of all, I'd like to thank the anonymous poster who brought this paper (Zhang et al, Model-based Analysis of ChIP-Seq (MACS), Genome Biology) to my attention, and Xue, who passed me the link to it. It didn't appear in Google until this morning, for me, so knowing where it was beforehand was very helpful.

Unlike the other ChIP-Seq software papers that I reviewed, I think this one is the first that I feel was really well done - and it's a pretty decent read too. My biggest complaint is just that we had printer issues this morning, so it took me nearly 2 hours to get it printed out. (-;

Ok. Before I jump into technical details, I just have to ask one (rhetorical) question to Dr. Johnson: is that really your work email address? (-:

MACS appears to be an interesting application. Feature for feature, it competes well with the other applications in it's group. In fact, from the descriptions given, I think it does almost the identical calculations that are implemented in FindPeaks 3.1. (And yes, they seem to have used 3.1+ in their comparisons, though it's not explicitly stated which version was used.)

There are two ways in which MACS differs from the other applications:
  1. The use of a "shift"(referred to as an "adaptive fragment length" in FindPeaks)
  2. the use of controls in a more coherent manner.
The first is interesting because it's one of the last open questions in ChIP-Seq data: how do you determine the actual length of a single end tag?

Alas, while FindPeaks had a very similar module for calculating fragment lengths in version 3.1.0, it was never released to the public. Here, MACS beats FindPeaks to the punch, and has demonstrated that an adaptive fragment size can work. In fact their solution is quite elegant: identify the top X peaks, then calculate the distance from the peak that the ends normally occur. I think there's still room for improvement, but this has now opened the Pandora's box of fragment length estimation for everyone to get their hands dirty and play with. And it's about time.

The second improvement is their handling of controls. Several applications have moved "control" handling into the core algorithms, while FindPeaks has chosen to keep it part of the post-processing. Again, this is a good step in the right direction, and I'm glad to see them make controls an integral (though optional) part of the ChIP-Seq analysis.

To get into the specifics of the paper, there are several points that are worth discussing.

As a minor point, the terminology in the paper is mildly different, and doesn't seem to reflect the jargon used in the field. They describe fragment lengths in terms of shifting their positions, rather than extending the lengths, and terms like "Watson strand" and "Crick strand" are used to refer to the forward and reverse strand - I haven't heard those terms used since high school, but it's cool. (=

Feature wise, MACS implements familiar concepts such as
  • duplicate filtering
  • Thresholding misnamed as FDR (just like FindPeaks!)
  • Peak finding that reports the peak max location
  • requirement of an effective genome size
and then offers some very nice touches
  • p-value cutoffs for peak reporting
  • built in analysis of controls
  • built in saturation calculations
All together, this makes for a great round up for a head to head competition. So guess what they do next? That's right: it's a peak finding bake-off!

For each test, MACS is compared against ChIP-Seq Peak Finder (from the Wold lab), FindPeaks (GSC) and QuEST (Valouev). And the competition is on (See Figure 2 in the paper).

Test number one (a) shows that MACS is able to keep the FDR very low for the "known set of FOXA1 binding sites" compared to the other three peak finders. This graph could either be very informative, or very confusing, as there are no details on how the number of sites for each application was chosen, who's FDR was used or otherwise. This graph could either be reporting that MACS underestimates the real FDR for it's peaks, or that the other peak finders provide overstated FDR for their peaks, or that only MACS is able to identify peaks that are true positives. MACS wins this round - though it may be by a technicality.

(Assuming that the peaks are all the same across peak callers, then all this tells us is that the peaks are assigned different FDRs by each peak caller, so I'm not even sure it's a test of any sort.)

Test number two (b) is the same thing as number one, repeated for a different binding site, and thus the same thing occurs here. I still can't tell if it's the same binding sites between peak callers, or different, making this comparison rather meaningless, except to understand the FDR biases produced by each peak caller.

Test number three (c) starts to be a little more informative. Using a subset of peaks, how many of them contain the motif of interest. With the exception of the Wold lab's Peak Finder, they all start out around 65% for the top(?) 1000 peaks, and then slowly wind their way down. By 7000 peaks, MACS stays up somewhere near 60%, FindPeaks is about 55%, and Quest never actually makes it that far, ending up with about 58% by 6000 peaks (the Wold Lab Peak finder starts at 58% and ends at about 54% at 6000 peaks as well. So, MACS seems to win this one - it's enrichment stays the best over the greatest number of peaks. This is presumably due to it's integration of controls into the application itself. Not surprisingly, Quest, which also integrates controls, ends up a close second. FindPeaks and the Wold lab Peak Finder lose here. (As a small note, one assumes that many of the peaks would be filtered out in post processing used by the Wold Lab's software and FindPeaks, but that brings in a whole separate pipeline of tools.)

The fourth test is the same as the third test, but with the NRSF application. The results are tighter, this time, and surprisingly, FindPeaks ends up doing just a little better than QuEST, depite the lack of built-in control checking .

The fifth and six test roughly parallel the last two, but this time give the spatial resolution of the motif. Like the fourth test, MACS wins again, with FindPeaks a close second for most tests, and QuEST a hot contender. But it's really hard to get excited about this one: The gap between best and worst here rarely exceeds 3 bases for FoxA1, and MACS appears to stay within 1-1.5 bases from motifs identified by FindPeaks for NRSF.

Take away message: MACS does appear to have the best application among the applications tested, while the Wold Lab Peak Finder seems to lose in all cases. Ouch. Quest and FindPeaks duke it out for second place, trading punches.

Like all battles, however, this is far from the last word on the subject. Both MACS and FindPeaks are now being developed as open source applications, which means that I expect to see future rounds of this battle being played out - and a little competition is a good thing. I know FindPeaks is pushing ahead in terms of improving FDR/thresholding, and I'm sure all of the other applications are as well.

One problem with all of these applications is that they all still assume a Poisson (or similar) model for the noise/peaks, which we're all slowly learning, isn't all it's cracked up to be. (See ChIP-Seq in Silico.)

But, in contrast, I think we can now safely say that the low-hanging fruit of ChIP-Seq has all been picked. Four independently developed software applications have all been compared - and they all have roughly the same peaks, within a very narrow margin of error on the peak maxima.

To wrap this up, I think it's pretty clear that the next round of ChIP-Seq software will be battled out by those projects that gain some momentum and community. With the peak finders giving more or less equivalent performance (with MACS just a hair above the others for the moment) and more peak finders becoming open source, the question of which one will continue moving forwards can be viewed as a straight software problem. Which peak finder will gain the biggest and most vibrant community?

That, it seems, will probably come down to a question of programming languages.

Thursday, September 18, 2008

FindPeaks 3.2.0

Obviously, I didn't write an article review this morning. Instead, I finished work on an "unstable" Findpeaks 3.2.0 release. It's now been tagged, for anyone who'd like to download the source code and build it for themselves. The documentation has some catching up to do (which I plan to work on for the end of the week.), but the majority of the features are in place and appear to be functional.

I was debating posting the changelog to SEQanswers, but I think I'll hold off until I have a more stable version. In the meantime, I'll just post a summary from the tag commit here. Welcome to the FindPeaks 3.2 series!

(Update: I should also mention two important details. This release includes code contributions and fixes from Timothee Cezard, another developer at the GSC, as well as a code donation (Lander-Waterman FDR algorithm) from Mikhail Bilenky, also at the GSC.)

Changes since FindPeaks
  • New FindPeaks Core Module
  • Significant reorganization/modularity/OOP of code
  • Significant cleanup of code
  • ability to read MAQ .map files
  • ability to read bed files (both reduced and extended)
  • Log_Buffer to provide buffered and logged output to file
  • Several bug fixes to reported issues
  • better error handling
  • better error messages and verbose output
  • New coverage based FDR (Thresholding methods) incorporated
  • Native mode (mode 3) for MAQ SET/PET or Bed files, or generating wigs
  • Conforms to most Enerjy best practice recommendations. (over 2000 code changes.)
  • More controlled exits to handle system crashes gracefully.
  • better handling of constants
  • smoothing can be turned on or off for adaptive fragment size estimation.
  • utilities for working with PETs
  • FindPeaks can now be used as a simple wig generator
  • hundreds more fixes and updates!


Wednesday, September 17, 2008

Math as art

I came across an article on the BBC website about the art of maths, which is well worth the two minutes or so it takes to play. The images are stunning (I particularly like the four dimensional picture in two dimensions), and the narration is quite interesting. As a photographer, I especially enjoyed the comparison of math as art to photography as an art. My own take is that both math and photography share the common artistic element of forcing the artist to determine how to best express what it is they're seeing. Two people can see the same thing, and still get very different photos, which is also a component of expressing math in an artistic manner.

That got me thinking about genomics as art as well. I'm aware of people who've made music out of DNA sequences, but not visual art. Oh, sure, you can mount a karyotype or electrophoresis image on your wall, and it's pretty, but I don't think genomics has realized the potential for expressing DNA form and function in a non-linear manner yet.

Still, it's obviously on it's way. Every time I go to a presentation, I see a few more "pretty" graphs. Or maybe I've just gone to too many seminars when a graph of the clustering of gene arrays starts to look like a Mondrian picture. Who knows... maybe ChIP-Seq will start looking like that too? (=


Tuesday, September 16, 2008

Processing Paired End Reads

Ok, I'm taking a break from journals. I didn't like the overly negative tone I was taking in those reviews, so I'm rethinking how I write about articles. Admittedly, I think criticism is in order for papers , but I can definitely focus on papers that are worth reviewing. Unfortunately, I'm rarely a fan of papers discussing bioinformatics applications, as I always feel like there's a hidden agenda behind them. Whether it's simply proving their application is the best, or just getting it out first, computer application papers are rarely impartial.

Anyhow, I have three issues to cover today:
  • FindPeaks is now on sourceforge
  • Put the "number of reads under a peak" to rest. permanently, I hope.
  • Bed files for different data sources.

The first one is pretty obvious. FindPeaks is now available under the GPL on sourceforge, and I hope people will participate in using and improving the software. We're aiming for our first tagged release on friday, with frequent tags thereafter. Since I'm no longer the only developer on this project, it should continue moving forward quickly, even while I'm busy studying for my comps.

The second point is this silly notion that keeps coming up. "How many reads were found under each peak?" I'm quite sick of that question, because it really makes no sense. Unfortunately, this was a metric produced in Robertson et al.'s STAT1 data set, and I think other people have included it or copied it. Unfortunately it's rubbish.

The reason it worked in STAT1 was because they used a fixed length (or XSET) value on their data set. This allowed them to determine the exact length of each read, which allowed them to figure out how many reads were "contiguously linked in each peak." Readers who are paying attention will also realize what the second problem is... They didn't use subpeaks either. Once you start identifying subpeaks, you can no longer assign to which peak a read spanning peaks belongs. Beyond that, what do you do with reads in a long tail? Are they part of the peak or not?

Anyhow, the best measure for a peak, at the moment at least, is the height of the peak. This can also include weighted reads, so that reads which are unlikely to contribute to a peak actually contribute less, bringing in a scaled value. After all, unless you have paired end tags, you really don't know how long the original DNA fragment was, which means you don't know where it ended.

That also makes a nice segue to my last point. There are several ways of processing paired end tags. When it comes to peak calling it's pretty simple: you use the default method - you know where the two ends are, and they span the full read. For other applications, however, there are complexities.

If the data source is a transcriptome, your read didn't cover the intron, so you need to process the transcript to include breaks, when mapping it back to the genome. That's really a pain, but it is clearly the best way to visualize transcriptome PETs.

If the data source is unclear, or you don't know where the introns are (which is a distinct possibility), then you have to be more conservative, and not assume the extension of each tag. Thus, you end up with a "tag and bridge" format. I've included an illustration to make it clear.

So why bring it up? Because I've had several requests for the tag-and-bridge format, and my code only works on the default mode. Time to make a few adjustments.

Labels: , , ,

Friday, September 12, 2008


One more day, one more piece of ChIP-Seq software to cover. I've not talked about FindPeaks, much, which is the software descended from Robertson et al, for obvious reasons. The paper was just an application note - and well, I'm really familiar with how it works, so I'm not going to review it. I have talked about Quest, however, which was presumably descended from Johnson et al.. And, for those of you who have been following ChIP-Seq papers since the early days will realize that there's still something missing: The aligner descended from Barski et al, which is the subject of today's blog: SISSR. Those were the first three published ChIP-Seq papers, and so it's no surprise that each of them followed up with a paper (or application note!) on their software.

So, today, I'll take a look at SISSR, to complete the series.

From the start, the Barski paper was discussing both histone modifications and transcription factors. Thus, the context of the peak finder is a little different. Where FindPeaks (and possibly QuEST as well) was originally conceived for identifying single peaks, and expanded to do multiple peaks, I would imagine that SISSR was conceived with the idea of working on "complex" areas of overlapping peaks. Although, that's only relevant in terms of their analysis, but I'll come back to that.

The most striking thing you'll notice about this paper is that the datasets look familiar. They are, in fact the sets from Robertson, Barski and Johnson: STAT1, CTCF and NRSF, respectively. This is the first of the Chip-Seq application papers that actually performs a comparison between the available peak finders, and of course, claim that theirs is the best. Again, I'll come back to that.

The method used by SISSR is almost identical to the method used by FindPeaks, with the use of directional information built into the base algorithm, whereas FindPeaks provides it as an optional module (-directional flag, which uses a slightly different method). They provide an excellent visual image on the 4th page of the article, demonstrating their concept, which will explain the method better than I can, but I'll try anyhow.

In ChIP-Seq, a binding site is expected to have many real tags pointing at it, as tags upstream should be on the sense strand, and tags on downstream should be on the anti-sense strand. Thus, a real binding site should exist at transition points, where the majority of tags switch from the sense to the anti-sense tag. By identifying these transition points, they will be able to identify locations of real binding sites. More or less, that describes the algorithm employed, with the following modifications: A window is used, (20bp default) instead of doing it on a base-by-base basis, and parameter estimation is employed to guess the length of the fragments.

In my review of QuEST, I complained that windows are a bad idea(tm) for ChIP-Seq, only to be corrected that QuEST wasn't using a window. This time, the window is explicitly described - and again, I'm puzzled. FindPeaks uses an identical operation without windows, and it runs blazingly fast. Why throw away resolution when you don't need to?

On the subject of length estimation, I'm again less than impressed. I realize this is probably an early attempt at it - and FindPeaks has gone through it's fair share of bad length estimators, so it's not a major criticism, but it is a weakness. To quote a couple lines from the paper: "For every tag i in the sense strand, the nearest tag k in the anti-sense strand is identified. Let J be the tag in the sense strand immediately upstream of k." Then follows a formula based upon the distances between (i,j) and (j,k). I completely fail to understand how this provides an accurate assessment of the real fragment length. I'm sure I'm missing something. As a function that describes the width of peaks, that may be a good method, which is really what the experiment is aiming for, anyhow - so it's possible that this may just be poorly named.

In fairness, they also provide options for a manual length estimation (or XSET, as it was referred to at the time), which overrides the length estimation. I didn't see a comparison in the paper about which one provided the better answers, but having lots of options is always a good thing.

Moving along, my real complaint about this article is the analysis of their results compared to past results, which comes in two parts. (I told you I'd come back to it.)

The first complaint is what they were comparing against. The article was submitted for publication in May 2008, but they compared results to those published in the June 2007 Robertson article for STAT1. By August, our count of peaks had changed. By January 2008, several upgraded versions of FindPeaks were available, and many bugs had been ironed out. It's hardly fair to compare the June 2007 FindPeaks results to the May 2008 version of SISSR, and then declare SISSR the clear winner. Still, that's not a great problem - albeit somewhat misleading.

More vexing is their quality metric. In the Motif analysis, they clearly state that because of the large amount of computing power, only the top X% of reads were used in their analysis. For comparison with FindPeaks, the top 5% of peaks were used - and were able to observe the same motifs. Meanwhile, their claim to find 74% more peaks than FindPeaks, is not really discussed in terms of the quality of the additional sites. (FindPeaks was also modified to identify sub-peaks after the original data set was published, so this is really comparing apples to oranges, a fact glossed over in the discussion.)

Anyhow, complaints aside, it's good to see a paper finally compare the various peak finders out there. They provide some excellent graphics, and a nice overview on how their ChIP-Seq application works, while contrasting it to published data available. Again, I enjoyed the motif work, particularly that of figure 5, which correlates four motif variants to tag density - which I feel is a fantastic bit of information, buried deeper in the paper than it should be.

So, in summary, this paper presents a rather unfair competition by using metrics guaranteed to make SISSR stand out, but still provides a good read with background on ChIP-Seq, excellent illustrations and the occasional moment of deep insight.

Labels: , ,

Thursday, September 11, 2008

Response to QuEST posting

One thing I really enjoy about this blog is that it seems to encourage communication between people in the 2nd Generation Genomics field. In particular, I get to hear from a lot of people that I otherwise would never have the privilege of meeting. After my posting on the QuEST software for ChIP-Seq the other day, I got a reply from Anton Valouev, the first author on the QuEST paper. (Who's name I accidentally misspelled in the original post. Sorry - it's now been fixed!)

He was kind enough to point out some of my mistakes, and to point me to the wikipedia entry on Kernel Density Estimation (KDE), which I had misunderstood as a windowing function. Anyhow, not only was he willing to teach me some science, he was also kind enough to let me print his reply here, which rebuts several of my points very well, and clearly points out the significance of the paper, which I had understated.

Here's Anton Valouev's reply:

Hi Anthony,

my name is Anton Valouev and I really enjoyed your blog post featuring QuEST. I wanted to point out a couple of things about your blog post that I wanted to clarify.

Although kernel density estimation uses bandwidth, it is not strictly a windowed approach since the points beyond the bandwidth window contribute to the estimation (see the formula in the methods section of our paper). KDE is an extremely powerful approach and surpasses windowed counts in every single respect except perhaps in simplicity of calculations. Wikipedia provides good reading on KDE approaches.

The novelty of the paper is in the software itself which appears to outperform the Wold lab peak caller (featured in Joh[n]son et al, although we do not demonstrate this in the paper as you have noted).

For these reasons, QuEST is currently adopted as a default peak caller by the ENCODE consortium. Yes, the analysis of the data is somewhat similar to Johnson et al Science paper, but the software is completely new which is the main point of the paper. Another new result is in direct detection of the ELK4 motif in the SRF data which hasn't been shown before. Obviously motif finding is not new, but this particular result is.

The code dealing with controls is in the peak_caller.cpp lines 248-256, so this is not an unreleased feature. Also, I'd really appreciate if you can change my last name to correct spelling.

Good luck with your FindPeaks software, ChIP-Seq is seeing its early days, so much work is still to be done in this area.

Very best, Anton.

Wednesday, September 10, 2008


I had an article all ready to write about today, but then decided to save it for tomorrow, and instead write about something completely unrelated: the CIRA AGM. Yes, I went downtown for part of the afternoon to the annual general meeting of the organization that controls the top level domain for Canada. (The .ca domain.)

Ok, so I'll admit, they had good door prizes for people who sat through the whole thing (which I did), and with five chances to win good stuff (which I didn't), I was tempted to go. The free lunch and 1Gb USB key really did make it worth my time. Besides, it seemed like something that could be related to my work with computers. (It wasn't - but I really didn't know that when I decided to go.)

Lunch was somewhat disappointing - they ran out of a lot of food before I got there. That's my biggest complaint. Anyhow....

The AGM itself wasn't really enlightening, though the question period was amusing. The guy who stood up and called CIRA "a bloated bureaucracy" was amusing, and the other guy who insisted the President defend their choice of speaker (who was in the room at the time) really made the question period a hoot. I was laughing pretty hard at the guy who demanded the CIO help him troubleshoot the fact that his domain wasn't appearing in the listings of several search engines.

The more relevant thing was the Keynote speaker - Amber McArthur. For a crowd that held a lot of over-50's type people, an introduction to web 2.0 social networking was probably an eye opener. For anyone who's ever blogged, it was underwhelming - but an interesting social commentary on the management of CIRA.

Putting this in the context of some of the other statements made during the president's address, and the other members of the board, a clear picture starts to emerge: (my paraphrasing)
  • They're a windows only shop, internally, but don't mind using unix for their DNS servers.
  • They've selected Microsoft Exchange as the premier email management and calendaring tool (and blew a lot of money on it this year).
  • They believe they're on the cutting edge of domain registry authorities in the world in terms of technology and transparency.
  • They're extremely proud of their use of Oracle10i (migrated this year).
  • They're proud of their recent decision to use Cognos Business tools. (I've developed for them before... Cognos tools aren't all they're cracked up to be.)
  • One of their biggest achievements this year was to automate password recovery for members! (it was previously a manual process.)

I could go on, but I won't.

Essentially, I don't think they're a bloated bureaucracy, but they're clearly thrilled with the proprietary business model (which is probably not a bad thing, considering their need for 100% uptime) for the software they acquire, but aren't spending a lot of time trying to figure out how to evolve with the internet. They're a perfect model of what a software/internet shop should be in the late 1990's. Do you think they use wiki's in their office environment? How about instant messaging between developers? What about social networking to educate the public on their mandate? Have they ever contributed information to the wikipedia page on CIRA?

If they needed to invite a speaker to introduce them to web 2.0, I highly doubt it. They don't "grok" where the web is going.

That's the irony: The organization in charge of keeping Canada's internet presence in order is clueless as to where the internet is going, and how it's transforming itself through the use of new technology. Common people - if most of your board members have never heard of facebook, how are they going to guide you through the next 10 years of evolution of the web?


Tuesday, September 9, 2008

ChIP-Seq in silico

Yesterday I got to dish out some criticism, so it's only fair that I take some myself, today. It came in the form of an article called "Modeling ChIP Sequencing In Silico with Applications", by Zhengdong D. Zhang et al., PLoS Computational Biology, August 2008: 4(8).

This article is actually very cool. They've settled several points that have been hotly debated here at the Genome Sciences Centre, and made the case for some of the stuff I've been working on - and then show me a few places where I was dead wrong.

The article takes direct aim at the work done in Robertson et al., using the STAT1 transcription factor data produced in that study. Their key point is that the "FDR" used in that study was far from ideal, and that it could be significantly improved. (Something I strongly believe as well.)

For those that aren't aware, Robertson et al. is sort of the ancestral origin of the FindPeaks software, so this particular paper is more or less aiming at the FindPeaks thresholding method. (Though I should mention that they're comparing their results to the peaks in the publication, which used the unreleased FindPeaks 1.0 software - not the FindPeaks 2+ versions, of which I'm the author.) Despite the comparison to the not-quite current version of the software, their points are still valid, and need to be taken seriously.

Mainly, I think there are two points that stand out:

1. The null model isn't really appropriate
2. The even distribution isn't really appropriate.

The first, the null model, is relatively obvious - everyone has been pretty clear from the start that the null model doesn't really work well. This model, pretty much consistent across ChIP-Seq platforms can be paraphrased as "if my reads were all noise, what would the data look like?" This assumption is destined to fail every time - the reads we obtain aren't all noise, and thus assuming they are as a control is really a "bad thing"(tm).

The second, the even distribution model, is equally disastrous. This can be paraphrased as "if all of my noise were evenly distributed across some portion of the chromosome, what would the data look like?" Alas, noise doen't distribute evenly for these experiments, so it should be fairly obvious why this is also a "bad thing"(tm).

The solution presented in the paper is fairly obvious; create a full simulation for your ChIP-Seq data. Their version requires a much more rigorous process, however. They simulate a genome-space, remove areas that would be gaps or repeats in the real chromosome, then begin tweaking the genome simulation to replicate their experiment using weighted statistics collected in the ChIP-Seq experiment.

On the one hand, I really like this method, as it should give a good version of a control, whereas on the other hand, I don't like that you need to know a lot about the genome of interest before you can analyze your ChIP-Seq data. (ie, mappability, repeat-masking, etc.) Of course, if you're going to simulate your genome, simulate it well - I agree with that.

I don't want to belabor the point, but this paper provides a very nice method for simulating ChIP-Seq noise in the absence of a control, as in Robertson et al. However, I think there are two things that have changed since this paper was submitted (January 2008) that should be mentioned:

1. FDR calculations haven't stood still. Even at the GSC, we've been working on two separate FDR models that no longer use the null model, however, both still make even distribution assumptions, which, is also not ideal.

2. I believe everyone has now acknowledged that there are several biases that can't be accounted for in any simulation technique, and that controls are the way forward. (They're used very successfully in QuEST, which I discussed yesterday.)

Anyhow, to summarize this paper: Zhang et al. provide a fantastic critique of the thresholding and FDR used in early ChIP-Seq papers (which is still in use today, in one form or another), and demonstrate a viable and clearly superior method for refining Chip-Seq results with out a matched control. This paper should be read by anyone working on FDRs for next-gen sequencing and ChIP-Seq software.

(Post-script: In preparation for my comprehensive exam, I'm trying to prepare critical evaluations of papers in the area of my research. I'll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and - coming from a graduate student - shouldn't be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism - I'm simply trying to evaluate the science in the context of the field as it stands today.)

Labels: , ,

Monday, September 8, 2008


(Pre-script: In preparation for my comprehensive exam, I'm trying to prepare critical evaluations of papers in the area of my research. I'll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and - coming from a graduate student - shouldn't be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism - I'm simply trying to evaluate the science in the context of the field as it stands today.)

(UPDATE: A response to this article was kindly provided by Anton Valouev, and can be read here.)

I once wrote a piece of software called WINQ, which was the predecessor of a piece of software called Quest. Not that I'm going to talk about that particular piece of Quest software for long, but bear with me a moment - it makes a nice lead in.

The software I wrote wasn't started before the University of Waterloo's version of Quest, but it was released first. Waterloo was implementing a multi-million dollar set of software for managing student records built on oracle databases, PeopleSoft software, and tons of custom extensions to web interfaces and reporting. Unfortunately, The project was months behind, and the Quest system was no where near being deployed. (Vendor problems and the like.) That's when I became involved - in two months of long days, I used Cognos tools (several of them, involving 5 separate scripting and markup languages) to build the WINQ system, which provided the faculty with a way to access query the oracle database through a secure web frontend and get all of the information they needed. It was supposed to be in use for about 4-6 months, until Quest took over... but I heard it was used for more than two years. (There are many good stories there, but I'll save them for another day.)

Back to ChIP-Seq's QuEST, this application was the subject of a recently published article. In a parallel timeline to the Waterloo story, QuEST was probably started before I got involved in ChIP-Seq, and was definitely released after I released my software - but this time I don't think it will replace my software.

The paper in question (Valouev et al, Nature Methods, Advanced Online Publication) is called "Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. I suspect it was published with the intent of being the first article on ChIP-Seq software, which, unfortunately, it wasn't. What's most strange to me is that it seems to be simply a reiteration of the methods used by Johnson et al. in their earlier ChIP-Seq paper. I don't see anything novel in this paper, though maybe someone else has seen something I've missed.

The one thing that surprises me about this paper, however, is their use of a "kernel density bandwidth", which appears to be a sliding window of pre-set length. This flies in the face of the major advantage of ChIP-Seq, which is the ability to get very strong signals at high resolution. By forcing a "window" over their data, they are likely losing a lot of the resolution they could have found by investigating the reads directly. (Admittedly, with a window of 21bp, as used in the article, they're not losing much, so it's not a very heavy criticism.) I suppose it could be used to provide a quick way of doing subpeaks (finding individual peaks in areas of contiguous read coverage) at a cost of losing some resolving power, but I don't see that discussed as an advantage.

The second thing they've done is provide a directional component to peak finding. Admittedly, I tried to do the same thing, but found it didn't really add much value. Both the QuEST publication and my application note on FindPeaks 3.1 mention the ability to do this - and then fail to show any data that demonstrates the value of using this mechanism versus identifying peak maxima. (In my case, I wasn't expected to provide data in the application note.)

Anyhow, that was the down side. There are two very good aspects to this paper. The first is that they do use controls. Even now, the Genome Sciences Centre is struggling with ChIP-Seq controls, while it seems everyone else is using them to great effect. I really enjoyed this aspect of it. In fact, I was rather curious how they'd done it, so I took a look through the source code of the application. I found the code somewhat difficult to wade through, as the coding style was very different from my own, but well organized. Unfortunately, I couldn't find any code for dealing with controls, which leads me to think this is an unreleased feature, and was handled by post-processing the results of their application. Too bad.

The second thing I really appreciated was the motif finding work, which isn't strictly ChIP-Seq, but is one of the uses to which the data can be applied. Unfortunately, this is also not new, as I'm aware of many earlier experiments (published and unpublished) that did this as well, but it does make a nice little story. There's good science behind this paper - and the data collected on the chosen transcription factors will undoubtedly be exploited by other researchers in the future.

So, here's my summary of this paper: As a presentation of a new algorithm, they failed to produce anything novel, and with respect to the value of those algorithms versus any other algorithm, no experiments were provided. On the other hand, as a paper on growth-associated binding protein, and serum response factor proteins (GABP and SRF respectively), it presents a nice compact story.

Labels: ,

Thursday, September 4, 2008

Data Storage Problems?

Here I was, thinking that us Next-gen genomics researchers (genomicists?) had a tough time with data management, simply by the phenomenal number of Gigabytes and Terabytes of data we're generating. Then I read this article, titled How the Large Hadron Collider Might Change the Web [Scientific American].

When it's up and running, they'll be generating 15 petabytes (15 million gigabytes) per year, compared to the several Terabytes per year we work with. Ouch.

Anyhow, as a model of distributed data analysis, it's definitely of interest. Maybe we'll learn something from the physicists. Or maybe they can lend me a few Terabytes of space for my projects...

Wednesday, September 3, 2008

Back to work...

I'm back from my vacation - and just about ready to get back to work. Almost. Before I do, I figured I should share my pictures, and tie up a few loose ends.

Pictures first! Here's the full album.

And here's one of my favorites:

Ok, now that I'm past showing off pictures, onto to more sciency stuff:

The commenting system seems to have worked very well in my absence, so I'll leave it as is - I won't be moderating posts as they are added to the blog, though I reserve the right to delete rude or otherwise inappropriate posts - it hasn't been a problem yet, though, and I hope it doesn't become one.

I had planned to write more on indels, but I think everyone's on the same page with it now. Longer reads will give more information, and we're better off waiting a few months for those, rather than trying to work our way through it now.

I've been swamped with emails about FindPeaks while I was away, so anyone who sent me an email that I haven't yet replied to - not to worry, I'm almost caught up. A few more hours, and I'll be back to working, instead of emailing. I've also decided that FindPeaks needs to be open sourced ASAP - there are too many things that could be fixed by people who are using the program, rather than waiting for me, while I play customer support. It's wasting my time, which could be better spend on developing new features.

FindPeaks has also had it's first commit from someone other than me. Yay! It happens to be another developer at the GSC, but it's a good first step.

And yes, I think I finally have a date for my committee meeting, and I'm also narrowing down a date for my comprehensive exam. I suspect that means a lot of my posts in the future will be literature reviews, roundups and discussions of a wide range of genomic topics, starting in October.

Ok, that about does it for now - Only about 45 more emails to process, and I'll be back on top of things.