Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Friday, May 29, 2009

Science Cartoons - 3

I wasn't going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that's generally called "fair use" in the US, and copyright laws in most countries have similar exceptions), but you can't publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don't steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

Labels: , , , , , , , , ,

Science Cartoons - 2

Comic #2/5. This one, obviously is about aligners. I've added in the copyright on the far right, this time. If I expect people to respect my copyright, I really do need to put it on there, don't I?

Labels: , , ,

Thursday, May 28, 2009

Science Cartoons - 1

I've been busy putting together a poster for a student conference I'm attending next week in Winnipeg, which has distracted me from just about everything else I had planned to accomplish. Not to worry, though, I'll be back on it in a day or two.

Anyhow, part of the fun this time around is that the poster competition includes criteria for judging that involve how pretty the poster is - so I decided to go all out and learn to draw with inkscape. Instead of the traditional background section on a poster, I went with a comic theme. It's a great way to use text and figures to get across complex topics very quickly. So, for the next couple days, I thought I'd post some of them, one at a time.

Here's the first, called "Second Generation Sequencing". It's my least favorite of the bunch, since the images feel somewhat sloppy, but it's not a bad try for a first pass. Yes, I do own the copyrights - and you do need to ask permission to re-use them.

Labels: , ,

Monday, May 25, 2009

Can't we use ChIP-chip controls on *-Seq?

Thanks to Nicholas, who left this comment on my web page this morning, in reference to my post on controls in Second-Gen Seqencing:
Hi Anthony,

Don't you think that controls used for microarray (expression
and ChIP-chip) are well established and that we could use
these controls with NGS?


I think this is a valid question, and one that should be addressed. My committee asked me the same thing during my comprehensive exam, so I've had a chance to think about it. Unfortunately, I'm not a statistics expert, or a ChIP-chip expert, so I would really value other people's opinion on the matter.

Anyhow, I think the answer has to be put in perspective: Yes, we can learn from ChIP-chip and Arrays for the statistics that are being used, but no, they're not directly applicable.

Both ChIP-chip and array experiments are based on hybridization to a probe - which makes them cheap and reasonably reliable. Unfortunately, it also leads to a much lower dynamic range, since they saturate out at the high end, and can be undetectable at the low end of the spectrum. This alone should be a key difference. What signal would be detected from a single hybridization event on a micro-array?

Additionally, the resolution of a chip-chip probe is vastly different from that of a sequencing reaction. In ChIP-Seq or RNA-Seq, we can get unique signals for sequences with a differing start location only one base apart, which should then be interpreted differently. With ChIP-chip, the resolution is closer to 400bp windows, and thus the statistics take that into account.

Another reason why I think the statistics are vastly different is because of the way we handle the data itself, when setting up an experiment. With arrays, you repeat the same experiment several times, and then use that data as several repeats of the same experiment, in order to quantify the variability (deviation and error) between the repeats. With second-generation sequencing, we pool the results from several different lanes, meaning we always have N=1 in our statistical analysis.

So, yes, I think we can learn from other methods of statistical analysis, but we can't blindly apply the statistics from ChIP-chip and assume they'll correctly interpret our results. The more statistics I learn, the more I realize how many assumptions go into each method - and how much more work it is to get the statistics right for each type of experiment.

At any rate, these are the three most compelling reasons that I have, but certainly aren't the only ones. If anyone would like to add more reasons, or tell me why I'm wrong, please feel free to add a comment!

Labels: , ,

Thursday, May 21, 2009


Apparently there's a rumour going around that I think FindPeaks isn't a good software program, and that I've said so on my blog. I don't know who started this, or where it came from, but I would like to set the record straight.

I think FindPeaks is one of the best ChIP-Seq packages currently available. Yes, I would like to have better documentation, yes there are occasionally bugs in it, and yes, the fact that I'm the only full-time developer on the project are detriments.... But those are only complaints, and they don't diminish the fact that I think FindPeaks can out peak-call, out control, out compare.... and will outlast, outwit and outplay all of it's opponents. (=

FindPeaks is actually a VERY well written piece of code, and has the flexibility to do SO much more than just ChIP-Seq. If I had my way, I'd have another developer or two working with the code to expand many of the really cool features that it already has.

If people are considering using it, I'm always happy to help get them started, and I'm thrilled to add more people to the project if they would like to contribute.

So, just to make things clear - I think FindPeaks (and the rest of the Vancouver Short Read Analysis Package) are great pieces of software, and are truly worth using. If you don't believe me, send me an email, and I'll help make you a believer too. (=

Labels: , ,

Monday, May 18, 2009

Questions about controls.

After my post on controls, the other day, I've discovered I'd left a few details out of my blog entry. It was especially easy to come to that conclusion, since it was pointed out by Aaron in a comment below.... Anyhow, since his questions were right on, I figured that they deserved a post of their own, not just a quick reply.

There were two parts to the questions, so I'll break up the reply into two parts. First:

One question - what exactly are you referring to as 'control' data? For chip-seq, sequencing the "input" ie non-enriched chromatin of your IPs makes sense, but does doubling the amount of sequencing you perform (therefore cost & analysis time) give enough of a return? eg in my first analysis we are using a normal and cancer cell line, and looking for differences between the two.
Control data, in the context I was using it, should be experiment specific. Non-enriched chromatin is probably going to be the most common for experiments with transcription factors, or possibly even histone modifications. Since you're most interested in the enrichment, a non-enriched control makes sense. At the GSC, we've had many conversations about what the appropriate controls are for any given experiment, and our answers have ranged from genomic DNA or non-enriched-DNA all the way to stimulated/unstimulated sample pairs. The underlying premise is that the control is always the one that makes sense for the given experiment.

On that note, I suspect it's probably worth rehashing the Robertson 2007 paper on stimulated vs. unstimulated IFN-gamma STAT1 transcription markers... hrm..

Anyhow, The second half of that question boils down to "is it really worth it to spend the money on controls?"

YES! Unequivocally YES!

Without a control, you'll never be able to get a good handle on the actual statistics behind your experiment, you'll never be able to remove bias in your samples (from the sequencing itself, and yes, it is very much present). In the tests I've been running on the latest FP code (, I've been seeing what I think are incredibly promising results. If it cost 5 times as much to do controls, I'd still be pushing for them. (Of course, it's actually not my money, so take that for what it's worth.)

In any case, there's a good light at the end of the cost tunnel on this. Although yes, it does appear that the better the control, the better the results will be, we've also been able to "re-use" controls. So, in fact, having one good set of controls per species already seems to make a big contribution. Thus, controls should be amortizable over the life of the sequencing data, so the cost should not be anywhere near double for large sequencing centres. (I don't have hard data, to back this up, yet, but it is generally the experience I've had so far.)

To get to the second half of the question:

Also, what controls can you use for WTSS? Or do you just mean comparing two groups eg cancer vs normal?
That question is similar to the one above, and so is my answer: it depends on the question you're asking. If you want to use it to compare against another sample (eg, using FP's compare function), then you'll want to compare with another sample, which will be your "control". I've been working with this pipeline for a few weeks now and have been incredibly impressed with how well this works over my own (VERY old 32bp) data sets.

On the other hand, if you're actually interested in discovering new exons and new genes using this approach, you'd want to use a whole genome shotgun as your control.

As with all science, there's no single right answer for which protocol you need to use until you've decided what question you want to ask.

And finally, there's one last important part of this equation - have you sequenced either your control or your sample to enough depth? Even if you have controls working well, it's absolutely key to make sure your data can answer your question well. So many questions, so little time!

Labels: , ,

Saturday, May 16, 2009

BIoinformatics in the lab

After yesterday's talk by Dr. Bowdish (I just feel weird calling professors by the first name when referring to their talks), I walked away with several different trains of thought, one of which was the easy integration of bioinformatics into the research program she'd undertaken. The interesting thing isn't so much that it was there, but the absolutely relaxed attitude with which it had been presented.

When I first started talking to professors about the interface between computers and biology or biochemistry, the field had barely even been given a name - and most of the professors were utterly confused about what computers could do to enhance their research programs. (Yes, I was an undergrad in the mid-90's.) I remember several profs saying they couldn't think of a reason to have computers in their labs at all. (Of course, at the time, there probably wasn't much use for computers in the lab anyhow.)

There was one prof who was working on the edge of the two subjects: Dr. Patricia Schulte. Although she was working on the field of fish biology, somehow she was able to see the value and encourage her students to explore the interface of bioinformatics and lab integration - and she was the first person to introduce me to the term Bioinformatics (among many other topics: HMMs, Neural Nets, etc...)

Anyhow, at that point, I was hooked on bioinformatics, but finding the opportunity to do hands on work was nearly impossible. The biology professors didn't know what it could do for them - and clearly didn't have the vocabulary with which to express their interests in computational information. It was awkward, at times. One prof couldn't figure out why I wanted to use word processors for biology.

To my great amazement, things have dramatically changed in the (nearly) decade and a half since I started my first undergrad, and yesterday's talk was really a nice opportunity to contemplate that change. Dr. Bowdish's talk included a significant amount of biology, genomics and bioinformatics predictions. When the predictions didn't turn out (eg. the putative myristolation site wasn't actually important), there was no accompanying comment about how unreliable bioinformatics is (which I used to see ALL the time in the early days of the field), and there was no hesitation to jump in to the next round of bioinformatics predictions (structure predictions for the enzyme).

I think even this quiet incorporation of bioinformatics into a young lab is incredibly encouraging. Perhaps it's Dr. Bowdish's past, having done her PhD in Dr. Hancock's lab, who himself was an early adopter of bioinformatics predictions, or possibly it's just researchers who have grown up with computers for most of their life finally getting into the ranks of academia. Either way, I'm impressed and encouraged. Bioinformatics gold age may not be here yet, but I think the idea that they'll never become mainstream has finally started to fade from the halls of the ivory tower.

Labels: ,

Friday, May 15, 2009

UBC Seminar - Dr. Dawn Bowdish, McMaster University

[This talk was given by a good friend, Dr. Dawn Bowdish - and is WAY outside of the topics that I normally cover. However, there is some interesting work with SNPs which is worth mentioning, if you can get past the non-genomic part at the start - which I suggest. As always, mistakes are my misunderstanding of the topic - not the speakers!]

Talk title: The class A scavenger receptors are associated with host defense towards Mycobacteium tuberculosis.

Lab URL:

Post-doc was done at oxford, where most of the work that will be presented today was done.
1.The role of the scavenger receptors in Mycobacterium tuberculosis in infection
2.Polymorphisms in scavenger receptors and susceptibility to M. Tuberculosis infection
3.The role of the cytoplasmic tail in scavenger receptor signalling
4.Evolution of scavenger receptor domains.

Macrophages are beautiful cells. They don't have a single form – you know it when you see it. Paraphrased: 'phooi to t-cells.'

[at this point, the projector died. Dawn offers to tell macrophage stories... Someone wants to know all about Oxford. “It was very Harry Potter.” AV people mess around in the back....]

Macrophages are central to everything she studies. They are an integral part of mammalian biology:
  • Embryonic development, organ structure
  • chronic disease
  • genetic diseases
  • infectious disease
  • autoimmunity
  • cancer
Macrophages receptors are indicators of phenotype, function and biomarkers for disease phenotype

Scavenger receptors: several classes of them exist. The only conserved feature is that they bind modified lipids (acLDL) with varying efficiency.

Class A scavengers: includes 2 that Dawn studies specifically: MARCO and SRA (I and II). Found in all organisms from plants to humans, yeast.. etc. They are involved in cell-cell interactions, and have been adapted to many other cell-interactions.

Marco (Macrophage receptor with collagenous structure) and SRA (scavenger receptor class A)have similar ligands, which is very broad. “Molecular fly paper.” In general, restricted to expression in macrophages)

They only bind some bacterial well, but not all.

SRA plays a role in homeostasis and infectious disease, septic shock.
Marco plays a role in infectious disease. (Redundancy in vitro – requires double knock out.)

The binding domains, however are very different. In Marco, binding is at the end of the receptor. In SRA, it's the 2nd last.

MARCO is not expressed in any cell line, and is not in bone marrow macrophage. Thus, it's often overlooked.

Three types of activation: Classical, alternative, innate (hypothesized). Marco seems to be innate activation, and the properties/phenotype are not well understood. Possibly a phenotype for immuno-modulation, but when it's used is not known. Fills a niche, which doesn't quite fit with the known models in mouse.

So, how does it work in TB? (Not something Dr. Bowdish intended to study, but ended up falling into it in oxford.)

There are many types of uptake – and many new ones have been discovered. There's still room for more receptors, however, and it's possible that the scavenger receptors might be involved in TB.

SRA is up-regulated in response to IFN-gamma and BCG, knockouts are susceptible to BG induced shock. But MARCO? No clear connection. There is still no human anti-MARCO antibody, so these experiments can't be repeated for human cells.

Collaboration with Dr. Russell and Sakamoto from Cornell, and ended up getting involved. They had a ligand (Trehalose dimycholate) that no one had ever found a receptor for – and that turned out to be MARCO. Using TDM coated beads, you could see if it was picked up.

Use a cell line with MARCO receptor – and the beads. MARCO showed that it picked up the beads, SRA did not pick up beads. Could knock it down with a specific inhibitor for MARCO. (shown with fluorescence microscopy.)

Previous work had shown that TDM induces cytokine production in a MyD88 dependent fashion. There was a TLR2 &4 response – so did a knock out, and showed that it could use either of them.

Minimum signal complex required is Marco + TLR (2 or 4). This recreates the pro-inflammatory response. Could never recreate this with SLA.

Is MARCO the missing factor in TDM signalling? Yes. So, it's not that they've lost the pathway or ability – just lacking the particular co-receptor to interact with TDM.

How MARCO works in cytoplasm, however, is another story – it has a very small cytoplasmic tail... which includes a predicted myristolation site. Made constructs with different part of the tail – which didn't change the signalling much. The model proposed, however, is that MARCO is a tethering receptor, which binds or transports the TDM beads to TLRs via CD14. (Similar to the LPS signalling complex.) This was tested with a NF-kb reporter system.

More experiments were done using the knockouts without MARCO or DKO, and were able to continue along to find that MARCO appears to be involved in response to M. Tuberculosis.

Up till now, this was in vitro and mouse. A switch was made to human models.

Started looking for groups looking at SNPs in humans. Did a study interested in whether these SNPs are related to human disease. (Adrian Hill?)

It works well because TB has been around for a long time – 40,000 years.

The Hill group has samples from Gambia, to study TB. Screened 3,500 individuals (HIV free), do controls for the usual (age, sex, etc), and then screened 25SNPs in MARCO and 22 in MSR1.

[Presents a fancy map, showing coverage.]

Much to surprise: there were no SNPs what so ever in SRA – found 4 in MARCO with association to susceptibility and resistance. However, they were all in introns. They were found in introns, and discovered that it was in a putative splice site. (There were no splice variants known in mice, at the time – and there are still none known.) Using assays, Dr. Bowdish found there were indeed splice variants, caused by the SNP.

Oddly enough, this splice variant seems to knock out the binding domain of MARCO. (And the SNP seems to be predominant in african populations - and is very uncommon in caucasians.)

Tentative model: TDM induces MARCO expression. MARCO is regulated at transcriptional and post-translational modification levels. Thus, splice variants may induce differences in response to TB bacteria.

Goals for the future:
  • Understand role of macrophage receptors in infectious disease
  • Attribute functional significance of genetic variability in macrophage genes
  • Characterize phenotype of innate activation & determine if this can be manipulated by immunomodulation
  • Collaborating with people studying other receptors.
Open day on October 26th, 2009 : Institute of infectious disease research opening day.

Labels: ,

On the necessity of controls

I guess I've had this rant building up for a while, and it's finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it's effects on a particular closed system. The scientific method actually demands that we do it - hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn't bother with controls for ChIP-Seq for a long time. I can't say I've even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I've put in to the latest version of FindPeaks, I'm finally getting a good grasp of the importance of using controls well. With the other software I've seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets... and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that's what prompted me to write this. As I look over the results from the new FindPeaks (, both for ChIP-Seq and WTSS, I'm amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in - but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren't in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I've been pushing for a decent control for my WTSS lanes - and there is matched normal for one cell line - but it's still two months away from having the reads land on my desk... and I'm getting impatient.

Now that I'm able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it's so much more of a challenge without an appropriate control. Besides, who'd believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I'm going to make a suggestion: if you're still doing experiments without a control, and you want to get them published, it's going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

Labels: , , , , , , , ,

Tuesday, May 12, 2009

Quality vs Quantity

Today was an interesting day, for many reasons. The first was the afternoon tours for high-school students that came by the Genome Sciences Centre and the labs. I've been taking part in an outreach program for some of the students at two local high schools, which has involved visiting the students to teach them a bit of the biology and computers we do, as well as the tours that bring them by to see us "at work." Honestly, it's a lot of fun, and I really enjoy interacting with the kids. Their questions are always amusing and insightful - and are often a lot of fun to answer well. (How do you explain how the academic system works in 2 minutes or less?)

For my part, I introduced the kids to Pacific Biosystems SMRT technology. I came up with a relatively slick monologue that goes well with a video from PacBio. (If you haven't seen their video, you should definitely check this out.) The kids seem genuinely impressed with the concept, and really enjoy the graphics - although they enjoy the desktop effects with Ubuntu too... so maybe that's not the best criteria to use for evaluation.

Anyhow, aside from that distraction, I've also had the pleasure of working on some of my older code today. After months of people at the GSC ignoring the fact that I'd already written code to solve many of the problems they were trying to develop software, a few people have decided to pick up some of the pieces of the Vancouver Short Read Package and give it a test spin.

One of them was looking at FindFeatures - which I've used recently to find exons of interest in WTSS libraries - and the other was PSNPAnalysiPipeline code - which does some neat metrics for WTSS.

The fun part of it is that the code for both of those applications were written months ago - in some cases before I had the data to test them on. When revisiting them and now actually putting the code to use, I was really surprised by the number of options I'd tossed in, to account for many situations that hadn't even been seriously anticipated. Someone renamed all of your fasta files? No worries, just use the -prepend option! Your junction library has a completely non-standard naming? No problem, just use the -override_mapname option! Some of your MAQ aligned reads have indels - well, ok, i can give you a 1-line patch to make that work too.

I suppose that really makes me wonder: If I were writing one-off scripts, which would obviously lack this kind of flexibility, I'd be able to move faster and more nimble across the topics that interest me. (Several other grad students do that, and are well published because of it.) Is that a trade off I'm willing to make, though?

Someone really needs to hold a forum on this topic: "Grad students: quality or quantity?" I'd love to sit through those panel discussions. As for myself, I'm still somewhat undecided on the issue. I'd love more publications, but having the code just work (which gets harder and harder as the codebase hits 30k lines) is also a nice thing. While I'm sure users of my software are happy when these options exist, I wonder what my supervisor thinks of the months I've spent building all of these tools - and not writing papers.

Ah well, I suppose when it comes time to defend, I'll find out exactly what he thinks about that issue. :/

Labels: , , ,

Monday, May 11, 2009

FindPeaks 4.0

After much delay, and more statistics than I'd ever hoped to see in my entire life, I'm just about ready to tag FindPeaks 4.0.

There are some really cool things going on - better handling of control data, better compares, and even saturation analysis made it in. Overall, I'm really impressed with how well things are working on that front.

The one thing that I'm disappointed to have not been able to include was support for SAM/BAM files. No one is really pushing for it yet, here at the GSC, but it's only a matter of time. Unfortunately, the integration is not trivial, and I made the executive decision to work on other things first.

Still, I can't say I'm unhappy at all - there's room for a fresh publication, if we decide to go down that route. If not, I have at least 2 possible publications in the novel applications of the code for the biology. That can't be a bad thing.

Anyhow, I'm off to put a few final updates on the code before tagging the final 3.3.x release.

Labels: , ,

Friday, May 8, 2009

Science Nightmares

I had friends over last week, and an interesting conversation came up where we were discussing nightmares. Apparently people who have braces have the nightmare of all their teeth falling out, undergrads have the "I missed an exam" nightmare (although I did that one in real life, so the nightmares weren't that disturbing afterwards), and profs have the "I missed a talk" nightmare.

Well, if it's a sign that my career is on it's way forward, I had the "missed a talk" nightmare this morning. The ironic thing is that I've never been invited to give a talk a conference, so it's a bit premature.

Anyhow, it probably has more to do with the fact that I'm somewhat freaked about the huge changes in findpeaks. We learn SO much every day about the biology behind the experiment that this is really nerve wracking to keep on top of it. The development is going well, although bug testing is always a challenge.

At any rate, we're finally getting to the point where there are very few arbitrary decisions - the data decides how to do the analysis. Quite the contrast to 3 months ago, where we thought we'd hit the end of what new things we could pull out of the data.

Anyhow, debugging calls. Back to work....


Friday, May 1, 2009

2 weeks of neglect on my blog = great thesis progress.

I wonder if my blogging output is inversely proportional to my progress on my thesis. I stopped writing two weeks ago for a little break, and ended up making big steps forward. The vast amount of my work went into FindPeaks, which included the following:

  • A complete threaded Saturation analysis for next-gen libraries.

  • A method of comparing next-gen libraries to identify peaks that are statistically significant outliers. (It's also symmetic, unlike a linear regression based methods.)

  • A better control method

  • A whole new way of analysing WTSS data, which gives statistically valid expression differences

And, of course many many other changes. Not everything is bug-free, yet, but it's getting there. All that's left on my task list are debugging a couple of things in the compare mode, relating to peaks present in only one of the two librarires, and an upgrade to my FDR cutoff prediction methods. Once those are done, I think I'll be ready to push out FindPeaks 4.0. YAY!

Actually, what was surprising to me was the sheer amount of work that I've done on this since January. I compiled the change list since my last "quarterly report" for a project that used FindPeaks (but doesn't support it, ironically.... why am I doing reports for them again?) and came up with 24 pages of commit messages - over 575 commits. Considering the amount of work I've done on my actual projects, away from FindPeaks, I think I've been pretty darn productive.

Yes, I'm going to take this opportunity to pat myself on the back in public for a whole 2 seconds... ok, done.

So, overall, blogging may not have been distracting me from my work, as even at the height of my blogging (around AGBT), I was still getting lots done, but the past two weeks have really been a help. I'll be back to blogging all the good stuff on monday. And I'm looking forward to doing some writing now, on some of the cool things in FP4.0 that haven't made it into the manual... yet.

Anyone want some fresh ChIP-Seq results? (-;

Labels: , , , ,