Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Thursday, April 16, 2009

Multi-match reads in ChIP-Seq

I had an interesting comment left on my blog today, which is worth taking a few minutes to write a response to:
"Hi Anthony, I just discovered your blog and it looks very interesting to me!
Since this article on Eland is now more than one year old, I was wondering
if the description at point 3 about multi matching locations is still
applicable to the Eland program in the Illumina pipeline 1.3. More in general,
would you trust the multi matching locations extracted from the multi_eland
output files to perform a repeat enrichment analysis over an experiment of
ChIP-seq? If no, why? Thank you in advance for your attention."

The first question asks about multi-matching locations - and if the point in question (point 3) applies to the Illumina Pipeline 1.3. Since point 3 was just that the older pipeline didn't provide the locations of the multi-matche reads, I suppose this no longer really applies: I understand the new version of Eland does provide multi-match alignment information, as do other aligners such as Bowtie. However, I should also mention that since I adopted Maq as my preferred aligner, I haven't used Eland much - so it's hard for me to give an informed opinion on the quality of the matches. I simply don't know if they're any good, and I won't belabour that point. I have used Bowtie specifically because it was able to do mutli-matches, but we didn't use it for ChIP-Seq, and the multi-matches had other uses in that experiment.

So, the more interesting question is whether I'd use multi-match reads in a ChIP-Seq analysis. And, off hand, my answer has to be no. But let me explain my reasoning, and the conditions in which I would change that answer.

First, lets assume that we have Single End Tags, so the multi-match information is not resolvable. That means anytime we have a read that maps to more than one location, we have the possibility that we can either map it to it's source - or we're mapping it incorrectly. A 50% change of "getting it right." The greater the number of multi-match locations, the smaller the chance we're actually finding it's correct origin. So, at best we've got a 50-50 chance that we're not adversely affecting the outcome of the experiment. That's not great.

In contrast, there are things we could do to make them usable. The most widely used method from FindPeaks is the weighted fragment distribution type. Thus, we could expand the principle to weight the fragments according to the number of sites. That would be... bearable. But would it significantly add to the quality of the alignment?

I'm still going to say no. Fragments we see in ChIP-Seq experiments tend to fall within 200-300bp of the regions in which the transcription factor (or other sites) bind. Thus, even if we were concerned that a particular transcription factor binds primarily to the similar motif regions at two sites, there should be more than enough (unique) sequence around that site (which is usually <30-40bp in length) to which you'll still see fragments aligning. That should compensate for the loss of the multi-match fragments.

Even more importantly, as read lengths increase, the amount of non-unique sequence decreases rapidly, making the shrinking number of multi-match reads less important.

The same argument can be extended for paired end tags: Just as read lengths improve and reduce the number of multi-match sites, more of the multi-match reads will be resolved by pairing them with a second read, which is unlikely to be within the same repeat region, thus reducing the number of reads that become unresolvable multi-matches. Proportionally, one would then expect that leaving out these reads become a smaller and smaller segment of the population, and would have to worry less and less about their contribution.

So, then, when would I want them?

Well, on the odd chance you're working with very short reads, you can pull off the weighting properly, and you have single end tags - and the multi-match reads make up a significant proportion of the reads, then it's worth exploring.

You'd need to start asking the tough questions: did the aligner simply find that a small k-mer of the read aligned to multiple locations (and was then unable to resolve the tie by extension the way some Eland aligners work)? Does the aligner use quality scores to identify mis-alignments? How reliable are the alignments (what's their error rate)? What was your sample, and how divergent is it from reference ? (e.g., cancer samples have a high variation rate, and so encourage many false alignments, making the alignments less reliable.)

Overall, I really don't see too many cases where you're going to gain a lot by digging in the multi-match files. That's not too say that you won't find anything good in there - you probably would, if you knew where to look, but the noise to signal ratio is going to be pretty poor - just by definition of the fact that they're mutli-match reads alone. You'll just have to ask if it's worth your time.

For the moment, I don't think my time (even at grad student wages) is worth it. It's just not low hanging fruit, when it comes to ChIP-Seq.

Labels: , , , , , , ,

Tuesday, September 16, 2008

Processing Paired End Reads

Ok, I'm taking a break from journals. I didn't like the overly negative tone I was taking in those reviews, so I'm rethinking how I write about articles. Admittedly, I think criticism is in order for papers , but I can definitely focus on papers that are worth reviewing. Unfortunately, I'm rarely a fan of papers discussing bioinformatics applications, as I always feel like there's a hidden agenda behind them. Whether it's simply proving their application is the best, or just getting it out first, computer application papers are rarely impartial.

Anyhow, I have three issues to cover today:
  • FindPeaks is now on sourceforge
  • Put the "number of reads under a peak" to rest. permanently, I hope.
  • Bed files for different data sources.

The first one is pretty obvious. FindPeaks is now available under the GPL on sourceforge, and I hope people will participate in using and improving the software. We're aiming for our first tagged release on friday, with frequent tags thereafter. Since I'm no longer the only developer on this project, it should continue moving forward quickly, even while I'm busy studying for my comps.

The second point is this silly notion that keeps coming up. "How many reads were found under each peak?" I'm quite sick of that question, because it really makes no sense. Unfortunately, this was a metric produced in Robertson et al.'s STAT1 data set, and I think other people have included it or copied it. Unfortunately it's rubbish.

The reason it worked in STAT1 was because they used a fixed length (or XSET) value on their data set. This allowed them to determine the exact length of each read, which allowed them to figure out how many reads were "contiguously linked in each peak." Readers who are paying attention will also realize what the second problem is... They didn't use subpeaks either. Once you start identifying subpeaks, you can no longer assign to which peak a read spanning peaks belongs. Beyond that, what do you do with reads in a long tail? Are they part of the peak or not?

Anyhow, the best measure for a peak, at the moment at least, is the height of the peak. This can also include weighted reads, so that reads which are unlikely to contribute to a peak actually contribute less, bringing in a scaled value. After all, unless you have paired end tags, you really don't know how long the original DNA fragment was, which means you don't know where it ended.

That also makes a nice segue to my last point. There are several ways of processing paired end tags. When it comes to peak calling it's pretty simple: you use the default method - you know where the two ends are, and they span the full read. For other applications, however, there are complexities.

If the data source is a transcriptome, your read didn't cover the intron, so you need to process the transcript to include breaks, when mapping it back to the genome. That's really a pain, but it is clearly the best way to visualize transcriptome PETs.

If the data source is unclear, or you don't know where the introns are (which is a distinct possibility), then you have to be more conservative, and not assume the extension of each tag. Thus, you end up with a "tag and bridge" format. I've included an illustration to make it clear.

So why bring it up? Because I've had several requests for the tag-and-bridge format, and my code only works on the default mode. Time to make a few adjustments.

Labels: , , ,

Monday, December 3, 2007


Planning and reality often don't mix in grad school, I'm discovering. Since I have a committee meeting tomorrow, I thought I'd use today to finish my presentation and written documentation for it. Instead, I was told I had to have code running by 1pm that would process transcriptome data using paired end solexa reads. 5 hours later, I had code analyzing Single end transcriptome data, but nothing finished for my committee meeting.

On that note, I'm not going to blog about paired end data - I'll save that for another day, and instead, I'm going to go work on my presentation and report. However, I did want to provide a neat link Elaine passed to me: made with molecules. Jewelry for the rich and geeky. (I'm geeky enough to like it, but not rich enough to pay THAT much for it, even if I wore ornaments.)

Oh, and one last parting note... my laptop is being assembled today. If I wasn't going to be so busy tomorrow, I'd probably be checking it's status every 10 minutes. The curse and blessing of real-time tracking data.

Labels: , , ,