Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Saturday, April 18, 2009

sshfs: another useful Linux tool

Ok, I just had to share this tool - I'm sure anyone who's familiar with Linux already knows it, but just in case there are people out there who are behind me on the Linux learning curve, this one is worth knowing: sshfs

If you've never heard of it before, the full name is the ssh file system. What it does is quite simple: it allows you to mount any ssh accessible directory as a local file system. It's not any faster than ssh access would be, and apparently (from various sources on the web) it doesn't do well under huge stress, but for my purposes, it's been brilliant.

Normally, if I want to do anything with my work account, I have to ssh into the work machines, then download/scp files all over the place, leaving several copies on each of the computers I work on. This inevitably leads to version-ing nightmares.

This is where sshfs steps in. Instead of copying files, I now just mount each of the ssh accessible computers are mount points, and operate directly over the ssh share. It's pure genius.

To use it, you need to install the sshfs package, which is trivially easy on Ubuntu or another debian based system:
sudo apt-get install sshfs

Then you need to create the mount point (which I usually do in /media, just for ease of use), and assign yourself permissions to use that mount point. This is done with your own username on the local machine, and doesn't change the permissions on the server.:
cd /media
sudo mkdir servername
sudo chown username.username servername/

This sets up a point for you to attach the server. All that's left is then to attach the directory on the server to your local mountpoint, using a command eerily similar to scp's:
sshfs username@servername.example.com:/home/username /media/servername/

Tada. All that's left is to open your file manager of choice and open up the directory.

If only I'd discovered this tool sooner!

Labels:

Thursday, April 16, 2009

Multi-match reads in ChIP-Seq

I had an interesting comment left on my blog today, which is worth taking a few minutes to write a response to:
"Hi Anthony, I just discovered your blog and it looks very interesting to me!
Since this article on Eland is now more than one year old, I was wondering
if the description at point 3 about multi matching locations is still
applicable to the Eland program in the Illumina pipeline 1.3. More in general,
would you trust the multi matching locations extracted from the multi_eland
output files to perform a repeat enrichment analysis over an experiment of
ChIP-seq? If no, why? Thank you in advance for your attention."

The first question asks about multi-matching locations - and if the point in question (point 3) applies to the Illumina Pipeline 1.3. Since point 3 was just that the older pipeline didn't provide the locations of the multi-matche reads, I suppose this no longer really applies: I understand the new version of Eland does provide multi-match alignment information, as do other aligners such as Bowtie. However, I should also mention that since I adopted Maq as my preferred aligner, I haven't used Eland much - so it's hard for me to give an informed opinion on the quality of the matches. I simply don't know if they're any good, and I won't belabour that point. I have used Bowtie specifically because it was able to do mutli-matches, but we didn't use it for ChIP-Seq, and the multi-matches had other uses in that experiment.

So, the more interesting question is whether I'd use multi-match reads in a ChIP-Seq analysis. And, off hand, my answer has to be no. But let me explain my reasoning, and the conditions in which I would change that answer.

First, lets assume that we have Single End Tags, so the multi-match information is not resolvable. That means anytime we have a read that maps to more than one location, we have the possibility that we can either map it to it's source - or we're mapping it incorrectly. A 50% change of "getting it right." The greater the number of multi-match locations, the smaller the chance we're actually finding it's correct origin. So, at best we've got a 50-50 chance that we're not adversely affecting the outcome of the experiment. That's not great.

In contrast, there are things we could do to make them usable. The most widely used method from FindPeaks is the weighted fragment distribution type. Thus, we could expand the principle to weight the fragments according to the number of sites. That would be... bearable. But would it significantly add to the quality of the alignment?

I'm still going to say no. Fragments we see in ChIP-Seq experiments tend to fall within 200-300bp of the regions in which the transcription factor (or other sites) bind. Thus, even if we were concerned that a particular transcription factor binds primarily to the similar motif regions at two sites, there should be more than enough (unique) sequence around that site (which is usually <30-40bp in length) to which you'll still see fragments aligning. That should compensate for the loss of the multi-match fragments.

Even more importantly, as read lengths increase, the amount of non-unique sequence decreases rapidly, making the shrinking number of multi-match reads less important.

The same argument can be extended for paired end tags: Just as read lengths improve and reduce the number of multi-match sites, more of the multi-match reads will be resolved by pairing them with a second read, which is unlikely to be within the same repeat region, thus reducing the number of reads that become unresolvable multi-matches. Proportionally, one would then expect that leaving out these reads become a smaller and smaller segment of the population, and would have to worry less and less about their contribution.

So, then, when would I want them?

Well, on the odd chance you're working with very short reads, you can pull off the weighting properly, and you have single end tags - and the multi-match reads make up a significant proportion of the reads, then it's worth exploring.

You'd need to start asking the tough questions: did the aligner simply find that a small k-mer of the read aligned to multiple locations (and was then unable to resolve the tie by extension the way some Eland aligners work)? Does the aligner use quality scores to identify mis-alignments? How reliable are the alignments (what's their error rate)? What was your sample, and how divergent is it from reference ? (e.g., cancer samples have a high variation rate, and so encourage many false alignments, making the alignments less reliable.)

Overall, I really don't see too many cases where you're going to gain a lot by digging in the multi-match files. That's not too say that you won't find anything good in there - you probably would, if you knew where to look, but the noise to signal ratio is going to be pretty poor - just by definition of the fact that they're mutli-match reads alone. You'll just have to ask if it's worth your time.

For the moment, I don't think my time (even at grad student wages) is worth it. It's just not low hanging fruit, when it comes to ChIP-Seq.

Labels: , , , , , , ,

Friday, April 10, 2009

Nifty little trick for debugging frozen applications

This trick is just too cool not to mention.  I was trying to debug an application that was getting stuck in an endless loop, the other day. It was a rather complicated set of changes that was required and I had no idea where the program was getting stuck.'

In the past, I would have just ended the program with a control-c, and then started dropping in print statements until I could isolate exactly where the program was getting stuck.  Instead, I stumbled upon a very nifty little trick: using the kill function to halt the program and dump the thread's core to screen with the command:
kill -3 [pid]

For a java code running from the class files, the core dump shows you exactly which line is being executed in each thread, allowing you to find out precisely where the problem is - making debugging go much more quickly.

Anyhow, I haven't yet tried if this works on a .jar file, or what else you can do with a quick "kill -3", but this certainly broadens my toolkit of debugging utilities, and gives me a whole new respect for the kill signals.  I may have to test out a few of the other ones....

Labels: , ,

Thursday, April 9, 2009

BC Genome Forum 2009

I had a lot of stuff to blog about, but just haven't had the time to write any of it down. I haven't forgotten about any of it, but it's just not going to happen before this weekend. I'm currently bogged down in debugging something that I REALLY want to get working (and mostly is, but still has something slightly fishy going on...), and just too much going on outside of work to get it done otherwise.

Still, I figured I should mention a few things of interest before I forget to discuss them at all.

I attended some of the BC Genome forum lectures on Friday. I skipped the morning, since they seemed mainly irrelevant to anything I do - which was later confirmed - but caught the session on personal medicine. For the most part, it was focused the ethics of personal medicine. I was considering blogging those talks, but they just didn't have enough interest factor, individually.

For the most part, the speakers were caught in a pre-2006 time warp. Everything was about micro-arrays. One of the speakers even said something to the effect of "maybe one day we'll be able to sequence the whole human genome for patients, but we're no where near that yet." Needless to say my coleagues and I all exchanged startled glances.

Still, there were a few things of interest: There was a lot of discussion about what conditions you find in genomic screens that you should feel obligated to discuss with the DNA donor. They gave the example of one volunteer to tested positive for a condition that could be life threatening if they were to undergo surgery for any reason. It's easily treated, and can be easily managed - if you're aware of it. Since the donor was in the healthy control group, they were clearly unaware that they had the condition. In this condition, where the donor was clearly at risk, the penetrance of the gene is 100%, and the patient could clearly do something about the problem, it was "a no-brainer" that the donor should be notified.

However, for most of the information people are pulling from arrays, it's not always clear if the ethics tilt so heavily towards breaking confidentiality and reporting information to the patient. How this type of situation should be managed was touched upon by several of the speakers. The best solution we'd heard during the forum was one group who had set up an advisory board who sits down on a yearly/6-month basis to determine which - if any - conditions should be returned to the donors.

Unfortunately, no one described the criteria used to make that decision, but the concept is pretty solid.

The surprising thing for me was that after several years of using mechanisms like this, only 12-20 conditions were being returned. In the world of genomics, that's a VERY small number, but is probably more representative of the fact that they're using arrays to do genome screens.

And that is one of the reasons why it felt like 2006 all over again. All the mechanisms they've put in place are fine when you're talking about a couple of nnew conditions being screened each year. Within 2 years we'll be routinely doing whole genome sequencing with Pacific Biosciences SMRT (or the equivalent) systems, and whole genome association studies will become vastly more plentiful and powerful. Thus, when your independent board gets 1200 candidate diagnostic genes with actionable outcomes per year, that mechanism won't fly.

What's realy needed (in my humble opinion) is for a national board to be created in each country to determine what gene information should be disseminated as useful and actionable - possibly as part of the FDA in the states. That would also be very useful for reining in companies like 23andMe and the like... but that's another story altogether.

Moving along, there were a few other interesting things at the event. My personal favorite was from the Smit lab in the microbiology & immunology department at UBC, presented by Dr. John Nomelini, who I know from my days in the same department. They have a pretty cool system, based on the caulobacter bacterial system, where they can pull down antibodies (similarly to streptavadin beads) but using a much cheaper and easier system. While I don't know the legal issues around the university's licencing of the technology, Dr Nomelini is trying to find people interested in using the technology for ChIP experiments. I've proposed the idea to a few people here to test it out on ChIP-Seq, which would help bring the cost down by a few $100. We'll see if it gets off the ground.

So, if you've made it this far, hopefully you've gleaned something useful from this rambling post. I have some coding to do before my parents arrive for the easter weekend. Time to get back to debugging...

Labels: ,