Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, January 14, 2010

How to be a better Programmer: Tactics.

I'm a bit too busy for a long post, but a link was circulating around the office that I thought was worth passing on to any bioinformaticians out there.

http://dlowe-wfh.blogspot.com/2007/06/tactics-tactics-tactics.html

The article above is on how to be a better programmer - and I wholeheartedly agree with what the author proposed, with one caveat that I'll get to in a minute. The point of the the article is that learning to see the big picture (not specific skills) will make you a better programmer. In fact, this is the same advice Sun Tzu gives in "The Art of War", where understanding the terrain, the enemy, etc are the tools you need to be a better general. [This would be in contrast to learning how to wield each weapon, which would only make you a better warrior.] Frankly, it's good advice, and this leads you down the path towards good planning and clear thinking - the keys to success in most fields.

The caveat, however, is that there are times in your life where this is the wrong approach: ie. grad school. As a grad student, your goal isn't to be great at everything you touch - it's to specialize in some small corner of one field, and tactics are no help here. If grad school existed for Ninjas, the average student would walk out being the best (pick one of: poisoner/dart thrower/wall climber/etc) in the world - and likely knowing little or nothing about how to be a real ninja beyond what they learned in their Ninja undergrad. Tactics are never a bad investment, but they aren't always what is being asked of you.

Anyhow, I plan to take the advice in the article and to keep studying the tactics of bioinformatics in my spare time, even though my daily work is more on the details and implementation side of it. There are a few links in the comments of the original article to sites the author believes are good comp-sci tactics... I'll definitely be looking into those tonight. Besides, when it comes down to it, the tactics are really the fun parts of the problems, although there is also something to be said for getting your code working correctly and efficiently.... which I'd better get back to. (=

Happy coding!

Labels: , , , , ,

Monday, October 5, 2009

Why peak calling is painful.

In discussing my work, I'm often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It's trivial. Peak calling itself isn't hard. However, there are plenty of pitfalls that can surprise the unwary. (I've found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful - masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:
  1. Maintaining the software

  2. Peak calling without unlimited resources eg, 64Gb RAM

  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it's really easy to fire up a "one-off" script. Anyone can write something that can be tossed aside when they're done with it - but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we've transitioned through two versions of Eland output, Maq's .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we've been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn't have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption - and this is somewhat more technical. When dealing with sequencing reads, you're faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land - OR - you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I'll come back to in a moment. The second requires much less cpu time - but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a "pileup" file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn't matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you're using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn't always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually - which works well when correctly sorted. Unfortunately, new formats no longer allow that - using a "pre-sorted" bam/sam file, you can now find multi-part reads, but there's no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods - and do it ahead of everyone else. That on it's own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they're not. For a group of 30 people, that's probably not too hard, but for academic peak callers, that can be a challenge - particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that's the answer I give: It's trivial - but a lot of hard work. It's rewarding, educational and cool, but it's a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

Labels: , , , , , , ,

Friday, October 2, 2009

Base quality by position

A colleague of mine was working on a nifty tool to give graphs of the base quality at each position in a read using Eland Export files, which could be incorporated into his pipeline. Over a discussion about the length of time it should take to do that analysis, (His script was taking an hour, and I said it should take about a minute to analyze 8M illumina reads...) I ended up saying I'd write my own version to do the analysis, just to show how quickly it could be done.

Well, I was wrong about it taking about a minute. It turns out that the file has a lot more than about double the originally quoted 8 million reads (QC, no match and multi match reads were not previously filtered), and the whole file was bzipped, which adds to the processing time.

Fortunately, I didn't have to add bzip support in to the reader, as tcezard (Tim) had already added in a cool "PIPE" option for piping in whatever data format I want in to applications of the Vancouver Short Read Analysis Package, thus, I was able to do the following:
time bzcat /archive/solexa1_4/analysis2/HS1406/42E6FAAXX_7/42E6FAAXX_7_2_export.txt.bz2 | java6 src/projects/maq_utilities/QualityReport -input PIPE -output /projects/afejes/temp -aligner elandext

Quite a neat use of piping, really.

Anyhow, the fun part is that this was that the library was a 100-mer illumina run, and it makes a pretty picture. Slapping the output into openoffice yields the following graph:



I didn't realize quality dropped so dramatically at 100bp - although I remember when qualities looked like that for 32bp reads...

Anyhow, I'll include this tool in Findpeaks 4.0.8 in case any one is interested in it. And for the record, this run took 10 minutes, of which about 4 were taken up by bzcat. Of the 16.7M reads in the file, only 1.5M were aligned, probably due to the poor quality out beyond 60-70bp.

Labels: , , , , ,

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

Labels: , , , , , , , , , , , ,

Friday, May 15, 2009

On the necessity of controls

I guess I've had this rant building up for a while, and it's finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it's effects on a particular closed system. The scientific method actually demands that we do it - hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn't bother with controls for ChIP-Seq for a long time. I can't say I've even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I've put in to the latest version of FindPeaks, I'm finally getting a good grasp of the importance of using controls well. With the other software I've seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets... and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that's what prompted me to write this. As I look over the results from the new FindPeaks (3.3.3.1), both for ChIP-Seq and WTSS, I'm amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in - but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren't in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I've been pushing for a decent control for my WTSS lanes - and there is matched normal for one cell line - but it's still two months away from having the reads land on my desk... and I'm getting impatient.

Now that I'm able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it's so much more of a challenge without an appropriate control. Besides, who'd believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I'm going to make a suggestion: if you're still doing experiments without a control, and you want to get them published, it's going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

Labels: , , , , , , , ,

Tuesday, May 12, 2009

Quality vs Quantity

Today was an interesting day, for many reasons. The first was the afternoon tours for high-school students that came by the Genome Sciences Centre and the labs. I've been taking part in an outreach program for some of the students at two local high schools, which has involved visiting the students to teach them a bit of the biology and computers we do, as well as the tours that bring them by to see us "at work." Honestly, it's a lot of fun, and I really enjoy interacting with the kids. Their questions are always amusing and insightful - and are often a lot of fun to answer well. (How do you explain how the academic system works in 2 minutes or less?)

For my part, I introduced the kids to Pacific Biosystems SMRT technology. I came up with a relatively slick monologue that goes well with a video from PacBio. (If you haven't seen their video, you should definitely check this out.) The kids seem genuinely impressed with the concept, and really enjoy the graphics - although they enjoy the desktop effects with Ubuntu too... so maybe that's not the best criteria to use for evaluation.

Anyhow, aside from that distraction, I've also had the pleasure of working on some of my older code today. After months of people at the GSC ignoring the fact that I'd already written code to solve many of the problems they were trying to develop software, a few people have decided to pick up some of the pieces of the Vancouver Short Read Package and give it a test spin.

One of them was looking at FindFeatures - which I've used recently to find exons of interest in WTSS libraries - and the other was PSNPAnalysiPipeline code - which does some neat metrics for WTSS.

The fun part of it is that the code for both of those applications were written months ago - in some cases before I had the data to test them on. When revisiting them and now actually putting the code to use, I was really surprised by the number of options I'd tossed in, to account for many situations that hadn't even been seriously anticipated. Someone renamed all of your fasta files? No worries, just use the -prepend option! Your junction library has a completely non-standard naming? No problem, just use the -override_mapname option! Some of your MAQ aligned reads have indels - well, ok, i can give you a 1-line patch to make that work too.

I suppose that really makes me wonder: If I were writing one-off scripts, which would obviously lack this kind of flexibility, I'd be able to move faster and more nimble across the topics that interest me. (Several other grad students do that, and are well published because of it.) Is that a trade off I'm willing to make, though?

Someone really needs to hold a forum on this topic: "Grad students: quality or quantity?" I'd love to sit through those panel discussions. As for myself, I'm still somewhat undecided on the issue. I'd love more publications, but having the code just work (which gets harder and harder as the codebase hits 30k lines) is also a nice thing. While I'm sure users of my software are happy when these options exist, I wonder what my supervisor thinks of the months I've spent building all of these tools - and not writing papers.

Ah well, I suppose when it comes time to defend, I'll find out exactly what he thinks about that issue. :/

Labels: , , ,

Friday, April 10, 2009

Nifty little trick for debugging frozen applications

This trick is just too cool not to mention.  I was trying to debug an application that was getting stuck in an endless loop, the other day. It was a rather complicated set of changes that was required and I had no idea where the program was getting stuck.'

In the past, I would have just ended the program with a control-c, and then started dropping in print statements until I could isolate exactly where the program was getting stuck.  Instead, I stumbled upon a very nifty little trick: using the kill function to halt the program and dump the thread's core to screen with the command:
kill -3 [pid]

For a java code running from the class files, the core dump shows you exactly which line is being executed in each thread, allowing you to find out precisely where the problem is - making debugging go much more quickly.

Anyhow, I haven't yet tried if this works on a .jar file, or what else you can do with a quick "kill -3", but this certainly broadens my toolkit of debugging utilities, and gives me a whole new respect for the kill signals.  I may have to test out a few of the other ones....

Labels: , ,

Wednesday, February 18, 2009

Three lines of Java code you probably don't want to write

Ah, debugging. Ironically, it's one of the programming "skills" I'm good at. In fact, I'm usually able to debug code faster than I can write it - which leads to some interesting workflows for me. Anyhow, today I managed to really mess myself up, which took several hours of rewriting code and debugging to figure out. In the end, it all came down to three lines, which I hadn't looked at carefully enough - any of the 8-10 times I went over that subroutine.

The point was to transfer all of the reads in the local buffer back into the buffer_ahead, preserving the order - they need to be at the front of the queue. The key word here was "all".
In any case, I thought I'd share it as an example of what you shouldn't do. (Does anyone else remember the Berenstain bears books? "Son, this is what you should not do, now let this be a lesson to you.")

Enjoy:
for (int r = 0; r < buffer_ahead.size(); r++) {
buffer_ahead.add(r, local_buffer.remove(0));
}

Labels:

Thursday, August 7, 2008

programming with a loaded gun.

I had an anonymous comment the other day that started off like this:

Don't blame sloppy code practices on Perl; it would be just as easy to obfuscate the line you mention in another language.


That got me thinking about it, and I rapidly came to the conclusion that I don't think you can obfuscate code as much in another language as you can in perl. Perl is like a loaded gun. Used wisely, you could win a biathlon, catch crooks or... um... defend your country. (I'm not actually a big fan of guns, so this metaphor is stretching things a bit for me.) Used irresponsibly, you could shoot yourself in the foot, rob a bank or invade Kuwait. The gun-lovers tell you it's not the gun that's responsible for the bad things that are done with it. The famous quote is that "Guns don't kill people, people kill people." Well, I don't want to get into a debate on the merits of guns, but as a tool, they need to be used wisely. I don't think anyone would debate that.

I argue, so does perl.

Giving a novice programmer perl is like giving a teenager a loaded shotgun without a safety. The consequences are less dramatic, but equally irresponsible. You need training to use both. Without training, you write code that can never be understood, can't be trusted, and is unmaintainable. How do you correct a bug in code you can't wrap your head around?

Anyhow, yes, other languages can be used to obfuscate code, but perl, in my humble opinion, is designed to give you a flexibility that just isn't found in other languages. How many other modern languages can be used to read in a variable then use the content of the variable as the name of another variable? Certainly not C, Java or VB. And that's just my favorite example at the moment, I'm sure there are others.

Where this is leading me, is that all languages have problems, and all languages can be abused. A good program in a given languages has certain traits, just as the bad do. I'm working on my next blog post, which I figure I should try to answer the question, what does good code look like in a given language, and - more entertaining for me - what does bad code look like? I've seen a lot of examples of each, lately, so I may as well share what I know.

Labels: ,