Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, November 12, 2009

Go from Google...

Just a short post, since I'm actually (although you probably can't tell) rather busy today. However, I'm absolutely fascinated by Google's new language, Go. It's taken the best from just about every existing language out there, and appears so clean!

I'm currently watching Google's talk on it, while I write... I'm only a few minutes in, but it seems pretty good. Watching this seriously makes me want to start a new bio-go project... so nifty!

Labels: , ,

Wednesday, November 4, 2009

New ChIP-seq control

Ok, so I've finally implemented and debugged a second type of control in FindPeaks... It's different, and it seems to be more sensitive, requiring less assumptions to be made about the data set itself.

What it needs, now, is some testing. Is anyone out there willing to try a novel form of control on a dataset that they have? (I won't promise it's flawless, but hey, it's open source, and I'm willing to bug fix anything people find.)

If you do, let me know, and I'll tell you how to activate it. Let the testing begin!

Labels: , , , , ,

Monday, October 5, 2009

Why peak calling is painful.

In discussing my work, I'm often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It's trivial. Peak calling itself isn't hard. However, there are plenty of pitfalls that can surprise the unwary. (I've found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful - masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:
  1. Maintaining the software

  2. Peak calling without unlimited resources eg, 64Gb RAM

  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it's really easy to fire up a "one-off" script. Anyone can write something that can be tossed aside when they're done with it - but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we've transitioned through two versions of Eland output, Maq's .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we've been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn't have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption - and this is somewhat more technical. When dealing with sequencing reads, you're faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land - OR - you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I'll come back to in a moment. The second requires much less cpu time - but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a "pileup" file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn't matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you're using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn't always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually - which works well when correctly sorted. Unfortunately, new formats no longer allow that - using a "pre-sorted" bam/sam file, you can now find multi-part reads, but there's no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods - and do it ahead of everyone else. That on it's own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they're not. For a group of 30 people, that's probably not too hard, but for academic peak callers, that can be a challenge - particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that's the answer I give: It's trivial - but a lot of hard work. It's rewarding, educational and cool, but it's a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

Labels: , , , , , , ,

Friday, October 2, 2009

Base quality by position

A colleague of mine was working on a nifty tool to give graphs of the base quality at each position in a read using Eland Export files, which could be incorporated into his pipeline. Over a discussion about the length of time it should take to do that analysis, (His script was taking an hour, and I said it should take about a minute to analyze 8M illumina reads...) I ended up saying I'd write my own version to do the analysis, just to show how quickly it could be done.

Well, I was wrong about it taking about a minute. It turns out that the file has a lot more than about double the originally quoted 8 million reads (QC, no match and multi match reads were not previously filtered), and the whole file was bzipped, which adds to the processing time.

Fortunately, I didn't have to add bzip support in to the reader, as tcezard (Tim) had already added in a cool "PIPE" option for piping in whatever data format I want in to applications of the Vancouver Short Read Analysis Package, thus, I was able to do the following:
time bzcat /archive/solexa1_4/analysis2/HS1406/42E6FAAXX_7/42E6FAAXX_7_2_export.txt.bz2 | java6 src/projects/maq_utilities/QualityReport -input PIPE -output /projects/afejes/temp -aligner elandext

Quite a neat use of piping, really.

Anyhow, the fun part is that this was that the library was a 100-mer illumina run, and it makes a pretty picture. Slapping the output into openoffice yields the following graph:



I didn't realize quality dropped so dramatically at 100bp - although I remember when qualities looked like that for 32bp reads...

Anyhow, I'll include this tool in Findpeaks 4.0.8 in case any one is interested in it. And for the record, this run took 10 minutes, of which about 4 were taken up by bzcat. Of the 16.7M reads in the file, only 1.5M were aligned, probably due to the poor quality out beyond 60-70bp.

Labels: , , , , ,

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

Labels: , , , , , , , , , , , ,

Tuesday, August 11, 2009

SNP/SNV callers minimum useful information

Ok, I sent a tweet about it, but it didn't solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you'd think they grow on trees. (Actually, they grow on arrays...) I've written one, myself, and I know there are at least 3 others written at the GSC.

Anyhow, At first sight, what pisses me off is that there's no standard format. Frankly, that's not even the big problem, however. What's really underlying that problem is that there's no standard "minimum information" content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.

So, here's what I propose. If you're going to write a SNP or SNV caller, make sure your called variations contain the following fields:
  • chromosome: obviously the coordinate to find the location
  • position: the base position on the chromo
  • genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
  • canonical: what you expect to see at that position. (Invaluable for error checking!)
  • observed: what you did see at that position
  • coverage: the depth at that position (filtered or otherwise)
  • canonical_obs: how many times you saw the canonical base (key to evaluating what's at that position
  • variation_obs: how many times you saw the variation
  • quality: give me something to work with here - a confidence value between 0 and 1 would be ideal... but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!
Really, most of the callers out there give you most, if not all of it - but I have yet to see the final "quality" being given. The MAQ SNP caller (which is pretty good) asks you to look at several different fields and make up your own mind. That's fine for a first generation, but maybe I can convince people that we can do better in the second gen snp callers.

Ok, now I've got that off my chest! Phew.

Labels: , , , ,

Thursday, August 6, 2009

New Project Time... variation database

I don't know if anyone out there is interested in joining in - I'm starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I'm currently on my first pass - consider it version 0.1 - but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there - this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

Labels: , , , , , ,

Wednesday, July 29, 2009

Aligner tests

You know what I'd kill for? A simple set of tests for each aligner available. I have no idea why we didn't do this ages ago. I'm sick of off-by-one errors caused by all sorts of slightly different formats available - and I can't do unit tests without a good simple demonstration file for each aligner type.

I know Sam format should help with this - assuming everyone adopts it - but even for SAM I don't have a good control file.

I've asked someone here to set up this test using a known sequence- and if it works, I'll bundle the results into the Vancouver Package so everyone can use it.

Here's the 50-mer I picked to do the test. For those of you with some knowledge of cancer, it comes from tp53. It appears to blast uniquely to this location only.
>forward - chr17:7,519,148-7,519,197
CATGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCGG

>reverse - chr17:7,519,148-7,519,197
ccggcacccgcgtccgcgccatggccatctacaagcagtcacagcacatg

Labels: , , , , , ,

Monday, July 27, 2009

Picard code contribution

Update 2: I should point out that the subject of this post has been resolved. I'll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.


Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I'll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) - and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I've been slowly reading through the code, as I'd like to use it in my project for reading/writing SAM format files - which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It's an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it's suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They're not key pieces, but they are important to maintain your code's good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the "==" operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It's not in bad shape, but it's got a lot of little "problems" that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I'd fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it's best, right?

The sad part of the tale is that open source only works when the community's contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn't. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I'm not going to waste my time submitting the second set of changes I made. That's kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I'd begin making larger code contributions as my familiarity with the code improves - and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren't interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they'd like to accomplish with the software - and including contributions may mean including someone on a publication downstream if and when it does become published. I don't know if that was the case here: it was well within the project leader's rights to reject my patches on any grounds they like, but I can't say it makes me happy. I still don't enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see "Open Source" software, I'll remember that just because it's open source, it doesn't mean all contributions are welcome - I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I'll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that's highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

Labels: , , , , , ,

Friday, July 17, 2009

Community

This week has been a tremendous confluence of concepts and ideas around community. Not that I'd expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate's successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can't imagine having with too many people. (By the way, I think Obi's suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I've temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a "distinct" source of information, easily distinguishable from other people's blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks - I don't want to sink into a "borg" experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation - and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of "successes and failures in developing an open source community" as applied to the Vancouver Short Read Analysis Package. Yes, I'm happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us - not to mention how we integrate into them, both in our work and outside. If you can't maximize your ability to motivate them (or their ability to motivate you), then you're at a serious disadvantage. How we balance all of that is an open question, and one I'm still working hard at answering.

I've attached my presentation from this morning, just in case anyone is interested. (I've decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it's about 7Mb.)

Labels: , , , , , , , ,

Tuesday, May 12, 2009

Quality vs Quantity

Today was an interesting day, for many reasons. The first was the afternoon tours for high-school students that came by the Genome Sciences Centre and the labs. I've been taking part in an outreach program for some of the students at two local high schools, which has involved visiting the students to teach them a bit of the biology and computers we do, as well as the tours that bring them by to see us "at work." Honestly, it's a lot of fun, and I really enjoy interacting with the kids. Their questions are always amusing and insightful - and are often a lot of fun to answer well. (How do you explain how the academic system works in 2 minutes or less?)

For my part, I introduced the kids to Pacific Biosystems SMRT technology. I came up with a relatively slick monologue that goes well with a video from PacBio. (If you haven't seen their video, you should definitely check this out.) The kids seem genuinely impressed with the concept, and really enjoy the graphics - although they enjoy the desktop effects with Ubuntu too... so maybe that's not the best criteria to use for evaluation.

Anyhow, aside from that distraction, I've also had the pleasure of working on some of my older code today. After months of people at the GSC ignoring the fact that I'd already written code to solve many of the problems they were trying to develop software, a few people have decided to pick up some of the pieces of the Vancouver Short Read Package and give it a test spin.

One of them was looking at FindFeatures - which I've used recently to find exons of interest in WTSS libraries - and the other was PSNPAnalysiPipeline code - which does some neat metrics for WTSS.

The fun part of it is that the code for both of those applications were written months ago - in some cases before I had the data to test them on. When revisiting them and now actually putting the code to use, I was really surprised by the number of options I'd tossed in, to account for many situations that hadn't even been seriously anticipated. Someone renamed all of your fasta files? No worries, just use the -prepend option! Your junction library has a completely non-standard naming? No problem, just use the -override_mapname option! Some of your MAQ aligned reads have indels - well, ok, i can give you a 1-line patch to make that work too.

I suppose that really makes me wonder: If I were writing one-off scripts, which would obviously lack this kind of flexibility, I'd be able to move faster and more nimble across the topics that interest me. (Several other grad students do that, and are well published because of it.) Is that a trade off I'm willing to make, though?

Someone really needs to hold a forum on this topic: "Grad students: quality or quantity?" I'd love to sit through those panel discussions. As for myself, I'm still somewhat undecided on the issue. I'd love more publications, but having the code just work (which gets harder and harder as the codebase hits 30k lines) is also a nice thing. While I'm sure users of my software are happy when these options exist, I wonder what my supervisor thinks of the months I've spent building all of these tools - and not writing papers.

Ah well, I suppose when it comes time to defend, I'll find out exactly what he thinks about that issue. :/

Labels: , , ,

Friday, May 1, 2009

2 weeks of neglect on my blog = great thesis progress.

I wonder if my blogging output is inversely proportional to my progress on my thesis. I stopped writing two weeks ago for a little break, and ended up making big steps forward. The vast amount of my work went into FindPeaks, which included the following:

  • A complete threaded Saturation analysis for next-gen libraries.

  • A method of comparing next-gen libraries to identify peaks that are statistically significant outliers. (It's also symmetic, unlike a linear regression based methods.)

  • A better control method

  • A whole new way of analysing WTSS data, which gives statistically valid expression differences


And, of course many many other changes. Not everything is bug-free, yet, but it's getting there. All that's left on my task list are debugging a couple of things in the compare mode, relating to peaks present in only one of the two librarires, and an upgrade to my FDR cutoff prediction methods. Once those are done, I think I'll be ready to push out FindPeaks 4.0. YAY!

Actually, what was surprising to me was the sheer amount of work that I've done on this since January. I compiled the change list since my last "quarterly report" for a project that used FindPeaks (but doesn't support it, ironically.... why am I doing reports for them again?) and came up with 24 pages of commit messages - over 575 commits. Considering the amount of work I've done on my actual projects, away from FindPeaks, I think I've been pretty darn productive.

Yes, I'm going to take this opportunity to pat myself on the back in public for a whole 2 seconds... ok, done.

So, overall, blogging may not have been distracting me from my work, as even at the height of my blogging (around AGBT), I was still getting lots done, but the past two weeks have really been a help. I'll be back to blogging all the good stuff on monday. And I'm looking forward to doing some writing now, on some of the cool things in FP4.0 that haven't made it into the manual... yet.

Anyone want some fresh ChIP-Seq results? (-;

Labels: , , , ,

Friday, April 10, 2009

Nifty little trick for debugging frozen applications

This trick is just too cool not to mention.  I was trying to debug an application that was getting stuck in an endless loop, the other day. It was a rather complicated set of changes that was required and I had no idea where the program was getting stuck.'

In the past, I would have just ended the program with a control-c, and then started dropping in print statements until I could isolate exactly where the program was getting stuck.  Instead, I stumbled upon a very nifty little trick: using the kill function to halt the program and dump the thread's core to screen with the command:
kill -3 [pid]

For a java code running from the class files, the core dump shows you exactly which line is being executed in each thread, allowing you to find out precisely where the problem is - making debugging go much more quickly.

Anyhow, I haven't yet tried if this works on a .jar file, or what else you can do with a quick "kill -3", but this certainly broadens my toolkit of debugging utilities, and gives me a whole new respect for the kill signals.  I may have to test out a few of the other ones....

Labels: , ,

Friday, March 20, 2009

Universal format converter for aligned reads

Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot... and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie.... and several other more obscure formats.

For the moment, the "conversion utility" forces the output to bed file format (since that's the file type with the least information, and I don't have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there's no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren't being scaled or altered, you could end up with some rather bizzare conversions unless they're handled cleanly. Unfortunately, doing this scaling is such a moving target that it's just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I'll be asking for a co-op student for the summer to help out.)

Anyhow, I'll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

Labels: , , , , , , , , , , ,

Thursday, March 19, 2009

Findpeaks 3.3... continued

Patch, compile, read bug, search code, compile, remember to patch, compile, test, find bug, realized it's the wrong bug, test, compile, test....

Although I really enjoy working on my apps, sometimes a whole day goes by where tons of changes are made, and really I don't feel like I've gotten much done. I suppose it's more of the scale of things left to do, rather than the number of tasks. I've managed to solve a few mysteries and make an impact for some people using the software, but haven't got around to testing the big changes I've been working on for a few days on using different compare mechanisms for FindPeaks.

(One might then ask why I'm blogging instead of doing that testing... and that would be a very good question.)

Some quick ChIP-Seq things on my mind:
  • Samtools: there is a very complete Java/Samtools/Bamtools API that I could be integrating, but after staring at it for a while, I've realized that the complete lack of documentation on how to integrate it is really slowing the effort down. I will proably return to it next week.
  • Compare and Control: It seems people are switching to this paradigm on several other projects - I just need to get the new compare mechanism in, and then integrate it in with the control at the same time. That will provide a really nice method for doing both at once, which is really key for moving forward.
  • Eland "extended" format: I ended up reworking all of the Eland Export file functions today. All of the original files I worked with were pre-sorted and pre-formatted. Unfortunately, that's not how they exist in the real world. I now have updated the sort and separate chromosome functions for eland ext. I haven't done much testing on them, unfortunately, but that's coming up too.
  • Documentation: I'm so far behind - writing one small piece of manual a day seems like a good target - I'll try to hold myself to it. I might catch up by the end of the month, at that pace.
Anyhow, lots of really fun things coming up in this version of FindPeaks... I just have to keep plugging away.

Labels: , ,

Friday, January 9, 2009

No More Maq?

Another grad student at the GSC forwarded an email to our mailing list the other day, which was in turn from the maq-help mailing list. Unfortunately, the link on the maq-help mailing list takes you to another page, which incidentally (and erroneously) complains that FindPeaks doesn't work with Maq .map files - which it does. Instead, I suggest checking out this post on SeqAnswers from Li Heng, the creator of Maq, which has a very similar message.

The main gist of it is that the .map file format will be deprecated, and there will be no new versions of the Maq software package in the future. Instead, they will be working on two other projects (from the forwarded email):
  1. Samtools: replaces maq's (reference-based) "assembly"
  2. bwa: replaces maq's "mapping" for whole human genome alignment.
I suppose it means that eventually FindPeaks should support the Samtools formats, which I'll have to look into at some point. For those of you who are still using Maq, you may need to start following those projects as well, simply because it raises the question of long-term Maq support. As with many early generation Bioinformatics tools, we'll just have to be patient and watch how the software landscape evolves.

It probably also means that I'll have to start watching the Samtools development more carefully for use with my thesis project - many of the tools they are planning seem to replace the ones I've already developed in the Vancouver Short Read Alignment Package. Eventually, I'll have to evaluate both sets against each other. (That could also be an interesting project.)

While this was news to me, it's probably no more than the expected churn of a young technology field. I'm sure it's not going to be long until even the 2nd generation sequencing machines themselves evolve into something else.

Labels: , , ,

Thursday, January 8, 2009

The Future of FindPeaks

At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I'm working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I've been spending some time thinking about how I want to proceed forward - and I think the answer is to work smarter on FindPeaks. (No, I'm not dropping FindPeaks development. It's just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it's 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it's code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I've realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we're coming to the next generation of biology problems - integrating FindPeaks into the wider toolset - and generating real knowledge about what's going on in the genome, and I think it's time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven't been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit - so, even if I end up focussing on the cancer biology that I've got in front of me, I'm still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn't currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects.... I think I see a few more late nights in the near future. (-;

Labels: , , , , , ,

Tuesday, August 12, 2008

SNP callers.

I thought I'd switch gears a bit this morning. I keep hearing people say that the next project their company/institute/lab is going to tackle is a SNP calling application, which strikes me as odd. I've written at least 3 over the last several months, and they're all trivial. They seem to perform as well as any one else's SNP calls, and, if they take up more memory, I didn't think that was too big of a problem. We have machines with lots of RAM these days, and it's relatively cheap, these days.

What really strikes me as odd is that people think there's money in this. I just can't see it. The barrier to creating a new SNP calling program is incredibly low. I'd suggest it's even lower than creating an aligner - and there are already 20 or so of those out there. There's even an aligner being developed at the GSC (which I don't care for in the slightest, I might add) that works reasonably well.

I think the big thing that everyone is missing is that it's not the SNPs being called that important - it's SNP management. In order to do SNP filtering, I have a huge postgresql database with SNPs from a variety of sources, in several large tables, which have to be compared against the SNPs and gene calls from my data set. Even then, I would have a very difficult time handing off my database to someone else - my database is scalable, but completely un-automated, and has nothing but the psql interface, which is clearly not the most user friendly. If I were going to hire a grad student and allocate money to software development, I wouldn't spend the money on a SNP caller and have the grad student write the database - I'd put the grad student to work on his own SNP caller and buy a SNP management tool. Unfortunately, it's a big project, and I don't think there's a single tool out there that would begin to meet the needs of people managing output from massively-parallel sequencing efforts.

Anyhow, just some food for thought, while I write tools that manage SNPs this morning.

Cheers.

Labels: , , ,