Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, January 14, 2010

How to be a better Programmer: Tactics.

I'm a bit too busy for a long post, but a link was circulating around the office that I thought was worth passing on to any bioinformaticians out there.

http://dlowe-wfh.blogspot.com/2007/06/tactics-tactics-tactics.html

The article above is on how to be a better programmer - and I wholeheartedly agree with what the author proposed, with one caveat that I'll get to in a minute. The point of the the article is that learning to see the big picture (not specific skills) will make you a better programmer. In fact, this is the same advice Sun Tzu gives in "The Art of War", where understanding the terrain, the enemy, etc are the tools you need to be a better general. [This would be in contrast to learning how to wield each weapon, which would only make you a better warrior.] Frankly, it's good advice, and this leads you down the path towards good planning and clear thinking - the keys to success in most fields.

The caveat, however, is that there are times in your life where this is the wrong approach: ie. grad school. As a grad student, your goal isn't to be great at everything you touch - it's to specialize in some small corner of one field, and tactics are no help here. If grad school existed for Ninjas, the average student would walk out being the best (pick one of: poisoner/dart thrower/wall climber/etc) in the world - and likely knowing little or nothing about how to be a real ninja beyond what they learned in their Ninja undergrad. Tactics are never a bad investment, but they aren't always what is being asked of you.

Anyhow, I plan to take the advice in the article and to keep studying the tactics of bioinformatics in my spare time, even though my daily work is more on the details and implementation side of it. There are a few links in the comments of the original article to sites the author believes are good comp-sci tactics... I'll definitely be looking into those tonight. Besides, when it comes down to it, the tactics are really the fun parts of the problems, although there is also something to be said for getting your code working correctly and efficiently.... which I'd better get back to. (=

Happy coding!

Labels: , , , , ,

Tuesday, December 22, 2009

Link Roundup Returns - Dec 16-22

I've been busy with my thesis project for the past couple weeks, which I think is understandable, but all work and no play kinda doesn't sit well for me. So, over the weekend, I learned go, google's new programming languages, and wrote myself a simple application for keeping track of links - and dumping them out in a pretty html format that I can just cut and paste into my blog.

While I'm not quite ready to release the code for my little go application, I am ready to test it out. I went back through the last 200 twitter posts I have (about 8 days worth), and grabbed the ones that looked interesting to me. I may have missed a few, or grabbed a few less than thrilling ones. It's simply a consequence of me skimming some of the articles less well than others. I promise the quality of my links will be better in the future.

Anyhow, this experiment gave me a few insights into the process of "reprocessing" tweets. The first is that my app only records the person from whom I got the tweet - not the people from who they got it. I'll try to address that in the future. The second is that it's a very simple interface - and a lot of things I wanted to say just didn't fit. (Maybe that's for the better.. who knows.)

Regardless (or irregardless, for those of you in the U.S.) here are my picks for the week.

Bioinformatics:
  • Bringing back Blast (Blast+) (PDF) - Link (via @BioInfo)
  • Incredibly vague advice on how to become a bioinformatician - Link (via @KatherineMejia)
  • Cleaning up the Human Genome - Link (via @dgmacarthur)
  • Neat article on "4th paradigm of computing: exaflod of observational data" - Link (via @genomicslawyer)

Biology:
  • Gene/Protein Annotation is worse than you thought - Link (via @BioInfo)
  • Why are europeans white? - Link (via @lukejostins)

Future Technology:
  • D-Wave Surfaces again in discussions about bioinformatics - Link (via @biotechbase)
  • Changing the way we give credit in science - Link (via @genomicslawyer)

Off topic:
  • On scientists getting quote-mined by the press - Link (via @Etche_homo)
  • Give away of the best science cookie cutters ever - Link (via @apfejes)
  • Neat early history of the electric car - Link (via @biotechbase)
  • Wild (innacurate and funny) conspiracy theories about the Wellcome Trust Sanger Institute - Link (via @dgmacarthur)
  • The Eureka Moment: An Interview with Sir Alec Jeffreys (Inventor of the DNA Fingerprint) - Link (via @dgmacarthur)
  • Six types of twitter user (based on The Tipping Point) - Link (via @ritajlg)

Personal Medicine:
  • Discussion on mutations in cancer (in the press) - Link (via @CompleteGenomic)
  • Upcoming Conference: Personalized Medicine World Conference (Jan 19-20, 2010) - Link (via @CompleteGenomic)
  • deCODEme offers free analysis for 23andMe customers - Link (via @dgmacarthur)
  • UK government waking up to the impact of personalized medicine - Link (via @dgmacarthur)
  • Doctors not adopting genomic based tests for drug suitabiity - Link (via @dgmacarthur)
  • Quick and dirty biomarker detection - Link (via @genomicslawyer)
  • Personal Genomics article for the masses - Link (via @genomicslawyer)

Sequencing:
  • Paper doing the rounds: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data - Link (via @BioInfo)
  • Archiving Next Generation Sequencing Data - Link (via @BioInfo)
  • Epigenetics takes aim at cancer and other illnesses - Link (via @BioInfo)
  • (Haven't yet read) Changing ecconomics of DNA Synthesis - Link (via @biotechbase)
  • Genomic players for investors. (Very light overview) - Link (via @genomicslawyer)
  • Haven't read yet: Recommended review of 2nd and 3rd generation seq. technologies - Link (via @nanopore)
  • De novo assembly of Giant Panda Genome - Link (via @nanopore)
  • Welcome Trust summary of 2nd Gen sequencing technologies - Link (via @ritajlg)

Labels: , ,

Thursday, November 12, 2009

Go from Google...

Just a short post, since I'm actually (although you probably can't tell) rather busy today. However, I'm absolutely fascinated by Google's new language, Go. It's taken the best from just about every existing language out there, and appears so clean!

I'm currently watching Google's talk on it, while I write... I'm only a few minutes in, but it seems pretty good. Watching this seriously makes me want to start a new bio-go project... so nifty!

Labels: , ,

Monday, October 5, 2009

Why peak calling is painful.

In discussing my work, I'm often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It's trivial. Peak calling itself isn't hard. However, there are plenty of pitfalls that can surprise the unwary. (I've found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful - masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:
  1. Maintaining the software

  2. Peak calling without unlimited resources eg, 64Gb RAM

  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it's really easy to fire up a "one-off" script. Anyone can write something that can be tossed aside when they're done with it - but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we've transitioned through two versions of Eland output, Maq's .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we've been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn't have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption - and this is somewhat more technical. When dealing with sequencing reads, you're faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land - OR - you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I'll come back to in a moment. The second requires much less cpu time - but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a "pileup" file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn't matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you're using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn't always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually - which works well when correctly sorted. Unfortunately, new formats no longer allow that - using a "pre-sorted" bam/sam file, you can now find multi-part reads, but there's no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods - and do it ahead of everyone else. That on it's own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they're not. For a group of 30 people, that's probably not too hard, but for academic peak callers, that can be a challenge - particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that's the answer I give: It's trivial - but a lot of hard work. It's rewarding, educational and cool, but it's a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

Labels: , , , , , , ,

Monday, September 28, 2009

Recursive MC solution to a simple problem...

I'm trying to find balance between writing and experiments/coding. You can't do both at the same time without going nuts, in my humble opinion, so I've come up with the plan of alternating days. One day of FindPeaks work, one day on my project. At that rate, I may not give the fastest responses (yes, I have a few emails waiting), but it should keep me sane and help me graduate in a reasonable amount of time. (For those of you waiting, tomorrow is FindPeaks day.)

That left today to work on the paper I'm putting together. Unfortunately, working on the paper doesn't mean I don't have any coding to do. I had a nice simulation that I needed to run: given the data sets I have, what are the likely overlaps I would expect?

Of course, I hate solving a problem once - I'd rather solve the general case and then plug in the particulars.

Today's problem can be summed up as: "Given n data sets, each with i_n genes, what is the expected number of genes common to each possible overlap of 2 or more datasets?"

My solution, after thinking about the problem for a while, was to use a recursive solution. Not surprisingly, I haven't written recursive code in years, so I was a little hesitant to give it a shot. In contrast, I whipped up the code, and gave it a shot - and it worked the first time. (That's sometimes a rarity with my code - I'm a really good debugger, but can often be sloppy when writing code quickly the first time.) Best of all, the code is extensible - If I have more data sets later, I can just add them in and re-run. No code modification needed beyond changing the data. (Yes, I was sloppy and hard coded it, though it would be trivial to read it from a data file, if someone wants to re-use this code.)

Anyhow, it turned out to be an elegant solution to a rather complex problem - and I was happy to see that the results I have for the real experiment stick out like a sore thumb: it's far greater than random chance.

If anyone is interested in seeing the code, it was uploaded into the Vancouver Short Read Analysis Package svn repository: here. (I'm doubting the number of page views that'll get, but what the heck, it's open source anyhow.)

I love it when code works properly - and I love it even more when it works properly the first time.

All in all, I'd say it's been a good day, not even counting the 2 hours I spent at the fencing club. En gard! (-;

Labels: , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Thursday, August 6, 2009

New Project Time... variation database

I don't know if anyone out there is interested in joining in - I'm starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I'm currently on my first pass - consider it version 0.1 - but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there - this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

Labels: , , , , , ,

Wednesday, July 29, 2009

Aligner tests

You know what I'd kill for? A simple set of tests for each aligner available. I have no idea why we didn't do this ages ago. I'm sick of off-by-one errors caused by all sorts of slightly different formats available - and I can't do unit tests without a good simple demonstration file for each aligner type.

I know Sam format should help with this - assuming everyone adopts it - but even for SAM I don't have a good control file.

I've asked someone here to set up this test using a known sequence- and if it works, I'll bundle the results into the Vancouver Package so everyone can use it.

Here's the 50-mer I picked to do the test. For those of you with some knowledge of cancer, it comes from tp53. It appears to blast uniquely to this location only.
>forward - chr17:7,519,148-7,519,197
CATGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCGG

>reverse - chr17:7,519,148-7,519,197
ccggcacccgcgtccgcgccatggccatctacaagcagtcacagcacatg

Labels: , , , , , ,

Monday, July 27, 2009

Picard code contribution

Update 2: I should point out that the subject of this post has been resolved. I'll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.


Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I'll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) - and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I've been slowly reading through the code, as I'd like to use it in my project for reading/writing SAM format files - which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It's an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it's suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They're not key pieces, but they are important to maintain your code's good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the "==" operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It's not in bad shape, but it's got a lot of little "problems" that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I'd fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it's best, right?

The sad part of the tale is that open source only works when the community's contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn't. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I'm not going to waste my time submitting the second set of changes I made. That's kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I'd begin making larger code contributions as my familiarity with the code improves - and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren't interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they'd like to accomplish with the software - and including contributions may mean including someone on a publication downstream if and when it does become published. I don't know if that was the case here: it was well within the project leader's rights to reject my patches on any grounds they like, but I can't say it makes me happy. I still don't enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see "Open Source" software, I'll remember that just because it's open source, it doesn't mean all contributions are welcome - I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I'll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that's highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

Labels: , , , , , ,

Friday, July 17, 2009

Community

This week has been a tremendous confluence of concepts and ideas around community. Not that I'd expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate's successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can't imagine having with too many people. (By the way, I think Obi's suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I've temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a "distinct" source of information, easily distinguishable from other people's blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks - I don't want to sink into a "borg" experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation - and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of "successes and failures in developing an open source community" as applied to the Vancouver Short Read Analysis Package. Yes, I'm happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us - not to mention how we integrate into them, both in our work and outside. If you can't maximize your ability to motivate them (or their ability to motivate you), then you're at a serious disadvantage. How we balance all of that is an open question, and one I'm still working hard at answering.

I've attached my presentation from this morning, just in case anyone is interested. (I've decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it's about 7Mb.)

Labels: , , , , , , , ,

Friday, May 29, 2009

Science Cartoons - 3

I wasn't going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that's generally called "fair use" in the US, and copyright laws in most countries have similar exceptions), but you can't publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don't steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

Labels: , , , , , , , , ,

Friday, March 20, 2009

Universal format converter for aligned reads

Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot... and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie.... and several other more obscure formats.

For the moment, the "conversion utility" forces the output to bed file format (since that's the file type with the least information, and I don't have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there's no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren't being scaled or altered, you could end up with some rather bizzare conversions unless they're handled cleanly. Unfortunately, doing this scaling is such a moving target that it's just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I'll be asking for a co-op student for the summer to help out.)

Anyhow, I'll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

Labels: , , , , , , , , , , ,

Thursday, January 8, 2009

The Future of FindPeaks

At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I'm working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I've been spending some time thinking about how I want to proceed forward - and I think the answer is to work smarter on FindPeaks. (No, I'm not dropping FindPeaks development. It's just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it's 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it's code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I've realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we're coming to the next generation of biology problems - integrating FindPeaks into the wider toolset - and generating real knowledge about what's going on in the genome, and I think it's time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven't been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit - so, even if I end up focussing on the cancer biology that I've got in front of me, I'm still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn't currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects.... I think I see a few more late nights in the near future. (-;

Labels: , , , , , ,

Friday, March 21, 2008

Catching up....

I can't believe it's been nearly a month since my last post! I feel like I've been neglecting this a bit more than I should, but I'll try to rectify that as best I can.

For an indication of how busy I've been, I sat down to update my resume yesterday, and ended up adding 3 papers (all in submission) and two posters. That just about doubles what was in there previously in the papers section.

Anyhow, Next-generation sequencing doesn't stand still, so I thought I'd outline some of the things I want to talk about in my next posts, and set up a few pointers to other resources:

1. SeqAnswers. This aptly named forum has been around for a few months now, but has recently become more popular, and a great forum for discussing relevant Next-gen technology and sequencing methods. I'm especially happy to see the automated posts triggered by new literature on the subject, which are a great resource for those of us who are busy and forget to check for new publications ourselves.

2. There's one forum in particular that's of great interest: Benchmarking different aligners. This appears to be a well done comparison (if lightweight) that may be a good focus for other people who are interested in comparing aligners, and discussing it in a wider forum.

3. For people interested in ChIP-Seq, or Chromatin immunoprecipitation and massively parallel sequencing, I've finally gotten around to posting FindPeaks 3.1 on the web. I'd consider this release (3.1.3) an alpha release. I'd love to get more people contributing by using this application and telling me what could be improved on it, or what enhancements they'd like to see. I'm always happy to discuss new features, and can probably add most of them in with a relatively quick turn around time.

4. For people interested in assessing the quality of the whole transcriptome shotgun sequencing (WTSS), I'm about to break out a tool that should fit that purpose. If anyone is interested in giving suggestions on ways they'd like to see quality tests performed, I'd be more than happy to code those into my applications. (And yes, if you contribute to the tool, I will provide you a copy of the tool to use. Collaborations, etc, can be discussed, as well.)

5. A quick question, of which I'll post more in the future. Has anyone here managed to get Exonerate 2.0 to work in client/server mode on two separate machines?

6. I'll also post a little more about this in the future: building environments, ant and java. Why are people still doing large projects in perl?

7. One last thing I wanted to mention. I was going to write more on this topic, but eh... I'll let slashdot do it for me: The more you drink, the less you publish. Well, So much for keeping a bottle of tequila under the desk. Now I know what to get the competition for x-mas, though...

Cheers!

Labels: , , ,

Wednesday, May 16, 2007

No more support

For once, I think I'll toss out a quick rant that I haven't really thought through, so don't mind if it's a little rough.

I've spent some time thinking about the various projects I've been working on, and that I'd like to be working on in my grad school future. Surprisingly, I'm really happy with them, and I'm eager to delve into all of them, with one major exception: I think I'm sitting on a potential train wreck.

From a project planning perspective, the one major issue that I foresee is that I'm inheriting a legacy of a few 10's of kloc (thousand lines of code) done in Java. I'm not really proficient in Java, but it's not hard, compared to some of the other languages I've used - that's not the issue. The big problem is that almost all of it takes advantage of something called the Ensembl API, which is a quick way for programmers to access all sorts of fantastic functions and data related to various genomic information. It's a fantastic resource, but Ensembl (who made the API) has decided to stop supporting the java version in favour of the Perl version.

Even now, I'm stuck using the annotations from version 41 of the Ensembl Human Genome, whereas v.43 is the most current. How much difference will this make? Probably not much, at the moment. However, in the long term, I think that could become a major issue.

Now, I have worked in Perl before, years ago, so that's not a problem. But what do I do about the 10kloc? Recreating it will take the better part of a year, at least. For now, the solution is to postpone the decision, but I think that'll only work for another month or two. Eventually something is going to give, and I'm just going to have suck it up and redo all of the code we've got in house. Yuck.

Labels: