Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Thursday, March 11, 2010

Wolfram Alpha recreates ensembl?

Ok, this might not be my most coherent post - I'm finally getting better from being sick for a whole week, which has left my brain felling somewhat... spongy. Several of us the AGBT-ers have come down with something after getting back, and I have a theory that it was something in the food we were given.... Maybe someone slipped something into the food to slow down research at the GSC??? (-; [insert conspiracy theory here.]

Anyhow, I just received a link [aka spam] from Wolfram Alpha, via a posting on linked in, letting me know all about their great new product: Wolfram Alpha now has genome information!

Somehow, looking at their quick demo, I'm somewhat less than impressed. Here's the link, if you'd like to check it out yourself: Wolfram Alpha Blog Post (Genome)

I'm unimpressed for two reasons: the first is that there are TONS of other resources that do this - and apparently do it better, from the little I've seen on the blog. For the moment, they have 11 genomes in there, which they hope to expand in the future. I'm going to have to look more closely, if I find the motivation, as I might be missing something, but I really don't see much that I can't do in the UCSC genome browser or the Ensembl web page. The second thing is that I'm still unimpressed by Wolfram Alpha's insistence that it's more than just a search engine, and that if you use it to answer a question, you need to cite it.

I'm all in favour of using really cool algorithms and searches are no exception. [I don't think I've mentioned this to anyone yet, but if you get a chance check out Unlimited Detail's use of search engine optimization to do unbelievable 3D graphics in real time.] However, if you're going to send links boasting about what you can do with your technology, do something other people can't do - and be clear what it is. From what I can tell, this is just a mash-up meta analysis of a few small publicly available resources. It's not like we don't have other engines that do the same thing, so I'm wondering what it is that they think they do that makes it worth going there for... anyone?

Worst of all, I'm not sure where they get their information from... where do they get their SNP calls from? How can you trust that, when you can't even trust dbSNP?

Anyhow, for the moment, I'll keep using resources that I can cite specifically, instead of just citing Wolfram Alpha... I don't know how reviewers would take it if I cured cancer... and cited Wolfram as my source.

Happy searching, people!

Labels: , , ,

Thursday, January 14, 2010

How to be a better Programmer: Tactics.

I'm a bit too busy for a long post, but a link was circulating around the office that I thought was worth passing on to any bioinformaticians out there.

The article above is on how to be a better programmer - and I wholeheartedly agree with what the author proposed, with one caveat that I'll get to in a minute. The point of the the article is that learning to see the big picture (not specific skills) will make you a better programmer. In fact, this is the same advice Sun Tzu gives in "The Art of War", where understanding the terrain, the enemy, etc are the tools you need to be a better general. [This would be in contrast to learning how to wield each weapon, which would only make you a better warrior.] Frankly, it's good advice, and this leads you down the path towards good planning and clear thinking - the keys to success in most fields.

The caveat, however, is that there are times in your life where this is the wrong approach: ie. grad school. As a grad student, your goal isn't to be great at everything you touch - it's to specialize in some small corner of one field, and tactics are no help here. If grad school existed for Ninjas, the average student would walk out being the best (pick one of: poisoner/dart thrower/wall climber/etc) in the world - and likely knowing little or nothing about how to be a real ninja beyond what they learned in their Ninja undergrad. Tactics are never a bad investment, but they aren't always what is being asked of you.

Anyhow, I plan to take the advice in the article and to keep studying the tactics of bioinformatics in my spare time, even though my daily work is more on the details and implementation side of it. There are a few links in the comments of the original article to sites the author believes are good comp-sci tactics... I'll definitely be looking into those tonight. Besides, when it comes down to it, the tactics are really the fun parts of the problems, although there is also something to be said for getting your code working correctly and efficiently.... which I'd better get back to. (=

Happy coding!

Labels: , , , , ,

Tuesday, December 22, 2009

Link Roundup Returns - Dec 16-22

I've been busy with my thesis project for the past couple weeks, which I think is understandable, but all work and no play kinda doesn't sit well for me. So, over the weekend, I learned go, google's new programming languages, and wrote myself a simple application for keeping track of links - and dumping them out in a pretty html format that I can just cut and paste into my blog.

While I'm not quite ready to release the code for my little go application, I am ready to test it out. I went back through the last 200 twitter posts I have (about 8 days worth), and grabbed the ones that looked interesting to me. I may have missed a few, or grabbed a few less than thrilling ones. It's simply a consequence of me skimming some of the articles less well than others. I promise the quality of my links will be better in the future.

Anyhow, this experiment gave me a few insights into the process of "reprocessing" tweets. The first is that my app only records the person from whom I got the tweet - not the people from who they got it. I'll try to address that in the future. The second is that it's a very simple interface - and a lot of things I wanted to say just didn't fit. (Maybe that's for the better.. who knows.)

Regardless (or irregardless, for those of you in the U.S.) here are my picks for the week.

  • Bringing back Blast (Blast+) (PDF) - Link (via @BioInfo)
  • Incredibly vague advice on how to become a bioinformatician - Link (via @KatherineMejia)
  • Cleaning up the Human Genome - Link (via @dgmacarthur)
  • Neat article on "4th paradigm of computing: exaflod of observational data" - Link (via @genomicslawyer)

  • Gene/Protein Annotation is worse than you thought - Link (via @BioInfo)
  • Why are europeans white? - Link (via @lukejostins)

Future Technology:
  • D-Wave Surfaces again in discussions about bioinformatics - Link (via @biotechbase)
  • Changing the way we give credit in science - Link (via @genomicslawyer)

Off topic:
  • On scientists getting quote-mined by the press - Link (via @Etche_homo)
  • Give away of the best science cookie cutters ever - Link (via @apfejes)
  • Neat early history of the electric car - Link (via @biotechbase)
  • Wild (innacurate and funny) conspiracy theories about the Wellcome Trust Sanger Institute - Link (via @dgmacarthur)
  • The Eureka Moment: An Interview with Sir Alec Jeffreys (Inventor of the DNA Fingerprint) - Link (via @dgmacarthur)
  • Six types of twitter user (based on The Tipping Point) - Link (via @ritajlg)

Personal Medicine:
  • Discussion on mutations in cancer (in the press) - Link (via @CompleteGenomic)
  • Upcoming Conference: Personalized Medicine World Conference (Jan 19-20, 2010) - Link (via @CompleteGenomic)
  • deCODEme offers free analysis for 23andMe customers - Link (via @dgmacarthur)
  • UK government waking up to the impact of personalized medicine - Link (via @dgmacarthur)
  • Doctors not adopting genomic based tests for drug suitabiity - Link (via @dgmacarthur)
  • Quick and dirty biomarker detection - Link (via @genomicslawyer)
  • Personal Genomics article for the masses - Link (via @genomicslawyer)

  • Paper doing the rounds: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data - Link (via @BioInfo)
  • Archiving Next Generation Sequencing Data - Link (via @BioInfo)
  • Epigenetics takes aim at cancer and other illnesses - Link (via @BioInfo)
  • (Haven't yet read) Changing ecconomics of DNA Synthesis - Link (via @biotechbase)
  • Genomic players for investors. (Very light overview) - Link (via @genomicslawyer)
  • Haven't read yet: Recommended review of 2nd and 3rd generation seq. technologies - Link (via @nanopore)
  • De novo assembly of Giant Panda Genome - Link (via @nanopore)
  • Welcome Trust summary of 2nd Gen sequencing technologies - Link (via @ritajlg)

Labels: , ,

Wednesday, October 14, 2009

Useful error messages.... and another format rant.

I'll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It's one of my "go-to" sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I'd get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the "email results to you" option, whereupon I got the following email message:

Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress

That has to be the first time I've ever had a server shutdown cause a result failure. Ok, it's not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable - I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I'm also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it's not freely available, which is somewhat un-helpful. (I don't know when or if it will be available, although I've heard rumours about it being put into our pipeline...)

That's kind of a sad state of affairs - although I really shouldn't complain. I have more than enough work on my plate, and I'm sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I'll just have to sit here and be patient... and maybe write an 8th snp format iterator.

Labels: , , , , , ,

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

Labels: , , , , , , , , , , , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Thursday, August 6, 2009

New Project Time... variation database

I don't know if anyone out there is interested in joining in - I'm starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I'm currently on my first pass - consider it version 0.1 - but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there - this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

Labels: , , , , , ,

Monday, July 27, 2009

Picard code contribution

Update 2: I should point out that the subject of this post has been resolved. I'll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.

Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I'll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) - and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I've been slowly reading through the code, as I'd like to use it in my project for reading/writing SAM format files - which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It's an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it's suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They're not key pieces, but they are important to maintain your code's good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the "==" operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It's not in bad shape, but it's got a lot of little "problems" that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I'd fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it's best, right?

The sad part of the tale is that open source only works when the community's contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn't. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I'm not going to waste my time submitting the second set of changes I made. That's kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I'd begin making larger code contributions as my familiarity with the code improves - and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren't interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they'd like to accomplish with the software - and including contributions may mean including someone on a publication downstream if and when it does become published. I don't know if that was the case here: it was well within the project leader's rights to reject my patches on any grounds they like, but I can't say it makes me happy. I still don't enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see "Open Source" software, I'll remember that just because it's open source, it doesn't mean all contributions are welcome - I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I'll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that's highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

Labels: , , , , , ,

Thursday, July 9, 2009

New Tool: KeepNote

Obviously I haven't updated much here lately - I've been pretty busy and inspiration hasn't struck me much in the last few days to get anything written. However, I started using some new software this morning, and I'm enjoying it so much I figured I have to share.

One of the big problems I have, as a bioinformatician, is keeping track of all the notes and one off scripts I write. I don't want to use an SVN, because it's just a repository with no organization. I don't want to use a wiki, because it's a huge hassle to maintain for small projects, and I hate using text files.

The compromise, it seems, is to use standards compliant files with a hell of a wrapper around them that does the organization for you, and the one I found is called KeepNote. The project page and downloads can be found at The software is available for all major OS (Linux, Mac and even Windows), and can be installed relatively quickly and (for the most part) painlessly. (Linux builds are missing a library in the dependencies, but that can be figured out pretty quickly - just apt-get the missing lib and re-install if you hit this problem.)

While it may not fit everyone's workflow, my few hours of using it have already helped me get my tools organized and assembled in a logical manner, and it's allowed me to remove a load of files from my desktop. There are still bugs with it: I had to manually do some configuration of the the web browser, text editor and such before I could get started, but so far I haven't hit any of the bugs.

It also claims to help you organize notes - which I can clearly see. next time I go to a conference, I'll be using this for recording and organizing the usual 30-40 pages of notes I take.

For me, this falls under the heading of required tools for bioinformaticians and students alike and I look forward to seeing the project evolve and grow.

Labels: , , ,

Friday, May 29, 2009

Science Cartoons - 3

I wasn't going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that's generally called "fair use" in the US, and copyright laws in most countries have similar exceptions), but you can't publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don't steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

Labels: , , , , , , , , ,

Thursday, May 21, 2009


Apparently there's a rumour going around that I think FindPeaks isn't a good software program, and that I've said so on my blog. I don't know who started this, or where it came from, but I would like to set the record straight.

I think FindPeaks is one of the best ChIP-Seq packages currently available. Yes, I would like to have better documentation, yes there are occasionally bugs in it, and yes, the fact that I'm the only full-time developer on the project are detriments.... But those are only complaints, and they don't diminish the fact that I think FindPeaks can out peak-call, out control, out compare.... and will outlast, outwit and outplay all of it's opponents. (=

FindPeaks is actually a VERY well written piece of code, and has the flexibility to do SO much more than just ChIP-Seq. If I had my way, I'd have another developer or two working with the code to expand many of the really cool features that it already has.

If people are considering using it, I'm always happy to help get them started, and I'm thrilled to add more people to the project if they would like to contribute.

So, just to make things clear - I think FindPeaks (and the rest of the Vancouver Short Read Analysis Package) are great pieces of software, and are truly worth using. If you don't believe me, send me an email, and I'll help make you a believer too. (=

Labels: , ,

Wednesday, November 19, 2008

Has Apple gone too far?

As a bioinformatician, I enjoy a good looking piece of computer hardware and, for the last few years, the best looking hardware around has been the Apple Macs. I've even thought about buying their new macbooks, although for the same specs, you can pick up a dell on sale at 1/3rd of the price, so it's hardly a good deal. I really can't see myself running anything other than Linux on it, though, so despite the beautiful engineering, I can't see myself paying ~$300 for an OS I'd just remove. (I was even upset at paying ~$50 for a copy of Windows XP with my current laptop. Drop me a line if you want to buy the license - It doesn't even have a Valid EULA... but that's another story.)

Anyhow, I've got to admit, Apple has finally managed to turn me off completely. Check out this article. To paraphrase, Apple has decided to follow suit with Microsoft and Intel in order to prevent you from enjoying the content you own in the way you'd like to use it. In other words, Mac OSX is now claiming control over your media files. (And, I might add, this is not about copyright, because the article shows uses that are clearly restricting "fair use" as well.) DRM is now built right into your hardware, and if your hardware isn't DRM enabled, you can't use it. Ouch.

I feel sorry for those people who have jumped the Microsoft ship just to end up in the Apple camp and are about to discover that Apple doesn't have their best interests at heart either. Why shouldn't you be able show a movie on an external monitor or projector?

In the long run, this is probaby good advertising for GNU/Linux, which doesn't enforce media company greed on it's users. So, if anyone wants a free Ubuntu disk to make their Apple harware work for them instead of against them, here you go.

Labels: , ,

Wednesday, October 8, 2008

Maq Bug

I came across an interesting bug today, when trying to work with reads aligned by Smith-Waterman (flag = 130), in PET alignments. Indeed, I even filed a Bug Report on it.

The low down on the "bug" or "feature" (I'm not sure which it is yet), is that sequences containing deletions don't actually show the deletions - they show the straight genomic sequence at that location. The reason that I figure this is a bug instead of by design is because sequences with insertions show the insertion. So why the discrepancy?

Anyhow, the upshot of it is that I'm only able to use 1/2 of the Smith-Waterman alignments maq produces when doing SNP calls in my software. (I can't trust that the per-base quality scores align to the correct bases with this bug) Since other people are doing SNP calls using MAQ alignments... what are they doing?

Labels: , , ,

Tuesday, July 22, 2008

SNP calling from MAQ

With that title, you're probably expecting a discussion on how MAQ calls snps, but you're not going to get it. Instead, I'm going to rant a bit, but bear with me.

Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No - but not for the reasons you might expect.

I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don't get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the "kludges" with the method I didn't know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.

If you're wondering why it took so long, it's three fold:

1. I started off depending on someone else's method, since they came up with it. As is often the case, that person was working quickly to get results, and I don't think they had the goal of writing production quality code. Since I didn't have their code (though, honestly, I didn't ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there's a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.

2. I was trying to write code that would be re-usable. That's something I'm very proud of, as most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I'll have to redo the whole experiment from the start at some point because I'm not fully satisfied with the method, and we won't be doing it exactly this way in the future. I just hope the change doesn't happen in the next 3 weeks.

3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a "MT" or just "M"? What case do we use for "X" and "Y" (or is it "x" and "y"?) in our files? Should we pre-pend "chr" to our chromsome names? And what on earth is "chr5_random" doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read... bleh.

Anyhow, I fully admit that SNP calling isn't hard to do. Once you've read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there's not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.

However, once you're done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I'm aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there's much difference beyond the alignment quality.

Unfortunately, because the aligner field is still immature, there is no single file output format that's common to all aligners, so the comparison is a pain to do - which means it's probably a long way off. That, in itself, might be a good topic for an article, one day.

Labels: , , , ,

Wednesday, April 2, 2008

New ChIP-Seq tool from Illumina

Ok, I had to blog this. Someone on the SeqAnswers forum brought it to my attention that Illumina has a new tool for ChIP-Seq experiments. That in itself doesn't bother me - the more people in this space, the faster we learn about what makes us tick.

What surprises me, though, is the tool itself (beadstudio data analysis software - chip sequencing module). It's implemented only for Windows, for one. (Don't most self-respecting scientists use Macs or Linux these days? Or at least use and develop tools that can be used cross-platform?) Second, the feature set appears to be a re-implementation of the UCSC Genome Browser. Given the choice between the two, I don't see any reason to buy the Illumina version. (Yes, you have to pay for it, whereas UCSC is free and flexible.) I can't tell if it loads bed files or wig files, but the screen shots show a rather unflexible tool that looks like a graphical version of Gap4 or Consed. I'm not particularly impressed.

Worse still, I can't see this being implemented in a pipeline. If you're processing 100's of ChIP-Seq experiments in a year, or 1000's once this technique really starts to hit it's stride, why would you want to force it all through a GUI? I just don't get it.

Well, what do I know? Maybe there's a big market for people out there who don't want free cross-platform tools, and would rather pay for a brand name science application than use something that works. Come to think of it, I'm willing to bet there are a few pharma companies out there who do think like that, and Illumina is likely to conquer that market with their tool. Happy clicking, Vista users.

Labels: , ,