Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, February 5, 2010

GFF3 undocumented feature...

Earlier today, I tweeted:
Does anyone know how to decypher a diBase GFF3 file? They don't identify the "most abundant" nucleotide uniquely. seems useless to me.
Apparently, there is a solution, albeit undocumented:

The attribute "genotype" contains an IUB code that is limited to using either a single base or a double base annotation (eg, it should not contain, H, B, V, D or N - but may contain R, Y, W, S, M or K ), which then allows you to subtract the "reference" attribute (that must be canonical) from the "genotype" attribute IUB code to obtain the new SNP - but only when the "genotype" attribute is not a canonical base.

If only that were documented somewhere...

UPDATE: Actually, this turns out not to be the case at all -- there are still positions for which the "genotype" attribute is an IUB code, and the reference is not one of the called bases. DOH!

Labels: , , ,

Wednesday, October 14, 2009

Useful error messages.... and another format rant.

I'll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It's one of my "go-to" sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I'd get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the "email results to you" option, whereupon I got the following email message:


Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress


That has to be the first time I've ever had a server shutdown cause a result failure. Ok, it's not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable - I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I'm also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it's not freely available, which is somewhat un-helpful. (I don't know when or if it will be available, although I've heard rumours about it being put into our pipeline...)

That's kind of a sad state of affairs - although I really shouldn't complain. I have more than enough work on my plate, and I'm sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I'll just have to sit here and be patient... and maybe write an 8th snp format iterator.

Labels: , , , , , ,

Friday, August 21, 2009

SNP Database v0.2

My SNP database is now up and running, with the first imports of data working well. That's a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I've even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)

So, what's most interesting to me? Some of the early stats:

11,545,499 snps in total, made from:
  • 870549 snp calls from the 1000 genome project
  • 11361676 snps from dbsnp
So, some quick math:
11,361,676 + 870,549 - 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.

That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.

I suppose that's not a bad thing, since those samples were all "normals", and it's good to get some sense as to how big dbSNP really is.

Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets....

This is starting to become fun.

Labels: , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Thursday, January 29, 2009

FindPeaks 3.3.0 and AGBT proposal

Well, I finally have a version of FindPeaks 3.3.0 that runs without known bugs. Tracking down that last bug was tricky, and took me 3 days to find and squash it. It's hard to find bugs that only happen when they're near to a fragment that is duplicated. (-;

Anyhow, now that that's working better, it's time to add in the new functionality. The most pressing parts are the controls (in two parts - one of which is a top secret collaboration, while the other is just too boring to really talk about), and the other is implementing SAM/BAMtools interface. Whenever the "new MAQ" is ready, I'd like to be prepared for it.

Incidentally, I think controls will be the easier of the two, and I think I'll be able to finish the boring parts off this week. At the rate things are going, it might be another 2 days of debugging after that, but that's what makes software writing fun. (Just imagine a tall thin guy hunched over a computer keyboard cackling insanely while staring deep into the monitor displaying green matrix-style characters drifting downward...)

At any rate, I'm also working towards my poster for AGBT, which reminds me of what else I wanted to suggest. If anyone who reads my blog is going to be at AGBT and is happy to meet up to talk some ChIP-Seq or SNP finding (or anything remotely related), let me know. I'm thinking it would be neat to gather people together who are working on the same topic and talk for a bit. (I'm even willing to miss formal talks for it, as long as they're not directly related to my work.)

So, to that effect, I'll point to this page on SeqAnswers, and suggest if anyone is interested they let me know. (= It would definitely be an efficient way to network.

Oh, and (still) for those of you who've already registered for AGBT, check out the nifty package Illumina is sending out to people. I'm HIGHLY impressed with the creative idea and timeliness. (If you don't know what it is, the suspence is killing you and you care enough to ask, I'll put the answer in the comments.) (-:

Labels: , ,