Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, February 5, 2010

GFF3 undocumented feature...

Earlier today, I tweeted:
Does anyone know how to decypher a diBase GFF3 file? They don't identify the "most abundant" nucleotide uniquely. seems useless to me.
Apparently, there is a solution, albeit undocumented:

The attribute "genotype" contains an IUB code that is limited to using either a single base or a double base annotation (eg, it should not contain, H, B, V, D or N - but may contain R, Y, W, S, M or K ), which then allows you to subtract the "reference" attribute (that must be canonical) from the "genotype" attribute IUB code to obtain the new SNP - but only when the "genotype" attribute is not a canonical base.

If only that were documented somewhere...

UPDATE: Actually, this turns out not to be the case at all -- there are still positions for which the "genotype" attribute is an IUB code, and the reference is not one of the called bases. DOH!

Labels: , , ,

Wednesday, October 14, 2009

Useful error messages.... and another format rant.

I'll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It's one of my "go-to" sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I'd get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the "email results to you" option, whereupon I got the following email message:


Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress


That has to be the first time I've ever had a server shutdown cause a result failure. Ok, it's not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable - I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I'm also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it's not freely available, which is somewhat un-helpful. (I don't know when or if it will be available, although I've heard rumours about it being put into our pipeline...)

That's kind of a sad state of affairs - although I really shouldn't complain. I have more than enough work on my plate, and I'm sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I'll just have to sit here and be patient... and maybe write an 8th snp format iterator.

Labels: , , , , , ,

Friday, September 4, 2009

Variant Call Format

After earlier discussions, there is now some information available on the Variant Call Format available to the public.

If you're intrested in working with snps, this may be required reading:

http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

Labels: , , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Monday, July 27, 2009

Picard code contribution

Update 2: I should point out that the subject of this post has been resolved. I'll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.


Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I'll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) - and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I've been slowly reading through the code, as I'd like to use it in my project for reading/writing SAM format files - which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It's an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it's suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They're not key pieces, but they are important to maintain your code's good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the "==" operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It's not in bad shape, but it's got a lot of little "problems" that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I'd fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it's best, right?

The sad part of the tale is that open source only works when the community's contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn't. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I'm not going to waste my time submitting the second set of changes I made. That's kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I'd begin making larger code contributions as my familiarity with the code improves - and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren't interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they'd like to accomplish with the software - and including contributions may mean including someone on a publication downstream if and when it does become published. I don't know if that was the case here: it was well within the project leader's rights to reject my patches on any grounds they like, but I can't say it makes me happy. I still don't enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see "Open Source" software, I'll remember that just because it's open source, it doesn't mean all contributions are welcome - I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I'll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that's highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

Labels: , , , , , ,

Friday, March 20, 2009

Universal format converter for aligned reads

Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot... and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie.... and several other more obscure formats.

For the moment, the "conversion utility" forces the output to bed file format (since that's the file type with the least information, and I don't have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there's no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren't being scaled or altered, you could end up with some rather bizzare conversions unless they're handled cleanly. Unfortunately, doing this scaling is such a moving target that it's just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I'll be asking for a co-op student for the summer to help out.)

Anyhow, I'll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

Labels: , , , , , , , , , , ,

Monday, April 21, 2008

Eland file Format

I haven't written much over the past couple of days. I have a few things piled up that all need doing urgently... and it never fails, that's when you get sick. I spent today in bed, fighting off a cold, sore throat and fever. Wonderful combination.

Anyhow, since I'm starting to feel better, I thought I'd write a few lines before going to bed, and wanted to mention that I've finally seen a file produced by the new Eland. It's a little different, and the documentation provided (ahem, that I was able to obtain from a colleague who uses the pipeline) was pretty scarce.

In fact, much of what you see in the file is pretty obvious, with the same general concept as the previous Eland files, except this has a few caveats:

1) the library name and 4-coordinate position of the sequence are all separated by tabs in one of the files I saw, but concatenated with a separating ":" in another. I'm not sure which is the real format, but there are at least 2 formats for line identification.

2) There's a string that seems to encode the base quality scores from the prb files, but it's in a format for which I can't find any information.

3) there's a new format for mismatches within the alignment. Instead of telling you the location of the mismatch, you now get a summary of the alignment itself. If Eland could do insertions, it would work well for those too. From the document, it tells you the number of aligned bases, with letters interspersed to show the mismatch. (e.g. if you had a 32 base alignment, with a mismatch A at character 10, you'd get the string "9A22".) I also understand that upper and lower case mismatches mean something different, though I haven't probed the format too much.

So, in the discussion of formats, I understand there's some community effort around using a so-called "short read format" or SRF format. It's been adopted by Helicos, GEO, as well as several other groups.

Maybe it's time I start converting Eland formats to this as well. Wouldn't it be nice if we only had to work with one format? (If only Microsoft understood that too! - ps, don't let the community name fool you, it's well known Microsoft sponsored that site.)

Labels: , ,