Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Wednesday, October 14, 2009

Useful error messages.... and another format rant.

I'll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It's one of my "go-to" sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I'd get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the "email results to you" option, whereupon I got the following email message:


Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress


That has to be the first time I've ever had a server shutdown cause a result failure. Ok, it's not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable - I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I'm also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it's not freely available, which is somewhat un-helpful. (I don't know when or if it will be available, although I've heard rumours about it being put into our pipeline...)

That's kind of a sad state of affairs - although I really shouldn't complain. I have more than enough work on my plate, and I'm sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I'll just have to sit here and be patient... and maybe write an 8th snp format iterator.

Labels: , , , , , ,

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

Labels: , , , , , , , , , , , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Friday, July 17, 2009

Community

This week has been a tremendous confluence of concepts and ideas around community. Not that I'd expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate's successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can't imagine having with too many people. (By the way, I think Obi's suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I've temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a "distinct" source of information, easily distinguishable from other people's blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks - I don't want to sink into a "borg" experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation - and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of "successes and failures in developing an open source community" as applied to the Vancouver Short Read Analysis Package. Yes, I'm happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us - not to mention how we integrate into them, both in our work and outside. If you can't maximize your ability to motivate them (or their ability to motivate you), then you're at a serious disadvantage. How we balance all of that is an open question, and one I'm still working hard at answering.

I've attached my presentation from this morning, just in case anyone is interested. (I've decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it's about 7Mb.)

Labels: , , , , , , , ,

Friday, May 29, 2009

Science Cartoons - 3

I wasn't going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that's generally called "fair use" in the US, and copyright laws in most countries have similar exceptions), but you can't publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don't steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

Labels: , , , , , , , , ,

Science Cartoons - 2

Comic #2/5. This one, obviously is about aligners. I've added in the copyright on the far right, this time. If I expect people to respect my copyright, I really do need to put it on there, don't I?

Labels: , , ,

Thursday, March 19, 2009

Findpeaks 3.3... continued

Patch, compile, read bug, search code, compile, remember to patch, compile, test, find bug, realized it's the wrong bug, test, compile, test....

Although I really enjoy working on my apps, sometimes a whole day goes by where tons of changes are made, and really I don't feel like I've gotten much done. I suppose it's more of the scale of things left to do, rather than the number of tasks. I've managed to solve a few mysteries and make an impact for some people using the software, but haven't got around to testing the big changes I've been working on for a few days on using different compare mechanisms for FindPeaks.

(One might then ask why I'm blogging instead of doing that testing... and that would be a very good question.)

Some quick ChIP-Seq things on my mind:
  • Samtools: there is a very complete Java/Samtools/Bamtools API that I could be integrating, but after staring at it for a while, I've realized that the complete lack of documentation on how to integrate it is really slowing the effort down. I will proably return to it next week.
  • Compare and Control: It seems people are switching to this paradigm on several other projects - I just need to get the new compare mechanism in, and then integrate it in with the control at the same time. That will provide a really nice method for doing both at once, which is really key for moving forward.
  • Eland "extended" format: I ended up reworking all of the Eland Export file functions today. All of the original files I worked with were pre-sorted and pre-formatted. Unfortunately, that's not how they exist in the real world. I now have updated the sort and separate chromosome functions for eland ext. I haven't done much testing on them, unfortunately, but that's coming up too.
  • Documentation: I'm so far behind - writing one small piece of manual a day seems like a good target - I'll try to hold myself to it. I might catch up by the end of the month, at that pace.
Anyhow, lots of really fun things coming up in this version of FindPeaks... I just have to keep plugging away.

Labels: , ,

Thursday, October 9, 2008

Maq .map to .bed format converter

I've finally "finished" the manual for one section of the Vancouver Short Read Analysis Package - though it's not findpeaks. (I'm still working on that - but it's a big application.) It could still use pictures and graphs, and stuff... but it's now functional.

One down... about 7 more manuals to write. Documentation is never the fun stuff.

What slowed this down, ironically, was my inabilty to read the Maq documentation. I completely missed the fact that unmapped reads are now included in PET aligned .map files, but with a different paired flag status. Previously, unmapped ends were thrown away, and I had to handle the unpaired ends. With the new version, those unmapped reads are now included, but given a status of 192, so they can be paired again - albeit there's not much information in the pairing. Infact, I can even handle the other ends as I find them, because they're given a status of 64. (Do these numbers seem arbitrary to anyone other than me?)

Anyhow, Finally, the .map to .bed converter works - and there's a manual to go with it.

Cheers.

Labels: ,