Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Wednesday, October 21, 2009

5 Way Venn diagram, with Paths

I was looking for a way to clean up my 5-way Venn diagrams (aka, remove the spaces with zeros) when I discovered you can do some pretty amazing things in Inkscape once you convert your objects to paths.

Since I plan to use this as a figure, I've removed the relevant numbers, but left the shapes - I think it's pretty obvious right away how the relationships work, which isn't bad, considering it IS a 5-way venn diagram.

Pretty, isn't it?



As I mentioned above, the image was made in Inkscape (available for windows/linux/mac). The software natively produces scalable vector graphics, which can be exported to png. Despite the complexity of the image, it really doesn't take long to do this, and Inkscape is pretty easy, once you get the hang of it.

Anyhow, while it's not immediately clear how to interpret the figure, it's still an interesting representation of data that would otherwise be totally impossible to interpret with the naked eye.

Labels:

Tuesday, October 20, 2009

Funny Conversations, part 2

I was amused:

-fejes- effort is relative.
-person2- No, it's absolute. That's why I gave a reference point.
-person2- Sarcasm is also absolute.
-fejes- if it were absolute, you wouldn't need a reference point.
-fejes- sarcasm also lacks a good scale.
-person2- You're somehow less funny than I am.
-fejes- that's relative.

Labels:

Wednesday, October 14, 2009

Useful error messages.... and another format rant.

I'll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It's one of my "go-to" sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I'd get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the "email results to you" option, whereupon I got the following email message:


Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress


That has to be the first time I've ever had a server shutdown cause a result failure. Ok, it's not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable - I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I'm also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it's not freely available, which is somewhat un-helpful. (I don't know when or if it will be available, although I've heard rumours about it being put into our pipeline...)

That's kind of a sad state of affairs - although I really shouldn't complain. I have more than enough work on my plate, and I'm sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I'll just have to sit here and be patient... and maybe write an 8th snp format iterator.

Labels: , , , , , ,

Thursday, October 8, 2009

Speaking English?

I have days where I wonder what language comes out of my mouth, or if I'm actually having conversations with people that make sense to anyone.

Due to unusual circumstances (Translation to English: my lunch was forcibly ejected from the fridge at work, which was incompatible with the survival of the glass-based container it was residing in at the time of the incident), I had to go out to get lunch. In the name of getting back to work quickly, as Thursdays are short days for me, I went to Wendy's. This is a reasonable approximation of the conversation I had with one of the employees.

Employee: "What kind of dressing for your salad?"

Me: "Honey-dijon, please."

Employee: "What kind of dressing do you want?"

Me: "Honey-dijon."

Employee: "dressing."

Me: "Honey-dee-john"

Employee: What kind of dressing for your salad?"

Me: "Honey-dijahn. It says honey-dijon on the board, it's a dressing, right?"

Employee: "You have the salad with your meal?"

Me: "yes.."

Employee: "You want the Honey Mustard?"

Me: "Yes."


Sometimes I just don't get fast food joints - they make me wonder if I have aspergers syndrome. After that conversation, I wasn't even going to touch the issue that my "sprite, no ice" had more ice than sprite.

Labels: ,

Tuesday, October 6, 2009

DNA sequencing Videos.

With IBM tossing it's hat into the ring of "next-next-generation" sequencing, I'm starting to get lost as to which generation is which. For the moment, I'm sort of lumping things together, while I wait to see how the field plays out. In my mind, first generation is anything that requires chain termination, Second generation is chemical based pyrosequencing, and third generation is single molecule sequencing based on a nano-scale mechanical process. It's a crude divide, but it seems to have some consistency.

At any rate, I decided I'd collect a few videos to illustrate each one. For Sanger, there are a LOT of videos, many of which are quite excellent, but I only wanted one. (Sorry if I didn't pick yours.) For second and third generation DNA sequencing videos, the selection kind of flattens out, and two of them come from corporate sites, rather than youtube - which seems to be the general consensus repository of technology videos.

Personally, I find it interesting to see how each group is selling themselves. You'll notice some videos press heavily on the technology, while others focus on the workflow.

As an aside, I also find it interesting to look for places where the illustrations don't make sense... there's a lovely place in the 454 video where two strands of DNA split from each other on the bead, leaving the two full strands and a complete primer sequence... mysterious! (Yes, I do enjoy looking for inconsistencies when I go to the movies.)

Ok, get out your popcorn.

First Generation:
Sanger Entry: Link



Second Generation:
Pyrosequencing Entry: Link



Helicose Entry: Link



Illumina (Corporate site): Link

(Click to see the Flash animation)


454 Entry: Link



Third Generation:

Pacific Biosciences: Link

(Click to see the Flash Video)


Oxford Nanopore Entry: Link



IBM's Entry: Link



Note: If I've missed something, please let me know. I'm happy to add to this post whenever something new comes up.

Labels: , ,

Monday, October 5, 2009

Why peak calling is painful.

In discussing my work, I'm often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It's trivial. Peak calling itself isn't hard. However, there are plenty of pitfalls that can surprise the unwary. (I've found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful - masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:
  1. Maintaining the software

  2. Peak calling without unlimited resources eg, 64Gb RAM

  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it's really easy to fire up a "one-off" script. Anyone can write something that can be tossed aside when they're done with it - but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we've transitioned through two versions of Eland output, Maq's .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we've been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn't have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption - and this is somewhat more technical. When dealing with sequencing reads, you're faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land - OR - you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I'll come back to in a moment. The second requires much less cpu time - but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a "pileup" file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn't matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you're using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn't always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually - which works well when correctly sorted. Unfortunately, new formats no longer allow that - using a "pre-sorted" bam/sam file, you can now find multi-part reads, but there's no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods - and do it ahead of everyone else. That on it's own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they're not. For a group of 30 people, that's probably not too hard, but for academic peak callers, that can be a challenge - particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that's the answer I give: It's trivial - but a lot of hard work. It's rewarding, educational and cool, but it's a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

Labels: , , , , , , ,

Friday, October 2, 2009

Base quality by position

A colleague of mine was working on a nifty tool to give graphs of the base quality at each position in a read using Eland Export files, which could be incorporated into his pipeline. Over a discussion about the length of time it should take to do that analysis, (His script was taking an hour, and I said it should take about a minute to analyze 8M illumina reads...) I ended up saying I'd write my own version to do the analysis, just to show how quickly it could be done.

Well, I was wrong about it taking about a minute. It turns out that the file has a lot more than about double the originally quoted 8 million reads (QC, no match and multi match reads were not previously filtered), and the whole file was bzipped, which adds to the processing time.

Fortunately, I didn't have to add bzip support in to the reader, as tcezard (Tim) had already added in a cool "PIPE" option for piping in whatever data format I want in to applications of the Vancouver Short Read Analysis Package, thus, I was able to do the following:
time bzcat /archive/solexa1_4/analysis2/HS1406/42E6FAAXX_7/42E6FAAXX_7_2_export.txt.bz2 | java6 src/projects/maq_utilities/QualityReport -input PIPE -output /projects/afejes/temp -aligner elandext

Quite a neat use of piping, really.

Anyhow, the fun part is that this was that the library was a 100-mer illumina run, and it makes a pretty picture. Slapping the output into openoffice yields the following graph:



I didn't realize quality dropped so dramatically at 100bp - although I remember when qualities looked like that for 32bp reads...

Anyhow, I'll include this tool in Findpeaks 4.0.8 in case any one is interested in it. And for the record, this run took 10 minutes, of which about 4 were taken up by bzcat. Of the 16.7M reads in the file, only 1.5M were aligned, probably due to the poor quality out beyond 60-70bp.

Labels: , , , , ,