Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

Labels: , , , , , , , , , , , ,

Friday, January 9, 2009

No More Maq?

Another grad student at the GSC forwarded an email to our mailing list the other day, which was in turn from the maq-help mailing list. Unfortunately, the link on the maq-help mailing list takes you to another page, which incidentally (and erroneously) complains that FindPeaks doesn't work with Maq .map files - which it does. Instead, I suggest checking out this post on SeqAnswers from Li Heng, the creator of Maq, which has a very similar message.

The main gist of it is that the .map file format will be deprecated, and there will be no new versions of the Maq software package in the future. Instead, they will be working on two other projects (from the forwarded email):
  1. Samtools: replaces maq's (reference-based) "assembly"
  2. bwa: replaces maq's "mapping" for whole human genome alignment.
I suppose it means that eventually FindPeaks should support the Samtools formats, which I'll have to look into at some point. For those of you who are still using Maq, you may need to start following those projects as well, simply because it raises the question of long-term Maq support. As with many early generation Bioinformatics tools, we'll just have to be patient and watch how the software landscape evolves.

It probably also means that I'll have to start watching the Samtools development more carefully for use with my thesis project - many of the tools they are planning seem to replace the ones I've already developed in the Vancouver Short Read Alignment Package. Eventually, I'll have to evaluate both sets against each other. (That could also be an interesting project.)

While this was news to me, it's probably no more than the expected churn of a young technology field. I'm sure it's not going to be long until even the 2nd generation sequencing machines themselves evolve into something else.

Labels: , , ,

Thursday, February 7, 2008

AGBT post #2.

Good news.. my bag arrived! I'm going to go pick it up after the current session, and finally get some clean clothes and a shave. Phew!

Anyhow, on the AGBT side of things, I just came back form the Pacific Biosciences panel discussion, which was pretty neat. The discussion was on "how many base pairs will it take to enable personalized medicine?" A topic I'm really quite interested in.

The answers stretched from infinite, to 6 Billion, to 100TB, to 100 people (if they can pick the right person), to 1 (if they find the right one). It was a pretty decent discussion, covering things from American politics, to snp finding, to healthcare... you get the idea. The moderator was also good, the host of a show (Biotechworld?) on NPR.

My one problem is that in giving their answers, they brushed on several key points, but never really followed up on it.

1) just having the genome isn't enough. Stuff like transription factor binding sites, methylation, regulation, and so forth are all important. If you don't know how the genome works, personal medicine applications aren't going to fall out of it. (Elaine Mardis did mention this, but there was little discussion of it.)

2) Financial aspects will drive this. That, in itself was mentioned, but the real paradigm shifts will happen when you can convince the U.S. insurance companies that preventive medicine is cheaper than treating illness. That's only a matter of time, but I think that will drive FAR more long term effects than having people's genomes. (If insurance companies gave obese people a personal trainer and cooking lessons, assuming their health issues are diet related, they'd save a bundle in not having to pay for diabetes medicine, heart surgery, and associated costs.... but targeting people for preventive treatment requires much more personal medicine than we have now.)

Other points that were well covered include the effect of computational power as a limiting agent in processing information, the importance of sequencing the right people, and how its impossible to predict where the technology will take us, both morally and scientifically.

Anyhow, as I'm typing this while sitting in other talks:

Inanc Birol, also from the GSC, gave a talk on his work on a new de novo assembler:

80% reconstruction of the C.elegans genome from 30x coverage, which required 6 hours (10 cpu) for data preparation and performing the assembly in less than 10 minutes on a single CPU, using under 4Gb of RAM.

There you go.. the question for me (relevant to the last posting) is "how much of the 20% remaining has poor sequencability?" I'm willing to bet it's the same.

And I just heard a talk on SSAHA_pileup, which seems to try to sort snps. Unfortunately, every SNP caller talk I see always assumes 30X coverage.. How realistic is that for human data? Anyhow, I'm sure I missed something. I'll have to check out the slides on, once they're posted.

And the talks continue....

btw, remind me to look into the fast smith-waterman in cross-match - it sounds like it could be useful.

Labels: , , ,

Tuesday, February 5, 2008

AGBT and Sequencability

First of all, I figured I'd try to do some blogging from ABGT, while I'm there. I don't know how effective it'll be, or even how real-time, but we'll give it a shot. (Wireless in Linux on the Vostro 1000 isn't particularly reliable, and I don't even know how accessible internet will be.)

Second, what I wrote yesterday wasn't very clear, so I thought I'd take one more stab at it.

Sequencability (or mappability) is a direct measure of how well you'll be able to sequence a genome using short reads. Thus, by definition, de novo sequencing of a genome is going to be a direct result of the sequencability of that genome. Unfortunately, when people talk about the sequencability, they talk about it in terms of "X% of the genome is sequencable", which means "sequencability is not zero for X% of the genome."

Unfortunately, even if sequencability is not zero, it doesn't mean you can generate all of the sequences (even if you could do 100% random fragments, which we can't), indicating that much of the genome beyond that magical "X% sequencable" is still really not assemblable. (Wow, that's such a bad word.)

Fortunately, sequencability is a function of the length of the reads used, and as the read length increases, so does sequencability.

Thus, there's hope that if we increase the read length of the Illumina machines, or someone else comes up with a way to do longer sequences with the same throughput (e.g. ABI Solid, or 454's GS FLX), the assemblability of the genome will increase accordingly. All of this goes hand in hand: longer reads and better lab techniques always make a big contribution to the end results.

Personally, I think the real answer lays in using a variety of techniques: Paired-End-Tags to span difficult to sequence areas (eg. low or zero sequencability regions), and Single-End-Tags to get high coverage... and hey throw in a few BACs and ESTs reads for good luck. (=

Labels: , , , , ,