Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Thursday, March 11, 2010

Wolfram Alpha recreates ensembl?

Ok, this might not be my most coherent post - I'm finally getting better from being sick for a whole week, which has left my brain felling somewhat... spongy. Several of us the AGBT-ers have come down with something after getting back, and I have a theory that it was something in the food we were given.... Maybe someone slipped something into the food to slow down research at the GSC??? (-; [insert conspiracy theory here.]

Anyhow, I just received a link [aka spam] from Wolfram Alpha, via a posting on linked in, letting me know all about their great new product: Wolfram Alpha now has genome information!

Somehow, looking at their quick demo, I'm somewhat less than impressed. Here's the link, if you'd like to check it out yourself: Wolfram Alpha Blog Post (Genome)

I'm unimpressed for two reasons: the first is that there are TONS of other resources that do this - and apparently do it better, from the little I've seen on the blog. For the moment, they have 11 genomes in there, which they hope to expand in the future. I'm going to have to look more closely, if I find the motivation, as I might be missing something, but I really don't see much that I can't do in the UCSC genome browser or the Ensembl web page. The second thing is that I'm still unimpressed by Wolfram Alpha's insistence that it's more than just a search engine, and that if you use it to answer a question, you need to cite it.

I'm all in favour of using really cool algorithms and searches are no exception. [I don't think I've mentioned this to anyone yet, but if you get a chance check out Unlimited Detail's use of search engine optimization to do unbelievable 3D graphics in real time.] However, if you're going to send links boasting about what you can do with your technology, do something other people can't do - and be clear what it is. From what I can tell, this is just a mash-up meta analysis of a few small publicly available resources. It's not like we don't have other engines that do the same thing, so I'm wondering what it is that they think they do that makes it worth going there for... anyone?

Worst of all, I'm not sure where they get their information from... where do they get their SNP calls from? How can you trust that, when you can't even trust dbSNP?

Anyhow, for the moment, I'll keep using resources that I can cite specifically, instead of just citing Wolfram Alpha... I don't know how reviewers would take it if I cured cancer... and cited Wolfram as my source.

Happy searching, people!

Labels: , , ,

Thursday, December 17, 2009

One lane is (still) not enough...

After my quick post yesterday where I said one lane isn't enough, I was asked to elaborate a bit more, if I could. Well, I don't want do get into the details of the experiment itself, but I'm happy to jump into the "controls" a bit more in depth.

What I can tell is that with one lane of RNA-Seq (Illumina data50bp), all of the variations I find show up either in known polymorphism database or as somatic SNPs, with a few exceptions. The few exceptions just turn out to be exceptions for lack of coverage.

For a "control", I took two data sets (from two separate patients) - each with 6 individual lanes of sequencing data. (I realize this isn't the most robust experiment, but it shows a point.) In the perfect world, each of the 6 lanes per person would have sampled the original library equally well.

So, I matched up one lane from each patient into 6 sets and asked the question: How many transcripts are void (less than 5 tags) in one sample and at least 5x greater in the other sample. (I did this in both directions.)

The results aren't great. In one direction, I see an average of 1245 Transcripts (about 680 genes, so there's some overlap amongst the transcript set) with a std dev. of 38 Transcripts. That sounds pretty consistent, till you look for the overlap in actual transcripts: avg 27.3 with a std dev of 17.4. (range 0-60). And, when with do the calculations, the most closely matched data sets only have a 5% overlap.

The results for the opposite direction were similar: Average of 277 transcripts found that met the criteria ( of 33.61), with an average overlap between data sets being 4.8, std. dev 4.48. (range of 0-11 transcripts in common.) The best overlap in "upregulated" genes for this dataset was just over 4% concordance with a second pair of lanes.

So, what this tells me (for a VERY dirty experiment) is that expression of genes in one lane is highly variable depending on the lane for genes expressed at the low end. (Sampling at the high end usually pretty good, so I'm not too concerned about that.)

What I haven't answered yet is how many lanes is enough. Alas, I have to go do some volunteering, so that experiment will have to wait for another day. And, of course, the images I created along the way will have to follow later as well.

Labels: , , , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Monday, July 27, 2009

how recently was your sample sequenced?

One more blog for the day. I was postponing writing this one because it's been driving me nuts, and I thought I might be able to work around it... but clearly I can't.

With all the work I've put into the controls and compares in FindPeaks, I thought I was finally clear of the bugs and pains of working on the software itself - and I think I am. Unfortunately, what I didn't count on was that the data sets themselves may not be amenable to this analysis.

My control finally came off the sequencer a couple weeks ago, and I've been working with it for various analyses (snps and the like - it's a WTSS data set)... and I finally plugged it into my FindPeaks/FindFeatures pipeline. Unfortunately, while the analysis is good, the sample itself is looking pretty bad. In looking at the data sets, the only thing I can figure is that the year and a half of sequencing chemistry changes has made a big impact on the number of aligning reads and the quality of the reads obtained. I no longer get a linear correlation between the two libraries - it looks partly sigmoidal.

Unfortunately, there's nothing to do except re-seqeunce the sample. But really, I guess that makes sense. If you're doing a comparison between two data-sets, you need them to have as few differences as possible.

I just never realized that the time between samples also needed to be controlled. Now I have a new question when I review papers: How much time elapsed between the sequencing of your sample and it's control?

Labels: , , , , , ,

Monday, June 15, 2009

Another day, another result...

I had the urge to just sit down and type out a long rant, but then common sense kicked in and I realized that no one is really interested in yet another graduate student's rant about their project not working. However, it only took a few minutes for me to figure out why it's relevant to the general world - something that's (unfortunately) missing from most grad student projects.

If you follow along with Daniel McArthur's blog, Genetic Future, you may have caught the announcement that Illumina is getting into the personal genome sequencing game. While I can't admit that I was surprised by the news, I will have to admit that I am somewhat skeptical about how it's going to play out.

If your business is using arrays, then you'll have an easy time sorting through the relevance of the known "useful" changes to the genome - there are only a couple hundred or thousand that are relevant at the moment, and several hundred thousand more that might be relevant in the near future. However, when you're sequencing a whole genome, interpretation becomes a lot more difficult.

Since my graduate project is really the analysis of transcriptome sequencing (a subset of genome sequencing), I know firsthand the frustration involved. Indeed, my project was originally focused on identifying changes to the genome common to several cancer cell lines. Unfortunately, this is what brought on my need to rant: there is vastly more going on in the genome than small sequence changes.

We tend to believe blindly what we were taught as the "central paradigm of molecular biology". Genes are copied to mRNA, mRNA is translated to proteins, and the protein goes off to do it's work. However, cells are infinitely more complex than that. Genes can be inactivated by small changes, can be chopped up and spliced together to become inactivated or even deregulated, interference can be run by distally modified sequences, gene splicing can be completely co-opted by inactivating genes we barely even understand yet and desperately over-expressed proteins can be marked for deletion by over-activating garbage collection systems so that they don't have a chance to get where they were needed in the first place. And here we are, looking for single nucleotide variations, which make up a VERY small portion of the information in a Cell.

I don't have the solution, yet, but whatever we do in the future, it's not going to involve $48,000 genome re-sequencing. That information on it's own is pretty useless - we'll have to study expression (WTSS or RNA-Seq, so figure another $30,000), changes to epigenetics (of which there are many histone marks, so figure 30 x $10,000) and even dna methylation (I don't begin to know what this process costs.)

So, yes, while I'm happy to see genome re-sequencing move beyond the confines of array based SNP testing, I'm pretty confident that this isn't the big step forward it might seem. The early adopters might enjoy having a pretty piece of paper that tells them something unique about their DNA, and I don't begrudge it. (In fact, I'd love to have my DNA sequenced, just for the sheer entertainment value.) Still, I don't think we're seeing a revolution in personal genomics - not quite yet. Various experiments have shown we're on the cusp of a major change, but this isn't the tipping point: we're still going to have to wait for real insight into the use of this information.

When Illumina offers a nice toolkit that allows you to get all of the SNVs, changes in expression and full ChIP-Seq analysis - and maybe even a few mutant transcription factor ChIP-Seq experiments thrown in - and all for $48,000, then we'll have a truly revolutionary system.

In the meantime, I think I'll hold out on buying my genome sequence. $48,000 would buy me a couple more weeks in Tahiti, which would currently offer me a LOT more peace of mind. (=

And on that note, I'd better get back to doing the things I do.... new FindPeaks tag, anyone?

Labels: , , , ,

Friday, March 20, 2009

Universal format converter for aligned reads

Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot... and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie.... and several other more obscure formats.

For the moment, the "conversion utility" forces the output to bed file format (since that's the file type with the least information, and I don't have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there's no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren't being scaled or altered, you could end up with some rather bizzare conversions unless they're handled cleanly. Unfortunately, doing this scaling is such a moving target that it's just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I'll be asking for a co-op student for the summer to help out.)

Anyhow, I'll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

Labels: , , , , , , , , , , ,

Thursday, January 29, 2009

Canada's Government is pulling the carpet from Genomics.

Canada's conservative party, which currently controls the parliament, has decided that it wont fund genomics research any more. They seem to have decided that the $140 Million that it takes to power Canada's genome centres is too much, although they don't have a problem pouring in over a billion and a half into building upgrades. Wow. Just... wow.

I have never been as impressed with the shortsightedness of the conservatives. Yes, it's great to have a nice shiny genomics building, but when there's no money to operate the machinery, that's just sad. Considering the work that's being done in Canada with the new genomics technology, that's like deciding that all electronics after the invention of the lightbulb is superfluous. Great job guys. Genome Canada, through Genome BC has played a large part in the work I do on breast cancer, on ChIP-Seq work... etc.

Anyhow, this sudden disbelief in science from the government has me wondering what the future has in store for science research in Canada. In the next two years I'll be leaving the happy confines of the Genome Sciences Centre with a nifty doctorate and my hope was that I'll be able to stay in Vancouver (ideally), or at least in Canada to do a post-doc or something genomics related. Unfortunately, the biggest agency helping to get this type of research off the ground was Genome Canada and it's affiliates. Now that they've had a $140 Million/year budget pulled out from under them, I'm guessing it's pretty darn unlikely.

This means we'll be dropping funding for age related diseases, cancers, promising new pharmaceutical technologies.... ok, I'm not going to list it out. I sure hope that the Government knows what it's doing, because all I see in the future is another move... and this time it'll probably be south of the border.

Well, either that or I go back to school for another two years and learn carpentry to help build the empty buildings on campus.

Labels: ,