Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Wednesday, September 30, 2009

Second Gen Sequencer Map and naming our generations properly

Yes, this is old news, but I've found myself searching for the second generation sequencing map that was started at seqanswers several times in the last few days. Just to make it even easier to find - and of course to give some positive publicity to a really cool project - here's the Google Maps based list of facilities that run second generation (Next-Generation) sequencing machines:

http://tinyurl.com/orm8cr

For the record, I understand a few places are missing, however. I hear there's some second-generation sequencing going on Alberta, which clearly hasn't appeared on the map yet.

And, as a foot note, now that I see other people have picked up on it and started calling "next-generation sequencing" the more appropriate label of "second generation sequencing" (e.g. here at nature), I'm going to drop next-gen as a label entirely. First-gen is sanger/dideoxy/capillary/etc, second-gen is pyrosequencing and the third gen is biotech-based (using cellular components such as DNA polymerases and the like). Lets end the confusion and name our generations accordingly, shall we?

Labels: ,

Monday, September 28, 2009

Recursive MC solution to a simple problem...

I'm trying to find balance between writing and experiments/coding. You can't do both at the same time without going nuts, in my humble opinion, so I've come up with the plan of alternating days. One day of FindPeaks work, one day on my project. At that rate, I may not give the fastest responses (yes, I have a few emails waiting), but it should keep me sane and help me graduate in a reasonable amount of time. (For those of you waiting, tomorrow is FindPeaks day.)

That left today to work on the paper I'm putting together. Unfortunately, working on the paper doesn't mean I don't have any coding to do. I had a nice simulation that I needed to run: given the data sets I have, what are the likely overlaps I would expect?

Of course, I hate solving a problem once - I'd rather solve the general case and then plug in the particulars.

Today's problem can be summed up as: "Given n data sets, each with i_n genes, what is the expected number of genes common to each possible overlap of 2 or more datasets?"

My solution, after thinking about the problem for a while, was to use a recursive solution. Not surprisingly, I haven't written recursive code in years, so I was a little hesitant to give it a shot. In contrast, I whipped up the code, and gave it a shot - and it worked the first time. (That's sometimes a rarity with my code - I'm a really good debugger, but can often be sloppy when writing code quickly the first time.) Best of all, the code is extensible - If I have more data sets later, I can just add them in and re-run. No code modification needed beyond changing the data. (Yes, I was sloppy and hard coded it, though it would be trivial to read it from a data file, if someone wants to re-use this code.)

Anyhow, it turned out to be an elegant solution to a rather complex problem - and I was happy to see that the results I have for the real experiment stick out like a sore thumb: it's far greater than random chance.

If anyone is interested in seeing the code, it was uploaded into the Vancouver Short Read Analysis Package svn repository: here. (I'm doubting the number of page views that'll get, but what the heck, it's open source anyhow.)

I love it when code works properly - and I love it even more when it works properly the first time.

All in all, I'd say it's been a good day, not even counting the 2 hours I spent at the fencing club. En gard! (-;

Labels: , ,

Tuesday, September 22, 2009

Time to publish?

Although not quite the first time I've been told that I'm slacking in my life, I got the lecture from my supervisor yesterday. To paraphrase: "You're sitting on data. Publish it now!"

I guess there's a spectrum of people out there in research: those who publish fast and furious, and those who publish slowly and painstakingly. I guess I'm on the far end of the spectrum: I really like to make sure that the data I have is really right before pushing it out the door.

This particular data set was collected about a year and a half ago, back when 36-bp Illumina reads were all the rage, so yes, I've been sitting on it for a long time. However, if you read my notebooks, there's a clear evolution. Even in my database, the table names are marked as "tbl_run5_", so you can get the idea of how many times I've done this analysis. (I didn't start with the database on the first pass.)

At this point, and as of late last week (aka, Thursday), I'm finally convinced that my analysis of the data is reproducible, reliable and accurate - and I'm thrilled to get it down in a paper. I just look back at the lab book full of markings and have to wonder what would have happened if I'd have published earlier... bleh!

So, this is my own personal dilemma: How do you publish as quickly as possible without opening the door to making mistakes? I always err on the side of caution, but judging by what makes it out to publication, maybe that's not the right path. I've heard stories of people sending results they knew to be incorrect to the reviewers with the assumption that they could fix it up by the time the reviewers come back with comments..

Balancing publication quality, quantity and speed is probably another of those necessary skills that grad students will just magically pick up somewhere along the way towards getting a PhD. (Others in the list include teaching a class of 300 undergraduates, getting and keeping grants and starting up your own lab)

I think I'm going to spend a few minutes this afternoon (between writing paragraphs, maybe?) looking for a good grad school HOWTO. The few I've come across haven't dealt with this particular subject, but I'm sure it's out there somewhere.

Labels: ,

Thursday, September 17, 2009

3 year post doc? I hope not!

I started replying to a comment left on my blog the other day and then realized it warranted a little more than just a footnote on my last entry.

This comment was left by "Mikael":

[...] you can still do a post-doc even if you don't think you'll continue in academia. I've noticed many life science companies (especially big pharmas) consider it a big plus if you've done say 3 years of post-doc.


I definitely agree that it's worth doing a post-doc, even if you decide you don't want to go on through the academic pathway. I'm beginning to think that the best time to make that decision (ivory tower vs indentured slavery) may actually be during your post-doc, since that will be the closest you come to being a professor before making the decision. As a graduate student, I'm not sure I am fully aware of risks and rewards of the academic lifestyle. (I haven't yet taken a course on the subject, and one only gets so much of an idea through exposure to professors.)

However, at this point, I can't stand the idea of doing a 3 year post doc. After 6 years of undergrads, 2.5 years of masters, 3 years of (co-)running my own company, and about 3.5 years of doing a PhD by the time I'm done... well, 3 more years of school is about as appealing as going back to the wet lab. (No, glassware and I don't really get along.)

If I'm going to do a post-doc (and I probably will), it will be a short and sweet one - no more than a year and a half at the longest. I have friends who are stuck in 4-5 year post-docs and have heard of people doing 10-year post-docs. I know what it means to be a post-doc for that long: "Not a good career building move." If you're not getting publications out quickly in your post-doc, I can imagine it won't reflect well on your C.V, destroying your chances of moving into the limited number of faculty positions - and wrecking havoc on your chances of getting grants.

Still, It's more about what you're doing than how long you're doing it. I'd consider a longer post doc if it's in a great lab with the possibility of many good publications. If there's one thing I've learned from discussions with collaborators and friends who are years ahead of me, it's that getting into a lab where publications aren't forthcoming - and where you're not happy - can burn you out of science quickly.

Given that I've spent this long as a science student (and it's probably far too late for me to change my mind on becoming a professional musician or photographer), I want to make sure that I end up somewhere where I'm happy with the work and can make reasonable progress: this is a search that I'm taking pretty seriously.

[And, just for the record, if company needs me to do 3-years of post-doc at this point, I have to wonder just who it is I'm competing with for that job - and what it is that they think you learn in your 2nd and 3rd years as a postdoc.]

With that in mind, I'm also going to put my (somewhat redacted) resume up on the web in the next few days. It might be a little early - but as I said, I'm taking this seriously.

In the meantime, since I want to actually graduate soon, I'd better go see if my analyses were successful. (=

Labels: , ,

Tuesday, September 15, 2009

Depressing view of Academia

So I officially started going through available post-doc positions this week, now that I'm back from my vacation. I'm still trying to figure out what I want to do when I finish my PhD next year (assuming I do...), and of course, I came back to the academia vs. industry question.

In weighing the evidence, a friend pointed me to this article on the problems facing new scientists in academia. Somehow, it does a nice job of dissuading me from thinking about going down that route - although I'm not completely convinced industry is the way to go yet either.

Read for yourself: Real Lives and White Lies in the Funding of Scientific Research

Labels: , , ,

Friday, September 4, 2009

Risk Factors #2 and thanks to Dr. Steve.

Gotta love drive-by comments from Dr. Steve:

I don't have time to go into all of the many errors in this post, but for a start, the odds ratio associated with the celiac snp is about 5-10X *per allele* (about 50X for a homozygote). This HLA allele accounts for about 90% of celiac disease and its mode of action is well understood.

I understand this is just a blog and you are not supposed to be an expert, but you should do some basic reading on genetics before posting misinformation. Or better yet, leave this stuff to Daniel MacArthur.


While even the 6th grade bullies I knew could give Dr Steve a lesson in making friends, I may as well at least clarify my point.

To start with, Dr. Steve made one good point. I didn't know know what the risk factor for a single given Celiac SNP is - and thanks to Dr. Steve's incredibly educational message - I still don't. I simply made up a risk factor, which was probably the source of Dr. Steve's confusion. (I did say I made it up in the post, but apparently hypothetical situations aren't within Dr Steve's repertoire.)

But lets revisit how Risk factors work, as I understand it. If someone would like to tell me how I'm wrong, I'll accept that criticism. Telling me I'm wrong without saying why is useless, so don't bother.

A risk factor is a multiplicative factor that indicates your risk of expressing a given phenotype versus the general population. If you have a risk factor of 5, and the general population has a 10% phenotype penetration, that means you have a 50% chance of expressing the phenotype. (.10 x 5 = 0.5, which is 50%).

In more complex cases, say with two independent SNPs, each with an independent risk factor, you multiply the set of risk factors by the probability of the phenotype appearing in the general population. (You don't need me to do the math for you, do you?)

My rather long-winded point was that discussing risk factors without discussing the phenotypic background disease rate in the population is pointless, unless you know that the risk ratio leads you to a diagnostic test and predicts in a demonstrated statistically significant manner that the information is actionable.

Where I went out on a limb was in discussing other unknowns: error rates in the test, and possible other factors. Perhaps Dr. Steve knows he has a 0% error rate in his DTC SNP calls, or assumes as much - I am just not ready to make that assumption. Another place Dr. Steve may have objected to was my point about extraneous environmental factors, which may be included in the risk factor, although I just passed over it in my previous post without much discussion.

(I would love to hear how a SNP risk factor for something like Parkinson's disease would be modulated by Vitamin D levels depending on your latitude. It can't possibly be built into a DTC report. Oh, wait, this is hypothetical again - Sorry Dr. Steve!)

My main point from the previous post was that I have a difficult time accepting is that genomics consultants consider a "risk factor" as a useful piece of genomic information in the absence of an accompanying "expected background phenotypic risk." A risk factor is simply a modulator of the risk, and if you talk about a risk factor you absolutely need to know what the background risk is.

Ok, I'm done rehashing my point from the previous point, and that takes me to my point for today:

Dr. Steve, telling people who have an interest in DTC genomics to stay out of the conversation in favor of the experts is shooting yourself in the foot. Whether it's me or someone else, we're going to ask the questions, and telling us to shut up isn't going to get the questions answered. If I'm asking these questions, and contrary to your condescending comment I do have a genomics background, people without a genomics background will be asking them as well.

So I'd like to conclude with a piece of advice for you: Maybe you should leave the discussion to Daniel McArthur too - he's doing a much better job of spreading information than you are, and he does it without gratuitously insulting people.

And I thought Doctors were taught to have a good bedside manner.

Labels: ,

Variant Call Format

After earlier discussions, there is now some information available on the Variant Call Format available to the public.

If you're intrested in working with snps, this may be required reading:

http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

Labels: , , ,

Best Software Licence Ever!

I was looking for some example code of a Mahalanobis distance calculator and came across what I happen to believe is the most entertaining license I have ever seen. I had to share:

The program is free to use for non-commercial academic purposes, but for course works, you must understand what is going inside to use. The program can be used, modified, or re-distributed for any purposes if you or one of your group understand codes (the one must come to court if court cases occur.) Please contact the authors if you are interested in using the program without meeting the above conditions.


The Source.

Labels: , , ,

Thursday, September 3, 2009

DTC Snps... no more risk factors!

I've been reading Daniel's blog again. Whenever I end up commenting on things I don't understand well, that's usually why. Still, it's always food for thought.

First of all, has anyone quantified the actual error rate on these tests? We know they have all sorts of mistakes going on. (This one was recently in the news, and yes, unlike Wikipedia, Daniel is a valid reference source for anything genomics related.) I'll come back to this point in a minute.

As I understand it, the risk factor is an adjustment made to the likelihood of the general population in characterizing the risk of an individual suffering from a particular disease.

So, as I interpret it, you take whatever your likelihood of having that disease was multiplied by the risk factor. For instance with a disease like Jervell and Lange-Nielsen Syndrome, 6 of every 1 Million people suffer from it's effects (although this is a bad example since you would have discovered it in childhood, but ignoring that for the moment we can assume another rare disease with a similar rate.) If our DTC test shows we have a 1.17 risk factor because we have a SNP, we would multiply that by 1.17.

6/1,000,000 x 1.17 = 7/1,000,000

if I've understood it all correctly, that means you've gone from knowing you have a 0.000,6% chance to being certain you have a 0.000,7% chance of suffering from your selected disease. (What a great way to spend your money!)

But lets not stop there. Lets ask about the the error rate on actually calling that snp is. From my own experience in SNP validation, I'd make a guess that the validation rate is close to 80-90%. Lets even be generous and take the high end. Thus:

You've gone from 100% knowing you've got a 0.000,6% chance of having a disease to being 90% sure you have a 0.000,7% chance of having a disease and a 10% sure you've still got a 0.000,6% of having the disease.

Wow, I'm feeling enlightened.

Lets do the same for something like Celiacs disease, which is estimated to strike 1/250 people, but is only diagnosed 1/4,700 people in the U.S.A. - and lets be generous and assume that the SNP in your DTC test has a 1.1 risk factor. (Celiacs is far from a rare disease, I might add.)

As a member of the average U.S. population, you had a 0.4% chance of having the disease, but a 0.02% chance of being diagnosed with it. That's a pretty big disparity, so maybe there's a good reason to have this test done. As a Canadian it's somewhat different odds, but lets carry on with the calculations anyhow.

lets say you do the test and find out you have a 1.1 times risk factor of having the disease. omg scary!

Wait, lets not freak out yet. That sounds bad, but we haven't finished the calculations.

Your test has the SNP.... 1.1 x 1/250 = 0.44% likelihood you have the disease. Because Celiacs disease requires a biopsy to definitively diagnose it (and treatment does not start till you've done the diagnosis), would you run out and submit yourself to a biopsy on a 0.44% chance you have a disease? Probably not unless you have some other knowledge that you're likely to have this disease already.

Then, we factor in the 90% likelyhood of getting the SNP call correct: You have a 90% likelihood of having a 0.44% chance of having the disease, and a 10% likelihood of having a 0.4% chance of having the disease.

Ok, I'd be done panic-ing about now. And we've only considered two simple things here. Lets add one more just for fun.

lets pretend that an unknown environmental stressor is actually involved in triggering the condition, which would explain why the odds are somewhat different in Canada. Since we know nothing about that environmental trigger, we can't even project odds of coming in contact with it. Who knows what effect that plays with the SNP you know about.

By now, I can't help thinking that all of this is just a wild goose chase.

So, when people start talking about how you have to take your DTC results to a Genetic Counsellor or to your MD I really have to wonder. I can't help but to think that unless you have a very good reason to suspect a disease or if you have some form of a priori knowledge, this whole thing is generally a waste. Your Genetic Counsellor will probably just laugh at you, and your MD will order a lot of unnecessary tests - which of those sounds productive?

Let me make a proposal (and I'm happy to hear dissent): Risk factors are great - but are absolutlely useless when it comes to discussing how genetic factors affect you. Lets leave the risk factors to the people writing the studies and ask the DTC companies to make a statement: what are your odds of being affected by a given condition? And, if you can't make a helpful prediction (aka, a diagnostic test), maybe you shouldn't be selling it as a test.

Labels: , , ,

Tuesday, September 1, 2009

How much time I spend...

I was just thinking about the division of time amongst the various things I work on - and realized it's pretty bizarre. Unlike most grad students, I have to interface with people using my software for many different analysis types - some of which are "production" quality. That has it's own challenges, but I'll leave that for another day.

I figured I could probably recreate my average week in a pie chart form, covering the work I've been doing...


Honestly, though, It's just an estimate - and the sum is actually more than 40 hours a week. (I do work in the evenings, sometimes - and support for FindPeaks happens when I check my email in the evening, too. - to compensate, I may have been stingy on the hours spend goofing off....)

Anyhow, I think it would be an interesting project to try to keep track of how I spend my time. Maybe I'll give it a try when I come back from vacation. (Yes, I'll be away next week.)

Still, even from this estimate, three things are very clear:
  1. I need to spend more time upfront writing tests for my software to cut down on debugging
  2. I need to spend more time reading journals.
  3. I am clearly underestimating the time I spend playing Ping Pong. But hey, I work through lunch!

Labels: , ,