Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Friday, August 21, 2009

SNP Database v0.2

My SNP database is now up and running, with the first imports of data working well. That's a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I've even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)

So, what's most interesting to me? Some of the early stats:

11,545,499 snps in total, made from:
  • 870549 snp calls from the 1000 genome project
  • 11361676 snps from dbsnp
So, some quick math:
11,361,676 + 870,549 - 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.

That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.

I suppose that's not a bad thing, since those samples were all "normals", and it's good to get some sense as to how big dbSNP really is.

Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets....

This is starting to become fun.

Labels: , ,


Blogger Dan Fornika said...

Re: your tweet about "non-lethal, non-detremental" ie. neutral mutations saturate in the genome.

In a Wright-Fisher ideal population (no selection, random mating, no migration, non-overlapping generations) the average number of generations it takes for a new mutation to become fixed is 4 times the effective population size.

Different human populations have different effective population sizes, due to population bottlenecks in the past. One estimate is ~3,100 for CEU,and ~7,500 for YRI.

So a good guess would be about 20,000 generations (500,000 years).

August 21, 2009 3:27:00 PM PDT  
Blogger Anthony Fejes said...

That's cool - That's not at all what I meant, but a very neat calculation. (=

I was thinking in terms of "how many genomes do I need to sequence before I stop finding new neutral Snps."

Although, I rather guess the two are related - if I knew when the Snp was first introduced, i should be able to guess how far it's been spread - or vice versa, I suppose.

Does that work? If I know the percent of the carrying population, can I figure out how when it happened?

August 21, 2009 4:47:00 PM PDT  
Blogger Dan Fornika said...

You can make some hand-waving guesses, but the allele frequency alone will not tell you how old the allele is.

Again, if you're looking at a neutral change (not selected for or against) then I think that the number of people who have the new allele cannot greatly exceed the number of generations since its origin (I'm badly paraphrasing from a theory of pop. genetics text I read recently but don't have nearby).

The process is sort of a random walk, so most new mutations will just die out by chance, and a few will become fixed much later than 500,000 years. Also in real populations there is migration, wars, population bottlenecks, non-random mating, socio-economic factors... etc.

But the genome is a dynamic thing, and new mutations are being made every minute. Even when we have 1000 or 1M genomes, it is still a snapshot of this point in human history, and only for the populations that have been sequenced.

August 24, 2009 11:39:00 AM PDT  
Blogger Anthony Fejes said...

Hi Dan,

Thanks - that makes sense, and more or less reflects what I recall learning in high school population genetics.

Your point about it being a snapshot is important, however. I think a lot people working on genomics tend to forget that it's something will change over time (myself included.) Although, I suppose it'll be a while before we need a new reference genome.

August 24, 2009 1:19:00 PM PDT  
Blogger Qi Wang said...

I've been wondering about the same thing for a while. Thanks for make my life easier by having done the analysis ~

"That is a whopping 0.15% of the SNPs in my database were not previously annotated in dbSNP."

Just did some calculation based on the numbers provided in the post.

>>> (1-11361676/11545499.)*100

>>> (870549-686726)/11545499.*100

Should it be 1.6% ? I have missing information...

February 16, 2010 3:41:00 PM PST  
Blogger Anthony Fejes said...

Ooops... well, as I said, quick calculations. Thanks for catching that.

February 17, 2010 2:15:00 PM PST  

Post a Comment

<< Home