Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Monday, August 24, 2009

Science Online 2009 London blogs

I had several cool ideas for blogs this morning; I was going to illustrate the physics of how I think water flow impacts the height of a splash when doing a "human cannonball" into a swimming pool... and I had a great idea of writing up another cartoon style "how a sugar rush affects the human body and the longevity of your life...", but then I got distracted by the Science Online London 09 blogs.

Unfortunately, I'm on the wrong continent for attending it... and frankly, I had other things to do at 2am on a sunday morning. (Yes, I was asleep...), but the beauty of blogging conferences is that people do actually blog them.

While I missed out on the content firsthand, there are plenty of commentaries, reviews and discussions about the conference.

Martin Fenner (Gobbledygook) has so far put together the most comprehensive list of posts on Science Online London 09 that I've found so far.

One of my favorite reviews (at least, one of the most topical for me), is at the mind wobbles, which provides a summary of Mark Henderson, Dave Munger and Daniel MacArthur's "Blogging for Impact" talk. I found the points are really quite useful - as always.

Since it's still just early morning here, I'm WAY in catching up on the blogosphere... I guess I'll have to save my illustrated posts for another day. So many blogs to read, so little time!

Labels:

Friday, August 21, 2009

SNP Database v0.2

My SNP database is now up and running, with the first imports of data working well. That's a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I've even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)

So, what's most interesting to me? Some of the early stats:

11,545,499 snps in total, made from:
  • 870549 snp calls from the 1000 genome project
  • 11361676 snps from dbsnp
So, some quick math:
11,361,676 + 870,549 - 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.

That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.

I suppose that's not a bad thing, since those samples were all "normals", and it's good to get some sense as to how big dbSNP really is.

Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets....

This is starting to become fun.

Labels: , ,

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

Labels: , , , , , , , , , , , ,

Monday, August 17, 2009

Networking Session - Shepa learning company

I mentioned last week that I'd spent friday at a "howto" session on networking. The day was organized by the Mitacs group at UBC, and was run by the Shepa Learning Company, of the "Work The Pond" fame. (A book on "positive networking".)

Overall, I was really glad I attended - The two women who ran the course did an excellent job. (One of them is the founder of the company "Cookies-by-George" which I remember from my childhood - man, I loved the chocolate cookies with cheese in them....) Anyhow, this is one of those things you really have to attend yourself to get the full value out of - however, I can pass along some of the more valuable tips.
  1. It''s not who you know well - it's who you know vaguely. When looking for a job, it's probably not your close contacts who will hire you, but rather it'll be a connection through a connection - so cast your network wide, and make friends with everyone. (Corollary: if you see someone regularly, you should get to know their name - you never know who will be a good connection.)
  2. Good networking isn't about what people can do for you, but what you can do for them. This puts things into perspective a little better than the hustling that people usually associate with networking. In fact, this is more of a western view of karma: if you do good, good will come back to you - so engage in it with the perspective that you should meet people with the aim of being a good person and helping them out. Don't dismiss people because they can't help you - you may be able to help them, so go for it. [Actually, this point resonated very well with me, as it's the approach I take with my blog. Put information where it's available in the hope that it helps people, and some of that goodwill may come back to me one day - and so far, I have no complaints! I can already vouch for this method of networking.]
  3. Networking isn't just meeting people and exchanging cards. Always remember to follow up with the people you meet and to take the time to organize your notes/cards. I found that to be good advice - I spent a few hours organizing my card collection, making notes on where I met people, what we talked about, etc. All of this will help me next time I come across a card and want to know where it came from.
  4. Develop your brand. The way you present yourself, the way you communicate - even the way you interact distinguishes you from other people. All of that should should be reflected in your appearance, your cards, and even your "elevator pitch" when someone asks you what you do.
  5. Keep business cards everywhere - It's always a good idea to keep a few on hand when you meet someone new. Put some into the pockets of all your coats, bags, etc. Oh, and don't keep them in your wallet - it doesn't look so good and the cards get mangled.
  6. When you go to an event, set a goal. For instance, set out to meet 7 new people, or to rescue one person who is too shy to get involved. Your job at events should be to make connections - not just to meet them for yourself. Try to find people that could help each other, and put them in touch.
  7. The glowing introduction. After you meet someone, you should be able to introduce them to someone else in a flattering way. learn how to do this: it'll help you remember who they are better, and it will facilitate introductions.
  8. You have 3 seconds to make a good first impression. Make eye contact, have a firm handshake, and most importantly, decide you're going to like someone BEFORE you meet them - you'll have a good smile and you'll find you tend to like more people.
  9. Use people's name in conversation. it's a nice touch - and it helps you remember names. Win-win.
  10. Always give people your full attention when talking with them. It's just rude not to, and well, people don't multi-task as well as they think they do.
The take away tasks are cool, though:
  • Stash business cards everywhere
  • Go to more events
  • work out your introduction (in response to "what do you do?") Make sure it sells you and your desired brand"
  • meet more people (practice, practice, practice!)
  • Step out of your comfort zone
  • Connect people who could use each other's help
  • reconnect with your old network
Finally - there were three secrets, which I'm going to paraphrase for you. (I didn't see a trademark sign on any of them, but they're useful.

  1. Network to figure out what you can do for other people - don't expect a return from everyone you meet.
  2. You'll have to meet (and help) a lot of people before you meet people who can help you, so meet everyone you can.
  3. You need to give yourself permission to engage with people you don't know.
Was all of that useful to anyone? Probably not without the context of the course or even their book, but I found some of their points to be through provoking. As for whether I'm a better networker today than I was on thursday, I don't know, but I'm willing to go a little further out of my shell.

Labels:

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

301st blog post.

This is my 301st post to my blog - an impressive number considering that I had originally started with the idea of using it to publish my photography - and have since put almost no photography on my blog. Even more impressive, to me, is that I don't seem to have run out of things to talk about yet. Hopefully I still have another 300 posts in me over the next year or so.

Ironically, three things have been popular over the course of the blog: My posts on AGBT 2009, my summary of how Eland works, and finally, the all-time most popular posts ever have been the ones with the 4- and 5-way venn diagrams. I never would have predicted that any of those articles would have so much of an audience, but hey, I'm very pleased that people have found my blog to be useful.

Well, here's to hoping that I can find 3 more useful articles in the next 300. (-;

Labels:

Saturday, August 15, 2009

What would you do with 10kbp reads?

I just caught a tweet about an article on the Pathogens blog (What can you do with 1000 base pair reads?), which is specifically about 454 reads. Personally, I'm not so interested in 454 reads - the technology is good, but I don't have access to 454 data, so it's somewhat irrelevant to me. (Not to say 1kbp reads isn't neat, but no one has volunteered to pass me 454 data in a long time...)

So, anyhow, I'm trying to think two steps ahead. 2010 is supposed to be the year that Pacific Biosciences (and other companies) release the next generation of sequencing technologies - which will undoubtedly be longer than 1k. (I seem to recall hearing that PacBio has 10k+ reads.- UPDATE: I found a reference.) So to heck with 1kbp reads, this raises the real question: What would you do with a 10,000bp read? And, equally important, how do you work with a 10kbp read?
  • What software do you have now that can deal with 10k reads?
  • Will you align or assemble with a 10k read?
  • What experiments will you be able to do with a 10k read?
Frankly, I suspect that nothing we're currently using will work well with them - we'll all have to go back to the drawing board and rework the algorithms we use.

So, what do you think?

Labels: , , , ,

Thursday, August 13, 2009

Ridiculous Bioinformatics

I think I've finally figured out why bioinformatics is so ridiculous. It took me a while to figure this one out, and I'm still not sure if I believe it, but let me explain to you and see what you think.

The major problem is that bioinformatics isn't a single field, rather, it's the combination of (on a good day) biology and computer science. Each field on it's own is a complete subject that can take years to master. You have to respect the biologist who can rattle off the biochemicals pathway chart and then extrapolate that to the annotations of a genome to find interesting features of a new organism. Likewise, theres some serious respect due to the programmer who can optimize code down at the assembly level to give you incredible speed while still using half the amount of memory you initially expected to use. It's pretty rare to find someone capable of both, although I know a few who can pull it off.

Of course, each field on it's own has some "fudge factors" working against you in your quest for simplicity.

Biologists don't actually know the mechanisms and chemistry of all the enzymes they deal with - they are usually putting forward their best guesses, which lead them to new discoveries. Biology can effectively be summed us as "reverse engineering the living part of the universe", and we're far from having all the details worked out.

Computer Science, on the other hand, has an astounding amount of complexity layered over every task, with a plethora of languages and system, each with their own "gotchas" (are your arrays zero based or 1 based? how does your operating system handle wild cards at the command line? what does your text editor do to gene names like "Sep9") leading to absolute confusion for the novice programmer.

In a similar manner, we can also think about probabilities of encountering these pitfalls. If you have two independent events, and each of which has a distinct probability attached, you can multiply the probabilities to determine the likelihood of both events occurring simultaneously.

So, after all that, I'd like to propose "Fejes' law of interdisciplinary research"

The likelihood of achieving flawless work in an interdisciplinary research project is the product of the likelihood of achieving flawless work in each independent area.


That is to say, that if your biology experiments (on average) are free of mistakes 85% of the time, and your programming is free of bugs 90% of the time. (eg, you get the right answers), your likely hood of getting the right answer in a bioinformatics project is:
Fp = Flawless work in Programming
Fb = Flawless work in Biology
Fbp = Flawless work in Bioinformatics

Thus, according to Fejes' law:
Fb x Fp = Fbp

and the example given:
0.90 x 0.85 = 0.765

Thus, even an outstanding programmer and bioinformatician will struggle to get an extremely high rate of flawless results.

Fortunately, there's one saving grace to all of this: The magnitude of the errors is not taken into account. If the bug in the code is tiny, and has no impact on the conclusion, then that's hardly earth shattering, or if the biology measurements have just a small margin of error, it's not going to change the interpretation.

So there you have it, bioinformticians. if i haven't just scared you off of ever publishing anything again, you now know what you need to do...

Unit tests, anyone?

Labels: , , , ,

Tuesday, August 11, 2009

SNP/SNV callers minimum useful information

Ok, I sent a tweet about it, but it didn't solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you'd think they grow on trees. (Actually, they grow on arrays...) I've written one, myself, and I know there are at least 3 others written at the GSC.

Anyhow, At first sight, what pisses me off is that there's no standard format. Frankly, that's not even the big problem, however. What's really underlying that problem is that there's no standard "minimum information" content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.

So, here's what I propose. If you're going to write a SNP or SNV caller, make sure your called variations contain the following fields:
  • chromosome: obviously the coordinate to find the location
  • position: the base position on the chromo
  • genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
  • canonical: what you expect to see at that position. (Invaluable for error checking!)
  • observed: what you did see at that position
  • coverage: the depth at that position (filtered or otherwise)
  • canonical_obs: how many times you saw the canonical base (key to evaluating what's at that position
  • variation_obs: how many times you saw the variation
  • quality: give me something to work with here - a confidence value between 0 and 1 would be ideal... but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!
Really, most of the callers out there give you most, if not all of it - but I have yet to see the final "quality" being given. The MAQ SNP caller (which is pretty good) asks you to look at several different fields and make up your own mind. That's fine for a first generation, but maybe I can convince people that we can do better in the second gen snp callers.

Ok, now I've got that off my chest! Phew.

Labels: , , , ,

Friday, August 7, 2009

DNA replication video

This happens to be one of the coolest videos i've ever seen of a molecular simulation. I knew the theory of how DNA replication happens, but I'd never actually seen how all of the pieces fit together. If you have a minute, check out the video.

Labels:

Thursday, August 6, 2009

New Project Time... variation database

I don't know if anyone out there is interested in joining in - I'm starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I'm currently on my first pass - consider it version 0.1 - but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there - this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

Labels: , , , , , ,

Tuesday, August 4, 2009

10 minutes in a room with microsoft

As the title suggests, I spent 10 minutes in a room with reps from Microsoft. It counts as probably the 2nd least productive time span in my life - second only to the hour I spent at lunch while the Microsoft reps told us why they were visiting.

So, you'd think this would be educational, but in reality, it was rather insulting.

Wisdom presented by Microsoft during the first hour included the fact that Silverlight is cross platform, Microsoft is a major supporter of interoperability and that bioinformaticians need a better platform to replace bio{java|perl|python|etc} in .net.

My brain was actively leaking out of my ear.

My supervisor told me to be nice and courteous - and I was, but sometimes it can be hard.

The 30 minute meeting was supposed to be an opportunity for Microsoft to learn what my code does, and to help them plan out their future bioinformatics tool kit. Instead, they showed up with 8 minutes remaining in the half hour, during which myself and another grad student were expected to explain our theses, and still allow for 4 minutes of questions. (Have you ever tried to explain two thesis projects in 4 minutes?)

The Microsoft reps were all kind and listened to our spiel, and then engaged in a round-table question and discussion. What I learned during the process was interesting:
  • Microsoft people aren't even allowed to look at GPL software - legally, they're forbidden.
  • Microsoft developers also have no qualms about telling other developers "we'll just read your paper and re-implement the whole thing."
And finally,
  • Microsoft reps just don't get biology development: the questions they asked all skirted around the idea that they already knew what was best for developers doing bioinformatics work.
Either they know something I don't know, or they assumed they did. I can live with that part, tho - They probably know lots of things I don't know. Particularly, I'm sure they know lots about doing coding for biology applications that require no new code development work.

So, in conclusion, all I have to say is that I'm very glad I only published a bioinformatics note instead of a full description of my algorithms (They're available for EVERYONE - except Microsoft - to read in the source code anyhow) and that I produce my work under the GPL. While I never expected to have to defend my code from Microsoft, today's meeting really made me feel good about the path I've charted for developing code towards my PhD.

Microsoft, if you're listening, any one of us here at the GSC could tell you why the biology application development you're doing is ridiculous. It's not that I think you should stop working on it - but you should really get to know the users (not customers) and developers out there doing the real work. And yes, the ones that are doing the innovative and ground breaking code are are mainly working with the GPL. You can't keep your head in the sand forever.

Labels: , ,