Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, October 8, 2009

Speaking English?

I have days where I wonder what language comes out of my mouth, or if I'm actually having conversations with people that make sense to anyone.

Due to unusual circumstances (Translation to English: my lunch was forcibly ejected from the fridge at work, which was incompatible with the survival of the glass-based container it was residing in at the time of the incident), I had to go out to get lunch. In the name of getting back to work quickly, as Thursdays are short days for me, I went to Wendy's. This is a reasonable approximation of the conversation I had with one of the employees.

Employee: "What kind of dressing for your salad?"

Me: "Honey-dijon, please."

Employee: "What kind of dressing do you want?"

Me: "Honey-dijon."

Employee: "dressing."

Me: "Honey-dee-john"

Employee: What kind of dressing for your salad?"

Me: "Honey-dijahn. It says honey-dijon on the board, it's a dressing, right?"

Employee: "You have the salad with your meal?"

Me: "yes.."

Employee: "You want the Honey Mustard?"

Me: "Yes."


Sometimes I just don't get fast food joints - they make me wonder if I have aspergers syndrome. After that conversation, I wasn't even going to touch the issue that my "sprite, no ice" had more ice than sprite.

Labels: ,

Tuesday, September 22, 2009

Time to publish?

Although not quite the first time I've been told that I'm slacking in my life, I got the lecture from my supervisor yesterday. To paraphrase: "You're sitting on data. Publish it now!"

I guess there's a spectrum of people out there in research: those who publish fast and furious, and those who publish slowly and painstakingly. I guess I'm on the far end of the spectrum: I really like to make sure that the data I have is really right before pushing it out the door.

This particular data set was collected about a year and a half ago, back when 36-bp Illumina reads were all the rage, so yes, I've been sitting on it for a long time. However, if you read my notebooks, there's a clear evolution. Even in my database, the table names are marked as "tbl_run5_", so you can get the idea of how many times I've done this analysis. (I didn't start with the database on the first pass.)

At this point, and as of late last week (aka, Thursday), I'm finally convinced that my analysis of the data is reproducible, reliable and accurate - and I'm thrilled to get it down in a paper. I just look back at the lab book full of markings and have to wonder what would have happened if I'd have published earlier... bleh!

So, this is my own personal dilemma: How do you publish as quickly as possible without opening the door to making mistakes? I always err on the side of caution, but judging by what makes it out to publication, maybe that's not the right path. I've heard stories of people sending results they knew to be incorrect to the reviewers with the assumption that they could fix it up by the time the reviewers come back with comments..

Balancing publication quality, quantity and speed is probably another of those necessary skills that grad students will just magically pick up somewhere along the way towards getting a PhD. (Others in the list include teaching a class of 300 undergraduates, getting and keeping grants and starting up your own lab)

I think I'm going to spend a few minutes this afternoon (between writing paragraphs, maybe?) looking for a good grad school HOWTO. The few I've come across haven't dealt with this particular subject, but I'm sure it's out there somewhere.

Labels: ,

Thursday, September 17, 2009

3 year post doc? I hope not!

I started replying to a comment left on my blog the other day and then realized it warranted a little more than just a footnote on my last entry.

This comment was left by "Mikael":

[...] you can still do a post-doc even if you don't think you'll continue in academia. I've noticed many life science companies (especially big pharmas) consider it a big plus if you've done say 3 years of post-doc.


I definitely agree that it's worth doing a post-doc, even if you decide you don't want to go on through the academic pathway. I'm beginning to think that the best time to make that decision (ivory tower vs indentured slavery) may actually be during your post-doc, since that will be the closest you come to being a professor before making the decision. As a graduate student, I'm not sure I am fully aware of risks and rewards of the academic lifestyle. (I haven't yet taken a course on the subject, and one only gets so much of an idea through exposure to professors.)

However, at this point, I can't stand the idea of doing a 3 year post doc. After 6 years of undergrads, 2.5 years of masters, 3 years of (co-)running my own company, and about 3.5 years of doing a PhD by the time I'm done... well, 3 more years of school is about as appealing as going back to the wet lab. (No, glassware and I don't really get along.)

If I'm going to do a post-doc (and I probably will), it will be a short and sweet one - no more than a year and a half at the longest. I have friends who are stuck in 4-5 year post-docs and have heard of people doing 10-year post-docs. I know what it means to be a post-doc for that long: "Not a good career building move." If you're not getting publications out quickly in your post-doc, I can imagine it won't reflect well on your C.V, destroying your chances of moving into the limited number of faculty positions - and wrecking havoc on your chances of getting grants.

Still, It's more about what you're doing than how long you're doing it. I'd consider a longer post doc if it's in a great lab with the possibility of many good publications. If there's one thing I've learned from discussions with collaborators and friends who are years ahead of me, it's that getting into a lab where publications aren't forthcoming - and where you're not happy - can burn you out of science quickly.

Given that I've spent this long as a science student (and it's probably far too late for me to change my mind on becoming a professional musician or photographer), I want to make sure that I end up somewhere where I'm happy with the work and can make reasonable progress: this is a search that I'm taking pretty seriously.

[And, just for the record, if company needs me to do 3-years of post-doc at this point, I have to wonder just who it is I'm competing with for that job - and what it is that they think you learn in your 2nd and 3rd years as a postdoc.]

With that in mind, I'm also going to put my (somewhat redacted) resume up on the web in the next few days. It might be a little early - but as I said, I'm taking this seriously.

In the meantime, since I want to actually graduate soon, I'd better go see if my analyses were successful. (=

Labels: , ,

Friday, September 4, 2009

Risk Factors #2 and thanks to Dr. Steve.

Gotta love drive-by comments from Dr. Steve:

I don't have time to go into all of the many errors in this post, but for a start, the odds ratio associated with the celiac snp is about 5-10X *per allele* (about 50X for a homozygote). This HLA allele accounts for about 90% of celiac disease and its mode of action is well understood.

I understand this is just a blog and you are not supposed to be an expert, but you should do some basic reading on genetics before posting misinformation. Or better yet, leave this stuff to Daniel MacArthur.


While even the 6th grade bullies I knew could give Dr Steve a lesson in making friends, I may as well at least clarify my point.

To start with, Dr. Steve made one good point. I didn't know know what the risk factor for a single given Celiac SNP is - and thanks to Dr. Steve's incredibly educational message - I still don't. I simply made up a risk factor, which was probably the source of Dr. Steve's confusion. (I did say I made it up in the post, but apparently hypothetical situations aren't within Dr Steve's repertoire.)

But lets revisit how Risk factors work, as I understand it. If someone would like to tell me how I'm wrong, I'll accept that criticism. Telling me I'm wrong without saying why is useless, so don't bother.

A risk factor is a multiplicative factor that indicates your risk of expressing a given phenotype versus the general population. If you have a risk factor of 5, and the general population has a 10% phenotype penetration, that means you have a 50% chance of expressing the phenotype. (.10 x 5 = 0.5, which is 50%).

In more complex cases, say with two independent SNPs, each with an independent risk factor, you multiply the set of risk factors by the probability of the phenotype appearing in the general population. (You don't need me to do the math for you, do you?)

My rather long-winded point was that discussing risk factors without discussing the phenotypic background disease rate in the population is pointless, unless you know that the risk ratio leads you to a diagnostic test and predicts in a demonstrated statistically significant manner that the information is actionable.

Where I went out on a limb was in discussing other unknowns: error rates in the test, and possible other factors. Perhaps Dr. Steve knows he has a 0% error rate in his DTC SNP calls, or assumes as much - I am just not ready to make that assumption. Another place Dr. Steve may have objected to was my point about extraneous environmental factors, which may be included in the risk factor, although I just passed over it in my previous post without much discussion.

(I would love to hear how a SNP risk factor for something like Parkinson's disease would be modulated by Vitamin D levels depending on your latitude. It can't possibly be built into a DTC report. Oh, wait, this is hypothetical again - Sorry Dr. Steve!)

My main point from the previous post was that I have a difficult time accepting is that genomics consultants consider a "risk factor" as a useful piece of genomic information in the absence of an accompanying "expected background phenotypic risk." A risk factor is simply a modulator of the risk, and if you talk about a risk factor you absolutely need to know what the background risk is.

Ok, I'm done rehashing my point from the previous point, and that takes me to my point for today:

Dr. Steve, telling people who have an interest in DTC genomics to stay out of the conversation in favor of the experts is shooting yourself in the foot. Whether it's me or someone else, we're going to ask the questions, and telling us to shut up isn't going to get the questions answered. If I'm asking these questions, and contrary to your condescending comment I do have a genomics background, people without a genomics background will be asking them as well.

So I'd like to conclude with a piece of advice for you: Maybe you should leave the discussion to Daniel McArthur too - he's doing a much better job of spreading information than you are, and he does it without gratuitously insulting people.

And I thought Doctors were taught to have a good bedside manner.

Labels: ,

Tuesday, August 4, 2009

10 minutes in a room with microsoft

As the title suggests, I spent 10 minutes in a room with reps from Microsoft. It counts as probably the 2nd least productive time span in my life - second only to the hour I spent at lunch while the Microsoft reps told us why they were visiting.

So, you'd think this would be educational, but in reality, it was rather insulting.

Wisdom presented by Microsoft during the first hour included the fact that Silverlight is cross platform, Microsoft is a major supporter of interoperability and that bioinformaticians need a better platform to replace bio{java|perl|python|etc} in .net.

My brain was actively leaking out of my ear.

My supervisor told me to be nice and courteous - and I was, but sometimes it can be hard.

The 30 minute meeting was supposed to be an opportunity for Microsoft to learn what my code does, and to help them plan out their future bioinformatics tool kit. Instead, they showed up with 8 minutes remaining in the half hour, during which myself and another grad student were expected to explain our theses, and still allow for 4 minutes of questions. (Have you ever tried to explain two thesis projects in 4 minutes?)

The Microsoft reps were all kind and listened to our spiel, and then engaged in a round-table question and discussion. What I learned during the process was interesting:
  • Microsoft people aren't even allowed to look at GPL software - legally, they're forbidden.
  • Microsoft developers also have no qualms about telling other developers "we'll just read your paper and re-implement the whole thing."
And finally,
  • Microsoft reps just don't get biology development: the questions they asked all skirted around the idea that they already knew what was best for developers doing bioinformatics work.
Either they know something I don't know, or they assumed they did. I can live with that part, tho - They probably know lots of things I don't know. Particularly, I'm sure they know lots about doing coding for biology applications that require no new code development work.

So, in conclusion, all I have to say is that I'm very glad I only published a bioinformatics note instead of a full description of my algorithms (They're available for EVERYONE - except Microsoft - to read in the source code anyhow) and that I produce my work under the GPL. While I never expected to have to defend my code from Microsoft, today's meeting really made me feel good about the path I've charted for developing code towards my PhD.

Microsoft, if you're listening, any one of us here at the GSC could tell you why the biology application development you're doing is ridiculous. It's not that I think you should stop working on it - but you should really get to know the users (not customers) and developers out there doing the real work. And yes, the ones that are doing the innovative and ground breaking code are are mainly working with the GPL. You can't keep your head in the sand forever.

Labels: , ,

Thursday, July 30, 2009

I hate Facebook - part 2

I wasn't going to elaborate on yesterday's rant about hating facebook, but several people made comments, which got me thinking even more.

My main point yesterday was that I hate facebook because it's protocols aren't open, and is consequently is a "Walled Garden" approach to social networking. (Here's another great rant on the subject) That's not to say that you can't work with it - there are plugins for pidgin that let you chat on the facebook protocol, and there are clients (as was pointed out to me) that will integrate your IMs with the facebook chat for windows. But that wasn't my point anyways.

My point is that it's creating it's own separate protocols, which are each independent of the ones before it. In contrast to a service like twitter, in which the underlying protocol is XML, and is thus easily manipulated, using Facebook requires you work within their universe of standards. (I'm not the first person to come up with this - google will find you lots of examples of other people blogging the same thing.)

On the whole, that's not necessarily a bad thing, but common, reusable standards are what drive progress.

For instance, without a common HTML standard, the web would not have flourished - we'd have many independent webs. If AOL had their way, they'd still have you dialing up into their own proprietary Internet.

Without a common electricity format, we'd have to pick the appropriate set of appliances for our homes with independent plugs - buying a hair dryer would be infinitely more painful than it would need to be.

Without a common word processing format, we'd suffer every time we try to send a document to someone who's not using the same word processor that you do. (Oh wait, that's actually Microsoft's game - they refuse to properly support the one common document format every one else uses.)

So, when it comes to Facebook, my hate is this - if they used a simple RSS feed for the wall, I could have used that instead of twitter on my site. If they used a simple Jabber format for their chat, I could have merged that with my google chat account. And then there's their private message system... well, that's just email, but not accessible by IMAP or POP.

What they've done is try to resurect a business model that the web-unsavy keep trying. In the short term, it's pure money. You drive people into it because everyone is using it. The innovate concept makes it's adoption rapid and ubiquitous - but then you fall into the trap. The second generation of sites use open standards, and that allows FAR more cool things to be accomplished.

Examples of companies trying the walled garden approach on the net:

AOL and their independent internet, accessible only to AOL subscribers. Current Status: Laughable

Microsoft's Hotmail, where hotmail users can't export their email to migrate away. Current Status: GMail fodder.

Yahoo's communities. Current Status: irrelevant.

Wall Street Journal's new site. Current Status: ridiculed by people younger than 45.

Apple's i(phone/pod/tunes/etc). Current Status: Frequently hacked, forced to accept the defacto .mp3 format. (No Ogg yet...)

Ok, that's enough examples. All I have to say is that when Google (or anyone else) gets around to building a social networking site that's open and easy to play with, it won't be long before Facebook colapses.

The moral of the story? Don't invest too much in your facebook profile - it'll be obsolete in a few years.

Labels:

Wednesday, July 29, 2009

I hate facebook

I have a short rant to end the day, brought on by my ever increasing tie-in between the web and my desktop (now KDE 4.3):

I hate facebook.

It's not that I hate it the way I hate Myspace, which I hate because it's so easy to make horribly annoying web pages. It's not even that I hate it the way I hate Microsoft, which I hate because their business engages in unethical practices.

I hate it because it's a walled garden. Not that I have a problem with walled gardens in principle, but it's just so inaccessible - which is exactly what the facebook owners want. If you can only get at facebook through the facebook interface, you have to see their adds, which makes them money, if you ever get sucked into them. (You now have to manually opt out of having your picture used in adds for your friends... its a new option for your profile in your security settings, if you don't believe me.)

Seriously, the whole facebook wall can be recreated with twitter, the photo albums with flickr, the private messages with gmail.... and all of it can be tied together in one place. Frankly, I suspect that's what Google's "Wave" will be.

If I could integrate my twitter account with my wall on facebook, that would be seriously useful - but why should I invest the energy to update my status twice? Why should I have to maintain my own web page AND the profile on facebook...

Yes, it's a minor rant, but I just wanted to put that out there. Facebook is a great idea and a leader of it's genre, but in the end, it's going to die if its community starts drifting towards equivalent services that are more easily integrated into the desktop. I can now update twitter using an applet on my desktop - but facebook still requires a login so that I can see their adds.

Anyhow, If you don't believe me about where this is all going, wait to see what Google Wave and Chrome do for you. I'm willing to bet desktop publishing will have a whole new meaning, and on-line communities will be a part of your computer experience even before you open your browser window.

For a taste of what's now on my desktop, check out the OpenDesktop, Remember the Milk and microblog ( or even Choqok) plasmoids.

Labels: , , ,

Friday, July 17, 2009

Community

This week has been a tremendous confluence of concepts and ideas around community. Not that I'd expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate's successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can't imagine having with too many people. (By the way, I think Obi's suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I've temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a "distinct" source of information, easily distinguishable from other people's blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks - I don't want to sink into a "borg" experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation - and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of "successes and failures in developing an open source community" as applied to the Vancouver Short Read Analysis Package. Yes, I'm happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us - not to mention how we integrate into them, both in our work and outside. If you can't maximize your ability to motivate them (or their ability to motivate you), then you're at a serious disadvantage. How we balance all of that is an open question, and one I'm still working hard at answering.

I've attached my presentation from this morning, just in case anyone is interested. (I've decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it's about 7Mb.)

Labels: , , , , , , , ,

Monday, June 22, 2009

4 Freedoms of Research

I'm going to venture off the beaten track for a few minutes. Ever since the discussion about conference blogging started to take off, I've been thinking about what the rights of scientists really are - and then came to the conclusion that there really aren't any. There is no scientist's manifesto or equivalent oath that scientists take upon receiving their degree. We don't wear the iron ring like engineers, which signifies our commitment to integrity...

So, I figured I should do my little part to fix that. I'd like to propose the following 4 basic freedoms to research, without which science can not flourish.
  1. Freedom to explore new areas
  2. Freedom to share your results
  3. Freedom to access findings from other scientists
  4. Freedom to verify findings from other scientists
Broadly, these rights should be self evident. They are tightly intermingled, and can not be separated from each other:
  • The right to explore new ideas depends on us being able to trust and verify the results of experiments upon which our exploration is based.
  • The right to share information is contingent upon other groups being able to access those results.
  • The purpose of exploring new research opportunities is to share those results with people who can use them to build upon them
  • Being able to verify findings from other groups requires that we have access to their results.
In fact, they are so tightly mingled, that they are a direct consequence of the scientific method itself.
  1. Ask a question that explores a new area
  2. Use your prior knowledge, or access the literature to make a best guess as to what the answer is
  3. Test your result and confirm/verify if your guess matches the outcome
  4. share your results with the community.
(I liked the phrasing on this site) Of course if your question in step 1 is not new, you're performing the verification step.

There are constraints on what we are allowed to do as scientists as well, we have to respect the ethics of the field in which we do our exploring, and we have to respect the fact that ultimately we are responsible to report to the people who fund the work.

However, that's where we start to see problems. To the best of my knowledge, funding sources define the directions science is able to explore. We saw the U.S. restrict funding to science in order to throttle research in various fields (violating Research Freedom #1) for the past 8 years, which was effectively able to completely halt stem cell research, and suppress alternative fuel sources, etc. In the long term, this technique won't work, because the scientists migrate to where the funding is. As the U.S. restores funding to these areas, the science is returning. Unfortunately, it's Canada's turn, with the conservative government (featuring a science minister who doesn't believe in evolution) removing all funding from genomics research. The cycle of ignorance continues.

Moving along, and clearly in a related vein, Freedom #2 is also a problem of funding. Researchers who would like to verify other group's findings (a key responsibility of the basic peer-review process) aren't funded to do this type of work. While admitting my lack of exposure to granting committees, I've never heard of a grant being given to verify someone else's findings. However, this is the basic way by which the scientists are held accountable. If no one can repeat your work, you will have many questions to answer - and yet the funding for ensuring accountability is rarely present.

The real threat to an open scientific community occurs with the last two Freedoms: sharing and access. If we're unable to discuss the developments in our field, or are not even able to gain information on the latest work done, then science will come grinding to a major halt. We'll waste all of our time and money exploring areas that have been exhaustively covered, or worse yet, come to the wrong conclusions about what areas are worth exploring in our ignorance of what's really going on.

Ironically, Freedoms 3 and 4 are the most eroded in the scientific community today. Even considering only the academic world, where freedoms are taken for granted our interaction with the forums for sharing (and accessing) information are horribly stunted:
  • We do not routinely share negative results (causing unnecessary duplication and wasting resources)
  • We must pay to have our results shared in journals (limiting what can be shared)
  • We must pay to access other scientists results in journals (limiting what can be accessed)
It's trivial to think of other examples of how these two freedoms are being eroded. Unfortunately, it's not so easy to think of how to restore these basic rights to science, although there are a few things we can all do to encourage collaboration and sharing of information:
  • Build open source scientific software and collaborate to improve it - reducing duplication of effort
  • Publish in open access journals to help disseminate knowledge and bring down the barriers to access
  • Maintain blogs to help disseminate knowledge that is not publishable
If all scientists took advantage of these tools and opportunities to further collaborative research, I think we'd find a shift away from conferences towards online collaboration and the development of tools favoring faster and more efficient communication. This, in turn, would provide a significant speed up in the generation of ideas and technologies, leading to more efficient and productive research - something I believe all scientists would like to achieve.

To close, I'd like to propose a hypothesis of my own:
By guaranteeing the four freedoms of research, we will be able to accomplish higher quality research, more efficient use of resources and more frequent breakthroughs in science.
Now, all I need to do is to get someone to fund the research to prove this, but first, I'll have to see what I can find in the literature...

Labels: , , , , ,

Wednesday, June 17, 2009

More on conference blogging...

If you've been following along with the debate on conference blogging, you've surely been reading Daniel McArthur's blog, Genetic Future. His latest post on the subject provides a nifty idea: presenters who are ok with their talks being discussed should have an icon in the conference proceedings beside the anouncement of their talks so that members of the audience know it's safe to discuss their work. He even goes so far as to present a few icons that could be used.

On the whole, I'm not opposed to such a scheme - particularly at conference like Cold Spring, where unpublished information is commonly presented and even encouraged by the organizers. However, Cold Spring is one of the few rare venues where the attendance is "open", but the policy on disclosing the information is restricted. It's entirely regulated for journalists, but in the past has not been an issue for scientists. However, if a conference begins to restrict what the scientists are allowed to disclose outside of the meetings, the organizers are really removing themselves from the free and open scientific debate. A conference that does that isn't technically a conference - at best it's a closed door meeting - and the material should explicitly be labeled as confidential.

Assuming that the vast majority of presentations can't be discussed without explicit permission is quite the anathema of science. If you look at the way technology is handled in western society, you'll see a general trend: The patent system is based around the idea of disclosure, copyright is based on the idea of retaining rights after disclosure, and even our publication/peer review system demands full disclosure as the minimum standard. (Well, that plus a wad of cash for most journals...) For most conferences, then, I suggest we use a more fitting model than opting-in to allow disclosure, as proposed by Daniel. Rather, we should provide the opportunity to opt-out.

All presenters should have the option of choosing "I do not want my presentation disclosed." We can even label their presentation with a nice little dohicky that indicates that the material is not for public discussion.

Audience members who attend the talk then agree that they are not allowed to discuss this information after leaving the room. Why operate in half measures? It's either confidential or it's not. Why should we forbid people from discussing it online, and then turn a blind eye to someone reading their notes in front of the non-attending members of their institution?

Hyperbole aside, what we're all after here is a common middle-ground. Science Bloggers don't want to bite the hands of the conference organizers, and I can't really imagine conference organizers not being interested in fostering a healthy discussion. After all, conferences like AGBT have done well because of the buzz that surrounds their organization.

As I said in my last post on the topic, Science does well when the free and open exchange of ideas is allowed to take place, and people presenting at conferences should be aware of why they're presenting. (I leave figuring out those reasons as exercise to the student.)

Lets not throw the blogger out with the bathwater in our haste to find a solution: Conferences are about disclosure and blogs are about communication: aren't we all working towards the same goal?

Labels: , ,

Monday, June 15, 2009

Another day, another result...

I had the urge to just sit down and type out a long rant, but then common sense kicked in and I realized that no one is really interested in yet another graduate student's rant about their project not working. However, it only took a few minutes for me to figure out why it's relevant to the general world - something that's (unfortunately) missing from most grad student projects.

If you follow along with Daniel McArthur's blog, Genetic Future, you may have caught the announcement that Illumina is getting into the personal genome sequencing game. While I can't admit that I was surprised by the news, I will have to admit that I am somewhat skeptical about how it's going to play out.

If your business is using arrays, then you'll have an easy time sorting through the relevance of the known "useful" changes to the genome - there are only a couple hundred or thousand that are relevant at the moment, and several hundred thousand more that might be relevant in the near future. However, when you're sequencing a whole genome, interpretation becomes a lot more difficult.

Since my graduate project is really the analysis of transcriptome sequencing (a subset of genome sequencing), I know firsthand the frustration involved. Indeed, my project was originally focused on identifying changes to the genome common to several cancer cell lines. Unfortunately, this is what brought on my need to rant: there is vastly more going on in the genome than small sequence changes.

We tend to believe blindly what we were taught as the "central paradigm of molecular biology". Genes are copied to mRNA, mRNA is translated to proteins, and the protein goes off to do it's work. However, cells are infinitely more complex than that. Genes can be inactivated by small changes, can be chopped up and spliced together to become inactivated or even deregulated, interference can be run by distally modified sequences, gene splicing can be completely co-opted by inactivating genes we barely even understand yet and desperately over-expressed proteins can be marked for deletion by over-activating garbage collection systems so that they don't have a chance to get where they were needed in the first place. And here we are, looking for single nucleotide variations, which make up a VERY small portion of the information in a Cell.

I don't have the solution, yet, but whatever we do in the future, it's not going to involve $48,000 genome re-sequencing. That information on it's own is pretty useless - we'll have to study expression (WTSS or RNA-Seq, so figure another $30,000), changes to epigenetics (of which there are many histone marks, so figure 30 x $10,000) and even dna methylation (I don't begin to know what this process costs.)

So, yes, while I'm happy to see genome re-sequencing move beyond the confines of array based SNP testing, I'm pretty confident that this isn't the big step forward it might seem. The early adopters might enjoy having a pretty piece of paper that tells them something unique about their DNA, and I don't begrudge it. (In fact, I'd love to have my DNA sequenced, just for the sheer entertainment value.) Still, I don't think we're seeing a revolution in personal genomics - not quite yet. Various experiments have shown we're on the cusp of a major change, but this isn't the tipping point: we're still going to have to wait for real insight into the use of this information.

When Illumina offers a nice toolkit that allows you to get all of the SNVs, changes in expression and full ChIP-Seq analysis - and maybe even a few mutant transcription factor ChIP-Seq experiments thrown in - and all for $48,000, then we'll have a truly revolutionary system.

In the meantime, I think I'll hold out on buying my genome sequence. $48,000 would buy me a couple more weeks in Tahiti, which would currently offer me a LOT more peace of mind. (=

And on that note, I'd better get back to doing the things I do.... new FindPeaks tag, anyone?

Labels: , , , ,

Saturday, June 6, 2009

Once more into the breach...

I haven't been able to follow the whole conversation going on with respect to conference blogging, since I'm still away at a conference for another day. Technically, the conference ended a on thursday, but I'm still here visiting with some of the more important people in my life - so that is my excuse.

At any rate, I received an interesting comment from someone posting as "such.ire", to which I wrote a reply. In the name of keeping the argument going (since it is such a fascinating topic), I thought I'd post my reply to the front page. For context, I suggest reading such.ire's comment first:

click here for his comment.

My reply is below:

-------

Hi Such.ire,

I really appreciate your comment - it's a great counter point to what I said, and really emphasizes the fact that this debate will have plenty of nuances, which will undoubted carry this conversation on long after the blogosphere has finished with it.

To rebut a few of your points, however, I should point out that your examples aren't all correct.

Yes, conferences are well within their rights to ask you to sign NDAs as an attendee - or to require that confidentiality is a part of the conference - there is no debate on that point. However, if you attend a conference that is open and does not have an explicit policy, then it really is an open forum, and they do not have the right to retroactively dictate what you can (or can't) do with the information you gathered at the conference.

I think all of us would agree that the boundaries for a conference should be clearly specified at the time of registration.

As for lab talks for your lab members - those are not "public disclosures" in the eye of the law. All of your lab colleagues are bound by the rules that govern your institution, and I would be surprised if your institution hadn't asked you to sign various confidentiality rules or policies about disclosure at the time you joined them.

Department seminars are somewhat different - if they are advertised outside the department to individuals that are not members of the institution, then again, I would suggest they are fair game.

I don't blog departmental talks or RIP talks for that reason. They are not public disclosures of information.

Finally, my last point was not that journalists and bloggers do anything different up front, but that the method of their publishing should have a major impact on how they are treated. Bloggers can make corrections that reach all of their audience members and can update their stories, while journalists can not.

If a conference demands to see the material a journalist publishes up front, it makes sense. If they demand to do the same thing for a blogger, it completely ignores the context of the media in which the communication occurs.

Labels: , ,

Thursday, June 4, 2009

The Rights of Science Blogging

An article recently appeared on scienceweb, in relation to Daniel McArthur's blogging coverage of a conference he attended at Cold Spring Harbor, which has raised a few eyebrows (the related article is here). Cold Spring Harbor has a relatively strict policy for journalists, but it appears that Daniel wasn't constrained by it, since he's not a "journalist", by the narrow definition of the word.  More than half of the advice I've ever received on blogging science conferences comes from Daniel, and I would consider him one of the more experienced and professional of the science bloggers - which makes this whole affair just that much more interesting.  If anyone is taking exception to blogging, Daniel's coverage of an event is guaranteed to be the least offensive, best researched and most professional of the blogs, and hence the least likely to be the one that causes the outcry.

As far as I can tell from the articles, Cold Spring is relatively upset about this whole affair, and is going down the path that many other institutions have chosen: Trying to suppress blogging, instead of embracing it.
Unfortunately, there really very few reasons for this to be an issue - and I thought I'd put forward a few counter-points to those who think science blogging should be restrained.

1.  Public disclosure

Unless the conference organizers have explicitly asked each participant to sign a non-disclosure agreement, the conference contents are considered to be a form of public disclosure.  This is relevant, not because of the potential for people to talk about it is important, but because legally, this is when the clock starts ticking if you intend to profit from your discovery.  In most countries, the first time an invention is disclosed is when you begin to lose rights to an invention - broadly speaking, it often means that you have one year to officially file the patent, or the patent rights to it become void.  Public disclosure can be as simple as emailing your invention in an un-encrypted file, leaving a copy of a document in a public place....  the bar for public disclosure is really quite low.  More crucially, you can lose your rights to patenting things at all if they're disclosed publicly before the patent is filed.

Closer to home, you might have to worry about academic competition.  If you stand up in front of a room and tell everyone what you've just discovered (before you've submitted it), any one can then replicate that experiment and scoop you...  The academic world works on who has published what first - so we already have the built in instinct to keep our work quiet - until we're ready to release it.  (There's another essay in that on open source science, but I'll get to it another day.)  So, when academics stand up in front of an audience, it's always something that's ready to be broadcast to the world.  The fact that it's then being blogged to a larger audience is generally irrelevant at that point.

2.  Content quality

An argument raised by Cold Spring suggests that they are afraid that the material being blogged may not be an accurate reflection of the content of the presentation.  I'm entirely prepared to call B*llsh!t on this point.

Given a journalist with a bachelors degree in general science, possibly a year or two of journalism school and maybe a couple years of experience writing articles and a graduate student with several years of experience tightly focussed on the subject of the conference, who is going to write the more accurate article?

I can't seriously believe that Cold Spring or anyone else would have a quality problem with science blogging - when it's done by scientists with an interest in the field.  More on this in the conclusion.

3. Journalistic control

This one is more iffy to begin with.  Presumably, the conference would like to have tighter control over the journalists who write articles in order to make sure that the content is presented in a manner befitting the institution at which the conference took place.  Frankly, I have a hard time separating this from the last point:  If the quality of the article is good, what right does the institution have to dictate the way it's presented by anyone who attended.  If I sit down over beers with my colleagues and discuss what I saw at the conference, we'd all laugh if a conference organizer tried to censor my conversation.  It's both impossible and violates a right to free speech. (Of course, if you're in russia, or china, that argument might have a completely different meaning, but in North America or Europe, this shouldn't be an issue.)  The fact that I record that conversation and allow free access to it in print or otherwise should not change my right to freely convey my opinions to my colleagues.

Thus, I would argue you can either have a closed conference, or an open conference - you have to pick one or the other, and not hold different attendees to different standards depending on the mode by which they converse with their colleagues.

4. Bloggers are journalists

This is a fine line.  Daniel and I have very different takes on how we interact with the blogosphere.  I tend to publish notes and essays, where Daniel focusses more on news, views and well-researched topic reviews.  (Sorry about the alliteration.)  There is no one format for bloggers, just as there isn't one for journalists. Rather, it's a continuous spectrum of how information is distributed and for journalists to get upset about bloggers in general makes very little sense.  Most bloggers work in the niches where journalists are sparse.  In fact, for most people, the niches are what making blogs interesting.  (I'm certainly not aware of any journalists who work on ChIP-Seq full time, and that is, I suspect the main reason why people read my feeds.)

Despite anything I might have to say on the subject, the final answer will be decided by the courts, who have been working on this particular thorny issue for years.  (Try plugging "are bloggers journalists" into google, and you'll find more nuances to the issue than you might expect.

What it comes down to is that bloggers are generally protected by the same laws that protect journalists, such as the right to keep their sources confidential, and bound by the same limits, such as the ability to be sued for spreading false information.  Responsibility goes hand in hand with accountability.

And, of course, that should be how institutions like Cold Spring Harbor have to address the issue.

Conclusion:

Treating science bloggers the way Cold Spring Harbor treats journalists doesn't make sense.  Specialists talking about a field in the public is something that the community has been trying to encourage for years: greater disclosure, more open dialog and sharing of ideas are the fundamental pillars of western science.  To force the bloggers into the category of the journalists in the world of print magazines is utterly ridiculous: bloggers articles can be updated to fix typos, to adjust the content and to ensure clarity.  Journalists work in a world in which a typo becomes part of the permanent record and misunderstandings can remain in the public mind for decades.   The power to reach a large audience exists - but only bloggers have the ability to go back and make corrections.    Working with bloggers is a far better strategy than working against them.

No matter how you slice it, institutions with a vested interest in a single business model always resist change - and so do those who have not yet come to terms with the advances of technology.  Unfortunately, it sounds like Cold Spring Harbor hasn't yet adapted to the internet age and are trying to fig a square peg into a round hole.  

I'd like to go on the record in support of Daniel McArthur - blogging a conference is an important method of creating dialog in the science community.  We can't all attend each conference, but we shouldn't all be left out of the discussion - and blogs are one important way that that can be achieved.

If Cold Spring Harbor has a problem with Daniel's blog, let them come forward and identify the problem.  Sure, they can ask bloggers to announce their blog urls before the conference - allowing the organizers to follow along and be aware of the reporting, I wouldn't argue against that.  It provides accountability for those blogging the conference - which serious bloggers won't object to - and it allows the bloggers to go forth and engage the community.  

To strangle the communication between conference attendees and their colleagues, however, is to throttle the scientific community itself.  Lets all challenge Cold Spring to do the right thing and adapt with the times, rather than to ask scientists to drop a useful tool just because it's inconvenient and doesn't fit in with the way the conference organizers currently interact with their audience.

Labels: , , ,

Friday, November 28, 2008

It never rains, but it pours...

Today is a stressful day. Not only do I need to to finish my thesis proposal revisions (which are not insignificant, because my committee wants me to focus more on the biology of cancer), but we're also in the middle of real estate negotiations. Somehow, this is more than my brain can handle on the same day... At least we should know by 2pm if our counter-offer was accepted on the sales portion of the transaction, which would officially trigger the countdown on the purchase portion of the transaction. (Of course, if it's not accepted, then more rounds of offers and counter-offers will probably take place this afternoon. WHEE!)

I'm just dreading the idea of doing my comps the same week as trying to arrange moving companies and insurance - and the million other things that need to be done if the real estate deal happens.

If anyone was wondering why my blog posts have dwindled down this past couple of weeks, well, now you know! If the deal does go through, you probably won't hear much from me for the rest of this year. Some of the key dates this month:
  • Dec 1st: hand in completed and reviewed Thesis Proposal
  • Dec 5th: Sales portion of real estate deal completes.
  • Dec 6th: remove subjects on the purchase, and begin the process of arranging the move
  • Dec 7th: Significant Other goes to Hong Kong for~2 weeks!
  • Dec 12th: Comprehensive exam (9am sharp!)
  • Dec 13th: Start packing 2 houses like a madman!
  • Dec 22nd: Hannukah
  • Dec 24th: Christmas
  • Dec 29th: Completion date on the new house
  • Dec 30th: Moving day
  • Dec 31st: New Years!
And now that I've procrastinated by writing this, it's time to get down to work. I seem to have stuff to do today.

Labels: , , , , , ,

Tuesday, November 18, 2008

Bioinformatics Companies

I was working on my poster this afternoon, when I got an email asking me to provide my opinions on certain bioinformatics areas I've blogged on before, in return for an Apple iPod Touch in a survey that would take about half an hour to complete. Considering that ratio of value to time (roughly 44x what I get paid as a graduate student), I took the time to take the survey.

Unfortunately, at the very end of the survey, it told me I wasn't eligible to recieve the iPod. Go figure. Had they told me that first, I probably would have (wisely) spent that half hour on my poster or studying. (Then they told me they'd ship it in 4-6 weeks.... ok, then.)

In any case, the survey asked very targeted questions with multiple choice answers which really didn't encompas the real/full answers to the questions, and questions which were so leading that there really was no way to give the complete answer. (I like boxes to give my opinions... which kind of describes my blog, I suppose - A box into which I write my opinion. Anyhow...) In some ways, I have to wonder if the people who wrote the survey were trying to sell their product, or get feedback on it. Still, it led me to think about bioinformatics applications companies. (Don't worry, this will make sense in the end.)

The first thing you have to notice as a bioinformatics software company is that you have a small audience. A VERY small audience. If microsoft could only sell it's OS to a couple hundred or a thousand labs, how much would it have had to charge to make several billion dollars? (Answer: too much.)

And that's the key issue - bioinformatics applications don't come cheap. To make a profit on a bioinformtics application, you can only do one of four things:
  1. Sell at a high volume
  2. Sell at a high price
  3. Find a way to tie it to something high price, like a custom machine.
  4. Sell a service using the application.
The first is hard to do - there aren't enough bioinformatics labs for that. The second is common, but really alienates the audience. (Too many bioinformaticians believe that a grad student can just build their own tools from scratch cheaper than buying a pre-made and expensive tool, but that's another rant for another day. I'll just say I'm glad it's not a problem in my lab!) The third is good, but buying a custom machine has hidden support costs and in a world where applications get faster all the time, runs the risk of the device becoming obsolete all too fast. The last one is somewhat of a non-starter. Who wants to send their results to a third party for processing? Data ownership issues aside, if the bandwidth isn't expensive enough, the network transfer time usually negates the advantages of doing that.

So that leaves anyone who wants to make a profit in bioinformatics in a tight spot - and I haven't even mentioned the worst part of it yet:

If you are writing proprietary bioinformatics software, odds are, someone's writing a free version of it out there somewhere too. How do you compete against free software, which is often riding on the cutting edge? Software patents are also going to be hard to enforce in the post-bilski legal world, and even if a company managed to sue a piece of software out of existence (e.g. injunctions), someone else will just come along and write their own version. After all, bioinformaticians are generally able to program their own tools, if they need to.

Anyhow, all this was sparked by the survey today, making me want to give the authors of the survey some feedback.
  1. Your audience knows things - give them boxes to fill in to give their opinions. (Even if they don't know things, I'm sure it's entertaining.)
  2. Don't try to lead the respondents to the answers you want - let them give you their opinions. (That can also be paraphrased as "less promotional material, and more opinion asking." Isn't that the point of asking their opinions in the first place?
  3. Make sure your survey works! (The one I did today asked a few questions to test if I was paying attention to what I was reading, and then told me I got the answers wrong, despite confirming that the answer I checked was correct. Oops.)
So how does all of that tie together?

If you ask questions with the only possible answers being the ones you've provided, you're going to convince yourself that the audience and pricing for your product are something that it may not be. Bioinformatics software is a hard field to be successful in - and asking the wrong questions will only make it harder to understand the pitfalls ahead. With pressure on both the business side and the software side, this is not a field in which you can afford to ask the wrong questions.

Labels:

Tuesday, July 22, 2008

SNP calling from MAQ

With that title, you're probably expecting a discussion on how MAQ calls snps, but you're not going to get it. Instead, I'm going to rant a bit, but bear with me.

Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No - but not for the reasons you might expect.

I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don't get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the "kludges" with the method I didn't know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.

If you're wondering why it took so long, it's three fold:

1. I started off depending on someone else's method, since they came up with it. As is often the case, that person was working quickly to get results, and I don't think they had the goal of writing production quality code. Since I didn't have their code (though, honestly, I didn't ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there's a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.

2. I was trying to write code that would be re-usable. That's something I'm very proud of, as most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I'll have to redo the whole experiment from the start at some point because I'm not fully satisfied with the method, and we won't be doing it exactly this way in the future. I just hope the change doesn't happen in the next 3 weeks.

3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a "MT" or just "M"? What case do we use for "X" and "Y" (or is it "x" and "y"?) in our files? Should we pre-pend "chr" to our chromsome names? And what on earth is "chr5_random" doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read... bleh.

Anyhow, I fully admit that SNP calling isn't hard to do. Once you've read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there's not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.

However, once you're done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I'm aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there's much difference beyond the alignment quality.

Unfortunately, because the aligner field is still immature, there is no single file output format that's common to all aligners, so the comparison is a pain to do - which means it's probably a long way off. That, in itself, might be a good topic for an article, one day.

Labels: , , , ,