Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Thursday, January 14, 2010

How to be a better Programmer: Tactics.

I'm a bit too busy for a long post, but a link was circulating around the office that I thought was worth passing on to any bioinformaticians out there.

The article above is on how to be a better programmer - and I wholeheartedly agree with what the author proposed, with one caveat that I'll get to in a minute. The point of the the article is that learning to see the big picture (not specific skills) will make you a better programmer. In fact, this is the same advice Sun Tzu gives in "The Art of War", where understanding the terrain, the enemy, etc are the tools you need to be a better general. [This would be in contrast to learning how to wield each weapon, which would only make you a better warrior.] Frankly, it's good advice, and this leads you down the path towards good planning and clear thinking - the keys to success in most fields.

The caveat, however, is that there are times in your life where this is the wrong approach: ie. grad school. As a grad student, your goal isn't to be great at everything you touch - it's to specialize in some small corner of one field, and tactics are no help here. If grad school existed for Ninjas, the average student would walk out being the best (pick one of: poisoner/dart thrower/wall climber/etc) in the world - and likely knowing little or nothing about how to be a real ninja beyond what they learned in their Ninja undergrad. Tactics are never a bad investment, but they aren't always what is being asked of you.

Anyhow, I plan to take the advice in the article and to keep studying the tactics of bioinformatics in my spare time, even though my daily work is more on the details and implementation side of it. There are a few links in the comments of the original article to sites the author believes are good comp-sci tactics... I'll definitely be looking into those tonight. Besides, when it comes down to it, the tactics are really the fun parts of the problems, although there is also something to be said for getting your code working correctly and efficiently.... which I'd better get back to. (=

Happy coding!

Labels: , , , , ,

Monday, December 14, 2009

Talk: Writing a manuscript for bio-medical publication - Ian E.P. Taylor (Professor emeritus of Botany, UBC)

I was asked to blog this by a colleague. I haven't caught up with the other talks from last week, but since I'm here, and I found an outlet, I figured I could just dump my notes to the web. So here they are. I've left in a few comments of my own, but really, the lecture stands on it's own. Most of the notes are derived directly from the talk - but they mirror the distributed hand outs pretty closely. And frankly, the talk was full of examples and anecdotes that really helped illustrate the point. If you get a chance to see Dr. Taylor speak, I highly suggest it. Any mistakes, as always, are mine, and you shouldn't take my advice on publishing...

[I will clean up the (HORRIBLY BAD) HTML that was caused by dumping in an open office document. (Update: I've now removed the bad HTML - and I won't be cutting and pasting from open office again. That was brutal.)]

Writing a manuscript for bio-medical publication

By Ian E.P. Taylor (Professor emeritus of Botany)

Started with an anecdote: When you publish a paper, the people you have to impress are 2 reviewers and an editor. They are harder to impress, and you have to worry about first impressions, or they will be hostile. Don't be negative about your work. For things that are missing: keep them for your next experiment.

Peer review: An independent but generic tool that allows an editor to
  • Determine originality
  • operationally competence
  • coherent reports of research
  1. Unpublished research may as well not have been done.
  2. Unread publications may as well not have been written.
Four steps to understand: (Plan for the lecture.)
  1. Plan and plan for your journal
  2. Real and credible authorship. (Those who are listed have done something... names have been put on cheques for authorships.)
  3. Peer review. (The web hasn't changed this much – people still want to see peer review.)
  4. Responses to review.

Picking a journal:
Polling the audience – lots of usual reasons. [Joke about having your professor on the editorial board....]
His reasons:

  1. indexed
  2. best in field
  3. appropriate readership
  4. i read it a lot
  5. timeliness
  6. costs
  7. profile of other authors
  8. professor-choice

Impact factor:

  • Kind of a fake idea.... the papers have impact factors, not journals. Several examples of high impact papers in low impact factor journals.

The plan:

Know the giants upon who's shoulders you stand. What are their backgrounds, and where do they publish?

  1. Know your goals, hypothesis
  2. If you can't write what you discovered in 20 words, you haven't discovered it.
  3. How did you discover it
  4. prepare the outcomes (text, figs, tables, supplementary)
  5. what do they mean?
  6. Keep your references as you write. (Missing references piss off reviewers!)
[The only decent cheese in the world is cheshire cheese – and yes, that's where he grew up.]

Planning abstract:
  • State what you discovered in the first line, and your conclusion in the second line
  • As you write, challenge each sentence against your plan.
  • If necessary, change plan

“We have discovered...” Everything you write should fit this goal.

Elements of a paper:
  • Results – no results, no paper
  • Methods – how you got the results
  • Discussion – why the methods. (May or may not be the choppy approach....)
  • Introduction – direct the reader
  • Conclusions – not repeat of results
  • Aside on abstract: can be structured – explains order of what you're discussing, or can be intro: explain significance.
How to read a paper:
  • Introduction -> Methods -> Discussion.
  • Introduction should lead you straight to discussion.... should be able to skip results and methods.
  • Writing author (always singular) – You should try to write it out yourself, as this is key. You don't want to mix styles.
  • Co-authors
  • senior author (first)
  • senior author (PI)
  • corresponding author.
In case it's not obvious, first is the person who did the work, last is the PI. (for Biomedical) Some places do alphabetical, but it depends on your field.

Writing Author:
  • Get the instructions to authors from the journal
  • write the first draft, which is a record of your work
  • unify style
  • share draft with all co-authors
  • Responsible for ensuring the record of ethical performance.
  • It's key that you get the style correct for your journal.

Before you write – make sure you know which journal.

[Great advice to me: Journals do not publish the truth! They publish results, written truthfully!]

Wonderful anecdote about how they used to write papers by putting everyone into a room and not leaving till the 2nd draft was done.

Who are the authors?
  • Only the people who should be on there. Eg, supervisors who wrote the grant, didn't actually supervise.
  • Data providers
  • Analysis data (e.g. statisticians)
  • political.... yeah, it happens
  • All who contributed to an essential part of the actual research reported
  • No one who is there for contributing a gift of ANY sort
  • check the journal for the criteria.
  • All authors must agree to the authorship list
  • All authors must agree to submission.
People who create libraries should not be included. Eg, if they supply you with a DNA library or something of that nature – if they didn't do the work, they don't belong on the paper. Just as a acknowledgement. See American Journal of Medicine – they have a form for this. (=

Fabrication, Falsification, Publication: they are criminal offenses in the scholarly community. Most time detection is by accident.

Author obligations - everyone on the paper takes responsibility for the whole paper.
  • inform editor of related works
  • refer to instructions to authors for details
  • covering letter should context of other published work. Make sure each paper is an original idea.
  • Inform editors of financial or other conflicts of interest
  • Identify (un)acceptable bias
  • Full justification of 'representative results'
  • Negative results???
  • The stuff you'd expect
  • list of suitable reviewers! Don't just pick the people who are best in the field - give others too. Don't just pick editorial board either. Give some reason why you picked them.
Remember: You are the world expert on the subject you're writing on.

Reviewer Obligations:
  • Treat as you would wish a stranger to treat you
  • Follow directions from editor
  • Consult editor before consulting others
  • group reviews must adhere to confidentially
  • One person must sign off on it.
  • Refuse, rather than delay: 2 weeks max. Do not delay!
  • if you're waiting for journals, don't put up with delay, get on the phone and call!
  • If asked for recommendation, follow criteria.
  • Annotate manuscript, but AVOID being rude.
If you're reviewing, Track changes is acceptable
  • Don't steal ideas or use their ideas for your gain
  • Do not break confidentially
  • Avoid becoming investigator. - don't tell the research how to do the research.
Disclose potential conflict of interests - let the editor decide if it's a problem
and yeah, don't talk to the author unless the editor says it's ok (explicitly)
  • Reviewer conflicts:
  • Recent collaborations
  • intellectual conflicts
  • scientific bias or personal animosity
  • parallel research activity, research work on a competing project
  • potential financial benefit
  • AND potential benefit from advanced knowledge of new work
Responding to reviewers:

  • In the end, the editors decide what's in the journal - not the reviewer
  • It is not an election - the editor can ignore the recommendations
  • Take all the comments - particularly the editors - seriously
  • The editor can say "your paper is accepted subject to the actions of the reviewers...." It has not yet been accepted.
  • Vent about negative comments - but only for 10 minutes.
  • Mark every point on a copy of the manuscript
  • Fix all typos and mechanics IMMEDIATELY. (within 24 hours.)
  • Fix the criticisms, THEN and ONLY THEN worry about the rebut.
  • The comment expressed by reviewer 1 is incorrect because...
  • We have addressed....
  • Don't delay!

Labels: ,

Monday, November 23, 2009

What I've learned about PhD committees

[Update: thanks to some excellent feedback, I thought I'd revisit this article and clean it up. I've tried to be clear where the revisions are, and only made minor clarifications to the body of the text where warranted.]

This has been a really bad week for me. It started with a botched committee meeting, a death in the family, and then a series of technical errors that have annoyed me to no end. All that has made me walk away from my computer in frustration several times, only to return and find something else that upsets me. Unfortunately, the technical issues are mostly just that: technical. They're not something that other people will learn anything from, with the possible exception of this:

I understand dbsnp 130 has now begun to include cancer causing mutations and pretty much everything else in their annotated snps. And, of course, there doesn't seem to be any mention of this on the web. Knowing this, you obviously shouldn't use it for filtering out "neutral" changes. It won't work. (If you're working on genetic variations from RNA-seq like I am, this warning might save you a few hours or pain - or better, prevent severe embarrassment if you start talking about filtering in front of an audience, as a fellow grad student and friend of mine did recently.)

Anyhow, the greater part of the lessons I learned this week were about Grad School, and what I learned about committee meetings can be summed up in a few quick points: [Note, this is advice on meeting with the committee as a body, not meeting with individual members.]
  • Your relationship with your committee is not [necessarily] a friendly one.
They may be friendly with you, but they're not there to give you friendly advice and guide you through your PhD. Instead, the committee (as a body) is really engaged in an adversarial relationship in which they are the gate keepers that will decide when you can leave this pit of doom, and they are the ones that will open the door at the end when they believe you're ready to depart. Yes, they do have the roadmap to letting you out, but they would much rather you figure it out yourself instead of asking them to help plan it. [To be clear, it is the job of your committee members individually to give you advice and help you out - and the job of your advisor to help you find that road map. The committee exists to make sure that you've satisfied the requirements. Every project is different, so you'll have to chart your own path - and the committee as a body knows where you'll end up, but not how you'll get there.]
  • Your committee is not interested in your progress - they're interested in your results.
The difference may be subtle, but it changes how I view my committee meetings. No more will I go in there with a "progress report" style presentation. Instead, I'm going to go in to present results, the same way I would if I were in a journal club presenting a paper. They don't care if I've learned new coding languages, solved 12 cold cases and rescued a baby from a burning house. They only want to know what my results look like - because those will go into my thesis, and that's all that matters. (Don't bother asking what they think your thesis should include... you should decide that, and then they'll tell you afterwards if you're wrong - see "roadmap" in the first point.)
  • Your committee is not expecting great things from you - they want you to know what they know.
Actually, they expect you to memorize useless details, be able to regurgitate the names of people in your field blindly, and know which journal has the highest impact factor in your field. What they're really after is that you should be able to point to the people they know in the field and explain how they solved the problems you're working on. If you know who they know, you'll know who's papers they read. [This probably sounds much more harsh than I meant it to be. Getting a PhD means you're an expert in your field, and thus know all the details - when you know all of them, that's when you're ready to leave. The purpose of the committee as a body is to ensure you're an expert, not that you're destined for a Nobel prize. The only criteria they have to judge you on is what they know about your field. So knowing what they know about your field is the way to show that you're an expert - after all, those are the questions they'll ask you to determine if you're right about what you know. And yes, as undergrads, we all learned that the right answer to a question is what the professor gave in his notes, not what you think the right answer is.]
  • When your committee asks you an opinion question, they aren't asking your opinion - they're asking their opinion.
This should be obvious to any 1st year undergrad student, but as a grad student, we tend to forget it. Professors may ask a question that starts with "what do you think about/is...." The correct answer is not what you think it is - it's what THEY think it is. (Remember this subtle point - it will probably be needed in your defense as well.) [Again, somewhat harsh, but it's like I've said above - they're asking you questions to test if you know the answers - and the correct answers, in their mind, will be what they know and what they believe. One on one, you can discuss and debate these issues with your committee members as individuals, but my committee meetings rarely seem to be discussions.]
  • Your committee won't know why your results are important unless you explicitly explain it to them.
Again, this is something you learn as an undergrad, but may have faded with time. A committee member looked at my presentation and said at the end (paraphrased) "You're just turning a crank and out pops Venn diagrams". Obviously, I didn't do a good job of explaining the 26,000 lines of code I've written and the novel algorithms that went into it.
  • Your committee will change their minds - and not know it.
Don't expect that your committee will remember what they told you last time... they don't. [It's probably been a year or more since last time you met. We all forget.] My last committee meeting, I was told me (explicitly) I should not include my ChIP-Seq work in my thesis. This time, they told me I'd be crazy to leave it out. (They may have used a different word... I was somewhat in awe at this point in the conversation.) [Clarification: What details may seem important to you are probably insignificant in their lives. Don't expect them to remember for you, and a year is a long time - things may have changed.]
  • Don't expect sympathy from professors.
Once you've irritated all your committee members by doing nothing but turning cranks, remember that your job is just to keep producing results - that's all that matters. When your committee has just discussed that you're not turning the crank fast enough, your advisor isn't going to come for a friendly chat to find out why - they'll just send you notes that antagonize you. They assume that nothing is going on in your life and that your results are just not there because you have become lazy. They were in grad school once, so they know that your inability to conquer impossible problems is just because you're off playing ping pong or getting coffee. (Be prepared for this - it's the inevitable result of a bad committee meeting, if you haven't taken my advice above.) [Again, harsh, but yeah, I was upset. Still, this was just an example. I do play ping pong, but I don't drink coffee, and yes, I did get a sarcastic email from my supervisor - probably deserved after such a poor presentation to my committee. As they say, Your Millage May Vary - if you're lucky enough to have a supervisor that is holding your hand through the process, that's great - but don't expect it. PhDs are all about preparing the student for the real world - and the real world is harsh.]
  • Professors are very good at juggling tasks, and the only way to learn is trial by fire
Since I've discussed it several times with my advisor that I'm doing too many things, and that's how it's been all year, maybe it shouldn't be a surprise I can't focus on turning out papers. It seems to me that the people who make the grade to become profs are the ones that are able to write grants while juggling 4 projects, and are able to make progress in all of them. That clearly makes sense - lousy PhDs don't make good profs. However, those people who make it to professorship are (in my opinion) often the ones that are naturally good at managing their own tasks. For those of us trying to manage too many tasks, don't expect them to help manage your priorities - they do it instinctively for themselves, and they expect you to do it instinctively too - even when your priorities are 180 degrees opposite from what you thought they were. [One of the major lessons I've learned is that in grad school, priorities are what you make them. Your committee exists to make sure they don't slip too far from what they think you need to accomplish - as much as we may all want hand holding, professors are busy people, and your priorities are exactly that: Yours to set and to juggle.]

So, there you have it - it's been an educational week. I've learned:
  1. What a PhD committee is for.
  2. How to talk to and answer questions from committee members.
  3. What to expect from my committee and doing research.
  4. That I need to completely re-organize the way I manage my tasks.
While I'm helpless to do anything about the botched committee meeting, I have been able to work on that last point. I've changed how I manage my software, how I interact with my colleagues, what projects get my time, and I'm making a point of saying No to things that won't get me out of here. With luck, that will put me back on track - which is what my committee wanted in the first place, right?


Tuesday, September 22, 2009

Time to publish?

Although not quite the first time I've been told that I'm slacking in my life, I got the lecture from my supervisor yesterday. To paraphrase: "You're sitting on data. Publish it now!"

I guess there's a spectrum of people out there in research: those who publish fast and furious, and those who publish slowly and painstakingly. I guess I'm on the far end of the spectrum: I really like to make sure that the data I have is really right before pushing it out the door.

This particular data set was collected about a year and a half ago, back when 36-bp Illumina reads were all the rage, so yes, I've been sitting on it for a long time. However, if you read my notebooks, there's a clear evolution. Even in my database, the table names are marked as "tbl_run5_", so you can get the idea of how many times I've done this analysis. (I didn't start with the database on the first pass.)

At this point, and as of late last week (aka, Thursday), I'm finally convinced that my analysis of the data is reproducible, reliable and accurate - and I'm thrilled to get it down in a paper. I just look back at the lab book full of markings and have to wonder what would have happened if I'd have published earlier... bleh!

So, this is my own personal dilemma: How do you publish as quickly as possible without opening the door to making mistakes? I always err on the side of caution, but judging by what makes it out to publication, maybe that's not the right path. I've heard stories of people sending results they knew to be incorrect to the reviewers with the assumption that they could fix it up by the time the reviewers come back with comments..

Balancing publication quality, quantity and speed is probably another of those necessary skills that grad students will just magically pick up somewhere along the way towards getting a PhD. (Others in the list include teaching a class of 300 undergraduates, getting and keeping grants and starting up your own lab)

I think I'm going to spend a few minutes this afternoon (between writing paragraphs, maybe?) looking for a good grad school HOWTO. The few I've come across haven't dealt with this particular subject, but I'm sure it's out there somewhere.

Labels: ,

Thursday, September 17, 2009

3 year post doc? I hope not!

I started replying to a comment left on my blog the other day and then realized it warranted a little more than just a footnote on my last entry.

This comment was left by "Mikael":

[...] you can still do a post-doc even if you don't think you'll continue in academia. I've noticed many life science companies (especially big pharmas) consider it a big plus if you've done say 3 years of post-doc.

I definitely agree that it's worth doing a post-doc, even if you decide you don't want to go on through the academic pathway. I'm beginning to think that the best time to make that decision (ivory tower vs indentured slavery) may actually be during your post-doc, since that will be the closest you come to being a professor before making the decision. As a graduate student, I'm not sure I am fully aware of risks and rewards of the academic lifestyle. (I haven't yet taken a course on the subject, and one only gets so much of an idea through exposure to professors.)

However, at this point, I can't stand the idea of doing a 3 year post doc. After 6 years of undergrads, 2.5 years of masters, 3 years of (co-)running my own company, and about 3.5 years of doing a PhD by the time I'm done... well, 3 more years of school is about as appealing as going back to the wet lab. (No, glassware and I don't really get along.)

If I'm going to do a post-doc (and I probably will), it will be a short and sweet one - no more than a year and a half at the longest. I have friends who are stuck in 4-5 year post-docs and have heard of people doing 10-year post-docs. I know what it means to be a post-doc for that long: "Not a good career building move." If you're not getting publications out quickly in your post-doc, I can imagine it won't reflect well on your C.V, destroying your chances of moving into the limited number of faculty positions - and wrecking havoc on your chances of getting grants.

Still, It's more about what you're doing than how long you're doing it. I'd consider a longer post doc if it's in a great lab with the possibility of many good publications. If there's one thing I've learned from discussions with collaborators and friends who are years ahead of me, it's that getting into a lab where publications aren't forthcoming - and where you're not happy - can burn you out of science quickly.

Given that I've spent this long as a science student (and it's probably far too late for me to change my mind on becoming a professional musician or photographer), I want to make sure that I end up somewhere where I'm happy with the work and can make reasonable progress: this is a search that I'm taking pretty seriously.

[And, just for the record, if company needs me to do 3-years of post-doc at this point, I have to wonder just who it is I'm competing with for that job - and what it is that they think you learn in your 2nd and 3rd years as a postdoc.]

With that in mind, I'm also going to put my (somewhat redacted) resume up on the web in the next few days. It might be a little early - but as I said, I'm taking this seriously.

In the meantime, since I want to actually graduate soon, I'd better go see if my analyses were successful. (=

Labels: , ,

Tuesday, September 15, 2009

Depressing view of Academia

So I officially started going through available post-doc positions this week, now that I'm back from my vacation. I'm still trying to figure out what I want to do when I finish my PhD next year (assuming I do...), and of course, I came back to the academia vs. industry question.

In weighing the evidence, a friend pointed me to this article on the problems facing new scientists in academia. Somehow, it does a nice job of dissuading me from thinking about going down that route - although I'm not completely convinced industry is the way to go yet either.

Read for yourself: Real Lives and White Lies in the Funding of Scientific Research

Labels: , , ,

Tuesday, September 1, 2009

How much time I spend...

I was just thinking about the division of time amongst the various things I work on - and realized it's pretty bizarre. Unlike most grad students, I have to interface with people using my software for many different analysis types - some of which are "production" quality. That has it's own challenges, but I'll leave that for another day.

I figured I could probably recreate my average week in a pie chart form, covering the work I've been doing...

Honestly, though, It's just an estimate - and the sum is actually more than 40 hours a week. (I do work in the evenings, sometimes - and support for FindPeaks happens when I check my email in the evening, too. - to compensate, I may have been stingy on the hours spend goofing off....)

Anyhow, I think it would be an interesting project to try to keep track of how I spend my time. Maybe I'll give it a try when I come back from vacation. (Yes, I'll be away next week.)

Still, even from this estimate, three things are very clear:
  1. I need to spend more time upfront writing tests for my software to cut down on debugging
  2. I need to spend more time reading journals.
  3. I am clearly underestimating the time I spend playing Ping Pong. But hey, I work through lunch!

Labels: , ,

Thursday, July 23, 2009

What I wish people had told me when I started graduate studies

I suggested to my friend (who finished his masters degree last year) that he should know everything there is to know about grad school and should write a book on it - and he declined.

So, I spent some time thinking this morning about what advice I'd pass along to grad students who are just starting out, which resulted in this post: 10 things I'd like to tell new grad students. If I do a second post on the topic, it might include advice specifically targeted for bioinformatics students, but that's a long way off. Either way, please feel free to add comments on this post. I'm sure I have missed things.

At any rate, this advice is for people who are already in grad school or are about to start. None of this will convince you to stay here if you're doubting yourself (we all did), or if you're trying to weather the ups and downs. (we all have them too...) There are already excellent resources out there for that, with which I can't hope to contribute anything. So, for those of you who are ready to stick it out, here's my advice:

  1. Keep good notes! Believe it or not, just about everything you ever do as a graduate student will some day be useful - you'll be writing a presentation and realize you just need that one number you collected in your first 8 months in the lab, or you'll remember that you did some long drawn out set of scripts at one point and running them again would be useful - if only you'd written them down... I guarantee that good and complete notes will knock at least a month or two off of time you spend in school.

  2. Keep yourself motivated. I won't beat around the bush here. There will be days you don't want to work. There might even be weeks or months. Sometimes it'll be burnout, sometimes it'll just be depression (experiments don't always work) and sometimes it'll just be plain laziness. Whatever it is, you're going to have to find a way to power through it and keep your momentum. One of the best pieces of advice I've had is to start each day or week by setting goals and then holding myself to them. You can seriously improve your work ethic this way.

  3. Keep sight of the big picture. Remember that you're working on something innovative and creative, and the little details like software bugs and optimizing protocols are just stepping stones along the way. It's easy to get sucked into the mindset that your job is only those details, but if you keep sight of the big picture, you'll have an easier time evaluating what it is you really need to be doing. (Will that bug help you get a paper? No? Maybe you should spend more time writing out a more efficient algorithm anyway...) Your goal is to graduate with some cool research - and you shouldn't be working on the details unless they get you there - or accomplish a goal that is clearly useful to you.

  4. Learn everything you can. One of those magic things about grad school is that you're given free reign to learn. Use it! Learn new software applications, learn new subjects, learn new hobbies. You never know what will be a help to you in the future or will help you get a job down the road. I learned to snowboard while doing my masters degree, and it turned out to be the only way to get sunshine in January in Vancouver, which is probably the only reason I don't get S.A.D. every year (and spend a month or two without the ability to get things done.) Seriously - Just about every skill you pick up will help you out somehow... one day.

  5. Build your community. Grad school is also one of those few times in your life when everyone is willing to talk to you. You've already established you're not a fool by finishing your undergrad and getting into graduate school, so nearly everyone will be willing to give you the time of day and most of them will even be willing to give you advice on how to make your life easier. You never know who might be able to help you later in life, and you never know who might turn into a great friend. Go forth and meet your peers! They'll provide support when (not if) you need it, and can help you work through your problems - and remember, your peers aren't all in the same lab or university as you.

  6. Don't be afraid to take/ignore advice. Ok, so you asked the Nobel prize winner that hangs out in the cafeteria for some advice on your project. First of all, remember that it's just advice. If you're doing science, no one actually knows the answer (unless they beat you to the experiment), so take all the advice you get with a grain of salt. Remember to judge the advice upon it's merits, rather than upon the status of the person who gives it. Nobel prize winners can give advice that's every bit as bad as the next person, and even a "lowly undergrad" can point you in the right direction now and then. Even if the Nobel prize winning scientist tells you it's the dumbest thing they've ever heard, don't be afraid to test it out now and then. Of course, don't forget to remember that sometimes dumb ideas really are dumb, and sometimes it is worth listening to the advice. A good scientist will eventually figure out the difference and know which leads to test out. In the meantime, be courageous and try to be insightful when evaluating what people tell you.

  7. Don't fall prey to false economy reasoning. One of those things I've gotten used to hearing are comments like "I don't have time to clean up my code - I need to put in new features!" or " I can't learn how to use a new tool, I just need the results." This is called false economy: when you forget that investments pay dividends in the long term, and only take the short term goals into account. For one good example, I learned how to use professional desktop publishing software, which forced me to spend a whole week making a single poster. Ever since then, I've been able to reuse the template and rely on that learned skill set to bang out posters in about 2 days each - and I've done 7 or 8 by since. By investing the time, I've become much more productive in the long term. (And yes, spending a week on cleaning up your code will save you weeks or months on debugging and ease of future coding.)

  8. Be assertive about what you want. Many beginning grad students are intimidated by the people around them just because everyone else is more senior, and that can be devastating to your own ambitions. You can easily get sucked into other people's projects, goals or even politics if you're afraid to strike out on your own. Remember that you are an individual and you have your own tasks to accomplish. Be assertive about what you think will help you achieve those goals and get you through the project. On the other hand, don't forget to respect your peers - you probably won't make it through if you piss them all off.

  9. Nothing will be perfect. Several years in, this one still gets me. No publication is ever perfect, your results don't need to be bulletproof before you publish them and posters will never tell you the whole story. Do the best you can, and try to do as much as you can, and revisions will help you fix everything else up as you go. That is to say, try to balance out your ability to pay attention to details (don't be sloppy), but don't expect to have every last detail in place before you start writing it up.

  10. Write lots. As a grad student, your productivity is measured by your ability to publish what you've done. However, that's not the only measure you should take of your time as a grad student. Write as much as you can, and practice your communication skills. The more you write, the more you learn about how to get your point across. Write emails to build up your network, write a blog to practice getting your point across, write publications to build your resume and write notes to cultivate it as a habit. The more you write, and it doesn't really matter what you write, the better off you'll be.

So, there you have it. 10 things you can do that will enhance your grad student experience. If you need more advice, try this site, which has more collected advice than any other place I've ever seen.

Good luck and good publishing!


Thursday, July 9, 2009

New Tool: KeepNote

Obviously I haven't updated much here lately - I've been pretty busy and inspiration hasn't struck me much in the last few days to get anything written. However, I started using some new software this morning, and I'm enjoying it so much I figured I have to share.

One of the big problems I have, as a bioinformatician, is keeping track of all the notes and one off scripts I write. I don't want to use an SVN, because it's just a repository with no organization. I don't want to use a wiki, because it's a huge hassle to maintain for small projects, and I hate using text files.

The compromise, it seems, is to use standards compliant files with a hell of a wrapper around them that does the organization for you, and the one I found is called KeepNote. The project page and downloads can be found at The software is available for all major OS (Linux, Mac and even Windows), and can be installed relatively quickly and (for the most part) painlessly. (Linux builds are missing a library in the dependencies, but that can be figured out pretty quickly - just apt-get the missing lib and re-install if you hit this problem.)

While it may not fit everyone's workflow, my few hours of using it have already helped me get my tools organized and assembled in a logical manner, and it's allowed me to remove a load of files from my desktop. There are still bugs with it: I had to manually do some configuration of the the web browser, text editor and such before I could get started, but so far I haven't hit any of the bugs.

It also claims to help you organize notes - which I can clearly see. next time I go to a conference, I'll be using this for recording and organizing the usual 30-40 pages of notes I take.

For me, this falls under the heading of required tools for bioinformaticians and students alike and I look forward to seeing the project evolve and grow.

Labels: , , ,

Tuesday, March 24, 2009

Decision time

Well, now that I've heard that there's a distinct possibility that I might be done my PhD in about a year, it's time to start making some decisions. Frankly, I didn't think I'd be done that quickly - although, really, I'm not done yet. I have a lot of publications to put together, and things to make sense of before I leave, but the clock to start figuring out what to do next has officially begun.

I suppose all of those post-doc blogs I've been reading for the last year have influenced me somewhat: I'm going to look for a lab where I'll find a good mentor, a good environment, and a commitment to publishing and completing post-docs relatively quickly. Although that sounds simple, judging by other blogs I've been reading, it's probably not all that easy to work out. Add to that the fact that my significant other isn't interested in leaving Vancouver (and that I would prefer to stay here as well), and I think this will be a difficult process.

I do need to put together a timeline, however - and since I'm not yet entirely convinced which track I should follow (academic vs industry), it's going to be a somewhat complex timeline. Anyhow, the point of blogging this it is an excellent way to open communication channels with people who you wouldn't be able to connect with in person - and the first one I'd like to open is to ask readers if they have any suggestions.

Input, at this time would be VERY welcome, both on the point of academia vs. industry, as well as what I should be looking for in a good post-doc position, if that ends up being the path I go down. (=

Anyhow, just to mention, I have another blog post coming, but I'll save it for tomorrow. I'd like to comment on another series of blog post from John Hawks and Daniel McArthur. I'm sure the whole blogosphere has heard all about the subject of training bioinformatics students from both the biology and computer science paths by now, but I feel I have something unique to talk about on that issue. In the meantime, I'd better get back to debugging and testing code. FindPeaks has a very cool new method of comparing different samples - and I'd like to get the testing finished. (=

Labels: ,

Thursday, January 8, 2009

The Future of FindPeaks

At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I'm working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I've been spending some time thinking about how I want to proceed forward - and I think the answer is to work smarter on FindPeaks. (No, I'm not dropping FindPeaks development. It's just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it's 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it's code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I've realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we're coming to the next generation of biology problems - integrating FindPeaks into the wider toolset - and generating real knowledge about what's going on in the genome, and I think it's time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven't been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit - so, even if I end up focussing on the cancer biology that I've got in front of me, I'm still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn't currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects.... I think I see a few more late nights in the near future. (-;

Labels: , , , , , ,

Friday, December 12, 2008

new title

That's right - I'm no longer just a "graduate student," I'm now officially a "PhD Candidate."

Not that that means anything, and not that it was really worth the pain, but I guess it's some compensation for going through the adventure of comprehensive exams.

Pain, you ask? Why pain?

Well, there was the whole idea that I should learn all of the biology of cancer in 3 weeks, which I think I did a reasonably good job of doing, only to have VERY few questions on it. Instead, my committee members asked about the following topics:
  • What are common breast cancer drugs, and what are their mechanisms?
  • How does Blast work? What algorithms are used for analyzing DNA arrays?
  • What are ESTs, and how could you apply those technologies to your research?
In general, they're not bad topics, but I didn't study a single one of those, since they're utterly unrelated to my thesis.

The examiner who asked about ESTs was making a point, suggesting that I should use a Motif scanner to search my ~220Million DNA fragments (of ~36bp each) for motifs that would indicate a splice junction point, and then use the fragments on either side of the presumed break point (down to ~16-mers) and blast those against the genome. The more I think about this idea, the less I like it. I also don't have 5 years to let the job run.

Anyhow, the exam is over with. I could have answered the questions better, I could have done a better presentation, and I probably could have prepared differently, but it never would have occurred to me to study the blast algorithm, drug mechanisms and clustering techniques for arrays. Who knew?

If I had to do comps again, I probably wouldn't have done much differently. I asked my committee members for what topics they would like to discuss with me and what I needed to study, and then prepared for those questions. One committee member was very helpful in that respect, and I learned a lot preparing for the suggested questions. I think I even answered that set reasonably well. (nerves notwithstanding.)

At the end of the day, I'm just glad it's over. It's humbling to know how much you don't know.

In my presentation today, I closed with two quotes, and I thought I'd share them here. I'm sure all graduate students can immediately grasp their relevance.

“In the fields of observation chance favors only the prepared mind.”
--Louis Pasteur

“To achieve great things, two things are needed; a plan, and not quite enough time.”
--Leonard Bernstein
Edit: I suppose I shouldn't be quite so obtuse. Many of the things my committee brought up were to demonstrate that many of the issues facing 2nd generation sequencing have been tackled by other technologies in the past, and that knowing those technologies would be a good way to keep from reinventing the wheel. That part, I take as constructive criticism. I will have to read up on how array data is processed at some point, though I still don't think I'll use motif finding on 220Million+ fragments. (-;


Wednesday, December 10, 2008

countdown to comps continues..

Today has been a relatively productive day in one sense. I finally finished reading the Biology of Cancer (Weinberg), after about 2 solid weeks of doing a chapter a day. At about 60-75 pages a day, it was pretty intense, but I learned a LOT. (Well, how could I not?)

Since the last chapter was on drug design, an area I'm familiar with, it was pretty easy reading and went by pretty quickly. There are only so many times you can learn about Gleevec.

So I moved on to a few other review questions suggested by my committee... such as "Draw a gene" and "penetrance vs. expressivity," which made for a nice general review.

And then I moved on to a few papers. One of them discussed doing PET sequencing from 2nd generation machines to find chromosomal fusion points in cancer (Bashir et al., 2008). The math was interesting enough, but the end result was that they tested it on BACs. When I get around to doing PET on my samples, this will be a good review to make sure we get the parameters right, but the paper didn't go far enough, really, in my humble opinion. I was hoping for more.

A second paper that I went over was on identifying SNPs in 2nd generation sequencing, using bovine DNA (Tassell et al, 2008). I don't know what to make of their method though. They combined samples from 8 different types of cow (I didn't know there were that many types of cow!), and then sequenced it to a depth of 10x, so that on average, each read covering a specific location should reflect the sequence from a different breed. Maybe I'm missing something, but SNP calling on this should technically be impossible - even assuming you get 0% sequencing errors, how confident can you be in a SNP found only once? Anyhow, I had to abandon this paper... I just didn't understand how they could draw any conclusions at all from this data.

Finally, a colleague of mine (Thanks Simon!) recommended a review by Weinburg himself on "Mechanisms of Malignant Progression." (Weinburg, 2008). For anyone who ends up reading his textbook from cover to cover, I highly suggest following up with his review. Things have changed a bit from the time the textbook was published to now, which makes this review a timely followup. In particular, it flushes out much of the textbook's discussion of Epithelial-Mesenchymal Transitions (EMTs). Not that I want to go into much detail, but it's clearly worth a browse - and it's only 4 pages long. (MUCH shorter than the 790 pages I had to read to understand what he was talking about in the first place.) (-:

So now, this leaves me with 1 day left - just enough time to gather together my 15 minute presentation, to review the chapter summaries from the Biology of Cancer, and still have enough time to freak out a little bit. Perfect timing.

Labels: ,

Saturday, December 6, 2008

Nothing like reading to stimulate ideas

Well, this week has been exciting. The house sale competed last night, with only a few hiccups. Both us and the seller of the house we were buying got low-ball offers during the week, which provided the real estate agents lots to talk about, but never really made an impact. We had a few sleepless nights waiting to find out of the seller would drop our offer and take the competing one that came in, but in the end it all worked out.

On the more science-related side, despite the fact I'm not doing any real work, I've learned a lot, and had the chance to talk about a lot of ideas.

There's been a huge ongoing discussion about the qcal values, or calibrated base call scores that are appearing in Illumina runs these days. It's my understanding that in some cases, these scores are calibrated by looking at the number of perfect alignments, 1-off alignments, and so on, and using the SNP rate to identify some sort of metric which can be applied to identify an expected rate of mismatched base calls. Now, that's fine if you're sequencing an organism that has a genome identical to, or nearly identical to the reference genome. When you're working on cancer genomes, however, that approach may seriously bias your results for very obvious reasons. I've had this debate with three people this week, and I'm sure the conversation will continue on for a few more weeks.

In terms of studying for my comprehensive exam, I'm now done the first 12 chapters of the Weinberg "Biology of Genomes" textbook, and I seem to be retaining it fairly well. My girlfriend quizzed me on a few things last night, and I did reasonably well answering the questions. 6 more days, 4 more chapters to go.

The most interesting part of the studying was Thursday's seminar day. In preparation for the Genome Sciences Centre's bi-annual retreat, there was an all-day seminar series, in which many of the PIs spoke about their research. Incidentally, 3 of my committee members were speaking, so I figured it would be a good investment of my time to attend. (Co-incidentally, the 4th committee member was also speaking that day, but on campus, so I missed his talk.)

Indeed - having read so many chapters of the textbook on cancer biology, I was FAR better equipped to understand what I was hearing - and many of the research topics presented picked up exactly where the textbook left off. I also have a pretty good idea what questions they will be asking now: I can see where the questions during my committee meetings have come from; it's never far from the research they're most interested in. Finally, the big picture is coming together!

Anyhow, two specific things this week have stood out enough that I wanted to mention them here.

The first was the keynote speaker's talk on Thursday. Dr. Morag Park spoke about the environment of tumours, and how it has a major impact on the prognosis of the cancer patient. One thing that wasn't settled was why the environment is responding to the tumour at all. Is the reaction of the environment dictated by the tumour, making this just another element of the cancer biology, or does the environment have it's own mechanism to detect growths, which is different in each person. This is definitely an area I hadn't put much thought into until seeing Dr. Park speak. (She was a very good speaker, I might add.)

The second item was something that came out of the textbook. They have a single paragraph at the end of chapter 12, which was bothering me. After discussing cancer stem cells, DNA damage and repair, and the whole works (500 pages of cancer research into the book...), they mention progeria. In progeria, children age dramatically quickly, such that a 12-14 year old has roughly the appearance of an 80-90 year old. It's a devastating disease. However, the textbook mentions it in the context of DNA damage, suggesting that the progression of this disease may be caused by general DNA damage sustained by the majority of cells in the body over the short course of the life of a progeria patient. This leaves me of two minds: 1), the DNA damage to the somatic cells of a patient would cause them to lose tissues more rapidly, which would have to be regenerated more quickly, causing more rapid degradation of tissues - shortening telomeres would take care of that. This could be cause a more rapid aging process. However, 2) the textbook just finished describing how stem cells and rapidly reproducing progenitor cells are dramatically more sensitive to DNA damage, which are the precursors involved in tissue repair. Wouldn't it be more likely then that people suffering with this disease are actually drawing down their supply of stem cells more quickly than people without DNA repair defects? All of their tissues may also suffer more rapid degradation than normal, but it's the stem cells which are clearly required for long term tissue maintenance. An interesting experiment could be done on these patients requiring no more than a few milliliters of blood - has their CD34+ ratio of cells dropped compared to non-sufferers of the disease? Alas, that's well outside of what I can do in the next couple of years, so I hope someone else gives this a whirl.

Anyhow, just some random thoughs. 6 days left till the exam!

Labels: , , , , , ,

Friday, November 28, 2008

It never rains, but it pours...

Today is a stressful day. Not only do I need to to finish my thesis proposal revisions (which are not insignificant, because my committee wants me to focus more on the biology of cancer), but we're also in the middle of real estate negotiations. Somehow, this is more than my brain can handle on the same day... At least we should know by 2pm if our counter-offer was accepted on the sales portion of the transaction, which would officially trigger the countdown on the purchase portion of the transaction. (Of course, if it's not accepted, then more rounds of offers and counter-offers will probably take place this afternoon. WHEE!)

I'm just dreading the idea of doing my comps the same week as trying to arrange moving companies and insurance - and the million other things that need to be done if the real estate deal happens.

If anyone was wondering why my blog posts have dwindled down this past couple of weeks, well, now you know! If the deal does go through, you probably won't hear much from me for the rest of this year. Some of the key dates this month:
  • Dec 1st: hand in completed and reviewed Thesis Proposal
  • Dec 5th: Sales portion of real estate deal completes.
  • Dec 6th: remove subjects on the purchase, and begin the process of arranging the move
  • Dec 7th: Significant Other goes to Hong Kong for~2 weeks!
  • Dec 12th: Comprehensive exam (9am sharp!)
  • Dec 13th: Start packing 2 houses like a madman!
  • Dec 22nd: Hannukah
  • Dec 24th: Christmas
  • Dec 29th: Completion date on the new house
  • Dec 30th: Moving day
  • Dec 31st: New Years!
And now that I've procrastinated by writing this, it's time to get down to work. I seem to have stuff to do today.

Labels: , , , , , ,

Thursday, November 27, 2008

Should Bioinformatics be on your degree?

An interesting topic came up the other day - Should your specialization be on your graduate degree? Apparently, it's under consideration at my university and the faculty is consulting with staff and students to decide.

Unlike most consultation processes, this one got a lot of "reply all" comments, which showed to two distinct responses, one for, and one against. (That there are two sides to this story shouldn't be a surprise, I hope!)

Those in favour claimed that having a generic M.Sc or Ph.D. really doesn't reflect the value of the work you've done in achieving the degree, and it should reflect the subject area you've contributed to. Since nearly all of the replies I saw were from people in the bioinformatics program, having a M.Sc. (Bioinformatics) or Ph.D. (Bioinformatics) just seems way cooler than a generic degree from the faculty of science. Future employers will look at your publication record, anyhow, not what's on your degree.

On the other hand, those against proposed that Bioinformatics is too new a field and is likely to be swallowed up by other fields in the future - making a Ph.D. (Bioinformatics) more of a liability than an advantage. Equally important, many researchers switch fields several times as they follow their research throughout the course of their career, meaning that the bioinformatics specialization could constrain you as you apply for jobs in the future.

So, what's the answer? I haven't the faintest idea. My masters is pretty darn plain, and no one would have the faintest idea that I did it in microbiology and immunology. I have to admit I was underwhelmed when I saw it for the first time... but when it comes time to apply for jobs, I might be very glad to avoid giving away any pretense that I might know some immunology!


Monday, November 3, 2008


It has been a VERY busy week since I last wrote. Mainly, that was due to my committee meeting on Friday, where I had to present my thesis proposal. I admit, there were a few things left hanging going into the presentation, but none of them will be hard to correct. As far as topics go for my comprehensive exams, it sounds like the majority of the work I need to do is to shore up my understanding of cancer. With a field that big, though, I have a lot of work to do.

Still, it was encouraging. There's a very good chance I could be wrapping up my PhD in 18-24 months. (=

Things have also been busy at home - we're still working on selling a condo in Vancouver, and had two showings and two open houses over the weekend, and considering the open houses were well attended,that is an encouraging sign.

FindPeaks has also had a busy weekend, even though I wasn't doing any coding, myself. A system upgrade took FindPeaks off the web for a while and required a manual intervention to bring that back up. (Yay for the Systems Dept!) A bug was also found in all versions of 3.1 and 3.2, which could be fairly significant -and I'm still investigating. At this point, I've confirmed the bug, but haven't had a chance to identify if it's just for this one file, or for all files...

Several other findpeaks projects are also going to be coming to the forefront this week. Controls and automated file builds. Despite the bug I mentioned, FindPeaks would do well with an automated trunk build. More users would help me get more feedback, which would help me figure out what things people are using, so I can focus more on them. At least that's the idea. It might also recruit more developers, which would be a good thing.

And, of course, several new things that have appeared that I would like to get going on: Bowtie is the first one. If it does multiple alignments (as it claims to), I'll be giving it a whirl as the new basis of some of my work on transcriptomes. At a rough glance, the predicted 35x speedup compared to Maq is a nifty enticement for me. Then there's the opportunity to do some clean up code on the whole Vancouver package for command line parameter processing. A little work there could unify and clean up several thousand lines of code, and make new development Much easier.

First things first, though, I need to figure out the source and the effects of that bug in findpeaks!

Labels: , , ,

Tuesday, September 30, 2008

More ups, less downs

This has been an interesting week, and it's only Tuesday. While last week dragged on - mainly because I was stuck at home sick - this week is flying by in a hurry. I have several topics to blog about (ABI SOLiD, Arrays vs Chip-Seq, Excellent organizational tools, etc) but now find myself with enough work that I don't really have much time to blog on all of them today. Hopefully I can get around to it in the next couple days.

Last week, I blogged about my frustration with the GSC environment, but then found a few things to act as a counter-point. Whereas last week, I was stuck at home and could only watch things as they unfolded from a distance, this week I'm back in the office and able to talk with my co-workers, and realized how valuable a tool that can be. Even if I'm frustrated with jostling for position that goes on, there are fantastic people at the GSC who work hard and are happy to lend a hand. The GSC isn't perfect, but there are definitely fantastic people here.

That said, on a more personal note, I've had quite a few "perks" appear this week:
  • For the first time in my life, I've been asked to review a paper for a journal,
  • I've also become involved in several really cool projects,
  • I've had the opportunity to review some ABI SOLiD files for inclusion in FindPeaks,
  • I was even offered a chance to apply for a post-doc with a lab that I've been following closely. Talk about a flattering offer! (Unfortunately, I'm not close to graduating yet, so it's a bit premature. DRATS!)
Anyhow, its stuff like that keeps me writing the software and working hard. I guess that solves the question of why I try to do good work all of the time - A little bit of recognition goes a long way to keep me motivated. (=

Happy Tuesday, everyone.


Thursday, September 25, 2008

grad school - phase II?

I'm trying to keep up with my one-posting per-day plan, but I considered abandoning it today. It's just hard to keep focussed long enough to write a post when you've been coughing all day. Fortunately, the coughing let up, the fever went away, and - even if my brain isn't back to normal - I'm feeling much better this afternoon than I was this morning.

Anyhow, spending the day on the day on the couch gave me lots of time for reflection, even if my brain was at 1/4 capacity. I've been able to follow several email threads, and contribute to a couple along the way. And, of course, all this sitting around and watching email go by made me re-evaluate my attitude to grad school. Of course, this posting could just be because I'm sick and grumpy. Still, there's a nugget of truth in here somewhere...

Until today, I've been trying to write good quality code that can be used in our production pipelines - which some of it has. FindPeaks has been a great success in that respect. [EDITED]

However, several other pieces of my work were slated to be put into production for several months, until a few days ago. Now, they've all been discarded by the GSC in favour of other people's work. Of course, I can't complain about that - they're free to use whatever they want, and - despite doing my research at the GSC - I have to remind myself that I'm not a GSC employee, and they have no obligation to use my code.

To be fair, the production requirements for one element of the software changed this morning, and my code that was ready to be put in production a year ago is no longer cutting edge.

However, in hindsight, I guess I have to ask some questions: why have I been writing good code, when I could have been writing code that does only what I want, and doesn't handle the 100's of other cases that I don't come across in my research? What advantages were there for me to do a good job?
  • If I want to sell my code, it belongs to UBC. Clearly this is a non-starter.
  • If I want the GSC to use my code, I won't get any recognition for it. (That's the problem with writing a tool: Many people will get the results, but no one will ever know that it was my code in use, and unlike lab work, pipeline software doesn't seem to get included on publications.)
  • If I'm doing it just for myself, then there's the satisfaction of a job well done, but it distracts me from other projects I could be working on.
  • If I want to publish using my code, it doesn't have to be production ready - it only has to do the job I want it to, so there's clearly no advantage here.
Which leaves me with only one decent answer:
  • develop all my code freely on the web, and let other people participate. The more production-ready it is, the more people will use it. The more people who use it, the more my efforts are recognized - which in a way, is what grad school is all about:
Publications and recognition.

I seem to have more or less already adopted that model [EDITED]

Some time last year, I came to the recognition that I should stop helping people who just take advantage of my willingness to work, but I think this is a new, more jaded version of it.

Viva la grad school.

[EDITED] I guess now I understand why people leave Academia.


Monday, June 16, 2008

Random Update on FP/Coding/etc.

I had promised to update my blog more often, but then failed miserably to follow through last week. I guess I have to chalk it up to unforeseen circumstances. On the bright side, it gave me the opportunity to come up with several things to discuss here.

1. Enerjy: I learned about this tool on Slashdot, last week while doing some of my usual lunch hour "open source news" perusal. I can unequivocally say that installing the Enerjy tool in Eclipse has improved my coding by unimaginable leaps and bounds. I tested it out on my Java codebase that has my FindPeaks application and the Transcriptome/Genome analysis tools, and was appalled by the number of suggestions it gave. Admittedly, I'm self taught in Java, but I thought I had grasped the "Zen" of Java by now, though the 2000+ warnings it gave disagreed. I've since been cleaning up the code like there's no tomorrow, and have brought it down to 533 warnings. The best part is that it pointed out several places where bugs were likely to have occurred, which have now all been cleaned up.

2. Threading has also come up this past week. Although I didn't "need" it, there was no way to get around it - learning threads was the appropriate solution to one problem that came up, so my development version is now beginning to include some thread management, which is likely to spread into the the core algorithms. Who knew??

3. Random politics: If you're a grad student in a mixed academic/commercial environment, I have a word of warning for you: Not everyone there is looking out for your best interests. In fact, some people are probably looking out for their own interests, and they're definitely not the same as yours.

4. I read Michael Smith's biography this week. I was given a free copy by the Michael Smith Foundation for Health Research, who were kind enough to provide my funding for the next few years. It's fantastic to understand a lot of the history behind the British Columbia Biotechnology scene. I wish I'd read that before having worked at Zymeworks. That would have provided me with a lot more insight into the organizations and people I met along the way. Hindsight is 20/20.

5. FindPeaks 4.0: Yes, I'm skipping plans for a FindPeaks 3.3. I've changed well over 12000+ lines of code, according to the automated scripts that report such things, which have included a major refactoring and the start I made at threading. If that doesn't warrant an major number version change, I don't know what does.

Well, on that note, back to coding... I'm going to be competing with people here, in the near future, so I had best be productive!

Labels: , , , , ,

Tuesday, February 5, 2008

AGBT and Sequencability

First of all, I figured I'd try to do some blogging from ABGT, while I'm there. I don't know how effective it'll be, or even how real-time, but we'll give it a shot. (Wireless in Linux on the Vostro 1000 isn't particularly reliable, and I don't even know how accessible internet will be.)

Second, what I wrote yesterday wasn't very clear, so I thought I'd take one more stab at it.

Sequencability (or mappability) is a direct measure of how well you'll be able to sequence a genome using short reads. Thus, by definition, de novo sequencing of a genome is going to be a direct result of the sequencability of that genome. Unfortunately, when people talk about the sequencability, they talk about it in terms of "X% of the genome is sequencable", which means "sequencability is not zero for X% of the genome."

Unfortunately, even if sequencability is not zero, it doesn't mean you can generate all of the sequences (even if you could do 100% random fragments, which we can't), indicating that much of the genome beyond that magical "X% sequencable" is still really not assemblable. (Wow, that's such a bad word.)

Fortunately, sequencability is a function of the length of the reads used, and as the read length increases, so does sequencability.

Thus, there's hope that if we increase the read length of the Illumina machines, or someone else comes up with a way to do longer sequences with the same throughput (e.g. ABI Solid, or 454's GS FLX), the assemblability of the genome will increase accordingly. All of this goes hand in hand: longer reads and better lab techniques always make a big contribution to the end results.

Personally, I think the real answer lays in using a variety of techniques: Paired-End-Tags to span difficult to sequence areas (eg. low or zero sequencability regions), and Single-End-Tags to get high coverage... and hey throw in a few BACs and ESTs reads for good luck. (=

Labels: , , , , ,

Tuesday, January 22, 2008

Solexa Transcriptom Shotgun:Transcriptome alignments vs. Genome Alignments

First off, I was up late trying to finish off one of my many projects, so I didn't get a lot of sleep. Thus, if my writing is more incoherent than usual, that's probably why. And now on with the show.

I'm jumping gears a bit, today. I haven't finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.

When you're re-sequencing a genome, you map back to the genome template. Yes, you're probably sequencing a different individual, so you'll find occasional sections that don't match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you're seeing are already commonly known changes. Of course, if you're re-sequencing Craig Venter, you don't need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.

When you're sequencing a genome you've never sequenced before, you can't do mapping at all. There are various assemblers (i.e., Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add... ), SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and Euler (which I'd never heard of till I googled the web page for velvet...). The key point: you don't need to worry about what you're mapping back to when you do de novo assembly, since you're creating your own map. I'll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they're so short!

Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I'm ignoring cloned libraries... they're so 90's!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I'm ignoring some complexity here, because it makes for a better story.) But I'm getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.

mRNA is an interesting beast. Unlike Genomic DNA, it's a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there's a small chance that it might overlap the boundaries between two exons, which means the bases you have won't map back nicely to the genomic point of origin. That can be a serious problem.

Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you're sophisticated, you'll even know that introns often have signals that indicate their presence most of the time. And yes, you'd be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn't seem like much, really), how long would it take to do that alignment?

Too long.

So most of the time, we don't do gapped alignments. Instead, we have two choices:
  1. Align against the genome, and throw away reads that we can't align (i.e. those that over lap intron/exon boundaries.)

  2. Align against a collection of known coding DNA sequences

Number two isn't a bad option: it already has all the introns spliced out, so you don't need to worry about missing those alignments. Unfortunately, there are several issues with this approach:
  • Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.

  • Many transcripts are very similar. This isn't a problem with the data, but with your alignment program. If your aligner doesn't handle multi-matches (like Eland), this will never work.

  • Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.

  • Aligning to a transcript is hard to visualize. This is going to be one of my next projects... but with all the fantastic genomes out there, I'm still not aware of a good transcriptome visualization tool.

And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.

For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven't released my code for either one, yet, but they both work, and if you contact my supervisor, he'd probably be interested in collaborating, or possibly sharing the code.

Labels: , , , , ,

Tuesday, December 4, 2007

Committee Meeting #1

I had my first PhD committee meeting this afternoon, and it went reasonably well. I don't know what I was expecting, but having spoken to 3 of the 4 committee members about my project multiple times over the past 5 days, I guess I wasn't exactly expecting to be raked over the coals.

My only real comment is that I should have spent a bit more time practicing my presentation - but who has time for stuff like that? I'm going to try to get the Paired End Tag code working tomorrow, and the paired Eland reader - somehow it's just way more important to do the work than to talk about it, which is it feels like I've been doing lately.

Anyhow, so much for taking the evening off - I just spent the last half hour trying to figure out why someone in Wisconsin can't seem to get my FindPeaks application to work. I don't really get it - I can run my code with his file just fine. Anyhow, another thing to look at tomorrow.


Monday, December 3, 2007


Planning and reality often don't mix in grad school, I'm discovering. Since I have a committee meeting tomorrow, I thought I'd use today to finish my presentation and written documentation for it. Instead, I was told I had to have code running by 1pm that would process transcriptome data using paired end solexa reads. 5 hours later, I had code analyzing Single end transcriptome data, but nothing finished for my committee meeting.

On that note, I'm not going to blog about paired end data - I'll save that for another day, and instead, I'm going to go work on my presentation and report. However, I did want to provide a neat link Elaine passed to me: made with molecules. Jewelry for the rich and geeky. (I'm geeky enough to like it, but not rich enough to pay THAT much for it, even if I wore ornaments.)

Oh, and one last parting note... my laptop is being assembled today. If I wasn't going to be so busy tomorrow, I'd probably be checking it's status every 10 minutes. The curse and blessing of real-time tracking data.

Labels: , , ,

Thursday, September 20, 2007

Textbook Chapter

Lately I haven't had much to write about, as is clearly shown by the complete lack of content for the past month. I've been too busy to do much photography (I haven't even put up my most recent shots onto my web page)... but I have figured out some of what I'd like my blog to be, and I'm going to put some effort into breaking it up into two sections: photography and Grad School.

Of course, that will have to wait until I come back from my trip - oh, right - I don't think I mentioned I'll be in England, Scotland and Wales for about 10 days. This trip has been in the works for two years, so you can imagine that I've been looking forward to it... just a little bit. (-;

Anyhow, before I go, I have a few things to finish, once of which is a textbook chapter. I agreed to take this project on about a month ago, and since it's on an area in which I'm actively working, I thought it would be a good idea.

The topic of the chapter is a method or protocol called ChIP-Seq (or ChIPSeq, or ChIP-Solexa, or a few other things, actually), which is a combination of chromatin immunoprocipitation and a so-called "next-generation" sequencing technology. (If anyone wants to know more, send me a message, and I'll do a post on it.)

At any rate, when I was first approached about it, I was in the process of rewriting all of the production code that's being used to analyze the results - so I'm now intimately familiar with how that works. And, once that was done, I spent the next week doing a scholarship application, for which I had to do some lit review.. and finally this week, I got to working on the chapter.
I spent a few days reading, a day outlining, and a few hours here and there doing some writing. It's slow work, to be honest, but it's getting there.

The funny part is that I found a great paper yesterday, that would have pointed me to the right place, and has a great introduction. Where did I find it, you ask? On my desk. I've had it for about 2 months - it's the paper on ChIP-Seq written by the guy who sits across the cubicle wall from me.

Anyhow, That's life. Now I just need to sit down and write like mad - I'd like to get a draft out before my vacation starts.

Labels: , , , ,