Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Thursday, June 26, 2008

Hello to Oregon State University.

I love google analytics, but I always forget to look at it. It's such a fantastic interface. Unfortunately, I can't see the individual IP addresses of the people who are hitting my blog, but I can see the cities from which I'm getting traffic.

Can you believe nearly 1/2rd of my traffic comes from Corvallis, Oregon? Something like 25 hits a day. (Yes, I average around 50 hits a day on my blog.) So, what's in Corvallis? The only thing I can find is the Oregon State University - which incidentally has a Center for Genome Research and Biocomputing, where they do some Illumina related Sequencing. Coincidence? Maybe.

Anyhow, I just thought I'd say hi to whoever it is in Corvallis that likes my blog. (=


Chip-Seq revisited

In the world of ChIP-Seq, things don't seem to slow down. A collaborator of mine pointed out the new application called MACS, which is yet another peak finder, written in python as an open source project. That makes 2 open source peak finders that I'm aware of: Useq and now MACS.

The interesting thing, to me, is the maturity of the code (in terms of features implemented). In neither cases is it all that great, as it's mostly lacking features I consider to be relatively basic, and relatively naive in terms of algorithms used for peak detection. Though, I suppose I've been working with FindPeaks long enough that nearly everything else will seem relatively basic in comparison.

However, I'll come back to a few more FP related things in a moment. I wanted to jump to another ChIP-Seq related item that I'd noticed this week. The Wold lab merged their Peak Finder software into a larger development package for Genomic and Transcriptome work, which I think they're calling ERANGE. I've long argued that the Peak Finding tools are really just a subset of the whole Illumina tool-set required, and it's nice to see other people doing this.

This is the development model I've been using, though I don't know if the wold lab does exactly the same thing. The high-level organization uses a core library set, core object set, and then FindPeaks and other projects just sit on top, using those shared layers. It's a reasonably efficient model. And, in a blog a while ago, I mentioned that I'd made a huge number of changes to my code after coming across the tool called "Enerjy". I sat down to figure out how many lines were changed in the last two weeks: 26,000+ lines of code, comments and javadoc. That's a startling figure, since my entire code base ( grep -r " " * | wc -l) is only 22,884 lines, of which 15,022 contain semi-colons.

Anyhow, I have several plans for the next couple of days:
  1. try to get my SVN repository to somewhere other people can work on it as well, and not just restricted to GSC developers.
  2. Improve the threading I've got going
  3. Clean up the documentation, where possible
  4. and work on the Adaptive mode code.

Hopefully, that'll clean things up a bit.

Back to FindPeaks itself, the latest news is that my Application note in Bioinformatics has been accepted. Actually, it was accepted about a week ago, but I'm still waiting to see it in the advanced access section - hopefully it won't be much longer. I also have a textbook chapter on ChIP-Seq coming out relatively soon, (I'm absolutely honoured to have been given that opportunity!) assuming I can get my changes done by Monday.

I don't think that'll be a problem.

Labels: , ,

Monday, June 23, 2008

Reject Bill C-61.

I was passed a link today (thanks Obi!) that I think is worth passing along to any Canadians who read my blog.

Canadian Industry Minister Jim Prentice introduced a bill (Bill C-61), which would basically shut down Canada's IT sector in terms of small start-ups ability to innovate, erode away most of the consumers rights to use the media and content they own, and introduce an even more strict version of the dreaded DMCA (Digitial Millenium Copyright Act). I highly suggest that every Canadian tell the government how unwanted that legislation is. (I certainly won't be voting for any party that makes it illegal to listen to music I've purchased on my own MP3 player.)

This link gives you a template letter that you can just send along to your MP, but also provides you the opportunity to edit the letter if you'd like. Here's my version:


June 24, 2008

The Honourable Hedy Fry
House of Commons
Parliament Buildings
Ottawa, Ontario K1A 0A6

These recipients will be CC'd on your email:
The Honourable Jim Prentice P.C, M.P.
5th floor, West Tower
C.D. Howe Building
235 Queen St.
Ottawa, Ontario K1A 0H5

The Honourable Josée Verner, P.C., M.P.
Minister of Canadian Heritage
25 Eddy Street
Gatineau, Quebec K1A 0M5

Subject: Please Stand Against the New Copyright Bill

Dear Madam,

I'm a constituent who has been following recent developments in Canadian copyright law. As a worker in the high-tech/biotech sector, I'm concerned about the terrible economic cost such a bill would have on Canadian innovation, research and competitive ability to thrive in the knowledge based economy. Poorly crafted bills such as this are on the slippery slope towards "respecting intellectual property" such as Microsoft's patent for a single mouse click - which, if enforced, would prevent any non-Microsoft customers from being unable to use their computer without paying a Microsoft tax.

The American DMCA legislation has been throughly criticized by many leading innovators in the software industry and in other IT based sectors for it's ability to be exploited for commercial gain at the expense of the consumer. To enact legislation such as that outlined in the Copyright bill (C-61) presented by the government on June 12th would go far further in eroding consumer rights, and outlaw the lawful use of copyrighted material. It does not take into account the needs of consumers and Canada's creative community who are exploiting the potential of digital technology. Like many other Canadians, I'm disappointed that this bill adopts an American approach to digital copyright laws, instead of crafting a Canadian approach.

Copyright laws were designed to strike a balance between the needs of the consumer, and those producing content. When the content producers are able to restrict the rights of the ordinary consumer, such as preventing a user from unlocking their own cell phone, or copying the contents of purchased DVDs for use in video iPods, then the purposed of copyright is no longer being served.

The current bill outlaws the currently legal practices above. This bill paves the road to importing the consumer file-sharing lawsuit strategy that has failed so spectacularly in the United States. Canada deserves better.

Please ensure that C-61 really is made for Canadians by allowing all Canadian stakeholders a say in its final contents. That means meaningful consultation in the coming months, and opening up Canada's copyright policy to more than just the special interests that lobbied behind the scenes for this law. As my MP, I urge you to represent my interests in the copyright debate.


Anthony Fejes


Friday, June 20, 2008

MAQ mapview format. - Updated

Well, I promised a quick update on the MAQ mapview format, after I wrote the interpreter for it, but there isn't much to say.

Key bits of information:
  • The file is simply a zipped text file. (Update: this file is not normally gzipped. While the .map file is zipped, the .mapview file is not normally found in the gzipped state.) You can unzip it with 'gunzip' on a linux system.
  • Reads are pre-sorted by chromosome, then position.
  • The format is 1 based, so if you're using a zero based format, you'll need to convert.
  • Starting points are for the "left end", ie, regardless of which strand the sequence aligned to, the matching position with the lowest position is reported.
  • Sequences are not contained in this file, but if you go back to the original fastq file you can retrieve the sequence. If you do so, you will need to obtain the reverse compliment of any read that maps to the reverse strand to map to your fasta file sequence. Forward strand sequences will map correctly to the fasta sequence.
  • Most of the fields are not useful for any form of analysis, and what's given is mostly incomprehensible.

The best information I had was from the MAQ manpage. In a slightly more readable format:

  1. read name
  2. chromosome
  3. position
  4. strand
  5. insert size from the outer coorniates of a pair
  6. paired flag
  7. mapping quality
  8. single-end mapping quality
  9. alternative mapping quality
  10. number of mismatches of the best hit
  11. sum of qualities of mismatched bases of the best hit
  12. number of 0-mismatch hits of the first 24bp
  13. number of 1-mismatch hits of the first 24bp on the reference
  14. length of the read
  15. read sequence
  16. quality

Most strikingly, you'll notice 16 fields are listed above, while the file appears to have 14 fields. It seems not all files have the last two fields. I don't know if it's just the file I have, or if it's usually that way. (Update: Actually, there are normally 16 fields. The files I was given were generated using the mapview -B flag, which strips out some of the information, which I believe are the final two fields. Thus, the comments above reflect the -B flag output only! Thanks to Ryan for catching that!)

If I come across anything else that needs to be added, I'll update this entry.

Labels: , ,

Thursday, June 19, 2008

Downloads and aligners

At the Genome Science Centre, we have a platform Scientific Advisory Board meeting coming up in the near future, so a lot of people are taking the time to put together some information on what's being going on since the last meeting. As part of that, I was asked to provide some information on the FindPeaks application, since it's been relatively successful. One of those things was how many times it's been downloaded.

Surprisingly, it's not an insigificant number: FindPeaks 2.x has been downloaded 280 times, while FindPeaks 3.1.x has been downloaded 325 times. I was amazed. Admittedly, I filtered out google and anything calling itself "spider", and didn't filter on unique IPs, but it's still more than I expected.

The other surprise to me was the institutions from which it was downloaded, which are scattered across the world. Very cool to see it's not just Vancouver people who took an interest in FindPeaks, though I somewhat suspected as much, already.

Thinking of FindPeaks, I put in one last marathon cleaning session for all my software. I'm sure I'm somewhere north of 15,000 lines of code modified in the past week. Even when you consider that some of those changes were massive search-and-replace jobs, (very few, really), or refactoring and renaming variables (many many more), it's still an impressive number for two weeks. With that done, I'm looking to see some significant gains in development speed, and in developer productivity. (I learned a LOT doing this.)

The last 122 warnings will just have to get cleaned up when I get around to it - although they're really only 4 warnings types repeated so many times. (The majority of them are just that the branched logic goes deeper than 5 levels, or that my objects have public variables. (You can only write so many accessor functions in a week.)

Anyhow, I'm looking forward to testing out FindPeaks 4.0 alphas starting tomorrow, and to put some documents together on it. (And catch up on the flood of emails I received in the last 2 days.

Finally, I'm working on a MAQ file interpreter for another project I'm working on. If I ever manage to figure out how to interpret the (nearly undocumented) files, I'll post that here. If anyone's done this already (though I didn't see anything publicly available on the web), I'd love to hear from them.


Labels: , , ,

Monday, June 16, 2008

Random Update on FP/Coding/etc.

I had promised to update my blog more often, but then failed miserably to follow through last week. I guess I have to chalk it up to unforeseen circumstances. On the bright side, it gave me the opportunity to come up with several things to discuss here.

1. Enerjy: I learned about this tool on Slashdot, last week while doing some of my usual lunch hour "open source news" perusal. I can unequivocally say that installing the Enerjy tool in Eclipse has improved my coding by unimaginable leaps and bounds. I tested it out on my Java codebase that has my FindPeaks application and the Transcriptome/Genome analysis tools, and was appalled by the number of suggestions it gave. Admittedly, I'm self taught in Java, but I thought I had grasped the "Zen" of Java by now, though the 2000+ warnings it gave disagreed. I've since been cleaning up the code like there's no tomorrow, and have brought it down to 533 warnings. The best part is that it pointed out several places where bugs were likely to have occurred, which have now all been cleaned up.

2. Threading has also come up this past week. Although I didn't "need" it, there was no way to get around it - learning threads was the appropriate solution to one problem that came up, so my development version is now beginning to include some thread management, which is likely to spread into the the core algorithms. Who knew??

3. Random politics: If you're a grad student in a mixed academic/commercial environment, I have a word of warning for you: Not everyone there is looking out for your best interests. In fact, some people are probably looking out for their own interests, and they're definitely not the same as yours.

4. I read Michael Smith's biography this week. I was given a free copy by the Michael Smith Foundation for Health Research, who were kind enough to provide my funding for the next few years. It's fantastic to understand a lot of the history behind the British Columbia Biotechnology scene. I wish I'd read that before having worked at Zymeworks. That would have provided me with a lot more insight into the organizations and people I met along the way. Hindsight is 20/20.

5. FindPeaks 4.0: Yes, I'm skipping plans for a FindPeaks 3.3. I've changed well over 12000+ lines of code, according to the automated scripts that report such things, which have included a major refactoring and the start I made at threading. If that doesn't warrant an major number version change, I don't know what does.

Well, on that note, back to coding... I'm going to be competing with people here, in the near future, so I had best be productive!

Labels: , , , , ,

Friday, June 6, 2008

Playing at work...

One of the fun things about doing science research is that you don't really know what the right answer is until you test a bunch of potential solutions and figure out which one (if any of them) is right. For the computer nerds and engineers out there, science research is really reverse engineering nature. Scientists come up with a theory, and then do their best to figure out whether they're right. I think that's one of the things that the non-scientifically inclined don't understand: We're constantly refining our theories to better incorporate the subtle nuances of how nature works. It's a game of trying to understand and really "grok" how the universe functions - which has always been one of my favorite pastimes.

Anyhow, where I was going with that is to talk a little about some of the work I've been doing lately, and why it doesn't seem like work.

I'm trying to split my time between three projects; developing new (and hopefully improved) methods and software for the analysis of ChIP-Seq experiments, developing new (and, again, hopefully improved) software for analysis of whole transcriptome shotgun Illumina sequencing software, and finally, trying to analyze the results of the whole transcriptome shotgun sequencing I'm supposed to be working on for my PhD work. Each project has it's own amusement value, and it's surprisingly different.

I spent the last 3 days hammering through the changes to my next version of the FindPeaks software I'm working on. It turned out that the development model I had in mind was completely impossible - which obviously slowed my progress down well beyond what I expected. (Apologies to my collaborators!) However, along the way, I tried several new things, which gave me a lot of opportunity to refine the application. The code runs faster, performs vastly more in depth analysis of the results, sheds light on several things we'd never observed before, and has the potential to change the way we interpret ChIP-Seq data. To a scientist, that's fun: the discovery of something unexpected.

With the transcriptome software work, I've only been working on it incidentally this week, so I have less progress to report. It represents more of the usual type of science... the grinding work that goes on behind the scenes. While I may not be excited about the work itself, I can look back over the past 6 months and see how far it's come. Maybe it's not "fun", exactly... but I can take pride in the progress. By next week, I'll be able to hand over a lot more code to my collaborators on this project.

The last project, the interpretation of my own data set, is something else entirely. It's not fun, and not really pride in work well done, but just the thrill of basic research. I'm not seeing anything unexpected - but I am seeing something new for the first time. Being the first to peel back the layers and see what's going on under the hood in a cancer cell line is pretty cool. I may not understand everything, but it's all new. I'm sure it's the same thrill, albeit somewhat less grand, of an explorer being the first to climb a mountain, visit some remote corner of the earth, or take the first step on the moon. The newness doesn't fade quickly.

Between the three projects, I definitely find that there's a lot of excitement and entertainment in the grad school experience. And, as my friend Abby reminded me today, if you're having fun doing something, it's not work.