Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, October 2, 2009

Base quality by position

A colleague of mine was working on a nifty tool to give graphs of the base quality at each position in a read using Eland Export files, which could be incorporated into his pipeline. Over a discussion about the length of time it should take to do that analysis, (His script was taking an hour, and I said it should take about a minute to analyze 8M illumina reads...) I ended up saying I'd write my own version to do the analysis, just to show how quickly it could be done.

Well, I was wrong about it taking about a minute. It turns out that the file has a lot more than about double the originally quoted 8 million reads (QC, no match and multi match reads were not previously filtered), and the whole file was bzipped, which adds to the processing time.

Fortunately, I didn't have to add bzip support in to the reader, as tcezard (Tim) had already added in a cool "PIPE" option for piping in whatever data format I want in to applications of the Vancouver Short Read Analysis Package, thus, I was able to do the following:
time bzcat /archive/solexa1_4/analysis2/HS1406/42E6FAAXX_7/42E6FAAXX_7_2_export.txt.bz2 | java6 src/projects/maq_utilities/QualityReport -input PIPE -output /projects/afejes/temp -aligner elandext

Quite a neat use of piping, really.

Anyhow, the fun part is that this was that the library was a 100-mer illumina run, and it makes a pretty picture. Slapping the output into openoffice yields the following graph:



I didn't realize quality dropped so dramatically at 100bp - although I remember when qualities looked like that for 32bp reads...

Anyhow, I'll include this tool in Findpeaks 4.0.8 in case any one is interested in it. And for the record, this run took 10 minutes, of which about 4 were taken up by bzcat. Of the 16.7M reads in the file, only 1.5M were aligned, probably due to the poor quality out beyond 60-70bp.

Labels: , , , , ,

Friday, September 4, 2009

Best Software Licence Ever!

I was looking for some example code of a Mahalanobis distance calculator and came across what I happen to believe is the most entertaining license I have ever seen. I had to share:

The program is free to use for non-commercial academic purposes, but for course works, you must understand what is going inside to use. The program can be used, modified, or re-distributed for any purposes if you or one of your group understand codes (the one must come to court if court cases occur.) Please contact the authors if you are interested in using the program without meeting the above conditions.


The Source.

Labels: , , ,

Wednesday, July 29, 2009

I hate facebook

I have a short rant to end the day, brought on by my ever increasing tie-in between the web and my desktop (now KDE 4.3):

I hate facebook.

It's not that I hate it the way I hate Myspace, which I hate because it's so easy to make horribly annoying web pages. It's not even that I hate it the way I hate Microsoft, which I hate because their business engages in unethical practices.

I hate it because it's a walled garden. Not that I have a problem with walled gardens in principle, but it's just so inaccessible - which is exactly what the facebook owners want. If you can only get at facebook through the facebook interface, you have to see their adds, which makes them money, if you ever get sucked into them. (You now have to manually opt out of having your picture used in adds for your friends... its a new option for your profile in your security settings, if you don't believe me.)

Seriously, the whole facebook wall can be recreated with twitter, the photo albums with flickr, the private messages with gmail.... and all of it can be tied together in one place. Frankly, I suspect that's what Google's "Wave" will be.

If I could integrate my twitter account with my wall on facebook, that would be seriously useful - but why should I invest the energy to update my status twice? Why should I have to maintain my own web page AND the profile on facebook...

Yes, it's a minor rant, but I just wanted to put that out there. Facebook is a great idea and a leader of it's genre, but in the end, it's going to die if its community starts drifting towards equivalent services that are more easily integrated into the desktop. I can now update twitter using an applet on my desktop - but facebook still requires a login so that I can see their adds.

Anyhow, If you don't believe me about where this is all going, wait to see what Google Wave and Chrome do for you. I'm willing to bet desktop publishing will have a whole new meaning, and on-line communities will be a part of your computer experience even before you open your browser window.

For a taste of what's now on my desktop, check out the OpenDesktop, Remember the Milk and microblog ( or even Choqok) plasmoids.

Labels: , , ,

Monday, July 27, 2009

Picard code contribution

Update 2: I should point out that the subject of this post has been resolved. I'll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.


Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I'll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) - and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I've been slowly reading through the code, as I'd like to use it in my project for reading/writing SAM format files - which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It's an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it's suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They're not key pieces, but they are important to maintain your code's good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the "==" operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It's not in bad shape, but it's got a lot of little "problems" that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I'd fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it's best, right?

The sad part of the tale is that open source only works when the community's contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn't. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I'm not going to waste my time submitting the second set of changes I made. That's kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I'd begin making larger code contributions as my familiarity with the code improves - and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren't interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they'd like to accomplish with the software - and including contributions may mean including someone on a publication downstream if and when it does become published. I don't know if that was the case here: it was well within the project leader's rights to reject my patches on any grounds they like, but I can't say it makes me happy. I still don't enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see "Open Source" software, I'll remember that just because it's open source, it doesn't mean all contributions are welcome - I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I'll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that's highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

Labels: , , , , , ,

Thursday, July 9, 2009

New Tool: KeepNote

Obviously I haven't updated much here lately - I've been pretty busy and inspiration hasn't struck me much in the last few days to get anything written. However, I started using some new software this morning, and I'm enjoying it so much I figured I have to share.

One of the big problems I have, as a bioinformatician, is keeping track of all the notes and one off scripts I write. I don't want to use an SVN, because it's just a repository with no organization. I don't want to use a wiki, because it's a huge hassle to maintain for small projects, and I hate using text files.

The compromise, it seems, is to use standards compliant files with a hell of a wrapper around them that does the organization for you, and the one I found is called KeepNote. The project page and downloads can be found at http://rasm.ods.org/keepnote/. The software is available for all major OS (Linux, Mac and even Windows), and can be installed relatively quickly and (for the most part) painlessly. (Linux builds are missing a library in the dependencies, but that can be figured out pretty quickly - just apt-get the missing lib and re-install if you hit this problem.)

While it may not fit everyone's workflow, my few hours of using it have already helped me get my tools organized and assembled in a logical manner, and it's allowed me to remove a load of files from my desktop. There are still bugs with it: I had to manually do some configuration of the the web browser, text editor and such before I could get started, but so far I haven't hit any of the bugs.

It also claims to help you organize notes - which I can clearly see. next time I go to a conference, I'll be using this for recording and organizing the usual 30-40 pages of notes I take.

For me, this falls under the heading of required tools for bioinformaticians and students alike and I look forward to seeing the project evolve and grow.

Labels: , , ,

Monday, June 22, 2009

4 Freedoms of Research

I'm going to venture off the beaten track for a few minutes. Ever since the discussion about conference blogging started to take off, I've been thinking about what the rights of scientists really are - and then came to the conclusion that there really aren't any. There is no scientist's manifesto or equivalent oath that scientists take upon receiving their degree. We don't wear the iron ring like engineers, which signifies our commitment to integrity...

So, I figured I should do my little part to fix that. I'd like to propose the following 4 basic freedoms to research, without which science can not flourish.
  1. Freedom to explore new areas
  2. Freedom to share your results
  3. Freedom to access findings from other scientists
  4. Freedom to verify findings from other scientists
Broadly, these rights should be self evident. They are tightly intermingled, and can not be separated from each other:
  • The right to explore new ideas depends on us being able to trust and verify the results of experiments upon which our exploration is based.
  • The right to share information is contingent upon other groups being able to access those results.
  • The purpose of exploring new research opportunities is to share those results with people who can use them to build upon them
  • Being able to verify findings from other groups requires that we have access to their results.
In fact, they are so tightly mingled, that they are a direct consequence of the scientific method itself.
  1. Ask a question that explores a new area
  2. Use your prior knowledge, or access the literature to make a best guess as to what the answer is
  3. Test your result and confirm/verify if your guess matches the outcome
  4. share your results with the community.
(I liked the phrasing on this site) Of course if your question in step 1 is not new, you're performing the verification step.

There are constraints on what we are allowed to do as scientists as well, we have to respect the ethics of the field in which we do our exploring, and we have to respect the fact that ultimately we are responsible to report to the people who fund the work.

However, that's where we start to see problems. To the best of my knowledge, funding sources define the directions science is able to explore. We saw the U.S. restrict funding to science in order to throttle research in various fields (violating Research Freedom #1) for the past 8 years, which was effectively able to completely halt stem cell research, and suppress alternative fuel sources, etc. In the long term, this technique won't work, because the scientists migrate to where the funding is. As the U.S. restores funding to these areas, the science is returning. Unfortunately, it's Canada's turn, with the conservative government (featuring a science minister who doesn't believe in evolution) removing all funding from genomics research. The cycle of ignorance continues.

Moving along, and clearly in a related vein, Freedom #2 is also a problem of funding. Researchers who would like to verify other group's findings (a key responsibility of the basic peer-review process) aren't funded to do this type of work. While admitting my lack of exposure to granting committees, I've never heard of a grant being given to verify someone else's findings. However, this is the basic way by which the scientists are held accountable. If no one can repeat your work, you will have many questions to answer - and yet the funding for ensuring accountability is rarely present.

The real threat to an open scientific community occurs with the last two Freedoms: sharing and access. If we're unable to discuss the developments in our field, or are not even able to gain information on the latest work done, then science will come grinding to a major halt. We'll waste all of our time and money exploring areas that have been exhaustively covered, or worse yet, come to the wrong conclusions about what areas are worth exploring in our ignorance of what's really going on.

Ironically, Freedoms 3 and 4 are the most eroded in the scientific community today. Even considering only the academic world, where freedoms are taken for granted our interaction with the forums for sharing (and accessing) information are horribly stunted:
  • We do not routinely share negative results (causing unnecessary duplication and wasting resources)
  • We must pay to have our results shared in journals (limiting what can be shared)
  • We must pay to access other scientists results in journals (limiting what can be accessed)
It's trivial to think of other examples of how these two freedoms are being eroded. Unfortunately, it's not so easy to think of how to restore these basic rights to science, although there are a few things we can all do to encourage collaboration and sharing of information:
  • Build open source scientific software and collaborate to improve it - reducing duplication of effort
  • Publish in open access journals to help disseminate knowledge and bring down the barriers to access
  • Maintain blogs to help disseminate knowledge that is not publishable
If all scientists took advantage of these tools and opportunities to further collaborative research, I think we'd find a shift away from conferences towards online collaboration and the development of tools favoring faster and more efficient communication. This, in turn, would provide a significant speed up in the generation of ideas and technologies, leading to more efficient and productive research - something I believe all scientists would like to achieve.

To close, I'd like to propose a hypothesis of my own:
By guaranteeing the four freedoms of research, we will be able to accomplish higher quality research, more efficient use of resources and more frequent breakthroughs in science.
Now, all I need to do is to get someone to fund the research to prove this, but first, I'll have to see what I can find in the literature...

Labels: , , , , ,

Thursday, May 28, 2009

Science Cartoons - 1

I've been busy putting together a poster for a student conference I'm attending next week in Winnipeg, which has distracted me from just about everything else I had planned to accomplish. Not to worry, though, I'll be back on it in a day or two.

Anyhow, part of the fun this time around is that the poster competition includes criteria for judging that involve how pretty the poster is - so I decided to go all out and learn to draw with inkscape. Instead of the traditional background section on a poster, I went with a comic theme. It's a great way to use text and figures to get across complex topics very quickly. So, for the next couple days, I thought I'd post some of them, one at a time.

Here's the first, called "Second Generation Sequencing". It's my least favorite of the bunch, since the images feel somewhat sloppy, but it's not a bad try for a first pass. Yes, I do own the copyrights - and you do need to ask permission to re-use them.

Labels: , ,

Saturday, March 28, 2009

Taking control of your documents

It's always a mystery to me how bioinformaticians, who are generally steeped in computer culture, can be Microsoft users. Not that Microsoft's software is necessarily bad, (although I maintain that it doesn't come with all of the tools built in that bioinformaticians need, depending on what form of bioinformatics you're doing), but for those who have been immersed in the high tech environment, Microsoft's well documented business practices and bad-neighbor behaviour seem to be somewhat unenlightened. That led me to leave the MS ecosystem in search of more friendly environments nearly a decade ago.

Ever since then, I've been trying to move people away from Microsoft products and towards either the truly open Linux ecosystem, or the proprietary (but less open) Apple Macintosh ecosystem. (I run 3 linux machines and a mac laptop at home.) As part of that move - and probably the most important one, I always suggest people take control of their documents and not hand them over to Microsoft's trust.

One of the great proponents of this is Rob Weir, who has a vested interest in the process, but is able to provide a fantasticly objective perspective on the subject, in my opinion. (Microsoft employees frequently disagree.)

Anyhow, I just thought it was worth linking to a particular article of his, on that subject. Even if you don't want to move away from your Microsoft supplied word processor, he gives advice on how to keep your documents as open as possible. I highly recommend you give this article a quick read - and maybe take some of Mr. Weir's advice.

http://www.robweir.com/blog/2009/03/taking-control-of-your-documents.html

Labels:

Thursday, March 26, 2009

TomTom has no Linux support?

I'm still procrastinating - A plumber is supposed to show up to cut a hole in my ceiling in a few minutes, basically as exploratory surgery on my new house, in order to find a leak that's developed in the pipes leading away from the washer and dryer. So, I thought I'd spend the intervening moments doing something utterly useless. I looked up TomTom's web site and took a look at what they have to offer.

If you don't know TomTom, they're a company that produces GPS units for personal and car use. They've recently shot to fame because Microsoft decided to sue them for a bunch of really pointless patents. The most interesting ones of the bunch are the ones that Microsoft seems to think are being infringed just because TomTom is using Linux.

Anyhow, this post wasn't going to be about the patents, since I already gave my opinion of that. Instead, since I'd been thinking about buying a GPS unit for a while, I thought it might be worth buying one from someone who uses embedded Linux - and I'd like to support TomTom in their fight against the Redmond Monopoly. Unfortunately - and this is the part that boggles my mind - TomTom offers absolutely zero support for people who run Linux as their computer operating system. Like many other companies, they're a Windows/Mac support only shop.

This strikes me as rather silly - all of the open source users out there would probably be interested in buying an open source GPS, and would probably be happy to support TomTom in their fight... but they've completely neglected that market. They've generated a great swelling of goodwill in many communities by standing up to Microsoft's bullying, but then completely shut that same market segment out of purchasing their products.

Well, that's some brilliant strategy right there. I only hope TomTom changes their mind at some point - since otherwise all that goodwill is just going right down the toilet...

And thinking of plumbing, again, it's time to go see about a hole in my ceiling.

Labels: ,

Wednesday, February 25, 2009

Microsoft Sues TomTom over patents

I saw a link to Microsoft suing a Linux-based GPS maker, TomTom, which made me wonder what Microsoft is up to. Some people were saying that this is Microsoft's way of attacking Linux, but I thought not. I figured Microsoft probably has something more sly up it's sleeve.

Actually, I was disappointed.

I went into the legal document (the complaint) to find out what patents Microsoft is suing over... and was astounded by how bad the patents are. Given the recent decision in the Bilski ruling, I think this is really Microsoft looking for a soft target in which it's able to test the waters and see how valid it's patents are in the post-Bilski court environment... Of course, I think these are probably some of Microsoft's softest patents. I have a hard time seeing how any of them will stand up in court. (Aka, pass the obviousness test and, simultaneously, the transformative test proposed in Bilski.)

If Microsoft wins this case, it'll be back to claiming Linux violates 200+ patents. If it loses the case, I'm willing to be we won't hear that particular line of FUD again. I can't imagine any of the 200+ patents it says that Linux violates are any better than the crap it's enforcing here.

Anyhow, for your perusal, if you'd like to see what Microsoft engineers have been patenting in the last decade, here are the 8 that Microsoft is trying to enforce. Happy reading:

6,175,789

Summary: Attaching any form of a computer to a car.

7,045,745

Summary: Giving driving instructions from the perspective of the driver.

6,704,032

Summary: having an interface that lets you scroll and pan around, changing the focus of the scroll.

7,117,286

Summary: A computer that interacts or docks with a car stereo.

6,202,008

Summary: A computer in your car... with Internet access!

5,579,517

Summary: File names that aren't all the same length - in one operating system.

5,578,352

Summary: File names that aren't all the same length - in one operating system... again.

6,256,642

Summary: A file system for flash-erasable, programmable, read-only memory (FEProm).

Overwhelmed by the brilliance at Microsoft yet? (-;

Labels: ,

Wednesday, November 19, 2008

Has Apple gone too far?

As a bioinformatician, I enjoy a good looking piece of computer hardware and, for the last few years, the best looking hardware around has been the Apple Macs. I've even thought about buying their new macbooks, although for the same specs, you can pick up a dell on sale at 1/3rd of the price, so it's hardly a good deal. I really can't see myself running anything other than Linux on it, though, so despite the beautiful engineering, I can't see myself paying ~$300 for an OS I'd just remove. (I was even upset at paying ~$50 for a copy of Windows XP with my current laptop. Drop me a line if you want to buy the license - It doesn't even have a Valid EULA... but that's another story.)

Anyhow, I've got to admit, Apple has finally managed to turn me off completely. Check out this article. To paraphrase, Apple has decided to follow suit with Microsoft and Intel in order to prevent you from enjoying the content you own in the way you'd like to use it. In other words, Mac OSX is now claiming control over your media files. (And, I might add, this is not about copyright, because the article shows uses that are clearly restricting "fair use" as well.) DRM is now built right into your hardware, and if your hardware isn't DRM enabled, you can't use it. Ouch.

I feel sorry for those people who have jumped the Microsoft ship just to end up in the Apple camp and are about to discover that Apple doesn't have their best interests at heart either. Why shouldn't you be able show a movie on an external monitor or projector?

In the long run, this is probaby good advertising for GNU/Linux, which doesn't enforce media company greed on it's users. So, if anyone wants a free Ubuntu disk to make their Apple harware work for them instead of against them, here you go.

Labels: , ,

Monday, November 17, 2008

Synergy!

Studying for my comprehensive exam is moving along slowly, rather disrupted by the poster I'm creating for the annual Cancer Conference taking place this week. I'm a little behind, but I'm getting there. Anyhow, I thought I'd take a minute to mention something that's come up several times in conversation this week: Synergy.

This is one of those applications that is an absolute must for bioinformatics students and researchers, or anyone who uses more than one computer. (Don't we all, these days?) I've been using it, myself, for about a year now, and it's one of the most useful applications on my computers.

Synergy is an open source software implementation of a KVM switch. Like a KVM switch, it can be used across operating systems - anything from win95 to XP to OSX to Linux/*nix. It's not even hard to install. The beauty of it is really in its simplicity. Not only can your mouse and keyboard move across your computers, but it also carries a clipboard with it. Cutting and pasting between computers, on its own, is worth it's weight in gold. (though, that probably depends on how much you have on your clipboard...)

Anyhow, just because not everyone is aware of this nifty little tool, I figured I'd mention it. Hopefully it's useful to a few people out there!

Labels:

Tuesday, September 16, 2008

Processing Paired End Reads

Ok, I'm taking a break from journals. I didn't like the overly negative tone I was taking in those reviews, so I'm rethinking how I write about articles. Admittedly, I think criticism is in order for papers , but I can definitely focus on papers that are worth reviewing. Unfortunately, I'm rarely a fan of papers discussing bioinformatics applications, as I always feel like there's a hidden agenda behind them. Whether it's simply proving their application is the best, or just getting it out first, computer application papers are rarely impartial.

Anyhow, I have three issues to cover today:
  • FindPeaks is now on sourceforge
  • Put the "number of reads under a peak" to rest. permanently, I hope.
  • Bed files for different data sources.

The first one is pretty obvious. FindPeaks is now available under the GPL on sourceforge, and I hope people will participate in using and improving the software. We're aiming for our first tagged release on friday, with frequent tags thereafter. Since I'm no longer the only developer on this project, it should continue moving forward quickly, even while I'm busy studying for my comps.

The second point is this silly notion that keeps coming up. "How many reads were found under each peak?" I'm quite sick of that question, because it really makes no sense. Unfortunately, this was a metric produced in Robertson et al.'s STAT1 data set, and I think other people have included it or copied it. Unfortunately it's rubbish.

The reason it worked in STAT1 was because they used a fixed length (or XSET) value on their data set. This allowed them to determine the exact length of each read, which allowed them to figure out how many reads were "contiguously linked in each peak." Readers who are paying attention will also realize what the second problem is... They didn't use subpeaks either. Once you start identifying subpeaks, you can no longer assign to which peak a read spanning peaks belongs. Beyond that, what do you do with reads in a long tail? Are they part of the peak or not?

Anyhow, the best measure for a peak, at the moment at least, is the height of the peak. This can also include weighted reads, so that reads which are unlikely to contribute to a peak actually contribute less, bringing in a scaled value. After all, unless you have paired end tags, you really don't know how long the original DNA fragment was, which means you don't know where it ended.

That also makes a nice segue to my last point. There are several ways of processing paired end tags. When it comes to peak calling it's pretty simple: you use the default method - you know where the two ends are, and they span the full read. For other applications, however, there are complexities.

If the data source is a transcriptome, your read didn't cover the intron, so you need to process the transcript to include breaks, when mapping it back to the genome. That's really a pain, but it is clearly the best way to visualize transcriptome PETs.

If the data source is unclear, or you don't know where the introns are (which is a distinct possibility), then you have to be more conservative, and not assume the extension of each tag. Thus, you end up with a "tag and bridge" format. I've included an illustration to make it clear.




So why bring it up? Because I've had several requests for the tag-and-bridge format, and my code only works on the default mode. Time to make a few adjustments.

Labels: , , ,