Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Monday, July 7, 2008

MAQ map format

I'm currently working on a couple of coding projects, though the one that's urgent priority is to analyze my samples of interest using the MAQ aligner, which shouldn't be too bad. Unfortunately, there are no real standards when it comes to output from alignment programs, which makes it impossible to do a simple replacement of the previous aligner with the current aligner. Thus, Eland and MAQ are not interchangeable modular components. Switching between them takes a large investment in time and effort. Blah.

MAQ, however, is also more tricky because of it's poor documentation. I can understand why that is - if they stopped to document, they might not have time to write the code, which is fair enough since all of us in this field are busy. Compounding the problem, however, is MAQ's use of really cryptic file formats. Unfortunately, I don't know enough to really discuss that issue for the majority of the MAQ formats, but I did want to make two points in this morning's post.

The first is simply that MAQ's map format is a little sketchy - it's a binary format that depends heavily on c code calls to "sizeof()", which means that compiling code with different compilers may give very different results. That's always a "BAD THING"(tm), in my (possibly not so) humble opinion. It's also brutal for interoperability. I'm now working on coding up a java interface that reads the map file, allowing users to skip the mapview file - a gzipped human readable file - that takes up a lot of space. My issue is that I have to worry about many more things than I should need to: is the size of the variable the same between c and java? is the compiler for c big endian or little endian, and does java use the same conventions? how do I use masks to de-convolute the compressed flags? Admittedly, I love the idea of the binary file to save space for these large documents, but I'm not really a fan of the method.

The second is an appeal to people to stop coding in perl! Come on people, can't we pick a language that is maintainable? I'd like to show you a piece of code from the MAQ package, which I thought was responsible for converting the .map file to a .mapview file. (Fortunately, I was wrong, but it took me a day of staring at it before I realized I was looking at the wrong piece of code.)


while (<>) {
my @t = split;
next if ($t[8] <= $opts{q});
next if ($t[5] == 18 || $t[5] == 64 || $t[5] == 130);
my $do_break = ($t[1] ne $lasts || $t[2] - $lastc > $d)? 1 : 0;
if ($do_break) {
if ($lastc - $beginc > $opts{s}) { # skip short/unreliable regions
my $k = @regs;
my $flag = ($lastc - $beginc < $opts{i})? '*' : '.';
push(@reg_name, "$begins\t$beginc\t$lastc\t$flag");
my $p = \@{$regs[$k]};
foreach (@reg_seq) {
push(@$p, $_);
my @s = split;
push(@{$read{$s[0]}}, $k);
}
}
($begins, $beginc) = @t[1..2];
@reg_seq = ();
}
push(@reg_seq, join(" ", @t[0..3,5,8,13]));
($lasts, $lastc) = @t[1..2];
}


Take a look at that code. I have no idea what 18, 64 and 130 are supposed to be - and I have no documentation to tell me. It took me 2 days to figure out that $p is actually being pulled from the data file as a string, and then the content of $p becomes the name of the array further on, which is populated from the gzipped binary file. Even now, I'm still unsure about how the $regs variable gets populated.

If this were written in python, java, c.... well, anything but perl, it would probably be interpretable and maintainable. I've heard it said that if you want to fix a bug in perl, it's easier to rewrite the program than to look for the error - and after trying to read this script, I believe it.

Anyhow, while slagging perl, I should also mention that perl can be done in a maintainable way, and well documented. Unfortunately, the perl programming community seems to emphasize witty, compact and undecipherable code styles, so I've just rarely seen it done. If you're going to open source your code, why not do it in a way that makes your code re-usable?

5 Comments:

Blogger Dan F said...

Hey Anthony. You've convinced at least one grad student to take a look at Java. Not that I write all that much code, but you have to start somewhere! I'm looking forward to seeing your Java .map interface too.

July 10, 2008 5:36:00 PM PDT  
Blogger Anthony said...

Hey Dan,

Thanks for the comment. I don't advocate java for everything, but it has certainly provided a great framework for my work. (=

It won't be long till the .map code is available GPL'd. I just hope it'll be useful to other people!

Anthony

July 10, 2008 6:56:00 PM PDT  
Anonymous Anonymous said...

Don't blame sloppy code practices on Perl; it would be just as easy to obfuscate the line you mention in another language. A simple comment statement indicating what field $t actually IS and why it is okay to equal 18, 64, or 130 is all that is required here.

August 1, 2008 1:16:00 PM PDT  
Blogger Anthony said...

Hi Anonymous,

I don't blame sloppy code practises on Perl, but I do blame perl for encouraging programmers to write incredibly difficult to understand code.

Perl has a (well deserved) reputation for being unmaintainable. I've seen many people comment that given the choice between trying to fix someone else's perl scripts and rewriting the whole thing to avoid the bug, they'd rather re-write it. That's far from objective proof, I know, but it conveys the sentiment.

Still, your comment is roughly analogous to the strawman argument that "guns don't kill people." Perl doesn't make your code bad - but if we got perl off the street, I think there'd be a lot more maintainable and reusable code out there.

August 1, 2008 3:59:00 PM PDT  
Blogger cassj said...

Hello,

Came across this while looking for something on converting solexa to phred quality scores. In the name of procrastination, couldn't resist trying to make sense of the $p, $regs thing:

I think:

$k is set to the next free array index in @regs. When $p is defined, they dereference an array in the regs array at position $k. As this doesn't exist yet, it gets created automagically. So, $p is a reference to the same array as the one to which you've just created a reference in $regs[$k]. So when you push to the array ref'd by $p, you're pushing to the array ref'd by $regs[$k]. Can't see why they don't just push to @{$regs[$k]}, eg:


#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @regs = ("foo","bar","baz");
my $k = @regs;
push @{$regs[$k]}, "meeep";
print Dumper \@regs;


right, back to work...

March 15, 2009 4:04:00 PM PDT  

Post a Comment

<< Home