Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Monday, April 21, 2008

Eland file Format

I haven't written much over the past couple of days. I have a few things piled up that all need doing urgently... and it never fails, that's when you get sick. I spent today in bed, fighting off a cold, sore throat and fever. Wonderful combination.

Anyhow, since I'm starting to feel better, I thought I'd write a few lines before going to bed, and wanted to mention that I've finally seen a file produced by the new Eland. It's a little different, and the documentation provided (ahem, that I was able to obtain from a colleague who uses the pipeline) was pretty scarce.

In fact, much of what you see in the file is pretty obvious, with the same general concept as the previous Eland files, except this has a few caveats:

1) the library name and 4-coordinate position of the sequence are all separated by tabs in one of the files I saw, but concatenated with a separating ":" in another. I'm not sure which is the real format, but there are at least 2 formats for line identification.

2) There's a string that seems to encode the base quality scores from the prb files, but it's in a format for which I can't find any information.

3) there's a new format for mismatches within the alignment. Instead of telling you the location of the mismatch, you now get a summary of the alignment itself. If Eland could do insertions, it would work well for those too. From the document, it tells you the number of aligned bases, with letters interspersed to show the mismatch. (e.g. if you had a 32 base alignment, with a mismatch A at character 10, you'd get the string "9A22".) I also understand that upper and lower case mismatches mean something different, though I haven't probed the format too much.

So, in the discussion of formats, I understand there's some community effort around using a so-called "short read format" or SRF format. It's been adopted by Helicos, GEO, as well as several other groups.

Maybe it's time I start converting Eland formats to this as well. Wouldn't it be nice if we only had to work with one format? (If only Microsoft understood that too! - ps, don't let the community name fool you, it's well known Microsoft sponsored that site.)

Labels: , ,

4 Comments:

Anonymous Anonymous said...

Hi,

isn't it fun to always get data in a new format?... The quality string is ASCII code = quality value +64.

April 24, 2008 12:25:00 AM PDT  
Blogger Anthony said...

Dear anonymous,

Thanks so much for the tip. We figured it had to be something like that, but it's nice to have it explained!

April 24, 2008 12:43:00 AM PDT  
Anonymous Anonymous said...

I have not read through the specification of SRF. But so far as I know, it is specifically designed for raw trace data, but not for alignment. I know a group has started to make a universal alignment format. However, they seem to have a long way to go. To me, alignment seems to be more complicated than raw traces. Maybe I am wrong.

April 27, 2008 12:50:00 PM PDT  
Blogger Anthony said...

Hey Anonymous,

I think you're right. I've spoken to several people about the SRF format, but didn't make the connection that this is it. I've also heard that they're working on an alignment format, but that's not been perfectly clear.

That would make a good blog subject though! Thanks for your comment.

Anthony

April 27, 2008 3:42:00 PM PDT  

Post a Comment

<< Home