Eland file Format
Anyhow, since I'm starting to feel better, I thought I'd write a few lines before going to bed, and wanted to mention that I've finally seen a file produced by the new Eland. It's a little different, and the documentation provided (ahem, that I was able to obtain from a colleague who uses the pipeline) was pretty scarce.
In fact, much of what you see in the file is pretty obvious, with the same general concept as the previous Eland files, except this has a few caveats:
1) the library name and 4-coordinate position of the sequence are all separated by tabs in one of the files I saw, but concatenated with a separating ":" in another. I'm not sure which is the real format, but there are at least 2 formats for line identification.
2) There's a string that seems to encode the base quality scores from the prb files, but it's in a format for which I can't find any information.
3) there's a new format for mismatches within the alignment. Instead of telling you the location of the mismatch, you now get a summary of the alignment itself. If Eland could do insertions, it would work well for those too. From the document, it tells you the number of aligned bases, with letters interspersed to show the mismatch. (e.g. if you had a 32 base alignment, with a mismatch A at character 10, you'd get the string "9A22".) I also understand that upper and lower case mismatches mean something different, though I haven't probed the format too much.
So, in the discussion of formats, I understand there's some community effort around using a so-called "short read format" or SRF format. It's been adopted by Helicos, GEO, as well as several other groups.
Maybe it's time I start converting Eland formats to this as well. Wouldn't it be nice if we only had to work with one format? (If only Microsoft understood that too! - ps, don't let the community name fool you, it's well known Microsoft sponsored that site.)