Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, March 20, 2009

Universal format converter for aligned reads

Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot... and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie.... and several other more obscure formats.

For the moment, the "conversion utility" forces the output to bed file format (since that's the file type with the least information, and I don't have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there's no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren't being scaled or altered, you could end up with some rather bizzare conversions unless they're handled cleanly. Unfortunately, doing this scaling is such a moving target that it's just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I'll be asking for a co-op student for the summer to help out.)

Anyhow, I'll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

Labels: , , , , , , , , , , ,

16 Comments:

Anonymous Kenneth said...

Man I could really use something like that. Ive been left with nothing but and eland extended and limited tools that can utilize this format.

When will 3.2 be released to us non-developer types so we can get that functionality?

April 23, 2009 3:18:00 PM PDT  
Blogger Anthony Fejes said...

Hi Kenneth,

FindPeaks 3.3 builds have been available for several months now, and the "universal converter" tool is in the package. You can check it out at:

http://vancouvershortr.sourceforge.net/

April 23, 2009 3:27:00 PM PDT  
Anonymous Heather said...

Psyched to find your blog in the jungle of short read mapping! (Just so's you know.)

June 20, 2009 6:47:00 AM PDT  
Anonymous Marco said...

Hi,

Im currently working with ChIP-seq data files and taken in cosndieration that I dont have huge skills in programming( rather none) Im allways looking for some cool tools that makes life easyer. For instance do you have a converter from FASTQ to Bed files that does not need to run any complicate code? I mean why tno to stablish a webacces where we could just puload FASTQ files and we get out BED files?
greetings
Marco A.

June 22, 2009 4:05:00 AM PDT  
Blogger Anthony Fejes said...

Hi Marco,

The reason why there's no FASTQ -> BED format is because FASTQ is used to store unaligned reads, and BED is is used to store read positions after they've been aligned to contigs. Thus, it's not a matter of conversion as much as missing information.

FASTQ -> Aligner -> New format -> converter -> BED.

Unfortunately, you can't skip the aligner step.

Cheers

June 22, 2009 8:47:00 AM PDT  
Blogger Michael said...

Tried PF 4.0 package today for the first time. Its really a breeze and the format converters make my Perl-Hacks superfluous.

June 29, 2009 6:13:00 AM PDT  
Blogger Elizabeth said...

Hi Anthony,

Do you have a converter for going from either bowtie or bed to eland _results format? :)

Thanks!

September 14, 2009 4:17:00 PM PDT  
Blogger Anthony Fejes said...

Hi Elizabeth,

The short answer is No.

Slightly longer: You could certainly make one out of the converter I have - but I'm rather concerned about data loss, so I never implemented it.

Converting from BED -> Eland would be a disaster, since you'd be missing nearly all of the fields that go into an eland file.

Converting from Bowtie to Eland would be the equivalent of trying to use a pair of pants as a t-shirt: things just don't match up in expected ways - and it's the unexpected things that will bite you in the end.

If you're interested in doing it yourself, you could certainly download the source code for the Vancouver Short Read Analysis Package and modify the ConvertToBed program to do just that.

September 14, 2009 4:27:00 PM PDT  
Blogger ivangreg said...

Hi Anthony,

Have you considered creating a BED to WIG conversion tool?

That would be very useful for visualisation. Unlike BED to Eland, BED to WIG is an innocuous downgrading conversion from feature position to feature density.

Thanks,

September 17, 2009 9:10:00 AM PDT  
Blogger Anthony Fejes said...

Hi Ivan,

FindPeaks already converts nearly any format you'd like (including bed) to wig file. I don't have any plans to create a second stand alone version that just creates the wig file without the peaks file - but you can always ignore the peaks. (=

Anthony

September 17, 2009 9:30:00 AM PDT  
Anonymous Silvia said...

Hi Anthony,
do you know if there is any script or program's utility available to convert a Bowtie or SAM file into a BED, WIG or any other format that can be visualized by UCSC Genome Browser?
Thank u a lot!

Silvia

November 6, 2009 6:59:00 AM PST  
Blogger Anthony Fejes said...

Hi Silvia,

FindPeaks will accept bowtie and sam files and convert them to wigs. There are also utilities in the Vancouver Short Read Analysis Package for creating bed files from the same. All of those files will be viewable through the UCSC browser.

Anthony

November 6, 2009 9:10:00 AM PST  
Blogger Irina said...

The output wig files, are they just the overlaps between all the read sequences aligned, or are they a product of some transformation from the software?

January 8, 2010 6:01:00 AM PST  
Blogger Anthony Fejes said...

Hi Irina,

Both are options, depending on the distribution type. If you use the "native" distribution type (3, if I recall correctly), no transformation is applied. The other available distributions impose a transformation, which is useful for ChIP-Seq experiments.

Anthony

January 8, 2010 9:18:00 AM PST  
Blogger Irina said...

Thanks alot, this is very helpful. I did not know how to handle that. I also had a question on the auto_threshold parameter, because I tried applying it, and I noticed it works for calculating saturation (a number of peaks is returned for each fraction of the reads used) but in the normal mode I only can extrapolate from the histogram and then have to filter myself based on height. Is this correct?

January 8, 2010 10:53:00 AM PST  
Blogger Anthony Fejes said...

I'm not sure what you mean by "normal mode". It would depend on which FDR you're applying (as there are several). However, It's much more reliable to do the filtering yourself - or, as other people used to do:

1. run once to get FDR,

2. run a second time with -minimum to have it filter for you.

It's not a great method, but I haven't had a chance to write code that does this automatically AND without errors. (There are several false starts in the code base, but none I would feel comfortable releasing.)

Anthony

January 8, 2010 10:58:00 AM PST  

Post a Comment

<< Home