WiggleTools: a pocket calculator for very large datasets

We are pleased to announce a new bioinformatics application, WiggleTools, described in a recent Application Note in Bioinformatics. It allows you to quickly and conveniently compute statistics across many (up to the hundreds) of genome-wide datasets.

WiggleTools is first a data summary tool. It collapses into a single summary a large collection of genome wide datasets, such as BigWig, BigBed or BAM files. You can then view on the Ensembl browser a single statistic that combines all the datasets for a given project rather than displaying a pileup of several data tracks.

For example, if you wanted to display the average binding probability for each TF in the ENCODE dataset you could display a huge number of tracks on the browser, one for each TF. It is clearly difficult to interpret the view as one has to scroll up and down endlessly

EnsEMBL_Web_Component_Location_ViewBottom-Homo_sapiens-Location-View-74- (1)

And here is a summary track which recaps all of the data above in a single track (explained below):

EnsEMBL_Web_Component_Location_ViewBottom-Homo_sapiens-Location-View-74- (2)
Overall binding probability for all TF in a single track. Note that you now have room to add other datasets in the ‘Region in detail’ view.

To better handle different types of signal, WiggleTools offers a range of statistics on a set of values, such as mean, median, minimum, maximum, or variance.

Besides boiling down large collections of data into a single track, WiggleTools also allows you to compare groups of datasets. For example, if you have a collection of case and control replicates, you can compare the means of the cases and controls, but you can also apply more advanced statistics as Welch’s T-test (for normally distributed variables) or Wilcoxon’s rank sum test (for other variables).

WiggleTools has been designed with efficiency in mind. Streaming the data keeps memory requirements to a minimum by only storing local information. Functional components communicate directly in memory, without disk access or string passing. Parallel threads keep the system going smoothly regardless of irregularities in disk access. Finally, a novel BigWig file merging tool, bigWigCat, which we contributed to Jim Kent’s C library, allows WiggleTools to make the most of a cluster of computers. For example, to compute the sum of 126 BigWig files (a total of 121 GB) takes less than 17 minutes in total, on 116 CPUs, and fits on less than 5.5 GB of RAM.

A statistics package for genomic datasets

With WiggleTools, you can pretty much play with the BigWig, BigBed and Bam files lying on your filesystem as if they were vectors loaded in R, Numpy or Matlab. A simple language, which resembles LISP, is enough to define the functions that WiggleTools then runs in a single pass through the data.

A use case: we wanted a summary of transcription factor (TF) binding across the genome. For every position in the genome, we had estimated for each TF the probability of observing binding in a random cell type. To compose these datasets, we wanted to compute an overall probability of observing any binding at that position. We therefore wanted to compute:

Untitled

WiggleTools can create the appropriate function on the fly and compute the result in a single pass through the files. Total run time: 34s, max memory: 20MB.

For more information on WiggleTools, have a look at our paper in Bioinformatics, and our code on Github.

Comments are closed.