With the release 76 looming large in our calendars and the final deadlines out of the way for GRCh38 data production, it’s a good time to look back and take stock of what we’ve been doing in the Ensembl Regulation office. We have been rather quiet the past few months, working feverishly on an ambitious overhaul of our infrastructure. We’ve already given you a sneak peak at the new Ensembl Regulatory Build, so I’d like to take a look at the work horse underlying all of our data, the ‘Ensembl Regulation Analysis Pipeline’.

The end result is a core resource that centralises epigenomic data from multiple public sources, processes them through a universal pipeline, then summarises them into easily understood annotations. Ensembl Regulation aims to be a single entry point to obtain an overview of all the available regulatory data, from individual datasets to summary annotations, all coming to a browser near you, very soon. Underlying it is a full ‘end to end’ pipeline for producing the input data to the Regulatory Build, from fastq download, to alignment, IDR processing, peak calling and finally motif alignments.

The inputs to the Regulatory Segmentation and Build are experiments (Chip-Seq & DNAse-Seq) describing the chromatin status (i.e. histone modifications) and transcription factor landscape across various cell lines. These experiments come from large projects (e.g. ENCODE, Roadmap Epigenomics and BLUEPRINT), through to individual experiments made accessible via archives such as the ERA/ENA, SRA and GEO.

The main outputs of the pipeline are genome alignments, peak calls and  ‘collection’ files which provide coverage statistics across the genome. Managing and processing these data is no simple task, and we expect the number of available epigenomic datasets to increase significantly in the years to come. Also, with the arrival of GRCh38, we needed to reprocess all of the existing data in a short timeframe. We therefore integrated our processes into a shiny new fully automated pipeline using the ensembl-hive framework. Here follows a brief summary of the new features of the regulation analysis pipeline.

The Tracking Data Base

This now constitutes our main analysis and archive database, tracking the data both within our pipeline, but also in external repositories. In it, we register the meta-data from different projects and data repositories, providing a single point of reference to query the data available in the public domain. This has been crucial in determining which cell lines meet the requirements for a build.

Read Alignment and Peak Calling

We first align reads using BWA, then call peaks using SWEMBL for short regions and CCAT for broader ranging histone modifications. Replicates are processed in parallel to  support ENCODE’s Irreducible Discovery Rate (IDR) methodology.

Pipeline Improvements

Flexibility has been a key aim of the redesign, and the hive infrastructure has helped here by allowing us to define each logical part of the pipeline as a separate configuration which can be ‘topped up’ as required. This means that it’s easy to run just the read alignment stage (which we require as input to the segmentation), or at your pleasure add in the peak calling and collection file writing stages whilst it’s still running.  All the necessary state information is captured in the tracking database, so it’s really easy to pick things up at any point and start running the later stages of the pipeline.

Due to the size of our input data set and the resulting rolling data footprint, we set up a garbage collection of intermediate files and added inline archiving. This has limited our footprint, and enabled us to reprocess the entire human data set in one go.

The combination of the above improvements, the new ensembl-hive implementation and a whole load of other refinements, means much less manual intervention is required, resulting in a large reduction in run times.  For the alignments in particular, what was taking several weeks now takes just ~5 days!

What does the future hold?

We’ve already identified some more optimisations to the structure of the pipeline, so the runtimes are likely to drop even further. This will be crucial to handle the hundreds of cell types currently being examined within Roadmap Epigenomics, Blueprint, ENCODE 3 and other projects. We will also be revising our schemas to better reflect tissue specific data. This is part of a larger push within Ensembl to better describe the dynamics of gene regulation and transcription.

Finally, we are keeping up with lab techniques, and will be extending our pipelines to handle newer types of data, such as chromatin conformation assays or eQTLs. Although we do not process this data ourselves, we already integrated and remapped the FANTOM5 CAGE-tag annotations onto GRCh38.

p.s. If you want even more info on the, keep an eye on this page. Once release 76 is out it will be updated with our new Regulatory Build documentation.

We’re always looking at ways we can improve the Ensembl browser experience.  Quite often this results in a focus on the speed at which a particular display or track can be provided.  Over the past year or so we have been generating various pre-computed data which is compressed and optimised for web display purposes.  For release 62 we took the step of moving some of this data outside of the database into binary files.

The signal plot or ‘wiggle’ style displays provided as part of the regulation evidence (see ‘Functional genomics’ in the config panel) are now served from collection or ‘col’ files.  This provides significant speed ups, allowing many more tracks to be turned on without adversely affecting the response time of the display.  In fact it is now possible to turn on all the current signal plots for human, that’s 334 distinct data sets!

Moving the data outside of the database also overcomes various constraints and issues with managing the data and much reduces the size of the funcgen database.  This is beneficial for those who want to download the funcgen databases for API access only, for those running mirror sites if you want to display the signal plots you will need to down load the col files.

We intend to broaden support for standard data formats (e.g. BAM etc) as we identify more data which is amenable to flat file access.  As these data are not contained in the MySQL data dumps, we have created a new area on the FTP site:


More information can be found in the README files contained in this directory.

Over the last 3 releases we have significantly increased the content of the functional genomics databases, by including data from large public projects such as ENCODE and The Epigenomics Roadmap. In release 59 we have over 200 human data sets representing 10 cell types, 41 histone modifications and 14 transcription factors. These numbers will steadily increase in the forthcoming releases as more data is incorporated.

These data sets are now available in ‘Region In Detail’. Cell type tracks can be turned on via the ‘Functional genomics’ menu of the configuration panel – click on ‘configure this page’ on the left to access it. These are split into ‘Core’ and ‘Other’ evidence types, reflecting how we deal with these data within the Regulatory Build. Display options include a peak track, with the underlying raw data is available as a ‘multi-wiggle’ track. Further configuration is available via the ‘Cell/Tissue’ tab, where individual feature types can be turned on or off.

Work is ongoing to improve the flexibility of these displays.

The Ensembl Functional Genomics (eFG) environment has been expanded to incorporate array mapping functionality. Historically, arrays from different vendors have been processed in similar, but non-identical ways due to differing array designs, with the output being stored in the core database. The ‘arrays’ environment unifies this process within the eFG database to provide a new standardised array mapping procedure for all array formats. This involves a two step process whereby probe sequences are aligned both to genomic and transcript sequences, and then subsequently transcripts are annotated with xrefs(DBEntries) dependant on the quality of the probe alignments around a given transcript locus.

The ‘arrays’ environment provides easily accessible and interactive command line functions to help run and administer the array mapping pipeline. Recent developments include broader array format support and multi-species capability, along with capture of much more detailed mapping information. This data has yet to be seen in the Ensembl browser, but from release 55 we will start redirecting the web displays to use the eFG data, with a view to developing a more detailed ‘Probe’ panel at some point later in the year.

We will endeavour to provide alignments and mappings of all popular arrays, for all others we invite you to try out the eFG ‘arrays’ environment. For more information check out(literally):


Or see it online here.

If you have any questions, please mail ensembl-dev@ebi.ac.uk