The New Ensembl Regulatory Annotation

GWAS after GWAS return statistically significant hits that are hard to interpret because they fall outside of coding regions, and this begs for more functional annotation of regulatory regions. We at Ensembl have been providing such an annotation for a few years now and we are now redesigning from the ground up the way we define these regions. This work is in progress and we would love to hear your suggestions and comments.

Regulation diagram

An overview of the major elements involved in gene expression regulation

In short, we are looking for all regions of the genome which display regulatory function. Much ink has been spilled over the definition of the word ‘functional’, so we’re going to expand a bit.

We propose to map out the regions of the genome that display epigenomic marks and/or transcription factor binding sites (TFBS) associated to proximal and distal regulatory elements, transcription start sites (TSS), and CTCF insulators.

Ensembl’s Regulation pipeline

We will post more details next week on Computing the New Ensembl Regulatory Annotation. To cut to the chase, we defined the following regions from publicly available ENCODE and Roadmap Epigenomics datasets:

Label Count Avg. Lgth (bp) Max. Lgth (bp) Tot. Lgth (Mbp)
TSS 40,249 973.2 11,400 39.2
Proximal Reg. 101,206 1,005.5 15,000 101.8
Distal Reg. 209,081 526.1 8,400 110.0
CTCF 108,284 550.1 5,200 59.6
Unannotated TFBS 163,528 155.8 1,630 25.5
All: 299.2

The new Regulatory Build will allow us to separate state from function, as shown below, upstream of VNN2:


The track at the top colours the region by function, independently of the cell type. In the cell specific tracks below, the various features are greyed out if we do not have evidence of activity for that region.

We incorporated our preliminary results into a track hub, along with some of our intermediary data. Next week we will post more details on Computing Ensembl’s New Regulatory Build. We want to integrate this build officially into Ensembl release 76, sometime during the 3rd or 4th quarter of 2014.


Are you saying that 9.7% of the genome is functional?

Not quite. We’re saying that if you split the genome into 200 bp bins, 9.7% of them show epigenomic marks or TF binding. Remember that histone marks are measured at nucleosome resolution, so this signal is at best at a 140bp resolution. If you add in experimental noise (typically proportional to the Chip-Seq fragment length), the exact position of these elements on the genome is rather fuzzy. At the same time, the epigenome is a dynamic system, and we only have some assays on some cell lines. No doubt more regions will be annotated as more datasets come in.

What happened to the 80.4% of the genome being functional?

This statistic from the main ENCODE paper took into account other biochemical markers, in particular those associated to transcription, which can be observed over most of the genome. We therefore recommend using curated genesets such as GENCODE to define gene bodies. Nonetheless, the number of regulatory elements and promoters described here is of the same order of magnitude as that discussed in the ENCODE paper.

Is this the same as ENCODE?

This is more than just ENCODE. There are other fascinating epigenomic surveys out there, such as Roadmap Epigenomics or BLUEPRINT to name a few. Here at Ensembl, we have started merging all these datasets (including ENCODE), and provide the most comprehensive overview possible, updating our calls as new projects come along. Also, as discussed above, we are producing a cell-type independent summary of epigenomic function, which can be used to inform studies on new cell types.

What about other species?

We focused our Regulation database primarily on human: that is what most of our users ask for, and what we have most data for. But that does not mean that we ignore other species. Ensembl has already regulatory information for mouse, and we plan on shortly expanding this to farm animals, in collaboration with the Roslin Institute.

Can you assign regulatory elements to genes?

We’re working on it. Correlations are easy to find, but multiple testing quickly gets in the way when testing 310,287 regulatory elements against 40,249 TSSs.

Remember, this is work in progress, and we would love to hear your suggestions. Please leave your comments here or drop us a line.