Computing Ensembl’s New Regulatory Annotation

We described in a previous post Ensembl’s new regulatory annotation of the genome. Now, we will go in greater detail into how we computed it.

We started by running ChromHMM over 17 cell types, using publicly available ENCODE and Roadmap Epigenomics data. This produced a segmentation or annotation the genome for each of these cell types under 25 labels, or segmentation states. These states were given arbitrary names (P0, P1, …) after a preliminary comparison to the earlier ENCODE segmentations across six cell types.

For each state and each position in the genome, we computed the number of cell types that have that state at that position. This resulted in 25 segmentation state summary tracks. We also pulled in all the Chip-Seq peaks from the January 2011 ENCODE data freeze, and kept those that overlapped with open chromatin (i.e. DNAseI hypersensitivity) on the same cell type. From all these assays, we computed a summary track, which indicates the probability (between 0 and 1) of seeing a TF peak at any location of the genome. The segmentations and summary tracks can be seen in the illustration below.

Summary tracks generated from the segmentation and TFBS peaks

We then used TF binding data to determine the specificity of the ChromHMM states, as regulatory regions are presumably correlated with TF binding. Each function (TSS, CTCF insulator, proximal regulatory elements, distal regulatory elements) was thus associated with several states. For example, transcription start sites were strongly associated to the P0 state in the segmentations. We overlapped these state signals to define function specific signals. A simple count threshold was set to maximise the detection of TF binding sites. This led to the regions as displayed at the bottom of the following figure:

Selected summary tracks were overlapped to define the Ensembl Regulatory features

These new regions concur strongly with the TF binding datasets: 73.4% of the TF Chip-Seq peaks were captured in the ChromHMM regions, equivalent to an 8.3x enrichment with respect to the genomic average. Conversely, 24.0% of the ChromHMM-based regions were covered by observed TF Chip-Seq peaks. To avoid losing information, TFBS peaks which were not covered by any of these elements were marked as ‘Unannottated TFBS’.

Having defined consensus regulatory region, we returned to the original data, to determine which region is active in which cell line:

The Regulatory Features are then compared to the cell type specific segmentations to determine their activity in each cell type.

Validation

In the median cell type, 83% of FANTOM tags supported by three CAGE tags or more were annotated by our pipeline.

The Vista Enhancer database contains enhancer sequences validated by in vivo staining assays on transgenic mice (hats off to the VISTA team for their years of meticulous work). It currently contains 1,575 predicted enhancers, of which 807 were experimentally confirmed. 491 of those (60.8%) were picked up by our pipeline.

  • TF binding motifs

We found two estimates of active TF binding motifs:

Source: JASPAR Arbiza et al.
Number: 803,489 2,013,074
Avg. length (bp): 13.1 10.8
Total length (Mbp): 10.5 21.7
% covered: 59.0 87.0
Enrichment (fold): 6.1 9.0

JASPAR motifs are not mapped by default to the human genome, so we are missing a few. We will therefore be remapping them in time for Ensembl release 75, in early 2014.

The computing of Ensembl’s new regulatory annotation is work in progress. If you have any ideas on the subject, feel free to leave a comment or to send them to helpdesk@ensembl.org.