You may have heard us squeaking about our new mouse regulatory build in our Ensembl 93 release blog. If you’re interested in finding out what exactly a ‘regulatory build’ is, and how to view and download this data in Ensembl, then this is the blog for you!
What is the Ensembl regulatory build?
The Ensembl regulatory build is our method of annotating features on the genome (currently for human and mouse only) that are involved in the regulation of gene expression (Figure 1).
These features include:
- Promoters and promoter flanking regions
- CTCF binding sites
- Transcription factor binding sites
- Open chromatin regions (a.k.a. hypersensitive sites)
For mouse, the sequencing data that we use as evidence for annotating these features comes from the ENCODE consortium (Figure 1). We import data from a wide range of cell types (a.k.a. epigenomes), and we use evidence from all of these to annotate the locations of the features. These annotated features create the ‘regulatory build’ and we also display the activity and raw data for each of these features for each cell type.
For every cell type in the regulatory build we predict a regulatory feature to have one of the following levels of activity:
- Poised (where the epigenetic signature shows the potential to be activated)
- Repressed (where the epigenetic signature includes modifications known to repress activity)
- Inactive (where the epigenetic signature shows no evidence of modifications related to activity)
- NA (no available data in this cell type)
What does the new mouse regulatory build bring?
Release 93 brought with it a completely updated regulatory build for mouse. The last update was in release 87, December 2016. For the new version we imported three terabytes of new data from ENCODE, equivalent to 93 billion reads or four trillion base pairs. If we printed off all this data it would be long enough to go to the moon and back to earth 26 times, and without parallelisation it would have taken two years and four months to process this data! Thanks to our brilliant software engineers, this time has been reduced to 5 days!
This new data came from 79 different cell types, a significant increase from the previous regulatory build which only had eight. You can view the full list of cell types on our help page. The new build created from this data has over 100,000 more regulatory feature annotations compared to the previous release, so now the regulatory build covers approximately 15% of the genome sequence.
Where can I find this data in Ensembl?
The genome browser display
The regulatory build track is displayed by default in the genome browser views. This short video shows you how to select different evidence and cell type specific data from the regulatory build in the Location tab.
It is important to note that we do not link the regulatory features to genes as reliably inferring cis-regulatory element (regulatory elements that regulate neighbouring genes) interactions between enhancers and promoters from experimental results such as Hi-C or eQTL data continues to be an open research question that we are investigating. From the Regulation page in the gene tab you can add the same tracks available in the Location tab by using the ‘Configure this Page’ option.
The regulation tab
Each regulatory feature gets assigned a unique stable ID (e.g.ENSMUSR00000659207), and searching for this ID will take you to the regulation tab in the browser. This is where you can view the cell type specific activity of your regulatory feature of interest, without the need to add tracks to the genome browser image. The Summary page will show the regulatory feature of interest centred in the browser view. Below this image is a table that summarises the activity of this feature in the different cell types.
The Details by cell type page has a display where you can view the activity of the feature in the different cell types, as well as the raw data. Raw data can be viewed as peaks (by default) shown as coloured rectangles with black arrows where the signal is strongest, or as the raw data signals, or both! You can quickly toggle between the views using the buttons above the display, you’ll also find buttons to choose cells and different types of evidence here too.
The Ensembl variant effect predictor (VEP)
If you’re working with a specific cell type you are also able to add a filter before you submit your job to show these results only if they fall in a regulatory feature that is active in a specific cell type. When using the web browser, simply expand the ‘Extra options’ section, find the ‘Regulatory data’ header and select ‘Yes and limit by cell type’ from the drop down menu (Figure 6). A box will appear with a list of all of the cell-types available to filter by. You can choose more than one! If you’re using the command-line version of the VEP, you can use the –cell_type flag, further details here.
BioMart for bulk data download
We have a separate dataset for regulation data in BioMart. Here you can download data about regulatory features from the build, as well as other regulation data (Figure 7). You can add filters to find features in a specific region of the genome, download specific feature data (e.g. enhancers or promoters only), or from specific cell-types across the database. Further details on using BioMart can be found on our help pages.
Command-line and API access
If you’re more comfortable working on the command line than in the browser, or need to access data on a large scale, then using our REST API or Perl API would suit you well! We have a range of end points for regulation data in our language agnostic REST API, where you can query the database in python, R, java or any other programming language. We also have dedicated help-pages and tutorials for using our Perl API for the regulation database.
You can download a single file containing all of the annotated regulatory features, peaks and activity from the Ensembl FTP site. If you wish to perform your own reanalysis, individual raw signal density and segmentation files can also be downloaded from the FTP. Also, new for this release you will now find text files here that contain analysis descriptions of how the data was processed to generate the peaks and the results of the data quality checks.
Further information and help
If you would like to see a demonstration of how to find the data, please register for our release 93 webinar on Tuesday 24th of July at 16:00 BST. This will also be made available on our YouTube channel afterwards. If you have any questions about the data please leave a comment, or contact us.