The Ensembl regulation resources FTP site saw a facelift in release 87. The directory structures have been modified to make it easier to find files- the file names have become more descriptive and we now also provide our data in a greater variety of file formats. All data files on our FTP site now adheres to a naming convention, which is described in greater detail here. The filenames include the following information separated with a dot (‘.’):
- assembly version
- cell type (if applicable)
- feature type (if applicable)
- analysis name
- results type
- data freeze date
- file format.
The data available on our FTP site include:
Peaks: The set of peaks for transcription factors, histone modifications and variants that are part of our regulatory resources. In previous releases these used to be collated in one file, called ‘AnnotatedFeatures.gff.gz’, but with our recent expansion to 88 human cell types with ChIP-seq data, the file became too big. Therefore, we split it into separate files by cell and feature type in the ‘Peaks’ subdirectory. The peaks are now available in gff, bed and bigBed format.
Quality scores: The outcome of our quality checks from processing the ChIP-seq data that yielded the peaks. They are in JSON format in the ‘QualityChecks’ subdirectory:
- the number of mapped reads
- the estimated fragment length, the NSC and RSC values using phantompeakqualtools
- the proportion of reads in peaks
- the enrichment of the ChIP over the Input using CHANCE.
Regulatory build: The current set of regulatory features along with their predicted activity in every cell type. We provide one gff file per cell type in the ‘regulatory_features’ subdirectory.
Transcription factor motifs: The transcription factor motifs identified using position weight matrices from JASPAR in enriched regions identified by our ChIP-seq analysis pipeline in gff format.