We often get questions about Ensembl VEP caches – the compressed data files we create for each new Ensembl release, which can be automatically installed with the VEP installer script – so here’s a quick introduction to these handy data bundles.
Underneath the bonnet in Ensembl are relational databases of transcripts, regulatory features and variants, which are used to display our web-pages and are accessible through our APIs. But the database isn’t the quickest way for VEP to look these up to annotate your variants – a cache is much quicker because of the structure of how the data is stored. (You will also need a local fasta genome sequence file for optimal speed.)
Another benefit of the cache is privacy. If you’re working with commercial or clinical data, you also may not want to send your queries over the internet to databases on our servers. Cache files allow for completely local data analysis.
The cache directory is built using a complex pipeline that incorporates information from data files and our gene, variation and regulation databases. It cannot be easily edited without running the pipeline from scratch, nor is it easy to read by humans. Inside it you will find an info.txt file which contains detailed information on the data types and versions available, including the column headers for the variant file and cell types for which regulatory data is available. This file is essential for the correct functioning of VEP. You may also see a chr_synonyms.txt file which provides a list of different commonly used names for the reference sequences. This helps if your variants or any custom or plugin data use sequence accessions rather than names.
You find a sub-directory for each chromosome, LRG, MT or other patch sequence – these contain the reference data. The transcript and regulatory data (where available) are stored as compressed perl objects, which do not need to be unzipped for the cache to work. A separate file is created for each 1Mb region of the genome and these are stored in the sequence-specific directories. Transcript objects for human, most model organisms and farm animals also contain matrices of SIFT results for all possible single nucleotide changes. PolyPhen-2 data is also available for human transcripts. Variant data is stored in a single tab delimited file per reference sequence. These files are sorted on position to allow rapid extraction of data for specific regions using tabix.
VEP works by taking a chunk of your input variants (by default 5000 at a time) and finding the genes they hit and variants they match to. When annotating a chunk of variants, VEP looks up only the reference data for the regions covered, to keep memory usage low. This means it is important to sort your input variants by position for more efficient and quick annotation.
Caches are created for each species/assembly in each Ensembl release. We always recommend using the same code version as cache version as there are sometimes changes in format which can lead to inconsistencies between versions.