Some Variant Effect Predictor (VEP) jobs are small, just ten or fewer variants, and that’s easy. Some VEP jobs are big, if you do variant calling on one whole human genome, that’s five million variants! The more variants you have, the more computing power the VEP needs to process them, which can make it slow. But there are ways to speed it up.
Category: Ensembl VEP
The VEP (Variant Effect Predictor) is our most popular tool and is incredibly useful for annotating genetic variants with the genes they hit and what effect they have on them. But did you know you can filter your results? Both in the web interface and using the script?
The Variant Effect Predictor (VEP) is one of Ensembl’s most popular tools. It has grown in 6 years from a simple perl script with just a couple of hundred lines of code to become a multi-limbed beast with thousands of lines of code and well over 100 configurable options.
VEP is now used by many high-profile projects, institutes and companies around the world. In order to effectively manage this growth and ensure we deliver the most reliable and feature filled variant annotator out there, we’ve had to go back to basics. Over the past six months the VEP codebase has been totally rewritten, and the new version is now available for download. Users of VEP’s web and REST API interfaces should see virtually no difference with the new version, so if that’s you, you can stop reading now!
For users of our command line tool, you can trial the new VEP by visiting https://github.com/Ensembl/ensembl-vep. The full list of changes to the code can be found in the README on GitHub, but these are the main points of note:
- Faster : process an individual genome in around 30 minutes.
- Backward-compatible : all data sources (cache files, databases) and most command line flags from the old code are fully compatible with the new code.
- More reliable : test-driven development means the new code is covered by more than 1500 unit tests with over 99% statement coverage.
For those tied to the current codebase, it is still available as part of the ensembl-tools GitHub repository, though updates and support for this will cease over time. Ensembl release 87 will be the last for which the ensembl-tools version of VEP will be the “primary” VEP codebase. Of course, the previous code and supporting data will remain available as part of Ensembl’s archiving strategy.
Some other points of note:
- The documentation at ensembl.org still refers to the old code. From Ensembl release 88 onwards full documentation for the new code will be made available.
- If possible, please report any issues you may find with the new code as a GitHub Issue.
- The code that calculates variant consequence types (e.g. missense_variant, stop_gained) remains a part of the ensembl-variation API module and has not been (significantly) updated; it is used by both the old and new code. The ensembl-vep codebase performs the following functions:
- parsing command line flags
- parsing input
- reading data from annotation sources (databases, cache files, flat files)
- interval alignment of input variants with annotation data
- writing output
- monitoring statistics
- data filtering interface
As “it says on the tin”, the VEP predicts the effect of variants (i.e. SNPs, indels and CNVs) on genes and regulatory elements. It tells you where your variants are located (e.g. introns, coding exons, transcription factor binding motifs), what effect they may have on protein coding sequences, and whether these effects might be deleterious or benign.
The VEP does this by mapping your variants against genes, transcripts, translations, and regulatory features that we annotate in Ensembl.
The Variant Effect Predictor can also be run against other gene sets: you can predict the effect of your variants on RefSeq genes too!
What is new?
The new web interface is more user-friendly and has lots of improvements:
Increased number of variants you can input
You can upload up to one million variants in a compressed format, with a 50MB file size limit. To upload these larger files, you simply need to log in. If you do not have an Ensembl account, you are missing out, as there are many perks of registering. It’s easy to do: just provide your name and email address. If you’d rather not register, the upload limit drops down to 5 MB, i.e. around 100,000 variants.
Display of results
We provide a summary statistics table and pie charts illustrating the different SO terms and the classes of coding consequences for the variants you input.
The results preview table with additional details is shown after the pie charts. You can apply a range of filters to any of the data fields and limit the results you see. The full or filtered results can be downloaded as VCF or tab-delimited text for import into Excel.
The new ticket tracker
You can run several jobs at the same time and track them back at a later date via the ticket numbers assigned to them. You can easily edit and re-run previous jobs. These jobs will be kept in our Ensembl servers for 30 days. If you register though, the jobs will be kept for as long as you like.
Population data from the NHLBI Exome Sequencing Project (ESP)
The VEP provides frequency data for known variants from both the 1000 Genomes and NHLBI exome sequencing projects.
You can also use this frequency data to filter your variants: you may wish to exclude known variants with a frequency above 1%, for example.
VEP results are linked to BioMart
The results table in the VEP is now directly linked to BioMart, a data export tool.
This allows you to retrieve additional data about known variants or the genes your variants affect.
You just need to select the attributes in BioMart, e.g. phenotype, orthologues, Gene Ontology terms, and you are ready to go.
Other ways to access the VEP
If you use a command line, you can run the VEP with our script on your own computer. With the Perl script, you can do everything you can do in the online version plus much, much more! It’s the most powerful way to use the VEP.
A couple of functionalities of the VEP (e.g. fetch variant consequences) are also available in the beta version of our language agonistic Rest API.
Help on the VEP
If you have questions or comments, please get in touch with us.
It has been quite a while since we’ve blogged about the VEP (Variant Effect Predictor), and in that time we’ve added a whole load of new features, particularly to the downloadable script version.
The VEP now supports finding the consequences of structural variants, with input either in VCF or tab-delimited format. Using the web interface to the VEP you can visualise which transcripts and features your structural variants overlap by clicking through to the Region in Detail view:
We’ve really pushed the VEP script’s capabilities when using local “caches” (as opposed to using remote databases). Almost every feature of the VEP is now available when using the cache in offline mode. You can use a local FASTA file to quickly retrieve the sequences required to construct HGVS notations. You can even construct your own cache from a GTF file if your species isn’t supported by Ensembl.
Our cache for human now contains allele frequency data from phase 1 of the 1000 Genomes Project, and you can use these frequencies to filter your input (for example, you might want to filter out variants that are common in the combined European (EUR) population). We also now provide SIFT predictions for 8 species – human, mouse, zebrafish, pig, cow, chicken, rat and dog.
We’re always trying to add new and useful features to the VEP, but we also recognise that other users have great ideas that they’d like to implement. The VEP script enables the use of plugins; these are bits of code that add extra functionality to the VEP. They can be used to retrieve data from remote sources, run external tools, filter output; pretty much anything you can think of can be accomplished in a plugin!
It’s easy to get started, and a basic plugin can be just a few lines of code – have a look at some of the examples we’ve created.
I recently added a plugin to retrieve data from dbNSFP – this is a great resource created by Liu et al in Houston, TX. They have, for every possible missense substitution in the human genome, pre-calculated pathogenicity scores, frequencies, conservation scores and a plethora of other things, and made all of this available as an easily downloadable file. To use this with the VEP, you just download the file and the plugin, run a couple of commands to get the data into the right format, and away you go – the VEP can now provide you with scores from LRT, MutationAssessor, MutationTaster, FATHMM and more for any missense substitution in your input.
Summary and HTML output
We had a number of requests for the VEP to provide summary statistics at the end of each run, and who are we to disappoint our loyal users?!? The VEP now writes a pretty HTML summary:
You can also view your output as HTML using the –html flag, which allows you to sort, filter and analyse your output on the fly.