Normalising variants to standardise Ensembl VEP output

Variants can be represented in myriad different ways; indeed, Ensembl VEP currently supports input in many different formats, including VCF, HGVS and SPDI. However, even within these specifications, variants can be described ambiguously. Insertions and deletions within repeated regions can be described at multiple different locations. For example, VCF describes variants using their most 5’ representation, while HGVS format describes a variant at its most 3’ location. 

Starting in Ensembl 100, VEP optionally normalises variants within repeated regions by shifting them as far as possible in the 3’ direction before consequence calculation. This standardises VEP output for equivalent variant alleles which are described using different conventions. 

We have introduced three options to VEP to provide this functionality:

  • shift_3prime – Right aligns all variants relative to their associated transcripts prior to consequence calculation.
  • shift_genomic – Right aligns all variants, including intergenic variants, before consequence calculation and updates the Location field.
  • shift_length – Reports the distance each variant has been shifted when used in conjunction with –shift_3prime.

For the vast majority of variants, this change will have little to no effect on VEP analysis. Only 2.5% of variants found within ClinVar (2019-03-19) required shifting. However, for 12% of these shifted variants (0.3% of variants within our dataset), normalisation led to entirely new consequence predictions. If the repetitive sequences bridge intron-exon boundaries, start/stop codons or coding and non-coding regions, normalising variants in this way can lead to significantly different VEP annotations. 

For example, consider the deletion of GGA at 1:237674095-237674097 – a region that has a GGA repeat that overlaps three Ensembl transcripts, including ENST00000360064. When considered in its reported position, the variant overlaps the intron-exon boundary and is reported as a splice_acceptor_variant. However, after normalisation, the variant is shifted in the 3’ direction and now falls completely in the coding region, leading to the inframe_deletion consequence being reported.

Initially, this functionality is available through our command line and REST interfaces; we will make it available via our web tool in a future release. Additionally, human variants viewed through the Ensembl browser have their consequences shifted by default. More information regarding shifting alongside some command-line examples can be found here.

Currently, the normalisation is optional to minimise disruption to your workflows while providing our most up-to-date annotations for those who want them. We intend to change this so that normalisation will be the default functionality in the future. Please get in touch if you have any queries or comments.

This blog was written by Andrew Parton, Senior Scientific Programmer in the Variation Annotation team.