Cool stuff the Ensembl VEP can do: parse protein HGVS

HGVS notation is an excellent way to describe variants in proteins, and VEP can interpret variants described this way to see if they are already known or if they affect other genomic features, so long as there is enough information to find a unique genomic location. If there isn’t, the Variant Recoder can help you to find the variant you need.

Consider the variant description PHF21B:p.Tyr124Cys, as a human reading that we know that the 124th tyrosine in the protein PHF21B is switched out for a cysteine. However, to interpret it computationally, VEP needs to make some assumptions. The first assumption is which protein it is referring to; since most genes will have multiple protein coding transcripts, VEP has to identify any transcripts of the gene PHF21B with a tyrosine at position 124 in the amino acid sequence. If VEP cannot find any transcripts like that, it will reject the variant as not mapping to the genome. If it finds multiple transcripts with a tyrosine in the right place (and these may be different tyrosines), it will identify a primary transcript by transcript quality and use that. You can make it easier for VEP by specifying a protein sequence accession such as an Ensembl ENSP or RefSeq NP, or a even transcript ID such as an Ensembl ENST or RefSeq NM. If you don’t supply a sequence accession and version (these are required in the HGVS guidelines) the genomic variant may not be precisely identified.

The second assumption it makes is the underlying base change. The tyrosine in PHF21B is encoded by UAU, and cytosine can be encoded by UGU or UGC. VEP assumes the variant results from the fewest possible changes with respect to the reference sequence, so would assume that a UAU tyrosine codon would become a UGU cytosine codon, making it a single A/G base change.

If multiple similarly likely changes are available, for example ENST00000218516.3:p.Met290Leu, nothing is returned. Methionine is encoded by an AUG, and leucine can be created from that by two different single base changes: UUG or CUG. VEP doesn’t know which change you want to know about so can’t give you anything. To get around this, you can use the Variant Recoder tool which will convert your protein HGVS into all compatible transcript and genomic HGVS. You can then input all of these into VEP, or pick which one you want to use.

VEP decodes protein HGVS when the underlying base change is unambiguous, but not when it’s ambiguous.

Since VEP is a value-added tool, it can supply further information about the variant PHF21B:p.Tyr124Cys such as that its identifier is rs115264708, it has a MAF of 0.0032 and doesn’t currently have any clinical significance assertions.