Variation consequence types, such as “intronic” or “non-synonymous”, describe the variation location or effect of a variation on a transcript. For the latest version of Ensembl (release 62) we have made some significant changes to the way in which we determine these consequence types, and we’d like to provide an overview of these improvements.
Firstly, we are now able to assign a specific effect to every allele of a variant. For example, rs12795274 has three alleles, the reference allele is T, and it also has two alternative alleles; C and A. The A is predicted to cause an amino acid change, while the C is synonymous. We now list the effect of each individual allele on the website and you also can fetch them separately when using the variation API
Another improvement we’ve made is that “under the hood” we now use terms defined in the Sequence Ontology (SO) to describe the consequence types. Moving to this set of externally maintained terms should make it easier to compare Ensembl annotations with those from other groups. The SO also groups the various terms we use into a hierarchical tree and, in the future, this will let users query for variants with particular effects in a much smarter way than is possible now. On the website we are still using our old terms by default, but you can see the mapping between the old terms and the SO terms on the variation documentation page and you can use “Configure this page” on several variation views to choose which set of terms you want to see (here‘s an example).
We also now provide SIFT and PolyPhen predictions for any variant that is predicted to cause an amino acid substitution in human. These are popular tools developed by external groups that try to predict the effect of a non-synonymous mutation on the function of the protein. You can see these predictions on several variation views, a useful example is the protein variation view.
You can find more information about these tools and how we run them in Ensembl on the variation documentation page.
All of these improvements are also available for you to use to analyse your own data using the Variant Effect Predictor (VEP). The VEP has new configuration options that allow you to choose which set of terms you want to use for the consequence annotations, and also offers options to fetch SIFT and PolyPhen predictions for any missense mutations in your data. We are able to provide these predictions for novel mutations by computing the predictions from SIFT and PolyPhen for all possible amino acid substitutions in human proteins and storing these in the variation database. We hope that this makes the VEP even more useful for mining your data and we have plans to add support for these sort of tools in other species in the near future.

Hello
I have been looking at the source code for the variant predictor script and
I have a question regarding a section of the code:
828 else {
829 $ref = substr($ref, 1);
830 $ref = ‘-’ if $ref eq ”;
831 $start++;
832
833 foreach my $alt_allele(split /\,/, $alt) {
834 $alt_allele = substr($alt_allele, 1);
835 $alt_allele = ‘-’ if $alt_allele eq ”;
836 push @alts, $alt_allele;
837 }
838 }
839
840 $alt = join “/”, @alts;
It does not make sense to me that you define $alt_allele as a substring of
$alt_allele in line 834
Indels can be bigger than 1 bp so why are you doing this?
You do the same for variations that have only 1 alternative allele (line
871):
868 else {
869 # chop off first base
870 $ref = substr($ref, 1);
871 $alt = substr($alt, 1);
Can you explain why this is necessary or indeed correct?
Best regards,
Duarte Molha
Hi,
The right place for questions about specific scripts is dev@ensembl.org. I see you have already posted it there!
For others, here is the reply from Will on the variation team:
This relates to the different ways in which VCF and Ensembl represent
insertions and deletions.
In VCF format, the base immediately before the change is included in
the reference sequence column, and the start coordinate represents
this base.
In Ensembl format, we only include the bases affected by the change,
so we have to trim off that extra base in order for the API to be able
to interpret the variant in the same way as those given to the script
in other formats.
Here is an example of a deletion:
Reference: AACTG
Variant: AACG
In VCF, this would be denoted with the reference sequence CT, variant
sequence C, and a coordinate of 3
In Ensembl, this is denoted with reference sequence T, variant
sequence – (where “-” represents the absence of any sequence), and a
start coordinate of 4, end coordinate 4.
Hence to convert between the two we trim off the first base of the
reference and variant sequence columns (using substr($string, 1)) and
increment the coordinate by 1. In the case of the variant sequence,
trimming off the first base leaves an empty string, so we substitute
this with “-”.
Here is an example of an insertion:
Reference: AACTA
Variant: AACGGTA
In VCF, this is denoted with reference sequence C, variant sequence
CGG, coordinate 3
In Ensembl, the same variant is denoted reference sequence -, variant
sequence GG, start 4 and end 3 (start is greater than end to denote an
insertion).
Again, to convert between the two we trim off the first base; in this
case the reference sequence is now an empty string, so it is replaced
by “-”.
I hope this makes sense!
Pingback: Condel goes Ensembl! » Computational Oncogenomics