Variation consequences in release 62

Variation consequence types, such as “intronic” or “non-synonymous”, describe the variation location or effect of a variation on a transcript. For the latest version of Ensembl (release 62) we have made some significant changes to the way in which we determine these consequence types, and we’d like to provide an overview of these improvements.

Firstly, we are now able to assign a specific effect to every allele of a variant. For example, rs12795274 has three alleles, the reference allele is T, and it also has two alternative alleles; C and A. The A is predicted to cause an amino acid change, while the C is synonymous. We now list the effect of each individual allele on the website and you also can fetch them separately when using the variation API

Another improvement we’ve made is that “under the hood” we now use terms defined in the Sequence Ontology (SO) to describe the consequence types. Moving to this set of externally maintained terms should make it easier to compare Ensembl annotations with those from other groups. The SO also groups the various terms we use into a hierarchical tree and, in the future, this will let users query for variants with particular effects in a much smarter way than is possible now.  On the website we are still using our old terms by default, but you can see the mapping between the old terms and the SO terms on the variation documentation page and you can use “Configure this page” on several variation views to choose which set of terms you want to see (here‘s an example).

We also now provide SIFT and PolyPhen predictions for any variant that is predicted to cause an amino acid substitution in human. These are popular tools developed by external groups that try to predict the effect of a non-synonymous mutation on the function of the protein. You can see these predictions on several variation views, a useful example is the protein variation view. You can find more information about these tools and how we run them in Ensembl on the variation documentation page.

CropperCapture[1402]

All of these improvements are also available for you to use to analyse your own data using the Variant Effect Predictor (VEP). The VEP has new configuration options that allow you to choose which set of terms you want to use for the consequence annotations, and also offers options to fetch SIFT and PolyPhen predictions for any missense mutations in your data. We are able to provide these predictions for novel mutations by computing the predictions from SIFT and PolyPhen for all possible amino acid substitutions in human proteins and storing these in the variation database. We hope that this makes the VEP even more useful for mining your data and we have plans to add support for these sort of tools in other species in the near future.

3 thoughts on “Variation consequences in release 62

  1. Hello

    I have been looking at the source code for the variant predictor script and
    I have a question regarding a section of the code:

    828 else {
    829 $ref = substr($ref, 1);
    830 $ref = ‘-‘ if $ref eq ”;
    831 $start++;
    832
    833 foreach my $alt_allele(split /\,/, $alt) {
    834 $alt_allele = substr($alt_allele, 1);
    835 $alt_allele = ‘-‘ if $alt_allele eq ”;
    836 push @alts, $alt_allele;
    837 }
    838 }
    839
    840 $alt = join “/”, @alts;

    It does not make sense to me that you define $alt_allele as a substring of
    $alt_allele in line 834

    Indels can be bigger than 1 bp so why are you doing this?

    You do the same for variations that have only 1 alternative allele (line
    871):

    868 else {
    869 # chop off first base
    870 $ref = substr($ref, 1);
    871 $alt = substr($alt, 1);

    Can you explain why this is necessary or indeed correct?

    Best regards,

    Duarte Molha

  2. Hi,

    The right place for questions about specific scripts is dev@ensembl.org. I see you have already posted it there!

    For others, here is the reply from Will on the variation team:

    This relates to the different ways in which VCF and Ensembl represent
    insertions and deletions.

    In VCF format, the base immediately before the change is included in
    the reference sequence column, and the start coordinate represents
    this base.

    In Ensembl format, we only include the bases affected by the change,
    so we have to trim off that extra base in order for the API to be able
    to interpret the variant in the same way as those given to the script
    in other formats.

    Here is an example of a deletion:

    Reference: AACTG
    Variant: AACG

    In VCF, this would be denoted with the reference sequence CT, variant
    sequence C, and a coordinate of 3

    In Ensembl, this is denoted with reference sequence T, variant
    sequence – (where “-” represents the absence of any sequence), and a
    start coordinate of 4, end coordinate 4.

    Hence to convert between the two we trim off the first base of the
    reference and variant sequence columns (using substr($string, 1)) and
    increment the coordinate by 1. In the case of the variant sequence,
    trimming off the first base leaves an empty string, so we substitute
    this with “-“.

    Here is an example of an insertion:

    Reference: AACTA
    Variant: AACGGTA

    In VCF, this is denoted with reference sequence C, variant sequence
    CGG, coordinate 3

    In Ensembl, the same variant is denoted reference sequence -, variant
    sequence GG, start 4 and end 3 (start is greater than end to denote an
    insertion).

    Again, to convert between the two we trim off the first base; in this
    case the reference sequence is now an empty string, so it is replaced
    by “-“.

    I hope this makes sense!

  3. Pingback: Condel goes Ensembl! » Computational Oncogenomics