We are pleased to announce the release of Ensembl 102, and the corresponding release of Ensembl Genomes 49 featuring lots of new and updated data in this release including the addition of human population frequency data from the NCBI Allele Frequency Aggregator, new plant species and a large update of the available bacterial data.
Genome assemblies and annotation for many new species are also being continuously added to the Ensembl Rapid Release genome browser.
Major data updates for human
Update to translate all non-ATG start codons as Methionine
Up until Ensembl 101, Ensembl/GENCODE has followed a literal interpretation of the genetic code using the standard vertebrate codon-translation table. However, there are a small number of genes that use a non-ATG start codon where the ribosomal machinery allows non-ATG codons to translate as Methionine. For human, there are 50 annotated genes with a non-ATG start codon. From Ensembl 102 onwards, we will be changing these genes to display a Methionine as the first residue in the protein translation. The affected genes are all manually tagged with a ‘non-ATG start’ annotation remark by the HAVANA annotators and a ‘non-ATG’ attribute will be visible for these transcripts in the transcript tab.
Addition of population frequency data from NCBI Allele Frequency Aggregator (ALFA)
The NCBI Allele Frequency Aggregator (ALFA) was launched earlier this year to provide summary data for variants from more than 1 million individuals across approved controlled-access studies in dbGaP. The initial release of allele frequency data from 100 thousand individuals from 12 populations includes allele counts and frequencies for 447 million variants, and will be available in Ensembl through the Population Genetics pages in the variant tab.
Genome sequences and annotation will added for three new plant species and two metazoa species:
- Saltwater cress (Eutrema salsugineum) – assembly and annotation from ENA.
- Lavender scallops (Kalanchoe fedtschenkoi) – assembly from ENA and annotation from Phytozome.
- Valley oak (Quercus lobata) – assembly from ENA and annotation from the Valley Oak Genome Project.
- Honey bee mite (Varroa destructor) – assembly from GenBank and annotation from NCBI Annotation Release 100.
- Clytia hemisphaerica – assembly from GenBank and annotation from the Marine Invertebrate Models Database (MARIMBA).
New Assemblies and/or Annotation
An updated genome assembly and annotation of the Tasmanian Devil (Sarcophilus harrisii) will be added to Ensembl 102. The Tasmanian Devil is a carnivorous marsupial which is currently endangered with a declining population on the island of Tasmania. Understanding genetic diversity among the Tasmanian Devil population is thought to be an important step in conservation efforts. The Tasmanian Devil is also an important model organism in the study of Devil Facial Tumour Disease, which is an example of transmissible cancer.
- Purple sea urchin (Strongylocentrotus purpuratus)
- Red fire ant (Solenopsis invicta)
- European honey bee (Apis mellifera)
- Jewel wasp (Nasonia vitripennis)
In Ensembl 102, there will also be a batch update of bacterial and archaeal genomes and annotation from ENA. There will be 31,332 genomes available in Ensembl Bacteria 102, including:
- 22,088 new genomes
- 34,804 genomes have been removed due to redundancy
plus updated annotation of pathogen-host interaction data from PHI-base, alignments to Rfam covariance models available through the ‘Rfam models’ track and updated protein features for all species using InterProScan 77.0. Read more in our separate blog post about the updates to Ensembl Bacteria.
Other updates and changes
- Variation data added for soybean and Phaseolus vulgaris from the European Variation Archive.
- Plant reactome mappings for plant species from Gramene.
- Updated repeated element annotation for selected plants using a custom plant library (nrTEplants).
- Retirement of Ensembl 81 archive site (jul2015.archive.ensembl.org).