Ensembl 97 and Ensembl Genomes 44 have been released!

Ensembl 97 and Ensembl Genomes 44 have been released! In this release you’ll find many new species, including some hybrid livestock, as well as important changes to gene sets for human and mouse and a new update to the human Regulatory Build.

Read on to explore the full details.

GENCODE updates and lncRNA biotype changes

Both human and mouse GENCODE genesets will be updated to versions 31 and M22 respectively. There are revisions to both protein coding and non-coding gene annotation (more about the latter below), you can find the new gene counts and statistics here for human, and here for mouse. 

There have been major changes to long non-coding RNA (lncRNA) genes, with several thousand new transcripts being added as a result of a new pipeline created by the GENCODE team: TAGENE. There are also significant changes to the biotype categories of lncRNA transcripts in the GENCODE gene sets for human and mouse. Previously (release 96 and earlier) there were nine biotypes classified under the header of ‘lncRNA’:

  1. Non_coding
  2. lincRNA
  3. macro_lncRNA
  4. Antisense
  5. Sense_intronic
  6. Sense_overlapping
  7. 3’_overlapping_ncRNA
  8. Bidirectional_promoter_lncRNA
  9. Retained_intron

Terms 1-8 will have now been retired. From now any transcripts that previously had these biotypes are now  referred to simply as lncRNAs. The exception is for retained_intron transcripts which remain unchanged, and will continue to be assigned as a biotype in the future. Terms 1-8 will now be stored as ‘legacy’ terms in the files you can download from the FTP site. You can view all of the biotypes that we assign here.

Updates to the human Regulatory Build: New data from Roadmap Epigenomics

We are adding data from Roadmap Epigenomics for these 13 new Cell/Tissue types for human.

  1. CD8+ ab T (PB)
  2. foreskin fibroblast_2
  3. foreskin keratinocyte_1
  4. foreskin keratinocyte_2
  5. foreskin melanocyte_1
  6. foreskin melanocyte_2
  7. germinal matrix
  8. mammary myoepithelial
  9. mammary epithelial_2
  10. mononuclear (PB)
  11. neurosphere (C)
  12. neurosphere (GE)
  13. UCSF-4

You can view the full lists of Cell/Tissue types available for human here. Along with this we have also curated our current Cell/Tissue types to condense some into a single record and to treat them as replicates within a unique Cell/Tissue type. Including the 13 new Cell/Tissue types we now have a total of 118 epigenomes (previously 123). We have also re-run our Regulatory Build pipeline incorporating this newly available data, with the aim to refine and improve our annotations of regulatory features, the new total number of annotated regulatory features are 613,944 (previously 675,965).

Figure 1: Bar chart showing the counts of different types of regulatory features between the previous release (96) and the new release (97) of the Ensembl Regulatory Build.

We are also updating our miRNA target features for both human and mouse with this release; these are being imported from TarBase v8.0. You can find the TarBase track for these data in the Gene tab > Regulation section, the Location > Region in Detail page as well as in the Regulatory feature tabs, by clicking on the Configure this page button.

New species and strains

Livestock

This release sees the first ever hybrid genomes incorporated into Ensembl. First up is the pig cross-breed USMARC. This pig is a cross-bred offspring of dystrophin deficient line of pigs submitted by the USDA ARS. This assembly is available in addition to the existing pig reference. We also have two new cattle hybrid genome assemblies, the results of a project from the University of Adelaide. These two genomes are a result of reciprocal cross-breeding of two Bos species (Bos indicus and Bos taurus) and are fully haplotype-resolved. These compliment the pure-bred cow (Bos taurus ARS-UCD1.2) reference genome assembly that we host in Ensembl.

  • Pig cross-breed (Sus scrofa USMARC)
  • Hybrid cattle – Bos indicus (Bos indicus X Bos taurus, paternal haplotype)
  • Hybrid cattle – Bos taurus (Bos indicus X Bos taurus, maternal haplotype)

Other mammals

We’re expanding our representation of Marsupials by integrating a new wombat genome produced by the MRC Institute of Genetics and Molecular Medicine at the University of Edinburgh. The wombat will be joining it’s closest relative, the Koala which is already at home in Ensembl.

Fish

Yes, more fish genomes! Building on the 38 new fish species we added in release 94, we now have an additional four fish species. All the fish that we had in Ensembl before release 97 had undergone three rounds of whole genome duplication (WGD, up to the Teleost WGD). However, the new Elephant shark’s genome (which is not a teleost fish) has only undergone two rounds of WGD, and the salmonoid family (of which Huchen is a part of) had their own additional WGD event, meaning their genome has duplicated four times in it’s evolutionary timeline! Our other new species Barramundi perch and Electric eel are still on three rounds like most other fish, but they obviously have other fascinating attributes!

 

Metazoa

A diverse group of invertebrates have been added to Ensembl Metazoa. The Scrub typhus mite is part of a group called ‘chiggers’ and this species is the host and transmitter of the intracellular bacterial parasite that causes scrub typhus, an important human disease. The velvet mite is actually an arachnid, and aside from being beautiful, the males have a fascinating ritual for attracting mates. Finally, the lancelet is a filter feeding chordate, and a useful study organism in the origin and evolution of vertebrates.

Plants

Bryophytes (including mosses, liverworts, and hornworts) represent the bridge between aquatic plants and land plants, and so are crucially important in understanding the evolution of plant life as a whole. Marchantia polymorpha is our first liverwort species, and will complement our other Bryophyte, the moss Physcomitrella patens that we already host at Ensembl Plants.

Protists

This bumper release of Ensembl Protists has a fresh import of genomes from the ENA. There are 68 new  genomes across the following groups: Alveolata, Amoebozoa, Choanoflagellida, Cryptophyta, Euglenozoa, Fornicata, Heterolobosea, Parabasalia, Rhizaria and Stramenopiles. 

There are a whole range of small species here with big impact. For example, we now have Phytophthora megakarya, a crucial oomycete plant pathogen that causes black pod disease. Its only known host is the much-loved cocoa tree (Theobroma cacao) located in West and Central Africa, and causes significant yield loss in an industry worth approximately $70 billion annually. This release also includes the addition of an important animal pathogen, Babesia ovata, a species widespread in east Asian countries which causes anemia in cattle.

It’s not all pathogens though! Fortunately, we are increasingly able to harness the powers of microbes to combat disease. For instance, this release we have the exclusively mycophagous amoeba (meaning it needs a diet of fungi) Planoprotostelium fungivorum that has been used in studies to target human fungal pathogens

Other new additions this release include the unusual psychrophile (cold-loving) Fragilariopsis cylindrus CCMP1102 found in Arctic and Antarctic sea water with compelling ecological and biotechnological applications. You can view the full list of all species in Ensembl Protists here

P. megakarya image source: tinyurl.com/y2rclxde

Updated assemblies and annotations

Other updates

  • The Variant Effect Predictor (VEP) will now report if a human GRCh38 transcript identified in your VEP results is in the MANE Select transcript set. By default, VEP returns the clinical significance assertion by variant, but a variant may have multiple alleles with different assertions. A new  ‘–clin_sig_allele’ option returns only the assertion for the input alternate allele and only when a phenotype is listed in ClinVar. 
  • Updated Wheat variant data:
  • The Pan-taxonomic Compara set of gene trees has been updated and two new plant species added: the new liverwort Marchantia polymorpha and the model grass Brachypodium distachyon. Three species were removed due to an unforseen issue: the Cyanobacteria Synechocystissp. 6803, Rhizobium leguminosarum bv. Viciae 3841 and Chondrus crispus. These species will be reinstated in Pan-taxonomic Compara in future releases. You can view the full list of species included in the Pan-taxonomic Compara analysis here.
  • New HMM-driven compara pipeline has been run with reference protist species.