Since its inception, Ensembl Bacteria has imported user-submitted annotations from the International Nucleotide Sequence Database Collaboration (INSDC) for prokaryotic genomes. Any in-house gene annotation produced by Ensembl remained largely within the scope of vertebrate genomes, with the odd metazoan or microbial genome annotated as a test. Recently, however, it has been possible for Ensembl Bacteria to establish a robust and scalable pipeline to produce consistent annotation for prokaryotes. We believe that a consistent set of annotations will go a long way in enabling better comparisons between genomes and also in the computation of pangenomes. Furthermore, against a backdrop of increasing volumes of prokaryotic assemblies being submitted to INSDC without accompanying gene annotation, we believe a robust approach will allow Ensembl Bacteria to make a meaningful contribution to this space.
A common annotation pipeline has been developed to annotate both isolate, and metagenome-assembled genomes (commonly referred to as MAGs) in bacteria through a collaboration between the Microbiome Informatics and Ensembl microbial groups at EMBL-EBI. This pipeline comprises Prokka for gene calling, followed by cmscan, InterProScan and EggNOG tools to bolster the functional annotation, and Codetta to screen for alternative genetic codes in a scalable manner. This approach will also facilitate the extension of the annotation framework to annotate features such as operons, pathways and biosynthetic gene clusters in future releases.
We have deployed this annotation framework on all 31,332 genomes hosted in Ensembl Bacteria and are doing a phased transition to the new annotation. In Ensembl release 110 / Ensembl Genomes release 57, we have replaced the annotations of all but 115 of our genomes. These 115 are key species whose annotation has been used in pan-taxonomic comparative analysis in Ensembl for a while; and will remain unchanged for the next few releases.
We have also taken this opportunity to implement a systematic naming scheme for the genes in Ensembl Bacteria based on rules provided by the Global Alliance for Genomics & Health (GA4GH). We have used the following five facts about each gene and encoded it using the SHA-512 algorithm. We have then used the first 15 characters of this checksum prepended with “ENSB:” as the gene identifier. The greatest benefit of using such a system is the ability to identify identical genes unambiguously and refer to them with the same identifier even when alternative gene prediction tools are used.
NCBI taxon identifier of species
GA4GH checksum (sha512t24u) of the dna sequence the CDS is on (truncated to 24 characters)
Start of CDS
End of CDS
Authors: Robert Finn, Andy Yates and Nishadi De Silva
Editors: Benjamin Moore, Louisse Paola Mirabueno