Human, mouse, rat and zebrafish genes without an official gene symbol, have been conventionally named after BAC clones in Ensembl. We plan to remove these names in Ensembl release 104 and resort to using the Ensembl gene stable IDs instead, in line with practices adopted for all other vertebrate species.
Thanks to the availability of deeper data, better computational methods and intensive manual review and intervention for some species, gene annotation is on an upward path towards the best possible representation of genes and transcripts. One happy consequence of this is the identification of genes that were not previously known, either in the collection of annotated genes (or gene sets) produced by Ensembl or by other reference annotation databases such as RefSeq, UniProt, MGI, RGD or ZFIN.
For some key species (human, mouse, rat and zebrafish) the geneset consists of computational annotation produced by the Ensembl-genebuild team and manual gene annotation generated by the Ensembl-HAVANA team that are merged together to produce the final geneset you see in the Ensembl browser or FTP site. These species were among the first vertebrates to have high quality genomes produced and genes annotated. The original genomic sequences were obtained by sequencing blocks of genomic DNA cloned into vectors called Bacterial artificial chromosomes (BACs) which were then assembled into complete chromosomes.
When the initial gene annotation was being carried out, known human genes with gene symbols approved by the Hugo Gene Nomenclature Committee (HGNC), were assigned the appropriate official symbol. In species other than human the respective nomenclature committees carried out an equivalent role. However, newly discovered genes didn’t have official names and were assigned unique symbols on the basis of the BAC clones on which they were identified (eg AC000032), and the number of novel genes on the clone (eg AC000032.1 and AC000032.2) to provide systemic labelling and a human readable identifier. The same is true for new genes today.
These identifiers, however, have outlived their usefulness. The name/symbol gives no information regarding the function of the locus and is drifting towards irrelevance at a time when all new assemblies are produced using sequencing methods that avoid the use of BAC clones completely.
Given the above and to avoid the confusion that parallel nomenclature systems can generate, from Ensembl 104 onward, we plan to remove this additional nomenclature system. Genes with clone-based identifiers will have them replaced by the Ensembl gene stable IDs (ENSGxx) and new genes without an official gene name will display the gene stable ID from their creation. The adoption of this convention will bring the species with merged manual-automated gene annotation in line with all other vertebrate species.
The Ensembl gene IDs will provide a stable symbol until official symbols are created by the official nomenclature bodies and will remain available as stable IDs for the life of the annotation.
This blog post has been written by Adam Frankish and edited by Michal Szpak.