Update to the Ensembl Canonical transcript set

Upcoming Ensembl release 104 will bring an update to the Ensembl Canonical transcripts. The new Ensembl Canonical definition will prioritise well-supported biologically representative, highly expressed and highly conserved transcripts. MANE Select will be used as the canonical transcript for human protein coding genes where available.

While many genes have multiple alternative transcripts (also known as isoforms or splice variants), which reflect the full complexity of expression mechanisms at a given locus, some analyses require only one representative transcript per locus, e.g. the Ensembl gene tree calculation. In order to facilitate consistency of such analysis, we designate a single, representative Ensembl Canonical transcript at every locus, which has conventionally focused on the longest coding sequence (CDS) or the overall transcript length for non-protein-coding genes. Using the longest translation/transcript, however, does not necessarily reflect the most biologically relevant transcript of a gene. To address this, the Ensembl release 104 will bring an update to the Ensembl Canonical transcript selection.

The new Ensembl Canonical transcript selection for protein-coding genes will focus on a transcript that, on balance, is the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt. To identify this transcript, we consider, where available, evidence of functional potential (such as evolutionary conservation of a coding region and transcript expression levels), transcript length and evidence from other resources (such as concordance with the APPRIS ’principal isoform’ and with the UniProt/Swiss-Prot ‘canonical isoform’).

For human transcripts, the algorithmically selected Ensembl Canonical transcript undergoes additional review, in a collaboration between EMBL-EBI and NCBI to define a joint default set of transcripts identical between RefSeq and GENCODE (the MANE project). If approved jointly by manual curators at both groups, the Ensembl Canonical will be designated as the MANE Select transcript at that locus, overriding the computational selection. Where a MANE Select transcript has not yet been determined, the Ensembl Canonical will be presented.

The Ensembl Canonical / MANE Select will be found at the top of the transcript table and will be easily searchable with BioMart‘s brand new filter – ‘Ensembl Canonical’. Importantly, these changes won’t affect the GRCh37 annotation (and annotations on other archived assemblies), which will continue to use the current ‘default selection’, as GRCh37 has been frozen since September 2013.

Updated Ensembl Canonical algorithm selection

The Ensembl Canonical transcript is assigned to the transcript with the highest score, which is a sum of the component scores based on the following:

  1. CDS conservation measured by PhyloCSF
  2. Expression, using two types of data:
    • RNA-seq supported intron data (e.g. Intropolis in human)
    • Cap Analysis of Gene Expression (CAGE) data (e.g. FANTOM5)
  3. Concordance with the APPRIS Principal (P1) CDS isoform
  4. Concordance with the UniProt canonical protein isoform
  5. Length, using two considerations:
    • CDS length
    • Length override: Disqualification of transcript whose CDS length is 75% or less of the longest CDS at the locus to avoid conservation bias towards shorter isoforms
  6. Clinical variation (for human): Identification of transcripts covering largest number of pathogenic variants
  7. Partial transcript status: Disqualification of incomplete transcripts

A score is assigned for each component where data are available for a species, and the selection can be made using only a partial subset of the data. In the absence of the data above (which currently applies to all non-human genomes), transcripts will be prioritised using the current ‘default selection’ focusing on the longest combined exon length and the transcript biotype, following the hierarchy:

  1. Protein coding
  2. NMD
  3. Non-stop decay
  4. Polymorphic pseudogene

Although the majority of data included in this pipeline is only suited for protein-coding genes, we hope that the greater understanding of the functionally important regions of lncRNAs will enable us to extend this pipeline to include the data types supporting their identification in the future. The Ensembl Canonical for non-protein-coding genes (such as lncRNAs) will be the transcript with the longest genomic span at the locus. You can find the full documentation here.

Leave a Reply

Your email address will not be published. Required fields are marked *