It’s probably reasonable to assume that the coding sequence (CDS) of a protein-coding transcript model is the feature that is of primary interest to most people who use Ensembl. However, both the 5’ and 3’ untranslated regions (UTRs) are important biological entities in their own right, and it is vital that we in Ensembl do the best we can to represent them accurately. However, the annotation of these UTRs is complicated, so we’re going to focus on exploring the annotation process for 3’ UTRs in this article (Figure 1).
How we annotate 3’ UTRs
Ensembl gene annotation comes in two flavours, automatic annotation and manual annotation. Automatic annotation is carried out on all of the genomes in Ensembl by the use of a complex computational pipeline. For a select few species – predominately human and mouse – we also have a dedicated team of annotators who describe genes based on the manual review of evidence, called the HAVANA team.
Virtually all Ensembl transcript models are based on transcribed RNA evidence, and the end coordinate of a given model can thus be defined by the base at which the alignment between the genomic DNA and the RNA evidence ends. If the transcript model has a CDS and a STOP codon, the 3’ UTR thus constitutes the span of sequence from that point to the end. For our computationally annotated genomes, the description of 3’ UTRs is essentially that simple. For those genomes that undergo manual annotation, however, the process is a little more complicated.
For manually annotated genomes, we try to define the 3’ UTR of our genes to its maximum extent, which may not be apparent from a single piece of RNA evidence. One reason for this is that mRNA or cDNA sequences found in silico are rarely full-length, i.e. they may not extend to the true 3’ end (or 5’ end) of the mRNA molecule that was captured in the cell. This is illustrated by human ARHGAP11A (Figure 2).
Happily, it is easy to tell if an RNA sequence is 3’ complete based on whether or not it contains the polyadenylation (polyA) tail that is found at the end of virtually all mature mRNAs. These tails have long been used to manually describe both polyA sites and polyA signals in our human and mouse gene sets.
You can visualise the evidence used to annotated transcript models from the ‘Supporting evidence page’ in the transcript tab in Ensembl, take a look at the evidence used to annotate ARHGAP11A-201 shown in Figure 2.
New sequencing methods, new annotation evidence
Traditionally, polyA tails were identified in cDNA and expressed sequence tag (EST) sequences, while in more recent years a variety of RNAseq protocols have been developed to specifically target the 3’ ends of transcripts, generically referred to as 3P-seq. Ultimately, by manually combining ‘old-school’ cDNA and EST transcript evidence with 3P-seq (e.g. from the PolyAsite repository) – and often using short-read coverage graphs to ‘fill in’ the gaps – we can now be far more confident that our genes are being annotated to their full 3’ length. Indeed, it is now apparent that 3’ UTRs can be extremely long; that of mouse Grin2b (Figure 3) extends nearly 19 kb, a size that could not have been captured by traditional cDNA sequencing protocols.
Most protein-coding genes have multiple polyA sites
3P-seq datasets are enormous, and demonstrate that transcriptomes contain a staggering amount of polyA complexity. While localised variation in the exact location of a polyA tail is typical, resulting in short clusters of polyA sites, most genes also utilise distinct clusters separated by hundreds or even thousands of base-pairs (Figure 2). We don’t have time to consider the biological implications of this variability right now, although note that differential polyA usage has been implicated in phenomena such as transcript localisation and stability (see Giammartino, Nishida, and Manley, 2011 for further reading).
Instead, let’s consider the complications that polyA variability has for Ensembl. As noted, manual gene annotation involves trying to establish the maximum extent of the 3’UTR, i.e. the most downstream polyA site of that locus. However, we now know that a given gene will almost certainly produce a variety of distinct transcripts utilising different 3’ polyA sites. Ultimately, it is not practical – or even possible – to represent all the transcripts that a gene produces based the combination of alternative splicing and alternate polyA. This is partly because we would have to create far more transcript models for a given gene, and also because the general lack of full-length RNA evidence makes it unclear which polyA sites are used by which specific transcripts.
Our solution for manually annotated genes such as ARHGAP11A is to extend the 3’ UTR to its full extent based on combined evidence only for a single chosen transcript model (Figure 1, transcript model 1); the 3’ ends of other transcripts annotated within that gene are then defined solely by the RNA evidence that matches its intron-exon structure (Figure 1: transcript models 2-6). PolyA sites and polyA signals are then annotated as features of the genome, and are not directly associated with a given transcript model in our databases.
Until now, polyA evidence has only been used in manual 3’ UTR annotation and only manual 3’ UTR annotation leads to the annotation of polyA sites and signals. Therefore, we do not computationally incorporate the polyA sites found in catalogues such as the PolyAsite repository into our automatically annotated genesets. However, to repeat: these 3P-seq catalogues are enormous, and they (or whatever libraries come next) will inevitably get larger in the future. They will also start to contain RNA from more tissue and cell types, and it will be interesting to understand the spatial and temporal expression profiles of differential polyA. In the meantime, our major focus is to find and improve human and mouse protein-coding genes that would benefit from a little 3’ UTR TLC. We’re going to be busy!