Ensembl insights: Annotating readthrough transcription in Ensembl

Ever come across a transcript that seems to span multiple genes? These are called ‘readthrough transcripts’, or sometimes ‘conjoined genes’, and they’re more common than you might think. Read on to find out about what they are and what they do, and how we annotate these at Ensembl.

This article was written by Jonathan Mudge of the HAVANA and GENCODE projects.

What are readthrough transcripts; are they even real?

This blog is dedicated to readthrough transcription, a weird and perhaps even controversial phenomenon that can present a significant challenge to gene annotation projects such as Ensembl. We should probably start with a definition. Readthrough transcripts are RNA molecules that are formed via the splicing of exons from more than one distinct gene (Figure 1). The genes involved are found on the same chromosome region on the same strand, typically adjacent to one another. While most readthrough events involve two genes – which may be protein-coding, lncRNAs or transcribed pseudogenes – we also see cases where transcription utilises exons from three or more genes. Readthrough transcription is apparently found in all vertebrate genomes, and may even be common to multicellular organisms. Here, we’ll stick to human and mouse. We should quickly distinguish readthrough transcripts from those aberrant RNA molecules that can be generated from ‘chimeric‘ gene structures or ‘fusions‘ (typically in cancer cells), i.e. when genomic rearrangements disrupt normal gene structures and place exons into novel contexts. Also, we do not consider the bespoke splicing of immunoglobulin segments in blood cells to represent readthrough events.

Figure 1: Readthrough transcription between two protein-coding genes. In this hypothetical example, genes 1 and 2 are ‘known’ protein-coding loci found adjacent to one another on the same chromosome (Coding sequence (CDS) outline in green; non-coding sequence in red fill). Experimental data provides additional support for the existence of transcription start sites (‘TSS’) and polyadenylation sites (‘pA’) for both genes, providing additional confidence that they are indeed separate loci. However, further transcript data provides evidence for readthrough transcription, i.e. for the existence of two RNA molecules that utilise exonic sequence from both protein-coding genes. The top model has a putative CDS that incorporates the reading frame of both genes. In contrast, the lower model incorporates additional exonic sequence that disrupts the reading frame, and the model is predicted to undergo nonsense mediated decay (NMD, CDS outline in purple). For annotation purposes, both readthrough transcripts are grouped together as part of the same gene, which is a separate third gene to the two known loci.

Instead, RNA-seq data make it clear that readthrough transcription occurs in normal cells across the whole range of tissues; it is essentially ubiquitous, and in our experience it is rare to find adjacent genes that do not show at least some evidence of readthrough. True, these splicing events tend to have significantly lower expression levels than the ‘canonical’ RNAs from either gene, generally (and anecdotally) at least 100 times lower. That said, a portion of readthrough events have high expression, and there are certainly thousands of human and mouse readthrough cDNAs out there; we know this because we’ve already annotated them.

What do readthrough transcripts do?

So let’s get to the most obvious question: what is the biological relevance of readthrough transcription? Theoretically, we might suppose that this process plays a similar role to alternative splicing, i.e. it functions to increase the size and diversity of the transcriptome, leading to an overall increase in organismal complexity. It could also be that splicing between protein-coding genes acts to generate novel protein isoforms. Here’s the thing though: there is very little evidence in the literature for the biological function of a readthrough transcript. Instead, in our view it is entirely plausible that readthrough transcription doesn’t have a general biological role; it may simply be ‘noise’. It is well established that the processes of transcription and splicing are not completely efficient, and it could be that splicing between genes occurs when a splice donor site in the upstream locus ‘misses’ its canonical acceptor site, and instead finds a donor site in the downstream locus.

So if readthrough transcription is potentially just noise, why bother annotating it? In the early days of our human and mouse annotation projects, a core remit of our work was to annotate all cDNAs, regardless of their potential functionality. It seemed important to account for all of these transcripts, and it is still common for users to point out to us that a certain cDNA has been missed. Today, as noted above, readthrough splicing events are ubiquitous with RNA-seq datasets, and the presence of readthrough splicing events in the geneset therefore facilitates the mapping of these introns. Moreover, whatever suspicions we may have about the biological relevance of this process, it makes sense for our project to remain essentially agnostic when considering its functionality. It may be that one day someone will demonstrate that readthrough transcription is more interesting than we thought, or at least for certain genes. It will be harder for people to study this phenomenon if we don’t annotate it in the first place.

How do we annotate readthrough transcripts?

OK, so let’s talk about annotation guidelines. Firstly, note that we currently only highlight readthrough models in our human and mouse genesets (i.e. those produced by Ensembl/GENCODE). That is not to say readthrough transcripts are not found in other genesets; they will be there, but they won’t have the ‘readthrough_transcript’ attribute in the download files. Also, readthrough annotation in human and mouse is at the present time only performed manually. As noted, we traditionally annotated readthrough based on cDNAs and EST libraries, and now also describe models found in long-read datasets i.e. from the PacBio and Nanopore sequencing platforms. However, these newer datasets are enormous, and ever increasing in number. Exactly how far we push the annotation of readthrough transcription based on PacBio / Nanopore data remains an open question. Note that we do not base readthrough models on short-read RNA-seq data alone.

The first key decision to make when manually assessing a potential readthrough scenario is to establish whether we are in fact dealing with a single gene that has been incorrectly separated into multiple segments. For splicing between protein-coding genes at least (Figure 1), we typically have strong evidence to suggest where each gene starts and ends. Most obviously, the structure of a coding gene is largely defined by its coding sequence (CDS), and in practice it’s typically easy to classify a readthrough event where transcription is seen to splice exons from what are normally ‘known’ independent translations. Furthermore, most protein-coding genes can be ‘anchored’ at their endpoints by transcriptomics data – especially using the fruits of 5′ and 3′ sequencing protocols such as CAGE or polyAseq (see our previous blog)– and, in combination with the generally lower expression of readthrough transcripts, the genes usually present themselves as independent loci.

What about non-coding transcripts?

Long non-coding RNAs (lncRNAs) can be more problematic: these genes tend to have lower expression – making it less obvious where one gene ends and another starts – and of course we don’t have CDS to guide us (Figure 2). Also, the actual functional relevance of the vast majority of lncRNAs has yet to be established. For these reasons, our tendency now is to merge or ‘collapse’ readthrough events between lncRNAs genes into a single locus (this is an ongoing process), unless this seems like the wrong thing to do in a certain case (for example, if we are considering a readthrough event between two lncRNA genes that are well established as known, independent loci).

Figure 2: Potential readthrough transcription between two lncRNA genes. In this hypothetical example, genes 1 and 2 are existing lncRNA genes. Additional transcript data provides evidence for a novel model that incorporates exons from both genes, i.e. a potential readthrough transcript. However, in contrast to the protein-coding example in Figure 1, it is not obvious in this scenario that genes 1 and 2 are truly independent genes. In the absence of data that provides a strong argument for keeping these genes separate – and therefore for annotating the readthrough transcript as a third gene – each of these models will instead be merged into a single gene.

We have also annotated a smaller number of readthrough events between ‘mixed’ gene biotypes, e.g. that utilise exons from any combination of a protein-coding gene, a lncRNA and a pseudogene. Essentially, the same rules apply here; for example, if the potential readthrough event connects a protein-coding gene and a lncRNA, then we will first wish to establish whether the latter is mis-annotated and should actually be considered as part of the former.

In conclusion

This brings us to the major annotation take-home: if the genes joined together by an apparent readthrough event are indeed, on inspection, considered to be distinct genes then the readthrough model is almost always annotated as a distinct third locus, i.e. with its own Ensembl gene ID. Furthermore, if multiple readthrough transcripts are identified, these will typically be grouped as a single readthrough gene. These transcripts may be annotated as protein-coding or non-coding, which emphasises a very important point: our annotation of readthrough transcription commonly produces additional protein-coding genes, even where these CDS only consist of translated regions found in the overlapping genes. You should also be aware that a readthrough gene may contain only transcript models with CDS predicted to trigger the nonsense-mediated decay pathway; this is in contrast to ‘normal’ protein-coding genes. A final point: readthrough events are incorporated in the gene counts presented on the Ensembl and GENCODE websites, and they are included by default in the ‘Comprehensive’ (i.e. all gene annotations) genesets for both species although can be filtered out of the model collections based on the ‘readthrough_transcript’ tag.

So there you go. While we have hopefully been able to shed some light on the annotation of readthrough transcription in Ensembl/GENCODE, it seems we’ll all have to wait a little longer before we get definitive answers to the Big Questions, like ‘How?’ and ‘Why?’. In the meantime, it remains unclear exactly what the result of our annotation efforts will be: are we describing a fascinating set of transcripts that will one day foster a new understanding of RNA biology? Or are we just flagging these transcripts to make it easier for you guys to ignore them? The jury is out.

Why not check out the NME1-NME2 gene to see how readthrough transcripts are displayed in the Ensembl genome browser.

Correction edit: It was previously stated that all readthrough transcripts were filtered out of the GENCODE basic set, this is not the case. The statement was edited to describe how to filter readthrough transcripts out of the gene sets.

Ensembl Blog

News about the Ensembl Project and its genome browser