Rebuilding the zebrafish gene set

The zebrafish (Danio rerio) is pretty much the ideal model for understanding vertebrate development as it combines some of the best attributes from several other model organisms. More importantly, there is extensive similarity between the zebrafish and human genomes. Thus, many human developmental and disease genes have counterparts in zebrafish, and linking such genes is key to elucidating human gene function. Consequently, an array of zebrafish models of human diseases have been produced for the purpose of testing candidate drugs. The Wellcome Trust Sanger Institute recently released a video addressing the usefulness of using zebrafish in research:

Until recently, we at Ensembl have relied mostly on protein and cDNA alignments to produce transcript models. However, the number of known zebrafish proteins and cDNAs is relatively small. With the advent of RNASeq, many more splice variants can be identified and used in the gene building process. Not only do these data provide proof of transcription, but RNASeq also presents us with information on splice sites, UTRs and tissue expression.

The RNA-Seq pipeline
The zebrafish gene set was among the first to be enhanced by the incorporation of RNASeq data. For those of you with some knowledge of model organisms, starting a new genebuild process using zebrafish as the test case may seem counter-intuitive when far simpler eukaryotic genomes are available, such as those of D. melanogaster and C. elegans. Ironically, the complexity of the zebrafish genome, and the predicted difficulty in annotating it, was the main reason for its selection as the new pipeline’s first test subject. At the time, the general consensus throughout the Ensembl genebuild team was that annotating a complex genome was the best way to prepare for future annotations of mammalian genomes.

The following video describes how RNASeq data is used in our gene sets:

Unsurprisingly, the zebrafish RNASeq annotation process did prove to be a difficult endeavour. Some particularly long genes, such as those encoding Nebulin and Titin, were problematic due to the very high number of possible combinations of exons and introns. The models, therefore, had to be simplified. Perhaps most frustratingly, however, the zebrafish assembly was updated from version 8 to version 9 just as the genebuild team were finishing. This meant that after running a full analysis they had to rerun the entire pipeline on the new assembly. Fortunately the hard work paid off in the end. The same pipeline that was developed for zebrafish can now be used for all other eukaryotes. Anole lizard, for example, has already been updated using the RNASeq pipeline and rabbit will soon follow.

The new zebrafish gene models
The RNASeq zebrafish pipeline took data from 5 tissues and 7 developmental stages and assembled them into 25,748 gene models. These elements were then incorporated into the Ensembl genebuild process after careful filtering. This was followed by a merge with the manually curated VEGA gene models to produce a final set of 26,152 genes, represented by 51,569 transcripts.

The different sets of gene models contribute contrasting elements to the final product, and achieving the best possible result is a balance between including correct models while excluding incorrect ones. In this particular case, the RNASeq gene models were used to adjust intron and exon boundaries, confirm expression and improve on the accuracy of the 3′ UTRs. If you would like to know more, the entire genebuild process for the zebrafish is summarised here, and you can read the published paper here.

Viewing the data on our website
If you’d like to, you can view the RNASeq information for a particular species on its Ensembl homepage :

  • Firstly, search for a gene on the homepage and click on it.
  • Then click “Configure this page”, which is to the left of the page.
  • Click “RNASeq models”, also to the left of the page.
  • Click “Enable/disable all RNASeq models”.

New tracks will now appear. These tracks can then be repositioned to facilitate simple comparisons. The video below gives a visual demonstration, as well as a more detailed description, of the steps involved.

In future Ensembl releases more annotated species will be updated with such information. As new species are introduced their genebuilds will also incorporate RNASeq data wherever possible. The final outcome will be better, more accurate UTR and splice site annotations, as well as a clearer picture of gene expression patterns.