For our latest release (e87) we’ve produced annotations from some new embryonic zebrafish RNA-seq data using the Ensembl genebuild RNA-seq pipeline. The collection of new data we’re providing consists of gene sets and alignments for 18 separate embryonic developmental stages, from the single celled zygote right up until 120 hours post fertilisation. As per usual, these features can be viewed in our browser as separate tracks, or they can be downloaded from our ftp site.

The RNA-seq data we used were produced by the Vertebrate Genetics and Genomics Group at the Sanger Institute. The team collected 96 embryos from each of the 18 stages, examining their morphology so as to ensure every single embryo was at the correct phase of development. Such an undertaking, although extensive, is more achievable in zebrafish than in many other vertebrates due to features such as large clutch size and external fertilisation and development. The team made 5 libraries for each of the developmental stages, each one comprising a pool of 12 embryos. All 90 libraries were made simultaneously by a robot to reduce batch effect and strand-specific sequencing was used to reveal information on genes overlapping on the opposing strand. The data were released to ENA directly after sequencing, to allow public access as early as possible. Variation in gene structure across development can be viewed in Ensembl and the changing expression level can be viewed in Expression Atlas. A manuscript describing the changes in gene structure and expression level across development is currently in preparation.

screen-shot-2016-12-09-at-16-51-45The alignments and annotations generated from the data are viewable in the Ensembl browser, and the individual tracks can be configured using the RNA-seq tissue matrix. The initial introduction of this matrix was covered in a previous blog post. The new zebrafish entries appear in chronological order under the heading ‘WTSI stranded RNA-seq’. A merged set, which contains all of the new developmental RNA-seq data, is also selectable.

We expect these RNA-seq data will expose new isoforms of previously annotated genes, which may be especially prevalent during, and perhaps even unique to, early embryonic development. The alignments may also reveal interesting expression patterns for specific genes.

We’d like to encourage our users to take full advantage of these exciting new data, and we hope they’ll facilitate some interesting new research.

Please send any questions to our helpdesk.

 

 

 

 

The Ensembl Pre! site has been updated for four species: zebrafish (Danio rerio), rat (Rattus norvegicus), sperm whale (Physeter macrocephalus) and fugu (Takifugu rubripes).

Sperm whale is a new species to Ensembl. Our main site already displays earlier assemblies for fugu, zebrafish and rat.

Zebrafish

ZebrafischThe zebrafish assembly, GRCz10 (GCA_000002035.3), was made available by The Genome Reference Consortium in September 2014. Since the previous release, Zv9 in July 2010, the GRC has taken over the task of improving and maintaining the zebrafish assembly. The most notable changes in the chromosome landscape since the previous release can be found on chromosome 4, which has gained about 15 Mb in length. Furthermore, 94 of the 112 previously unplaced contigs are now located on chromosomes. In total, this assembly consists of 26 chromosomes and 3,399 unplaced scaffolds. The full annotation of an older zebrafish assembly, Zv9, can be found on our main website. Click here to go to the zebrafish Pre! site, where you can view alignments of zebrafish UniProt proteins and human Ensembl translations, as well as gene models projected from the previous zebrafish assembly.

Rat

rattusThe new rat assembly, Rnor_6.0 (GCA_000001895.4), was produced by The Rat Genome Sequencing and Mapping Consortium and was released in July 2014. This assembly comprises 954 toplevel sequences, 22 of which are chromosomes (chromosome Y is a new addition in this assembly), and 1,395 of which are unplaced scaffolds. The full annotation of an older rat assembly, Rnor_5.0, can be found on our main website. Otherwise, click here to visit the rat Pre! site, where you can view alignments of rat UniProt proteins and human and mouse Ensembl translations, as well as gene models projected from the previous rat assembly.

Sperm Whale

800px-Mother_and_baby_sperm_whaleThe sperm whale assembly, PhyMac_2.0.2 (GCA_000472045.1), was produced in September 2013 by The Aquatic Genome Models Consortium. The assembly does not contain any assembled chromosomes or linkage groups and is instead made up of 11,711 unplaced scaffolds. The species is an important model for a number of human conditions such as respiratory disease, metal toxicity and cancer. For example, sperm whales exposed to high levels of chromium have no adverse health effects whereas humans do. Studying this species could lead to development of treatments for human chromium-related disorders. Click here to visit the sperm whale Pre! site, where you can view alignments of human and dolphin Ensembl translations.

Fugu

fugu_tThe fugu genome assembly, FUGU5 (GCA_000180615.2), was released in October 2011 by The Fugu Genome Sequencing Consortium. It is composed of 22 autosomal chromosomes, with a total sequence length of 391Mb. The species was initially proposed as a useful model for annotating and understanding the human genome, as it contains a similar repertoire of genes to human yet is only roughly one-eighth of the size. It is among the smallest vertebrate genomes, and previous assemblies of this species have already shown themselves to be useful reference genomes for identifying genes and other functional elements in other vertebrate species. The full annotation of an older fugu assembly, FUGU 4.0, can be found on our main website. Click here to visit the fugu Pre! site, where you can view alignments of human and dolphin Ensembl translations.

We’re now only a couple of weeks away from releasing our full annotation of the new human genome assembly (GRCh38). Before we make it publicly available we’d like to update you on our progress and to share a few key pieces of information.

Changes in the assembly

The GRCh38 assembly is made up of 455 top-level sequences. These sequences include 24 chromosomes, mitochondrial DNA, alternative reference loci and a number of unplaced scaffolds. For the first time ever, centromere sequences have also been included in a human reference assembly. The total contig length for this new assembly is 3.4 Gb, a small increase on the previous assembly, and the total chromosomal length is 3.1 Gb (excluding haplotypes). There are 261 alternate loci, including the LRC/KIR complex on chromosome 19 (35 alternate sequences) and the MHC region on chromosome 6 (7 alternate sequences). We have aligned nearly half a million proteins and over 200,000 cDNAs to the new assembly and have annotated a total of 63,263 models, 22,469 of which are protein-coding.

karyotype

Blue regions represent assembly gaps
Image credit: Kerstin Howe

For GRCh38, in addition to the usual steps involved in a genebuild, we have also made clone data available. The clone sets were loaded, along with other data, into the core human database. Although these data are not required for genebuilding, the information is extremely useful for some of our users.

What stage is the annotation at?

The Genebuilders have completed the final gene set, which has been merged with manual annotation from HAVANA to create the GENCODE 20 set. The data were then passed on to other teams within Ensembl so that they could carry out the remaining analyses. This entire process of data exchange between the different Ensembl teams is coordinated by the Ensembl Production team, who also conduct a series of quality control steps along the way.

The comparative genomics team (Compara) have now generated orthologues to all other Ensembl species from the new human geneset. They’ve also revised all pairwise and multi-species whole genome and transcript alignments so that users can identify conserved and constrained regions between human and other vertebrate species. Updating with the new human assembly, therefore, means that a large part of the Compara database also needs to be updated.

The Variation team have now collected all variant and phenotype data, linking the information to other data in Ensembl. This is so that useful variation data can be accessed and interpreted by our users. The variant effect predictor (VEP), for example, is an extremely useful tool that determines effects of variants, such as SNPs or indels, on genes, transcripts, proteins, regulatory regions and phenotypes. A user simply has to input the coordinates and sequence changes of the variants of interest.

And finally, the Regulation team have used the new Ensembl regulatory annotation build to locate regions in the human genome that are involved in the regulation of gene expression.

EnsEMBL_Web_Component_Location_ViewBottom-Homo_sapiens-Location-View-76-

Some sample regulatory features as seen in the Ensembl browser

Now that the last parts of the relevant analyses are being completed, the Ensembl Webteam are currently working on the Ensembl website, ensuring that all the relevant data will be accessible to you in the most user-friendly manner.

The final release is still on target for the end of July, after which the GRCh37 annotation will be available on a separate archive site. Although we have produced the GENCODE 20 gene set for the upcoming Ensembl release (e76), we are still in the process of refining it. We therefore recommend, particularly for large consortia, waiting for the GENCODE 21 release, which will be available with e77. In the mean time, until the e76 release, the human Pre! site is still up and running.

If you have any questions then please don’t hesitate to contact us, either through twitter or by emailing helpdesk.

The Amazon molly (Poecilia formosa) is now available on Ensembl Pre! This particular species is especially interesting to scientific research due to its origins, its method of reproduction and the manner in which it interacts with other closely related fish.

Amazon molly

Amazon molly

The single-sex interspecies school
Considering the name of the species, you would be forgiven for thinking these fish can be found swimming around The Amazon River. The Amazon molly actually resides in the warm waters of North-eastern Mexico and Southern Texas, and derives its name from something far more interesting than its habitat.

One of very few asexual vertebrates, this fish reproduces via a process known as gynogenesis, or sperm-dependent parthenogenesis. Despite being a method of asexual reproduction, gynogenesis does involve the mating of a male with a female. However, the genetic material from the male is not incorporated into the already diploid eggs and the sperm serves only to trigger embryonic development, thereby producing clones of the mother. The entire species is therefore female, and is thus named after the legendary society of female Amazon warriors.

Life finds a way
Due to the absence of male Amazon mollies, the females act as sexual parasites by mating with males from other closely related species. These mates come from species such as P. latipinna, P. mexicana, P. latipunctata and, occasionally, P. sphenops. In fact, it is thought that the Amazon molly originated from a hybridization event between two of these species, the Atlantic molly (P. mexicana) and the Sailfin molly (P. latipinna), approximately 280 KYA. However, all attempts to create P. formosa-like hybrids in the laboratory have, so far, been unsuccessful.

molly_distr

Distribution of molly species in coastal regions of The Gulf of Mexico

As the male fish do not contribute their genes to the next generation, one would expect that natural selection would act against them being ‘fooled’ into mating with the heterospecific Amazon females. Furthermore, experiments indicate that the males are able to tell the difference between females of their own species and the Amazon species. So why do they mate with these Amazon mollies? Unfortunately, the answer is that we simply don’t know. However, findings have suggested that the male individuals may actually benefit from this behaviour as mating with Amazon mollies seems to make them more attractive to females from their own species. The strange relationship between the Amazon mollies and these male mollies may therefore benefit both parties.

Asexual versus sexual
The main advantage of asexual over sexual reproduction, in any species, is an increase in reproductive output. With asexual females, there is no need to produce males that cannot give birth, resulting in twice the amount of grandchildren than would be produced by sexual reproduction. Asexual reproduction, therefore, should be the preferred method. As Amazon molly offspring are clones of the mother in an environment in which the mother was able to survive, they are also likely to survive and reproduce. This type of reproduction helps colonize new territory very quickly, but a population that reproduces in this manner will likely be unable to adapt to changing environments. Additionally, according to an evolutionary theory known as Muller’s ratchet, deleterious mutations in small asexual populations can accumulate at a fast rate due to a lack of gene recombination, which can eventually result in extinction.

Why study the Amazon molly?
A popular endeavour in modern evolutionary biology is to explain the evolution and persistence of sexual reproduction, given the higher costs of producing male individuals when compared with asexual reproduction. One effective way to research the relative strengths and weaknesses of the two reproductive methods is to study the dynamics of the coexistence of sexual and asexual organisms. The Amazon molly’s unique situation, both with respect to the way in which it reproduces and its interaction with other molly species, makes it an extremely valuable model. It has already been used in studies focused on determining whether or not sexual selection is necessary for high diversity of the MHC. Findings have suggested that the asexual molly has polymorphic MHC loci despite its clonal reproduction, yet these loci are more polymorphic in the sexual species. The Amazon molly is also used as a model for carcinogenicity studies, and is extremely easy to breed and rear in captivity. Furthermore, the clonality of the fish allows researchers to carry out studies on individuals that are genetically identical.

Browsing the genome
The Amazon molly genome assembly was made publicly available in October 2013. We have carried out a preliminary gene annotation, generated by alignments of Ensembl human, stickleback and zebrafish translations from Ensembl release 75. You can find this information on our Pre! site.

region_in_detail

Region of the Amazon molly genome as seen in the Ensembl browser. The gene models shown are derived from human and zebrafish proteins.

We’re extremely excited to be carrying out a complete genebuild, incorporating data such as RNASeq, which will be available in a future Ensembl release. Keep an eye on our blog to find out when, and if you have any questions feel free to contact us.

 

In-depth knowledge of the human genome is fundamental in an array of scientific fields, such as forensics, research, anthropology and medicine. Since the completion of the Human Genome Project in 2003 thousands of human genomes have been sequenced, sequencing technology has improved significantly, and the amount of available data has vastly increased.

The new human assembly (GRCh38) arrived last week, and our objective over the next few months will be to thoroughly annotate it, ultimately providing our users with the best possible gene set.

What does the new assembly look like? 
Though the underlying genomic DNA will be identical, or very similar, to that of the previous release (GRCh37), certain improvements mean this is a particularly important assembly. These changes include:

The reference GRCh38 assembly consists of the ‘primary assembly’ and ‘alternate sequences’. The primary assembly is made up of 24 chromosomes, 42 unlocalized scaffolds and 127 unplaced scaffolds, which contain genomic sequences that have not yet been assigned to chromosomes. The alternate sequences are a collection of 261 ‘alt loci’. These include the haplotypes for the MHC region on chromosome 6, as well as shorter regions on other chromosomes where the GRC provide alternate alleles present in the population.

mhc region

The figure above shows the MHC region of chromosome 6 on the reference genome.

What does it mean for our users? 
The updates will facilitate an improved understanding of the human genome, and increase the accuracy of the Ensembl annotation. It is important to note, particularly for users who use coordinate-based systems, that these changes may affect the lengths of chromosomes and the positions of many genes.

What does it mean for the Ensembl GeneBuilders?
When genome sequencing was a new technology it was initially thought that an assembly could be represented by a single ‘Golden Path’; a set of overlapping sequences that could be selected to produce a non-redundant chromosome sequence (with gaps), fully representing the sequence at all loci. The reference human assembly, however, is not a simple linear model and it includes additional information on an array of different alleles. Fortunately, our GeneBuild pipelines have previously been updated to deal with such alternative sequences, as we faced a similar challenge with GRCh37 patches. The most prominent challenge, however, will involve careful database storage and disk space management as we will be aligning a massive amount of data to the new assembly, such as EST (>8 million human ESTs) and RNASeq data.

What is involved in re-annotating the the new assembly?
Even though most of the genic regions in GRCh38 will be the same as for the previous assembly, we are going to throw away all the automatically annotated gene models (keeping the manually annotated genes from Havana) and begin the entire annotation process again. While it would be far easier and quicker to just copy the pre-existing gene models, we are more interested in producing the best possible gene set.

All of the gene models we produce are based on biological evidence: a protein and/or mRNA sequence must align to the genome in order for us to annotate a gene model. Just as the genome assembly has been updated to remove incorrect sequence and to add new DNA, so too have the public databases been improved since we last produced a gene set on human. New protein and cDNA sequences are now available, and others may have been removed. We therefore have a great opportunity to refresh the entire human gene set, and possibly find many genes that could not be annotated before due to a lack of evidence.

When can users expect to see the new assembly and gene set?
We plan on releasing a Pre! site this quarter to give users a chance to view the new assembly (BLAST/BLAT will be available). In order to produce this temporary Pre! site we are aligning the old human gene set to the new assembly to indicate where we expect the genes to be. It will take approximately three months to automatically annotate the new assembly using Ensembl pipelines, and about one month to merge the automatically annotated gene models with the manually annotated ones from Havana. The gene set should be finalized during the second quarter of this year, after which the data will be passed to the other teams in Ensembl (comparative genomics, variation, and regulation teams).

The complete GRCh38 annotation will be made available on the Ensembl website in the third quarter of 2014. From then on we will support GRCh38 on our main site (www.ensembl.org). The GRCh37 assembly will still be available, but will be contained on its own site (GRCh37.ensembl.org) and will remain static. Any further updates will be done exclusively on the new assembly.

Additional information can be found on the Genome Reference Consortium blog here. There is also other useful information here in the form of a poster, which was presented by James Torrance on behalf of the GRC at the ISMB/ECCB conference in July 2013. It is important to note, however, that the poster was produced before the final version of GRCh38 was released and some of the numbers it contains may be out of date.

New Pre! sites have been released for four species: the Mexican blind cave fish (Astyanax mexicanus), southern white rhinoceros (Ceratotherium simum simum), prairie vole (Microtus ochrogaster) and armadillo (Dasypus novemcinctus).

voleThe prairie vole assembly, MicOch1.0 (GCA_000317375.1), was made available by The Broad Institute in October 2012. The prairie vole is a model for social behaviour. The animals live in colonies and have been known to display aspects of human-like behaviour such as lifelong pair bonding. The vole assembly is composed of 21 chromosomes and 6314 unplaced scaffolds. Click here to go to the vole Pre! site, where you can view alignments of vole proteins from Uniprot, and mouse and human Ensembl translations.

Dasypus_novemcinctus The armadillo assembly, Dasnov3.0 (GCA_000208655.2), was made available by the Baylor College of Medicine in December 2011. The armadillo is a natural reservoir of leprosy that can be acquired by humans who handle or consume them. The species is also used to study multiple births and delayed implantation of embryos, they usually produce identical quadruplets. The armadillo assembly is composed of 46558 unplaced scaffolds. Click here to go to the armadillo Pre! site, where you can view alignments of the armadillo Uniprot proteins and human Ensembl translations.

Astyanax_mexicanusThe cave fish assembly, AstMex102 (GCA_000372685.1), became available in April 2013. The species has both surface dwelling (surface fish) and cave adapted (cave fish) morphs and is an important model in evolutionary biology. The cavefish differ from the surface fish in several traits such as the enhancement of non-visual sensory systems and the loss of eyes and pigmentation. As the two different morphs are inter-fertile the species is a useful model for microevolution studies, although it is primarily used as a model for retinal degeneration diseases. The cave fish assembly comprises 10735 unplaced scaffolds. Click here to go to the cavefish Pre! site, where you can view alignments of cavefish Uniprot proteins, and stickleback and zebrafish Ensembl translations.

Ceratotherium_simum_simumThe southern white rhinoceros assembly, CerSimSim1.0 (GCA_000283155.1), was made available by the Broad Institute in August 2012.  This particular species is a valuable model for a number of reasons. Potential longevity has been estimated to be between 40 and 50 years, and although they may be sexually mature by the age of 4, often they do not reproduce until much later in life. They are also important for comparative studies with horses, having diverged about 55mya. Finally, rhino studies are important for conservation reasons as they are near threatened species due to poaching. The rhino assembly comprises 3086 unplaced scaffolds. Click here to go to the rhinoceros Pre! site, where you can view alignments of the rhinoceros Uniprot proteins and human Ensembl translations.

The zebrafish (Danio rerio) is pretty much the ideal model for understanding vertebrate development as it combines some of the best attributes from several other model organisms. More importantly, there is extensive similarity between the zebrafish and human genomes. Thus, many human developmental and disease genes have counterparts in zebrafish, and linking such genes is key to elucidating human gene function. Consequently, an array of zebrafish models of human diseases have been produced for the purpose of testing candidate drugs. The Wellcome Trust Sanger Institute recently released a video addressing the usefulness of using zebrafish in research:

Until recently, we at Ensembl have relied mostly on protein and cDNA alignments to produce transcript models. However, the number of known zebrafish proteins and cDNAs is relatively small. With the advent of RNASeq, many more splice variants can be identified and used in the gene building process. Not only do these data provide proof of transcription, but RNASeq also presents us with information on splice sites, UTRs and tissue expression.

The RNA-Seq pipeline
The zebrafish gene set was among the first to be enhanced by the incorporation of RNASeq data. For those of you with some knowledge of model organisms, starting a new genebuild process using zebrafish as the test case may seem counter-intuitive when far simpler eukaryotic genomes are available, such as those of D. melanogaster and C. elegans. Ironically, the complexity of the zebrafish genome, and the predicted difficulty in annotating it, was the main reason for its selection as the new pipeline’s first test subject. At the time, the general consensus throughout the Ensembl genebuild team was that annotating a complex genome was the best way to prepare for future annotations of mammalian genomes.

The following video describes how RNASeq data is used in our gene sets:

Unsurprisingly, the zebrafish RNASeq annotation process did prove to be a difficult endeavour. Some particularly long genes, such as those encoding Nebulin and Titin, were problematic due to the very high number of possible combinations of exons and introns. The models, therefore, had to be simplified. Perhaps most frustratingly, however, the zebrafish assembly was updated from version 8 to version 9 just as the genebuild team were finishing. This meant that after running a full analysis they had to rerun the entire pipeline on the new assembly. Fortunately the hard work paid off in the end. The same pipeline that was developed for zebrafish can now be used for all other eukaryotes. Anole lizard, for example, has already been updated using the RNASeq pipeline and rabbit will soon follow.

The new zebrafish gene models
The RNASeq zebrafish pipeline took data from 5 tissues and 7 developmental stages and assembled them into 25,748 gene models. These elements were then incorporated into the Ensembl genebuild process after careful filtering. This was followed by a merge with the manually curated VEGA gene models to produce a final set of 26,152 genes, represented by 51,569 transcripts.

The different sets of gene models contribute contrasting elements to the final product, and achieving the best possible result is a balance between including correct models while excluding incorrect ones. In this particular case, the RNASeq gene models were used to adjust intron and exon boundaries, confirm expression and improve on the accuracy of the 3′ UTRs. If you would like to know more, the entire genebuild process for the zebrafish is summarised here, and you can read the published paper here.

Viewing the data on our website
If you’d like to, you can view the RNASeq information for a particular species on its Ensembl homepage :

  • Firstly, search for a gene on the homepage and click on it.
  • Then click “Configure this page”, which is to the left of the page.
  • Click “RNASeq models”, also to the left of the page.
  • Click “Enable/disable all RNASeq models”.

New tracks will now appear. These tracks can then be repositioned to facilitate simple comparisons. The video below gives a visual demonstration, as well as a more detailed description, of the steps involved.

In future Ensembl releases more annotated species will be updated with such information. As new species are introduced their genebuilds will also incorporate RNASeq data wherever possible. The final outcome will be better, more accurate UTR and splice site annotations, as well as a clearer picture of gene expression patterns.