The Ensembl Pre! site has been updated for four species: zebrafish (Danio rerio), rat (Rattus norvegicus), sperm whale (Physeter macrocephalus) and fugu (Takifugu rubripes).

Sperm whale is a new species to Ensembl. Our main site already displays earlier assemblies for fugu, zebrafish and rat.

Zebrafish

ZebrafischThe zebrafish assembly, GRCz10 (GCA_000002035.3), was made available by The Genome Reference Consortium in September 2014. Since the previous release, Zv9 in July 2010, the GRC has taken over the task of improving and maintaining the zebrafish assembly. The most notable changes in the chromosome landscape since the previous release can be found on chromosome 4, which has gained about 15 Mb in length. Furthermore, 94 of the 112 previously unplaced contigs are now located on chromosomes. In total, this assembly consists of 26 chromosomes and 3,399 unplaced scaffolds. The full annotation of an older zebrafish assembly, Zv9, can be found on our main website. Click here to go to the zebrafish Pre! site, where you can view alignments of zebrafish UniProt proteins and human Ensembl translations, as well as gene models projected from the previous zebrafish assembly.

Rat

rattusThe new rat assembly, Rnor_6.0 (GCA_000001895.4), was produced by The Rat Genome Sequencing and Mapping Consortium and was released in July 2014. This assembly comprises 954 toplevel sequences, 22 of which are chromosomes (chromosome Y is a new addition in this assembly), and 1,395 of which are unplaced scaffolds. The full annotation of an older rat assembly, Rnor_5.0, can be found on our main website. Otherwise, click here to visit the rat Pre! site, where you can view alignments of rat UniProt proteins and human and mouse Ensembl translations, as well as gene models projected from the previous rat assembly.

Sperm Whale

800px-Mother_and_baby_sperm_whaleThe sperm whale assembly, PhyMac_2.0.2 (GCA_000472045.1), was produced in September 2013 by The Aquatic Genome Models Consortium. The assembly does not contain any assembled chromosomes or linkage groups and is instead made up of 11,711 unplaced scaffolds. The species is an important model for a number of human conditions such as respiratory disease, metal toxicity and cancer. For example, sperm whales exposed to high levels of chromium have no adverse health effects whereas humans do. Studying this species could lead to development of treatments for human chromium-related disorders. Click here to visit the sperm whale Pre! site, where you can view alignments of human and dolphin Ensembl translations.

Fugu

fugu_tThe fugu genome assembly, FUGU5 (GCA_000180615.2), was released in October 2011 by The Fugu Genome Sequencing Consortium. It is composed of 22 autosomal chromosomes, with a total sequence length of 391Mb. The species was initially proposed as a useful model for annotating and understanding the human genome, as it contains a similar repertoire of genes to human yet is only roughly one-eighth of the size. It is among the smallest vertebrate genomes, and previous assemblies of this species have already shown themselves to be useful reference genomes for identifying genes and other functional elements in other vertebrate species. The full annotation of an older fugu assembly, FUGU 4.0, can be found on our main website. Click here to visit the fugu Pre! site, where you can view alignments of human and dolphin Ensembl translations.

Please note that the archive website for Ensembl release 65 (Dec 2011) will be retired in December when version 78 is released.

This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).

For more information about how to use archives, please see our previous blog post on the topic; a list of all current archives is available on the main website.

You may have noticed our beta REST server has been retired. We have replaced it with our new service, http://rest.ensembl.org, and have a handy migration guide to help you update existing scripts. Details about the new server can be found in the article published in Bioinformatics. Some of the improvements include:

  • New POST endpoints
  • POST messages allow users to submit a list of inputs as a single request
    This is supported for the archive, lookup and vep endpoints
  • The rate limit has been increased, with up to 15 requests per second allowed
    Combined with POST, we were able to process 1000 variants per second!
  • New /variation endpoint to retrieve variation information linked to a gene or a transcript
  • New /regulatory endpoint to retrieve data from the regulatory build
  • HTTPS support for clients working with a secure environment

Screen Shot 2014-10-08 at 10.12.26

This server provides access to the latest data in Ensembl, including the new human build on the GRCh38 assembly. For those wishing to use data from the GRCh37 assembly, a dedicated server is available on http://grch37.rest.ensembl.org

The brand new Ensembl Regulatory Build on the new GRCh38 human assembly has been released in Ensembl 76. This involved a complete redesign of the build process with a new statistically rigorous logic, a streamlining all the backend processes and a remapping and peak calling of all our data sets.

The build and constituent data are available to view directly through Ensembl through the Regulation section of “Configure This Page”, but we have also made it accessible by creating a public track hub. A track hub is a pre-configured set of tracks which you can load together into Ensembl or other genome browsers such as the UCSC. In addition to data loaded into Ensembl, it also contains tracks that summarise the data used to generate the build

What’s in the Ensembl Regulatory Build Track Hub?

GRCh37 data

You’ve been meaning to do that data remapping a while now, but didn’t quite get to it? Or maybe you’re just feeling nostalgic? No worries, the track hub covers both GRCh37 and 38.

Raw data

You can scan the header files for GRCh37 or GRCh38 for direct access to the raw data in BigBed and BigWig format.

Transcription Factors

For each transcription factor, we calculate the probability of having binding at any position, based on the available data sets by simply dividing the number of overlapping peaks by the number of data sets. These probabilities can be viewed in the TFBS Summaries section. An overall probability of any binding is viewable in the TFBS Summary track of the Ensembl build overview section.

Segmentations

We use genome segmentation software (Segway), to partition the genome into regions of similar signal over these assays, and label these states as e.g. predicted promoters, enhancers or repressed. The segmentations for each cell type can be found in the Cell Type Segmentations tracks.

For each state of the segmentation, we also create a summary track which represents the number of cell types that have that state at any given base pair of the genome.

The Ensembl Regulatory Build

The summarised Ensembl Regulatory Build can be viewed in the “Ensembl Reg. Build” track of the Ensembl Build Overview section. For each cell type, we then annotate each feature as on or off, as displayed in the Cell Type Activity tracks.

Please note that the archive websites for Ensembl release 62 (April 2011) and 63 (June 2011) will be retired in August when version 76 is released.

This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).

For more information about how to use archives, please see our previous blog post on the topic; a list of all current archives is available on the main website.

With the release of Ensembl 76 fast approaching, the variation team would like to provide more information on how we moved our variation data to the new human assembly, GRCh38. There are different methods available for re-annotating variants on a new assembly. The most accurate way would be to re-run experiments, or variant calling pipelines that identified the variant in the first place, on the new assembly. The necessary material and computational resources required for such an endeavour, however, are very expensive. Therefore, we have developed computational methods so that, for most of the data, such investments are not necessary.

Considering that the new assembly retained lots of sequence information from the previous assembly, we can use computational methods that try to derive the new location based on information about a variant we have already available namely the:

  • Location on the old assembly
  • Flanking sequence (DNA sequence from the old assembly surrounding the variant)

Based on this prior knowledge we can either project or remap our variation data.

Projecting variants

The projection algorithm compares two assemblies and computes the new location based on sequence similarity between the two assemblies. The computation of the new location is successful for ~98% of our variation data. However, when the sequence in the new assembly has changed too much compared to the old assembly, the projection fails, and for those variants we then go out to use a remapping strategy, as explained below. The projection functionality is implemented in the Ensembl core API.

Remapping variants

For the remapping approach we generate a sequence read by adding upstream and downstream sequence from the old assembly to a variant. We then map the read to the new assembly using BWA.

Workflow

We have ~64M variants and ~69M variation features (VF) in Ensembl release 75, GRCh37. You can think of a variation feature as a combination of a variant and its location on the genome. Most variants have one variation feature. If a variant maps to multiple locations on the genome, the variant has as many variation features as it has locations on the genome.
We can divide variation features into:

  1. VF that map uniquely to the reference genome (chromosome 1-22, X, Y, MT)
  2. VF that have multiple mappings on the reference genome
  3. VF that are located on an alternative locus
  4. VF that are located on a fix patch region
remapping_workflow

Workflow for re-annotating variation data to the new assembly

We first attempted to project all VF that map uniquely to the genome or are located on alternative loci. In a second attempt we use our remapping approach for VF that couldn’t be projected. For the ~62M variants with a unique location on GRCh37, only ~200,000 variants could not be projected and were remapped to the new assembly. Variants with multiple mappings on GRCh37 have been remapped to GRCh38 using their flanking sequence information as submitted to dbSNP.

As a result, both projection and remapping create the new set of variation feature locations in release 76. We do not need to re-annotate variants located on fix patch regions from GRCh37 because the fix patch regions have been incorporated into the primary sequence for the new assembly.

Alternative Loci

The Genome Reference Consortium increased the number of alternative sequence representations for variant regions (ALT LOCI) in GRCh38. In our workflow diagram we described how we re-annotate variants to ALT LOCI that were present in GRCh37. Additionally, we provide variant annotations to new ALT LOCI by remapping variation features from the primary reference sequence (GRCh38) that overlap an alternative locus. We added ~1.5M extra variation features with this approach. This gives however only an idea of how known variants map to ALT LOCI. Ideally, you would do the variant calling against the set of primary reference sequences and ALT LOCI. We can expect variants will be called on ALT LOCI in the near future as variant calling tools include the option of including ALT LOCI information.

Are you ready to move to GRCh38?

Ensembl provides a reliable representation of variation data on the new human assembly, GRCh38. In addition to re-annotating variation data from release 75 to 76, we also updated our data (e.g. from ClinVar, the NHLBI GO Exome Sequencing Project or from COSMIC) and projected and remapped the data where necessary to the new assembly GRCh38. But there is no need to worry if you are not yet ready to make the move to GRCh38. Starting with Ensembl release 76 we will support and update variation annotation for GRCh37 and GRCh38. If you have questions or comments, please get in touch with us.

With the release 76 looming large in our calendars and the final deadlines out of the way for GRCh38 data production, it’s a good time to look back and take stock of what we’ve been doing in the Ensembl Regulation office. We have been rather quiet the past few months, working feverishly on an ambitious overhaul of our infrastructure. We’ve already given you a sneak peak at the new Ensembl Regulatory Build, so I’d like to take a look at the work horse underlying all of our data, the ‘Ensembl Regulation Analysis Pipeline’.

The end result is a core resource that centralises epigenomic data from multiple public sources, processes them through a universal pipeline, then summarises them into easily understood annotations. Ensembl Regulation aims to be a single entry point to obtain an overview of all the available regulatory data, from individual datasets to summary annotations, all coming to a browser near you, very soon. Underlying it is a full ‘end to end’ pipeline for producing the input data to the Regulatory Build, from fastq download, to alignment, IDR processing, peak calling and finally motif alignments.

The inputs to the Regulatory Segmentation and Build are experiments (Chip-Seq & DNAse-Seq) describing the chromatin status (i.e. histone modifications) and transcription factor landscape across various cell lines. These experiments come from large projects (e.g. ENCODE, Roadmap Epigenomics and BLUEPRINT), through to individual experiments made accessible via archives such as the ERA/ENA, SRA and GEO.

The main outputs of the pipeline are genome alignments, peak calls and  ‘collection’ files which provide coverage statistics across the genome. Managing and processing these data is no simple task, and we expect the number of available epigenomic datasets to increase significantly in the years to come. Also, with the arrival of GRCh38, we needed to reprocess all of the existing data in a short timeframe. We therefore integrated our processes into a shiny new fully automated pipeline using the ensembl-hive framework. Here follows a brief summary of the new features of the regulation analysis pipeline.

The Tracking Data Base

This now constitutes our main analysis and archive database, tracking the data both within our pipeline, but also in external repositories. In it, we register the meta-data from different projects and data repositories, providing a single point of reference to query the data available in the public domain. This has been crucial in determining which cell lines meet the requirements for a build.

Read Alignment and Peak Calling

We first align reads using BWA, then call peaks using SWEMBL for short regions and CCAT for broader ranging histone modifications. Replicates are processed in parallel to  support ENCODE’s Irreducible Discovery Rate (IDR) methodology.

Pipeline Improvements

Flexibility has been a key aim of the redesign, and the hive infrastructure has helped here by allowing us to define each logical part of the pipeline as a separate configuration which can be ‘topped up’ as required. This means that it’s easy to run just the read alignment stage (which we require as input to the segmentation), or at your pleasure add in the peak calling and collection file writing stages whilst it’s still running.  All the necessary state information is captured in the tracking database, so it’s really easy to pick things up at any point and start running the later stages of the pipeline.

Due to the size of our input data set and the resulting rolling data footprint, we set up a garbage collection of intermediate files and added inline archiving. This has limited our footprint, and enabled us to reprocess the entire human data set in one go.

The combination of the above improvements, the new ensembl-hive implementation and a whole load of other refinements, means much less manual intervention is required, resulting in a large reduction in run times.  For the alignments in particular, what was taking several weeks now takes just ~5 days!

What does the future hold?

We’ve already identified some more optimisations to the structure of the pipeline, so the runtimes are likely to drop even further. This will be crucial to handle the hundreds of cell types currently being examined within Roadmap Epigenomics, Blueprint, ENCODE 3 and other projects. We will also be revising our schemas to better reflect tissue specific data. This is part of a larger push within Ensembl to better describe the dynamics of gene regulation and transcription.

Finally, we are keeping up with lab techniques, and will be extending our pipelines to handle newer types of data, such as chromatin conformation assays or eQTLs. Although we do not process this data ourselves, we already integrated and remapped the FANTOM5 CAGE-tag annotations onto GRCh38.

p.s. If you want even more info on the, keep an eye on this page. Once release 76 is out it will be updated with our new Regulatory Build documentation.

We’re now only a couple of weeks away from releasing our full annotation of the new human genome assembly (GRCh38). Before we make it publicly available we’d like to update you on our progress and to share a few key pieces of information.

Changes in the assembly

The GRCh38 assembly is made up of 455 top-level sequences. These sequences include 24 chromosomes, mitochondrial DNA, alternative reference loci and a number of unplaced scaffolds. For the first time ever, centromere sequences have also been included in a human reference assembly. The total contig length for this new assembly is 3.4 Gb, a small increase on the previous assembly, and the total chromosomal length is 3.1 Gb (excluding haplotypes). There are 261 alternate loci, including the LRC/KIR complex on chromosome 19 (35 alternate sequences) and the MHC region on chromosome 6 (7 alternate sequences). We have aligned nearly half a million proteins and over 200,000 cDNAs to the new assembly and have annotated a total of 63,263 models, 22,469 of which are protein-coding.

karyotype

Blue regions represent assembly gaps
Image credit: Kerstin Howe

For GRCh38, in addition to the usual steps involved in a genebuild, we have also made clone data available. The clone sets were loaded, along with other data, into the core human database. Although these data are not required for genebuilding, the information is extremely useful for some of our users.

What stage is the annotation at?

The Genebuilders have completed the final gene set, which has been merged with manual annotation from HAVANA to create the GENCODE 20 set. The data were then passed on to other teams within Ensembl so that they could carry out the remaining analyses. This entire process of data exchange between the different Ensembl teams is coordinated by the Ensembl Production team, who also conduct a series of quality control steps along the way.

The comparative genomics team (Compara) have now generated orthologues to all other Ensembl species from the new human geneset. They’ve also revised all pairwise and multi-species whole genome and transcript alignments so that users can identify conserved and constrained regions between human and other vertebrate species. Updating with the new human assembly, therefore, means that a large part of the Compara database also needs to be updated.

The Variation team have now collected all variant and phenotype data, linking the information to other data in Ensembl. This is so that useful variation data can be accessed and interpreted by our users. The variant effect predictor (VEP), for example, is an extremely useful tool that determines effects of variants, such as SNPs or indels, on genes, transcripts, proteins, regulatory regions and phenotypes. A user simply has to input the coordinates and sequence changes of the variants of interest.

And finally, the Regulation team have used the new Ensembl regulatory annotation build to locate regions in the human genome that are involved in the regulation of gene expression.

EnsEMBL_Web_Component_Location_ViewBottom-Homo_sapiens-Location-View-76-

Some sample regulatory features as seen in the Ensembl browser

Now that the last parts of the relevant analyses are being completed, the Ensembl Webteam are currently working on the Ensembl website, ensuring that all the relevant data will be accessible to you in the most user-friendly manner.

The final release is still on target for the end of July, after which the GRCh37 annotation will be available on a separate archive site. Although we have produced the GENCODE 20 gene set for the upcoming Ensembl release (e76), we are still in the process of refining it. We therefore recommend, particularly for large consortia, waiting for the GENCODE 21 release, which will be available with e77. In the mean time, until the e76 release, the human Pre! site is still up and running.

If you have any questions then please don’t hesitate to contact us, either through twitter or by emailing helpdesk.

As mentioned in another post, due to the presence of patches in both GRCh37 and GRCh38, the assembly mapping has proven challenging.
Related to this, another novelty arises when assigning stable ids to genes.

Every time a gene set is updated for a species, we compare the newest gene set with the previous one.
If we find a perfect match between the two gene sets, the stable id assigned to the older model will be used for the new model.
Even if the model has changed slightly (longer UTR for example), we try to map the old stable id whenever possible, with a version change to indicate that it was not a perfect match.

To provide a better comparison between the last GRCh37 gene set (e!75) and the new GRCh38 gene set (e!76), we have decided to project the old set onto the new assembly. This allows for overlap comparisons rather than simple sequence alignments. However, this means that around 2% of the genes are lost, as they can not be mapped onto the new assembly. If these gene models are still present in the new assembly, they are being assigned a new stable id.

Putting this in perspective of patch fixes integrated into the new reference, we also have cases where two genes in GRCh37 (one of the reference, one on the patch) both match the same gene on the new reference in GRCh38.
In that case, we have decided to arbitrarily keep the longest standing stable ID, which is likely to be the one on the reference.
The stable ID which was used on the patch is recorded as retired but a link is provided to its replacement. For example, searching for ENSG00000260384 (SERINC2 gene on HG989_PATCH) will redirect the user to ENSG00000168528 (SERINC2 on the primary assembly).

Screen Shot 2014-06-27 at 10.46.23Screen Shot 2014-06-27 at 10.48.13

This resulted in the deletion of around 3% of our genes.

In other cases, the difference between the GRCh37 reference (without patch) and the GRCh38 reference (with integrated patch fix from GRCh37) is too important to project annotations from the reference. Only annotations from the patch are then kept, along with the stable ids. For these cases, if there is a known alt_allele to a gene on the GRCh37 reference, it is added as a link to its equivalent on the patch.

Consequently, searching for ENSG00000183678 (CTAG1A gene on the GRCh37 primary assembly) will redirect the user to ENSG00000268651 (CTAG1A gene on HG1497_PATCH in GRCh37, on the primary assembly in GRCh38).

As mentioned in the blog post about the new gene set, a new assembly implies a number of underlying changes in the gene structure.
Despite this, 95% of all the gene stable ids have been assigned to the new gene models.
With this work, we try and ensure that you will still be able to find your favourite gene using the same stable id as in GRCh37.