In line with EMBL-EBI policy, from the end of 2015 Ensembl will be removing support for DAS from our browser. This means that we will no longer provide our annotations over DAS and that we will not visualise third party annotation provided to us via DAS. If you have data with genomic coordinates that you wish to present in Ensembl then we recommend that you do this using TrackHubs. For annotation on other coordinate systems, we are currently working on providing support for this and will announce developments in this area over the course of the coming year. If you need more details then please get in touch with us at email@example.com.
Category: Other news
My recent trip to Malawi as part of a Wellcome Trust Open Door Workshop has really reminded me how privileged I really am. I’m an Outreach Officer, which means that I have the privilege to travel out to institutes around the world to deliver free Ensembl workshops. Most of the time, these workshops are in Europe or the US, at fancy research institutes and universities, and it’s an awesome privilege to facilitate research at these institutes.
An even greater privilege is to be involved in the Open Door Workshops on Working with the Human Genome Sequence, organised by Wellcome Trust Advanced Courses, which head out to more developing countries to teach. They’re called ‘Open Door’ because all the resources we teach in them are free and open on the web, which means anyone, anywhere, with nothing but an internet connection can do it. I teach the Ensembl section of the course, but we also cover other resources from the EBI, Sanger Institute, NCBI and elsewhere.
We hold these courses at Wellcome Trust research centres, for example the Malawi-Liverpool Wellcome Trust I visited recently, which are fantastic investments by the Wellcome Trust in research around the world. Participants travel from all over the continent to attend the course; attendance is free (with selection) and the Wellcome Trust can even fund travel bursaries. It is a great privilege for me to be able to travel to these locations and to teach them all about Ensembl.
I am proud to present Ensembl to these workshops participants. Partly because I think it’s an amazing resource that can really facilitate research. Partly because we give it away for free, and I know this makes a huge difference to researchers whose labs are not well funded. Even in labs with £1 million grants, money is always tight, but for many of the people who attend our workshops, labs struggle with knackered PCR machines, ghost equipment that they can’t afford to buy the reagents to use and a complete reliance on Open Access publishing as they can’t pay for journal subscriptions, yet they still manage to produce world-class science. If they had to choose between replacing those broken machines and a pay-per-use or subscription-only bioinformatics resource, it would really be a no-brainer. But by giving them a free resource means they don’t have to make that choice. Indeed, it gives them the opportunity to carry out research that doesn’t need any expensive equipment or reagents.
The Wellcome Trust is one of the major funders of Ensembl. We are so grateful to them for allowing us to make our data freely available, so that everybody can make use of it. It really is a privilege.
The Ensembl Pre! site has been updated for four species: zebrafish (Danio rerio), rat (Rattus norvegicus), sperm whale (Physeter macrocephalus) and fugu (Takifugu rubripes).
The zebrafish assembly, GRCz10 (GCA_000002035.3), was made available by The Genome Reference Consortium in September 2014. Since the previous release, Zv9 in July 2010, the GRC has taken over the task of improving and maintaining the zebrafish assembly. The most notable changes in the chromosome landscape since the previous release can be found on chromosome 4, which has gained about 15 Mb in length. Furthermore, 94 of the 112 previously unplaced contigs are now located on chromosomes. In total, this assembly consists of 26 chromosomes and 3,399 unplaced scaffolds. The full annotation of an older zebrafish assembly, Zv9, can be found on our main website. Click here to go to the zebrafish Pre! site, where you can view alignments of zebrafish UniProt proteins and human Ensembl translations, as well as gene models projected from the previous zebrafish assembly.
The new rat assembly, Rnor_6.0 (GCA_000001895.4), was produced by The Rat Genome Sequencing and Mapping Consortium and was released in July 2014. This assembly comprises 954 toplevel sequences, 22 of which are chromosomes (chromosome Y is a new addition in this assembly), and 1,395 of which are unplaced scaffolds. The full annotation of an older rat assembly, Rnor_5.0, can be found on our main website. Otherwise, click here to visit the rat Pre! site, where you can view alignments of rat UniProt proteins and human and mouse Ensembl translations, as well as gene models projected from the previous rat assembly.
The sperm whale assembly, PhyMac_2.0.2 (GCA_000472045.1), was produced in September 2013 by The Aquatic Genome Models Consortium. The assembly does not contain any assembled chromosomes or linkage groups and is instead made up of 11,711 unplaced scaffolds. The species is an important model for a number of human conditions such as respiratory disease, metal toxicity and cancer. For example, sperm whales exposed to high levels of chromium have no adverse health effects whereas humans do. Studying this species could lead to development of treatments for human chromium-related disorders. Click here to visit the sperm whale Pre! site, where you can view alignments of human and dolphin Ensembl translations.
The fugu genome assembly, FUGU5 (GCA_000180615.2), was released in October 2011 by The Fugu Genome Sequencing Consortium. It is composed of 22 autosomal chromosomes, with a total sequence length of 391Mb. The species was initially proposed as a useful model for annotating and understanding the human genome, as it contains a similar repertoire of genes to human yet is only roughly one-eighth of the size. It is among the smallest vertebrate genomes, and previous assemblies of this species have already shown themselves to be useful reference genomes for identifying genes and other functional elements in other vertebrate species. The full annotation of an older fugu assembly, FUGU 4.0, can be found on our main website. Click here to visit the fugu Pre! site, where you can view alignments of human and dolphin Ensembl translations.
Please note that the archive website for Ensembl release 65 (Dec 2011) will be retired in December when version 78 is released.
This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).
You may have noticed our beta REST server has been retired. We have replaced it with our new service, http://rest.ensembl.org, and have a handy migration guide to help you update existing scripts. Details about the new server can be found in the article published in Bioinformatics. Some of the improvements include:
- New POST endpoints
- POST messages allow users to submit a list of inputs as a single request
This is supported for the archive, lookup and vep endpoints
- The rate limit has been increased, with up to 15 requests per second allowed
Combined with POST, we were able to process 1000 variants per second!
- New /variation endpoint to retrieve variation information linked to a gene or a transcript
- New /regulatory endpoint to retrieve data from the regulatory build
- HTTPS support for clients working with a secure environment
This server provides access to the latest data in Ensembl, including the new human build on the GRCh38 assembly. For those wishing to use data from the GRCh37 assembly, a dedicated server is available on http://grch37.rest.ensembl.org
The brand new Ensembl Regulatory Build on the new GRCh38 human assembly has been released in Ensembl 76. This involved a complete redesign of the build process with a new statistically rigorous logic, a streamlining all the backend processes and a remapping and peak calling of all our data sets.
The build and constituent data are available to view directly through Ensembl through the Regulation section of “Configure This Page”, but we have also made it accessible by creating a public track hub. A track hub is a pre-configured set of tracks which you can load together into Ensembl or other genome browsers such as the UCSC. In addition to data loaded into Ensembl, it also contains tracks that summarise the data used to generate the build
What’s in the Ensembl Regulatory Build Track Hub?
You’ve been meaning to do that data remapping a while now, but didn’t quite get to it? Or maybe you’re just feeling nostalgic? No worries, the track hub covers both GRCh37 and 38.
For each transcription factor, we calculate the probability of having binding at any position, based on the available data sets by simply dividing the number of overlapping peaks by the number of data sets. These probabilities can be viewed in the TFBS Summaries section. An overall probability of any binding is viewable in the TFBS Summary track of the Ensembl build overview section.
We use genome segmentation software (Segway), to partition the genome into regions of similar signal over these assays, and label these states as e.g. predicted promoters, enhancers or repressed. The segmentations for each cell type can be found in the Cell Type Segmentations tracks.
For each state of the segmentation, we also create a summary track which represents the number of cell types that have that state at any given base pair of the genome.
The Ensembl Regulatory Build
The summarised Ensembl Regulatory Build can be viewed in the “Ensembl Reg. Build” track of the Ensembl Build Overview section. For each cell type, we then annotate each feature as on or off, as displayed in the Cell Type Activity tracks.
Please note that the archive websites for Ensembl release 62 (April 2011) and 63 (June 2011) will be retired in August when version 76 is released.
This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).
With the release of Ensembl 76 fast approaching, the variation team would like to provide more information on how we moved our variation data to the new human assembly, GRCh38. There are different methods available for re-annotating variants on a new assembly. The most accurate way would be to re-run experiments, or variant calling pipelines that identified the variant in the first place, on the new assembly. The necessary material and computational resources required for such an endeavour, however, are very expensive. Therefore, we have developed computational methods so that, for most of the data, such investments are not necessary.
Considering that the new assembly retained lots of sequence information from the previous assembly, we can use computational methods that try to derive the new location based on information about a variant we have already available namely the:
- Location on the old assembly
- Flanking sequence (DNA sequence from the old assembly surrounding the variant)
Based on this prior knowledge we can either project or remap our variation data.
The projection algorithm compares two assemblies and computes the new location based on sequence similarity between the two assemblies. The computation of the new location is successful for ~98% of our variation data. However, when the sequence in the new assembly has changed too much compared to the old assembly, the projection fails, and for those variants we then go out to use a remapping strategy, as explained below. The projection functionality is implemented in the Ensembl core API.
For the remapping approach we generate a sequence read by adding upstream and downstream sequence from the old assembly to a variant. We then map the read to the new assembly using BWA.
We have ~64M variants and ~69M variation features (VF) in Ensembl release 75, GRCh37. You can think of a variation feature as a combination of a variant and its location on the genome. Most variants have one variation feature. If a variant maps to multiple locations on the genome, the variant has as many variation features as it has locations on the genome.
We can divide variation features into:
- VF that map uniquely to the reference genome (chromosome 1-22, X, Y, MT)
- VF that have multiple mappings on the reference genome
- VF that are located on an alternative locus
- VF that are located on a fix patch region
We first attempted to project all VF that map uniquely to the genome or are located on alternative loci. In a second attempt we use our remapping approach for VF that couldn’t be projected. For the ~62M variants with a unique location on GRCh37, only ~200,000 variants could not be projected and were remapped to the new assembly. Variants with multiple mappings on GRCh37 have been remapped to GRCh38 using their flanking sequence information as submitted to dbSNP.
As a result, both projection and remapping create the new set of variation feature locations in release 76. We do not need to re-annotate variants located on fix patch regions from GRCh37 because the fix patch regions have been incorporated into the primary sequence for the new assembly.
The Genome Reference Consortium increased the number of alternative sequence representations for variant regions (ALT LOCI) in GRCh38. In our workflow diagram we described how we re-annotate variants to ALT LOCI that were present in GRCh37. Additionally, we provide variant annotations to new ALT LOCI by remapping variation features from the primary reference sequence (GRCh38) that overlap an alternative locus. We added ~1.5M extra variation features with this approach. This gives however only an idea of how known variants map to ALT LOCI. Ideally, you would do the variant calling against the set of primary reference sequences and ALT LOCI. We can expect variants will be called on ALT LOCI in the near future as variant calling tools include the option of including ALT LOCI information.
Are you ready to move to GRCh38?
Ensembl provides a reliable representation of variation data on the new human assembly, GRCh38. In addition to re-annotating variation data from release 75 to 76, we also updated our data (e.g. from ClinVar, the NHLBI GO Exome Sequencing Project or from COSMIC) and projected and remapped the data where necessary to the new assembly GRCh38. But there is no need to worry if you are not yet ready to make the move to GRCh38. Starting with Ensembl release 76 we will support and update variation annotation for GRCh37 and GRCh38. If you have questions or comments, please get in touch with us.
With the release 76 looming large in our calendars and the final deadlines out of the way for GRCh38 data production, it’s a good time to look back and take stock of what we’ve been doing in the Ensembl Regulation office. We have been rather quiet the past few months, working feverishly on an ambitious overhaul of our infrastructure. We’ve already given you a sneak peak at the new Ensembl Regulatory Build, so I’d like to take a look at the work horse underlying all of our data, the ‘Ensembl Regulation Analysis Pipeline’.
The end result is a core resource that centralises epigenomic data from multiple public sources, processes them through a universal pipeline, then summarises them into easily understood annotations. Ensembl Regulation aims to be a single entry point to obtain an overview of all the available regulatory data, from individual datasets to summary annotations, all coming to a browser near you, very soon. Underlying it is a full ‘end to end’ pipeline for producing the input data to the Regulatory Build, from fastq download, to alignment, IDR processing, peak calling and finally motif alignments.
The inputs to the Regulatory Segmentation and Build are experiments (Chip-Seq & DNAse-Seq) describing the chromatin status (i.e. histone modifications) and transcription factor landscape across various cell lines. These experiments come from large projects (e.g. ENCODE, Roadmap Epigenomics and BLUEPRINT), through to individual experiments made accessible via archives such as the ERA/ENA, SRA and GEO.
The main outputs of the pipeline are genome alignments, peak calls and ‘collection’ files which provide coverage statistics across the genome. Managing and processing these data is no simple task, and we expect the number of available epigenomic datasets to increase significantly in the years to come. Also, with the arrival of GRCh38, we needed to reprocess all of the existing data in a short timeframe. We therefore integrated our processes into a shiny new fully automated pipeline using the ensembl-hive framework. Here follows a brief summary of the new features of the regulation analysis pipeline.
The Tracking Data Base
This now constitutes our main analysis and archive database, tracking the data both within our pipeline, but also in external repositories. In it, we register the meta-data from different projects and data repositories, providing a single point of reference to query the data available in the public domain. This has been crucial in determining which cell lines meet the requirements for a build.
Read Alignment and Peak Calling
We first align reads using BWA, then call peaks using SWEMBL for short regions and CCAT for broader ranging histone modifications. Replicates are processed in parallel to support ENCODE’s Irreducible Discovery Rate (IDR) methodology.
Flexibility has been a key aim of the redesign, and the hive infrastructure has helped here by allowing us to define each logical part of the pipeline as a separate configuration which can be ‘topped up’ as required. This means that it’s easy to run just the read alignment stage (which we require as input to the segmentation), or at your pleasure add in the peak calling and collection file writing stages whilst it’s still running. All the necessary state information is captured in the tracking database, so it’s really easy to pick things up at any point and start running the later stages of the pipeline.
Due to the size of our input data set and the resulting rolling data footprint, we set up a garbage collection of intermediate files and added inline archiving. This has limited our footprint, and enabled us to reprocess the entire human data set in one go.
The combination of the above improvements, the new ensembl-hive implementation and a whole load of other refinements, means much less manual intervention is required, resulting in a large reduction in run times. For the alignments in particular, what was taking several weeks now takes just ~5 days!
What does the future hold?
We’ve already identified some more optimisations to the structure of the pipeline, so the runtimes are likely to drop even further. This will be crucial to handle the hundreds of cell types currently being examined within Roadmap Epigenomics, Blueprint, ENCODE 3 and other projects. We will also be revising our schemas to better reflect tissue specific data. This is part of a larger push within Ensembl to better describe the dynamics of gene regulation and transcription.
Finally, we are keeping up with lab techniques, and will be extending our pipelines to handle newer types of data, such as chromatin conformation assays or eQTLs. Although we do not process this data ourselves, we already integrated and remapped the FANTOM5 CAGE-tag annotations onto GRCh38.
p.s. If you want even more info on the, keep an eye on this page. Once release 76 is out it will be updated with our new Regulatory Build documentation.
We’re now only a couple of weeks away from releasing our full annotation of the new human genome assembly (GRCh38). Before we make it publicly available we’d like to update you on our progress and to share a few key pieces of information.
Changes in the assembly
The GRCh38 assembly is made up of 455 top-level sequences. These sequences include 24 chromosomes, mitochondrial DNA, alternative reference loci and a number of unplaced scaffolds. For the first time ever, centromere sequences have also been included in a human reference assembly. The total contig length for this new assembly is 3.4 Gb, a small increase on the previous assembly, and the total chromosomal length is 3.1 Gb (excluding haplotypes). There are 261 alternate loci, including the LRC/KIR complex on chromosome 19 (35 alternate sequences) and the MHC region on chromosome 6 (7 alternate sequences). We have aligned nearly half a million proteins and over 200,000 cDNAs to the new assembly and have annotated a total of 63,263 models, 22,469 of which are protein-coding.
For GRCh38, in addition to the usual steps involved in a genebuild, we have also made clone data available. The clone sets were loaded, along with other data, into the core human database. Although these data are not required for genebuilding, the information is extremely useful for some of our users.
What stage is the annotation at?
The Genebuilders have completed the final gene set, which has been merged with manual annotation from HAVANA to create the GENCODE 20 set. The data were then passed on to other teams within Ensembl so that they could carry out the remaining analyses. This entire process of data exchange between the different Ensembl teams is coordinated by the Ensembl Production team, who also conduct a series of quality control steps along the way.
The comparative genomics team (Compara) have now generated orthologues to all other Ensembl species from the new human geneset. They’ve also revised all pairwise and multi-species whole genome and transcript alignments so that users can identify conserved and constrained regions between human and other vertebrate species. Updating with the new human assembly, therefore, means that a large part of the Compara database also needs to be updated.
The Variation team have now collected all variant and phenotype data, linking the information to other data in Ensembl. This is so that useful variation data can be accessed and interpreted by our users. The variant effect predictor (VEP), for example, is an extremely useful tool that determines effects of variants, such as SNPs or indels, on genes, transcripts, proteins, regulatory regions and phenotypes. A user simply has to input the coordinates and sequence changes of the variants of interest.
And finally, the Regulation team have used the new Ensembl regulatory annotation build to locate regions in the human genome that are involved in the regulation of gene expression.
Now that the last parts of the relevant analyses are being completed, the Ensembl Webteam are currently working on the Ensembl website, ensuring that all the relevant data will be accessible to you in the most user-friendly manner.
The final release is still on target for the end of July, after which the GRCh37 annotation will be available on a separate archive site. Although we have produced the GENCODE 20 gene set for the upcoming Ensembl release (e76), we are still in the process of refining it. We therefore recommend, particularly for large consortia, waiting for the GENCODE 21 release, which will be available with e77. In the mean time, until the e76 release, the human Pre! site is still up and running.