In-depth knowledge of the human genome is fundamental in an array of scientific fields, such as forensics, research, anthropology and medicine. Since the completion of the Human Genome Project in 2003 thousands of human genomes have been sequenced, sequencing technology has improved significantly, and the amount of available data has vastly increased.

The new human assembly (GRCh38) arrived last week, and our objective over the next few months will be to thoroughly annotate it, ultimately providing our users with the best possible gene set.

What does the new assembly look like? 
Though the underlying genomic DNA will be identical, or very similar, to that of the previous release (GRCh37), certain improvements mean this is a particularly important assembly. These changes include:

The reference GRCh38 assembly consists of the ‘primary assembly’ and ‘alternate sequences’. The primary assembly is made up of 24 chromosomes, 42 unlocalized scaffolds and 127 unplaced scaffolds, which contain genomic sequences that have not yet been assigned to chromosomes. The alternate sequences are a collection of 261 ‘alt loci’. These include the haplotypes for the MHC region on chromosome 6, as well as shorter regions on other chromosomes where the GRC provide alternate alleles present in the population.

mhc region

The figure above shows the MHC region of chromosome 6 on the reference genome.

What does it mean for our users? 
The updates will facilitate an improved understanding of the human genome, and increase the accuracy of the Ensembl annotation. It is important to note, particularly for users who use coordinate-based systems, that these changes may affect the lengths of chromosomes and the positions of many genes.

What does it mean for the Ensembl GeneBuilders?
When genome sequencing was a new technology it was initially thought that an assembly could be represented by a single ‘Golden Path’; a set of overlapping sequences that could be selected to produce a non-redundant chromosome sequence (with gaps), fully representing the sequence at all loci. The reference human assembly, however, is not a simple linear model and it includes additional information on an array of different alleles. Fortunately, our GeneBuild pipelines have previously been updated to deal with such alternative sequences, as we faced a similar challenge with GRCh37 patches. The most prominent challenge, however, will involve careful database storage and disk space management as we will be aligning a massive amount of data to the new assembly, such as EST (>8 million human ESTs) and RNASeq data.

What is involved in re-annotating the the new assembly?
Even though most of the genic regions in GRCh38 will be the same as for the previous assembly, we are going to throw away all the automatically annotated gene models (keeping the manually annotated genes from Havana) and begin the entire annotation process again. While it would be far easier and quicker to just copy the pre-existing gene models, we are more interested in producing the best possible gene set.

All of the gene models we produce are based on biological evidence: a protein and/or mRNA sequence must align to the genome in order for us to annotate a gene model. Just as the genome assembly has been updated to remove incorrect sequence and to add new DNA, so too have the public databases been improved since we last produced a gene set on human. New protein and cDNA sequences are now available, and others may have been removed. We therefore have a great opportunity to refresh the entire human gene set, and possibly find many genes that could not be annotated before due to a lack of evidence.

When can users expect to see the new assembly and gene set?
We plan on releasing a Pre! site this quarter to give users a chance to view the new assembly (BLAST/BLAT will be available). In order to produce this temporary Pre! site we are aligning the old human gene set to the new assembly to indicate where we expect the genes to be. It will take approximately three months to automatically annotate the new assembly using Ensembl pipelines, and about one month to merge the automatically annotated gene models with the manually annotated ones from Havana. The gene set should be finalized during the second quarter of this year, after which the data will be passed to the other teams in Ensembl (comparative genomics, variation, and regulation teams).

The complete GRCh38 annotation will be made available on the Ensembl website in the third quarter of 2014. From then on we will support GRCh38 on our main site (www.ensembl.org). The GRCh37 assembly will still be available, but will be contained on its own site (GRCh37.ensembl.org) and will remain static. Any further updates will be done exclusively on the new assembly.

Additional information can be found on the Genome Reference Consortium blog here. There is also other useful information here in the form of a poster, which was presented by James Torrance on behalf of the GRC at the ISMB/ECCB conference in July 2013. It is important to note, however, that the poster was produced before the final version of GRCh38 was released and some of the numbers it contains may be out of date.

Please note that the archive websites for Ensembl release 61 (Feb 2011) will be retired in February this year when version 75 is released.

This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).

For more information about how to use archives, please see our previous blog post on the topic; a list of all current archives is available on the main website.

We described in a previous post Ensembl’s new regulatory annotation of the genome. Now, we will go in greater detail into how we computed it.

We started by running ChromHMM over 17 cell types, using publicly available ENCODE and Roadmap Epigenomics data. This produced a segmentation or annotation the genome for each of these cell types under 25 labels, or segmentation states. These states were given arbitrary names (P0, P1, …) after a preliminary comparison to the earlier ENCODE segmentations across six cell types.

For each state and each position in the genome, we computed the number of cell types that have that state at that position. This resulted in 25 segmentation state summary tracks. We also pulled in all the Chip-Seq peaks from the January 2011 ENCODE data freeze, and kept those that overlapped with open chromatin (i.e. DNAseI hypersensitivity) on the same cell type. From all these assays, we computed a summary track, which indicates the probability (between 0 and 1) of seeing a TF peak at any location of the genome. The segmentations and summary tracks can be seen in the illustration below.

Summary tracks generated from the segmentation and TFBS peaks

We then used TF binding data to determine the specificity of the ChromHMM states, as regulatory regions are presumably correlated with TF binding. Each function (TSS, CTCF insulator, proximal regulatory elements, distal regulatory elements) was thus associated with several states. For example, transcription start sites were strongly associated to the P0 state in the segmentations. We overlapped these state signals to define function specific signals. A simple count threshold was set to maximise the detection of TF binding sites. This led to the regions as displayed at the bottom of the following figure:

Selected summary tracks were overlapped to define the Ensembl Regulatory features

These new regions concur strongly with the TF binding datasets: 73.4% of the TF Chip-Seq peaks were captured in the ChromHMM regions, equivalent to an 8.3x enrichment with respect to the genomic average. Conversely, 24.0% of the ChromHMM-based regions were covered by observed TF Chip-Seq peaks. To avoid losing information, TFBS peaks which were not covered by any of these elements were marked as ‘Unannottated TFBS’.

Having defined consensus regulatory region, we returned to the original data, to determine which region is active in which cell line:

The Regulatory Features are then compared to the cell type specific segmentations to determine their activity in each cell type.

Validation

In the median cell type, 83% of FANTOM tags supported by three CAGE tags or more were annotated by our pipeline.

The Vista Enhancer database contains enhancer sequences validated by in vivo staining assays on transgenic mice (hats off to the VISTA team for their years of meticulous work). It currently contains 1,575 predicted enhancers, of which 807 were experimentally confirmed. 491 of those (60.8%) were picked up by our pipeline.

  • TF binding motifs

We found two estimates of active TF binding motifs:

Source: JASPAR Arbiza et al.
Number: 803,489 2,013,074
Avg. length (bp): 13.1 10.8
Total length (Mbp): 10.5 21.7
% covered: 59.0 87.0
Enrichment (fold): 6.1 9.0

JASPAR motifs are not mapped by default to the human genome, so we are missing a few. We will therefore be remapping them in time for Ensembl release 75, in early 2014.

The computing of Ensembl’s new regulatory annotation is work in progress. If you have any ideas on the subject, feel free to leave a comment or to send them to helpdesk@ensembl.org.

GWAS after GWAS return statistically significant hits that are hard to interpret because they fall outside of coding regions, and this begs for more functional annotation of regulatory regions. We at Ensembl have been providing such an annotation for a few years now and we are now redesigning from the ground up the way we define these regions. This work is in progress and we would love to hear your suggestions and comments.

Regulation diagram

An overview of the major elements involved in gene expression regulation

In short, we are looking for all regions of the genome which display regulatory function. Much ink has been spilled over the definition of the word ‘functional’, so we’re going to expand a bit.

We propose to map out the regions of the genome that display epigenomic marks and/or transcription factor binding sites (TFBS) associated to proximal and distal regulatory elements, transcription start sites (TSS), and CTCF insulators.

Ensembl’s Regulation pipeline

We will post more details next week on Computing the New Ensembl Regulatory Annotation. To cut to the chase, we defined the following regions from publicly available ENCODE and Roadmap Epigenomics datasets:

Label Count Avg. Lgth (bp) Max. Lgth (bp) Tot. Lgth (Mbp)
TSS 40,249 973.2 11,400 39.2
Proximal Reg. 101,206 1,005.5 15,000 101.8
Distal Reg. 209,081 526.1 8,400 110.0
CTCF 108,284 550.1 5,200 59.6
Unannotated TFBS 163,528 155.8 1,630 25.5
All: 299.2

The new Regulatory Build will allow us to separate state from function, as shown below, upstream of VNN2:

EnsEMBL_Web_Component_Location_ViewBottom-Homo_sapiens-Location-View-74-

The track at the top colours the region by function, independently of the cell type. In the cell specific tracks below, the various features are greyed out if we do not have evidence of activity for that region.

We incorporated our preliminary results into a track hub, along with some of our intermediary data. Next week we will post more details on Computing Ensembl’s New Regulatory Build. We want to integrate this build officially into Ensembl release 76, sometime during the 3rd or 4th quarter of 2014.

FAQ:

Are you saying that 9.7% of the genome is functional?

Not quite. We’re saying that if you split the genome into 200 bp bins, 9.7% of them show epigenomic marks or TF binding. Remember that histone marks are measured at nucleosome resolution, so this signal is at best at a 140bp resolution. If you add in experimental noise (typically proportional to the Chip-Seq fragment length), the exact position of these elements on the genome is rather fuzzy. At the same time, the epigenome is a dynamic system, and we only have some assays on some cell lines. No doubt more regions will be annotated as more datasets come in.

What happened to the 80.4% of the genome being functional?

This statistic from the main ENCODE paper took into account other biochemical markers, in particular those associated to transcription, which can be observed over most of the genome. We therefore recommend using curated genesets such as GENCODE to define gene bodies. Nonetheless, the number of regulatory elements and promoters described here is of the same order of magnitude as that discussed in the ENCODE paper.

Is this the same as ENCODE?

This is more than just ENCODE. There are other fascinating epigenomic surveys out there, such as Roadmap Epigenomics or BLUEPRINT to name a few. Here at Ensembl, we have started merging all these datasets (including ENCODE), and provide the most comprehensive overview possible, updating our calls as new projects come along. Also, as discussed above, we are producing a cell-type independent summary of epigenomic function, which can be used to inform studies on new cell types.

What about other species?

We focused our Regulation database primarily on human: that is what most of our users ask for, and what we have most data for. But that does not mean that we ignore other species. Ensembl has already regulatory information for mouse, and we plan on shortly expanding this to farm animals, in collaboration with the Roslin Institute.

Can you assign regulatory elements to genes?

We’re working on it. Correlations are easy to find, but multiple testing quickly gets in the way when testing 310,287 regulatory elements against 40,249 TSSs.

Remember, this is work in progress, and we would love to hear your suggestions. Please leave your comments here or drop us a line.

gitandensembl_smallToday I am happy to announce Ensembl’s migration away from CVS as our primary version control system (VCS) to Git. This migration sees the end of nearly a year’s worth of work to ensure that our Git repositories provide the same historical record as CVS.

To summarise the changes:

  • Ensembl’s code is now provided from our GitHub organisation at http://github.com/Ensembl
  • We have migrated Ensembl’s versioning back to 1999
  • We will continue to back-port changes to CVS for the next 3 releases (support ending with release 77)
  • You can still download our API release tarballs from our FTP site

CVS (Concurrent Version System) was first released in 1990 and was based on an earlier system called RCS (released in 1982). It relies on a centralised single server to hold all previous revisions with none of this information held on the client. CVS also assumes that commits to files within the same project are independent of each other. When the Ensembl project started in 1999, we chose CVS as it was one of the best available VCS in the open source community.

Since that choice, a new breed of VCS has appeared: decentralised/distributed version control systems (DVCS). These systems favour local copies of the repositories, removing the need to communicate with a centralised server, except when sending or receiving new commits, and work with sets of file changes as a single atomic block of work. According to Black Duck’s comparison of repositories, Git is the dominant DVCS in open source projects. We have decided to use the code hosting company GitHub as the location of our repositories. GitHub has been a major contributor behind the success of Git by providing an infrastructure that promotes social coding between developers.

Whilst CVS has been a very good servant to Ensembl, the time has come to move on to better tools. We have seen other projects within EMBL-EBI and the Wellcome Trust Sanger Institute make a similar transition to Git. None of them have looked back, citing better tooling, a larger support base and an ability to support both long-term and short-term development branches. We agree and cannot wait to start using this exciting technology.

Monday morning in London: hustle and bustle of the start of the working week.
Packed trains and tube. Huge crowds on their customary way to work.

congestion

Congestion on the London underground.

I too (armed with my laptop, a course booklet and a pen) was on my way to London to deliver training in Ensembl.

Aida Santaolalla and Anita Grigoriadis organised this Ensembl browser workshop for the cancer research community at Guy’s hospital, a large NHS hospital with breathtaking views of the River Thames.

photo 4

View of London from Guy’s hospital.

We had 18 participants with mixed professional roles within the hospital: postdocs, PhD students, staff researchers, principal investigators, and a physician. All had one research theme in mind: cancer.

Cancer is a broad range of diseases characterised by unregulated cell growth due to varied and complex causes, one of them being genetic predisposition. During the workshop, I demonstrated, among other things, how Ensembl could be used to find out genes and variants (i.e. SNPs, CNVs and indels) that may confer such genetic predisposition. The location of variants that are associated with breast cancer, for instance, can be seen in Ensembl.

Screen shot 2013-12-16 at 11.27.29

Among the attendees, 40% were completely new to Ensembl and a whopping 80% had never used Ensembl BioMart before. The comments received in my feedback survey were very rewarding: ‘This was a very good workshop where I learned the basics of Ensembl software’, ‘This workshop was particularly useful in pointing out tools in Ensembl that I wouldn’t have known existed through personal investigation.’ And better still, every single participant indicated they would use Ensembl and Ensembl BioMart more often after this workshop, and they would recommend the workshop to a colleague.

Ensembl workshop delivered. Happy participants. Great feedback. Job done.
The day was far from over though. I still had to make my way out of the city.

Monday evening in London: hustle and bustle of the end of the first day of the working week. Packed trains and tube: huge crowds on their customary way home.

Can’t deny I was chuffed to bits to go back to the HQs of Ensembl in the quiet, tranquil Cambridgeshire.

Now I’m ready for my next adventure: University of Pavia, Italy.
Maybe I will come to your institute one day.

If you want to organise a training session in Ensembl, you can follow the steps of our hostesses at Guy’s, who contacted us to organise this successful Ensembl workshop in London for the cancer research community.

Today the Wellcome Trust Sanger Institute and the European Bioinformatics Institute (EMBL-EBI) announced plans to reorganise the Ensembl project so that we can best leverage the strengths of Ensembl’s parent institutes to capitalise on emerging opportunities in genomics

From our users’ perspective, all existing Ensembl services including our genome browser, APIs and regular releases will continue as usual. Behind the scenes, the current Ensembl services will be consolidated at EMBL-EBI to help us strengthen the existing resources and facilitate closer links with UniProt, Ensembl Genomes and the Expression Atlas, which are all based at EMBL-EBI. Ensembl will also continue to be closely involved in the GENCODE project coordinated at Sanger.

New Ensembl activities that focus on novel methods for storing and representing human variation will be based at the Sanger Institute. These efforts will be aligned with the aims of the global alliance for the secure sharing of genomic and clinical data. As part of the reorganisation, Ensembl will also be connected more closely with the DECIPHER project. DECIPHER is an interactive, web-based service to support sharing of likely functional, rare clinical variation and engagement with the clinical community.  These connections will improve clinically relevant access to the rich genome annotations provided within Ensembl.

This is an extraordinary time for genomics. Ensembl has supported and contributed to the dramatic advances in genomics over almost 15 years and we are excited about what the future holds.

Please note that the archive websites for Ensembl release 60 (Nov 2010) will be retired in mid December when version 74 is released.

This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).

For more information about how to use archives, please see our previous blog post on the topic; a list of all current archives is available on the main website.

CropperCapture[1038]The Genome Reference Consortium (GRC) plans to release a new human assembly (GRCh38) later this year. What is the reason for an update? The current human genome in Ensembl, UCSC and NCBI (GRCh37) is indeed high quality. The GRC reports it’s accurate to an error rate of ~1 in 100,000 bases. However, there are still gaps in the assembly and there are a number of difficult loci still to be resolved.  The new update addresses many of these issues.  More reasons for the update can be found on the GRC blog.

The new assembly will be available in Ensembl next year (the third or fourth quarter of 2014). What happens between the public release of the updated human assembly and the Ensembl release of the annotated genome? A series of posts from our team will cover the work required to annotate genes, variants, and more to a high standard. These articles will reflect not only the efforts to deliver high quality annotation, but to integrate the data in a useful way for our users. At least one post each month will reflect our thorough analysis of the genome in the following areas:

  • Release cycle (how does it work?)
  • The GRCh38 assembly
  • The new regulation build– integration of ENCODE and Blueprint data
  • Coordinate mapping from one assembly version to another
  • Processing dbSNP and other sources of variants
  • High quality genes, annotating the GENCODE set
  • Updates to the Variant Effect Predictor (VEP)
  • Determining stable ids for the new gene set
  • Whole Genome Alignments– pairing up the new human assembly with other species
  • Quality control- how do we check our data?

Keep your eye on our blog for more posts in this series, marked with the category “GRCh38 Ensembl”.

Are the Ensembl databases and API a core part of your research? Did you develop tools and scripts to get exactly what you need from these resources? Why not share them with the community?

programsEnsembl are pleased to announce e!code, a directory of programming resources for use with the Ensembl datasets and codebase. It currently includes a selection of VEP plugins developed by the Ensembl team plus external contributions such as the Java API JEnsembl.

Please note that this new site is not a repository, only a central listing of available resources. If you would like to contribute, please be sure to read our contributor guidelines.

Visit our e!code mini-site (part of this blog) for more information!