We are pleased to announce the public release of manual annotation on the new human GRCh38 assembly on the Vega website.This release follows on from the publication of a preliminary gene set on Pre! Ensembl and represents one of the final steps before the release of the full human Gencode 20 gene set in Ensembl release 76.

Vega website.

The Vega website uses Ensembl technology to present the latest manual annotation produced by the Havana group based at the Welcome Trust Sanger Institute. It has significance for researchers who want to see the most up to date annotation – every two weeks we run a streamlined, automated production pipeline that identifies new or updated annotation and presents it on Vega. Consequently there is never more than 14 days between annotation being created or updated by Havana and being made available to the public.

Vega update gene

Annotation of gene PCDHB9 has been updated within the last two weeks

Human GRCh38 manual annotation gene set.

The actual gene numbers have not changed greatly overall, but there has been a lot of work going on in the background to refine the gene set. The numbers of genes on GRC patches have been reduced from GRCh37 as many of these patches have now been incorporated into the primary genome assembly.

The initial step in the manual annotation of the new assembly was a computational one, projecting the manual annotation from GRCh37 onto GRCh38. As a part of this process we generated a list of the loci that did not project due to genomic changes. Many of them were in the regions of greatest change between assemblies including regions of chromosomes 1, 9, 17 and X. There were about 800 of these loci, and each of these needed manual intervention. This took a dedicated effort by the Havana group over about a three week period. The changes made fall into a number of categories:

(i) The use of single haplotypes across certain gene clusters, such as the XAGE and GAGE gene families on the X chromosome.

(ii) Filling, moving or even introducing gaps in the assembly to give a much more accurate representation of difficult regions. An example of such re-arrangement is the XAGE1B gene that is now placed on the opposite strand compared to the previous assembly.

(iii) A decrease in the number of polymorphic pseudogenes due to changes made in the assembly to include a haplotype with a coding version of the gene.  Polymorphic pseudogenes are coding in some individuals and disabled in other individuals due to sequence variation.

(iv) A large increase in the number of long non-coding RNAs (lncRNA) because we have been able to take advantage of new RNA-seq and PolyA-seq data rather than because of the new assembly per se.

Further annotation of the new assembly is ongoing, with the focus having changed from fixing projection errors to finalizing the annotation.

Merge with Ensembl geneset (Gencode 20)

The Havana manual annotation has been merged with the annotation arising from the rerun of the Ensembl genebuild pipeline. This improves the gene set, primarily by taking into account new experimental evidence generated since the manual annotation was originally performed. In addition, the comparison between the manually and automatically generated gene sets contributes to the continuous enhancement of both annotation systems. It is the merged gene set that will be released as Gencode 20.

The Amazon molly (Poecilia formosa) is now available on Ensembl Pre! This particular species is especially interesting to scientific research due to its origins, its method of reproduction and the manner in which it interacts with other closely related fish.

Amazon molly

Amazon molly

The single-sex interspecies school
Considering the name of the species, you would be forgiven for thinking these fish can be found swimming around The Amazon River. The Amazon molly actually resides in the warm waters of North-eastern Mexico and Southern Texas, and derives its name from something far more interesting than its habitat.

One of very few asexual vertebrates, this fish reproduces via a process known as gynogenesis, or sperm-dependent parthenogenesis. Despite being a method of asexual reproduction, gynogenesis does involve the mating of a male with a female. However, the genetic material from the male is not incorporated into the already diploid eggs and the sperm serves only to trigger embryonic development, thereby producing clones of the mother. The entire species is therefore female, and is thus named after the legendary society of female Amazon warriors.

Life finds a way
Due to the absence of male Amazon mollies, the females act as sexual parasites by mating with males from other closely related species. These mates come from species such as P. latipinna, P. mexicana, P. latipunctata and, occasionally, P. sphenops. In fact, it is thought that the Amazon molly originated from a hybridization event between two of these species, the Atlantic molly (P. mexicana) and the Sailfin molly (P. latipinna), approximately 280 KYA. However, all attempts to create P. formosa-like hybrids in the laboratory have, so far, been unsuccessful.


Distribution of molly species in coastal regions of The Gulf of Mexico

As the male fish do not contribute their genes to the next generation, one would expect that natural selection would act against them being ‘fooled’ into mating with the heterospecific Amazon females. Furthermore, experiments indicate that the males are able to tell the difference between females of their own species and the Amazon species. So why do they mate with these Amazon mollies? Unfortunately, the answer is that we simply don’t know. However, findings have suggested that the male individuals may actually benefit from this behaviour as mating with Amazon mollies seems to make them more attractive to females from their own species. The strange relationship between the Amazon mollies and these male mollies may therefore benefit both parties.

Asexual versus sexual
The main advantage of asexual over sexual reproduction, in any species, is an increase in reproductive output. With asexual females, there is no need to produce males that cannot give birth, resulting in twice the amount of grandchildren than would be produced by sexual reproduction. Asexual reproduction, therefore, should be the preferred method. As Amazon molly offspring are clones of the mother in an environment in which the mother was able to survive, they are also likely to survive and reproduce. This type of reproduction helps colonize new territory very quickly, but a population that reproduces in this manner will likely be unable to adapt to changing environments. Additionally, according to an evolutionary theory known as Muller’s ratchet, deleterious mutations in small asexual populations can accumulate at a fast rate due to a lack of gene recombination, which can eventually result in extinction.

Why study the Amazon molly?
A popular endeavour in modern evolutionary biology is to explain the evolution and persistence of sexual reproduction, given the higher costs of producing male individuals when compared with asexual reproduction. One effective way to research the relative strengths and weaknesses of the two reproductive methods is to study the dynamics of the coexistence of sexual and asexual organisms. The Amazon molly’s unique situation, both with respect to the way in which it reproduces and its interaction with other molly species, makes it an extremely valuable model. It has already been used in studies focused on determining whether or not sexual selection is necessary for high diversity of the MHC. Findings have suggested that the asexual molly has polymorphic MHC loci despite its clonal reproduction, yet these loci are more polymorphic in the sexual species. The Amazon molly is also used as a model for carcinogenicity studies, and is extremely easy to breed and rear in captivity. Furthermore, the clonality of the fish allows researchers to carry out studies on individuals that are genetically identical.

Browsing the genome
The Amazon molly genome assembly was made publicly available in October 2013. We have carried out a preliminary gene annotation, generated by alignments of Ensembl human, stickleback and zebrafish translations from Ensembl release 75. You can find this information on our Pre! site.


Region of the Amazon molly genome as seen in the Ensembl browser. The gene models shown are derived from human and zebrafish proteins.

We’re extremely excited to be carrying out a complete genebuild, incorporating data such as RNASeq, which will be available in a future Ensembl release. Keep an eye on our blog to find out when, and if you have any questions feel free to contact us.


As you may know, the new GRCh38 assembly for human was released in December 2013. This is a major update for Ensembl and will require months of hard work to provide high quality annotation for our users. Our goal is to provide a full genebuild on the GRCh38 assembly, as well as regulation, comparative and variation features.

As part of the Ensembl core team, I am responsible for generating a reliable mapping between the GRCh37 and GRCh38 assemblies. This mapping will be used by other teams to project existing annotations onto new coordinates. Therefore, it is important to get this right if we don’t want to end up with features in the wrong location!

The basic principle of assembly mapping is relatively simple. Let’s say we are mapping chromosome 1 in GRCh37 to chromosome 1 in GRCh38. For both chromosomes, we get the list of contigs used to construct the chromosome. If the same contigs are used, in the same order, in both chromosomes, these can be mapped directly. For the remaining unmapped regions, where no shared contigs can be found, the sequences are aligned using lastz.

Screen Shot 2014-02-28 at 17.38.03

Results of the mapping: how similar are the assemblies?

Faced with the results, the mapping meets our expectations. For the 24 chromosomes as well as haplotype regions, we map between 95 and 100% of the non-N sequence. Out of 82 regions, 72 map over 99%. To check the consistency of these mappings, the Ensembl gene set (GENCODE) is copied from GRCh37 to GRCh38. 97% of the transcripts find an identical model in GRCh38, with 98.5% of exons mapped correctly. Only 1.5% of the total transcripts do not have an equivalent model in the new assembly. This is expected, as we know some regions in GRCh37 do not exist in GRCh38.
For example, the gene PPIAL4A, associated to CCDS30835.1, is on a reference region in GRCh37 which is overlapped by patch HG1287. In GRCh38, that region does not exist and PPIAL4A is lost. The PPIAL is a family of retrogenes and other PPIAL4 models will still be in GRCh38.
Screen Shot 2014-02-27 at 10.19.06

Two additional regions have proved challenging for our mapping.

Chromosome 17:22904289-37003842:
This region in GRCh37 has become a haplotype in GRCh38 (HSCHR17_1_CTG4). As we do not provide mappings between haplotypes, we have only an approximate alignment between the reference in GRCh37 and the reference in GRCh38.

Chromosome 9:42900000-66450000, flanking the centromeric region:
This region in GRCh37 corresponds to 9:40700000-61600000 in GRCh38 and has undergone some massive changes on a sequence level. Some of the contigs have been split, shortened, extended, or simply removed. This means that gene models located on this region will change considerably from the gene models in GRCh37.

If your favourite gene is not in one of these regions though, there is a good chance you will be able to identify it using the same stable_id as in release 75! You’ll be able to read more about stable id mapping in a future post in this series.

Challenge: Patches

One of the major challenges when mapping GRCh37 to GRCh38 comes with patch regions. Shortly after GRCh37 was first released, a number of sequence differences were noticed. Rather than provide a whole new assembly, the concept of patches was introduced.

For regions where a sequencing error was corrected, a patch fix was added. It contains the corrected reference sequence as well as some padding on both ends, to locate it onto the genome.

In Ensembl, we provide annotation for both the reference and the patch region. Where the modified sequence is relatively short, a number of annotations are identical between the reference and the patch.

For example, CHAMP1 is a merged gene on chromosome 13 but has also been annotated on patch HG531_PATCH.
Screen Shot 2014-02-26 at 13.29.16 Screen Shot 2014-02-26 at 13.28.48

For regions where an alternative sequence was found, a patch novel is added.

In GRCh38, all patch novels will still exist as haplotypes. For the patch fix, it is another story altogether. Given these patches are fixing an error in the reference sequence in GRCh37, they will become the reference in GRCh38, replacing the GRCh37 sequence. This means that we are likely to keep the annotation produced on patches in GRCH37 while losing the GRCh37 reference annotation.

To deal with the special patch cases, we add an additional step in the assembly mapping. For patch fixes in GRCh37, we know their contig composition, as well as where they are mapped against the reference. Presuming the contig composition has not changed, we should be able to locate the same region in the reference in GRCh38. It should then be possible to map any feature in GRCh37, whether on patch or reference, onto GRCh38.

Working as an Outreach Officer for Ensembl means lots of exciting adventures around the world to teach Ensembl to local scientists. Last month I was privileged to be able to travel to South Korea to give an ENCODE workshop with Bob Kuhn from UCSC.

Bob and me, with our Korean hosts

Bob and me, with our Korean hosts

The participants were interested to learn about how Ensembl and UCSC use genome assemblies from projects like the Genome Reference Consortium. It was clear that many people had not realised the extent of the work that goes into producing a genome assembly, by sequencing the genome in contigs then putting them back together (learn more in this video). Or how errors in the human genome are dealt with using patches. I was able to explain to them how Ensembl works together with groups like Havana to produce the genes for the trusted GENCODE gene sets for human and mouse, and how they could find out about these genes in Ensembl.

Even more excitingly, I got to preview some new data. The Ensembl regulation team gave me access to their new regulatory build track hub, which you can learn more about in Daniel Zerbino’s blog post. I was able to show off how Ensembl bring together and process the raw ChIP-seq data from ENCODE and other sources to try to identify where regulatory features might be on a genome-wide scale, and the activities of those features. It was exciting to be able to preview something new for the workshop participants, and their feedback suggests that the new regulatory build is going to be a hit.

Bob showed off the UCSC genome browser in the same workshop, and we worked together well. Though we have competing browsers, we’re really on the same team: the team that believes high quality genomic data and tools should be available to all and works hard to provide that. We can learn from each other to provide the best way of giving our users the data and analyses they need.

My next stop is the Open Door Workshop in Uruguay. Don’t forget, you can host a workshop at your institute to learn the basics of using Ensembl, and to find out about the latest Ensembl functionality and what’s coming with the new human assembly.

In-depth knowledge of the human genome is fundamental in an array of scientific fields, such as forensics, research, anthropology and medicine. Since the completion of the Human Genome Project in 2003 thousands of human genomes have been sequenced, sequencing technology has improved significantly, and the amount of available data has vastly increased.

The new human assembly (GRCh38) arrived last week, and our objective over the next few months will be to thoroughly annotate it, ultimately providing our users with the best possible gene set.

What does the new assembly look like? 
Though the underlying genomic DNA will be identical, or very similar, to that of the previous release (GRCh37), certain improvements mean this is a particularly important assembly. These changes include:

The reference GRCh38 assembly consists of the ‘primary assembly’ and ‘alternate sequences’. The primary assembly is made up of 24 chromosomes, 42 unlocalized scaffolds and 127 unplaced scaffolds, which contain genomic sequences that have not yet been assigned to chromosomes. The alternate sequences are a collection of 261 ‘alt loci’. These include the haplotypes for the MHC region on chromosome 6, as well as shorter regions on other chromosomes where the GRC provide alternate alleles present in the population.

mhc region

The figure above shows the MHC region of chromosome 6 on the reference genome.

What does it mean for our users? 
The updates will facilitate an improved understanding of the human genome, and increase the accuracy of the Ensembl annotation. It is important to note, particularly for users who use coordinate-based systems, that these changes may affect the lengths of chromosomes and the positions of many genes.

What does it mean for the Ensembl GeneBuilders?
When genome sequencing was a new technology it was initially thought that an assembly could be represented by a single ‘Golden Path’; a set of overlapping sequences that could be selected to produce a non-redundant chromosome sequence (with gaps), fully representing the sequence at all loci. The reference human assembly, however, is not a simple linear model and it includes additional information on an array of different alleles. Fortunately, our GeneBuild pipelines have previously been updated to deal with such alternative sequences, as we faced a similar challenge with GRCh37 patches. The most prominent challenge, however, will involve careful database storage and disk space management as we will be aligning a massive amount of data to the new assembly, such as EST (>8 million human ESTs) and RNASeq data.

What is involved in re-annotating the the new assembly?
Even though most of the genic regions in GRCh38 will be the same as for the previous assembly, we are going to throw away all the automatically annotated gene models (keeping the manually annotated genes from Havana) and begin the entire annotation process again. While it would be far easier and quicker to just copy the pre-existing gene models, we are more interested in producing the best possible gene set.

All of the gene models we produce are based on biological evidence: a protein and/or mRNA sequence must align to the genome in order for us to annotate a gene model. Just as the genome assembly has been updated to remove incorrect sequence and to add new DNA, so too have the public databases been improved since we last produced a gene set on human. New protein and cDNA sequences are now available, and others may have been removed. We therefore have a great opportunity to refresh the entire human gene set, and possibly find many genes that could not be annotated before due to a lack of evidence.

When can users expect to see the new assembly and gene set?
We plan on releasing a Pre! site this quarter to give users a chance to view the new assembly (BLAST/BLAT will be available). In order to produce this temporary Pre! site we are aligning the old human gene set to the new assembly to indicate where we expect the genes to be. It will take approximately three months to automatically annotate the new assembly using Ensembl pipelines, and about one month to merge the automatically annotated gene models with the manually annotated ones from Havana. The gene set should be finalized during the second quarter of this year, after which the data will be passed to the other teams in Ensembl (comparative genomics, variation, and regulation teams).

The complete GRCh38 annotation will be made available on the Ensembl website in the third quarter of 2014. From then on we will support GRCh38 on our main site (www.ensembl.org). The GRCh37 assembly will still be available, but will be contained on its own site (GRCh37.ensembl.org) and will remain static. Any further updates will be done exclusively on the new assembly.

Additional information can be found on the Genome Reference Consortium blog here. There is also other useful information here in the form of a poster, which was presented by James Torrance on behalf of the GRC at the ISMB/ECCB conference in July 2013. It is important to note, however, that the poster was produced before the final version of GRCh38 was released and some of the numbers it contains may be out of date.

Please note that the archive websites for Ensembl release 61 (Feb 2011) will be retired in February this year when version 75 is released.

This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).

For more information about how to use archives, please see our previous blog post on the topic; a list of all current archives is available on the main website.

We described in a previous post Ensembl’s new regulatory annotation of the genome. Now, we will go in greater detail into how we computed it.

We started by running ChromHMM over 17 cell types, using publicly available ENCODE and Roadmap Epigenomics data. This produced a segmentation or annotation the genome for each of these cell types under 25 labels, or segmentation states. These states were given arbitrary names (P0, P1, …) after a preliminary comparison to the earlier ENCODE segmentations across six cell types.

For each state and each position in the genome, we computed the number of cell types that have that state at that position. This resulted in 25 segmentation state summary tracks. We also pulled in all the Chip-Seq peaks from the January 2011 ENCODE data freeze, and kept those that overlapped with open chromatin (i.e. DNAseI hypersensitivity) on the same cell type. From all these assays, we computed a summary track, which indicates the probability (between 0 and 1) of seeing a TF peak at any location of the genome. The segmentations and summary tracks can be seen in the illustration below.

Summary tracks generated from the segmentation and TFBS peaks

We then used TF binding data to determine the specificity of the ChromHMM states, as regulatory regions are presumably correlated with TF binding. Each function (TSS, CTCF insulator, proximal regulatory elements, distal regulatory elements) was thus associated with several states. For example, transcription start sites were strongly associated to the P0 state in the segmentations. We overlapped these state signals to define function specific signals. A simple count threshold was set to maximise the detection of TF binding sites. This led to the regions as displayed at the bottom of the following figure:

Selected summary tracks were overlapped to define the Ensembl Regulatory features

These new regions concur strongly with the TF binding datasets: 73.4% of the TF Chip-Seq peaks were captured in the ChromHMM regions, equivalent to an 8.3x enrichment with respect to the genomic average. Conversely, 24.0% of the ChromHMM-based regions were covered by observed TF Chip-Seq peaks. To avoid losing information, TFBS peaks which were not covered by any of these elements were marked as ‘Unannottated TFBS’.

Having defined consensus regulatory region, we returned to the original data, to determine which region is active in which cell line:

The Regulatory Features are then compared to the cell type specific segmentations to determine their activity in each cell type.


In the median cell type, 83% of FANTOM tags supported by three CAGE tags or more were annotated by our pipeline.

The Vista Enhancer database contains enhancer sequences validated by in vivo staining assays on transgenic mice (hats off to the VISTA team for their years of meticulous work). It currently contains 1,575 predicted enhancers, of which 807 were experimentally confirmed. 491 of those (60.8%) were picked up by our pipeline.

  • TF binding motifs

We found two estimates of active TF binding motifs:

Source: JASPAR Arbiza et al.
Number: 803,489 2,013,074
Avg. length (bp): 13.1 10.8
Total length (Mbp): 10.5 21.7
% covered: 59.0 87.0
Enrichment (fold): 6.1 9.0

JASPAR motifs are not mapped by default to the human genome, so we are missing a few. We will therefore be remapping them in time for Ensembl release 75, in early 2014.

The computing of Ensembl’s new regulatory annotation is work in progress. If you have any ideas on the subject, feel free to leave a comment or to send them to helpdesk@ensembl.org.

GWAS after GWAS return statistically significant hits that are hard to interpret because they fall outside of coding regions, and this begs for more functional annotation of regulatory regions. We at Ensembl have been providing such an annotation for a few years now and we are now redesigning from the ground up the way we define these regions. This work is in progress and we would love to hear your suggestions and comments.

Regulation diagram

An overview of the major elements involved in gene expression regulation

In short, we are looking for all regions of the genome which display regulatory function. Much ink has been spilled over the definition of the word ‘functional’, so we’re going to expand a bit.

We propose to map out the regions of the genome that display epigenomic marks and/or transcription factor binding sites (TFBS) associated to proximal and distal regulatory elements, transcription start sites (TSS), and CTCF insulators.

Ensembl’s Regulation pipeline

We will post more details next week on Computing the New Ensembl Regulatory Annotation. To cut to the chase, we defined the following regions from publicly available ENCODE and Roadmap Epigenomics datasets:

Label Count Avg. Lgth (bp) Max. Lgth (bp) Tot. Lgth (Mbp)
TSS 40,249 973.2 11,400 39.2
Proximal Reg. 101,206 1,005.5 15,000 101.8
Distal Reg. 209,081 526.1 8,400 110.0
CTCF 108,284 550.1 5,200 59.6
Unannotated TFBS 163,528 155.8 1,630 25.5
All: 299.2

The new Regulatory Build will allow us to separate state from function, as shown below, upstream of VNN2:


The track at the top colours the region by function, independently of the cell type. In the cell specific tracks below, the various features are greyed out if we do not have evidence of activity for that region.

We incorporated our preliminary results into a track hub, along with some of our intermediary data. Next week we will post more details on Computing Ensembl’s New Regulatory Build. We want to integrate this build officially into Ensembl release 76, sometime during the 3rd or 4th quarter of 2014.


Are you saying that 9.7% of the genome is functional?

Not quite. We’re saying that if you split the genome into 200 bp bins, 9.7% of them show epigenomic marks or TF binding. Remember that histone marks are measured at nucleosome resolution, so this signal is at best at a 140bp resolution. If you add in experimental noise (typically proportional to the Chip-Seq fragment length), the exact position of these elements on the genome is rather fuzzy. At the same time, the epigenome is a dynamic system, and we only have some assays on some cell lines. No doubt more regions will be annotated as more datasets come in.

What happened to the 80.4% of the genome being functional?

This statistic from the main ENCODE paper took into account other biochemical markers, in particular those associated to transcription, which can be observed over most of the genome. We therefore recommend using curated genesets such as GENCODE to define gene bodies. Nonetheless, the number of regulatory elements and promoters described here is of the same order of magnitude as that discussed in the ENCODE paper.

Is this the same as ENCODE?

This is more than just ENCODE. There are other fascinating epigenomic surveys out there, such as Roadmap Epigenomics or BLUEPRINT to name a few. Here at Ensembl, we have started merging all these datasets (including ENCODE), and provide the most comprehensive overview possible, updating our calls as new projects come along. Also, as discussed above, we are producing a cell-type independent summary of epigenomic function, which can be used to inform studies on new cell types.

What about other species?

We focused our Regulation database primarily on human: that is what most of our users ask for, and what we have most data for. But that does not mean that we ignore other species. Ensembl has already regulatory information for mouse, and we plan on shortly expanding this to farm animals, in collaboration with the Roslin Institute.

Can you assign regulatory elements to genes?

We’re working on it. Correlations are easy to find, but multiple testing quickly gets in the way when testing 310,287 regulatory elements against 40,249 TSSs.

Remember, this is work in progress, and we would love to hear your suggestions. Please leave your comments here or drop us a line.

gitandensembl_smallToday I am happy to announce Ensembl’s migration away from CVS as our primary version control system (VCS) to Git. This migration sees the end of nearly a year’s worth of work to ensure that our Git repositories provide the same historical record as CVS.

To summarise the changes:

  • Ensembl’s code is now provided from our GitHub organisation at http://github.com/Ensembl
  • We have migrated Ensembl’s versioning back to 1999
  • We will continue to back-port changes to CVS for the next 3 releases (support ending with release 77)
  • You can still download our API release tarballs from our FTP site

CVS (Concurrent Version System) was first released in 1990 and was based on an earlier system called RCS (released in 1982). It relies on a centralised single server to hold all previous revisions with none of this information held on the client. CVS also assumes that commits to files within the same project are independent of each other. When the Ensembl project started in 1999, we chose CVS as it was one of the best available VCS in the open source community.

Since that choice, a new breed of VCS has appeared: decentralised/distributed version control systems (DVCS). These systems favour local copies of the repositories, removing the need to communicate with a centralised server, except when sending or receiving new commits, and work with sets of file changes as a single atomic block of work. According to Black Duck’s comparison of repositories, Git is the dominant DVCS in open source projects. We have decided to use the code hosting company GitHub as the location of our repositories. GitHub has been a major contributor behind the success of Git by providing an infrastructure that promotes social coding between developers.

Whilst CVS has been a very good servant to Ensembl, the time has come to move on to better tools. We have seen other projects within EMBL-EBI and the Wellcome Trust Sanger Institute make a similar transition to Git. None of them have looked back, citing better tooling, a larger support base and an ability to support both long-term and short-term development branches. We agree and cannot wait to start using this exciting technology.