A GeneBuilder’s Perspective on the New Human Genome Assembly

In-depth knowledge of the human genome is fundamental in an array of scientific fields, such as forensics, research, anthropology and medicine. Since the completion of the Human Genome Project in 2003 thousands of human genomes have been sequenced, sequencing technology has improved significantly, and the amount of available data has vastly increased.

The new human assembly (GRCh38) arrived last week, and our objective over the next few months will be to thoroughly annotate it, ultimately providing our users with the best possible gene set.

What does the new assembly look like? 
Though the underlying genomic DNA will be identical, or very similar, to that of the previous release (GRCh37), certain improvements mean this is a particularly important assembly. These changes include:

The reference GRCh38 assembly consists of the ‘primary assembly’ and ‘alternate sequences’. The primary assembly is made up of 24 chromosomes, 42 unlocalized scaffolds and 127 unplaced scaffolds, which contain genomic sequences that have not yet been assigned to chromosomes. The alternate sequences are a collection of 261 ‘alt loci’. These include the haplotypes for the MHC region on chromosome 6, as well as shorter regions on other chromosomes where the GRC provide alternate alleles present in the population.

mhc region

The figure above shows the MHC region of chromosome 6 on the reference genome.

What does it mean for our users? 
The updates will facilitate an improved understanding of the human genome, and increase the accuracy of the Ensembl annotation. It is important to note, particularly for users who use coordinate-based systems, that these changes may affect the lengths of chromosomes and the positions of many genes.

What does it mean for the Ensembl GeneBuilders?
When genome sequencing was a new technology it was initially thought that an assembly could be represented by a single ‘Golden Path’; a set of overlapping sequences that could be selected to produce a non-redundant chromosome sequence (with gaps), fully representing the sequence at all loci. The reference human assembly, however, is not a simple linear model and it includes additional information on an array of different alleles. Fortunately, our GeneBuild pipelines have previously been updated to deal with such alternative sequences, as we faced a similar challenge with GRCh37 patches. The most prominent challenge, however, will involve careful database storage and disk space management as we will be aligning a massive amount of data to the new assembly, such as EST (>8 million human ESTs) and RNASeq data.

What is involved in re-annotating the the new assembly?
Even though most of the genic regions in GRCh38 will be the same as for the previous assembly, we are going to throw away all the automatically annotated gene models (keeping the manually annotated genes from Havana) and begin the entire annotation process again. While it would be far easier and quicker to just copy the pre-existing gene models, we are more interested in producing the best possible gene set.

All of the gene models we produce are based on biological evidence: a protein and/or mRNA sequence must align to the genome in order for us to annotate a gene model. Just as the genome assembly has been updated to remove incorrect sequence and to add new DNA, so too have the public databases been improved since we last produced a gene set on human. New protein and cDNA sequences are now available, and others may have been removed. We therefore have a great opportunity to refresh the entire human gene set, and possibly find many genes that could not be annotated before due to a lack of evidence.

When can users expect to see the new assembly and gene set?
We plan on releasing a Pre! site this quarter to give users a chance to view the new assembly (BLAST/BLAT will be available). In order to produce this temporary Pre! site we are aligning the old human gene set to the new assembly to indicate where we expect the genes to be. It will take approximately three months to automatically annotate the new assembly using Ensembl pipelines, and about one month to merge the automatically annotated gene models with the manually annotated ones from Havana. The gene set should be finalized during the second quarter of this year, after which the data will be passed to the other teams in Ensembl (comparative genomics, variation, and regulation teams).

The complete GRCh38 annotation will be made available on the Ensembl website in the third quarter of 2014. From then on we will support GRCh38 on our main site (www.ensembl.org). The GRCh37 assembly will still be available, but will be contained on its own site (GRCh37.ensembl.org) and will remain static. Any further updates will be done exclusively on the new assembly.

Additional information can be found on the Genome Reference Consortium blog here. There is also other useful information here in the form of a poster, which was presented by James Torrance on behalf of the GRC at the ISMB/ECCB conference in July 2013. It is important to note, however, that the poster was produced before the final version of GRCh38 was released and some of the numbers it contains may be out of date.