We’re now only a couple of weeks away from releasing our full annotation of the new human genome assembly (GRCh38). Before we make it publicly available we’d like to update you on our progress and to share a few key pieces of information.
Changes in the assembly
The GRCh38 assembly is made up of 455 top-level sequences. These sequences include 24 chromosomes, mitochondrial DNA, alternative reference loci and a number of unplaced scaffolds. For the first time ever, centromere sequences have also been included in a human reference assembly. The total contig length for this new assembly is 3.4 Gb, a small increase on the previous assembly, and the total chromosomal length is 3.1 Gb (excluding haplotypes). There are 261 alternate loci, including the LRC/KIR complex on chromosome 19 (35 alternate sequences) and the MHC region on chromosome 6 (7 alternate sequences). We have aligned nearly half a million proteins and over 200,000 cDNAs to the new assembly and have annotated a total of 63,263 models, 22,469 of which are protein-coding.
For GRCh38, in addition to the usual steps involved in a genebuild, we have also made clone data available. The clone sets were loaded, along with other data, into the core human database. Although these data are not required for genebuilding, the information is extremely useful for some of our users.
What stage is the annotation at?
The Genebuilders have completed the final gene set, which has been merged with manual annotation from HAVANA to create the GENCODE 20 set. The data were then passed on to other teams within Ensembl so that they could carry out the remaining analyses. This entire process of data exchange between the different Ensembl teams is coordinated by the Ensembl Production team, who also conduct a series of quality control steps along the way.
The comparative genomics team (Compara) have now generated orthologues to all other Ensembl species from the new human geneset. They’ve also revised all pairwise and multi-species whole genome and transcript alignments so that users can identify conserved and constrained regions between human and other vertebrate species. Updating with the new human assembly, therefore, means that a large part of the Compara database also needs to be updated.
The Variation team have now collected all variant and phenotype data, linking the information to other data in Ensembl. This is so that useful variation data can be accessed and interpreted by our users. The variant effect predictor (VEP), for example, is an extremely useful tool that determines effects of variants, such as SNPs or indels, on genes, transcripts, proteins, regulatory regions and phenotypes. A user simply has to input the coordinates and sequence changes of the variants of interest.
And finally, the Regulation team have used the new Ensembl regulatory annotation build to locate regions in the human genome that are involved in the regulation of gene expression.
Now that the last parts of the relevant analyses are being completed, the Ensembl Webteam are currently working on the Ensembl website, ensuring that all the relevant data will be accessible to you in the most user-friendly manner.
The final release is still on target for the end of July, after which the GRCh37 annotation will be available on a separate archive site. Although we have produced the GENCODE 20 gene set for the upcoming Ensembl release (e76), we are still in the process of refining it. We therefore recommend, particularly for large consortia, waiting for the GENCODE 21 release, which will be available with e77. In the mean time, until the e76 release, the human Pre! site is still up and running.
If you have any questions then please don’t hesitate to contact us, either through twitter or by emailing helpdesk.