We are pleased to announce the public release of manual annotation on the new human GRCh38 assembly on the Vega website.This release follows on from the publication of a preliminary gene set on Pre! Ensembl and represents one of the final steps before the release of the full human Gencode 20 gene set in Ensembl release 76.
The Vega website uses Ensembl technology to present the latest manual annotation produced by the Havana group based at the Welcome Trust Sanger Institute. It has significance for researchers who want to see the most up to date annotation – every two weeks we run a streamlined, automated production pipeline that identifies new or updated annotation and presents it on Vega. Consequently there is never more than 14 days between annotation being created or updated by Havana and being made available to the public.
Human GRCh38 manual annotation gene set.
The actual gene numbers have not changed greatly overall, but there has been a lot of work going on in the background to refine the gene set. The numbers of genes on GRC patches have been reduced from GRCh37 as many of these patches have now been incorporated into the primary genome assembly.
The initial step in the manual annotation of the new assembly was a computational one, projecting the manual annotation from GRCh37 onto GRCh38. As a part of this process we generated a list of the loci that did not project due to genomic changes. Many of them were in the regions of greatest change between assemblies including regions of chromosomes 1, 9, 17 and X. There were about 800 of these loci, and each of these needed manual intervention. This took a dedicated effort by the Havana group over about a three week period. The changes made fall into a number of categories:
(i) The use of single haplotypes across certain gene clusters, such as the XAGE and GAGE gene families on the X chromosome.
(ii) Filling, moving or even introducing gaps in the assembly to give a much more accurate representation of difficult regions. An example of such re-arrangement is the XAGE1B gene that is now placed on the opposite strand compared to the previous assembly.
(iii) A decrease in the number of polymorphic pseudogenes due to changes made in the assembly to include a haplotype with a coding version of the gene. Polymorphic pseudogenes are coding in some individuals and disabled in other individuals due to sequence variation.
(iv) A large increase in the number of long non-coding RNAs (lncRNA) because we have been able to take advantage of new RNA-seq and PolyA-seq data rather than because of the new assembly per se.
Further annotation of the new assembly is ongoing, with the focus having changed from fixing projection errors to finalizing the annotation.
Merge with Ensembl geneset (Gencode 20)
The Havana manual annotation has been merged with the annotation arising from the rerun of the Ensembl genebuild pipeline. This improves the gene set, primarily by taking into account new experimental evidence generated since the manual annotation was originally performed. In addition, the comparison between the manually and automatically generated gene sets contributes to the continuous enhancement of both annotation systems. It is the merged gene set that will be released as Gencode 20.