We’re excited to announce that the initial annotation of the new human assembly, GRCh38, is now live on the Ensembl Pre! site. A new human assembly is always an exciting (and scary) prospect for both Ensembl and the greater scientific community. Since the release of GRCh38 we’ve been working tirelessly to produce annotation on the new assembly. The first major milestone on that journey is the completion of the human Pre! site, which is now live.
Human GRCh38 assembly information:
GRCh38 was produced by the Genome Reference Consortium and released in December 2013. It consists of 24 chromosomes (1-22, X and Y), 127 unplaced scaffolds and 42 unlocalized scaffolds. In addition, there are 261 haplotypes and for the first time centromeres are represented in the assembly.
Preliminary annotation of GRCh38:
Ensembl Pre! sites are designed to offer users an early glimpse at the data, through a range of analyses that include repeat finding, gene prediction, nucleotide and protein alignments, and the construction of an initial set of gene and transcript models using the aforementioned analyses. We have imported or mapped pre-existing models from RefSeq, CCDS and GENCODE 19. The GRCh38 Pre! site includes:
- RepeatMasker and Dust analyses, masking 50.46% of the genome
- Centromeres covering 2% of the genome
- Total Genscan predictions: 50,117
- Distinct UniGene BLAST hits: 348,921
- Distinct UniProt BLAST hits: 451,431
- GENCODE 19 models mapped from e!75: 61,349
- CCDS models mapped from e!75: 28,781
- RefSeq annotation release 106 models: 26,670
- Human-specific cDNA and EST alignments
We have now begun construction of a new set of Ensembl gene models for GRCh38. Human-specific data are key to building a high quality final set of Ensembl gene models. For the Pre! site just over 80,000 uniquely accessioned, human-specific proteins were aligned to the genome to create preliminary gene models. It is important to note that these are an initial set of raw gene models; during the full annotation process for release 76 we will create models from multiple data sources, in addition to carrying out a filtration step to remove low quality or redundant models. This will bring the final gene count in line with the previous assembly.
Variation data for GRCh38:
As human is one of our flagship species we also have prepared a preliminary view of variation data on the new assembly.
The variation team has mapped all variant locations from GRCh37 to GRCh38 using a method developed by the Ensembl core team to project features between assemblies. Using this method we successfully projected 98.5% of our variants from release 75 to GRCh38. Reasons for why a projection might have failed are explained in the assembly mapping blog post.
For the Pre! site we also provide variant consequence information. We used the projected Ensembl gene set and computed the consequence of a variant overlapping an Ensembl transcript. You can view these data on the Pre! site or download the data from our Pre! FTP site in GVF or VCF format. Additionally, we computed consequences for variants overlapping NCBI’s RefSeq transcripts. These are available in GVF and VCF files from our Pre! FTP site.
The variation team is now working on the rest of our variation data for Ensembl release 76, including genotype information, phenotype data, LD data, structural variations and mappings of variants to all alternative loci that are new in GRCh38.
What’s next for GRCh38:
Over the following months we will continue to annotate GRCh38. This includes producing the GENCODE 20 gene set that will be released in Ensembl 76, as well as pairwise genome alignments between human and other species in Ensembl, and an updated Regulatory Build.