With the release of Ensembl 76 fast approaching, the variation team would like to provide more information on how we moved our variation data to the new human assembly, GRCh38. There are different methods available for re-annotating variants on a new assembly. The most accurate way would be to re-run experiments, or variant calling pipelines that identified the variant in the first place, on the new assembly. The necessary material and computational resources required for such an endeavour, however, are very expensive. Therefore, we have developed computational methods so that, for most of the data, such investments are not necessary.
Considering that the new assembly retained lots of sequence information from the previous assembly, we can use computational methods that try to derive the new location based on information about a variant we have already available namely the:
- Location on the old assembly
- Flanking sequence (DNA sequence from the old assembly surrounding the variant)
Based on this prior knowledge we can either project or remap our variation data.
The projection algorithm compares two assemblies and computes the new location based on sequence similarity between the two assemblies. The computation of the new location is successful for ~98% of our variation data. However, when the sequence in the new assembly has changed too much compared to the old assembly, the projection fails, and for those variants we then go out to use a remapping strategy, as explained below. The projection functionality is implemented in the Ensembl core API.
For the remapping approach we generate a sequence read by adding upstream and downstream sequence from the old assembly to a variant. We then map the read to the new assembly using BWA.
We have ~64M variants and ~69M variation features (VF) in Ensembl release 75, GRCh37. You can think of a variation feature as a combination of a variant and its location on the genome. Most variants have one variation feature. If a variant maps to multiple locations on the genome, the variant has as many variation features as it has locations on the genome.
We can divide variation features into:
- VF that map uniquely to the reference genome (chromosome 1-22, X, Y, MT)
- VF that have multiple mappings on the reference genome
- VF that are located on an alternative locus
- VF that are located on a fix patch region
We first attempted to project all VF that map uniquely to the genome or are located on alternative loci. In a second attempt we use our remapping approach for VF that couldn’t be projected. For the ~62M variants with a unique location on GRCh37, only ~200,000 variants could not be projected and were remapped to the new assembly. Variants with multiple mappings on GRCh37 have been remapped to GRCh38 using their flanking sequence information as submitted to dbSNP.
As a result, both projection and remapping create the new set of variation feature locations in release 76. We do not need to re-annotate variants located on fix patch regions from GRCh37 because the fix patch regions have been incorporated into the primary sequence for the new assembly.
The Genome Reference Consortium increased the number of alternative sequence representations for variant regions (ALT LOCI) in GRCh38. In our workflow diagram we described how we re-annotate variants to ALT LOCI that were present in GRCh37. Additionally, we provide variant annotations to new ALT LOCI by remapping variation features from the primary reference sequence (GRCh38) that overlap an alternative locus. We added ~1.5M extra variation features with this approach. This gives however only an idea of how known variants map to ALT LOCI. Ideally, you would do the variant calling against the set of primary reference sequences and ALT LOCI. We can expect variants will be called on ALT LOCI in the near future as variant calling tools include the option of including ALT LOCI information.
Are you ready to move to GRCh38?
Ensembl provides a reliable representation of variation data on the new human assembly, GRCh38. In addition to re-annotating variation data from release 75 to 76, we also updated our data (e.g. from ClinVar, the NHLBI GO Exome Sequencing Project or from COSMIC) and projected and remapped the data where necessary to the new assembly GRCh38. But there is no need to worry if you are not yet ready to make the move to GRCh38. Starting with Ensembl release 76 we will support and update variation annotation for GRCh37 and GRCh38. If you have questions or comments, please get in touch with us.