NCBI has announced big changes to how dbSNP manages human variation data, which will be reflected in Ensembl. These changes include a new allele normalisation approach and the removal of some older population genetics data.
The biggest change will be to the normalisation and clustering of short variants, as a result of the introduction of SPDI format. This will specifically affect insertions and deletions, which will now be converted to a maximal representation prior to clustering. For example:
ref: AGTCGTCGAAAAAGCACTCATGGTC alt1: AGTCGTCGAAAGCACTCATG alt2: AGTCGTCGAAAAAAGCACTCATGGTC
Previously these changes would be represented as two different refSNPs, a deletion of A (alt1) and an insertion of A (alt2). Now any variants that represent expansions or retractions of the same repeat, even if they are of different size, will be merged into a single variant, so this would be represented as a single refSNP with the alleles AAAAA/AAA/AAAAAA. This is useful in many ways, as it makes it clear when an insertion or deletion is actually an expansion or retraction of a repeat. Also, because it covers the whole repeat region, there is no mismatch between different sources reporting the same variant, which can be a problem when comparing a variant in VCF, which left-aligns, and HGVS, which right-aligns.
Another effect will be that variants with alleles that are not the same type of change (tandem, other insertion/deletion, substitution) will be split into different variants. So if we consider further alt sequences in our example region:
alt3: AGTCGTCGGCCAAAAAGCACTCATGGTC alt4: AGTCGTCGAAAAAAAGCACTCATGGTC
Previously this may have been described as a variant like -/AA/GCC. We will now see this as two separate variants, with the repeat expansion AAAAA/AAAAAAA being merged with the variant above and the standalone insertion -/GCC being considered a separate variant.
An example of variants that will change significantly is:
7 117548628 rs756031917 G GT,GTGTT,GTT 7 117548633 rs1805177 <(T)5> <(T)7>,<(T)9> 7 117548635 rs200454589 T TA,TTT
These all represent changes in this sequence:
117548628 - GTTTTTTTAA - 117548637
All three variants include a representation of expansions of the T repeat, but give a different location for it. rs756031917 also includes a representation of a TGTT insertion and rs200454589 includes an A repeat expansion at the 3′ end.
The new variants will be cleaner, with all the T expansions in one variant, and the two non-T expansions shown as other variants.
7 117548629 rs756031917 T TGTTT 7 117548628 rs1805177 GTTTTTT G,GTTTT,GTTTTT,GTTTTTTT,GTTTTTTTT,GTTTTTTTTT,GTTTTTTTTTTT,GTTTTTTTTTTTTT 7 117548635 rs200454589 T TA
In this case, the alleles from three variants were normalised into three variants, retaining previously assigned rsIDs of the originals, although the alleles switched around. This is not always the case, and many old rsIDs have been deprecated.
Population genetics data
The HapMap project was a worldwide collaboration that investigated the haplotype structure of the human genome by generating genotypes and allele frequencies across a range of populations. The data produced by HapMap has been superceded by 1000 Genomes, which included most of the samples from HapMap and many more in a standardised sequencing and variant calling process. The gnomAD project has generated still deeper frequency data by variant calling many thousands of genome and exome sequences. As these more expansive measures of population genetics are available, dbSNP have decided to discontinue support for HapMap data, and these frequencies will also cease to be available through Ensembl. We will also remove the HapMap evidence status for variants.
RefSeq intron HGVS
HGVS nomenclature for variants within the introns of RefSeq transcripts is not available in dbSNP152. As we import these annotations from dbSNP, we will also no longer display them on our variant pages. However, HGVS is calculated on the fly in VEP, so we can still display RefSeq intron HGVS descriptions in VEP.