1000 Genomes phase 3 frequencies, genotypes and LD data

Ensembl Variation recently incorporated the latest versions of the dbSNP and 1000 Genomes datasets. While we are able to import all of the variant loci from phase 3 of the 1000 Genomes project, the vast amount of genotype data (2500 individuals x 80 million sites = 200 billion data points!!!) meant we had to create a new solution to deliver this data through our API and website.

To this end we have extended the Ensembl Variation API to read genotype data directly from tabix-indexed VCF files. The API then calculates frequency and linkage disequilibrium (LD) data from these genotypes on-the-fly. You can see this in action on a typical population genetics page:
Screen Shot 2015-06-18 at 14.55.53
In order to use this functionality with your local API installation, there’s a couple of extra dependencies to install. You may even have them already!

Tabix

The tabix utility is used for rapid random access into compressed position-based text files. It also allows access to data across HTTP and FTP protocols, downloading only a small index file in the process.

To install it, we clone it from GitHub and run a couple of “make” statements. From here on we assume that you typically install things in your $HOME/src/ directory and that you are using bash or a bash-like terminal.

cd ~/src
git clone git@github.com:samtools/tabix.git
cd tabix
make
cd perl
perl Makefile.PL PREFIX=${HOME}/src/
make && make install

You may need the tabix binary in your path; you can either copy ~/src/tabix/tabix to a directory in your path, or add this to your path:

PATH=${PATH}:${HOME}/src/tabix/
export PATH

If it isn’t already, you should also add the relevant path to your PERL5LIB environment variable; the path in question is shown in the output from the “make && make install” command above.

PERL5LIB=${PERL5LIB}:${HOME}/src/lib/perl/5.14.2/
export PERL5LIB

ensembl-io

The ensembl-io package contains objects and methods for parsing and writing data formats commonly used in bioinformatics. If you installed the API using Git and Ensembl Git tools, chances are you already have the module.

If not, it’s simple to install with git:

cd ~/src
git clone git@github.com:Ensembl/ensembl-io.git
PERL5LIB=${PERL5LIB}:${HOME}/src/ensembl-io/modules
export PERL5LIB

Using in the API

That’s it! Now to use this in an API script, there’s a simple flag we have to set on the Variation DBAdaptor object:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');

# Tell API to use VCFs
$variation_adaptor->db->use_vcf(1);

my $variation = $variation_adaptor->fetch_by_name('rs699');
my $alleles = $variation->get_all_Alleles();

foreach my $allele (@{$alleles}) {
  next unless 
    (defined $allele->population) &&
    (defined $allele->frequency);
  my $allele_string = $allele->allele;
  my $frequency = $allele->frequency;
  my $population_name = $allele->population->name;
  printf("Allele %s has frequency %.3g in %s\n", $allele_string, $frequency, $population_name);
}

This script should print out frequency data for a number of populations, including those from 1000 Genomes phase 3:

....
Allele A has frequency 0.121 in 1000GENOMES:phase_3:KHV
Allele G has frequency 0.879 in 1000GENOMES:phase_3:KHV
Allele A has frequency 0.149 in 1000GENOMES:phase_3:JPT
Allele G has frequency 0.851 in 1000GENOMES:phase_3:JPT
Allele A has frequency 0.295 in 1000GENOMES:phase_3:ALL
Allele G has frequency 0.705 in 1000GENOMES:phase_3:ALL

You can use the “->db->use_vcf(1)” stub on any adaptor from the variation adaptor group.

Once set, it will affect fetching objects of the following types:

  • Allele
  • PopulationGenotype
  • IndividividualGenotype
  • LDFeatureContainer

Advanced configuration

The value we pass to use_vcf() also affects the behaviour of the API:

  • 0 : fetch data only from database
  • 1 : fetch data from VCFs and database
  • 2 : fetch data only from VCFs

One final thing; the API is pre-configured to use VCFs hosted on the Ensembl FTP site. It is also possible to use VCFs on your local machine or any arbitrary server. The configuration is found in the ensembl-variation folder:

cat ~/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json
{
 "collections": [
   {
     "id": "1000genomes_phase3",
     "species": "homo_sapiens",
     "assembly": "GRCh37",
     "type": "remote",
     "strict_name_match": 1,
     "filename_template": "ftp://ftp.ensembl.org/pub/grch37/release-79/variation/vcf/homo_sapiens/1000GENOMES-phase_3-genotypes/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.vcf.gz",
     "chromosomes": [
       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22"
     ],
     "individual_prefix": "1000GENOMES:phase_3:"
   },
   {
     "id": "1000genomes_phase3",
     "species": "homo_sapiens",
     "assembly": "GRCh38",
     "type": "remote",
     "strict_name_match": 1,
     "filename_template": "ftp://ftp.ensembl.org/pub/release-80/variation/vcf/homo_sapiens/1000GENOMES-phase_3-genotypes/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz",
     "chromosomes": [
       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12","13", "14", "15", "16", "17", "18", "19", "20", "21", "22"
     ],
     "individual_prefix": "1000GENOMES:phase_3:"
   }
 ]
}

Feel free to edit the filename_template entry in this file. Note there are separate entries for the two currently supported human assemblies, GRCh37 and GRCh38; the relevant entries will be used depending on which port you connect to in your API script (3306 for GRCh38, 3337 for GRCh37).

“###CHR###” is a placeholder that allows the API to read from a set of files distributed as one per chromosome. This is not mandatory, and indeed a single genome-wide VCF file could be used. The only requirement is that the chromosomes contained in the VCF or set of VCFs are listed in the “chromosomes” field of the JSON configuration file.

Any questions, don’t hesitate to get in touch!

2 thoughts on “1000 Genomes phase 3 frequencies, genotypes and LD data

  1. Hi Ensembl team, I’m excited for this new data with larger sample sizes.

    I have ~5000 SNPs, and for each one I would like to find all SNPs in LD (r-squared over 0.5). More information on the task is available here. Does your API support this operation using the phase 3 data? If so how, and if not is there an alternative you recommend?