Response to Proposed Update to the NIH Genomic Data Sharing Policy’s Access Model for Genomic Summary Results

I am writing in my capacity as leader of the Ensembl project based at the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) based near Cambridge, England. Ensembl is one of the world’s leading sources of genome information and a central aggregation point for genomic data.

Ensembl software is fully open source and our data is freely available through our web browser at; our Perl and REST APIs; our BioMart data mining platform serving direct queries and BioConductor access via biomaRt; and direct downloads of all data. Ensembl’s Variant Effect Predictor (VEP) is the premier open source variant annotation tool. VEP is actively supported and is used by thousands of companies, research groups and hospitals.

Ensembl’s web interface supports over five million user sessions annually and many millions more—which are not captured by analytics—access Ensembl data via our APIs, tools and data incorporated programmatically or directly into hundreds of other bioinformatics resources. More than half of Ensembl’s usage at our main sites is for human. As a reference resource for genomics, Ensembl aims to provide the most general and most useful data resources in a consistent form. We already incorporate genomic summary results from ExAC, gnomAD, TopMed, UK10K and other projects. These data are highly used in many contexts and are deeply embedded into the Ensembl ecosystem and tool set described above.

Since Ensembl’s first release in 2000, our policy has always been to have as few barriers as possible: no limits are put on the data provided by the project, no Ensembl user pays a license fee, no data in Ensembl requires click-though access. We strongly believe Ensembl’s openness has significantly contributed to the genomic revolution.

We understand and support the idea that human genomic data must be treated appropriately and so care must be taken when either erecting or removing barriers to access especially for vulnerable study populations where additional protection may be more easily justified. It is also possible that datasets arising from vulnerable study population will have lower user demand than more general reference data sets and so may not be targets for incorporation to Ensembl and other resources. Regardless, accessing human genomic data remains more challenging that it should be in many cases and there is room for significant improvements to the process of controlled access for both summary and individual level data.

We are extremely supportive of the concept of further increasing access to genomic summary data. We believe that there are many benefits to doing so and have neither observed nor received any reports of harm associated with this type of access. However, we do not support the use of click-though licenses for genomic summary data and include in this response some of the potential consequences for the Ensembl platform should such a policy come into force.

  1. Data removal vs. costly retrofitting

If the proposed click-though policy were immediately implemented, we would be forced to remove any summary data subject to the policy while we considered how, or if, to support the policy. Although adding a check box to the web site is relatively easy, Ensembl’s power arises from deep and consistent integration across multiple tools including our APIs, VEP and BioMart. These tools include data access points that have been created over the project’s history and rapidly retrofitting a seamless user interface into our system would be neither easy nor cheap.

Moreover, our archive websites and related resources, which stretch back five years for websites and longer for API and database access, are no longer subject to active development. Retrofitting this code would be extremely costly. The alternative, closing access to these data and websites, would disrupt on-going research projects and impair reproducibility of already published results. Unlike a commercial company we cannot hire several new developers for a year with the promise of greater profits after the development is complete. Our tools and services are free.

  1. Variable international applicability

Scientific research is international. We are uncertain about the mechanism of enforcement for click-through agreements for researchers outside the United States. We also recognise that researchers using these data in various jurisdictions may not have the contractual authority to agree to the click-through license and this could limit their use of Ensembl and any other resource incorporating such data.

  1. Challenges for data integration from various sources

The NIH does not make policy in a vacuum. Many funders and policy makers worldwide are likely to follow the lead of the NIH and we believe that it will be unlikely that more than a few will have less restrictive policies than what is enacted by the NIH. This is especially true for those organisations that may be less inclined to share data for wide variety of reasons.

Ensembl currently supports summary genomics datasets funded by the NIH and the Wellcome Trust, but we expect that this list will grow considerably over the next several years as more sequencing takes places across the world. Supporting a growing number of click-though agreements would be unfeasible. In fact, we do not believe that statements from each funder or data source could (or should) be accommodated on every page providing summary data. Instead, we urge the NIH to work with others around the world to create a Code of Conduct for use of summary genomic data.


Paul Flicek