It has been quite a while since we’ve blogged about the VEP (Variant Effect Predictor), and in that time we’ve added a whole load of new features, particularly to the downloadable script version.

Structural variants

The VEP now supports finding the consequences of structural variants, with input either in VCF or tab-delimited format. Using the web interface to the VEP you can visualise which transcripts and features your structural variants overlap by clicking through to the Region in Detail view:

Screen Shot 2013-04-19 at 15.14.23 copy

The cache

We’ve really pushed the VEP script’s capabilities when using local “caches” (as opposed to using remote databases). Almost every feature of the VEP is now available when using the cache in offline mode. You can use a local FASTA file to quickly retrieve the sequences required to construct HGVS notations. You can even construct your own cache from a GTF file if your species isn’t supported by Ensembl.

Our cache for human now contains allele frequency data from phase 1 of the 1000 Genomes Project, and you can use these frequencies to filter your input (for example, you might want to filter out variants that are common in the combined European (EUR) population). We also now provide SIFT predictions for 8 species – human, mouse, zebrafish, pig, cow, chicken, rat and dog.

Plugins

We’re always trying to add new and useful features to the VEP, but we also recognise that other users have great ideas that they’d like to implement. The VEP script enables the use of plugins; these are bits of code that add extra functionality to the VEP. They can be used to retrieve data from remote sources, run external tools, filter output; pretty much anything you can think of can be accomplished in a plugin!

It’s easy to get started, and a basic plugin can be just a few lines of code – have a look at some of the examples we’ve created.

I recently added a plugin to retrieve data from dbNSFP – this is a great resource created by Liu et al in Houston, TX. They have, for every possible missense substitution in the human genome, pre-calculated pathogenicity scores, frequencies, conservation scores and a plethora of other things, and made all of this available as an easily downloadable file. To use this with the VEP, you just download the file and the plugin, run a couple of commands to get the data into the right format, and away you go – the VEP can now provide you with scores from LRT, MutationAssessor, MutationTaster, FATHMM and more for any missense substitution in your input.

Summary and HTML output

We had a number of requests for the VEP to provide summary statistics at the end of each run, and who are we to disappoint our loyal users?!? The VEP now writes a pretty HTML summary:
Screen Shot 2013-04-03 at 13.35.45 You can also view your output as HTML using the –html flag, which allows you to sort, filter and analyse your output on the fly.

Don’t hesitate to get in touch with us about the VEP – our developer mailing list is the best place for technical questions, with helpdesk for everything else.

 

From release 70 we store and display information on the type of consequence a variant has on overlapping regulatory regions (Ensembl regulatory features and Ensembl motif features) for human and mouse.

Web display
Consequence types for variations overlapping regulatory features in region in detail view
One of the major benefits of this is that we can highlight the predicted consequence types for a variant overlapping regulatory regions in the region in detail view. The Variation – Genes and regulation page gives more information on the type of consequence a single variant has on a specific regulatory region.

API
We store the data in two new tables: regulatory_feature_variation and motif_feature_variation. Both are populated in a similar way to the transcript_variation table. You can find further information on the table structures on our Variation database schema description page.

You can access the data using the Ensembl Variation Perl API. Please check the API documentation for examples of how to use the RegulatoryFeatureVariationAdaptor and the MotifFeatureVariationAdaptor. These new modules allow you to fetch MotifFeatureVariations or RegulatoryFeatureVariations on a VariationFeature, MotifFeature or RegulatoryFeature. This is in addition to the existing functionality for getting all RegulatoryVariationFeatures and MotifFeatureVariations using the VariationFeatureAdaptor.

If you have any questions please email helpdesk.

If you are interested in knowing the evolutionary history of your preferred Ensembl gene, you are in luck. Starting from this release (69) Ensembl has a new gene gain/loss tree view just for this purpose. This view shows the evolutionary history of a gene family by showing gains (expansions) or loss (contractions) on the number of members belonging to a given gene family.

The example below shows a detail of the evolutionary history of the human gene ZNF235 as displayed by this new Ensembl view. As you can see, it is a species tree with annotated branches showing significant expansions (in red), contractions (in green) or no significant changes (in blue). The nodes representing each extant species or ancestral node is labelled with the number of members of the family and the statistical significance of this change (or the lack of it).

Gene gain/loss tree view exampleView the example in Ensembl

If you want to know more about this view and how the data is generated check out its help page.

Please, try it out with your preferred genes and let us know your impressions (helpdesk contact form). We are working to include more useful information in the view and your input is important!

Did you ever wish you could resize our images/views to make them bigger? We now have a new icon on the blue image toolbar in beta.ensembl.org, and you can resize the image on one click.

Image resize icon

On clicking the icon, a menu to choose the size will appear with your current size greyed out (see figure below). There is also a best fit option which will resize the image according to your screen resolution.

Image resize menu

Have a look and let us know your thoughts by sending them to ensembl-beta@sanger.ac.uk or by clicking on the black feedback button at the right of the views in Ensembl Beta.

All feedback, improvement/suggestions are welcome, specifically:

Is it useful?

Is the menu clear enough?

Any other improvements?

Many thanks for your feedback!

 

Have you ever spent time changing your favourite Ensembl view (for example adding new tracks, changing the track order, or uploading custom data) and wished you could easily send the configured display to a colleague through one simple url? You can now do this on beta.ensembl.org.

Configurable images now have a link icon in their toolbars. If you click on this, it will give you a link to share with another user.

If you have any custom tracks turned on for the image, you will get the option to share these too (this is opt-in via checkboxes). This works with uploaded files, attached URLs, DAS and data hubs.

Custom tracks will only be shareable if they are displayed on the image (or in the case of data hubs, if any of the tracks in the hub are displayed).

If you send the url to a colleague, he/she will see the image configured in the same way that you have it.

You can also share configurations for a whole page by using the Share this page button in the left menu.

Please try it out. If you encounter any problems, please use the Feedback button on the beta site to tell us about them (or email ensembl-beta@sanger.ac.uk), making sure to include the link you are trying to share.

We are pleased to announce that we are now providing access to the ENCODE integrative analysis data from within Ensembl. These analyses bring together a multitude of experiments targeted at determining functional elements in the human genome sequence. This data is provided from an external source (a track hub at the EBI).  Although the Ensembl code supporting track hubs is still in preliminary form, we considered this ENCODE set sufficiently important to release the code early to enable us to provide access to this set.

Important: Please read the instructions below before activating this data!

As this dataset is very large (over 2800 tracks) it is not configured on by default in the Ensembl browser. To add the ENCODE hub tracks, click on the link below. Warning: users of IE6 or IE7 should not do this because performance in those browsers is inadequate and the page will not load.

Link to add ENCODE integrative analysis hub

No tracks from the hub are switched on by default. To turn on tracks from ENCODE, go to ‘Configure this page’ and click on one of the submenus under ‘ENCODE data’, for example ‘ENCODE genome segmentations’. It will take a few seconds to bring up the track list. Then switch tracks on or off by clicking on the box next to the track name and choosing a track style. For genome segmentations the ‘Compact’ track style looks good.  More information on configuring the display is available in our recently released video tutorial on region in detail view. Here’s an example of a region showing a few ENCODE tracks (HepG2 and K562 genome segmentations and cytosolic RNASeq tracks):

If you no longer need access to the ENCODE set of tracks, the hub can be turned off by going to the ‘Manage your data’ link in the left hand menu, and clicking on the trash bin icon for the ‘ENCODE data’ source to delete it from the ‘Configure this page’ menu.

We will be working over the next few months to extend our track hub support, including improving the performance and adding features of configuration interface.

From release 68, we are using Sequence Ontology (SO) terms for the variation consequences, in an effort to standardise terms across the different browsers, making it easier for users to do a cross comparison of variation annotation.  The UCSC Genome Browser will use these terms on their SNP details page around mid-August, dbSNP will update their web display in the next few weeks and the ICGC also intend to standardise on SO terms for describing somatic mutation consequences.

At the same time, we have added a couple more specific consequences for SNPs and in-dels (splice donor variant and splice acceptor variant for example)  and consequences for larger structural variants are now available through the Variant Effect Predictor (VEP). The complete list of terms and definitions are in our documentation.As you will see, the SO equivalents for our old terms are fairly straightforward. The most notable difference is that we have replaced “non-synonymous” with the more specific term “missense”, for changes in amino acid which do not include stop gained, as we already have a specific term for stop gained.

The old Ensembl terms are still available on the website (using”Configure this page”) and if you have text files or VEP output files with our old Ensembl terms, you can easily update these to using the SO terms by running the following script.

For release 67 we changed how we store the protein function predictions from SIFT and PolyPhen so that they also can be used for more than just Ensembl transcripts, including RefSeq transcripts. We use these tools to compute the predicted effect of every possible amino acid substitution in the human proteome (over 2 billion predictions!). Now, the complete set of predictions for a particular protein are retrieved using the protein sequence itself as an identifier rather than an Ensembl stable identifier (we actually use the MD5 hash of the sequence). This means that you can retrieve predictions for any protein that has the same amino acid sequence as an Ensembl translation. So if you work with RefSeq transcripts, you can now get SIFT and PolyPhen predictions for any missense variants that fall in the 95% of RefSeq transcripts that match an Ensembl transcript exactly, using both the Variant Effect Predictor (VEP) and the Variation API.

New in release 67 are also predictions from both classifier models supplied with PolyPhen. Previously we provided predictions using a classifier trained on the HumVar dataset which is intended to distinguish between severely deleterious alleles against the background of abundant variation with milder effects. This is still the default, but when using the API you can now also opt to use predictions from the classifier trained on the HumDiv dataset which is intended to help evaluate rarer alleles potentially involved in complex disease. For more details on how these datasets are composed, please refer to the PolyPhen website.

The Variant Effect Predictor (VEP) software can predict the consequence of genomic variants using the genomic annotations provided by Ensembl. In release 63 of Ensembl we have added new features to both the script and web versions of the VEP.

Regulatory consequences have made their return; the VEP now reports if a variant falls within a regulatory region or a transcription factor binding motif, and furthermore if the variant falls in a high information locus within the motif.

The VEP now also has a dedicated area of the Ensembl website documentation.

Script version

To improve performance for users in the USA, we have now deployed a mirror of the public database server; to use this simply pass the flag “–host useastdb.ensembl.org” when running the script.

We have also implemented a caching system in the VEP, such that is possible to use almost all of the functionality of the script without the script querying the database at all. Simply download and unpack a pre-built cache, run the script with the flag “–cache”, and hey presto! No more network dependencies.

We have now made “whole genome mode” the default run mode of the script – this code has been rewritten and optimized such that it should be suitable for all use cases. We’ve also improved the status output of the script as it runs, so users with lots of data can easily track their progress.

See the new documentation for further details on all of these new features, or just download the script!

Web version

It is now possible to filter your input variants by their frequency as observed in the 1000 genomes or HapMap populations. You can either include or exclude input variants that are co-located with existing variants, based on frequencies in any particular population or across a range of populations.

As before, you can access the web VEP through the tools page, or via the “Manage your data” link on any species-specific page.

Alongside our website, ensembl provides direct access to our databases through our public MySQL server ensembldb.ensembl.org and as of today, we are pleased to announce the availability of a second MySQL mirror hosted on the east coast of the US. The new server is running on Amazon Cloud with the hostname

useastdb.ensembl.org

it can be directly direct accessed with the mysql client using port 5306 and username anonymous.
eg.

mysql -h useastdb.ensembl.org -u anonymous -P5306

It may also be accessed through our perl API with the following registry incantation:

use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db( -host => 'useastdb.ensembl.org',
                                  -user => 'anonymous');

useastDB will provide the current ensembl release alongside the previous on a rolling basis. This means that useastdb is currently hosting release 63 with 62 databases only, this will then become release 64 with 63 databases after our next release. Our full set of older releases will continue to to be hosted on ensembldb.ensembl.org

We hope that our users enjoy the faster access to our data that this new MySQL mirror should provide.