Have you ever spent time changing your favourite Ensembl view (for example adding new tracks, changing the track order, or uploading custom data) and wished you could easily send the configured display to a colleague through one simple url? You can now do this on beta.ensembl.org.

Configurable images now have a link icon in their toolbars. If you click on this, it will give you a link to share with another user.

If you have any custom tracks turned on for the image, you will get the option to share these too (this is opt-in via checkboxes). This works with uploaded files, attached URLs, DAS and data hubs.

Custom tracks will only be shareable if they are displayed on the image (or in the case of data hubs, if any of the tracks in the hub are displayed).

If you send the url to a colleague, he/she will see the image configured in the same way that you have it.

You can also share configurations for a whole page by using the Share this page button in the left menu.

Please try it out. If you encounter any problems, please use the Feedback button on the beta site to tell us about them (or email ensembl-beta@sanger.ac.uk), making sure to include the link you are trying to share.

We are pleased to announce that we are now providing access to the ENCODE integrative analysis data from within Ensembl. These analyses bring together a multitude of experiments targeted at determining functional elements in the human genome sequence. This data is provided from an external source (a track hub at the EBI).  Although the Ensembl code supporting track hubs is still in preliminary form, we considered this ENCODE set sufficiently important to release the code early to enable us to provide access to this set.

Important: Please read the instructions below before activating this data!

As this dataset is very large (over 2800 tracks) it is not configured on by default in the Ensembl browser. To add the ENCODE hub tracks, click on the link below. Warning: users of IE6 or IE7 should not do this because performance in those browsers is inadequate and the page will not load.

Link to add ENCODE integrative analysis hub

No tracks from the hub are switched on by default. To turn on tracks from ENCODE, go to ‘Configure this page’ and click on one of the submenus under ‘ENCODE data’, for example ‘ENCODE genome segmentations’. It will take a few seconds to bring up the track list. Then switch tracks on or off by clicking on the box next to the track name and choosing a track style. For genome segmentations the ‘Compact’ track style looks good.  More information on configuring the display is available in our recently released video tutorial on region in detail view. Here’s an example of a region showing a few ENCODE tracks (HepG2 and K562 genome segmentations and cytosolic RNASeq tracks):

If you no longer need access to the ENCODE set of tracks, the hub can be turned off by going to the ‘Manage your data’ link in the left hand menu, and clicking on the trash bin icon for the ‘ENCODE data’ source to delete it from the ‘Configure this page’ menu.

We will be working over the next few months to extend our track hub support, including improving the performance and adding features of configuration interface.

From release 68, we are using Sequence Ontology (SO) terms for the variation consequences, in an effort to standardise terms across the different browsers, making it easier for users to do a cross comparison of variation annotation.  The UCSC Genome Browser will use these terms on their SNP details page around mid-August, dbSNP will update their web display in the next few weeks and the ICGC also intend to standardise on SO terms for describing somatic mutation consequences.

At the same time, we have added a couple more specific consequences for SNPs and in-dels (splice donor variant and splice acceptor variant for example)  and consequences for larger structural variants are now available through the Variant Effect Predictor (VEP). The complete list of terms and definitions are in our documentation.As you will see, the SO equivalents for our old terms are fairly straightforward. The most notable difference is that we have replaced “non-synonymous” with the more specific term “missense”, for changes in amino acid which do not include stop gained, as we already have a specific term for stop gained.

The old Ensembl terms are still available on the website (using”Configure this page”) and if you have text files or VEP output files with our old Ensembl terms, you can easily update these to using the SO terms by running the following script.

For release 67 we changed how we store the protein function predictions from SIFT and PolyPhen so that they also can be used for more than just Ensembl transcripts, including RefSeq transcripts. We use these tools to compute the predicted effect of every possible amino acid substitution in the human proteome (over 2 billion predictions!). Now, the complete set of predictions for a particular protein are retrieved using the protein sequence itself as an identifier rather than an Ensembl stable identifier (we actually use the MD5 hash of the sequence). This means that you can retrieve predictions for any protein that has the same amino acid sequence as an Ensembl translation. So if you work with RefSeq transcripts, you can now get SIFT and PolyPhen predictions for any missense variants that fall in the 95% of RefSeq transcripts that match an Ensembl transcript exactly, using both the Variant Effect Predictor (VEP) and the Variation API.

New in release 67 are also predictions from both classifier models supplied with PolyPhen. Previously we provided predictions using a classifier trained on the HumVar dataset which is intended to distinguish between severely deleterious alleles against the background of abundant variation with milder effects. This is still the default, but when using the API you can now also opt to use predictions from the classifier trained on the HumDiv dataset which is intended to help evaluate rarer alleles potentially involved in complex disease. For more details on how these datasets are composed, please refer to the PolyPhen website.

The Variant Effect Predictor (VEP) software can predict the consequence of genomic variants using the genomic annotations provided by Ensembl. In release 63 of Ensembl we have added new features to both the script and web versions of the VEP.

Regulatory consequences have made their return; the VEP now reports if a variant falls within a regulatory region or a transcription factor binding motif, and furthermore if the variant falls in a high information locus within the motif.

The VEP now also has a dedicated area of the Ensembl website documentation.

Script version

To improve performance for users in the USA, we have now deployed a mirror of the public database server; to use this simply pass the flag “–host useastdb.ensembl.org” when running the script.

We have also implemented a caching system in the VEP, such that is possible to use almost all of the functionality of the script without the script querying the database at all. Simply download and unpack a pre-built cache, run the script with the flag “–cache”, and hey presto! No more network dependencies.

We have now made “whole genome mode” the default run mode of the script – this code has been rewritten and optimized such that it should be suitable for all use cases. We’ve also improved the status output of the script as it runs, so users with lots of data can easily track their progress.

See the new documentation for further details on all of these new features, or just download the script!

Web version

It is now possible to filter your input variants by their frequency as observed in the 1000 genomes or HapMap populations. You can either include or exclude input variants that are co-located with existing variants, based on frequencies in any particular population or across a range of populations.

As before, you can access the web VEP through the tools page, or via the “Manage your data” link on any species-specific page.

Alongside our website, ensembl provides direct access to our databases through our public MySQL server ensembldb.ensembl.org and as of today, we are pleased to announce the availability of a second MySQL mirror hosted on the east coast of the US. The new server is running on Amazon Cloud with the hostname


it can be directly direct accessed with the mysql client using port 5306 and username anonymous.

mysql -h useastdb.ensembl.org -u anonymous -P5306

It may also be accessed through our perl API with the following registry incantation:

use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db( -host => 'useastdb.ensembl.org',
                                  -user => 'anonymous');

useastDB will provide the current ensembl release alongside the previous on a rolling basis. This means that useastdb is currently hosting release 63 with 62 databases only, this will then become release 64 with 63 databases after our next release. Our full set of older releases will continue to to be hosted on ensembldb.ensembl.org

We hope that our users enjoy the faster access to our data that this new MySQL mirror should provide.

With some satisfaction, I am happy to announce the arrival of a new documentation resource with Release 63, intended to assist programmers in getting the most out of EnsEMBL.

Using a custom filter and the open source tool Doxygen we now bring you a more pleasing perspective on the EnsEMBL API, with the following features:

  1. Search box – find the class you need fast
  2. Inherited methods – no more hunting for superclasses
  3. Class and dependency diagrams – see how the API is structured
  4. Multiple perspectives – view by class, namespace, directory or method

The new reference can be found through the website, so update your bookmarks and have a look around. You might see some artifacts in the automated documentation, but we will be aiming to remove these as part of an ongoing effort to standardise code comments. I hope you enjoy the advantages of this new modern view of our API.

I’d like to introduce you an exciting new data set that we’ve introduced in Ensembl release 62: RNASeq data from Illumina’s Human BodyMap 2.0 project. The data, generated on HiSeq 2000 instruments in 2010, consist of 16 human tissue types, including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Raw reads are available for download here. For each tissue, we have aligned the raw reads to the genome and then linked exons into tissue-specific transcript models using the reads that span an exon-exon boundary.

You can view these data in the Region in Detail view. Click on ‘Configure this page’ and choose ‘RNA-Seq’ at the left of the main panel. Enable any or all of the 32 tracks and then close the configuration panel. Out of 32 possible tracks you can draw, 16 are tissue ‘gene model’ tracks, and 16 are ‘intron’ tracks.

The ‘gene model’ track shows you a transcript model. The ‘intron’ track shows you how many raw reads aligned across an exon-exon junction. The higher the intron block, the more highly expressed the transcript isoform is.

In this example, the kidney gene model track shows a transcript (dark blue) with an exon structure that matches the gold-coloured Ensembl transcript AQP6-001. The kidney transcript model includes coding and noncoding exons (in the example above, the empty box is UTR, and the filled boxes are exons).
Click on the kidney intron track to see that 192 raw reads were split between the first and second exons.

This example is interesting because it shows a gene with high expression in kidney tissue, and almost no expression in any other tissue.

The high read coverage for kidney means that the transcript’s exon-intron structure produced for the gene track has a good chance of being correct. When read coverage is very low, it is not always possible to build a full-length transcript model: Look at the colon and brain intron tracks to see that two colon reads and three brain reads have aligned across the transcript’s middle exon-exon junction. Although this read coverage is low, our pipeline has generated a transcript model for brain tissue. The pipeline however was not able to predict the two splice on either side because there were no raw reads from brain aligning over the splice junctions.

Below is a nice example of a gene that seems to be expressed in all 16 tissues, spermidine synthase (SRM).

Try dump_transcripts.pl as an example script to access the RNAseq-based transcript models. Have fun with these new data!

Have you noticed any strange-looking chromosome names when browsing the human data? For example, you might notice sequence region names looking like “Chromosome HSCHR17_2_CTG4: 68,302,419-68,526,413” or “Chromosome HG75_PATCH: 34,442,621-34,976,908”.

The names refer to genomic sequence that differs from the genomic DNA on the primary assembly. These alternate sequences come in two types: Allelic sequence (haplotypes and novel patches) and fix patches. Haplotypes are known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL).  Novel patches also represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence.  Haplotypes, novel patches and fix patches are determined by the GRC, not by Ensembl.

In the Ensembl browser, as in the figure below, the allelic sequence (haplotypic regions and novel patches) are coloured red and the fix patches are coloured green. If you have a look at the top image in Region In Detail for chromosome 17, you’ll see examples of both types of alternate sequence.


There are several ways to view alternate sequences in Ensembl:

  • If you know the name of the sequence you’re looking for, you can find it by searching in our Search bar.
  • You can view alternate sequence regions in the top image of any Location page eg. Region In Detail, Region Overview, Chromosome Summary.
  • Some alternate sequences are available through BioMart.
  • If you’re comfortable using MySQL, you can access the list through the assembly_exception table as follows:

mysql -uanonymous -hensembldb.ensembl.org -P5306 -Dhomo_sapiens_core_62_37g -e “select sr2.name as chr_name, exc_seq_region_start,exc_seq_region_end,exc_type,sr1.name as alternate_seq_name,seq_region_start, seq_region_end from assembly_exception ae, seq_region sr1, seq_region sr2 where sr1.seq_region_id=ae.seq_region_id and sr2.seq_region_id=ae.exc_seq_region_id order by chr_name,exc_seq_region_start”

Click here for the full list of e62 alternate sequences

$slices = $slice_adaptor->fetch_all( ‘toplevel’, undef, 1 );


$assembly_exception_features = $assembly_exception_feature_adaptor->fetch_all_by_Slice($slice);

When using the API, the primary assembly is known as the ‘reference’ sequence and the alternate sequences are know as ‘non-reference’ sequence.


Ensembl provides annotations indicating regions in the genome that are experimentally verified to be bound by transcription factors (from ChIP-Seq experiments). Within these regions, we now also provide precise transcription factor binding sites. To generate these binding sites, we make use of publicly available Position Weight Matrices (PWM) from Jaspar.

Transcription factor binding sites can be seen as black boxes in the Regulatory Features track. If you click on a Regulatory Feature you can see information regarding the binding sites contained within that regulatory feature. This includes the binding matrix used and a binding score representing how well a particular site matches the binding matrix. Clicking on a specific black box within the regulatory feature will highlight the corresponding information on the menu (the darker blue line in the figure showing information for a CTCF binding site). Transcription factor binding sites are also displayed as evidence for a regulatory feature (as ‘Core PWM’ entries).

To generate these PWM matches we take Jaspar matrices and find matches throughout the genome. Then, we use experimental binding data to stringently choose high confidence binding sites that fall within regions enriched in ChIP-Seq experiments for the corresponding factor. More details on this process can be found here.