Ensembl 82 is scheduled for September 2015 and includes:

Updated gene sets and annotations

  • Human variation data updates to dbSNP (144) including variants from the Exome Aggregation Consortium (ExAC)
  • Mouse: updated to GENCODE M7 annotation
  • Zebrafish: RNASeq data from developmental stages

Other highlights and data sets

  • Extra functionality for the VEP web and REST interfaces (via plugins)
  • Phenotype data updated for several species, including human, mouse, rat and horse
  • HGMD (version 2015.2), and NHLBI Exome Sequencing Project data, HumanCoreExome-12 chip variants
  • Improved data upload web form

Second update to the GRCh37 site

This release will also see a fresh update to our dedicated GRCh37 site (grch37.ensembl.org) with:

  • Variant data from dbSNP144 which includes variants from the Exome Aggregation Consortium (ExAC)
  • Updated variants from NHLBI Exome Sequencing Project and HGMD (2015.2)
  • Phenotype data from NHGRI-EBI GWAS, OMIM, ClinVar, UniProt, Orphanet and Decipher
  • Variants from the HumanCoreExome-12 chip

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

What is new?

pomGO

Gene Ontology (GO) annotations are some of the updates for S. pombe genes in Ensembl Fungi release 28

Slide1

Protein domains updated from InterProScan for Fungi, Protists, Metazoa and Plants in this new release

  • Updated BioMart: Protists, Metazoa, Fungi and Plants.

Any questions or comments? Get in touch.

What’s new in e81:

Human gene set update and new assembly patches

Human_assembly_exception

The human gene set now corresponds to GENCODE 23 while the assembly has been updated to include new assembly patches for GRCh38.p3.

Mouse and zebrafish clone tracks

New Mouse clones

Mouse and zebrafish clone libraries have been imported from the NCBI clone database to replace our previous DAS tracks. The new clones tracks can be found under “Clones and misc regions” in the configuration menu on the left hand side, while the coordinates for the BAC ends can be found as tracks under “Simple features”, also from the configuration menu.

New mouse regulatory build

New regulatory mouse build

The Regulatory Build on Mouse was re-computed, converting the “old style” build to the “new style” build introduced on human in e!76. All Regulatory Builds in Ensembl are now updated to the new style. We have also increased the number of mouse cell types to 8.

Transcript sequence mark up

transcript_sequence_markup

Transcript sequences can now be marked up to show exons as alternating upper and lower case characters, rather than grey/blue text. Simply tick the “Show exons as alternating upper/lower case” box in the “Configure this page” panel on Transcript cDNA or Transcript Protein pages.

This markup option will also carry over to the sequence export if RTF format is chosen.

Other news

  • Mouse: updated to GENCODE M6 including HAVANA annotation with the new assembly patched (GRCm38.p4)
  • Annotations now available in GFF3 format for all our species on our FTP site
  • Phenotype data updated for several species, including human, mouse, sheep and chicken
  • Sheep: updated gene set including lincRNA genes

A complete list of the changes can be found on the Ensembl website.

Ensembl 81 is scheduled for July 2015. Highlights include:

Updated gene sets and annotations

  • Human: updated to GENCODE 23 including manually annotated HAVANA annotation, with the new assembly patches (GRCh38.p3)
  • Mouse: updated to GENCODE M6 including HAVANA annotation with the new assembly patched (GRCm38.p4)
  • Sheep: updated gene set including lincRNA genes

Other highlights and data sets

  • Phenotype data updated for several species, including human, mouse, sheep and chicken
  • New mouse regulatory build
  • Genomic alignements of mouse BAC and FOSMID clones
  • Annotations in GFF3 format for all species

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

Ensembl Variation recently incorporated the latest versions of the dbSNP and 1000 Genomes datasets. While we are able to import all of the variant loci from phase 3 of the 1000 Genomes project, the vast amount of genotype data (2500 individuals x 80 million sites = 200 billion data points!!!) meant we had to create a new solution to deliver this data through our API and website.

To this end we have extended the Ensembl Variation API to read genotype data directly from tabix-indexed VCF files. The API then calculates frequency and linkage disequilibrium (LD) data from these genotypes on-the-fly. You can see this in action on a typical population genetics page:
Screen Shot 2015-06-18 at 14.55.53
In order to use this functionality with your local API installation, there’s a couple of extra dependencies to install. You may even have them already!

Tabix

The tabix utility is used for rapid random access into compressed position-based text files. It also allows access to data across HTTP and FTP protocols, downloading only a small index file in the process.

To install it, we clone it from GitHub and run a couple of “make” statements. From here on we assume that you typically install things in your $HOME/src/ directory and that you are using bash or a bash-like terminal.

cd ~/src
git clone git@github.com:samtools/tabix.git
cd tabix
make
cd perl
perl Makefile.PL PREFIX=${HOME}/src/
make && make install

You may need the tabix binary in your path; you can either copy ~/src/tabix/tabix to a directory in your path, or add this to your path:

PATH=${PATH}:${HOME}/src/tabix/
export PATH

If it isn’t already, you should also add the relevant path to your PERL5LIB environment variable; the path in question is shown in the output from the “make && make install” command above.

PERL5LIB=${PERL5LIB}:${HOME}/src/lib/perl/5.14.2/
export PERL5LIB

ensembl-io

The ensembl-io package contains objects and methods for parsing and writing data formats commonly used in bioinformatics. If you installed the API using Git and Ensembl Git tools, chances are you already have the module.

If not, it’s simple to install with git:

cd ~/src
git clone git@github.com:Ensembl/ensembl-io.git
PERL5LIB=${PERL5LIB}:${HOME}/src/ensembl-io/modules
export PERL5LIB

Using in the API

That’s it! Now to use this in an API script, there’s a simple flag we have to set on the Variation DBAdaptor object:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');

# Tell API to use VCFs
$variation_adaptor->db->use_vcf(1);

my $variation = $variation_adaptor->fetch_by_name('rs699');
my $alleles = $variation->get_all_Alleles();

foreach my $allele (@{$alleles}) {
  next unless 
    (defined $allele->population) &&
    (defined $allele->frequency);
  my $allele_string = $allele->allele;
  my $frequency = $allele->frequency;
  my $population_name = $allele->population->name;
  printf("Allele %s has frequency %.3g in %s\n", $allele_string, $frequency, $population_name);
}

This script should print out frequency data for a number of populations, including those from 1000 Genomes phase 3:

....
Allele A has frequency 0.121 in 1000GENOMES:phase_3:KHV
Allele G has frequency 0.879 in 1000GENOMES:phase_3:KHV
Allele A has frequency 0.149 in 1000GENOMES:phase_3:JPT
Allele G has frequency 0.851 in 1000GENOMES:phase_3:JPT
Allele A has frequency 0.295 in 1000GENOMES:phase_3:ALL
Allele G has frequency 0.705 in 1000GENOMES:phase_3:ALL

You can use the “->db->use_vcf(1)” stub on any adaptor from the variation adaptor group.

Once set, it will affect fetching objects of the following types:

  • Allele
  • PopulationGenotype
  • IndividividualGenotype
  • LDFeatureContainer

Advanced configuration

The value we pass to use_vcf() also affects the behaviour of the API:

  • 0 : fetch data only from database
  • 1 : fetch data from VCFs and database
  • 2 : fetch data only from VCFs

One final thing; the API is pre-configured to use VCFs hosted on the Ensembl FTP site. It is also possible to use VCFs on your local machine or any arbitrary server. The configuration is found in the ensembl-variation folder:

cat ~/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json
{
 "collections": [
   {
     "id": "1000genomes_phase3",
     "species": "homo_sapiens",
     "assembly": "GRCh37",
     "type": "remote",
     "strict_name_match": 1,
     "filename_template": "ftp://ftp.ensembl.org/pub/grch37/release-79/variation/vcf/homo_sapiens/1000GENOMES-phase_3-genotypes/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.vcf.gz",
     "chromosomes": [
       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22"
     ],
     "individual_prefix": "1000GENOMES:phase_3:"
   },
   {
     "id": "1000genomes_phase3",
     "species": "homo_sapiens",
     "assembly": "GRCh38",
     "type": "remote",
     "strict_name_match": 1,
     "filename_template": "ftp://ftp.ensembl.org/pub/release-80/variation/vcf/homo_sapiens/1000GENOMES-phase_3-genotypes/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz",
     "chromosomes": [
       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12","13", "14", "15", "16", "17", "18", "19", "20", "21", "22"
     ],
     "individual_prefix": "1000GENOMES:phase_3:"
   }
 ]
}

Feel free to edit the filename_template entry in this file. Note there are separate entries for the two currently supported human assemblies, GRCh37 and GRCh38; the relevant entries will be used depending on which port you connect to in your API script (3306 for GRCh38, 3337 for GRCh37).

“###CHR###” is a placeholder that allows the API to read from a set of files distributed as one per chromosome. This is not mandatory, and indeed a single genome-wide VCF file could be used. The only requirement is that the chromosomes contained in the VCF or set of VCFs are listed in the “chromosomes” field of the JSON configuration file.

Any questions, don’t hesitate to get in touch!

What is new?

  • Expansion of Protists and Fungi with hundreds of annotated genomes
  • Variation data for bread wheat, rice, Aedes aegypti, and Ixodes scapularis
  • Whole genome alignments for O. longistaminata and T. cacao
  • Non-coding RNA gene models in Bacteria
  • New assembly of tomato (version 2.50)
  • Full support for UCSC Track Hub format for hosting your own data in Ensembl

Expansion of Fungi and Protists

All protist and fungal genomes whose sequence and annotation are complete and submitted to the International Sequence Database Consortium (INSDC) have now been included in Ensembl Genomes. Future releases will continue to be updated with all newly submitted sequences.

EG_species

Mushrooms, mould, parasites of animals and plants: just to name a few genomes in the comprehensive and varied collection of new species in Fungi and Protists.

Ensembl Fungi now contains 408 genomes from 271 species, with 355 new genomes from 236 species included in this release.

Ensembl Protists now contains 133 genomes from 91 species, with 101 new genomes from 66 species.

The new genomes are available on our websites, MySQL databases, and APIs (REST and Perl). We are currently working on making them available in BioMart and expect this to be available in release 28.

A representative selection of 57 protists and 191 fungi genomes have been added to our comparative genomics analyses, making gene trees and homology calls available.

Screen shot 2015-06-15 at 16.49.10

Gene tree of the RBP gene in Leishmania mexicana, one of the many new genomes in Ensembl Protists, and its 12 orthologues.

New variation data for bread wheat

Variation data provided by the HapMap consortium is now available in Ensembl Plants for bread wheat. The data was generated by re-sequencing 62 diverse wheat lines. In total 1.57 million SNPs and 162 thousand small indels were identified across the 21 chromosomes of bread wheat. Moreover, the genotypes of 475 individuals have been added to the Axiom 820K SNP Array from CerealsDB.

SNPs and short indels from the wheat HapMap and CerealDB annotated with VEP can be viewed in our Ensembl Plants browser.

Other news

  • Updated gene models in Metazoa, Protists and Fungi
  • Updated comparative genomics across all divisions
  • New probe data for barley
  • Updated BioMarts

A complete list of both new and updated date can be found on our website.

Any questions or comments? Get in touch.

BiomaRt is a Bioconductor package that make accessing and retrieving Ensembl data from the R software very easy. The recent Bioconductor 3.1 release includes a new version of BiomaRt packed with many new Ensembl friendly functions allowing you to connect and retrieve data from the Ensembl marts in record time.

To celebrate the new Bioconductor release, we’ve just launched a brand new mart documentation page. This new documentation covers the BioMaRt package but also how to combine species dataBioMart RESTful and Perl API.

You want to get some Ensembl data from BioMart using BiomaRt? Easy, just follow the simple guide below.

How can I install the BiomaRt, R package?

First make sure you have installed the R software on your computer. Then, run the following commands from your R terminal to install the Bioconductor BiomaRt R package:

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

What are the Ensembl marts?

The following functions will give you the list of the current available Ensembl marts

> library(biomaRt)

> listEnsembl()

     biomart               version
1    ensembl               Ensembl Genes 80
2        snp               Ensembl Variation 80
3 regulation               Ensembl Regulation 80
4       vega               Vega 60
5      pride               PRIDE (EBI UK)

Which Ensembl species have Variation data?

The listDatasets function will list all the species available for a given mart.

> library(biomaRt)

> variation = useEnsembl(biomart="snp")

> listDatasets(variation)

biomart_R_1

What data can I get from the Variation mart (filters and attributes)?

The listFilters and listAttributes functions will give you the list of all the filters and attributes available for a given mart.

> library(biomaRt)
 
> variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")

> listFilters(variation)

> listAttributes(variation)

biomart_R_filters

 

 

biomart_R_attributes

 

 

 

How can I get data about a variant using an rsID?

In the following example, you will be able to retrieve Variation source, Chromosome locations, Minor allele, Frequency and count, Consequences, Ensembl Gene and Transcript IDs for the Variation name “rs1333049”.

> library(biomaRt)
 
> variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")

> rs1333049 <- getBM(attributes=c('refsnp_id','refsnp_source','chr_name','chrom_start','chrom_end','minor_allele','minor_allele_freq','minor_allele_count','consequence_allele_string','ensembl_gene_stable_id','ensembl_transcript_stable_id'), filters = 'snp_filter', values ="rs1333049", mart = variation)

> rs1333049

biomart_R_snp_information

How can I get data on all genes on a chromosome?

In the following example, you will be able to retrieve Ensembl Gene IDs, HGNC symbols and biotypes located on the human chromosome Y.

> library(biomaRt)

> ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

> chrY_genes <- getBM(attributes=c('ensembl_gene_id','gene_biotype','hgnc_symbol','chromosome_name','start_position','end_position'), filters = 'chromosome_name', values ="Y", mart = ensembl)

> chrY_genes 

biomart_R_gene

How can I get protein domains information mapped to an Ensembl Gene ID?

In the following example, you will be able to retrieve Ensembl Gene, Transcript and Protein IDs, Interpro and Pfam protein domain IDs and locations mapped to the Ensembl Gene ID “ENSG00000198763”.

> library(biomaRt)

> ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

> domain_location_ENSG00000198763 <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','ensembl_peptide_id','interpro','interpro_start','interpro_end','pfam','pfam_start','pfam_end'), filters ='ensembl_gene_id', values ="ENSG00000198763", mart = ensembl) 

> domain_location_ENSG00000198763

temporary_screenshot

The Bioconductor BiomaRt R package and complete documentation can be found on the BiomaRt Bioconductor page.

As we announced in our e80 release post, we rolled out a prototype display for cis-regulatory interactions, essentially arches connecting any two elements on the same chromosome. This was mainly designed to display your eQTL or Hi-C with little effort, based on the WashU Epigenomics Explorer interaction track:

Hi-C

All you need to do is prepare a tab-delimited interaction file, optionally indexing it with Tabix if it is too large to upload directly. You can use it to represent Intra-chromosomal rearrangements such as the micro-inversion below:

inv

@RNA3DHub even suggested using it to display RNA secondary structure:

RNA

You can do what you want really!

Rainbow

What’s new in e80:

1000 Genomes phase 3 and dbSNP build 142

We are happy to announce that Human dbSNP 142 incorporating 1000 Genomes phase 3 data is now available for the GRCh38 assembly.

1000_genomes_phase3

Gene Expression Atlas Widget

The Gene Expression Atlas widget has been embedded into Ensembl. You can now view where the gene is expressed anatomically and also which experiment it is associated with.

GXA_plugin

Updated zebrafish and rat gene annotation based on the GRCz10 and Rnor_6.0 assemblies

rattusWe are really excited to release the full gene annotation, dbSNP and microarray updates for:

Zebrafisch

Track label improvement

Some tracks in images now appear within sections, grouping common tracks within a category is now possible.

Each section is identified by a heading underlined in a certain colour, and each track within that section using the same colour on the left-hand side.

Also, some tracks now have labels within the image itself, to allow longer descriptions. These in-image labels can be configured on or off via the configuration panel.

New user data track type: long-range interactions

We are very pleased to announce that Ensembl now supports long-range pairwise interaction data, which can be drawn as arcs on Region in Detail. Scores are indicated using a grey-to-black gradient, and labels can be displayed by selecting the appropriate track style from the configuration menu.

Initially we support the two formats developed by WashU for their Epigenomics browser. More information on both formats can be found in our online documentation.

We hope to support more formats in the future, so please let us know which formats you are currently using!

Other news

  • Mouse GENCODE M5 (GRCm38.p3): An updated version of the GENCODE gene set
  • We’ve imported sequence variants from:
  • New BioMart documentation
  • New export options for comparative views (homologues, gene trees and OrthoXML filtering)
  • Gap initiation update for BLAT and BLAST

A complete list of the changes can be found on the Ensembl website.

Ensembl 80 is scheduled for May 2015. Highlights include:

Variation data imports and updates

  • 1000 Genomes phase 3 studies will be imported for human
  • Variant locations will be added from the ExAC project
  • The latest sequence variants will be imported from:
    • dbSNP build 142 for human, mouse, zebrafish and cow
    • dbSNP build 143 for sheep and pig

Updated gene sets and annotations

  • Mouse GENCODE M5 (GRCm38.p3): An updated version of the GENCODE gene set
  • Updated zebrafish gene annotation based on the GRCz10 assembly
  • Updated rat gene annotation based on the new Rnor_6.0 assembly
  • RefSeq genomic to mRNA comparison attributes will be added for human

New web features

  • New export options for comparative views (homologues, gene trees and OrthoXML filtering)
  • New display styles for BigWig files on karyotype
  • Support for long-range interaction data

Other updates

  • The pairwise and multiple alignments have been updated to use the new Zebrafish and Rat assemblies

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.