Boom!

Due to a somewhat catastrophic hardware failure during our production cycle for Ensembl release 79, we have only been able to release human dbSNP 142 incorporating 1000 Genomes phase 3 data on our GRCh37-based services (Web: grch37.ensembl.org, REST API: grch37.rest.ensembl.org, MySQL: anonymous@ensembldb.ensembl.org:3337, VEP: grch37.ensembl.org/vep). We hope to have these data on GRCh38 for Ensembl release 80; our GRCh38 database is based on dbSNP 138 until then.

VEP and GRCh38

For users of the Variant Effect Predictor (VEP) living in the brave new GRCh38 world, we have made available a VCF file which can be used to incorporate IDs and allele frequencies from the 1000 Genomes phase 3 data into your results. See this handy guide for details.

Viewing phase 3 data on the web

Using the same VCF as above, you can attach this as a custom data file to your Ensembl browser; this will allow you to see the genomic locations of the variants and their consequence types on our Region in Detail view.

Attach the following URL as a VCF (indexed): ftp://ftp.ensembl.org/pub/release-79/variation/VEP/1KG.phase3.GRCh38.vcf.gz

Screen Shot 2015-04-02 at 17.06.46

You can then see the data alongside the tracks from our database:

Screen Shot 2015-04-02 at 17.11.03

After running our brand new Ensembl Regulatory Build on the human GRCh38 and GRCh37 assemblies, we spent some time revamping our current Regulation mart to make it faster, easier to use and pack it with brand new features. A complete re-design of the mart has been done in the background to make sure our mart can provide improved performance and deal with the data increase. The new Regulation mart can still be found on the Ensembl website under the BioMart tab.

The first thing that you will notice is that you can now access each regulation data type separately, this allows you to get the data quickly and make the filter and attribute sections neater.

New Regulation mart dataset dropdown

 

 

 

 

 

 

Each Regulation section holds its own type of data, for example you can get the following data for human GRCh38:

  • Binding Motifs (TFBS Annotation)
    • Binding Matrix ID (e.g: MA0003.1)
    • Feature type data (e.g: BHLHE40,CTCFL,…)
  • Other Regulatory Regions
    • Feature type data (e.g: FANTOM predictions, VISTA Enhancers)
    • Identifiers (e.g: hs1, 1:922877-923268,…)
  • Regulatory Evidence (Regulatory Build Information)
    • Feature type data (e.g: ATF3, DNase1,…)
    • Cell type (e.g: A549, H1ESC,..)
    • Feature Type class (e.g: Histone, Polymerase, Transcription Factor, Open Chromatin)
    • Project name (e.g: ENCODE and Roadmap Epigenomics)
    • SRA Experiment Accession (e.g: SRX018823SRX056730,…)
  • Regulatory Features (Regulatory Build Information)
    • Ensembl Regulatory Stable ID (e.g: ENSR00001516677)
    • Feature type (e.g: CTCF Binding Site, Enhancer, Open chromatin, Promoter, Promoter Flanking Region, TF binding site)
    • Cell type (e.g: A549, DND-41, GM12878, H1ESC,…)
  • Regulatory Segments (Segmentation Information)
    • Feature type (e.g: CTCF enriched, Predicted Enhancer, Predicted heterochromatin, Predicted low activity, Predicted Promoter flank, Predicted Promoter with TSS, Predicted Repressed, Predicted Transcribed Region)
    • Cell type (e.g: HeLa-S3, HepG2)
  • miRNA Target Regions (TarBase miRNA target predictions)
    • miRNA Identifier ID (e.g: hsa-miR-124-3p, hsa-miR-122-5p,…)
    • miRNA Accession ID (e.g: MIMAT0000069, MIMAT0000646,…)

We also have Binding motifs, Regulatory evidence, Regulatory Features, miRNA Target Regions data available for mouse and Other Regulatory Regions available for both mouse and fruit fly.

In addition of the above, we have added the following brand new information:

  • Band Start/End, Marker Start/End and ENCODE Pilot Regions filters to the six Regulation data sections
  • SO name and accession to the six Regulation data sections
  • EFO Term accession to the Regulatory evidence, Regulatory features and Regulatory segments.
  • “Has evidence”, which denotes whether Regulatory features have supporting evidence on a particular cell type or not.
  • Chromosome Strand and Evidence to miRNA Target Regions.

Working with R?

Did you know that you can access all our marts using the BiomaRt Bioconductor R package?

To do this, first install the Bioconductor BiomaRt package: http://bioconductor.org/packages/release/bioc/html/biomaRt.html.

The following R code will then give you the chromosome location and scores for the human GRCh38 Binding matrix ID MA0005.2:

> library(biomaRt)

> ensembl_regulation = useMart(biomart="ENSEMBL_MART_FUNCGEN",host="www.ensembl.org",dataset="hsapiens_motif_feature")

> binding_matrix = getBM(attributes=c('binding_matrix_id','chromosome_name', 'chromosome_start', 'chromosome_end','chromosome_strand', 'score'), filters='motif_binding_matrix_id',values="MA0005.2", mart=ensembl_regulation)

Screen Shot 2015-04-02 at 14.16.56

The new Regulation mart is available for human, mouse and fruit fly on both www.ensembl.org and www.grch37.ensembl.org.

Annotation of the recent human assembly, GRCh38, was released in e76 in August 2014. Since then we have been maintaining a dedicated site to the GRCh37 assembly. The reason for updating annotations on the previous human assembly is to support those users who may still have data annotated on the old assembly, and who can not yet run their analyses on the new assembly. The Genome Reference Consortium (GRC) keeps a blog on the assemblies that they maintain which may be a good source of information if you are still contemplating a move to GRCh38. If you are wondering about the migration from GRCh37 to GRCh38 within Ensembl, we published a blog series which may be of interest.

We are now pleased to announce that the GRCh37 archive site has been updated with new human data sets. In addition to data imports, we have also utilised the improved regulatory build pipeline for mapping all available human regulatory features to the GRCh37 assembly. We have also re-built the GRCh37 Ensembl, Regulation and Variation BioMarts to integrate the updated data sets.

Highlights from some of the data imports for this release are:

  • Genotypes from 1000 Genomes Phase 3
  • dbSNP142 human data
  • Latest release of public HGMD data (version 2014.4)
  • COSMIC version 71
  • RefSeq GFF3 annotation

A complete list of the changes can be found on the Ensembl GRCh37 website.

In line with EMBL-EBI policy, from the end of 2015 Ensembl will be removing support for DAS from our browser. This means that we will no longer provide our annotations over DAS and that we will not visualise third party annotation provided to us via DAS. If you have data with genomic coordinates that you wish to present in Ensembl then we recommend that you do this using TrackHubs. For annotation on other coordinate systems, we are currently working on providing support for this and will announce developments in this area over the course of the coming year. If you need more details then please get in touch with us at helpdesk@ensembl.org.

Highlights

Screen shot 2015-03-25 at 11.01.41

Explore the new variation data in the plant pathogen Z. tritici. This variation database was constructed based on as study SRP017760 downloaded from ENA.

Genome comparisons for Triticeae and related species

Whole genome alignments between seven pairs of Triticeae genomes, including the bread wheat A, B and D component genomes, Triticum urartu (the A genome progenitor), Aegilops tauschii (the D genome progenitor), and barley are now available. The alignments were obtained using ATAC and the statistics on genome and coding exon coverages can be found on our website. See the ATAC results for the comparison between T. urartu versus A. tauschii genomes.

MAF files on our FTP sites

In response to several requests from our users we now provide the pairwise alignments as MAF files. These can be found on the FTP download site of all Ensembl Genomes divisions. See an example of this data in Ensembl Metazoa.

Other news

A complete list of both new and updated date can be found on our website.

Get in touch if you have any questions or comments.

The Ensembl Genomes team

What’s new in e79:

Human update

The human gene set now corresponds to GENCODE 22 while the assembly has been updated to include new assembly patches for GRCh38.p2.

Corrected RYBP gene in GRCh38.p2 assembly

The HG-126 patch (KN538364.1) in GRCh38.p2 corrects a misassembly in this region which affects the RYBP gene.

Comparison of human RefSeq transcripts to Ensembl models

In this release we provide comparisons between the imported RefSeq transcripts in human to all overlapping Ensembl models. The comparison is done at the transcript level where all exons are compared in terms of genomic coordinates and the transcript sequences of the two models are also compared. Additionally, we have also compared the genomic sequences of the RefSeq transcripts to the Ensembl models. Both of these data sets are available via our API.

Gene gain/loss tree view

It is now possible to view the Gene gain/loss tree with our new interactive view (which uses the same engine as the species tree view introduced last release). Click on the toolbar to change the layout of the tree or on a node to get its details and expand, collapse, or focus on it.

cafe_tree_new_widget

Global Alliance REST Endpoints

Our REST server now implements the Global Alliance for Genomics and Health (GA4GH) Genomics API. The aim of the GA4GH API is to allow the interoperable exchange of genomic information across multiple organisations and on multiple platforms – see http://ga4gh.org/#/api for further details. Phase 3 genotype data from the 1000 Genomes project is now available from Ensembl via three new GA4GH endpoints.

Import of NextGen project sheep genotype data

Genotype data for three sheep species have been imported from the NextGen project. The project aims to preserve the biodiversity of farm animals. The imported populations are Iranian Ovis aries and Ovis orientalis and Moroccan Ovis aries. You can read more about other NextGen data set on the Ensembl projects.

Other news

  • Updated human HAVANA annotation in Vega
  • Import of phenotype and disease data from the Rat Genome Database (RGD)
  • RefSeq GFF3 annotations for majority of Ensembl species
  • Addition of non-coding genes to the vervet-AGM gene set
  • Updated APPRIS flags for human, mouse, rat, zebrafish with addition of pig
  • Assembly and gene set update for Drosophila to BDGP6 (FB2014_05)

A complete list of the changes can be found on the Ensembl website.

Find out more at the Ensembl Release Webinar e79 (16.00 GMT, Thursday 26th March). Register here (for free!).

Are you looking for whole genomes, protein sequences, alignments or other genome-wide data from Ensembl?

Look no further; our FTP site is the place for you:

  • Download our data from the current release only (i.e. Ensembl 78)
  • Download our data from current and previous releases (including GRCh37)

These are some of our data that can be downloaded in bulk and for free; file types are described in brackets:

  • DNA, cDNA, CDS, ncRNA sequences (FASTA)
  • Annotations of our coding and non-coding genes (GTF)
  • Annotation of regulatory elements for the human and mouse genomes (GFF)
  • Variation data (VCF) for more than 20 Ensembl species
  • RNASeq reads (BAM) aligned against 25 genomes
  • GERP scores to identify constrained elements (BED)
  • Alignments of resequencing data for several species (EMF)
  • Multiple and pairwise genome alignments (MAF)
  • Ensembl databases for local installation (MySQL)

How can the Ensembl FTP foster research?

Let’s look at coiled-coils, simple dimers in protein sequences found in many species and believed to enable protein-protein interaction in a variety of biological processes.

Slide1

Structure of coiled-coil domain from PDBe. Homohexameric assembly by Li et al. (2014)

Coiled-coil domains differ immensely from their globular counterparts, and distinct evolutionary constraints on them are expected. How conserved are coiled-coils? What has driven their evolution?

Intrigued by these questions, Surkont and Pereira-Leal (2015) set out on an journey to compare different protein sequences across several vertebrates, and the yeast. They show that substitution patterns do differ in coiled-coil versus globular regions, and they developed an evolutionary model to improve the detection of coiled-coils by homology, and their phylogeny inference.

Where did Surkont and Pereira-Leal find these proteomes for their investigation? In our FTP site.

Why not explore the Ensembl FTP site to see what we’ve got in store for you?

Any comments or questions, just get in touch.

 

My recent trip to Malawi as part of a Wellcome Trust Open Door Workshop has really reminded me how privileged I really am. I’m an Outreach Officer, which means that I have the privilege to travel out to institutes around the world to deliver free Ensembl workshops. Most of the time, these workshops are in Europe or the US, at fancy research institutes and universities, and it’s an awesome privilege to facilitate research at these institutes.

An even greater privilege is to be involved in the Open Door Workshops on Working with the Human Genome Sequence, organised by Wellcome Trust Advanced Courses, which head out to more developing countries to teach. They’re called ‘Open Door’ because all the resources we teach in them are free and open on the web, which means anyone, anywhere, with nothing but an internet connection can do it. I teach the Ensembl section of the course, but we also cover other resources from the EBI, Sanger Institute, NCBI and elsewhere.

We hold these courses at Wellcome Trust research centres, for example the Malawi-Liverpool Wellcome Trust I visited recently, which are fantastic investments by the Wellcome Trust in research around the world. Participants travel from all over the continent to attend the course; attendance is free (with selection) and the Wellcome Trust can even fund travel bursaries. It is a great privilege for me to be able to travel to these locations and to teach them all about Ensembl.

Group photograph

The group from the Open Door Workshop at the Malawi-Liverpool Wellcome Trust. Featuring instructors me (seated, second from left), Jane Loveland (Sanger Institute; seated, middle), Rob Finn (EBI; back row, far left), Charlie Steward (Sanger Institute; back row, middle) and Matt Clark (TGAC; back row, second from right). Photo by Heidi Hauser (Wellcome Trust Advanced Courses).

I am proud to present Ensembl to these workshops participants. Partly because I think it’s an amazing resource that can really facilitate research. Partly because we give it away for free, and I know this makes a huge difference to researchers whose labs are not well funded. Even in labs with £1 million grants, money is always tight, but for many of the people who attend our workshops, labs struggle with knackered PCR machines, ghost equipment that they can’t afford to buy the reagents to use and a complete reliance on Open Access publishing as they can’t pay for journal subscriptions, yet they still manage to produce world-class science. If they had to choose between replacing those broken machines and a pay-per-use or subscription-only bioinformatics resource, it would really be a no-brainer. But by giving them a free resource means they don’t have to make that choice. Indeed, it gives them the opportunity to carry out research that doesn’t need any expensive equipment or reagents.

The Wellcome Trust is one of the major funders of Ensembl. We are so grateful to them for allowing us to make our data freely available, so that everybody can make use of it. It really is a privilege.

Ensembl 79 is scheduled for March 2015. Highlights include:

Updated gene sets and annotations

  • Human GENCODE release 22 (GRCh38.p2): An updated version of the GENCODE gene set, which combines Havana’s manual annotation and Ensembl’s evidence-based automatic annotation, will be released
  • Assembly patches will be added and annotated for the new human assembly GRCh38.p2
  • RefSeq to Ensembl model comparison attributes will be added for human
  • Fruitfly assembly will be updated to BDGP6

Variation data imports and updates

  • 1000 Genomes phase 3 studies will be imported for human
  • The latest sequence variants from dbSNP build 142 for human will be imported
  • New Global Alliance standards REST endpoints will be available for sets of Variation data
  • NextGen Project genotype data will be added from 3 sheep populations (Iranian Ovis aries, Iranian Ovis orientalis, Moroccan Ovis aries)
  • New rat strain-specific variants and genotypes, and QTLs and phenotypes from the Rat Genome Database (RGD)

New web features

  • Updated Gene gain/loss tree view
  • New summary statistics of the homologs predicted between each pair of species

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

The Ensembl Pre! site has been updated for four species: zebrafish (Danio rerio), rat (Rattus norvegicus), sperm whale (Physeter macrocephalus) and fugu (Takifugu rubripes).

Sperm whale is a new species to Ensembl. Our main site already displays earlier assemblies for fugu, zebrafish and rat.

Zebrafish

ZebrafischThe zebrafish assembly, GRCz10 (GCA_000002035.3), was made available by The Genome Reference Consortium in September 2014. Since the previous release, Zv9 in July 2010, the GRC has taken over the task of improving and maintaining the zebrafish assembly. The most notable changes in the chromosome landscape since the previous release can be found on chromosome 4, which has gained about 15 Mb in length. Furthermore, 94 of the 112 previously unplaced contigs are now located on chromosomes. In total, this assembly consists of 26 chromosomes and 3,399 unplaced scaffolds. The full annotation of an older zebrafish assembly, Zv9, can be found on our main website. Click here to go to the zebrafish Pre! site, where you can view alignments of zebrafish UniProt proteins and human Ensembl translations, as well as gene models projected from the previous zebrafish assembly.

Rat

rattusThe new rat assembly, Rnor_6.0 (GCA_000001895.4), was produced by The Rat Genome Sequencing and Mapping Consortium and was released in July 2014. This assembly comprises 954 toplevel sequences, 22 of which are chromosomes (chromosome Y is a new addition in this assembly), and 1,395 of which are unplaced scaffolds. The full annotation of an older rat assembly, Rnor_5.0, can be found on our main website. Otherwise, click here to visit the rat Pre! site, where you can view alignments of rat UniProt proteins and human and mouse Ensembl translations, as well as gene models projected from the previous rat assembly.

Sperm Whale

800px-Mother_and_baby_sperm_whaleThe sperm whale assembly, PhyMac_2.0.2 (GCA_000472045.1), was produced in September 2013 by The Aquatic Genome Models Consortium. The assembly does not contain any assembled chromosomes or linkage groups and is instead made up of 11,711 unplaced scaffolds. The species is an important model for a number of human conditions such as respiratory disease, metal toxicity and cancer. For example, sperm whales exposed to high levels of chromium have no adverse health effects whereas humans do. Studying this species could lead to development of treatments for human chromium-related disorders. Click here to visit the sperm whale Pre! site, where you can view alignments of human and dolphin Ensembl translations.

Fugu

fugu_tThe fugu genome assembly, FUGU5 (GCA_000180615.2), was released in October 2011 by The Fugu Genome Sequencing Consortium. It is composed of 22 autosomal chromosomes, with a total sequence length of 391Mb. The species was initially proposed as a useful model for annotating and understanding the human genome, as it contains a similar repertoire of genes to human yet is only roughly one-eighth of the size. It is among the smallest vertebrate genomes, and previous assemblies of this species have already shown themselves to be useful reference genomes for identifying genes and other functional elements in other vertebrate species. The full annotation of an older fugu assembly, FUGU 4.0, can be found on our main website. Click here to visit the fugu Pre! site, where you can view alignments of human and dolphin Ensembl translations.