We’re already gearing up for Ensembl 89, scheduled for May 2017. It’s a slimline release this time, with just a handful of highlights:

Updated assemblies, gene sets and annotations

  • Human: updated cDNA alignments
  • Mouse: updated cDNA alignments and update to Ensembl-Havana GENCODE gene set

Other updates and highlights

  • Variation and phenotype database updates, including COSMIC version 80.
  • GnomAD frequencies will be available via the website, VEP and APIs.
  • Mapping of array probes to 15 different mouse strains in Ensembl.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

The Ensembl regulation resources FTP site saw a facelift in release 87. The directory structures have been modified to make it easier to find files- the file names have become more descriptive and we now also provide our data in a greater variety of file formats. All data files on our FTP site now adheres to a naming convention, which is described in greater detail here. The filenames include the following information separated with a dot (‘.’):

  • species
  • assembly version
  • cell type (if applicable)
  • feature type (if applicable)
  • analysis name
  • results type
  • data freeze date
  • file format.

E.g.: homo_sapiens.GRCh38.K562.Regulatory_Build.regulatory_activity.20161111.gff.gz

The data available on our FTP site include:

Peaks: The set of peaks for transcription factors, histone modifications and variants that are part of our regulatory resources. In previous releases these used to be collated in one file, called ‘AnnotatedFeatures.gff.gz’, but with our recent expansion to 88 human cell types with ChIP-seq data, the file became too big. Therefore, we split it into separate files by cell and feature type in the ‘Peaks’ subdirectory. The peaks are now available in gff, bed and bigBed format.

Quality scores: The outcome of our quality checks from processing the ChIP-seq data that yielded the peaks. They are in JSON format in the ‘QualityChecks’ subdirectory:

  • the number of mapped reads
  • the estimated fragment length, the NSC and RSC values using phantompeakqualtools
  • the proportion of reads in peaks
  • the enrichment of the ChIP over the Input using CHANCE.

Regulatory build: The current set of regulatory features along with their predicted activity in every cell type. We provide one gff file per cell type in the ‘regulatory_features’ subdirectory.

Transcription factor motifs: The transcription factor motifs identified using position weight matrices from JASPAR in enriched regions identified by our ChIP-seq analysis pipeline in gff format.

For our latest release (e87) we’ve produced annotations from some new embryonic zebrafish RNA-seq data using the Ensembl genebuild RNA-seq pipeline. The collection of new data we’re providing consists of gene sets and alignments for 18 separate embryonic developmental stages, from the single celled zygote right up until 120 hours post fertilisation. As per usual, these features can be viewed in our browser as separate tracks, or they can be downloaded from our ftp site.

The RNA-seq data we used were produced by the Vertebrate Genetics and Genomics Group at the Sanger Institute. The team collected 96 embryos from each of the 18 stages, examining their morphology so as to ensure every single embryo was at the correct phase of development. Such an undertaking, although extensive, is more achievable in zebrafish than in many other vertebrates due to features such as large clutch size and external fertilisation and development. The team made 5 libraries for each of the developmental stages, each one comprising a pool of 12 embryos. All 90 libraries were made simultaneously by a robot to reduce batch effect and strand-specific sequencing was used to reveal information on genes overlapping on the opposing strand. The data were released to ENA directly after sequencing, to allow public access as early as possible. Variation in gene structure across development can be viewed in Ensembl and the changing expression level can be viewed in Expression Atlas. A manuscript describing the changes in gene structure and expression level across development is currently in preparation.

screen-shot-2016-12-09-at-16-51-45The alignments and annotations generated from the data are viewable in the Ensembl browser, and the individual tracks can be configured using the RNA-seq tissue matrix. The initial introduction of this matrix was covered in a previous blog post. The new zebrafish entries appear in chronological order under the heading ‘WTSI stranded RNA-seq’. A merged set, which contains all of the new developmental RNA-seq data, is also selectable.

We expect these RNA-seq data will expose new isoforms of previously annotated genes, which may be especially prevalent during, and perhaps even unique to, early embryonic development. The alignments may also reveal interesting expression patterns for specific genes.

We’d like to encourage our users to take full advantage of these exciting new data, and we hope they’ll facilitate some interesting new research.

Please send any questions to our helpdesk.

 

 

 

 

The Variant Effect Predictor (VEP) is one of Ensembl’s most popular tools. It has grown in 6 years from a simple perl script with just a couple of hundred lines of code to become a multi-limbed beast with thousands of lines of code and well over 100 configurable options.

VEP is now used by many high-profile projects, institutes and companies around the world. In order to effectively manage this growth and ensure we deliver the most reliable and feature filled variant annotator out there, we’ve had to go back to basics. Over the past six months the VEP codebase has been totally rewritten, and the new version is now available for download. Users of VEP’s web and REST API interfaces should see virtually no difference with the new version, so if that’s you, you can stop reading now!

For users of our command line tool, you can trial the new VEP by visiting https://github.com/Ensembl/ensembl-vep. The full list of changes to the code can be found in the README on GitHub, but these are the main points of note:

  • Faster : process an individual genome in around 30 minutes.
  • Backward-compatible : all data sources (cache files, databases) and most command line flags from the old code are fully compatible with the new code.
  • More reliable : test-driven development means the new code is covered by more than 1500 unit tests with over 99% statement coverage.

For those tied to the current codebase, it is still available as part of the ensembl-tools GitHub repository, though updates and support for this will cease over time. Ensembl release 87 will be the last for which the ensembl-tools version of VEP will be the “primary” VEP codebase. Of course, the previous code and supporting data will remain available as part of Ensembl’s archiving strategy.

Some other points of note:

  • The documentation at ensembl.org still refers to the old code. From Ensembl release 88 onwards full documentation for the new code will be made available.
  • If possible, please report any issues you may find with the new code as a GitHub Issue.
  • The code that calculates variant consequence types (e.g. missense_variant, stop_gained) remains a part of the ensembl-variation API module and has not been (significantly) updated; it is used by both the old and new code. The ensembl-vep codebase performs the following functions:
    • parsing command line flags
    • parsing input
    • reading data from annotation sources (databases, cache files, flat files)
    • interval alignment of input variants with annotation data
    • writing output
    • monitoring statistics
    • threading
    • data filtering interface

What’s New in e87:

Updated assemblies, gene sets and annotations

In Ensembl 87, there are a number of updates to the assemblies and gene sets for several species:

  • Human: updated cDNA alignments and RefSeq import
  • Mouse: updated gene set and assembly, see below
  • Zebrafish: updated gene set
  • Chicken: updated gene set

Updated gene models for mouse olfactory receptors

e87 includes an updated Ensembl-Havana mouse gene set, a merge of complete Ensembl gene models and the latest Havana gene annotation. All CCDS genes are included in this gene set.

This latest Havana gene annotation includes improved gene models for the mouse olfactory receptors. Over 2Mbp of additional sequence has been added to the mouse olfactory genes to create several hundred multi-exonic models. These new models are based on RNA-seq data from Ibarra-Soria X et. al.

The mouse assembly has been updated to GRCm38.p5. The patches for GRCm38.p5 were annotated using a combination of manual annotation, annotation projected from the primary assembly and annotation derived from cDNA and protein alignment evidence.

New lincRNA data

New regulation summary activity table

Due to the high number of epigenomes now available in the Human Regulatory Build, we can no longer show them all by default on the Regulation Summary image, in the Regulation Tab. We have therefore added a table listing the cell types by their regulatory feature activity.

Regulation Cell Type Activity table

Other News

  • DGVa structural variant study updates for Human, Cow and Macaque
  • dbSNP updates for Sheep
  • Cosmic version 78 imported for human
  • Phenotype data updates for several species

A complete list of the changes can be found on the Ensembl website

Find out more about the new release and ask the team questions, in our free webinar: Wednesday 14th December, 4pm GMT. Register here.

What’s new?

Ensembl Plants takes centre stage in the release of Ensembl Genomes 33, with a variety of new data available for a number of different species:

  • Incorporation of the Araport 11 gene model annotation for Arabidopsis thaliana
  • Addition of mitochondrial and plastid genome sequences to the current maize (Zea mays) chromosomal assembly (AGPv4)
  • Alignment between the A, B and D genomes of bread wheat (Triticum aestivum) updated to use TGACv1 genome assemblies
  • Whole genome alignment between bread wheat and Brachypodium distachyon

Other News

You can find more details in the release notes.

What’s New in e86:

Mouse strain genomes

In Ensembl 86, you will now be able to view the annotated genome assemblies, variation data and comparative analyses of 16 different mouse strains, produced by the Mouse Genomes Project. While the GRCm38 assembly (produced from Mus musculus strain C57BL/6J) remains the reference assembly, variants and comparative analyses for the other strains can be viewed through the Gene tab and the Location tab. You can find the gene trees and orthologue/paralogue predictions for the mouse strains through the Strains option in the menu in the Gene tab. The mouse strain gene tree depicts the evolutionary history of genes (left) and protein alignment (right) for the individual mouse strains and rat. mouse strain treemouse strain orthologues You can find the variants between these mouse strains through the Strain table option in the menu in the Location tab. The strain table displays the alleles identified at variant positions across the 16 mouse strains. strain variant table

Updated assemblies, gene sets and annotations

In Ensembl 86, there will also be a number of updates to the assemblies and gene sets for a number of different species:

  • Human: updated cDNA alignments and RefSeq import
  • Mouse: updated cDNA alignments and RefSeq import
  • Zebrafish: updated gene set and RefSeq import
  • Chicken: updated to the Galgal_5.0 assembly
  • Mouse lemur: updated to the Mmur_2.0 assembly
  • Macaque:  updated to the Mmul_8.0.1 assembly

New lincRNA data

New Mobile Site Views

As of release 86, you can now view transcripts on the mobile version of Ensembl. You can also view exon sequence, cDNA sequence and protein sequence by clicking on the lefthand arrow.

mobile site- transcript[1]mobile site- transcript[2]

The gene sequence is also now available to view on mobile devices. Just go to any gene page and click on the left hand arrow and then choose sequence.

1

Other News

  • Variation and phenotype databases updated
  • You can now select ‘Manhattan plot’ as an option when configuring bigWig files

A complete list of the changes can be found on the Ensembl website

Find out more about the new release and ask the team questions, in our free webinar: Tuesday 11th October, 4pm BST. Register here.