We are pleased to announce that Ensembl Genomes 36 has now been released, which includes new and updated genome assemblies and gene annotation as well as updated variation data and comparative genomics analyses. Find out more below:

  • Ensembl Bacteria includes an additional 142 genomes from release 35 together with an update to gene families.
  • Ensembl Fungi has added gene symbols for 1-to-1 orthologues from S. cerevisiae to Botrytis cinerea and includes updated PHI-base 4.3 annotations.
  • Ensembl Metazoa now has automated RNA gene annotation for 37 species (i.e. all species that have not been imported from FlyBase, VectorBase or WormBase) and alignment of Rfam 12.2 covariance models for all species. There are also updated protein features, which now includes features from new sources (CDD, MobiDB and SFLD).
  • Ensembl Protists now has new automatic ncRNA alignments across all protist species as well as updated PHI-base 4.3 annotations.
  • Ensembl Plants now includes the new genome assembly for Hordeum vulgare (barley), the biggest diploid yet sequenced, which is included in updated comparative peptide analyses for all species. There are also new ncRNA gene annotations and new plant reactome cross references across all plant species. New and updated variation data has also been included in this release for both Oryza sativa and Arabidopsis thaliana. Last, but not least, 80829 variation markers from the iSelect 90k array and 13.8 million Inter-Homoeologous Variants (IHVs) have been added to the wheat assembly, along with chloroplast and mitochondrial components (including gene annotations) imported from ENA.

Please see the release notes for full details of the updates.

Ensembl 90 is scheduled for August 2017 and it’s set to be our biggest release ever in terms of new genome annotation. Here’s what you can look forward to:

New assemblies, gene sets and annotations

  • Annotation of 15 rodent genomes, including three updates to old genomes:
    • Brazilian guinea pig
    • Chinese hamster
    • Damara mole rat
    • Degu
    • Golden Hamster
    • Guinea Pig (update)
    • Kangaroo rat (update)
    • Lesser Egyptian jerboa
    • Long-tailed chinchilla
    • Naked mole-rat – we have two different assemblies for naked mole-rat so you can keep working with your preferred genome
    • Northern American deer mouse
    • Prairie vole
    • Squirrel (update)
    • Upper Galilee mountains blind mole rat
  • Bringing in annotation of the well-used rodent cell-line, Chinese Hamster Ovary, and two mouse species, Ryukyu mouse and Shrew mouse.
  • Annotation on the latest Pig genome assembly, Sscrofa11.1
  • Updating the Human gene set to GENCODE 27.
  • Updating the Mouse gene set to GENCODE M15.
  • Adding transcript models from RNA-seq to the gene database and pri-miRNAs to the otherfeatures database in Zebrafish.

Other updates and highlights

  • Updating our human variation database with:
    • COSMIC 81 somatic variants
    • HGMD 2016.4
    • dbSNP 150
    • DGVa structural variants
    • TopMed in GRCh37
    • Phenotypes from NHGRI-EBI GWAS, OMIM, ClinVar, UniProt, Cosmic Gene Census, DDG2P, MIM Morbid and Orphanet
  • In other species we also have variation updates as follows:
    • DGVa in Cow, Dog and Mouse
    • Phenotype updates from relevant databases in Cat, Chicken, Chimpanzee, Cow, Dog, Horse, Macaque, Mouse, Pig, Rat, Sheep, Turkey and Zebrafish
  • Updating our microarray probe mappings in:
    • C.intestinalis
    • Caenorhabditis elegans
    • Chicken
    • Chimpanzee
    • Cow
    • Dog
    • Fruitfly
    • Human
    • Macaque
    • Mouse
    • Mouse 129S1/SvImJ
    • Mouse A/J
    • Mouse AKR/J
    • Mouse BALB/cJ
    • Mouse C3H/HeJ
    • Mouse C57BL/6NJ
    • Mouse CAST/EiJ
    • Mouse CBA/J
    • Mouse DBA/2J
    • Mouse FVB/NJ
    • Mouse LP/J
    • Mouse NOD/ShiLtJ
    • Mouse NZO/HlLtJ
    • Mouse PWK/PhJ
    • Mouse SPRET/EiJ
    • Mouse WSB/EiJ
    • Pig
    • Platypus
    • Rabbit
    • Rat
    • Saccharomyces cerevisiae
    • Xenopus
    • Zebrafish

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

We are pleased to announce that Ensembl Genomes 35 has now been released.

New and updated genomic sequences are available in all EG sub-portals, while updated comparative peptide analyses have been performed for Fungi, Metazoa, Plants, and Protists:

  • Ensembl Bacteria now incorporates 2460 new genomes, as well as revised assemblies and annotation for 188 and 234 genomes, respectively;
  • Ensembl Fungi now incorporates more than 100 new genomes, including the Puccinia striiformis f. sp. tritici PST-130 v1.0 assembly from the Joint Genome Institute, and provides updates to existing genomes and annotation. In particular, a new, manually-annotated genebuild, curated by the community using the WebApollo tool, has been added for Botrytis cinerea B05.10;
  • Ensembl Metazoa adds three new genomes, including that of Hessian fly. In addition, orthologue metrics have been calculated for all metazoan species and have been used to compute a set of “high-confidence” orthologues;
  • Ensembl Plants includes a new genome assembly and genebuild for Sorghum bicolor, and an updated genebuild for maize. New variation data are available for bread wheat, as are new comparative peptide analyses for all species;
  • Ensembl Protists contains 11 new genomes, along with revised genomic assemblies for more than 25 other species. Variation data have been newly included for Phaeodactylum tricornutum, and have been updated for Phytophthora infestans and Plasmodium falciparum; new comparative peptide analyses have also been performed.

Please see the release notes for full details of the updates: http://ensemblgenomes.org/info/release-notes/35

We’re already gearing up for Ensembl 89, scheduled for May 2017. It’s a slimline release this time, with just a handful of highlights:

Updated assemblies, gene sets and annotations

  • Human: updated cDNA alignments
  • Mouse: updated cDNA alignments and update to Ensembl-Havana GENCODE gene set

Other updates and highlights

  • Variation and phenotype database updates, including COSMIC version 80.
  • GnomAD frequencies will be available via the website, VEP and APIs.
  • Mapping of array probes to 15 different mouse strains in Ensembl.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

The Ensembl regulation resources FTP site saw a facelift in release 87. The directory structures have been modified to make it easier to find files- the file names have become more descriptive and we now also provide our data in a greater variety of file formats. All data files on our FTP site now adheres to a naming convention, which is described in greater detail here. The filenames include the following information separated with a dot (‘.’):

  • species
  • assembly version
  • cell type (if applicable)
  • feature type (if applicable)
  • analysis name
  • results type
  • data freeze date
  • file format.

E.g.: homo_sapiens.GRCh38.K562.Regulatory_Build.regulatory_activity.20161111.gff.gz

The data available on our FTP site include:

Peaks: The set of peaks for transcription factors, histone modifications and variants that are part of our regulatory resources. In previous releases these used to be collated in one file, called ‘AnnotatedFeatures.gff.gz’, but with our recent expansion to 88 human cell types with ChIP-seq data, the file became too big. Therefore, we split it into separate files by cell and feature type in the ‘Peaks’ subdirectory. The peaks are now available in gff, bed and bigBed format.

Quality scores: The outcome of our quality checks from processing the ChIP-seq data that yielded the peaks. They are in JSON format in the ‘QualityChecks’ subdirectory:

  • the number of mapped reads
  • the estimated fragment length, the NSC and RSC values using phantompeakqualtools
  • the proportion of reads in peaks
  • the enrichment of the ChIP over the Input using CHANCE.

Regulatory build: The current set of regulatory features along with their predicted activity in every cell type. We provide one gff file per cell type in the ‘regulatory_features’ subdirectory.

Transcription factor motifs: The transcription factor motifs identified using position weight matrices from JASPAR in enriched regions identified by our ChIP-seq analysis pipeline in gff format.

For our latest release (e87) we’ve produced annotations from some new embryonic zebrafish RNA-seq data using the Ensembl genebuild RNA-seq pipeline. The collection of new data we’re providing consists of gene sets and alignments for 18 separate embryonic developmental stages, from the single celled zygote right up until 120 hours post fertilisation. As per usual, these features can be viewed in our browser as separate tracks, or they can be downloaded from our ftp site.

The RNA-seq data we used were produced by the Vertebrate Genetics and Genomics Group at the Sanger Institute. The team collected 96 embryos from each of the 18 stages, examining their morphology so as to ensure every single embryo was at the correct phase of development. Such an undertaking, although extensive, is more achievable in zebrafish than in many other vertebrates due to features such as large clutch size and external fertilisation and development. The team made 5 libraries for each of the developmental stages, each one comprising a pool of 12 embryos. All 90 libraries were made simultaneously by a robot to reduce batch effect and strand-specific sequencing was used to reveal information on genes overlapping on the opposing strand. The data were released to ENA directly after sequencing, to allow public access as early as possible. Variation in gene structure across development can be viewed in Ensembl and the changing expression level can be viewed in Expression Atlas. A manuscript describing the changes in gene structure and expression level across development is currently in preparation.

screen-shot-2016-12-09-at-16-51-45The alignments and annotations generated from the data are viewable in the Ensembl browser, and the individual tracks can be configured using the RNA-seq tissue matrix. The initial introduction of this matrix was covered in a previous blog post. The new zebrafish entries appear in chronological order under the heading ‘WTSI stranded RNA-seq’. A merged set, which contains all of the new developmental RNA-seq data, is also selectable.

We expect these RNA-seq data will expose new isoforms of previously annotated genes, which may be especially prevalent during, and perhaps even unique to, early embryonic development. The alignments may also reveal interesting expression patterns for specific genes.

We’d like to encourage our users to take full advantage of these exciting new data, and we hope they’ll facilitate some interesting new research.

Please send any questions to our helpdesk.

 

 

 

 

The Variant Effect Predictor (VEP) is one of Ensembl’s most popular tools. It has grown in 6 years from a simple perl script with just a couple of hundred lines of code to become a multi-limbed beast with thousands of lines of code and well over 100 configurable options.

VEP is now used by many high-profile projects, institutes and companies around the world. In order to effectively manage this growth and ensure we deliver the most reliable and feature filled variant annotator out there, we’ve had to go back to basics. Over the past six months the VEP codebase has been totally rewritten, and the new version is now available for download. Users of VEP’s web and REST API interfaces should see virtually no difference with the new version, so if that’s you, you can stop reading now!

For users of our command line tool, you can trial the new VEP by visiting https://github.com/Ensembl/ensembl-vep. The full list of changes to the code can be found in the README on GitHub, but these are the main points of note:

  • Faster : process an individual genome in around 30 minutes.
  • Backward-compatible : all data sources (cache files, databases) and most command line flags from the old code are fully compatible with the new code.
  • More reliable : test-driven development means the new code is covered by more than 1500 unit tests with over 99% statement coverage.

For those tied to the current codebase, it is still available as part of the ensembl-tools GitHub repository, though updates and support for this will cease over time. Ensembl release 87 will be the last for which the ensembl-tools version of VEP will be the “primary” VEP codebase. Of course, the previous code and supporting data will remain available as part of Ensembl’s archiving strategy.

Some other points of note:

  • The documentation at ensembl.org still refers to the old code. From Ensembl release 88 onwards full documentation for the new code will be made available.
  • If possible, please report any issues you may find with the new code as a GitHub Issue.
  • The code that calculates variant consequence types (e.g. missense_variant, stop_gained) remains a part of the ensembl-variation API module and has not been (significantly) updated; it is used by both the old and new code. The ensembl-vep codebase performs the following functions:
    • parsing command line flags
    • parsing input
    • reading data from annotation sources (databases, cache files, flat files)
    • interval alignment of input variants with annotation data
    • writing output
    • monitoring statistics
    • threading
    • data filtering interface