Ensembl Blog – Page 28 – News about the Ensembl Project and its genome browser

What’s coming in Ensembl 91

6th October 2017 by Ben (Outreach)·Comments Off

Ensembl 91 is scheduled for December 2017 and we’re continuing our push to include the genome annotation for lots of new species. This time, we’re adding a whole new set of primate species to Ensembl.

Here’s what you can look forward to:

New assemblies, gene sets and annotations

Annotation of 12 new primate genomes, as well as updates to 6 existing genomes:
- Nancy Ma’s night monkey
- White-headed capuchin
- Sooty mangabey
- Angola colobus
- Crab eating macaque
- Southern pig-tailed macaque
- Drill
- Bonobo
- Coquerel’s sifaka
- Black snub-nosed monkey
- Golden snub-nosed monkey
- Black-capped squirrel monkey
- Chimpanzee (update)
- Gibbon (update)
- Gorilla (update)
- Mouse lemur (update)
- Olive baboon (update)
- Tarsier (update)
Annotation on the latest Cat genome assembly, Felis_catus_8.0
C. elegans gene set and annotation updated to Wormbase release WS260
Fruitfly gene set and annotation updated to Flybase release FB2017_04 (dmel_r6.17)
Updated Human cDNA alignments
Updated Mouse cDNA alignments
Updated microarray probe mappings and comparative genomics analyses for all new and updated species

Other updates and highlights

Updating our human variation database with:
- COSMIC 82 somatic variants
- HGMD 2017.2
- DGVa structural variants
- Phenotypes from NHGRI-EBI GWAS, OMIM, ClinVar, UniProt, Cosmic Gene Census, DDG2P, MIM Morbid and Orphanet
In other species we also have variation updates as follows:
- dbSNP 150 in macaque, mouse, zebrafish, sheep, pig, horse, cow and chicken
- DGVa in cow, dog and mouse, horse, macaque, pig, sheep and zebrafish
- Phenotype updates from relevant databases in rat, zebrafish and mouse
Links to PharmGKB added from human variants
New web tool for Linkage Disequilibrium (LD) calculation
Updated GRCh37 regulatory features

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

Upcoming API changes in Regulation for e!91

28th September 2017 by Michael Nuhn·Comments Off

The upcoming Ensembl release (e!91) will include several updates to the regulation API and with it a farewell to many objects that have given the regulation API its characteristic look and feel over the years.

The changes listed below only affect the way we store high-throughput sequencing experiments and their results. Probe feature related objects and regulatory features are not affected. If you use any of the following in your scripts, please keep an eye for our updated doxygen documentation once the Ensembl release 91 is out.

ResultSet and InputSubset

The long serving ResultSet object and its faithful companion, the InputSubset object, will be removed. Over the last releases these data types have been extensively modified and moved to more specific API objects, until they only served to store information about the read files (InputSubset) and their respective alignments (ResultSet).

From now on alignments are handled by a new API object, called “Alignment”.

The InputSubset object will be replaced by two new objects:

ReadFile
ReadFileExperimentalConfiguration

A ReadFile represents a FASTQ file generated by a high-throughput sequencing experiment, such as ChIP-seq or DNAse-seq.

The experimental configuration that led to the creation of the read file is stored in the ReadFileExperimentalConfiguration object. It links the Experiment object to the ReadFiles generated by it and contains the following information:

which biological and
which technical replicate a ReadFile is within an Experiment,
whether it is paired-end, and
whether it is the result of multiple sequencing runs of the same sample.

Using the experimental configuration the Ensembl Regulation Sequence Alignment (ERSA) pipeline decides how to analyse the various high-throughput sequencing data.

AnnotatedFeature and FeatureSet

In the current API the AnnotatedFeature object represents enriched regions or peaks from ChIP-seq and DNase-seq experiments.

In the future the AnnotatedFeature API object will become the Peak object.

AnnotatedFeature objects used to be accessed by first fetching an appropriate FeatureSet object and then the AnnotatedFeatures linked to it.

A FeatureSet object that links to a set of AnnotatedFeatures represented a peak calling analysis from a ChIP-seq-like experiment. These are now represented by the new PeakCalling object.

DataSet

The venerable DataSet object will be retired and it will not be replaced.

CoordSystem

The CoordSystem object in regulation, not to be confused with the CoordSystem object used for Ensembl core databases, has been retired after many years of service.

It was mostly known for its Adaptor, which gave scary error messages, if the Registry had been misconfigured. It could also make features unexpectedly vanish from the website.

There are no plans to replace its function.

Summary

Current Object	New Object	Notes
InputSubset	ReadFile ReadFileExperimentalConfiguration
ResultSet	Alignment
AnnotatedFeature	Peak
FeatureSet	PeakCalling
DataSet		Retired.
CoordSystem		Retired. Regulation-specific object. Not to be confused with that used for the Ensembl core databases.

Ensembl Genomes 37 is now live

12th September 2017 by Emily (Outreach)·Comments Off

We’re pleased to announce the latest release from Ensembl Genomes. There’s new data and software available. Find out more:

Continue reading

Getting to know us: Sophie from the Research Management Office

8th September 2017 by Emily (Outreach)·Comments Off

Our latest introduction to the Ensembl Team post comes from Sophie Janacek, who takes care of all our money.

Adjusting Custom Tracks in Ensembl

31st August 2017 by Ben (Outreach)·Comments Off

In the latest Ensembl release (Ensembl 90, August 2017), we have added the option for you to adjust the y-axis of your custom “wiggle” tracks, such as BigWig and bedGraph files.
Continue reading

Ensembl 90 has been released!

22nd August 2017 by Emily (Outreach)·4 Comments

Ensembl 90 is now live and it’s absolutely massive! Read on to find out why:

The UK and EMBL continue to support the EBI post-Brexit

11th August 2017 by Emily (Outreach)·Comments Off

Are you thinking about applying for a job at Ensembl, but worried about working in the UK post-Brexit? You don’t need to. The UK government and EMBL are committed to continuing support for the EBI in Cambridge. What does this mean?
Continue reading

Getting to know us: Jyo from Web

11th August 2017 by Ben (Outreach)·Comments Off

This month we’re meeting Jyothish Bhai, who works in the Ensembl web team.

Getting to know us: Bronwen, Vertebrate Annotation team leader

14th July 2017 by Emily (Outreach)·Comments Off

This month we’re meeting Bronwen Aken who heads up our Vertebrate Annotation team.

GFF3 and Sequence Ontology terms

6th July 2017 by Magali (Coordinator)·Comments Off

Since release 81, Ensembl has provided the gene annotations in GFF3 files alongside the already existing GTF ones. While GTF uses its own controlled vocabulary to classify features, GFF3 takes advantage of sequence ontology. In the initial release, we attempted to map all existing Ensembl biotypes to equivalent SO terms.

This has proven unsatisfactory for several reasons:

not all biotypes have an equivalent SO term
there are too many levels of granularity, with 25 terms for genes and another 25 for transcripts
some SO mappings do not respect the parent-child relationship expected between gene and transcript SO terms
some SO mappings are inaccurate, missing or wrong
it is mostly redundant with the biotypes which are also provided as an attribute
there can be confusion when most features have identical values in the third column (the SO term) and the biotype attribute, yet a handful do not

For all these reasons, our SO term mapping has undergone a major overhaul to take advantage of the functionality sequence ontologies offer. This new mapping, which will be used from release 90 onwards, attempts to provide general biotype groupings that match the ones used on the website. As a result, all gene biotypes are mapped to one of these three groups, coding, non-coding or pseudogene. Meanwhile, transcript biotypes are mapped to one of five main groups: mRNA, pseudogenic_transcript, long non coding RNAs, short non coding RNAs and IG biotypes.

Additionally, the groupings remove some of the previous granularity that can still be explored via the biotype and the assigned terms respect the gene-transcript relationship where possible.

To see the full extent of those changes, as they will be reflected in the GFF3 files provided from release 90 onwards, please check these files on the FTP.
We hope this improvement will help our users take better advantage of the GFF3 format.