This month we’re meeting Jyothish Bhai, who works in the Ensembl web team.
Since release 81, Ensembl has provided the gene annotations in GFF3 files alongside the already existing GTF ones. While GTF uses its own controlled vocabulary to classify features, GFF3 takes advantage of sequence ontology. In the initial release, we attempted to map all existing Ensembl biotypes to equivalent SO terms.
This has proven unsatisfactory for several reasons:
- not all biotypes have an equivalent SO term
- there are too many levels of granularity, with 25 terms for genes and another 25 for transcripts
- some SO mappings do not respect the parent-child relationship expected between gene and transcript SO terms
- some SO mappings are inaccurate, missing or wrong
- it is mostly redundant with the biotypes which are also provided as an attribute
- there can be confusion when most features have identical values in the third column (the SO term) and the biotype attribute, yet a handful do not
For all these reasons, our SO term mapping has undergone a major overhaul to take advantage of the functionality sequence ontologies offer. This new mapping, which will be used from release 90 onwards, attempts to provide general biotype groupings that match the ones used on the website. As a result, all gene biotypes are mapped to one of these three groups, coding, non-coding or pseudogene. Meanwhile, transcript biotypes are mapped to one of five main groups: mRNA, pseudogenic_transcript, long non coding RNAs, short non coding RNAs and IG biotypes.
Additionally, the groupings remove some of the previous granularity that can still be explored via the biotype and the assigned terms respect the gene-transcript relationship where possible.
To see the full extent of those changes, as they will be reflected in the GFF3 files provided from release 90 onwards, please check these files on the FTP.
We hope this improvement will help our users take better advantage of the GFF3 format.
We are pleased to announce that Ensembl Genomes 36 has now been released, which includes new and updated genome assemblies and gene annotation as well as updated variation data and comparative genomics analyses. Find out more below:
- Ensembl Bacteria includes an additional 142 genomes from release 35 together with an update to gene families.
- Ensembl Fungi has added gene symbols for 1-to-1 orthologues from S. cerevisiae to Botrytis cinerea and includes updated PHI-base 4.3 annotations.
- Ensembl Metazoa now has automated RNA gene annotation for 37 species (i.e. all species that have not been imported from FlyBase, VectorBase or WormBase) and alignment of Rfam 12.2 covariance models for all species. There are also updated protein features, which now includes features from new sources (CDD, MobiDB and SFLD).
- Ensembl Protists now has new automatic ncRNA alignments across all protist species as well as updated PHI-base 4.3 annotations.
- Ensembl Plants now includes the new genome assembly for Hordeum vulgare (barley), the biggest diploid yet sequenced, which is included in updated comparative peptide analyses for all species. There are also new ncRNA gene annotations and new plant reactome cross references across all plant species. New and updated variation data has also been included in this release for both Oryza sativa and Arabidopsis thaliana. Last, but not least, 80829 variation markers from the iSelect 90k array and 13.8 million Inter-Homoeologous Variants (IHVs) have been added to the wheat assembly, along with chloroplast and mitochondrial components (including gene annotations) imported from ENA.
Please see the release notes for full details of the updates.
Ensembl 90 is scheduled for August 2017 and it’s set to be our biggest release ever in terms of new genome annotation. Here’s what you can look forward to:
New assemblies, gene sets and annotations
- Annotation of 15 rodent genomes, including three updates to old genomes:
- Brazilian guinea pig
- Chinese hamster
- Damara mole rat
- Golden Hamster
- Guinea Pig (update)
- Kangaroo rat (update)
- Lesser Egyptian jerboa
- Long-tailed chinchilla
- Naked mole-rat – we have two different assemblies for naked mole-rat so you can keep working with your preferred genome
- Northern American deer mouse
- Prairie vole
- Squirrel (update)
- Upper Galilee mountains blind mole rat
- Bringing in annotation of the well-used rodent cell-line, Chinese Hamster Ovary, and two mouse species, Ryukyu mouse and Shrew mouse.
- Annotation on the latest Pig genome assembly, Sscrofa11.1
- Updating the Human gene set to GENCODE 27.
- Updating the Mouse gene set to GENCODE M15.
- Adding transcript models from RNA-seq to the gene database and pri-miRNAs to the otherfeatures database in Zebrafish.
Other updates and highlights
- Updating our human variation database with:
- COSMIC 81 somatic variants
- HGMD 2016.4
- dbSNP 150
- DGVa structural variants
- TopMed in GRCh37
- Phenotypes from NHGRI-EBI GWAS, OMIM, ClinVar, UniProt, Cosmic Gene Census, DDG2P, MIM Morbid and Orphanet
- In other species we also have variation updates as follows:
- DGVa in Cow, Dog and Mouse
- Phenotype updates from relevant databases in Cat, Chicken, Chimpanzee, Cow, Dog, Horse, Macaque, Mouse, Pig, Rat, Sheep, Turkey and Zebrafish
- Updating our microarray probe mappings in:
- Caenorhabditis elegans
- Mouse 129S1/SvImJ
- Mouse A/J
- Mouse AKR/J
- Mouse BALB/cJ
- Mouse C3H/HeJ
- Mouse C57BL/6NJ
- Mouse CAST/EiJ
- Mouse CBA/J
- Mouse DBA/2J
- Mouse FVB/NJ
- Mouse LP/J
- Mouse NOD/ShiLtJ
- Mouse NZO/HlLtJ
- Mouse PWK/PhJ
- Mouse SPRET/EiJ
- Mouse WSB/EiJ
- Saccharomyces cerevisiae
For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.
This is the third of our monthly posts introducing a member of the Ensembl team and what they do in Ensembl. This time it’s Matthew Laird, who works in the Core team. Continue reading
Ensembl transcripts have two identifiers, the versioned ENST, which is stable through time and can be tracked from release to release, and a separate identifier that incorporates a gene symbol. The latter have changed in e!89; read on for more details.Continue reading
Ensembl 89 is now live. Read on to find out about the new features and data in this release.Continue reading
We’re really excited to be a part of the ESHG conference again, this time in Copenhagen from the 27th-30th May. We can’t wait to see all the great science that’s going to be presented, but here’s a guide to the talks, workshops and posters from Ensembl and some of our close friends:
W18 – Ensembl & GENCODE Workshop
Monday 29th May 3pm-4:30pm Ancona Room
After the introduction to the Ensembl genome browser, there will be hands-on demonstrations to teach you how to:
- annotate SNPs and CNVs with functional consequences using the Variant Effect Predictor
- investigate quick alternatives to the browser (BioMart and the REST API)
You can find more information about the workshop location and timings on the ESHG programme. If you have any questions or want to talk about Ensembl and GENCODE you can e-mail the Ensembl Helpdesk to arrange a time to meet, tweet us or simply come by and meet us after the workshop.
As well as the Ensembl and GENCODE workshop, Amonida will also be presenting an electronic poster (E-P16.08) that describes genome annotation and assembly assessment in Ensembl. Electronic Posters will be on display in the Poster Area and can be accessed during exhibition opening hours from 09:30 on Saturday 27th May to 17:45 on Monday 29th May by all participants.
- Maria will be presenting an electronic poster (E-P16.12) with new ideas they have for increasing the speed that GWAS data is incorporated into the GWAS Catalog. Electronic Posters will be on display in the Poster Area and can be accessed during exhibition opening hours from 09:30 on Saturday 27th May to 17:45 on Monday 29th May by all participants.
- Jackie will be presenting a poster about the steps they are taking to ‘Increase the utility of the NHGRI-EBI genome-wide association study (GWAS) Catalog for users’. Jackie’s poster number is P16.36D and you can come and talk to her between 16:45 and 17:45 on Monday 29th May.
Jackie and Maria would love to hear feedback from GWAS Catalog users during the conference, particularly on the new and proposed functionality presented in our posters. They will also be able to answer any questions on the GWAS Catalog. Any users who would like to talk to us should e-mail the GWAS Catalog team at firstname.lastname@example.org, and they can arrange a time to meet, or simply come by and meet them at their posters.
And last, but not least, Giselle Kerry will be representing The European Genome-Phenome Archive (EGA) with a poster about their future plans. Giselle’s poster number is P19.44D and you can come and talk to her between 10:15 and 11:15 on Monday 29th May.
Looking forward to seeing you all there!