Joannella Morales, Jane Loveland and Adam Frankish contributed to this post.

Back in October, we introduced you to our new joint initiative with the NCBI — the Matched Annotation from the NCBI and EMBL-EBI (MANE) transcript set. We are now pleased to update you on our progress so far.

The goal of this project is to share annotation and converge on a high-confidence, genome-wide transcript set, with a matched transcript in both RefSeq and Ensembl/GENCODE. We are doing this in two phases. During phase 1, we will release the “MANE Select” transcript set to include one well-supported transcript for every protein-coding locus. We envision the adoption of the MANE Select set as a default set across genomics resources. In phase 2, we intend to release an expanded set (“MANE Plus”) to include additional transcripts per locus that are well-supported or of particular user interest.

Continue reading

One of the biggest highlights of the new Ensembl Plants release 40 is the inclusion of the new Wheat (RefSeq v1.0) genome from the International Wheat Genome Sequencing Consortium (IWGSC).

The path to sequencing the wheat genome has been no easy ride, due to its large and highly repetitive genome. This new assembly from the IWGSC bridges many gaps from the initial genome sequencing effort. Read on to find out more about this exciting new genome assembly!

Continue reading

Since release 81, Ensembl has provided the gene annotations in GFF3 files alongside the already existing GTF ones. While GTF uses its own controlled vocabulary to classify features, GFF3 takes advantage of sequence ontology. In the initial release, we attempted to map all existing Ensembl biotypes to equivalent SO terms.

This has proven unsatisfactory for several reasons:

  • not all biotypes have an equivalent SO term
  • there are too many levels of granularity, with 25 terms for genes and another 25 for transcripts
  • some SO mappings do not respect the parent-child relationship expected between gene and transcript SO terms
  • some SO mappings are inaccurate, missing or wrong
  • it is mostly redundant with the biotypes which are also provided as an attribute
  • there can be confusion when most features have identical values in the third column (the SO term) and the biotype attribute, yet a handful do not

For all these reasons, our SO term mapping has undergone a major overhaul to take advantage of the functionality sequence ontologies offer. This new mapping, which will be used from release 90 onwards, attempts to provide general biotype groupings that match the ones used on the website. As a result, all gene biotypes are mapped to one of these three groups, coding, non-coding or pseudogene. Meanwhile, transcript biotypes are mapped to one of five main groups: mRNA, pseudogenic_transcript, long non coding RNAs, short non coding RNAs and IG biotypes.

Additionally, the groupings remove some of the previous granularity that can still be explored via the biotype and the assigned terms respect the gene-transcript relationship where possible.

To see the full extent of those changes, as they will be reflected in the GFF3 files provided from release 90 onwards, please check these files on the FTP.
We hope this improvement will help our users take better advantage of the GFF3 format.

Ensembl transcripts have two identifiers, the versioned ENST, which is stable through time and can be tracked from release to release, and a separate identifier that incorporates a gene symbol. The latter have changed in e!89; read on for more details.Continue reading

File Chameleon, click to enlarge

File Chameleon, click to enlarge

Transforming file formats has always been a troublesome issue in bioinformatics because of the numerous standards and slight eccentricities in formatting required by some software packages. How many times have you needed to transform chromosome names between 1,2,3 and chr1, chr2, chr3 or vice versa? With the introduction of File Chameleon we hope to somewhat smooth this process for data consumers.

File Chameleon is a web service introduced by Ensembl to transform Ensembl FTP files for easier use across the spectrum of bioinformatics tools. Need UCSC style chromosome names? Need genes longer than 4Mbp removed? File Chameleon can do that. From the File Chameleon web interface simply select the species and which flat file you want to download (individual chromosome gtf, full assembly fasta, etc), then select which filters you want to apply. The file will be transcribed and ready to download within a few minutes.

Currently File Chameleon only operates on GTF, GFF3, and FASTA formats and has a very limited set of filters for each format, however we’re committed to expanding the tool over future releases. Please take a look and give us feedback, which of the Ensembl formats would be useful to add, and more importantly what transformations and filters on the data would make it more useful for you? What is the awk or sed script you run on the files you download that we can do for you, or others might find helpful?

File Chameleon is also available as a standalone tool and is designed to have easily pluggable filters. If you find the tool useful, you can run it locally and expand it writing your own plugins to further process files. The package can be downloaded via GitHub along with extensive documentation and examples.