One of the biggest highlights of the new Ensembl Plants release 40 is the inclusion of the new Wheat (RefSeq v1.0) genome from the International Wheat Genome Sequencing Consortium (IWGSC).

The path to sequencing the wheat genome has been no easy ride, due to its large and highly repetitive genome. This new assembly from the IWGSC bridges many gaps from the initial genome sequencing effort. Read on to find out more about this exciting new genome assembly!

Continue reading

Since release 81, Ensembl has provided the gene annotations in GFF3 files alongside the already existing GTF ones. While GTF uses its own controlled vocabulary to classify features, GFF3 takes advantage of sequence ontology. In the initial release, we attempted to map all existing Ensembl biotypes to equivalent SO terms.

This has proven unsatisfactory for several reasons:

  • not all biotypes have an equivalent SO term
  • there are too many levels of granularity, with 25 terms for genes and another 25 for transcripts
  • some SO mappings do not respect the parent-child relationship expected between gene and transcript SO terms
  • some SO mappings are inaccurate, missing or wrong
  • it is mostly redundant with the biotypes which are also provided as an attribute
  • there can be confusion when most features have identical values in the third column (the SO term) and the biotype attribute, yet a handful do not

For all these reasons, our SO term mapping has undergone a major overhaul to take advantage of the functionality sequence ontologies offer. This new mapping, which will be used from release 90 onwards, attempts to provide general biotype groupings that match the ones used on the website. As a result, all gene biotypes are mapped to one of these three groups, coding, non-coding or pseudogene. Meanwhile, transcript biotypes are mapped to one of five main groups: mRNA, pseudogenic_transcript, long non coding RNAs, short non coding RNAs and IG biotypes.

Additionally, the groupings remove some of the previous granularity that can still be explored via the biotype and the assigned terms respect the gene-transcript relationship where possible.

To see the full extent of those changes, as they will be reflected in the GFF3 files provided from release 90 onwards, please check these files on the FTP.
We hope this improvement will help our users take better advantage of the GFF3 format.

Ensembl transcripts have two identifiers, the versioned ENST, which is stable through time and can be tracked from release to release, and a separate identifier that incorporates a gene symbol. The latter have changed in e!89; read on for more details.Continue reading

File Chameleon, click to enlarge

File Chameleon, click to enlarge

Transforming file formats has always been a troublesome issue in bioinformatics because of the numerous standards and slight eccentricities in formatting required by some software packages. How many times have you needed to transform chromosome names between 1,2,3 and chr1, chr2, chr3 or vice versa? With the introduction of File Chameleon we hope to somewhat smooth this process for data consumers.

File Chameleon is a web service introduced by Ensembl to transform Ensembl FTP files for easier use across the spectrum of bioinformatics tools. Need UCSC style chromosome names? Need genes longer than 4Mbp removed? File Chameleon can do that. From the File Chameleon web interface simply select the species and which flat file you want to download (individual chromosome gtf, full assembly fasta, etc), then select which filters you want to apply. The file will be transcribed and ready to download within a few minutes.

Currently File Chameleon only operates on GTF, GFF3, and FASTA formats and has a very limited set of filters for each format, however we’re committed to expanding the tool over future releases. Please take a look and give us feedback, which of the Ensembl formats would be useful to add, and more importantly what transformations and filters on the data would make it more useful for you? What is the awk or sed script you run on the files you download that we can do for you, or others might find helpful?

File Chameleon is also available as a standalone tool and is designed to have easily pluggable filters. If you find the tool useful, you can run it locally and expand it writing your own plugins to further process files. The package can be downloaded via GitHub along with extensive documentation and examples.

The CRISPR/Cas9 system has revolutionised scientific research over the last few years, offering an efficient method of genome editing. CRISPR/Cas9 utilises the cellular machinery used by bacteria to recognise and edit the DNA of invading viruses. It is formed of two key components: Cas9, an enzyme that can cut a double DNA strand at a precise point; and CRISPR, a short strand of RNA that guides the Cas9 enzyme to recognise and cleave at specific DNA sites.

Cas9 restricts DNA at specific Protospacer Adjacent Motifs (PAMs), which is species-dependent (for example, 5′ NGG 3′ for Streptococcus pyogenes Cas9). Therefore, by coupling a custom CRISPR polymer (gRNA), Cas9’s restriction activity can be targeted to specific locations in the genome that contain a PAM region.

The latest release of Ensembl (Ensembl 85, July 2016) now includes annotated CRISPR/Cas9 sites predicted by the Wellcome Trust Sanger Institute Genome Editing (WGE) group for human and mouse genomes.

The WGE group have predicted CRISPR sites and developed an accompanying database to help you design genome editing experiments, and you can view these WGE-predicted sites by adding the ‘WGE CRISPR sites’ track to any ‘Region in Detail’ view for human or mouse in Ensembl. Click on the ‘Configure this page’ option from the menu on the left hand side of the page, and then add the track, which Configure this page buttoncan be found in the ‘Other regulatory regions’ category, by clicking the empty box and selecting the track style from the pop-up window:Add CRISPR track option

Below, you can see an example of the WGE-predicted CRISPR site track added (to both the forward and reverse strand) of the genomic region containing the human BRCA2 gene in the ‘structure’ style. Each CRISPR site is labelled as a single green box, which appears as a single vertical line when viewing a large genomic region.CRISPR site track

From the example above, we have now zoomed into a specific region of interest. You can see the structure of each CRISPR site, with the filled green box matching up with the PAM motif and the un-filled box representing the potential gRNA binding sequence. Clicking on any of these individual CRISPR sites will open a pop-up window that provides you with more information about the specific genomic co-ordinates of the CRISPR site as well as a link to the WGE database.CRISPR pop up

You can find more information about the CRISPR site prediction method in the published description of the WGE database

Continued HapMap variation data access through Ensembl

NCBI have recently released plans to immediately retire their HapMap interface, however, data from the HapMap Project will continue to be freely accessible through Ensembl. There is lots of help and documentation as well as video tutorials to help you learn how to access variant data in Ensembl. This post aims to complement those materials to highlight the methods for accessing the HapMap Project variant data specifically.

Finding HapMap variants by ID

You can find data from the HapMap project relating to specific variants by searching for the variant rsID itself. In Ensembl, you can find information related to variants identified in the HapMap Project, which includes population genetics statistics:Population Genetics HapMap

However, as you can see from the example above, some of the populations represented in the HapMap Project have two separate entries in the Population Genetics table. This is because the HapMap Project was completed in a number of phases. In the first phase, a number of different groups used different genotyping platforms to type variants from a number of population panels (CEU, YRI, HCB, JPT). In a later phase, a larger set of samples were added to the samples from the initial phase and submitted as HapMap3. The two entries refer to the two submitted phases of the HapMap Project, where the number in brackets next to the allele frequency indicates the number of samples in that population.

It is also possible to view HapMap Project results by gene of interest by searching the Variant Table. The Variant Table can be filtered by ‘Evidence’ type so you can choose to see only HapMap Project variants, for example.Variant Filter HapMap

Filtering the Variant Table by ‘Evidence type = HapMap’ will filter the displayed variants to those identified in the HapMap project. This will be denoted by theevidence HapMapin the Evidence column.Filtered variant Table Hapmap

Finding HapMap Project variant data using BioMart 

HapMap SNP data can also be retrieved using BioMart. There is help and documentation and a video tutorial to help you while using BioMart.

When querying the Homo sapiens short variants dataset in BioMart, you can access HapMap variant data specifically by using the ‘Variant Set Name’ filter and selecting the HapMap populations that are relevant for your research.HapMap variation Mart

Finding HapMap variants using the Ensembl API

It is also possible to access variation data through the Ensembl APIs. Using the Perl API, for example, it is possible to retrieve variation data specifically related to the HapMap Project variant set, either as the whole HapMap variant set, or as individual populations represented in the HapMap Project.