As of Ensembl release 93, which is due at the end of the month, the Gene Variant Image view will be retired for human. We have elected to retire this page because we feel that the density of known genetic variation is too great for this view to be informative in its current form.
Category: New data and web features
In the latest Ensembl release (Ensembl 90, August 2017), we have added the option for you to adjust the y-axis of your custom “wiggle” tracks, such as BigWig and bedGraph files.
Since release 81, Ensembl has provided the gene annotations in GFF3 files alongside the already existing GTF ones. While GTF uses its own controlled vocabulary to classify features, GFF3 takes advantage of sequence ontology. In the initial release, we attempted to map all existing Ensembl biotypes to equivalent SO terms.
This has proven unsatisfactory for several reasons:
- not all biotypes have an equivalent SO term
- there are too many levels of granularity, with 25 terms for genes and another 25 for transcripts
- some SO mappings do not respect the parent-child relationship expected between gene and transcript SO terms
- some SO mappings are inaccurate, missing or wrong
- it is mostly redundant with the biotypes which are also provided as an attribute
- there can be confusion when most features have identical values in the third column (the SO term) and the biotype attribute, yet a handful do not
For all these reasons, our SO term mapping has undergone a major overhaul to take advantage of the functionality sequence ontologies offer. This new mapping, which will be used from release 90 onwards, attempts to provide general biotype groupings that match the ones used on the website. As a result, all gene biotypes are mapped to one of these three groups, coding, non-coding or pseudogene. Meanwhile, transcript biotypes are mapped to one of five main groups: mRNA, pseudogenic_transcript, long non coding RNAs, short non coding RNAs and IG biotypes.
Additionally, the groupings remove some of the previous granularity that can still be explored via the biotype and the assigned terms respect the gene-transcript relationship where possible.
To see the full extent of those changes, as they will be reflected in the GFF3 files provided from release 90 onwards, please check these files on the FTP.
We hope this improvement will help our users take better advantage of the GFF3 format.
Ensembl transcripts have two identifiers, the versioned ENST, which is stable through time and can be tracked from release to release, and a separate identifier that incorporates a gene symbol. The latter have changed in e!89; read on for more details.Continue reading
Transforming file formats has always been a troublesome issue in bioinformatics because of the numerous standards and slight eccentricities in formatting required by some software packages. How many times have you needed to transform chromosome names between 1,2,3 and chr1, chr2, chr3 or vice versa? With the introduction of File Chameleon we hope to somewhat smooth this process for data consumers.
File Chameleon is a web service introduced by Ensembl to transform Ensembl FTP files for easier use across the spectrum of bioinformatics tools. Need UCSC style chromosome names? Need genes longer than 4Mbp removed? File Chameleon can do that. From the File Chameleon web interface simply select the species and which flat file you want to download (individual chromosome gtf, full assembly fasta, etc), then select which filters you want to apply. The file will be transcribed and ready to download within a few minutes.
Currently File Chameleon only operates on GTF, GFF3, and FASTA formats and has a very limited set of filters for each format, however we’re committed to expanding the tool over future releases. Please take a look and give us feedback, which of the Ensembl formats would be useful to add, and more importantly what transformations and filters on the data would make it more useful for you? What is the awk or sed script you run on the files you download that we can do for you, or others might find helpful?
File Chameleon is also available as a standalone tool and is designed to have easily pluggable filters. If you find the tool useful, you can run it locally and expand it writing your own plugins to further process files. The package can be downloaded via GitHub along with extensive documentation and examples.
The CRISPR/Cas9 system has revolutionised scientific research over the last few years, offering an efficient method of genome editing. CRISPR/Cas9 utilises the cellular machinery used by bacteria to recognise and edit the DNA of invading viruses. It is formed of two key components: Cas9, an enzyme that can cut a double DNA strand at a precise point; and CRISPR, a short strand of RNA that guides the Cas9 enzyme to recognise and cleave at specific DNA sites.
Cas9 restricts DNA at specific Protospacer Adjacent Motifs (PAMs), which is species-dependent (for example, 5′ NGG 3′ for Streptococcus pyogenes Cas9). Therefore, by coupling a custom CRISPR polymer (gRNA), Cas9’s restriction activity can be targeted to specific locations in the genome that contain a PAM region.
The latest release of Ensembl (Ensembl 85, July 2016) now includes annotated CRISPR/Cas9 sites predicted by the Wellcome Trust Sanger Institute Genome Editing (WGE) group for human and mouse genomes.
The WGE group have predicted CRISPR sites and developed an accompanying database to help you design genome editing experiments, and you can view these WGE-predicted sites by adding the ‘WGE CRISPR sites’ track to any ‘Region in Detail’ view for human or mouse in Ensembl. Click on the ‘Configure this page’ option from the menu on the left hand side of the page, and then add the track, which can be found in the ‘Other regulatory regions’ category, by clicking the empty box and selecting the track style from the pop-up window:
Below, you can see an example of the WGE-predicted CRISPR site track added (to both the forward and reverse strand) of the genomic region containing the human BRCA2 gene in the ‘structure’ style. Each CRISPR site is labelled as a single green box, which appears as a single vertical line when viewing a large genomic region.
From the example above, we have now zoomed into a specific region of interest. You can see the structure of each CRISPR site, with the filled green box matching up with the PAM motif and the un-filled box representing the potential gRNA binding sequence. Clicking on any of these individual CRISPR sites will open a pop-up window that provides you with more information about the specific genomic co-ordinates of the CRISPR site as well as a link to the WGE database.
You can find more information about the CRISPR site prediction method in the published description of the WGE database
Continued HapMap variation data access through Ensembl
NCBI have recently released plans to immediately retire their HapMap interface, however, data from the HapMap Project will continue to be freely accessible through Ensembl. There is lots of help and documentation as well as video tutorials to help you learn how to access variant data in Ensembl. This post aims to complement those materials to highlight the methods for accessing the HapMap Project variant data specifically.
Finding HapMap variants by ID
You can find data from the HapMap project relating to specific variants by searching for the variant rsID itself. In Ensembl, you can find information related to variants identified in the HapMap Project, which includes population genetics statistics:
However, as you can see from the example above, some of the populations represented in the HapMap Project have two separate entries in the Population Genetics table. This is because the HapMap Project was completed in a number of phases. In the first phase, a number of different groups used different genotyping platforms to type variants from a number of population panels (CEU, YRI, HCB, JPT). In a later phase, a larger set of samples were added to the samples from the initial phase and submitted as HapMap3. The two entries refer to the two submitted phases of the HapMap Project, where the number in brackets next to the allele frequency indicates the number of samples in that population.
It is also possible to view HapMap Project results by gene of interest by searching the Variant Table. The Variant Table can be filtered by ‘Evidence’ type so you can choose to see only HapMap Project variants, for example.
Finding HapMap Project variant data using BioMart
When querying the Homo sapiens short variants dataset in BioMart, you can access HapMap variant data specifically by using the ‘Variant Set Name’ filter and selecting the HapMap populations that are relevant for your research.
Finding HapMap variants using the Ensembl API
It is also possible to access variation data through the Ensembl APIs. Using the Perl API, for example, it is possible to retrieve variation data specifically related to the HapMap Project variant set, either as the whole HapMap variant set, or as individual populations represented in the HapMap Project.
We released the new Ensembl mobile site in release 82 and it’s already being used by our communities. m.ensembl.org has a slim-line functionality, designed to be a quick reference tool for genes, variants and associated phenotypes.
We hope you will find this useful if, for example, you’re on your mobile device at a conference, or you’re having coffee with a colleague and want to quickly look up something.
Each page features a share button , which gives you the option of posting the page to Facebook or Twitter, or emailing the URL to a colleague or even yourself so that you can explore the annotation in more detail on the desktop website when you get back to your computer.
BiomaRt is a Bioconductor package that make accessing and retrieving Ensembl data from the R software very easy. The recent Bioconductor 3.1 release includes a new version of BiomaRt packed with many new Ensembl friendly functions allowing you to connect and retrieve data from the Ensembl marts in record time.
To celebrate the new Bioconductor release, we’ve just launched a brand new mart documentation page. This new documentation covers the BioMaRt package but also how to combine species data, BioMart RESTful and Perl API.
You want to get some Ensembl data from BioMart using BiomaRt? Easy, just follow the simple guide below.
How can I install the BiomaRt, R package?
First make sure you have installed the R software on your computer. Then, run the following commands from your R terminal to install the Bioconductor BiomaRt R package:
What are the Ensembl marts?
The following functions will give you the list of the current available Ensembl marts
> library(biomaRt) > listEnsembl() biomart version 1 ensembl Ensembl Genes 80 2 snp Ensembl Variation 80 3 regulation Ensembl Regulation 80 4 vega Vega 60 5 pride PRIDE (EBI UK)
Which Ensembl species have Variation data?
The listDatasets function will list all the species available for a given mart.
> library(biomaRt) > variation = useEnsembl(biomart="snp") > listDatasets(variation)
What data can I get from the Variation mart (filters and attributes)?
The listFilters and listAttributes functions will give you the list of all the filters and attributes available for a given mart.
> library(biomaRt) > variation = useEnsembl(biomart="snp", dataset="hsapiens_snp") > listFilters(variation) > listAttributes(variation)
How can I get data about a variant using an rsID?
In the following example, you will be able to retrieve Variation source, Chromosome locations, Minor allele, Frequency and count, Consequences, Ensembl Gene and Transcript IDs for the Variation name “rs1333049”.
> library(biomaRt) > variation = useEnsembl(biomart="snp", dataset="hsapiens_snp") > rs1333049 <- getBM(attributes=c('refsnp_id','refsnp_source','chr_name','chrom_start','chrom_end','minor_allele','minor_allele_freq','minor_allele_count','consequence_allele_string','ensembl_gene_stable_id','ensembl_transcript_stable_id'), filters = 'snp_filter', values ="rs1333049", mart = variation) > rs1333049
How can I get data on all genes on a chromosome?
In the following example, you will be able to retrieve Ensembl Gene IDs, HGNC symbols and biotypes located on the human chromosome Y.
> library(biomaRt) > ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl") > chrY_genes <- getBM(attributes=c('ensembl_gene_id','gene_biotype','hgnc_symbol','chromosome_name','start_position','end_position'), filters = 'chromosome_name', values ="Y", mart = ensembl) > chrY_genes
How can I get protein domains information mapped to an Ensembl Gene ID?
In the following example, you will be able to retrieve Ensembl Gene, Transcript and Protein IDs, Interpro and Pfam protein domain IDs and locations mapped to the Ensembl Gene ID “ENSG00000198763”.
> library(biomaRt) > ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl") > domain_location_ENSG00000198763 <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','ensembl_peptide_id','interpro','interpro_start','interpro_end','pfam','pfam_start','pfam_end'), filters ='ensembl_gene_id', values ="ENSG00000198763", mart = ensembl) > domain_location_ENSG00000198763
The Bioconductor BiomaRt R package and complete documentation can be found on the BiomaRt Bioconductor page.
As we announced in our e80 release post, we rolled out a prototype display for cis-regulatory interactions, essentially arches connecting any two elements on the same chromosome. This was mainly designed to display your eQTL or Hi-C with little effort, based on the WashU Epigenomics Explorer interaction track:
All you need to do is prepare a tab-delimited interaction file, optionally indexing it with Tabix if it is too large to upload directly. You can use it to represent Intra-chromosomal rearrangements such as the micro-inversion below:
@RNA3DHub even suggested using it to display RNA secondary structure:
You can do what you want really!