We’re really excited to be a part of the ESHG conference again, this time in Copenhagen from the 27th-30th May. We can’t wait to see all the great science that’s going to be presented, but here’s a guide to the talks, workshops and posters from Ensembl and some of our close friends:

W18 – Ensembl & GENCODE Workshop

Monday 29th May 3pm-4:30pm Ancona Room 

This joint workshop is organised and will be presented by Ben Moore and Amonida Zadissa from Ensembl and Adam Frankish from HAVANA. It is aimed at attendees familiar with Ensembl, including wet-lab biologists, clinicians and the bioinformatics community.
The workshop will start with a brief introduction to the Ensembl project and genome browser along with a talk about the gene annotation process carried out by the Ensembl and HAVANA teams to produce the GENCODE gene set.

After the introduction to the Ensembl genome browser, there will be hands-on demonstrations to teach you how to:

  • annotate SNPs and CNVs with functional consequences using the Variant Effect Predictor
  • investigate quick alternatives to the browser (BioMart and the REST API)

You can find more information about the workshop location and timings on the ESHG programme. If you have any questions or want to talk about Ensembl and GENCODE you can e-mail the Ensembl Helpdesk to arrange a time to meet, tweet us or simply come by and meet us after the workshop.

Poster Sessions

As well as the Ensembl and GENCODE workshop, Amonida will also be presenting an electronic poster (E-P16.08) that describes genome annotation and assembly assessment in Ensembl. Electronic Posters will be on display in the Poster Area and can be accessed during exhibition opening hours from 09:30 on Saturday 27th May to 17:45 on Monday 29th May by all participants.

Our friends Jackie, Joannella and Maria from the GWAS catalog will be presenting their work at ESHG.

  • Maria will be presenting an electronic poster (E-P16.12) with new ideas they have for increasing the speed that GWAS data is incorporated into the GWAS Catalog. Electronic Posters will be on display in the Poster Area and can be accessed during exhibition opening hours from 09:30 on Saturday 27th May to 17:45 on Monday 29th May by all participants.
  • Jackie will be presenting a poster about the steps they are taking to ‘Increase the utility of the NHGRI-EBI genome-wide association study (GWAS) Catalog for users’. Jackie’s poster number is P16.36D and you can come and talk to her between 16:45 and 17:45 on Monday 29th May.

Jackie and Maria would love to hear feedback from GWAS Catalog users during the conference, particularly on the new and proposed functionality presented in our posters. They will also be able to answer any questions on the GWAS Catalog. Any users who would like to talk to us should e-mail the GWAS Catalog team at gwas-info@ebi.ac.uk, and they can arrange a time to meet, or simply come by and meet them at their posters.

And last, but not least, Giselle Kerry will be representing The European Genome-Phenome Archive (EGA) with a poster about their future plans. Giselle’s poster number is P19.44D and you can come and talk to her between 10:15 and 11:15 on Monday 29th May.

Looking forward to seeing you all there!

This is the second instalment of our monthly posts introducing a member of the Ensembl team, and what they do in Ensembl. This time, it’s Will McLaren who works in the Variation team.

What is your job in Ensembl?

I’m the principal developer in the Ensembl Variation team. Our team produces, maintains and supports all of Ensembl’s variation resources. This includes a number of databases as well as the APIs and tools that use them, including the Variant Effect Predictor (VEP).

What do you enjoy about your job?

I love hacking around with code, making new things, taking things apart and fixing them again. Knowing that we’re contributing to advancing science and medicine by doing that is a huge bonus, and the satisfaction I get from that is what I enjoy the most.

I also enjoy interacting with our users, either helping them out over email or face to face at our workshops. Our users really are the inspiration for what we do, so I think it’s really important that we engage with them as much and as productively as we can.

What are you currently working on?

I usually have a number of projects on the go. A lot of my development time is spent working on and supporting VEP, and I’m currently working on improving how we handle RefSeq genes, as well as managing our recent transition to a major VEP update. I’m also spending some time working on a collaborative project with OpenTargets, the aim of which is to help identify links between genomic variation, genes and disease.

What is your typical day?

A typical day for me usually consists of continuing progress on whatever development project or projects I need to prioritise that day, and for me that usually means a terminal, a text editor and a web browser are my best friends. This will be punctuated by various things. I have a daily standup meeting with my team, saying what we’ve done, what we’re going to do, and discussing any issues that come up from that – it might be a colleague has become stuck on something that someone else has already figured out, so sharing in this way is great for all of us. We also communicate a lot via instant messaging if we can’t in person, and this extends across the whole of Ensembl. There’s usually a couple of help requests from users, and occasionally this will also involve finding and squashing a bug in our code (something we try to prioritise where we can). If the bug turns up in code someone else is responsible for, we might discuss with them before working out how to fix it. There may be a pipeline I need to kick off or check on that generates data as part of Ensembl’s ongoing release cycle. I might also have a conference call or a meeting with collaborators; typically this might be to discuss a manuscript in progress or a shared project.

How did you end up here?

I started my academic life doing Biochemistry and Genetics. I pretty quickly realised I didn’t have the manual skills or the patience for lab work, but I loved the discovery aspect of the science. I’d always had a hobby messing with computers, and was delighted to discover that I could combine my hobby and academic interests in this thing we call bioinformatics.

After a year studying bioinformatics, I got my first job working in informatics for pet health. Not the most glamorous, but it started me down the path of working in variation data, and a change of city led me to working at the Sanger Institute (who share a campus with us here at EMBL-EBI) doing statistical genetics for genome-wide association studies (GWAS).

I’d dabbled with Ensembl before, and when I saw a job advertised in Ensembl in a new-ish team that fit my experience and interests and was looking to expand, I jumped at the chance and haven’t looked back!

What surprised you most about Ensembl when you started working here?

What surprised me most was the diversity in the Ensembl team. We really are an international team with representatives from nearly every continent, which is great on both a personal and professional level. As well as this, we have a surprising diversity in people’s educational and career backgrounds. The combination of these means we have a huge breadth and depth of knowledge across the Ensembl team, which allows us to deliver what I consider a staggering array of data and functionality to our users.

What is the coolest tool or data type in Ensembl that you think everybody should know about?

We have a cool web view called the transcript haplotype view. This shows you whole transcript and protein sequences as they would appear in each individual from the 1000 Genomes project, by considering all of the genomic variation across a gene together. We also have a related tool that you can use on your own data called Haplosaurus, and I think this is going to be a really important step towards seeing the real biological picture in sequenced genomes.

We’re already gearing up for Ensembl 89, scheduled for May 2017. It’s a slimline release this time, with just a handful of highlights:

Updated assemblies, gene sets and annotations

  • Human: updated cDNA alignments
  • Mouse: updated cDNA alignments and update to Ensembl-Havana GENCODE gene set

Other updates and highlights

  • Variation and phenotype database updates, including COSMIC version 80.
  • GnomAD frequencies will be available via the website, VEP and APIs.
  • Mapping of array probes to 15 different mouse strains in Ensembl.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

The Ensembl regulation resources FTP site saw a facelift in release 87. The directory structures have been modified to make it easier to find files- the file names have become more descriptive and we now also provide our data in a greater variety of file formats. All data files on our FTP site now adheres to a naming convention, which is described in greater detail here. The filenames include the following information separated with a dot (‘.’):

  • species
  • assembly version
  • cell type (if applicable)
  • feature type (if applicable)
  • analysis name
  • results type
  • data freeze date
  • file format.

E.g.: homo_sapiens.GRCh38.K562.Regulatory_Build.regulatory_activity.20161111.gff.gz

The data available on our FTP site include:

Peaks: The set of peaks for transcription factors, histone modifications and variants that are part of our regulatory resources. In previous releases these used to be collated in one file, called ‘AnnotatedFeatures.gff.gz’, but with our recent expansion to 88 human cell types with ChIP-seq data, the file became too big. Therefore, we split it into separate files by cell and feature type in the ‘Peaks’ subdirectory. The peaks are now available in gff, bed and bigBed format.

Quality scores: The outcome of our quality checks from processing the ChIP-seq data that yielded the peaks. They are in JSON format in the ‘QualityChecks’ subdirectory:

  • the number of mapped reads
  • the estimated fragment length, the NSC and RSC values using phantompeakqualtools
  • the proportion of reads in peaks
  • the enrichment of the ChIP over the Input using CHANCE.

Regulatory build: The current set of regulatory features along with their predicted activity in every cell type. We provide one gff file per cell type in the ‘regulatory_features’ subdirectory.

Transcription factor motifs: The transcription factor motifs identified using position weight matrices from JASPAR in enriched regions identified by our ChIP-seq analysis pipeline in gff format.

Due to essential maintenance, the Ensembl helpdesk email will be shut down for approximately 48 hours, beginning at 9 am (GMT) on 25th November. Any emails sent during this time will be held in a queue, and we will respond to them when the system is up and running again, although there may be some delay. You will not receive any confirmation of your email, as this is automatically generated by the system.

You can also post queries to the Ensembl dev list and BioStars (please add “Ensembl” as a tag).

We will update you when the system is back.

We apologise for any inconvenience this may cause.

What’s new?

Ensembl Plants takes centre stage in the release of Ensembl Genomes 33, with a variety of new data available for a number of different species:

  • Incorporation of the Araport 11 gene model annotation for Arabidopsis thaliana
  • Addition of mitochondrial and plastid genome sequences to the current maize (Zea mays) chromosomal assembly (AGPv4)
  • Alignment between the A, B and D genomes of bread wheat (Triticum aestivum) updated to use TGACv1 genome assemblies
  • Whole genome alignment between bread wheat and Brachypodium distachyon

Other News

You can find more details in the release notes.

What’s New in e86:

Mouse strain genomes

In Ensembl 86, you will now be able to view the annotated genome assemblies, variation data and comparative analyses of 16 different mouse strains, produced by the Mouse Genomes Project. While the GRCm38 assembly (produced from Mus musculus strain C57BL/6J) remains the reference assembly, variants and comparative analyses for the other strains can be viewed through the Gene tab and the Location tab. You can find the gene trees and orthologue/paralogue predictions for the mouse strains through the Strains option in the menu in the Gene tab. The mouse strain gene tree depicts the evolutionary history of genes (left) and protein alignment (right) for the individual mouse strains and rat. mouse strain treemouse strain orthologues You can find the variants between these mouse strains through the Strain table option in the menu in the Location tab. The strain table displays the alleles identified at variant positions across the 16 mouse strains. strain variant table

Updated assemblies, gene sets and annotations

In Ensembl 86, there will also be a number of updates to the assemblies and gene sets for a number of different species:

  • Human: updated cDNA alignments and RefSeq import
  • Mouse: updated cDNA alignments and RefSeq import
  • Zebrafish: updated gene set and RefSeq import
  • Chicken: updated to the Galgal_5.0 assembly
  • Mouse lemur: updated to the Mmur_2.0 assembly
  • Macaque:  updated to the Mmul_8.0.1 assembly

New lincRNA data

New Mobile Site Views

As of release 86, you can now view transcripts on the mobile version of Ensembl. You can also view exon sequence, cDNA sequence and protein sequence by clicking on the lefthand arrow.

mobile site- transcript[1]mobile site- transcript[2]

The gene sequence is also now available to view on mobile devices. Just go to any gene page and click on the left hand arrow and then choose sequence.


Other News

  • Variation and phenotype databases updated
  • You can now select ‘Manhattan plot’ as an option when configuring bigWig files

A complete list of the changes can be found on the Ensembl website

Find out more about the new release and ask the team questions, in our free webinar: Tuesday 11th October, 4pm BST. Register here.

Ensembl 86 is scheduled for September 2016, highlights include:

New mouse strains

  • Annotated genome assemblies, variation data and comparative analyses of 16 different mouse strains will be included in Ensembl 86.

Updated assemblies, gene sets and annotations

  • Human: updated cDNA alignments and RefSeq import
  • Mouse: updated cDNA alignments and RefSeq import
  • Zebrafish: updated gene set and RefSeq import
  • Chicken: updating to the Galgal_5.0 assembly
  • Mouse lemur: updating to the Mmur_2.0 assembly
  • Macaque:  updating to the Mmul_8.0.1 assembly

New lincRNA data

New GRCh37 tools converted from 1000 Genomes Project

A number of tools previously developed for use in the 1000 Genomes Project browser have now been converted for use with the GRCh37 assembly in Ensembl:

  • Dataslicer tool- This tool allows you to get a subset of data from a BAM or VCF file.
  • Variation pattern finder tool- This tool allows you to identify variation patterns in a chromosomal region of interest for different individuals.
  • Forge analysis tool- This tool takes a list of variants and analyses their enrichment in functional regions from the ENCODE or Roadmap Epigenome project on a tissue specific basis.

Other updates and highlights

  • Variation and phenotype databases updates

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

The CRISPR/Cas9 system has revolutionised scientific research over the last few years, offering an efficient method of genome editing. CRISPR/Cas9 utilises the cellular machinery used by bacteria to recognise and edit the DNA of invading viruses. It is formed of two key components: Cas9, an enzyme that can cut a double DNA strand at a precise point; and CRISPR, a short strand of RNA that guides the Cas9 enzyme to recognise and cleave at specific DNA sites.

Cas9 restricts DNA at specific Protospacer Adjacent Motifs (PAMs), which is species-dependent (for example, 5′ NGG 3′ for Streptococcus pyogenes Cas9). Therefore, by coupling a custom CRISPR polymer (gRNA), Cas9’s restriction activity can be targeted to specific locations in the genome that contain a PAM region.

The latest release of Ensembl (Ensembl 85, July 2016) now includes annotated CRISPR/Cas9 sites predicted by the Wellcome Trust Sanger Institute Genome Editing (WGE) group for human and mouse genomes.

The WGE group have predicted CRISPR sites and developed an accompanying database to help you design genome editing experiments, and you can view these WGE-predicted sites by adding the ‘WGE CRISPR sites’ track to any ‘Region in Detail’ view for human or mouse in Ensembl. Click on the ‘Configure this page’ option from the menu on the left hand side of the page, and then add the track, which Configure this page buttoncan be found in the ‘Other regulatory regions’ category, by clicking the empty box and selecting the track style from the pop-up window:Add CRISPR track option

Below, you can see an example of the WGE-predicted CRISPR site track added (to both the forward and reverse strand) of the genomic region containing the human BRCA2 gene in the ‘structure’ style. Each CRISPR site is labelled as a single green box, which appears as a single vertical line when viewing a large genomic region.CRISPR site track

From the example above, we have now zoomed into a specific region of interest. You can see the structure of each CRISPR site, with the filled green box matching up with the PAM motif and the un-filled box representing the potential gRNA binding sequence. Clicking on any of these individual CRISPR sites will open a pop-up window that provides you with more information about the specific genomic co-ordinates of the CRISPR site as well as a link to the WGE database.CRISPR pop up

You can find more information about the CRISPR site prediction method in the published description of the WGE database

Continued HapMap variation data access through Ensembl

NCBI have recently released plans to immediately retire their HapMap interface, however, data from the HapMap Project will continue to be freely accessible through Ensembl. There is lots of help and documentation as well as video tutorials to help you learn how to access variant data in Ensembl. This post aims to complement those materials to highlight the methods for accessing the HapMap Project variant data specifically.

Finding HapMap variants by ID

You can find data from the HapMap project relating to specific variants by searching for the variant rsID itself. In Ensembl, you can find information related to variants identified in the HapMap Project, which includes population genetics statistics:Population Genetics HapMap

However, as you can see from the example above, some of the populations represented in the HapMap Project have two separate entries in the Population Genetics table. This is because the HapMap Project was completed in a number of phases. In the first phase, a number of different groups used different genotyping platforms to type variants from a number of population panels (CEU, YRI, HCB, JPT). In a later phase, a larger set of samples were added to the samples from the initial phase and submitted as HapMap3. The two entries refer to the two submitted phases of the HapMap Project, where the number in brackets next to the allele frequency indicates the number of samples in that population.

It is also possible to view HapMap Project results by gene of interest by searching the Variant Table. The Variant Table can be filtered by ‘Evidence’ type so you can choose to see only HapMap Project variants, for example.Variant Filter HapMap

Filtering the Variant Table by ‘Evidence type = HapMap’ will filter the displayed variants to those identified in the HapMap project. This will be denoted by theevidence HapMapin the Evidence column.Filtered variant Table Hapmap

Finding HapMap Project variant data using BioMart 

HapMap SNP data can also be retrieved using BioMart. There is help and documentation and a video tutorial to help you while using BioMart.

When querying the Homo sapiens short variants dataset in BioMart, you can access HapMap variant data specifically by using the ‘Variant Set Name’ filter and selecting the HapMap populations that are relevant for your research.HapMap variation Mart

Finding HapMap variants using the Ensembl API

It is also possible to access variation data through the Ensembl APIs. Using the Perl API, for example, it is possible to retrieve variation data specifically related to the HapMap Project variant set, either as the whole HapMap variant set, or as individual populations represented in the HapMap Project.