This is the second instalment of our monthly posts introducing a member of the Ensembl team, and what they do in Ensembl. This time, it’s Will McLaren who works in the Variation team.

What is your job in Ensembl?

I’m the principal developer in the Ensembl Variation team. Our team produces, maintains and supports all of Ensembl’s variation resources. This includes a number of databases as well as the APIs and tools that use them, including the Variant Effect Predictor (VEP).

What do you enjoy about your job?

I love hacking around with code, making new things, taking things apart and fixing them again. Knowing that we’re contributing to advancing science and medicine by doing that is a huge bonus, and the satisfaction I get from that is what I enjoy the most.

I also enjoy interacting with our users, either helping them out over email or face to face at our workshops. Our users really are the inspiration for what we do, so I think it’s really important that we engage with them as much and as productively as we can.

What are you currently working on?

I usually have a number of projects on the go. A lot of my development time is spent working on and supporting VEP, and I’m currently working on improving how we handle RefSeq genes, as well as managing our recent transition to a major VEP update. I’m also spending some time working on a collaborative project with OpenTargets, the aim of which is to help identify links between genomic variation, genes and disease.

What is your typical day?

A typical day for me usually consists of continuing progress on whatever development project or projects I need to prioritise that day, and for me that usually means a terminal, a text editor and a web browser are my best friends. This will be punctuated by various things. I have a daily standup meeting with my team, saying what we’ve done, what we’re going to do, and discussing any issues that come up from that – it might be a colleague has become stuck on something that someone else has already figured out, so sharing in this way is great for all of us. We also communicate a lot via instant messaging if we can’t in person, and this extends across the whole of Ensembl. There’s usually a couple of help requests from users, and occasionally this will also involve finding and squashing a bug in our code (something we try to prioritise where we can). If the bug turns up in code someone else is responsible for, we might discuss with them before working out how to fix it. There may be a pipeline I need to kick off or check on that generates data as part of Ensembl’s ongoing release cycle. I might also have a conference call or a meeting with collaborators; typically this might be to discuss a manuscript in progress or a shared project.

How did you end up here?

I started my academic life doing Biochemistry and Genetics. I pretty quickly realised I didn’t have the manual skills or the patience for lab work, but I loved the discovery aspect of the science. I’d always had a hobby messing with computers, and was delighted to discover that I could combine my hobby and academic interests in this thing we call bioinformatics.

After a year studying bioinformatics, I got my first job working in informatics for pet health. Not the most glamorous, but it started me down the path of working in variation data, and a change of city led me to working at the Sanger Institute (who share a campus with us here at EMBL-EBI) doing statistical genetics for genome-wide association studies (GWAS).

I’d dabbled with Ensembl before, and when I saw a job advertised in Ensembl in a new-ish team that fit my experience and interests and was looking to expand, I jumped at the chance and haven’t looked back!

What surprised you most about Ensembl when you started working here?

What surprised me most was the diversity in the Ensembl team. We really are an international team with representatives from nearly every continent, which is great on both a personal and professional level. As well as this, we have a surprising diversity in people’s educational and career backgrounds. The combination of these means we have a huge breadth and depth of knowledge across the Ensembl team, which allows us to deliver what I consider a staggering array of data and functionality to our users.

What is the coolest tool or data type in Ensembl that you think everybody should know about?

We have a cool web view called the transcript haplotype view. This shows you whole transcript and protein sequences as they would appear in each individual from the 1000 Genomes project, by considering all of the genomic variation across a gene together. We also have a related tool that you can use on your own data called Haplosaurus, and I think this is going to be a really important step towards seeing the real biological picture in sequenced genomes.

We’re already gearing up for Ensembl 89, scheduled for May 2017. It’s a slimline release this time, with just a handful of highlights:

Updated assemblies, gene sets and annotations

  • Human: updated cDNA alignments
  • Mouse: updated cDNA alignments and update to Ensembl-Havana GENCODE gene set

Other updates and highlights

  • Variation and phenotype database updates, including COSMIC version 80.
  • GnomAD frequencies will be available via the website, VEP and APIs.
  • Mapping of array probes to 15 different mouse strains in Ensembl.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

The Ensembl regulation resources FTP site saw a facelift in release 87. The directory structures have been modified to make it easier to find files- the file names have become more descriptive and we now also provide our data in a greater variety of file formats. All data files on our FTP site now adheres to a naming convention, which is described in greater detail here. The filenames include the following information separated with a dot (‘.’):

  • species
  • assembly version
  • cell type (if applicable)
  • feature type (if applicable)
  • analysis name
  • results type
  • data freeze date
  • file format.

E.g.: homo_sapiens.GRCh38.K562.Regulatory_Build.regulatory_activity.20161111.gff.gz

The data available on our FTP site include:

Peaks: The set of peaks for transcription factors, histone modifications and variants that are part of our regulatory resources. In previous releases these used to be collated in one file, called ‘AnnotatedFeatures.gff.gz’, but with our recent expansion to 88 human cell types with ChIP-seq data, the file became too big. Therefore, we split it into separate files by cell and feature type in the ‘Peaks’ subdirectory. The peaks are now available in gff, bed and bigBed format.

Quality scores: The outcome of our quality checks from processing the ChIP-seq data that yielded the peaks. They are in JSON format in the ‘QualityChecks’ subdirectory:

  • the number of mapped reads
  • the estimated fragment length, the NSC and RSC values using phantompeakqualtools
  • the proportion of reads in peaks
  • the enrichment of the ChIP over the Input using CHANCE.

Regulatory build: The current set of regulatory features along with their predicted activity in every cell type. We provide one gff file per cell type in the ‘regulatory_features’ subdirectory.

Transcription factor motifs: The transcription factor motifs identified using position weight matrices from JASPAR in enriched regions identified by our ChIP-seq analysis pipeline in gff format.

Due to essential maintenance, the Ensembl helpdesk email will be shut down for approximately 48 hours, beginning at 9 am (GMT) on 25th November. Any emails sent during this time will be held in a queue, and we will respond to them when the system is up and running again, although there may be some delay. You will not receive any confirmation of your email, as this is automatically generated by the system.

You can also post queries to the Ensembl dev list and BioStars (please add “Ensembl” as a tag).

We will update you when the system is back.

We apologise for any inconvenience this may cause.

What’s new?

Ensembl Plants takes centre stage in the release of Ensembl Genomes 33, with a variety of new data available for a number of different species:

Other News

You can find more details in the release notes.

What’s New in e86:

Mouse strain genomes

In Ensembl 86, you will now be able to view the annotated genome assemblies, variation data and comparative analyses of 16 different mouse strains, produced by the Mouse Genomes Project. While the GRCm38 assembly (produced from Mus musculus strain C57BL/6J) remains the reference assembly, variants and comparative analyses for the other strains can be viewed through the Gene tab and the Location tab. You can find the gene trees and orthologue/paralogue predictions for the mouse strains through the Strains option in the menu in the Gene tab. The mouse strain gene tree depicts the evolutionary history of genes (left) and protein alignment (right) for the individual mouse strains and rat. mouse strain treemouse strain orthologues You can find the variants between these mouse strains through the Strain table option in the menu in the Location tab. The strain table displays the alleles identified at variant positions across the 16 mouse strains. strain variant table

Updated assemblies, gene sets and annotations

In Ensembl 86, there will also be a number of updates to the assemblies and gene sets for a number of different species:

  • Human: updated cDNA alignments and RefSeq import
  • Mouse: updated cDNA alignments and RefSeq import
  • Zebrafish: updated gene set and RefSeq import
  • Chicken: updated to the Galgal_5.0 assembly
  • Mouse lemur: updated to the Mmur_2.0 assembly
  • Macaque:  updated to the Mmul_8.0.1 assembly

New lincRNA data

New Mobile Site Views

As of release 86, you can now view transcripts on the mobile version of Ensembl. You can also view exon sequence, cDNA sequence and protein sequence by clicking on the lefthand arrow.

mobile site- transcript[1]mobile site- transcript[2]

The gene sequence is also now available to view on mobile devices. Just go to any gene page and click on the left hand arrow and then choose sequence.

1

Other News

  • Variation and phenotype databases updated
  • You can now select ‘Manhattan plot’ as an option when configuring bigWig files

A complete list of the changes can be found on the Ensembl website

Find out more about the new release and ask the team questions, in our free webinar: Tuesday 11th October, 4pm BST. Register here.

Ensembl 86 is scheduled for September 2016, highlights include:

New mouse strains

  • Annotated genome assemblies, variation data and comparative analyses of 16 different mouse strains will be included in Ensembl 86.

Updated assemblies, gene sets and annotations

  • Human: updated cDNA alignments and RefSeq import
  • Mouse: updated cDNA alignments and RefSeq import
  • Zebrafish: updated gene set and RefSeq import
  • Chicken: updating to the Galgal_5.0 assembly
  • Mouse lemur: updating to the Mmur_2.0 assembly
  • Macaque:  updating to the Mmul_8.0.1 assembly

New lincRNA data

New GRCh37 tools converted from 1000 Genomes Project

A number of tools previously developed for use in the 1000 Genomes Project browser have now been converted for use with the GRCh37 assembly in Ensembl:

  • Dataslicer tool- This tool allows you to get a subset of data from a BAM or VCF file.
  • Variation pattern finder tool- This tool allows you to identify variation patterns in a chromosomal region of interest for different individuals.
  • Forge analysis tool- This tool takes a list of variants and analyses their enrichment in functional regions from the ENCODE or Roadmap Epigenome project on a tissue specific basis.

Other updates and highlights

  • Variation and phenotype databases updates

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

The CRISPR/Cas9 system has revolutionised scientific research over the last few years, offering an efficient method of genome editing. CRISPR/Cas9 utilises the cellular machinery used by bacteria to recognise and edit the DNA of invading viruses. It is formed of two key components: Cas9, an enzyme that can cut a double DNA strand at a precise point; and CRISPR, a short strand of RNA that guides the Cas9 enzyme to recognise and cleave at specific DNA sites.

Cas9 restricts DNA at specific Protospacer Adjacent Motifs (PAMs), which is species-dependent (for example, 5′ NGG 3′ for Streptococcus pyogenes Cas9). Therefore, by coupling a custom CRISPR polymer (gRNA), Cas9’s restriction activity can be targeted to specific locations in the genome that contain a PAM region.

The latest release of Ensembl (Ensembl 85, July 2016) now includes annotated CRISPR/Cas9 sites predicted by the Wellcome Trust Sanger Institute Genome Editing (WGE) group for human and mouse genomes.

The WGE group have predicted CRISPR sites and developed an accompanying database to help you design genome editing experiments, and you can view these WGE-predicted sites by adding the ‘WGE CRISPR sites’ track to any ‘Region in Detail’ view for human or mouse in Ensembl. Click on the ‘Configure this page’ option from the menu on the left hand side of the page, and then add the track, which Configure this page buttoncan be found in the ‘Other regulatory regions’ category, by clicking the empty box and selecting the track style from the pop-up window:Add CRISPR track option

Below, you can see an example of the WGE-predicted CRISPR site track added (to both the forward and reverse strand) of the genomic region containing the human BRCA2 gene in the ‘structure’ style. Each CRISPR site is labelled as a single green box, which appears as a single vertical line when viewing a large genomic region.CRISPR site track

From the example above, we have now zoomed into a specific region of interest. You can see the structure of each CRISPR site, with the filled green box matching up with the PAM motif and the un-filled box representing the potential gRNA binding sequence. Clicking on any of these individual CRISPR sites will open a pop-up window that provides you with more information about the specific genomic co-ordinates of the CRISPR site as well as a link to the WGE database.CRISPR pop up

You can find more information about the CRISPR site prediction method in the published description of the WGE database

Continued HapMap variation data access through Ensembl

NCBI have recently released plans to immediately retire their HapMap interface, however, data from the HapMap Project will continue to be freely accessible through Ensembl. There is lots of help and documentation as well as video tutorials to help you learn how to access variant data in Ensembl. This post aims to complement those materials to highlight the methods for accessing the HapMap Project variant data specifically.

Finding HapMap variants by ID

You can find data from the HapMap project relating to specific variants by searching for the variant rsID itself. In Ensembl, you can find information related to variants identified in the HapMap Project, which includes population genetics statistics:Population Genetics HapMap

However, as you can see from the example above, some of the populations represented in the HapMap Project have two separate entries in the Population Genetics table. This is because the HapMap Project was completed in a number of phases. In the first phase, a number of different groups used different genotyping platforms to type variants from a number of population panels (CEU, YRI, HCB, JPT). In a later phase, a larger set of samples were added to the samples from the initial phase and submitted as HapMap3. The two entries refer to the two submitted phases of the HapMap Project, where the number in brackets next to the allele frequency indicates the number of samples in that population.

It is also possible to view HapMap Project results by gene of interest by searching the Variant Table. The Variant Table can be filtered by ‘Evidence’ type so you can choose to see only HapMap Project variants, for example.Variant Filter HapMap

Filtering the Variant Table by ‘Evidence type = HapMap’ will filter the displayed variants to those identified in the HapMap project. This will be denoted by theevidence HapMapin the Evidence column.Filtered variant Table Hapmap

Finding HapMap Project variant data using BioMart 

HapMap SNP data can also be retrieved using BioMart. There is help and documentation and a video tutorial to help you while using BioMart.

When querying the Homo sapiens short variants dataset in BioMart, you can access HapMap variant data specifically by using the ‘Variant Set Name’ filter and selecting the HapMap populations that are relevant for your research.HapMap variation Mart

Finding HapMap variants using the Ensembl API

It is also possible to access variation data through the Ensembl APIs. Using the Perl API, for example, it is possible to retrieve variation data specifically related to the HapMap Project variant set, either as the whole HapMap variant set, or as individual populations represented in the HapMap Project.

A mysteriously common debilitating genetic disorder. A deadly tropical disease. One of my favourite stories in the history of genetics weaves together these two elements – it’s a good one and it always deserves a re-telling – that of malaria and sickle cell anaemia.

This story captures my attention and inspires me in the power of scientific observation, curiosity and experiment. I’m sure you are all aware of the details of this worn-out tale: it is used as an example in classrooms and lecture theatres every year to explain Mendelian genetics, haploinsufficiency, physiology, disease and protein structure and function to young scientists. To mark the coincidental coinciding of DNA day and Malaria day, we wanted to re-visit this ‘historical’ example of how scientific observation and experimental approaches have led to the understanding of how a disease as debilitating as sickle cell anaemia paradoxically persists in the human population.

Molecular biology and bioinformatics have transformed the face of biological research over the last few decades. The speed that scientists can sequence and analyse DNA means that global collaborations that study thousands of individuals are beginning to shed light on a range of different diseases.

Sickle-cell anaemia is a disease in which red blood cells form an abnormal crescent (or sickle) shape. It is an inherited disorder, and was the first ever to be attributed to a specific genetic variant (rs334, see it here in Ensembl).

rs334_info

In 1949, ‘Sickle Cell Anaemia, a Molecular Disease’, from Pauling et al. identified a difference in the electrophoretic mobility between haemoglobin from healthy individuals and those with sickle-cell anaemia caused by a change in molecular structure of haemoglobin responsible for the sickling process [1]. The genetic variant (A, Reference:T) that causes cell sickling results in the substitution of a conserved glutamic acid residue at position 7 in beta chain of haemoglobin to a valine [2].

You can find this information in the Genes and regulation section for this variant. In the table below, which has been filtered to see only missense variants, the ‘Allele (transcript allele)’ column describes the variant allele (A) and the  transcript allele (T, as the HBB gene is located on the reverse strand). You can also see the nature and location of the variant on the transcript in the ‘Position’, ‘Amino acid’ and ‘Codons’ columns. The SIFT and Polyphen algorithms predict the effect of the amino acid change on protein structure and function. Interestingly, only the SIFT algorithm predicts that the T/A variant would have deleterious effect on haemoglobin structure and function, confirming that predictions can never be as accurate as experimental evidence.

rs334_consequences

Only those individuals that are homozygous for the variant allele develop sickle cell anaemia, although heterozygous individuals do have the much more manageable sickle cell trait. If untreated, individuals with sickle cell anaemia have a shorter than normal life expectancy, experiencing lethargy and breathlessness throughout their lives, with increased risk of stroke and pulmonary hypertension, as well as increased vulnerability to infection. Individuals with the milder sickle cell trait can experience problems in low oxygen or as a result of severe physical exercise, but can mostly be expected to live normal lives.

As such it would be expected that this variant would be rare in human populations. However, observations made in mid-20th century revealed that this variant is, in fact, surprisingly common in African, African American and Caribbean populations (you can see this in the 1000 Genomes allele frequencies available under Population genetics in Ensembl). Coincidentally, these were people descended from those who came from areas where malaria is prevalent [3]. Why was this happening?

rs334_pop_genetics

Individuals carrying just one copy of the variant allele were known not to develop sickle cell anaemia, leading rather normal lives. However, it was found that these same individuals, were in fact highly protected against malaria. It turned out that, quite bizarrely, having alternate alleles at this loci simultaneously prevented infection from the malaria parasite with entirely manageable sickle manifestations! Therefore, individuals with one copy of each allele have a greater chance of survival in geographical areas where malaria is endemic, preserving both alleles in the population.

Understanding this relationship has led to a deeper understanding of the infective lifecycle of the malaria parasite and novel approaches in combating malaria [4-5], but also an appreciation of the genetic factors leading to sickle-cell anaemia.

This story exemplifies how observation, epidemiology and scientific investigation can uncover the mysteries of a human disease and provide important insights for its treatment. Nowadays, this gold standard of studying single genetic disorders has been multiplied and sped up on an unprecedented scale. There are now numerous projects that are aimed at sequencing the DNA of many individuals with different diseases and using the power of bioinformatics to analyse how genetic variation might lay at the foundations for previously poorly understood diseases.

[1] Pauling L. et al. Sickle cell anemia a molecular disease Science, 1949 Nov 25;110(2865):543-8

[2] Ingram VM et al. Abnormal human haemoglobins. III. The chemical difference between normal and sickle cell haemoglobins Biochim Biophys Acta 1959 36: 543–548

[3] Allison AC et al. Protection Afforded by Sickle-cell Trait Against Subtertian Malarial Infection 1954 Br Med J 1 (4857): 290–294

[4] Mounkaila A. et al. Sickle Cell Trait Protects Against Plasmodium falciparum Infection American Journal of Epidemiology, 2012 176 175-185

[5]  Gregory LaMonte et al. Translocation of Sickle Cell Erythrocyte MicroRNAs into Plasmodium falciparum Inhibits Parasite Translation and Contributes to Malaria Resistance Cell Host & Microbe, 2012 12 187-199