Project ideas

Ensembl is able to host funded contributors on summer projects or internships at various times throughout the year.

Projects: VGNC

Building a customizeable synteny browser for multiple species

Brief Explanation

Synteny describes the conservation of blocks of genes in the same order between different species. This is a key aspect of comparative genomics, as it provides evidence that syntenic regions are derived from the same ancestral genomic region. Looking at synteny across multiple genomes at the same time helps us to identify the equivalent genes in each species, i.e. orthologs, which can then be assigned the same gene name in each species. We would like to develop a tool that enables the user to select a set of genomes and simultaneously view the orthologs of a particular gene of interest, as well as their flanking genes, in multiple genomes. This would help us to name orthologs in several species at the same time, as part of the VGNC (Vertebrate Gene Nomenclature Committee) project.

Expected results

  • Javascript code to generate a synteny map using gene location data fetched from existing REST services
  • Develop web application to allow selection of data to include in the synteny map

Required knowledge

  • Javascript
  • RESTful APIs
  • Web development

Difficulty

Low-medium

Length

175 hrs

Mentors

Kristian Gray Tamsin Jones 

Projects: Compara

Ensembl maintains a broad set of cross-species comparisons ranging from DNA-level whole-genome alignments to protein-level orthologies. Those resources are widely used and Ensembl strives for using state-of-the-art methods. The projects below are all aimed at increasing the quality of the provided data, and to utilize the latest advances in the field.

Compact format storage for gene homology relationships

Brief Explanation

The comparison of gene homology between species is important for biologists as it gives information about their function. At the moment, in Ensembl, the homology relationships between genes are stored in a database table where one entry represents a pairwise relation of gene homology between species. The space required to store these base relations increases quadratically with the number of species. At the moment, with about 300 species, we have to store more than a billion of lines. This solution is hardly scalable to thousands of species and puts a lot of strain on our database.

A more efficient way to store the homology relations would be to take advantage of the hierarchical structure of a gene tree which would prevent storing the details of all homology relations. As a proof of concept, we propose to develop a new format for storing gene homology relations using the hierarchical structure of gene trees. In addition, a tool reading the format to infer the gene homology relationship needs to be developed.

Expected results

  • Format representing the homology relationship of a gene tree for thousends of species

  • Tool parsing the format to provide the homology relation between two genes.

Required knowledge

  • Python

  • Data representation 

Difficulty

Hard

Length

350h

Mentor

David Thybert

Deep learning for homology inference

Brief Explanation

In a precedent Google summer of code, a deep learning neural network has been built to infer orthology relations. While this network is predicting with good accuracy the homology relation between a pair of genes from closely related species, the performance is  decreasing dramatically when the genes are from distant species. 

Here, we propose to build on the precedent project and develop a deep learning neutral network with an increased performance in the prediction for homology relation from distant related species.

 Expected results

  • Deep learning network for the prediction of homology relations between distant related species.

Required knowledge

  • Python, Keras

  • Deep learning 

Difficulty

Medium, Hard

Length

350h

Mentor

David Thybert

Projects: Biodata NLP, OCR and software development for WormBase.org and Alliancegenome.org

Extract important information from scientific papers (several candidates possible)

Brief Explanation

Our database retrieves all full-text scientific papers about model organisms, and in a largely automated way extracts information from those papers, to add into the database. Your job would be to work on types of data we do not yet capture, and create scripts to extract significant sentences and words from the fulltext papers, and reformat them for ingestion into the database, and further validation by scientific curators. The pipeline for retrieving full-text scientific papers, and extracting sentences already exists, so this is a perfect opportunity to think creatively about text mining, and choose your own subset of biologically important data to extract, normalise and format.

Expected results

  • Script for flagging up curatable information from full-text scientific papers

Skill required/preferred

  • Python

  • Interest in NLP (natural language processing)

  • Interest in biology

Difficulty

Adaptable (Easy, medium, hard)

Expected size of project (175 or 350 hour) 

Adaptable (175-350 hours)

Mentor

Magdalena Zarowiecki

Extract text from tables in scientific papers (several candidates possible)

Brief Explanation

Our database retrieves all full-text scientific papers about model organisms, and in a largely automated way extracts information from those papers, to add into the database using NER, NN classification and NLP. We would like the contributor to try out a few different ways to programmatically identify tables in PDFs, and correctly extract the table content as text. This is super-important, as a lot of the data we want to collect from scientific publications is located in tables, which we currently cannot extract very well. The content of the tables can be symbols or numbers, so the method must give accurate representation even of table content that are not known words, like for example eat-4 (a gene name). Your application has the highest chance of success if you have done something similar before, or can mention which software packages/libraries you’d like to benchmark, and a detailed breakdown of the different issues you have to solve to get from the input (PDFs of scientific papers) to the end – a tab-delimited table containing all the data, extracted correctly, and matched up with categories in our data model. You’ll be supported by our data scientists, and curators/testers to evaluate the results.

Expected results

  • Script for extracting tables, and the text within them from full-text scientific papers

Required knowledge

  • Interest in NLP (natural language processing) and OCR

  • Interest in biology
  • Python skills may help you to integrate your work with the current paper processing pipeline

Difficulty

Adaptable (Easy, Medium, Hard)

Expected size of project (175 or 350 hour) 

Adaptable (175-350 hours)

Mentor

Magdalena Zarowiecki

Projects: Software Development for MGnify

MGnify GraphQL

Brief Explanation

MGnify (https://www.ebi.ac.uk/metagenomics) is a freely available hub for the analysis and exploration of metagenomic, metatranscriptomic, amplicon and assembly data. The resource provides rich functional and taxonomic analyses of user-submitted sequences, as well as analysis of publicly available metagenomic datasets held within the European Nucleotide Archive (ENA). 

The public-facing service is a website and a json:api compliant REST API, which serves metagenomic data and associated analyses. There are also micro-services for specific tasks like sequence searches. Many pages of the MGnify website require multiple resources, and therefore multiple API requests, to display. In addition, we are increasingly encouraging our users to programmatically access our data using data analysis scripts talking directly to the API, or via client packages.

The json:api structure is not flexible and adding relationships to it creates performance issues, thanks to the large query sets and potentially complex cross-table database joins. Adding a GraphQL API to MGnify can resolve our performance scaling challenges, introduce caching strategies, and better integrate micro-services both for the website (built in React, so ready to use technologies like Apollo) and for direct user-API access.

Expected results

  • Implement a GraphQL API for MGnify
  • Investigate caching strategies for the GraphQL API
  • Documentation and code examples for the MGnify GraphQL API

Required knowledge

  • Python
  • Django
  • GraphQL

Difficulty

Adaptable

Expected size of project (175 or 350 hour) 

175 hours

Mentors

Martin Beracochea Alexander Rogers 

Workflow Orchestration Monitor

Brief Explanation

MGnify (https://www.ebi.ac.uk/metagenomics) is a freely available hub for the analysis and exploration of metagenomic, metatranscriptomic, amplicon and assembly data. The resource provides rich functional and taxonomic analyses of user-submitted sequences, as well as analysis of publicly available metagenomic datasets held within the European Nucleotide Archive (ENA). This service is hosted by the European Bioinformatics Institute (EMBL-EBI), and runs on the EMBL-EBI High Performance Computing (HPC) cluster. Part of our efforts to scale-up the service in line with increasing data volumes is to enable the MGnify service to run in cloud computing systems. 

We have built a job orchestration system called Orchestra for that purpose. This system allows us to execute our workloads in hybrid environments, such as EMBL-EBI HPC cluster, EMBL-EBI cloud environment and Google Cloud Platform. 

One feature we aim to develop is a tool to collect diagnostic information about the jobs that Orchestra dispatches. This includes resource usage (such as CPU usage, runtime, memory consumption, disk usage) and log files from those jobs. All the collected information will be sent to a centralised search index for  efficient querying and visualisation.

Expected results

  • A monitoring agent to harvest logs and metrics from jobs.

Required knowledge

  • Python
  • Unix HPC environment
  • Knowledge about the ELK stack (Elasticsearch, Logstash, Kibana)

Difficulty

Medium

Expected size of project (175 or 350 hour) 

175 hours

Mentors

Martin Beracochea

Projects: Data exploration

Accessing Ensembl data with Presto and AWS Athena

Brief Explanation

Ensembl has always provided a way to download custom reports of genes, transcripts, proteins and other data types most notably through the BioMart tool. BioMart allows a user to filter data by one or more fields and to filter output to one or more columns. Behind the scenes, BioMart holds data as a reverse star schema in a MySQL database meaning it converts user requests into SQL queries. Whilst this functionality is very useful to users, the technology is failing to scale to large data sets and to the numbers of genomes now hosted in Ensembl. BioMart is now reserved for our most popular species only meaning it is not a consistent offering. This project aims to develop a successor to the BioMart tool using modern database and cloud technologies to solve the problem.

We have identified the Presto and AWS Athena database technologies capable of executing complex queries across a variety of formats such as text files, JSON, parquet and ORC. Parquet and ORC are two formats appearing very amenable to the types of queries we wish to perform. The project will first work with the mentors to define a data format and schema suitable for use in Presto and Athena, create files in the required formats from our current MySQL databases and then load them into a Presto and Athena instance. This will cover the core data model of Ensembl (genes, transcripts, key cross references, key metadata attributes). We would then look to expand this to four genomes of interest and how to adapt the schema to the changing availability of data points across these species (e.g. the sparseness of availability of transcript quality flags across all species). Finally we would seek to have a basic REST-API and web interface where the database instance of choice (this could be limited to one technology) where a user can filter by a subset of attributes and filter the returning columns by those of interest.

Expected results

  • Defining a schema and data format appropriate to be used by Presto and Athena
  • A script for creating parquet representations of identified data from Ensembl databases
  • A Presto instance running with Ensembl data for human, mouse, e.coli and SARS-CoV-2
  • An AWS Athena instance running the same data
  • A basic querying interface capable of selecting data of interest and searching for genes by ID and symbol

Required Knowledge

  • Python
  • SQL and database querying
  • Web API design
  • HTML & JavaScript

Difficulty

Medium

Expected size of project (175 or 350 hour) 

350 hours

Mentors

Andy Yates 

Projects: new FAANG backend with Elasticsearch and GraphQL

Brief explanation

The Functional Annotation of ANimal Genomes project (FAANG) is working to understand the genotype to phenotype link in domesticated animals, to help researchers drive sustainable farmed animal production. The FAANG data portal (https://data.faang.org) helps researchers from all over the world identify relevant data for their research. Currently we are using a set of indices in Elasticsearch to store and retrieve the data through a Python/Django backend. Researchers often need different sets of indices (and columns) to be joined to retrieve data as a single table,  that is challenging  with the current technical setup. GraphQL might be ideal to solve this problem, and this project will update our existing technical stack with this technology to help the scientific community easily retrieve any combination of information they need.

Expected results

Updated GraphQL FAANG backend that will allow scientists to retrieve data in any combination of indices/columns that they need

Required knowledge

  • Python/Django
  • Elasticsearch
  • GraphQL

Difficulty

Adaptable

Length

350h

Mentors

Alexey Sokolov 

Projects: HAVANA

Defining gene boundaries

Brief Explanation

Understanding the impact of genetic variation on disease requires comprehensive gene annotation. Human genes are well characterised following more than two decades of work on their annotation, however, we know that this annotation is not complete and that new experimental methods are generating data to help us towards the goal of complete gene annotation. Long transcriptomic reads allow us to identify and annotate many new features, including the start and end of a transcript which can be combined to give information for genes. We would like to develop a pipeline to extract long transcriptomic data from the European Nucleotide Archive (ENA), map to the human reference genome and extract the terminal co-ordinates to create a growing collection of transcript start/end positions. This data will support improving the accuracy of gene annotation of individual transcripts and genes and give insight into any differences between transcript start and end sites across different tissues

Expected results

  • Code to extract read data from ENA, map to genome and calculate termini
  • Database of termini and read metadata
  • Code to extract data from database and format for browser viewing

Required knowledge

  • Transcript mapping (long-read RNA-seq alignment)
  • Workflow manager

Difficulty

Adaptable

Length

350h

Mentors

 Jonathan Mudge Jose Manuel Gonzalez Martinez Adam Frankish

Using machine learning to annotate difficult genes

Brief Explanation

Understanding the impact of genetic variation on disease requires comprehensive gene annotation. Human genes are well characterised following more than two decades of work on their annotation, however, we know that this annotation is not complete and that new experimental methods are generating data to help us towards the goal of complete gene annotation. We have developed an automated workflow to use long transcriptomic data to add novel alternatively spliced transcripts to our gene annotation. Our method uses very strict thresholds to ensure that no poor-quality models are added to the gene annotation, although as a consequence we reject significant numbers of viable novel transcripts. We want to use machine learning to recover good quality but rejected transcripts and improve the setting of initial filters for new datasets.

Expected results

  • Install and learn to use a machine learning package
  • Run it on known gene annotation
  • Deliverable: simple model that helps to recover valid transcripts; set of most relevant features for decision making

Required knowledge

  • Machine learning

Difficulty

Adaptable

Length

350h

Mentors

 Jonathan Mudge Jose Manuel Gonzalez Martinez Adam Frankish

Projects: Genebuild

Using Machine Learning to Identify and Classify Repeat Features

Brief Explanation

A number of tools exist for identifying repeat features, but it remains a problem that the DNA sequence of some genes can be identified as being repeat sequence. If such sequences are used to mask the genome, genes may be missed in downstream annotation. Assuming that gene sequences have various signatures relating to their function and that repeats have different signatures including the repetitive nature of the signal itself, we want to train a classifier to separate the repeat sequences from the gene sequences.  

Expected results

  • Create an encoding mechanism for the sequences
  • Train a classifier
  • Create a library of genomic sequences identified as repeats

Required knowledge

  • Python/ PyTorch (preferable)
  • Machine Learning

Difficulty

Medium

Length

350h

Mentors

William Stark Leanne Haggerty 

Using Machine Learning to Distinguish Between Readthroughs from Artificial Joins

Brief Explanation

“Readthrough transcripts”, or “conjoined genes”, are RNA molecules that are formed via the splicing of exons from more than one distinct gene, these are real and more common than we might assume. However, sometimes in a gene set, we find examples of a CDS that is bridged by transcripts that have distinct hits to proteins in Uniprot, this is likely a result of incorrect joining of genes during the annotation process.

We want to train a classifier to identify examples of artificial joins in gene sets.

Expected results

  • Create a set of artificial joins from existing gene sets based on finding cases of unique protein hits for non-overlapping transcripts in the same gene.
  • Train a classifier
  • Call artificial joins vs. readthroughs in a gene set

Required knowledge

  • Python/ PyTorch (preferable)
  • Machine Learning

Difficulty

Medium

Length

350h

Mentors

Francesca Tricomi Jose Perez-Silva 

Internship only – not for GSoC

Improve the pairwise orthology predictions using synteny

Brief Explanation

The Ensembl method to infer orthologues starts with building phylogenetic trees and reconciling with the species-tree in order to call the speciation and duplication events that took place. According to the definition, the orthologues are the genes from different species linked via a speciation node. However, the gene-tree reconstructions are not error-free and we need to bend the classical definition of orthology in order to overcome these mis-reconstructions and increase the number of orthologues we predict.

One approach is to use the conservation of local synteny [1]. Since the version 86 of Ensembl, we compute a gene-order conservation (GOC) score for all the orthologues we predict, using a couple of genes on both sides. We currently use this score to assess the quality of the existing orthology calls and are now considering using it to do different calls altogether. Given our current orthology predictions, we want to find other homologues that have a higher GOC score than their predicted orthologues, compare other metrics such as sequence similarity etc and switch the orthology call when needed. The work will involve implementing

  1. a method to scan the gene-trees, extract the homologues an compute synteny-support for all of them, as well as some additional metrics
  2. a decision method to switch some of our orthologues to another homologue based on the metrics gathered.

[1] Computational methods for Gene Orthology inference. David M. Kristensen et al. Briefings in Bioinformatics, Volume 12, Issue 5, 1 September 2011, Pages 379–391, https://doi.org/10.1093/bib/bbr030

Expected results

  • Script or workflow to extract the gene-order conservation scores and the other metrics
  • Analysis of the data and definition of a method to switch orthologues using synteny
  • Assessment of the performance of the new method

Required knowledge

  • Perl
  • Unix HPC environment
  • Data analysis, classification

Difficulty

Medium

Mentors

David Thybert