Student Projects

Ensembl is able to host self-funded students on summer projects or internships at various times throughout the year.

Projects: RESTful API Development

Compliance Test Suite for Ensembl Web APIs

Brief Explanation

API specification is key to the creation of good and future proofed APIs. However there is a disconnect between ensuring an API implementation conforms to a specification. This includes the expected behaviour of a specification in response to erroneous or nonsensical requests. This is exceptionally important to maintain the contract of service when new API versions are deployed (and where we allow implementations to diverge away from a specifications). We envisage a library of compliance tests, which can be run periodically to create a report detailing where APIs do not comply. This would also work with the Global Alliance for Genomics and Health (GA4GH) to reuse infrastructure they have developed to write these compliance suites.

Expected results

A suite of tests, which can be run periodically against a set of REST APIs (both traditional and GraphQL), create both a computational and human readable representation of results and publish these to a web server.

Required knowledge

  • RESTful API usage
  • Python
  • Test development

Difficulty

Medium-hard

Mentors

Bethany Flint, Andy Yates

Fetch nearest feature REST endpoint (not for GSoC)

Brief Explanation

It is not uncommon for researchers to look for the ‘nearest feature’, ie a gene or regulatory element close to a region of interest. This search can be restricted to the same strand or include the opposite strand, be upstream or downstream of the region of interest, include only non-overlapping features, be within 5 base pairs or 500… Ensembl already provides this functionality through the Perl API but we wish to provide a language-agnostic tool for doing this, through the REST API.

Expected results

A REST endpoint to retrieve the ‘nearest feature’ to a region of interest. 

Required knowledge

  • RESTful API development
  • Ensembl Perl API
  • (desirable) genomics

Difficulty

Medium – hard

Mentors

Magali Ruffier

Projects: Bioinformatic analysis

Building a containerised transcript alignment environment

Brief explanation

Finding the structure of genes hidden a specie’s DNA can be done in a variety of ways. One common way is align data such a protein and RNA sequences from the species of interest, or other closely related species, onto the genome being annotated in order to find the exon and intron structures that make up a gene.

There are numerous tools available to align data to a genome and create a potential gene models. These include:

  • GenBlast
  • Exonerate
  • GeneWise
  • minimap2
  • Pro-Splign
  • Splign
  • GenomeThreader

In order to better evaluate the different methods, we will build a containerised system that can run these different tools on a standardised set of test data. The results will then be automatically compared to gold standard, manually curated, reference annotations to calculate the sensitivity and specificity of the different tools. We will generate different sets of test data to test the tools under different scenarios to better understand the impact of using data that is evolutionary distant from the test species. The system will be designed to be easily deployable in different environments and easily expandable to include new tools.

Expected results

  • A containerised environment for comparing test data to reference annotations under a variety of scenarios
  • An assessment of the suitability of a subset of the currently available tools to gene annotation
  • Documentation of the implementation and test results

Required knowledge

  • Experience with containerisation (Docker and Singularity a plus)
  • Familiarity with desiging workflows for testing data
  • Python

Desirable knowledge

  • Any prior experience with sequence alignment and genome annotation

Difficulty

Medium

Mentors

Thibaut Hourlier, Fergal Martin

Modelling of external references (not for GSoC)

Brief Explanation

There are multiple databases of biological data available, sometimes presenting comparable data from a different angle. Ensembl provides links to equivalent data in other databases in the form of external references. Mapping to these resources can be very heterogeneous, via coordinate overlap, sequence alignment, manually curated links or even third-party links. For reproducibility reasons, it is important to be able to report how a link was generated. The information is available in a graph database but we would like to store it in a traditional database.

Expected results

  • Identify what data should be recorded from external sources
  • Describe a relational schema to store all the relevant information

Required knowledge

  • RDBMS
  • Scripting
  • (desirable) RDF

Difficulty

Easy – medium

Mentors

Magali Ruffier

Genomic data and flat files (not for GSoC)

Brief Explanation

Genomics data is provided in a number of different file formats with loose specifications (for example GFF3, GTF, EMBL). As a result, the same data can be represented differently depending on who generated the file and when. Additionally, different analysis softwares have different expectation as to the format, requiring end users to repeatedly modify existing files to suit their use case. Ensembl has developed a tool, FileChameleon, that can facilitate these transformations and provide users directly with the end file they require. We would like to identify what transformations are needed and implement them accordingly.

Expected results

  • Identify a handful of typical workflows using genomics file
  • Implement additional options for FileChameleon to fulfill those workflows
  • Provide example configurations for complete workflows

Required knowledge

  • Perl
  • Genomics

Difficulty

Easy – medium

Mentors

Magali Ruffier

Projects: Compara

Ensembl maintains a broad set of cross-species comparisons ranging from DNA-level whole-genome alignments to protein-level orthologies. Those resources are widely used and Ensembl strives for using state-of-the-art methods. The projects below are all aimed at increasing the quality of the provided data, and to utilize the latest advances in the field.

Improve the pairwise orthology predictions using synteny (internship only – not for GSoC)

Brief Explanation

The Ensembl method to infer orthologues starts with building phylogenetic trees and reconciling with the species-tree in order to call the speciation and duplication events that took place. According to the definition, the orthologues are the genes from different species linked via a speciation node. However, the gene-tree reconstructions are not error-free and we need to bend the classical definition of orthology in order to overcome these mis-reconstructions and increase the number of orthologues we predict.

One approach is to use the conservation of local synteny [1]. Since the version 86 of Ensembl, we compute a gene-order conservation (GOC) score for all the orthologues we predict, using a couple of genes on both sides. We currently use this score to assess the quality of the existing orthology calls and are now considering using it to do different calls altogether. Given our current orthology predictions, we want to find other homologues that have a higher GOC score than their predicted orthologues, compare other metrics such as sequence similarity etc and switch the orthology call when needed. The work will involve implementing

  1. a method to scan the gene-trees, extract the homologues an compute synteny-support for all of them, as well as some additional metrics
  2. a decision method to switch some of our orthologues to another homologue based on the metrics gathered.

[1] Computational methods for Gene Orthology inference. David M. Kristensen et al. Briefings in Bioinformatics, Volume 12, Issue 5, 1 September 2011, Pages 379–391, https://doi.org/10.1093/bib/bbr030

Expected results

  • Script or workflow to extract the gene-order conservation scores and the other metrics
  • Analysis of the data and definition of a method to switch orthologues using synteny
  • Assessment of the performance of the new method

Required knowledge

  • Perl
  • Unix HPC environment
  • Data analysis, classification

Difficulty

Medium

Mentors

David Thybert

Compact format storage for gene homology relationships

Brief Explanation

The comparison of gene homology between species is important for biologists as it gives information about their function. At the moment, in Ensembl, the homology relationships between genes are stored in a database table where one entry represents a pairwise relation of gene homology between species. The space required to store these base relations increases quadratically with the number of species. At the moment, with about 300 species, we have to store more than a billion of lines. This solution is hardly scalable to thousands of species and puts a lot of strain on our database.

A more efficient way to store the homology relations would be to take advantage of the hierarchical structure of a gene tree which would prevent storing the details of all homology relations. As a proof of concept, we propose to develop a new format for storing gene homology relations using the hierarchical structure of gene trees. In addition, a tool reading the format to infer the gene homology relationship needs to be developed.

Expected results

  • Format representing the homology relationship of a gene tree for thousends of species
  • Tool parsing the format to provide the homology relation between two genes.

Required knowledge

  • Python
  • Data representation 

Difficulty

Hard

Mentor

David Thybert

Deep learning for homology inference

Brief Explanation

In a precedent Google summer of code, a deep learning neural network has been built to infer orthology relations. While this network is predicting with good accuracy the homology relation between a pair of genes from closely related species, the performance is  decreasing dramatically when the genes are from distant species. 

Here, we propose to build on the precedent project and develop a deep learning neutral network with an increased performance in the prediction for homology relation from distant related species.

 Expected results

  • Deep learning network for the prediction of homology relations between distant related species.

Required knowledge

  • Python, Keras
  • Deep learning 

Difficulty

Medium, Hard

Mentor

David Thybert

 

Projects: Biodata NLP and software development for WormBase.org

 

Extract important information from scientific papers (several candidates possible)

Brief Explanation

Our database retrieves all full-text scientific papers about model organisms, and in a largely automated way extracts information from those papers, to add into the database. Your job would be to work on types of data we do not yet capture, and create scripts to extract significant sentences and words from the fulltext papers, and reformat them for ingestion into the database, and further validation by scientific curators. The pipeline for retrieving full-text scientific papers, and extracting sentences already exists, so this is a perfect opportunity to think creatively about text mining, and choose your own subset of biologically important data to extract, normalise and format. 

Expected results

  • Script for flagging up curatable information from full-text scientific papers

Required knowledge

  • Python
  • Interest in NLP (natural language processing)
  • Interest in biology

Difficulty

Adaptable

Mentor

Magdalena Zarowiecki

Automation and testing of workflow (several candidates possible)

Brief Explanation

Our database gets re-built twice a year, in a workflow containing several hundred steps. Many of these steps currently lack an automated test to validate that the step has run to completion and that the output is complete and correct. This project will teach you the basic principles of software development and testing, which is widely applicable for any type of software development project (scientific or commercial) you might be involved with in the future. You will learn to think logically about how to do “testing” (https://en.wikipedia.org/wiki/Software_testing), the different conceptual and practical approaches one can take, and think creatively about how to design test suites for a range of different cases.

Expected results

  • Create scripts for validating that steps in the workflow have run to completion and that the output is complete and correct. 

Required knowledge

  • Interest in learning perl (Python also possible)
  • Interest in testing, and software development

Difficulty

Medium

Mentor

Magdalena Zarowiecki

 

Projects: Software development for MGnify

Workflow execution orchestration

Brief Explanation

MGnify provides a free to use platform for the assembly, analysis and archiving of microbiome data derived from sequencing microbial populations that are present in particular environments. This service is hosted by the European Bioinformatics Institute (EMBL-EBI), and runs on the EMBL-EBI High Performance Computing (HPC) cluster. Part of our efforts to scale-up the service in line with increasing data volumes is to port the MGnify service to run in the cloud. The aim of this project is to build a job dashboard, which coordinates the execution of jobs within a set of distributed compute clusters. This dashboard will be responsible for the orchestration of jobs across environments such as EMBL-EBI HPC, Google Cloud Platform and Embassy. The dashboard will control a Scheduler that schedules jobs and puts them into a GCP Cloud Pub/Sub queue keeping track of their state in a database. A series of Workers will pick up the messages from the queue and will interact with the cluster job scheduler to execute the workflow components. Finally, the Monitoring Server harvests logs and metrics from the jobs and stores them in centralized storage (such as Kibana) for further analysis.

Expected results

  • A Scheduler application to keep track of the jobs
  • A Worker agent that listen to jobs and interacts with an HPC job scheduler (LSF and Slurm)
  • A Monitoring agent that pipes logs from the jobs into a centralized database
  • A web application to interact with the Scheduler

Required knowledge

  • Python
  • JavaScript – CSS / HTML
  • Concept of job scheduler
  • Familiarity with Unix 

Difficulty

Medium

Mentors

Martin Beracochea, Juan Caballero