Student Projects

Ensembl is able to host self-funded students on summer projects or internships at various times throughout the year.

Bioinformatic analysis

Compact format storage for gene homology relationships

Brief Explanation

The comparison of gene homology between species is important for biologists as it gives information about their function. At the moment, in Ensembl, the homology relationships between genes are stored in a database table where one entry represents a pairwise relation of gene homology between species. The space required to store these base relations increases quadratically with the number of species. At the moment, with about 300 species, we have to store more than a billion of lines. This solution is hardly scalable to thousands of species and puts a lot of strain on our database.

A more efficient way to store the homology relations would be to take advantage of the hierarchical structure of a gene tree which would prevent storing the details of all homology relations. As a proof of concept, we propose to develop a new format for storing gene homology relations using the hierarchical structure of gene trees. In addition, a tool reading the format to infer the gene homology relationship needs to be developed.

Expected results

  • Format representing the homology relationship of a gene tree for thousands of species
  • Tool parsing the format to provide the homology relation between two genes.

Required knowledge

  • Python
  • Data representation 

Difficulty

Hard

Mentor

David Thybert

Deep learning for homology inference

Brief Explanation

In a precedent Google summer of code, a deep learning neural network has been built to infer orthology relations. While this network is predicting with good accuracy the homology relation between a pair of genes from closely related species, the performance is  decreasing dramatically when the genes are from distant species. 

Here, we propose to build on the precedent project and develop a deep learning neutral network with an increased performance in the prediction for homology relation from distant related species.

Expected results

  • Deep learning network for the prediction of homology relations between distant related species.

Required knowledge

  • Python, Keras
  • Deep learning 

Difficulty

Medium, Hard

Mentor

David Thybert

Single assay epigenomic annotation

Brief Explanation

The Ensembl Regulatory Build currently annotates candidate functional regions using a collection of epigenomic assays applied to the same sample. To extend this approach to more samples, we could run a simplified annotation process that would compare one assay to the annotation developed from the richer samples. This would in effect allow us to augment the number of tissues in the Ensembl Regulatory Build using existing data.

Expected results

  • Building a prototype script
  • Quality assessment of the results

Required knowledge

  • Scripting (Python/Perl)
  • Basic statistics

Difficulty

Easy

Mentors

Garth Ilsley

Identification of tissues of interest from GWAS results

Brief Explanation

A number of methods currently allow researchers to infer tissues relevant to disease by comparing genome-wide association study (GWAS) results to reference epigenomic marks. This in turn is useful for downstream analyses, as many functional annotations are tissue dependent.

Expected results

  • Literature research for existing methods
  • Collection of test datasets from GWAS Catalog and Ensembl
  • Benchmark analysis

Required knowledge

  • Scripting (Python/Perl)
  • Basic machine learning

Difficulty

Medium

Mentors

Garth Ilsley

A self-updated pipeline for the annotation of DNA features

Brief Explanation

In Plants and other divisions repeat features are annotated by combining predictions of three types of repeats: low-complexity regions, tandem repeats and complex repeats (see more at ttp://ensemblgenomes.org/info/data/repeat_features). For complex repeats a variety of libraries are scanned. For instance, RepBase and MIPS Repeat are used for plants. We would like to generalize the pipeline to support any features that can be modelled as HMMs or FASTA files with gene/transcript sequences.

Expected results

  • A standalone DNA features pipeline able to integrate multiple custom annotation sources
  • A test to check the quality of an annotation run
  • A mechanism to update signature libraries after a new release

Required knowledge

  • Scripting (Python/Perl)
  • Testing libraries and strategies

Difficulty

Medium

Mentors

Bruno Contreras Moreira, Guy Naamati, Fergal Martin

Automatic naming of non-coding genes

Brief Explanation

In the Google Summer of Code 2019 we investigated the possibility of automatically naming non-coding genes that were annotated both by Ensembl and RefSeq. This year, we wish to automate this process into a robust pipeline.

Expected results

  • File parsers
  • File format checkers
  • Quality control
  • Implementing a designated algorithm
  • Automatic loading of results into an SQL database

Required knowledge

  • SQL
  • Scripting

Difficulty

Easy

Mentors

Daniel Zerbino

eHive workflow management system

Ensembl’s production is powered by the eHive workflow management system. It is responsible for scheduling and executing in excess of 450 CPU years of compute per year.

Support for Common Workflow Language (CWL) in eHive

Brief Explanation

CWL is a recent effort to define a common language to describe workflows. CWL is being increasingly supported by workflow management systems and we are also willing to support it in eHive (our own, older, workflow management system). Extending the CWL language to add the features and capabilities that eHive has is a massive task that needs deep cooperation of both projects. As a first step, we’d like to have two tools: a eHive Runnable that is able to execute a CWL component (delegating this to the reference CWL implementation), and a CWL component to execute a eHive Runnable within a CWL workflow (delegating the actual execution to eHive reference tools).

Expected results

  • A CWLCmd eHive Runnable to execute a CWL workflow
  • A HiveStandaloneJob CWL component to execute a eHive Runnable

Required knowledge

  • Perl and Python

Difficulty

Medium

Mentors

Matthieu Muffato

Support for OpenStack cloud instances

Brief Explanation

eHive is currently used on physical compute clusters managed by LSF and SGE, which often have a fixed size. On the other hand, cloud providers have custom APIs to manage the size of the cluster. We want to build an OpenStack installation and an Openstack backend for eHive to be able to 1) run eHive pipelines on an OpenStack cloud and 2) let eHive manage the (de)allocation of nodes. The goal is to directly call the OpenStack API from eHive, removing the need for a job scheduler.

Expected results

  • A recipe to build an OpenStack VM (for instance using packer)
  • A new backend in eHive to control an OpenStack cloud

Required knowledge

  • OpenStack API
  • Perl

Difficulty

Hard

Mentors

Matthieu Muffato

Automated resource allocation in compute workflows

Brief Explanation

In order to use the compute resources as efficiently as possible, eHive requires workflows to declare the resources (memory, CPU, etc) that each analysis requires. As most grid managers (such as LSF) will kill jobs when they exceed the resources they have requested, we need a mechanism to resubmit the jobs with more resources. eHive is able to find out those cases but currently requires analyses to predefine alternative analyses that have more resources, e.g. first try with 500 MB RAM, then with 2GB, then with 4GB. There are two drawbacks to this approach: the workflow then features extra copies of all those analyses, making it less readable and more complex to maintain, and jobs are only allowed to grow within the given rang.

eHive will be used to orchestrate “big data” genomics workflows in the light of projects such as the Darwin Tree of Life, the Vertebrate Genomes Project or the Earth BioGenome Project, which will sequence the genomes of more than 100,000 species, multipling by 100 the amount of data being processed by our institute. In genomics, “big data” not only means “more data” but also “more diverse data” and “unexpected data”. These projects will lead to the discovery of genomic features unheard of, and we will inevitably face more and more compute jobs not behaving as we planned. We need a new, fully dynamic and automated, system in order to capture the resource requirements of individual jobs and resubmit them adequatly, without having to predefine alternative analyses.

Expected results

  • New database model and API to record resources in an easily parsable format
  • New database model and API to track resources at the job level
  • Improved scheduler that is aware of job-specific requirements

Required knowledge

  • Perl
  • MySQL, PostreSQL or SQLite

Difficulty

Medium

Mentors

Matthieu Muffato

Data visualisation

Exploring evolution with sequence tube maps

Brief Explanation

In a previous Google Summer of Code, we developed a novel and user friendly visualisation to explore genome sequence alignments, called Tubemaps. We would like to use the tubemaps intuitive and familiar look and feel for the benefit of scientific outreach and communication.

As a proof of principle we propose to develop a small web tool that given a human gene name of interest extracts the Ensembl sequence alignment of that gene across species and displays how it differs across species with a sequence tube map. The default tubemaps display could be enhanced with symbols representing the species. This tool could also provide interesting and educational facts about that gene.

Expected results

  • Javascript code to generate Tubemap from sequence alignment
  • Web interface

Required knowledge

  • Java script
  • Web development

Difficulty

Low-medium

Mentors

Daniel Zerbino, David Thybert, Bruno Contreras Moreira

API and webservices

Compliance Test Suite for Ensembl Web APIs

Brief Explanation

API specification is key to the creation of good and future proofed APIs. However there is a disconnect between ensuring an API implementation conforms to a specification. This includes the expected behaviour of a specification in response to erroneous or nonsensical requests. This is exceptionally important to maintain the contract of service when new API versions are deployed (and where we allow implementations to diverge away from a specifications). We envisage a library of compliance tests, which can be run periodically to create a report detailing where APIs do not comply. This would also work with the Global Alliance for Genomics and Health (GA4GH) to reuse infrastructure they have developed to write these compliance suites.

Expected results

A suite of tests, which can be run periodically against a set of REST APIs (both traditional and GraphQL), create both a computational and human readable representation of results and publish these to a web server.

Required knowledge

  • RESTful API usage
  • Python
  • Test development

Difficulty

Medium-hard

Mentors

Bethany Flint, Andy Yates