Student Projects

Ensembl can host self-funded students on summer projects or internships. We have a number of ideas for possible projects, but we’re also willing to hear your ideas. Please contact us if you’re interested.

Bioinformatic analysis

Repeat annotation

Brief Explanation

Vertebrate genomes contain repetitive stretches of DNA. This might mean the same element repeated in tandem in a particular region or an element is repeated at numerous locations over the genome (or both). Repeat elements often greatly expand the sequence of a genome without adding to the functional repertoire.

The annotation of repeat families is interesting from an evolutionary standpoint, but also important from a computational standpoint. Repeats have a tendency to slow, confuse or completely break software. In addition they naturally make the search space for functional elements much bigger than it should be, as it is very unusual to find function elements such as genes within repeats.

As part of the genome annotation pipeline Ensembl runs popular repeat finding tools including RepeatMasker, Dust and TRF to find potential repeats. This strategy works well for species where the repeats have a well characterised repeat library to help with finding repeats in the genome sequence.

For species with unique repeats, or where there is no suitable repeat library, this strategy does not work well. In such cases we use RepeatModeler for ab initio repeat finding, in order to generate a repeat library. One major issue with this strategy is that RepeatModeler sometimes identifies gene families as repeats. Some gene families, for example single exon genes, are particularly susceptible to this problem. The purpose of this project would be to examine and implement new methods of repeat finding and filtering to improve the quality of repeat annotation in Ensembl

Expected results

  • Generate a set of high quality repeat libraries for all vertebrates with an available genome sequence
  • Produce repeat-masked copies of these genomes submitted post analysis back to the public archives
  • Update the Ensembl annotation code-base and pipelines to take advantage of improvements to the repeat annotation process
  • Potential to create a standalone, containerised, continuously deployed pipeline for repeat annotation. This could run on genome sequences as soon as they arrive in the public archives and submit a completed repeat analysis back to the archives

Required knowledge

  • Object-oriented coding
  • Familiarity with some of the following: genomics, repeat annotation, bioinformatics software, GIT, containers, deployment strategies, CWL

Difficulty

Medium-hard

Mentors

Osagie Izuogu, Fergal Martin

Circular RNA analytics frontend

Brief Explanation

Circular RNAs (circRNAs) are an established class of highly stable non-coding RNAs, produced by a back-splice mechanism and have been identified in various cell lines and tissues across multiple species. Dynamic and divergent expression of circRNAs relative to other linear transcripts are associated with various biological processes and are potentially useful in delineating their functional relevance or as disease markers.

At Ensembl, we have developed workflows for accurate computational identification and quantification of circRNAs from high throughput RNAseq data. To facilitate circRNA research, this project will produce a responsive web-based analytics dashboard, integrated to an in-house catalogue of circRNAs identified from multiple species.

Expected results

  • Design and development of a web application with functionalities for:
    • assessing distribution and abundance of circRNAs
    • circRNA isoform comparisons across samples
    • search feature, based on genome coordinates and host gene names
    • visual representation of inferred circRNA structure
    • link to Ensembl genome browser

Required knowledge

  • Python/JavaScript
  • NodeJS or (any modern web development framework)
  • MySQL or NoSQL

Difficulty

Medium

Mentors

Osagie Izuogu, Fergal Martin

Using Deep Learning techniques to enhance orthology calls

Brief Explanation

All Ensembl gene sequences are compared against the TreeFam HMM library in order to classify them into clusters that will be used to produce gene trees, infer homologues and produce gene families. These trees represent the evolutionary history of gene families, which evolved from a common ancestor:
http://www.ensembl.org/info/genome/compara/homology_method.html

Reconciliation of these gene trees against the species trees allows us to distinguish duplication and speciation events, resulting in different types of homologues (orthologues and paralogues): http://www.ensembl.org/info/genome/compara/homology_types.html

This project aims to apply machine learning algorithms like Deep Learning Neural Networks to validate the homologies predicted with our method in addition to infer new ones based on other properties of the data that are currently not being considered (such as local synteny, divergence rates, etc).

Expected results

  • Identify which parameters (features) can be used by the learning algorithm
  • Gather the data (orthology calls and properties)
  • Build a model capable of predicting and validating orthologies based on their properties

Required knowledge

  • Machine learning (TensorFlow framework)
  • Python

Difficulty

Hard

Mentors

Mateus Patricio, Matthieu Muffato

Pan-genome gene content analysis

Brief Explanation

For many species, including plants, researchers are now producing not one but several genome assemblies, which are then combined to define what is called a pan-genome. Building on the Compara framework, the student will test different whole-genome alignment strategies and nucleotide gene phylogenies to produce a presence-absence (PAV) matrix.

Expected results

  • A workflow to extract a gene presence-absence matrix from a set of genomes from the same or closely related species and produce a flat file over the REST API
  • A script to produce a simple, high-resolution plot of the PAV matrix

Required knowledge

  • Python/Java/Perl/R
  • Unix HPC environment
  • Data analysis, bioinformatics

Difficulty

Medium

Mentors

Matthieu Muffato, Steve Trevanion, Bruno Contreras

Modelling of external references

Brief Explanation

There are multiple databases of biological data available, sometimes presenting comparable data from a different angle. Ensembl provides links to equivalent data in other databases in the form of external references. Mapping to these resources can be very heterogeneous, via coordinate overlap, sequence alignment, manually curated links or even third-party links. For reproducibility reasons, it is important to be able to report how a link was generated. The information is available in a graph database but we would like to store it in a traditional database.

Expected results

  • Identify what data should be recorded from external sources
  • Describe a relational schema to store all the relevant information

Required knowledge

  • RDBMS
  • Scripting
  • (desirable) RDF

Difficulty

Easy – medium

Mentors

Magali Ruffier

Genomic data and flat files

Brief Explanation

Genomics data is provided in a number of different file formats with loose specifications (for example GFF3, GTF, EMBL). As a result, the same data can be represented differently depending on who generated the file and when. Additionally, different analysis softwares have different expectation as to the format, requiring end users to repeatedly modify existing files to suit their use case. Ensembl has developed a tool, FileChameleon, that can facilitate these transformations and provide users directly with the end file they require. We would like to identify what transformations are needed and implement them accordingly.

Expected results

  • Identify a handful of typical workflows using genomics file
  • Implement additional options for FileChameleon to fulfill those workflows
  • Provide example configurations for complete workflows

Required knowledge

  • Perl
  • Genomics

Difficulty

Easy – medium

Mentors

Magali Ruffier

Automated Processing of Primary Genome Analysis

Brief explanation

Each year EMBL-EBI’s ENA resource receives genome assembly submissions from around the world. With the creation of the Vertebrate Genomes Project and Genome 10K due to deliver ever increasing numbers of genomes automated processing of these submitted sequences will be essential. We wish to develop a system where submitted genomes have a number of primary analyses performed upon on them, such as GC composition, CpG island detection or repeat analysis. Analyses will be distributed using the Common Workflow Language (CWL) and conducted using its workflow system.

Expected results

  • Retrieval of DNA sequence from ENA (INSDC archive)
  • A CWL analyses workflow to perform GC count across a DNA sequence
  • Docker container holding said GC analysis
  • Development of the GC system a generic framework to allow the running of any DNA based analysis on a sequence
  • System for storing analysis results in long-term archives

Required knowledge

  • Python
  • Genomics

Difficulty

Medium

Mentors

Andy Yates

Data file search API

Brief explanation

As well as the main web sites, the Ensembl and Ensembl Genomes projects provide hundreds of thousands of data files from over 40,000 genomes in a wide variety of different formats. Finding the correct files for a genome or collection of genomes can be challenging, and we’d like to provide a programmatic interface for searching e.g. all peptide FASTA files for rodents. This could be used directly by client code for bulk downloads or via a wizard. This project would cover selecting an appropriate indexing technology, implementing a pipeline for updating the public set, and a designing and implementing an API that can be used to retrieve files matching a set of criteria. If time permits, a web interface for this service could also be developed.

Expected results

  • Index of available files and metadata
  • Update pipeline
  • API
  • (Optional) Web interface

Required knowledge

  • RESTful API development
  • Perl/Python/Java
  • (desirable) web development (Css/ Javascript / Angular)
  • (desirable) Python framework such as Flask and/or Django

Difficulty

Low-medium

Mentors

Marc Chakiachvili

Fetch nearest feature REST endpoint

Brief Explanation

It is not uncommon for researchers to look for the ‘nearest feature’, ie a gene or regulatory element close to a region of interest. This search can be restricted to the same strand or include the opposite strand, be upstream or downstream of the region of interest, include only non-overlapping features, be within 5 base pairs or 500… Ensembl already provides this functionality through the Perl API but we wish to provide a language-agnostic tool for doing this, through the REST API.

Expected results

  • A REST endpoint to retrieve the ‘nearest feature’ to a region of interest.

Required knowledge

  • RESTful API development
  • Ensembl Perl API
  • (desirable) genomics

Difficulty

Medium – hard

Mentors

Magali Ruffier

Applying machine learning techniques to characterising and naming lncRNA genes

Brief explanation

Applying machine learning techniques to characterising and naming lncRNA genes. Although far less understood, it is estimated that there are far more long non-coding RNA (lncRNA) genes than protein-coding genes. This means that in depth manual inspection of annotations, as performed currently on protein-coding genes, cannot scale up to lncRNA. The aim of this project is to examine existing lncRNA annotations produced by RefSeq or Ensembl HAVANA and determine consistent annotations that are therefore worthy of an HGNC (HUGO Gene Nomenclature Committee) approved gene symbol and name. We are currently compiling a hand curated dataset to serve for training, so this project will focus on machine learning, although some background knowledge in molecular biology will be useful in feature design.

Expected results

  • Automatic extraction of candidate gene properties
  • Examination of training dataset
  • Running the method whole genome

Required knowledge

  • Basic molecular biology
  • Machine learning
  • Python/R

Difficulty

Hard

Mentors

Daniel Zerbino

Semantic Spreadsheets

Brief Explanation

Metadata for biological data in complex and varied and as such requires controlled terminology in the form of ontologies to ease the interpretation and interoperability of this data by machines. Many ontologies have been developed for the bio-medical sciences that provide standard terminology for describing everything from gene function, anatomy and development through to disease, phenotype and environment. The tool of choice for wrangling biological metadata is still the spreadsheet, therefore tools that can assist bio-curators in aligning metadata in spreadsheets to ontologies are in high demand. In this project we aim to build extensions for GoogleSheets that support the automated and semi-automated mapping of data to ontologies by connecting GoogleSheets to ontology APIs provided by the EMBL-EBI Ontology Lookup Service (OLS). Such a service would significantly reduce the time and effort spent mapping data in spreadsheets to ontologies and would significantly improve the quality and utility of data when shared in public archives.

Expected results

  • Develop a GoogleSheets add-on for searching ontologies directly in GoogleSheets
  • Support the automatic annotation of data in GoogleSheets to ontology terms
  • Allow the service to be configured to restrict ontology selections to a subset of ontologies based on a particular community standard

Required knowledge

  • Javascript or Google AppScript
  • Biomedical ontologies

Difficulty

Medium

Mentors

Simon Jupp and Daniel Zerbino

eHive workflow management system

Ensembl’s production is powered by the eHive workflow management system. It is responsible for scheduling and executing in excess of 450 CPU years of compute per year.

Support for Common Workflow Language (CWL) in eHive

Brief Explanation

CWL is a recent effort to define a common language to describe workflows. CWL is being increasingly supported by workflow management systems and we are also willing to support it in eHive (our own, older, workflow management system). Extending the CWL language to add the features and capabilities that eHive has is a massive task that needs deep cooperation of both projects. As a first step, we’d like to have two tools: a eHive Runnable that is able to execute a CWL component (delegating this to the reference CWL implementation), and a CWL component to execute a eHive Runnable within a CWL workflow (delegating the actual execution to eHive reference tools).

Expected results

  • A CWLCmd eHive Runnable to execute a CWL workflow
  • A HiveStandaloneJob CWL component to execute a eHive Runnable

Required knowledge

  • Perl and Python

Difficulty

Medium

Mentors

Matthieu Muffato

Support for generic job schedulers eHive

Brief Explanation

eHive has native support for the Platform LSF and Grid Engine job schedulers. We know that some of our users have written modules to support other schedulers such as Condor, Torque, etc. There is in fact an alternative and more generic approach: support generic job schedulers such as DRMAA or GRAM. The goal of this project is to compare both technologies, decide which one(s) would work better for us and our users, and implement the backend.

Expected results

  • Set up a test environment for the new job schedulers (for instance Docker images)
  • Implement a eHive backend for the selected ones

Required knowledge

  • Linux setup (installation of packages, etc) and Docker
  • Concept of job scheduler
  • Perl

Difficulty

Medium

Mentors

Matthieu Muffato

Next Generation Process Logging in eHive

Brief Description

Central to eHive is the job blackboard, which coordinates the state and assignment of work within a pipeline by querying a MySQL database. Seeing the state of the pipeline requires querying the database. Our aim is to enable other methods of responding to the state of a pipeline and to enable extensive cross-pipeline logging of processes. We believe a combination of new logging toolkits in fluentd/logstash/flume in combination with messaging queues and a log analysis dashboard (Kibana/Kibi) could provide. We wish to develop this platform and combine it with our in-house pipeline analysis toolkit gui-hive.

Expected Results

  • eHive to be able to log events to log forwarder
  • Have an analysis dashboard available to query reading data from eHive pipelines
  • Combine with gui-hive to provide a single interface

Required knowledge

  • Perl
  • Concepts of log-forwarding and log-analysis
  • JavaScript

Difficulty

Hard

Mentors

Matthieu Muffato, Andy Yates

Support for OpenStack cloud instances

Brief Explanation

eHive is currently used on physical compute clusters managed by LSF and SGE, which often have a fixed size. On the other hand, cloud providers have custom APIs to manage the size of the cluster. We want to build an OpenStack installation and an Openstack backend for eHive to be able to 1) run eHive pipelines on an OpenStack cloud and 2) let eHive manage the (de)allocation of nodes. The goal is to directly call the OpenStack API from eHive, removing the need for a job scheduler.

Expected results

  • A recipe to build an OpenStack VM (for instance using packer)
  • A new backend in eHive to control an OpenStack cloud

Required knowledge

  • OpenStack API
  • Perl

Difficulty

Hard

Mentors

Matthieu Muffato

Graphical editor of eHive Workflows as XML documents – Follow up

Brief Explanation

eHive workflows are currently described as specially-structured Perl modules, and we want to move to a more standard structured format like XML. We have a draft schema specification (in RNG) of the XML files we want to represent. We’d now like to have a graphical interface to design workflows following the specification.

Part of last year’s GSoC programme, a student developed an interface based on Google’s Blockly library, see it on Github. We support a number of elements from the RNG grammar, the interface allows the creation of a new XML document from scratch and includes a validation step (the resulting XML is checked against the RNG specification). What’s left to do is the loading of an existing XML document into the interface. This requires parsing the XML file against the RNG specification and/or the Blockly blocks, instantiating new blocks, filling up their values and connecting them.

Expected results

  • Learn to use the Blockly library
  • Write a tool to load an XML file and create the matching diagram

Required knowledge

  • JavaScript

Difficulty

Medium

Mentors

Matthieu Muffato