Google Summer of Code 2018

Ensembl is proud to be part of Google Summer of Code 2018, allowing us to support students and encourage them to contribute to the code base. Below is a list of projects for inspiration, but all suggestions are welcome. If you wish to discuss any of these projects or your own ideas, please email us, and we can help you to craft your proposal to Google. The deadline for applications, which should be submitted via the Google Summer of Code website, is the 27th March.

Bioinformatic analysis

WiggleTools

Brief Description

The Ensembl genome browsers ( http://www.ensembl.org and http://www.ensemblgenomes.org ) allow users to explore genomes at high resolution. However, the number of annotations and experimental datasets is growing exponentially, and displaying many on a browser becomes unwieldy. We therefore developed a set of libraries to compute statistics on large collections of such datasets, either on the fly or in bulk.

We wish to improve the usability of this command line tool and extend its functionalities, possibly in Python or R.

Expected Results

  • Design, implement and test a wrapper API for WiggleTools.
  • Extend the API to present the WiggleTools output in more convenient formats (plots, objects, etc).

Required Knowledge

  • Python or R
  • Plot functions

Difficulty

Simple

Mentors

Daniel Zerbino

Fetch nearest feature REST endpoint

Brief Explanation

It is not uncommon for researchers to look for the ‘nearest feature’, ie a gene or regulatory element close to a region of interest. This search can be restricted to the same strand or include the opposite strand, be upstream or downstream of the region of interest, include only non-overlapping features, be within 5 base pairs or 500… Ensembl already provides this functionality through the Perl API but we wish to provide a language-agnostic tool for doing this, through the REST API.

Expected results

  • A REST endpoint to retrieve the ‘nearest feature’ to a region of interest.

Required knowledge

  • RESTful API development
  • Ensembl Perl API
  • (desirable) genomics

Difficulty

Medium – hard

Mentors

Magali Ruffier

Transcript Comparisons

Brief Explanation

Importance: Identifying of orthologous proteins between two species, for example human and Mus musculus, is important for researchers and pharmaceutical companies who wish to translate findings in one species to another. For example, which protein in mouse is likely to function most similarly to the target protein in human?

Goal: This process requires first identifying the orthologous genes and subsequently identifying the most closely matched isoforms from the available pool of transcripts annotated for each of the pair of genes. Ensembl have gene and transcript annotations for a large number of species, and we already pre-compute gene-trees for all of these genes. However, the gene-trees are calculated using only one representative protein per gene. We wish to develop a system whereby users can select their transcript of choice from one species and we will identify its closest match in a second species. This proposal therefore aims to compare two protein coding gene loci annotated as orthologues by Ensembl, and to automate the selection of process of functionally orthologous proteins.

Why is this difficult? Simple orthologue calls can become quite complex when one sees the number of transcripts annotated at a single locus. Not all isoforms may be fully annotated and so could lack important functional domains, splicing patterns, a similar folded structure to the human target and may not be expressed in the tissues of interest.

Expected results

  • Design a decision process for identifying orthologous proteins, including protein domain presence, domain order, exon count, sequence identity, and tissue-specific expression
  • For each transcript in a gene A, compare it to all transcripts in its orthologous gene B, using the above criteria
  • Create a table of results for each transcript in gene A, with orthologous proteins listed, scored and ranked
  • Allow further visual inspection of the results in a widget eg. sequence alignments, tissue expression

Required knowledge

  • Accessible data with the Ensembl Perl API
  • Running pipelines on a compute cluster
  • RNA-seq expression quantification analysis
  • Javascript/web application development

Difficulty

Medium-hard

Mentors

Fergal Martin

Modelling of external references

Brief Explanation

There are multiple databases of biological data available, sometimes presenting comparable data from a different angle. Ensembl provides links to equivalent data in other databases in the form of external references. Mapping to these resources can be very heterogeneous, via coordinate overlap, sequence alignment, manually curated links or even third-party links. For reproducibility reasons, it is important to be able to report how a link was generated. The information is available in a graph database but we would like to store it in a traditional database.

Expected results

  • Identify what data should be recorded from external sources
  • Describe a relational schema to store all the relevant information

Required knowledge

  • RDBMS
  • Scripting
  • (desirable) RDF

Difficulty

Easy – medium

Mentors

Magali Ruffier

Genomic data and flat files

Brief Explanation

Genomics data is provided in a number of different file formats with loose specifications (for example GFF3, GTF, EMBL). As a result, the same data can be represented differently depending on who generated the file and when. Additionally, different analysis softwares have different expectation as to the format, requiring end users to repeatedly modify existing files to suit their use case. Ensembl has developed a tool, FileChameleon, that can facilitate these transformations and provide users directly with the end file they require. We would like to identify what transformations are needed and implement them accordingly.

Expected results

  • Identify a handful of typical workflows using genomics file
  • Implement additional options for FileChameleon to fulfill those workflows
  • Provide example configurations for complete workflows

Required knowledge

  • Perl
  • Genomics

Difficulty

Easy – medium

Mentors

Magali Ruffier

Community annotation of gene function

Brief explanation

While some species such as human, model organisms and key economic species dedicate from thoroughly curated databases that describe their function, many others species, in particular plants, are still lacking such a resource. We propose to use the Wikidata project from the Wikimedia foundation to create community-managed resources.

Expected results

We wish to set up a space within Wikidata where all known genes would have an automatically created page pre-filled with automatically computed results.

Required knowledge

  • Wikidata
  • MySQL

Difficulty

Easy – Medium

Mentors

Dan Bolser

Automated Processing of Primary Genome Analysis

Brief explanation

Each year EMBL-EBI’s ENA resource receives genome assembly submissions from around the world. With the creation of the Vertebrate Genomes Project and Genome 10K due to deliver ever increasing numbers of genomes automated processing of these submitted sequences will be essential. We wish to develop a system where submitted genomes have a number of primary analyses performed upon on them, such as GC composition, CpG island detection or repeat analysis. Analyses will be distributed using the Common Workflow Language (CWL) and conducted using its workflow system.

Expected results

  • Retrieval of DNA sequence from ENA (INSDC archive)
  • A CWL analyses workflow to perform GC count across a DNA sequence
  • Docker container holding said GC analysis
  • Development of the GC system a generic framework to allow the running of any DNA based analysis on a sequence
  • System for storing analysis results in long-term archives

Required knowledge

  • Python
  • Genomics

Difficulty

Medium

Mentors

Andy Yates

eHive workflow management system

Ensembl’s production is powered by the eHive workflow management system. It is responsible for scheduling and executing in excess of 450 CPU years of compute per year.

Support for Common Workflow Language (CWL) in eHive

Brief Explanation

CWL is a recent effort to define a common language to describe workflows. CWL is being increasingly supported by workflow management systems and we are also willing to support it in eHive (our own, older, workflow management system). Extending the CWL language to add the features and capabilities that eHive has is a massive task that needs deep cooperation of both projects. As a first step, we’d like to have two tools: a eHive Runnable that is able to execute a CWL component (delegating this to the reference CWL implementation), and a CWL component to execute a eHive Runnable within a CWL workflow (delegating the actual execution to eHive reference tools).

Expected results

  • A CWLCmd eHive Runnable to execute a CWL workflow
  • A HiveStandaloneJob CWL component to execute a eHive Runnable

Required knowledge

  • Perl and Python

Difficulty

Medium

Mentors

Matthieu Muffato

Support for generic job schedulers eHive

Brief Explanation

eHive has native support for the Platform LSF and Grid Engine job schedulers. We know that some of our users have written modules to support other schedulers such as Condor, Torque, etc. There is in fact an alternative and more generic approach: support generic job schedulers such as DRMAA or GRAM. The goal of this project is to compare both technologies, decide which one(s) would work better for us and our users, and implement the backend.

Expected results

  • Set up a test environment for the new job schedulers (for instance Docker images)
  • Implement a eHive backend for the selected ones

Required knowledge

  • Linux setup (installation of packages, etc) and Docker
  • Concept of job scheduler
  • Perl

Difficulty

Medium

Mentors

Matthieu Muffato

Next Generation Process Logging in eHive

Brief Description

Central to eHive is the job blackboard, which coordinates the state and assignment of work within a pipeline by querying a MySQL database. Seeing the state of the pipeline requires querying the database. Our aim is to enable other methods of responding to the state of a pipeline and to enable extensive cross-pipeline logging of processes. We believe a combination of new logging toolkits in fluentd/logstash/flume in combination with messaging queues and a log analysis dashboard (Kibana/Kibi) could provide. We wish to develop this platform and combine it with our in-house pipeline analysis toolkit gui-hive.

Expected Results

  • eHive to be able to log events to log forwarder
  • Have an analysis dashboard available to query reading data from eHive pipelines
  • Combine with gui-hive to provide a single interface

Required knowledge

  • Perl
  • Concepts of log-forwarding and log-analysis
  • JavaScript

Difficulty

Hard

Mentors

Matthieu Muffato, Andy Yates

Support for OpenStack cloud instances

Brief Explanation

eHive is currently used on physical compute clusters managed by LSF and SGE, which often have a fixed size. On the other hand, cloud providers have custom APIs to manage the size of the cluster. We want to build an OpenStack installation and an Openstack backend for eHive to be able to 1) run eHive pipelines on an OpenStack cloud and 2) let eHive manage the (de)allocation of nodes. The goal is to directly call the OpenStack API from eHive, removing the need for a job scheduler.

Expected results

  • A recipe to build an OpenStack VM (for instance using packer)
  • A new backend in eHive to control an OpenStack cloud

Required knowledge

  • OpenStack API
  • Perl

Difficulty

Hard

Mentors

Matthieu Muffato

Graphical editor of eHive Workflows as XML documents – Follow up

Brief Explanation

eHive workflows are currently described as specially-structured Perl modules, and we want to move to a more standard structured format like XML. We have a draft schema specification (in RNG) of the XML files we want to represent. We’d now like to have a graphical interface to design workflows following the specification.

Part of last year’s GSoC programme, a student developed an interface based on Google’s Blockly library, see it on Github. We support a number of elements from the RNG grammar, the interface allows the creation of a new XML document from scratch and includes a validation step (the resulting XML is checked against the RNG specification). What’s left to do is the loading of an existing XML document into the interface. This requires parsing the XML file against the RNG specification and/or the Blockly blocks, instantiating new blocks, filling up their values and connecting them.

Expected results

  • Learn to use the Blockly library
  • Write a tool to load an XML file and create the matching diagram

Required knowledge

  • JavaScript

Difficulty

Medium

Mentors

Matthieu Muffato