Google Summer of Code ’17

GSOC 2016 LogoThis year sees the return of Google’s Summer of Code project. We are inviting creative students to help contribute new features to the Ensembl Genome Browser and its underlying bioinformatic infrastructure. Specifically we are looking for volunteers to develop tools and services around our genomic databases. You may invent your own projects or build one around any of our suggestions, centred around:

  • Visualisation
  • RESTful interfaces
  • Bioinformatics
  • Bioinformatics pipelines.

Knowledge of biology is not required for any of the projects, although you will probably pick up basic notions as you go along.

How to Apply?

Start early and pay attention to Google’s student manual and the timeline. Student application submission date is April 3rd so do not get bitten by timezones or random internet outages.

Take some time to learn more about us: you can click around the Ensembl Genome Browser, check out all our repositories on Github, listen to us chatter on our public developers’ mailing list, or drop us an e-mail. Look at the project ideas below to get an idea of the kind of projects you could get involved in.

Are you ready, excited and committed? The ball is now in your court! You will need to apply via Google Summer of Code’s portal and design a proposal. Below are some ideas to get you started. Once again, these are only suggestions, all creative ideas are welcome.

Formatting a proposal

Please structure your proposals as described in the instructions from the Google Summer of Code portal:

  • Name and contact information
  • Title
  • Synopsis
  • Benefits to the community
  • Deliverables
  • Related work
  • Biographical information

You’ll notice that the projects below do not follow this format. This is because doing some research and planning is explicitly part of the experience. Give yourself a reasonable amount of time to write a good proposal. The Google Summer of Code programme offers a lot of excellent advice on how to write good proposals. Don’t hesitate to ask us our opinion ahead of submission, we’re here to help you craft a better submission. If you submit early enough before the deadline, we can get back at you with comments and questions, so that you can modify your proposal and maximise your chances of being selected. Remember this is a competitive program, and more people are turned down than accepted.

Visualisation

Genoverse: large file format support

Brief Explanation

Genoverse is a JavaScript based genome browser using HTML5 and canvas to draw genomic features to a linear genome, developed in collaboration by Ensembl and DECIPHER. Genoverse has been integrated into the Ensembl Genome Browser providing a scrollable region overview. Our aim is to move to client side rendering for all linear genome representations including the rendering of large bioinformatic formats such as BAM, BigBed, BigWig, VCF and next generation formats such as CRAM and BCF. This allows Ensembl to visualise new and exciting data sets whilst handling the problems of an ever increasing amounts of genome data.

Expected results

  • Extract the JavaScript large file format file readers from the BioDalliance project into their own project (BAM/BigBed/BigWig/VCF)
  • Integrate this code into Genoverse
  • Render blocks, genetic variation and wiggle data in Genoverse
  • Support customisation of the tracks

Required knowledge

  • JavaScript, HTML5, canvas
  • Binary format file parsing

Difficulty

Medium

Mentors

Andy Yates, Stephen Trevanion, Simon Brent

Ensembl Track Database

Brief Description

Central to our browser is the region view, which displays a 1 dimensional map of a chromosomal region (e.g. human chromosome 4, between positions 76306026 and 76311599) then stacks annotations on top of it. Scientists can configure these images and add/remove tracks at will to highlight biological features of interest.

Currently, the location and graphical configuration of these tracks is held in the same MySQL databases as the data itself. First, we wish to externalise this configuration into its own database to allow a friendly, easy to use web interface to link out to genomic annotations stored in a variety of storage engines (MySQL, custom flat file formats, RESTful APIs). Second, we wish to facet this configuration by common attributes such as species, assembly, data type or data format. Finally, we wish to be able to collate results from multiple other track providers over a common web interface to provide a single way of locating genomic data from multiple resources.

Expected Results

  • Store all Ensembl annotation tracks in an external facet aware/searchable resource
  • Have developed an easy to use configuration interface
  • Have designed methods of sharing track information between Ensembl resources

Required Knowledge

  • RESTful APIs
  • JavaScript
  • JavaScript Application frameworks (AngularJS)

Difficulty

Medium

Mentors

Andy Yates, Stephen Trevanion

GenomeStats

Brief Description

The Ensembl genome browsers (http://www.ensembl.org and http://www.ensemblgenomes.org) allow users to explore genomes at high resolution. However, the number of annotations and experimental datasets is growing exponentially, and displaying many on a browser becomes unwieldy. We therefore developed a set of libraries to compute statistics on large collections of such datasets, either on the fly or in bulk. On top of this infrastructure, we prototyped a basic web tool, GenomeStats, to allow users to request such calculations remotely on our collection of datasets, then display or upload the final result.

We wish to improve the usability of the user interface and extend its functionalities, possibly connecting the webservice to other services such as Galaxy.

Expected Results

  • Design, implement and test a new user interface for GenomeStats.

  • Consolidate the underlying webservice to be robust and secure.

Required Knowledge

  • RESTful APIs

  • JavaScript

Difficulty

Medium

Mentors

Daniel Zerbino

RESTful interfaces

Data file search API

Brief Explanation

As well as the main web sites, the Ensembl and Ensembl Genomes projects provide hundreds of thousands of data files from over 40,000 genomes in a wide variety of different formats. Finding the correct files for a genome or collection of genomes can be challenging, and we’d like to provide a programmatic interface for searching through them, e.g. find all peptide FASTA files for rodents. This could be used directly by client code for bulk downloads or via a wizard. This project would cover selecting an appropriate indexing technology, implementing a pipeline for updating the public set, and a designing and implementing an API that can be used to retrieve files matching a set of criteria. If time permits, a web interface for this service could also be developed.

Expected results

  • Design of a database
  • Automated pipeline to update it
  • API to search for files
  • (time permitting) a web interface

Required knowledge

  • RESTful API development
  • Perl/Python/Java
  • (desirable) web development

Difficulty

Low-medium

Mentors

Dan Staines

Fetch nearest feature REST endpoint

Brief Explanation

Researchers often wish to look for the features, e.g. genes or regulatory elements, nearest to a region of interest. This search can be restricted to the same strand or include the opposite strand, be upstream or downstream of the region of interest, include only non-overlapping features, be within 5 base pairs or 500… Ensembl already provides this functionality through the Perl API but we wish to provide a language-agnostic tool for doing this, through the REST API.

Expected results

  • Listing all required use cases
  • Implementing these use cases in Ensembl’s Perl API
  • Implement a REST endpoint that exposes the Perl code

Required knowledge

  • RESTful API development
  • Ensembl Perl API
  • (desirable) genomics

Difficulty

Medium – hard

Mentors

Magali Ruffier

Global Alliance for Genomics and Health API

Brief Explanation

The purpose of the Global Alliance of Genomics and Health (GA4GH) is to accelerate genomics research and practice by setting up an ecosystem for easier data exchange. Among other things, the Global Alliance has designed APIs around which the community is converging. Ensembl already has a few GA4GH-compliant endpoints but we wish to design tools that make use of these endpoints. For example, modifying the Genoverse to use these endpoints would be a concrete proof of concept.

Expected results

  • Design a useful tool concept
  • Set up an instance of GA4GH’s reference server
  • Prototype the tool on the GA4GH server
  • (Time permitting) Connect the tools to full scale endpoints.

Required knowledge

  • RESTful API development
  • Perl
  • Genomics
  • (desirable) web development

Difficulty

Medium – hard

Mentors

Daniel Zerbino

Bioinformatics

Strain Differentials

Brief Explanation

Across Ensembl, we collect large quantities of data describing genomic differences between individuals, strains or populations with respect to a designated reference genome.  We need efficient tools to tell us the difference between any two such accessions or to quickly identify genomes containing a given mutation.

Expected results

  • Design or adopt an appropriate data structure
  • Import data for at least 1 crop species from public archives
  • Construct a web interface to support primary use cases

Required knowledge

  • File indexing
  • Javascript/web application development

Difficulty

Medium-hard

Mentors

Paul Kersey

Transcript Comparisons

Brief Explanation

Identifying orthologous proteins between two species, i.e. proteins that derive from a common ancestral protein, is important for researchers and pharmaceutical companies who wish to translate findings in one species to another. For example, which protein in mouse is likely to function most similarly to a target protein in human? Ensembl stores gene sequences for a large number of species, and we pre-compute orthology for all of these genes. Our current orthology pipeline only accounts for one protein per gene, however genes can generate different protein sequences, through a process known as alternative splicing. Approximately, a gene can generate different RNA sequences, known as transcript isoforms, and each transcript in turn leads to a specific protein sequence.

We therefore wish to develop a system that, given a transcript from one species, identifies its closest match in a second species. This first involves finding the orthologous gene(s) (as computed by Ensembl) and automatically selecting the functionally orthologous protein(s). Why is this difficult? Simple orthology calls can become quite complex when one sees the number of transcripts annotated at a single locus. Not all isoforms may be fully annotated and so could lack important functional domains, splicing patterns, a similar folded structure to the human target and may not be expressed in the tissues of interest.

Expected results

  • Design a decision process for identifying orthologous proteins, including protein domain presence, domain order, exon count, sequence identity, and tissue-specific expression
  • For each transcript in a gene A, compare it to all transcripts in its orthologous gene B, using the above criteria
  • Create a table of results for each transcript in gene A, with orthologous proteins listed, scored and ranked
  • Allow further visual inspection of the results in a widget eg. sequence alignments, tissue expression

Required knowledge

  • Basic molecular biology (esp. transcription and translation)
  • Accessible data with the Ensembl Perl API
  • Running pipelines on a compute cluster
  • Javascript/web application development

Difficulty

Medium-hard

Mentors

Fergal Martin

Modelling of external references

Brief Explanation

There are many biological databases, often presenting comparable data from a different angle. Ensembl provides links, known as external references, to corresponding data in other databases. This mapping of Ensembl identifiers to external identifiers can rely on heterogenous evidence such as genomic coordinate overlaps, sequence alignments, manually curation or even third-party links. For reproducibility reasons, it is important to be able to report how a link was generated. The information is available in a graph database but we would like to store it in a traditional database.

Expected results

  • Code to traverse and extract the graph neighbourhood of an identifier (either in RDF or in an RDBMS)
  • A REST endpoint that exposes this code

Required knowledge

  • RDBMS
  • Scripting
  • (desirable) RDF

Difficulty

Easy – medium

Mentors

Magali Ruffier

Genomic data and flat files

Brief Explanation

Genomics data is provided in a number of different file formats with loose specifications (for example GFF3GTFEMBL). As a result, the same data can be represented differently depending on who generated the file and when. Additionally, different analysis softwares have different expectation as to the format, requiring end users to repeatedly modify existing files to suit their use case. Ensembl has developed a tool, FileChameleon, that can facilitate these transformations and provide users directly with the end file they require. We would like to identify what transformations are needed and implement them accordingly.

Expected results

  • Collect use cases, formalise spec around format dialects
  • Extend the FileChameleon to handle these use cases.

Required knowledge

  • Perl
  • Genomics

Difficulty

Easy – medium

Mentors

Magali Ruffier

Bioinformatics pipelines

Ensembl’s production is powered by the eHive (https://github.com/Ensembl/ensembl-hive) workflow management system. It is responsible for scheduling and executing in excess of 450 CPU years of compute per year.

Support for generic job schedulers eHive

Brief Explanation

eHive natively supports the Platform LSF job scheduler, but some of our users have written modules to support other schedulers such as SGE, Open Lava, Condor, etc. An alternative and generic approach consists in supporting standard job schedulers APIs such as DRMAA or GRAM. The goal of this project is to compare both technologies, decide which one would work better for us and our users, and implement the backend.

Expected results

  • Set up a test environment for the new job schedulers
  • Assess both products
  • Implement a eHive backend for the selected ones

Required knowledge

  • System administration
  • Grid computing
  • Perl

Difficulty

Hard

Mentors

Matthieu Muffato

Next Generation Process Logging in eHive

Brief Description

Central to eHive is the job blackboard, which coordinates the state and assignment of work within a pipeline by querying a MySQL database. Seeing the state of the pipeline requires querying the database. Our aim is to enable other methods of responding to the state of a pipeline and to enable extensive cross-pipeline logging of processes. We believe a combination of new logging toolkits in fluentd/logstash/flume in combination with messaging queues and a log analysis dashboard (Kibana/Kibi) could provide. We wish to develop this platform and combine it with our in-house pipeline analysis toolkit gui-hive.

Expected Results

  • eHive to be able to log events to log forwarder
  • Have an analysis dashboard available to query reading data from eHive pipelines
  • Combine with gui-hive to provide a single interface

Required knowledge

  • Perl
  • Concepts of log-forwarding and log-analysis
  • JavaScript

Difficulty

Hard

Mentors

Matthieu Muffato, Andy Yates

Support for Common Workflow Language (CWL) in eHive

Brief Explanation

CWL is a recent effort to define a common language to describe workflows. CWL is being increasingly supported by workflow management systems and we are also willing to support it in eHive (our own, older, workflow management system). Extending the CWL language to add the features and capabilities that eHive has is a massive task that needs deep cooperation of both projects. As a first step, we’d like to have two tools: a eHive Runnable that is able to execute a CWL component (delegating this to the reference CWL implementation), and a CWL component to execute a eHive Runnable within a CWL workflow (delegating the actual execution to eHive reference tools).

Expected results

  • A CWLCmd eHive Runnable to execute a CWL workflow
  • A HiveStandaloneJob CWL component to execute a eHive Runnable

Required knowledge

  • Perl and Python

Difficulty

Medium

Mentors

Matthieu Muffato

Support for OpenStack cloud instances

Brief Explanation

eHive is currently used on physical compute clusters managed by LSF and SGE, which often have a fixed size. On the other hand, cloud providers have custom APIs to manage the size of the cluster. We want to build an OpenStack installation and an Openstack backend for eHive to be able to 1) run eHive pipelines on an OpenStack cloud and 2) let eHive manage the (de)allocation of nodes. The goal is to directly call the OpenStack API from eHive, removing the need for a job scheduler.

Expected results

  • A recipe to build an OpenStack VM (for instance using packer)
  • A new backend in eHive to control an OpenStack cloud

Required knowledge

  • OpenStack API
  • Perl

Difficulty

Hard

Mentors

Matthieu Muffato

Graphical editor of eHive Workflows as XML documents – Follow up

Brief Explanation

eHive workflows are currently described as specially-structured Perl modules, and we want to move to a more standard structured format like XML. We have a draft schema specification (in RNG) of the XML files we want to represent. We’d now like to have a graphical interface to design workflows following the specification.

Part of last year’s GSoC programme, a student developed an interface based on Google’s Blockly library, see it on Github. We support a number of elements from the RNG grammar, the interface allows the creation of a new XML document from scratch and includes a validation step (the resulting XML is checked against the RNG specification). What’s left to do is the loading of an existing XML document into the interface. This requires parsing the XML file against the RNG specification and/or the Blockly blocks, instantiating new blocks, filling up their values and connecting them.

Expected results

  • Learn to use the Blockly library
  • Write a tool to load an XML file and create the matching diagram

Required knowledge

  • JavaScript

Difficulty

Medium

Mentors

Matthieu Muffato

6 thoughts on “Google Summer of Code ’17

  1. Hello Dev’s, I’m Lakshmanan doing my Third year in Computer Science and Engineering in India. I would like to work on “Bioinformatics –
    Strain Differentials” Project during this GSoC 2017. I’am also interested to be an active contributor for this project henceforth.

  2. Hello Ensembl! I am a Biology year Undergrad from BITS Pilani India, and I am interested in working on the “Bioinformatics – Strain Differentials” project for GSoC 2017 and carry it forward actively taking part in it’s development.

    I have experience in Web Development – JS, HTML, CSS and AngualJS (project: https://github.com/ravaan/dashboard/tree/angular) and Data Structures and Algorithms. I have worked as a developer with an game studio (Raincrow Studios) involving setting up high performance tile servers on AWS.

    You can view my complete bio here [https://www.linkedin.com/in/ravaan/]

    I wanted to start working on the project looking forward to any suggestions or advices you guys have !

    • Thanks for emailing the helpdesk (helpesk [at] ensembl.org) about this. We’ll get back to you that way. If anybody else has questions, that’s the best way to get in touch.

  3. Hi ,
    I’m Chanaka from Sri Lanka.
    I’m interested in Ensembl Track Database project. And need to get some information on this. And need to get in touch with Andy Yates, Stephen Trevanion. These are the mentors of that project.
    But when I click on the names, I get this page.
    https://www.ebi.ac.uk/seqdb/confluence/login.action?os_destination=%2Fusers%2Fviewuserprofile.action%3Fusername%3Dayates&permissionViolation=true

    Asking username and password. Can we get one form you ?
    Or is there any way I can contact mentors ?

Leave a Reply

Your email address will not be published. Required fields are marked *

*