GSOC with Ensembl: introducing our students

For the third year in a row, we’re lucky to have student developers working with us as part of Google Summer of Code. We’ve got three GSOC-ers this year, working on some really exciting projects: Zeyu Tony Yang, working on primary genome analysis, Nabil Ibtehaz, working on transcript-level orthology and Somesh Chaturvedi, working on retrieving reference sequences with APIs.

GSOC is a project set up by Google that places students in open source projects to take on a short independent coding project, and pays them for it. We have to pass rigorous selection criteria to be allowed to offer projects on GSOC, and the students have to be selected by both Google and us to take part. It means the GSOC-ers are the Top Gun of student developers. We think this is a really great opportunity, both for open source projects like us, who get a fresh pair of eyes to take a look at something that we’ve maybe put on the back-burner, and for the students, who get experience working on a real-world coding project during their university summer break.

The projects are structured, meaning that are milestones, assessments and mutual evaluations throughout the three months of the project. This ensures that we’re getting something worthwhile out of our students, but also that we’re doing a good job as mentors, supporting our students and their development. Our students have been working on their projects for a fews weeks now and are heading for their first assessment, so we’d like to introduce them to you, along with what they’ve been working on.

Zeyu Tony Yang

Student background: Final year chemistry MSci student at Imperial College London. Candidate for the Wellcome Trust 4-Year PhD programme in theoretical systems biology and bioinformatics.

Supervisor: Andy Yates and Leanne Haggerty

Project: Automated Processing of Primary Genome Analysis

Project overview: Since the success of the Human Genome Project in 2003, many novel technologies have been invented to accelerate genome sequencing speed. With the development of new sequencing machineries and their much-reduced cost, a large quantity of novel genome data is produced daily at an increasing rate. Hence, it is important to develop an automated process that analyses newly sequenced genome data efficiently.

The primary aim of this project is to construct a generic framework that would allow automated processing of new genome sequences. European Nucleotide Archive (ENA), one of the leading repositories for nucleotide sequence data, has over one billion records and its collection is still expanding exponentially. ENA records a large variety of nucleotide data, from DNA sequencing machine configurations to sequence traces and annotated information. Those data gathered from different sources, such as small-scale lab sequencing and European sequencing centres, is made freely available online, and can be accessed via ENA’s REST API. GC content, the percentage of guanine (G) or cytosine (C) bases on a genome, provides vital information about this organism. Guanine (G) and cytosine (C) base pairs stack thermodynamically more favourably than adenine (A) and thymine (T) base pairs, hence facilitate the DNA’s stability. GC content is varied for different organisms and can provide evolutionary insights into particular organisms. Therefore, GC analysis is a meaningful and relatively easy algorithm to implement, and will allow a general framework to be constructed around it.

Common Workflow Language (CWL) will be used to construct the automation framework. CWL documents, written in JSON or YAML, are used to describe the connection of different command line tools, and it was developed in aid of scientific data analysis. Workflows described in CWL specification are easily portable and scalable in different computational environments. Because of the explicit and isolated nature of CWL tasks, CWL workflows can be containerised to allow easy deployment.

Experience so far: I had a great fun doing the coding project so far. My mentors are very approachable and gave me much hands-on guidance. Coming from a non-computing background, I had to pick-up many software engineering skills on the fly but I very much enjoyed learning new things. I greatly enjoyed researching, coding and experimenting for this project. I learnt much more about command-line interface, version control, software packing and continuous testing. This project not only has an interesting real-life application but also is a good preparation for my future study.

Nabil Ibtehaz

Student background: I am from Bangladesh, I am a student of Bangladesh University of Engineering and Technology (BUET). I am pursuing my masters in Computer Science and Engineering, I have attended courses on Algorithms, Artificial Intelligence, Bioinformatics, Image Processing, Meta-heuristics etc. My area of interests are Machine Learning, Computational Biology and Biomedical Engineering. I had a graduate course on Bioinformatics, where a special emphasis was put on Computational Phylogenetics. For that course I did a research project on clustering orthologous genes using machine learning, which lead to surprisingly good results. So when I saw this Google Summer of Code project I immediately applied as I’d previously worked on gene level orthology, I was very much interested to see that how can these be done in transcript level.

Supervisor: Fergal Martin and Kostas Billis

Project: Transcript Comparisons: Exploring Transcript Level Orthologous Relation

Project overview: In Comparative Genomics we compare an unknown gene with some other known genes, for better inference of biological properties of that unknown gene, meaning that the identification of gene orthology relationships is an important task of comparative genomics. As we know orthologues are genes that originates from speciation, they tend to preserve similar molecular and biological functions. On the contrary, since paralogues are created from gene duplications, they tend to deviate from their ancestral behaviour and functions. Thus if we can establish orthology relationships between two genes, we can obtain valuable evolutionary history of the two genes.

Gene orthology has been well studied in the evolutionary area and is thought to be an important implication to functional genome annotations. However, with advanced sequencing depth and expansion in transcriptome data, genes are no longer the proper units for interrogation in functional conservation, evolutionary events, and expressional patterns. Alternative transcripts represent evolution in terms of the complexity (and possibly function) of the gene. Transcript orthologues are interesting as they would allow us to consider the evolution of gene complexity in different lineages. Here we can consider human and mouse, which diverged a reasonably long time ago, and attempt to recover transcripts that were common before that divergence. It would be equally interesting to consider the transcripts in each species that don’t have clear orthologues and then determine why this is. Such as, if a mouse transcript does not have a clear orthologue in the corresponding human gene, has that transcript just not been annotated in human or is it a genuine evolutionary event?

As transcriptomic data accumulates, alternative splicing is taken into account in the assignment of gene orthologues and orthology is suggested to be further considered at transcript level. With either gene or transcript orthology, exons can be seen as the basic units that represent the whole gene structure; however, not much is known on how to build exon level orthology in a whole genome scale. Therefore, it is essential to establish a transcription-oriented gene orthology algorithm.

Experience so far: I’ve been having an amazing experience working with Ensembl. I am being supervised by Kostas Billis and Fergal Martin, who are very friendly and are always there to help me. I started off this project by learning Perl and the Ensembl Perl APIs. It was interesting working with the APIs and the documentation was very helpful. Also, I’ve been studying some research papers on similar topics. Now I’m collecting data so that we may apply machine learning in solving the problem this project deals with.

Somesh Chaturvedi

Student background: I have recently completed my Bachelors in Biotechnology from Indian Institute of Technology, Roorkee (better known as IIT). My passion for software development and biotech, since the very start of my graduation, led me to explore the field of Computational Biotechnology and Bioinformatics over the past couple of years in which I worked in Computational Biology and Translational Bioinformatics Lab in my institute along with internships in a healthcare analytics firm and in a Chatbot company. My current areas of interest are Artificial Intelligence, Software Development and their applications in biology. This is the very reason, I found Global Alliance for Genomics and Health perfectly suited for my first GSoC experience. And yes I consider myself a Pythonista! 🙂 Find me at somesh0896 on twitter.

Supervisor:Andy Yates and Matthew Laird

Project: Reference Sequence Retrieval API

Project overview:Somesh will be developing a compliance suite for implementers of the GA4GH Reference Sequence Retrieval API, along with a client library in Python. By the end of the project he will have a framework that can test implementations of the API to ensure compliance with the specifications and create a report of deficiencies. The final component will be developing a test suite which produces an interoperability matrix all clients verses servers, ensuring users of the API are assured all servers and clients produce consistent results.

Experience so far: It’s been one month with GA4GH and I would say, without a doubt, it has been an amazing journey full of learning new concepts and technologies. Both of my mentors were very helpful and cooperative. Tasks before the first evaluation were

  • Writing compliance documentation for Reference Server
  • Development of test suite for the same

In writing the documentation for the server, the very first and most important tasks was to understand the APIs of the server with all the edge cases. It took me some time to get a hold of the technicalities and complexity of the API. I referred to rfc7233, API specs provided by mentors and various other resources for the same. I used markdown files for documentation and finally, as suggested by my mentors, I put the documentation onto ReadtheDocs available here.

The second task was to write a test suite for the server. This task first seemed easy to me and I decided to go with the pytest framework of Python. Eventually, with all the variables involved in the server implementation like the encodings, circular sequence support, redirection support, I realised this too, is not a straightforward project. I wrote 60+ tests, testing all the scenarios of the server implementation. Test suite also contains a mock-server which can be used as a standalone server as well. In the process, I honed my API testing skills, got an experience of working with pytest and learned a lot about test-driven development.

You can find all of my code here.