GSoC with Ensembl: catching up with 2018’s students

In this blog we catch up with Ensembl’s 2018 Google Summer of Code (GSoC) students and hear about their now completed projects, and their reflections on the experience. You may have already seen our previous blog post which we published as they were just beginning their projects. Read on to find out how they went, what they learnt and what valuable advice they can pass on to aspiring GSoC students.

Somesh Chaturvedi

Mentors: Andy Yates and Matthew Laird

Twitter: @somesh0896

Project: Reference Sequence Retrieval API

Somesh’s GitHub page for this project can be found here.

 

Project highlights: This project revolved around the reference sequence API specification and consisted of primarily four sub-tasks:

  1. Writing Compliance Document covering all the edge cases of the API specification.
  2. Compliance test suite, an API test suite written using py-test for testing server implementations of the reference API.
  3. Reference Sequence Fetcher (RFC), a python client library with command-line interface (CLI) to query a valid server implementation.
  4. Compliance Report Utility, a report generating system to pin point edge cases where a server implementation is failing.

The project required a lot of RFCs exploration and testing knowledge, e.g. unit tests, integration tests, API tests etc. It was a fun project to work on.

Were there any aspects of the project you found particularly challenging, how did you overcome these?The most challenging aspect I would say was the final task, a reporting system which generated human readable reports in text and tabular format, and machine readable reports in JSON format. Test cases written in the report system were inter-dependent and hence the complete test graph was quite complex. Test cases were to be executed in a structured fashion, complete call stack to be notified in case of a failed test case and report was to be in a plain English. Bringing all this together was difficult to me at first. After brain storming and discussing with mentors, I implemented a test graph and few graph algorithms such as graph labelling, Breadth First Search (BFS) etc. to successfully complete this task.

What did you learn during this project?
During this project I learnt a lot about technical documentations and best practices, testing frameworks like pytest, nose, unittest etc., and identifying their pros and cons. I also learnt a lot about different paradigms of testing like unit testing, integration testing, API testing and so on. I wrote a python library from scratch with integrations to readthedocs-sphinx, coverage, Jenkins and wrote unit tests as well. During the last stage of the project I implemented graph algorithms in the report system successfully.

Do you feel that your study or career direction has changed based on the project?
I developed a great interest in this field during my project. I would like to explore this field in the future as well.

What piece of advice would you pass on to anyone applying to do a GSOC project?
I believe GSoC is one the best opportunity to grab as a student.  My advice to them is that students should not hesitate in connecting with mentors and ask questions, even if it seems like a stupid one. No questions are stupid. Mentors and organisations are really helpful and newcomer friendly. Communication is the key in GSoC.

Nabil Ibtehaz

Mentors: Fergal Martin and Kostas Billis

Project: Transcript Comparisons: Exploring Transcript Level Orthologous Relation

Nabil’s GitHub page for this project can be found here.

 

 

Project highlights: In this project, we tried to identify the most likely orthologues at the transcript level. Alternative splicing accounts for the variations in orthologous genes, and we developed an algorithm modelling this event. We also developed a GUI application so that researchers may use our findings in their researches. Moreover, we analysed the similarity of the protein features and obtained satisfactory results.

Were there any aspects of the project you found particularly challenging, how did you overcome these?
At first, I found it quite difficult to use the Ensembl APIs, but my mentors suggested some tutorials to me and following them I managed to use the APIs with ease. The most challenging part of my project was developing the algorithm. I had previously worked on a research project predicting orthologous relationships between genes, but I was not that much acquainted with such scenarios at the transcript level. Also, I was quite unfamiliar with the concept of alternative splicing. I overcame these by studying scientific papers, and of course with the help and support of my mentors.

What did you learn during this project?
I have learnt a lot from working on this project. First of all, I have studied a number of papers and thus, my understanding of this domain has increased quite a bit. Also, my programming skills have been improved significantly. Furthermore, I learnt how to extract genomic data from Ensembl, and used tools like InterPro and MUSCLE.

Do you feel that your study or career direction has changed based on the project?
This project remarkably aligns with my current study and career direction, as I am interested in the fields of Computational Biology, Bioinformatics, Genomics. Thus, though this project has not changed my direction, I believe it has strengthened it.

What piece of advice would you pass on to anyone applying to do a GSoC project?
I would like to advise the prospective GSoC students to investigate a project diligently before applying for it. Most of the projects offered in GSoC are quite challenging and require a great deal of background knowledge and relevant experience. Fortunately, the project web pages are very comprehensive in describing the challenges and requirements of that individual project. I would suggest the new students to invest ample time in analysing this information, and when they feel ready they should try to contact the mentors and start discussing with them about the project.

Zeyu Tony Yang

Mentors: Andy Yates and Leanne Haggerty

Project: Automated Processing of Primary Genome Analysis

Tony’s GitHub page for this project can be found here.

 

 

Project highlights: This GSoC project served as a pilot study for an automated genome processing pipeline, which is Ensembl’s long-term goal. In this project, I developed and deployed some Common Workflow Language (CWL) workflows to conduct genome analysis, as well as a system to monitor the progress of those analyses. In the end, the system would submit new genome assemblies for analysis once they are deployed to European Nucleotide Archive’s (ENA) database.

Were there any aspects of the project you found particularly challenging, how did you overcome these?
This was my first real-world coding project where I wrote a fully functional program. I had the chance to learn and use Git for version control, tools for automate testing and had the opportunity to put some advanced Python features into use. Learning many new skills and concepts quickly can be challenging but it is extremely rewarding that everything came together in the end.

What did you learn during this project?
Besides all the “hard” coding skills I picked up during this project, I think the soft skills I learnt are equally, or even more, valuable. I learnt more about time management, efficient communication and project management (e.g. setting reachable goals and break down large problems into smaller sets). Needless to say, my mentors had a great influence on me and I greatly appreciate their help.

Do you feel that your study or career direction has changed based on the project?
Transferring from a synthetic chemistry background to bioinformatics field, this GSoC project with Ensembl served me particularly well. Not only have I gained an overview and grasped some current topics of bioinformatics before starting my course, this experience will also help me choosing my future field of study.

What piece of advice would you pass on to anyone applying to do a GSoC project?
My advice for potential applicants would be to thoroughly understand the project and research the area before writing a proposal. More importantly, pick a subject that you are really interested in and start writing your proposals early so your mentors would have time to read through your draft.

 

 

We are applying to host GSoC projects at Ensembl again in 2019, please keep an eye on the GSoC project pages and our twitter @ensembl to hear about the upcoming opportunities.