Today’s blog focuses on this year’s Google Summer of Code (GSoC). GSoC is an international program founded by Google in 2005 with the purpose of bringing together open-source organisations, and developers interested in contributing to open-source software and getting an “exposure to real-world software development techniques”. Host organisations list project ideas, and applicants discuss these ideas directly with mentors from the organisations and devise a project proposal to Google, who issue a small stipend to successful applicants.
EMBL-EBI’s Genome Assembly and Annotation (GAA) section, which includes Ensembl, has been a GSoC mentor since 2016. The GAA is one of 198 open-source organisations who have undergone rigorous application and selection processes to ensure GSoC applicants, also referred to as contributors, are receiving the best possible mentorship for their projects. We are grateful that we have once again had the opportunity to work with Google and help contributors realise their projects. Every year we receive applications from candidates who want to learn more about writing open-source software. And as of 2022, Google has also welcomed applications not just from students, but anyone over the age of 18 with an interest in open-source software development.
With the GAA GSoC projects now completed, we have talked to two contributors, Yantong and Rohit, who worked with Ensembl colleagues on open-source projects. Read on to find out more about their work, their experiences and what they have learned from the opportunity.
I am a fourth-year genetics and genomics PhD student at the Ocean University of China, and my primary field of study is to develop a next-generation sequencing (NGS) library construction in Scallop. Recently, I have also developed a research interest in applying deep learning methods in the field of biology.
Using machine learning to identify and classify repeat features
Several tools already exist to identify repeat features in genomes. A problem that remains is identifying repeat DNA sequences in some genes. If such sequences are used to mask the genome, genes may be missed in the downstream annotation process.
Assuming that gene sequences have various signatures relating to their function, and that repeats have different signatures including the repetitive nature of the signal itself, we formalise the core of identifying repeat sequences as identifying repeat regions and classifying the corresponding repeat type. A deep-learning method is applied to unify those two subtasks into one. This project contains two different models, each model can solve the identifying repeat task in theory. The details of the models can be found in this GitHub repository.
I learned a lot of things from this project, including deep learning methods and understanding repeat sequences. When I found a description of this project in the Ensembl GSoC lists of ideas, I did not have any specific solution in mind yet. But I liked the sound of the project as it was very interesting to me. I drafted an initial proposal and sent it to the project mentors. Their response was very quick and thoughtful. They pointed out that a part of my proposal could be replaced with a deep learning algorithm. That is when I realised that the core of this task can be composed into identity repeat position and classify the corresponding type, and those two parts can be solved by one algorithm. I immediately sent my idea to the mentors. After some discussions with the mentors, I was able to come up with an interesting proposal. During the coding phase, when implementing all parts including data generation and model, my mentors carefully reviewed my code, which taught me some best practices in coding. Unfortunately, the performance of the model was not as efficient as I had hoped. One of the project mentors, William, cheered me up and gave me both technical and mental advice, which encouraged me a lot. Even though we were not able to get the desired result at the end of the project period, I believe that the project is still worthwhile to revisit, investigate and contribute to.
Rohit is a Software Engineer with three years of full-stack software development experience. He completed his undergraduate studies in 2019 at Sikkim Manipal Institute of Technology in the field of Computer Science and Engineering.
Accessing Ensembl data with Presto and AWS Athena.
The goal of this project was to build a NextGen replacement for the BioMart tool that provides a way to download custom reports of genes, transcripts, proteins, and other data types. Considering the huge amount of data that needs to be dealt with in the area of genomics, the current tool has very limited use cases because of scalability issues. The new tool will use the latest technologies available in the market such as AWS Athena (based on Presto), and Parquet/ORC to build a scalable solution. The focus of the end solution is to present full-stack software that can demonstrate the feasibility of the proposed system architecture to counter the scalability issues. The solution consists of a Python script that migrates the genomic data from Ensembl’s MySQL database to parquet files, which are then stored on AWS S3. The backend system provides user-friendly isolation via Application Programming Interface (API) over the AWS APIs to request the required genomic data. The frontend allows users to interact with this system using a graphical user interface (GUI) to fetch the required genomic data for the desired data type and species with appropriate filters.
Contributing to open-source has been on my checklist for a long time now, especially in the field of bioinformatics. Ensembl proved to be a great place to cross out that item from the list, where I worked on a new system to make genomic data accessible. My mentors have been really supportive and helpful throughout this journey and have helped me navigate the complexities of this project revolving around bioinformatics, a field in which I did not have any background. I have learned a lot through this journey including hands-on experience and from my mentors. The GSoC experience helped me build significant competency in full-stack application development lifecycle, as I did not just develop the applications using the latest technologies but also managed and deployed them on AWS.