Today’s blog focuses on this year’s Google Summer of Code (GSoC). GSoC is an international program founded by Google in 2005 with the purpose of bringing together open-source organisations, and developers interested in contributing to open-source software and getting an exposure to real-world software development techniques. Host organisations list project ideas, and applicants discuss these ideas directly with mentors from the organisations and devise a project proposal to Google, who issue a small stipend to successful applicants.
EMBL-EBI’s Genome Assembly and Annotation (GAA) section, which includes Ensembl has been a GSoC mentor since 2016. The GAA is one of 168 open-source organisations who have undergone rigorous application and selection processes to ensure GSoC contributors are receiving the best possible mentorship for their projects. We are grateful that we have once again had the opportunity to work with Google and help contributors realise their projects. Every year we receive applications from candidates who want to learn more about writing open-source software. And as of 2022, Google has also welcomed applications not just from students, but anyone over the age of 18 with an interest in open-source software development.
With the GAA GSoC projects now completed, we have talked to contributors Kenny Lam, Friederike Biermann and Satya Adda who worked with Ensembl colleagues on open-source projects. Read on to find out more about their work, their experiences and what they have learned from the opportunity.
Data science graduate at the Australian National University and final year MSc (Bioinformatics) student at the University of Melbourne. Kenny previously interned as a data scientist and is currently working on small-molecule machine learning projects.
Using Machine Learning to Annotate Difficult Genes
The advancement in the accuracy of long-read sequencing technology has allowed us to explore novel transcript variants of known genes. Gene annotation is an essential step in understanding the role of genes. Preventing potentially wrong transcripts and gene annotation is essential to the platform as the research community might rely on the information to make decisions. Automated workflow has been developed to minimise the time needed to verify and annotate those transcript variants. However, current workflows are developed using a very strict rule-set and hence many of the novel transcript variants were rejected. This project aims to address this issue by using machine learning to recover good quality but rejected transcripts, analysing the decision-making methods of the model, and consequently improving the rule-set used in the automated workflow.
Friederike (Frida) Biermann
Second-year bioinformatics PhD student, privileged to work between the labs of Eric Helfrich in Frankfurt, Germany, and Marnix Medema in Wageningen, Netherlands. Frida’s research interest primarily revolves around bacterial genome mining and natural product genomics. Frida is eager to expand her scope to eukaryotes and get more insight into deep learning, as she has mostly worked on simpler machine learning algorithms.
Using Deep Learning to Identify Features of Protein-Coding Genes
Accurate gene annotation in eukaryotes solely based on genomic data has been a significant obstacle in biology since the introduction of next-generation sequencing technologies and thus the rapid increase of available data. Traditional methods rely on homology searches to map the open reading frames to previously identified protein-coding genes with known additional experimental data, like transcriptomics and proteomics data. This approach produces potentially inaccurate results if the genome of interest is not at least somewhat related to an already annotated genome. Additionally, gathering transcriptomics and proteomics data is labor-intensive and expensive. For that reason, there is a high demand for models that predict the location of protein-coding genes solely from inherent features of the DNA sequence of the gene. Although, theoretically possible, methods that use e.g. Hidden Markov models to detect protein-coding genes based on known gene features are often inaccurate. In this project, we will train a Deep Learning transformer model to extract features of protein-coding genes to gain deeper insight into their exact properties that lead to translation. The whole workflow will include training a Conditional Random Field model on recognising candidate gene regions and then using these as input for a more fine-grained Transformer – Convolutional Neural Network hybrid model. The final pipeline will be tested against a benchmark of gold standard annotations as well as various test sets to evaluate the influence of different parameters like genome sequence quality, protein length or gene structure complexity.
Software engineer (data science) with four years of industry experience in building large scale data warehouses, ETL pipelines, ML/DL model training and deployments. Satya completed his Bachelors in Engineering from National Institute of Technology (NIT) Raipur, India. He is a self taught programmer and is interested in developing his skills in distributed/scalable data architectures.
Expand the species search functionality for beta website
The search engine of any website can be one of the most useful tools for users to easily retrieve the information they want. The current Ensembl’s search works based on indexed fields of our databases, that mainly covers key information, e.g. genes, species, proteins, including synonyms for every one of them. However, the current search only allows exact name matches and also has limitations when it comes to retrieving synonyms or close matches from the taxonomy graph. The goal of this project is to create a standalone search tool that can handle taxonomic-related requests and address the above limitations. This tool helps to expand the Ensembl beta’s search functionality to include and support searching based on taxonomic information.
“I applied to GSoC in search of opportunities to make significant open source contributions and Ensembl has proven to be an amazing place to start my journey. I worked on a project to create a standalone python app using elastic search & Django to expand and improve species search functionality using the NCBI taxonomy database. Throughout the GSoC Application & Coding phase, mentors were super responsive and helped me comprehend the intricacies involved in the taxonomy search. While not having a background in bioinformatics initially appeared to be a difficulty, my mentors assisted me by providing relevant real-world examples of difficulties they were facing and expected results, which helped me speed up the project execution.”
Authors: Aleena Mushtaq, Kenny Lam, Friederike Biermann and Satya Adda