GSoC 2019: Our students and their projects

Google Summer of Code (GSoC) is a programme that has been set up by Google to introduce students to open source software development. It links students to open source organisations such as Ensembl. The students work remotely with their GSoC project mentors during the university summer break and get paid for it by Google. Both students and organisations go through a rigorous application and selection process. It ensures that the students are among the very best and that the organisations are committed to mentoring them and their projects effectively. We think that GSoC is a great programme for students as well as Ensembl as an open source organisation and are glad that we had the opportunity to be part of it again this year!

We are very happy that we had three great students working on GSoC projects with us this summer. Two of them, Harshit Gupta and Srijan Verma, used Machine Learning to enhance orthology calls and characterise lncRNA genes, respectively. Praduman Goyal worked on a web-based dashboard for circular RNA analytics. GSoC 2019 has just finished, so here you can find out about our students, their projects and how their GSoC experience has been.

Harshit Gupta

Student background:

I am a third year Computer Science Engineering student at the B.M.S. Institute of Technology, Bengaluru, India. My current interests include Machine Learning, Deep Learning, Graphical Models, Reinforcement Learning and Microeconomics. I have worked on projects involving Natural Language Processing and Reinforcement Learning previously.

Supervisors: Mateus Patricio and Matthieu Muffato

Project: Using Deep Learning techniques to enhance orthology calls

Project overview:

All Ensembl gene sequences are compared against the TreeFam HMM library in order to classify them into clusters that will be used to produce gene trees, infer homologues and produce gene families. These trees represent the evolutionary history of gene families, which evolved from a common ancestor. Details of our protein tree pipeline are here. Reconciliation of these gene trees against the species trees allows us to distinguish duplication and speciation events, resulting in different types of homologues (orthologues and paralogues). This project aimed to apply Machine Learning algorithms like Deep Learning Neural Networks to validate the homologies predicted with our method, in addition to infer new ones based on other properties of the data that are currently not being considered, such as local synteny or divergence rates.

Experience:

I come from a Mathematics background, so I never had the chance to study Biology or Bioinformatics in depth. It was a challenge for me to overcome these knowledge gaps that arose due to my background. Needless to say, I had really great mentors. They helped me a lot – not only during the course of the project, but also during the proposal period, when we were looking for features to be included in the Machine Learning model. The last three months have been amazing for me, I have learned so many different concepts across multiple domains. The GSoC project helped me to realise and solve problems that might arise while creating and improving datasets, a really important step in any Deep Learning project. Since all the datasets in Comparative Genomics are huge, making this project viable and feasible for those kind of scenarios was a great learning experience (I finally learned and applied multiprocessing in python!!!!!!!). We trained several models with different class compositions and checked the accuracy across around 66 different scenarios. The accuracies ranged from 25-99% with the earlier models to 65-99% with the newest models.

GitHub: https://github.com/EnsemblGSOC/compara-deep-learning

Srijan Verma

Student background:

Final year B.Pharm. student at Birla Institute of Technology and Science, Pilani, Rajasthan, India. Drummer (Check out my latest music release here!), coder and basketball enthusiast. Currently pursuing my undergraduate thesis on ‘Drug Metabolism Prediction’ in the Sirimulla Research Group, University of Texas, El Paso, US.

Supervisor: Daniel Zerbino

Project: Applying machine learning techniques to characterising and naming lncRNA genes

Project overview:

Advances in RNA sequencing technologies have revealed the complexity of our genome. Long non-coding RNAs (lncRNAs) make up the majority of the non-coding transcriptome. Understanding the significance of this RNA world is one of the most important challenges faced in biology today, and the lncRNAs within it represent a gold mine of potential new biomarkers and drug targets. Its discovery is still at a preliminary stage. To date, very few lncRNAs have been characterised in detail. However, it is clear that they are important regulators of gene expression; lncRNAs are thought to have a wide range of functions in cellular and developmental processes. There are many specialised databases that include lncRNAs, such as RefSeq, GENCODE, Ensembl, SGD and TAIR. The primary aim of this project was to implement a Machine Learning model (a second pass filter) which would validate credible calls (​true positive​ cases) produced by two of the world’s major gene annotation databases, Ensembl and RefSeq.

Experience:

My experience with Ensembl has been really amazing. I had previously done a couple of projects in Machine Learning which were not related to the particular discipline I was studying, Bachelor in Pharmacy. Working at Ensembl combined my love for Machine Learning and the core subjects that I was actively studying at my college. I knew that if I could find an intersection of these two fields, I would be able to use my knowledge in both fields to the best of my abilities. Also, I knew about Ensembl’s projects in Bioinformatics already and found them very interesting. The three months working on the GSoC project have absolutely been a life changing experience for me. I feel extremely grateful towards the whole community of ‘Genes, Genomes and Variation’, especially towards my mentor Daniel Zerbino. He was always very approachable and helped me with coding as well as with the biological aspect of the project. The weekly video sessions with him, which normally happened every Thursday, were the sessions that I looked forward to most. In these sessions, he took all my queries patiently and answered them with clarity. There have also been teaching sessions, apart from discussions! These 1-to-1 experiences with my mentor have been one of the best learning experiences of my GSoC journey. At Ensembl, I learned a lot every day, which truly made the journey a lot of fun.

GitHub: https://github.com/EnsemblGSOC/srijan-gsoc-2019

Praduman Goyal

Student background:

I am a junior year student of the Integrated MSc in Applied Mathematics at the Indian Institute of Technology, Roorkee, India. My love for computers and various computer languages began from 8th standard when I was introduced to HTML in my curriculum. Since then, I have developed a passion for exploring more in the field of Computer Science and applying its concepts to develop daily-use applications. In my college years, I explored various web technologies and studied ReactJS and Django.

Supervisors: Osagie Izuogu and Fergal Martin

Project: Circular RNA analytics frontend

Project overview:

In this project, we built a prototype for the first-ever circular RNA (circRNA) analytics dashboard named ‘ecircdb’. It contains the following features that were suggested to be of highest priority for the prototype:

  1. Species view: This view contains the highlights, plots and export list options based on the selected genome assembly of a species, tissues, minimum NMethods and TPM etc.
  2. Sample list & Sample view: After selecting a sample from a list of samples for an assembly, this view shows statistics, plot, quality report and export list options for the selected sample.
  3. Location view: This view integrates the Genoverse browser with ecircdb. The browser shows the track for circRNAs for a given assembly with the gene track. It allows to select from the top-X chromosomes based on circRNAs and to select coordinates for all the circRNA producing genes.

Experience:

It has been a great experience to work with Ensembl on this project. I was supervised by Fergal Martin and Osagie Izuogu. They were very helpful and approachable. Throughout the whole project, they have provided guidance about the different aspects involved. It was my first time to work on data visualisation, so it was a great learning experience for me. Coming from a non-Biology background, it was a little difficult at the start to brainstorm and plan the app flow, but with the good guidance and reference material, I was able to complete the project successfully. In the first phase of the project, we developed the flow of the app. The second phase involved creating a backend and frontend environment where we could add statistics and plots. This was the pure development part and familiar to me. The third phase involved generating the actual plot from the circRNA data. This was quite intensive but fun. We used pandas and PlotlyJS to plot various kinds of data. This phase also included integrating the Genoverse browser with the dashboard which was a challenging part. I wrote the documentation of the project so that the prototype can easily be developed further or be extended to form a better product.

GitHub:

https://github.com/EnsemblGSOC/ecircdb-backend

https://github.com/EnsemblGSOC/ecircdb-frontend

https://github.com/EnsemblGSOC/ecircdb-genoverse