Project ideas

Ensembl is part of the Genome Assembly and Annotation (GAA) section of EMBL-EBI. The GAA section contains other popular services such as MGnify, HGNC/VGNC and Wormbase. Below are project ideas related to the activities of the section. GAA is able to host funded contributors on summer projects or internships at various times throughout the year

Projects: Ensembl Genebuild

Using Deep Learning to Classify Repeat Features

Brief Explanation

Finding and classifying repetitive DNA sequence in eukaryotic genomes is both an important first step ahead of further genome annotation, and also interesting in its own right as repeats frequently drive genome evolution. Repeats in DNA can be broken into a number of different major classes such as LINEs, SINEs and LTRs. Global biodiversity efforts such as Darwin Tree of Life, the European Reference Genome Atlas and the Earth BioGenome Project are producing hundreds and soon thousands of high-quality reference genomes, that will all need repeat annotation. Currently we have two potential approach to annotating repeats. The first is building a repeat library for a species (using RepeatModeler) and then annotating the repeats on the genome (using RepeatMasker). This method both finds and classifies the repeats and finds lineage specific repeats, however building a repeat library is computationally costly. The second approach is to use an extremely fast k-mer approach (REpeatDetector, aka Red), to mask the genome in a fraction of this time. The downside is that this approach does not classify repeats and so is not very informative for researchers studying repeat evolution.

In this project we want to explore Deep Learning in order to help classify repeats. We have large existing training sets across hundreds of species, spanning billions of classified repeats. As part of this projects you would train a neural network to take as input an unclassified repeat sequence and label it according to the class of repeats it belongs. You will explore the most efficient approach in terms of both preparing the training data and constructing the network. If the training is successful, we will then test the resulting model from a perspective of compute efficiency, i.e. does the model produce similar results to our existing method of classification (i.e. building a repeat library for the species and then using it to find and classify repeats) and what is the relative compute cost in each approach.

Depending on the success and progress related to the above, there may also be the opportunity to take the project a step further, in terms of generative repeat library construction, i.e. given a fast k-mer derived set of repeat sequences and their coordinates on the genome, is it possible to generate a repeat library. This would be highly experimental and only considered after fast and excellent progress on the core project.

Expected results

  • A Deep Learning model for classifying repeat sequences into major classes
  • A comparison of the efficiency of said model to our traditional approach in terms of compute cost

Required knowledge

  • ML frameworks such as PyTorch/Keras or similar
  • Python

Desirable knowledge

  • Understanding of repeat biology and associated software
  • Training on Slurm/LSF

Difficulty

Medium

Length

350h

Mentors

Fergal Martin Leanne Haggerty 

A Nextflow Pipeline for Repeat Annotation

Brief Explanation

Finding and classifying repetitive DNA sequence in eukaryotic genomes is both an important first step ahead of further genome annotation, and also interesting in its own right as repeats frequently drive genome evolution. Global biodiversity efforts such as Darwin Tree of Life, the European Reference Genome Atlas and the Earth BioGenome Project are producing hundreds and soon thousands of high-quality reference genomes, that will all need repeat annotation. Currently our repeat annotation pipelines are run via an in-house workflow management system, eHive. eHive is Perl based and nearing end of life and as a result we are transitioning much or our infrastructure to other workflow managers such as Nextflow.

In this project you will work together with us to help redesign our repeat annotation pipeline. We will identify all the existing components, decide what to keep and what to remove and then come up with a final workflow. You will then implement this workflow using Nextflow and test the deployment both locally and on our various cloud partners. Time permitting we will work on costing the pipeline using a variety of species to come up with a cost per gigabase of sequence to mask repeats. Similarly, if there is additional time, we will look at large scale deployment of the pipeline on our species to build a consistent set of repeat resources for public use.

Expected results

  • A Nextflow pipeline to generate resources related to repetitive elements in eukaryotic genomes
  • Test deployment in a production environment

Required knowledge

  • Nextflow
  • Python

Desirable knowledge

  • Cloud deployment
  • Containerisation
  • Slurm/LSF

Difficulty

Medium

Length

350h

Mentors

Leanne HaggertyThiago Genez  Fergal Martin 

Using Deep Learning to Identify Features of Protein-Coding Genes

Brief Explanation

Protein-coding genes form the basis of many scientific analyses. They have directly links to important real world problems such as human health, food security and ecosystem conservation. Global biodiversity efforts such as Darwin Tree of Life, the European Reference Genome Atlas and the Earth BioGenome Project are producing hundreds and soon thousands of high-quality reference genomes, and these genomes need structural annotation of genes.

Protein-coding genes are made up of many distinct features such as exons, introns, splice sites, codons, transcription/translation initiation and termination sites, UTRs and promotors. These features have associated signals that can be very strong, for example the coding regions almost always start with the sequence ATG, followed by triplets of nucleotides, and end in either TAG, TAA or TGA. Similarly splice sites usually have a donor GT and an acceptor GT sequence. Other signals such as promotors or transcription initiation/termination sites can be more complex and degenerative. It should be possible to describe the potential set of genes and the coordinates of the underlying features from the DNA sequence of the genome alone, however existing methods to do this, often based on Hidden Markov Models (HMMs) are generally inaccurate and do not produce a highly detailed or accurate annotation of the genes.

Traditional methods to produce a high-quality annotation generally use RNA or protein data and attempt to match these data against corresponding regions of the genome to identify the genes and the corresponding substructures. The problem with these methods is that they are costly and time consuming to generate input data for (RNA), or the data may be taken from other species and the greater the evolutionary distance the higher the number of errors in the resulting annotation (often true for protein data). As such traditional methods will not work for every species, as we will not have the appropriate RNA/protein data in the majority of cases.

In this project we want to explore Deep Learning in order to help accurately identity structures associated with protein-coding genes from the genome of the species. Together we will construct a training set of high-confidence annotations of genes using existing annotation from both our data and other sources for hundreds of eukaryotic genomes. We will attempt to first find high level features, i.e. regions of the genome likely to contain protein-coding genes based on the k-mer profiles and absence of repetitive sequence, which we will train a model to recognise. Once this initial model is accurately identifying candidate regions we will build a model to then look at fine grained feature extraction.

We will work with you to build the high confidence set of training data we can use in terms of the analysis. Your main task will be to implement the high-level and fine-grained models and test different network architectures and methods of pre-processing the training data. We will work together on analysing the results against both gold standard, near complete annotations and also compare species annotated from a variety of other methods and with varying levels of quality.

Expected results

  • A Deep Learning approach to identifying key features of protein-coding genes
  • A comparison of the results against gold standard reference annotations and other annotation approaches

Required knowledge

  • ML frameworks such as PyTorch/Keras or similar
  • Python

Desirable knowledge

  • Understanding of gene annotation and associated software
  • Training on Slurm/LSF

Difficulty

Very hard

Length

350h

Mentors

Fergal Martin Leanne Haggerty 

Improved Transcript Representation in Non-Model Organisms via Deep Learning

Brief Explanation

Genes form the basis of many scientific analyses. They have directly links to important real world problems such as human health, food security and ecosystem conservation. Global biodiversity efforts such as Darwin Tree of Life, the European Reference Genome Atlas and the Earth BioGenome Project are producing hundreds and soon thousands of high-quality reference genomes, and these genomes need structural annotation of genes.

Genes are made up of many different features, but exons are arguably the key feature as they represent the blocks of the genome that are transcribed in to RNA, which may form functional structures, regulate the expression of other genes or encode proteins. Under certain conditions exons may be included or skipped in the transcribed RNA, sometimes leading to different functional outcomes. A particular permutation of exons that forms a transcribed RNA is known as a transcript. While there is often on particular transcript that represents the normal state of the gene, and thus is most prevalent, it is very common to have alternative transcripts expressed, particularly in higher eukaryotes. These may be expressed in different tissues or points in time, or simply expressed continuously but at a lower level to the dominant transcript.

It is important to have as complete a representation of the full set of transcripts in a gene as possible. Short read sequencing is a common method for finding alternative transcript structures, however the nature of the technology means we cannot be certain that the permutations of exons we infer from short read data actually exist in reality. Long read data allows us to directly observer full length RNA and thus should allow us to confidently identify alternative transcripts, but the technology is less common place and also does not capture as many genes as short read data. There are also frequently fragmented data present.

The objective of this project will be to examine methods of better representing potential full length transcripts via deep learning. We will preform our test in mammals, where there are several high-quality reference annotations (human and mouse in particular). We will take genes from mammals where large quantities of long read data are available and identify high confidence sets of alternative transcripts. We will then utilise the union of the exons described in these transcripts to attempt to help train a model capable of validating alternative transcripts. There are two approaches we could take, the first would be to find the longest possible exon chain, assume this is the dominant transcript and automatically generate a set of alternative transcripts with exon skipping, where the model would produce a binary output as to whether or not a permutation was valid. This approach would be sraightforward, but as some genes can have many exons, this could generate many permutations. The other approach would be to try and build a generative model, where the input is the union of all unique exons across the input set, while the output would be a set of transcripts and the exons contained in each. This would be more robust, but would require a more complex model.

The project will involve you working with us to identify suitable training data from our existing annotations and assessing and implementing a suitable approach to using the data to train a model. Your work will help decide which approach is most viable and you will be responsible for implementing and training the corresponding model.

We will test the resulting model in terms of how accurately it can validate true alternative transcripts in both gold standard and non-model mammalian species. Time permitting we may consider extending past mammals into other eukaryotes to see how generalisable it is

Expected results

  • A Deep Learning approach to identifying valid alternative transcripts
  • A comparison of the results against gold standard reference annotations and non-model annotations

Required knowledge

  • ML frameworks such as PyTorch/Keras or similar
  • Python

Desirable knowledge

  • Understanding of gene annotation, particularly transcriptomic data and associated software
  • Training on Slurm/LSF

Difficulty

Hard

Length

350h

Mentors

Fergal Martin Leanne Haggerty Adam Frankish 

Improving the definition of UTR boundaries via Deep Learning

Brief Explanation

Untranslated regions (UTRs) represent the boundaries of protein-coding genes. These regions are important for understanding where one gene ends and a neighbouring gene starts. UTR regions sometimes house features that regulate the expression of the gene in addition to being key to analysing the expression of the gene when using single cell data.

Annotating UTRs is difficult. It is clear from long and short read transcriptomic data that there is rarely a precise start/end to the UTRs of a gene. There are usually regions where there transcriptional machinary is more likely to attach or detach. In particular, short read data (which is most frequently available) is naturally imprecise for determining the start/end of the UTR as each read represents a small fragment of the gene. If the sampling of these small fragments is uneven, it leads to incorrect identification of the start/end. At the same time the cellular machinery for transcription is able to identify these binding/release regions despite not fundamentally changing across eukaryotes, so it should be possible to directly identify their approximate locations directly from the genome sequence.

In this project we will explore the use of long read data and high-quality reference annotations to train a model to predict the location of a UTR start or end from a sequence adjacent to a coding region start/end. While it will not be possible to do this for all UTRs, particularly ones that are very long or have large introns contained within, we will be able to train to predict simple UTR start/ends within a fixed window. This will assist with better representation of UTRs, particularly in species lacking transcriptomic data.

We will work together to build a training set consisting of genes where we are confident we have captured repesentative UTR boundaries. When several possible boundaries in one of these genes are present, we will select the longest UTR boundary, unless it is infrequently observed relative to the number of long reads mapped to the gene (in which case the boundary will be set to be a balance of the longest UTR observed in more than 20 percent of the reads). We will use as much of the sequence of the flanking region as possible along with the coordinate of the selected boundary, to then train the model to predict the boundary coordinates. You will be responsible for building the network and testing different hyperparameters during training. We will then compare to gold standard reference annotations and look at the approximate distance between the predicted and true boundaries to evaluate the model.

Expected results

  • A Deep Learning model for predicting the coordinates of 5′ and 3′ UTR boundaries
  • A comparison of the results against gold standard reference annotations

Required knowledge

  • ML frameworks such as PyTorch/Keras or similar
  • Python

Desirable knowledge

  • Understanding of gene structures, particularly UTRs
  • Training on Slurm/LSF

Difficulty

Hard

Length

350h

Mentors

Fergal Martin Leanne Haggerty Adam Frankish 

Projects: Software Development for MGnify

MGnify Data Visualisations

Brief Explanation

MGnify (https://www.ebi.ac.uk/metagenomics) is a freely available hub for the analysis and exploration of metagenomic, metatranscriptomic, amplicon and assembly data. The resource provides rich functional and taxonomic analyses of user-submitted sequences, as well as analysis of publicly available metagenomic datasets held within the European Nucleotide Archive (ENA).

The public-facing service is a React.js website backed by a Python/Django REST API, which serves metagenomics data and associated analyses via API endpoints and data files. There are also micro-services for specific tasks like sequence searches. In addition to the website, MGnify provides hosted Jupyter Notebooks to cover extra use cases and showcase how the MGnify API-provided data can be used in downstream data analysis tasks (using R and Python).

Together, the website and notebooks include many data visualisation built using various technologies: Highcharts (Javascript) for website graphics like nucleotide distributions, specialised javascript components like the Integrative Genomics Viewer for genome annotations, and matplotlib and ggplot for graphics created in the Jupyter notebooks.

As MGnify approach the release of our next-generation analysis pipeline, the aim is develop a reusable framework for managing these visualisations. Specifically, we aim to reuse components and libraries in as many places as possible, and to support FAIR (Findable, Accessible, Interoperable, Reusable) principles by enabling our users to easily build upon the visualisations we provide. An example could be: the MGnify website using a d3.js histogram to display protein annotation information, from where users can jump to an Observable JS Notebook with the required API fetching code and d3 visualisation code ready for them to modify to produce a graphic suitable for their own publication.

Expected results

  • Propose a rational approach to data visualisations across MGnify frontends
  • Migrate a subset of the existing website visualisations to conform with the new approach
  • Implement user-editable visualisations (e.g. Notebooks)
  • Document and provide code examples for the approaches and libraries used

Required knowledge

  • Python: some data analysis experience required (e.g. Pandas, Matplotlib, Spark)
  • Javascript: some modern front-end work required (e.g. React) and visualisation experience (e.g. d3.js)
  • Ideally some experience with notebook coding: Jupyter or Observable.

Difficulty

Adaptable

Expected size of project (175 or 350 hour) 

175 hours

Mentors

Martin Beracochea Alexander Rogers

Projects: HAVANA

Defining gene boundaries

Brief Explanation

Understanding the impact of genetic variation on disease requires comprehensive gene annotation. Human genes are well characterised following more than two decades of work on their annotation, however, we know that this annotation is not complete and that new experimental methods are generating data to help us towards the goal of complete gene annotation. Long transcriptomic reads allow us to identify and annotate many new features, including the start and end of a transcript which can be combined to give information for genes. We would like to develop a pipeline to extract long transcriptomic data from the European Nucleotide Archive (ENA), map to the human reference genome and extract the terminal co-ordinates to create a growing collection of transcript start/end positions. This data will support improving the accuracy of gene annotation of individual transcripts and genes and give insight into any differences between transcript start and end sites across different tissues

Expected results

  • Code to extract read data from ENA, map to genome and calculate termini
  • Database of termini and read metadata
  • Code to extract data from database and format for browser viewing

Required knowledge

  • Transcript mapping (long-read RNA-seq alignment)
  • Workflow manager

Difficulty

Adaptable

Length

350h

Mentors

Jonathan Mudge Jose Gonzalez Adam Frankish

Using machine learning to annotate difficult genes

Brief Explanation

Understanding the impact of genetic variation on disease requires comprehensive gene annotation. Human genes are well characterised following more than two decades of work on their annotation, however, we know that this annotation is not complete and that new experimental methods are generating data to help us towards the goal of complete gene annotation. We have developed an automated workflow to use long transcriptomic data to add novel alternatively spliced transcripts to our gene annotation. Our method uses very strict thresholds to ensure that no poor-quality models are added to the gene annotation, although as a consequence we reject significant numbers of viable novel transcripts. We want to use machine learning to recover good quality but rejected transcripts and improve the setting of initial filters for new datasets.

Expected results

  • Install and learn to use a machine learning package
  • Run it on known gene annotation
  • Deliverable: simple model that helps to recover valid transcripts; set of most relevant features for decision making

Required knowledge

  • Machine learning

Difficulty

Adaptable

Length

350h

Mentors

Jonathan Mudge Jose Gonzalez Adam Frankish

Projects: Metazoa

Develop an automatic system to flag any updated/new species and rank them

Brief Explanation

Ensembl Metazoa plans every release by manually collecting a list of available species from INSDC resources a few months in advance, and then going over their available information (e.g. taxonomic clade, assembly quality, annotation availability/quality, RefSeq availability, etc.) to filter out and select about 20 species that will be processed and loaded into the next Ensembl release. As an example, taxonomic information is used to highlight species that cover new clades not present in Ensembl, as well as those that bring novel information to existing clades, e.g. new locust genomes in the well-known Neoptera clade.

In our plans to expand our Ensembl Metazoa resources we would like to introduce automation in the process described above to check available new species/updates from INSDC resources, as well as create a system that allows us to rank them depending on different criteria. This system should collect the data on a regular basis, e.g. monthly, and provide all the required information to easily ingest it into our production loading system, e.g. GCA, species name, strain, common name, taxonomy,… Additionally, it would be desirable if the new system could rely on our JIRA tracking system to create and update this information, so we can feed this information programmatically into our processing and loading system.

Expected results

  • Automatic system that can run monthly and provide a list of available updates and new species in INSDC

Required knowledge

  • Python + pytest

Desirable knowledge

  • Understanding of taxonomy information
  • Understanding of INSDC resources and their REST end-points, i.e. Entrez, ENA Portal API
  • JIRA

Difficulty

Medium

Length

350h

Mentors

Jorge Alvarez

Expand the species search functionality for beta website

Brief Explanation

The search engine of any website can be one of the most useful tools for users to help them easily retrieve the information they are looking for. Currently, Ensembl’s search tool works based on indexed fields of our databases, that mainly covers key information, e.g. genes, species, proteins, including many synonyms for every one of them. As we plan to move to our new beta website by the end of 2023, we want to make our search engine even better so our users can enjoy the experience of using Ensembl even more.

We would like to expand our Ensembl beta’s search functionality to include and support searching based on taxonomic information. In particular, we are interested in providing users a list of close relatives when a given species is requested and it is not part of Ensembl (yet), return the list of species available given a taxonomic clade instead of a species name, or find a species even when a (homotypic) synonym is provided instead of its current scientific name. The objective of this project is to create a standalone Elasticsearch tool that can handle taxonomic-related requests.

Expected results

  • Search tool returns the actual species’ link when the species is in Ensembl, including checking for taxonomy synonyms
  • Search tool returns options for close-relatives of introduced species (if any) when the species is not part of Ensembl
  • Search tool returns options for species within the given taxonomy clade (if any)

Required knowledge

  • Python
  • Elasticsearch
  • MySQL

Desirable knowledge

  • Understanding of taxonomy information
  • Django

Difficulty

Medium

Length

350h

Mentors

Sarah Dyer Jorge Alvarez