Documentation projects

We have some ideas for some documentation projects for a technical writer, which we hope to be included in Google Season of Docs.

Ensembl manual: structure and style guide

Brief explanation

Ensembl aims to release a new website in 2020¬†for exploring genomic data. With it, we’d like to produce a regularly updated manual that is available alongside the website, describing the data available and how to access it. This project aims to identify the technologies needed for such a manual, produce a logical structure for the manual, and create a style guide.

Ensembl has vast amounts and types of genomic data with different methods to access it. The documentation should cover what the data is, where it comes from and how we process it, plus how to use the different tools to work with it. This needs to be organised into a logical structure to make it easily accessible for our users.

We release new software and data several times a year, so this manual would need to be dynamic. Writers need to be alerted when changes occur that need to be reflected in the documentation. Previous suggestions for this include using a ghost browser to take screenshots, then producing alerts when image analysis software indicates differences between old and new screenshots.

The documentation will be written by many individuals within and outside the Ensembl project, with varying technical ability. The structure will guide the writers to ensure they are writing relevant and complete documentation, while the style guide will ensure consistency of voice throughout the documentation.

Expected results

  • Identification of suitable technologies for storing, presenting and updating the documentation.
  • A general structure listing the sections of the manual noting their approximate content.
  • A style guide for the documentation, including how to use highlighting for different types of elements, diagram labelling and spelling.
  • Sample documentation on topics that will be unchanged in the new website.

Commitment

Full-time

Mentors

Emily Perry, Andy Yates and Andrea Winterbottom

Manual gene annotation documentation

Brief explanation

Ensembl genes are annotated onto the genome by a combination of automatic annotation using a pipeline and manual annotation by skilled annotators. Current documentation describing the process of manual gene annotation is very limited and needs significant expansion.

Manual gene annotation is a very involved process, which uses data from a variety of sources, along with the annotators’ expert knowledge of gene structures, to determine the position of genes. Annotators make use of their own annotators’ guidelines, which may form the basis of any documentation. They have specialised software to carry out the task, which is not available to the public.

This project would aim to produce the documentation that will allow Ensembl users to understand where their genes came from. The source of genes and the reason for any changes in our gene models are some of the most popular topics on our email helpdesk. The documentation would not help people to annotate genes in Ensembl themselves.

Expected results

  • Documentation pages describing the process of manual gene annotation.
  • Images illustrating the process of manual gene annotation.

Commitment

Full-time

Mentors

Erin Haskell, Jonathan Mudge and Jane Loveland

REST API documentation

Brief explanation

The Ensembl REST APIs provide language-agnostic programmatic access to genomic data. The existing endpoints have been added piecemeal over time, and as such there exists redundancy of endpoints and inconsistency between the ways the endpoints have been implemented. We are in the process of auditing the endpoints to improve this, and would like to take the opportunity to improve our documentation with it.

The REST API’s documentation is split between an auto-generated description of our endpoints and a wiki user guide. Our auto-generated endpoint description uses an in-house custom system and is in need of updating and modernising. Our user guide was first written over 5 years ago and is due for review to ensure it continues to be informative and relevant to users. During this project, we would hope a prospective candidate would investigate alternatives to our in-house systems and begin to replace them alongside working with our engineers to produce an improved user guide/manual.

Expected results

  • Evaluation and selection of suitable technology for documenting the REST API endpoints
  • Trialing selected technologies on a subset of endpoints and then rolling it out to other endpoints
  • A new user guide/manual

Commitment

Part-time

Mentors

Astrid Gall and Beth Flint

eHive manual update

Brief explanation

eHive is a system to define and execute workflows. It operates by spawning autonomous workers which carry out parallel tasks in order to complete the larger pipeline. It was originally created for running in-house Ensembl workflows, but is now shipped separately and can be used to run any computational workflows. It is widely used in Ensembl and by other projects: we are aware of several hundreds active workflows, totalling thousands of CPU years on several compute clusters. eHive is written in Perl, with Python and Java plugins. It supports various job schedulers and has beta support for Docker clouds.

The existing eHive manual is hosted on Readthedocs, however there are improvements we wish to make to the documentation, including the addition of a cheat sheet. The overall goal is to make eHive more accessible: make the first steps easier for new users, whilst allowing users to reach intermediate and advanced levels quicker.

Expected results

  • Better documentation of advanced features.
  • Cheat sheet.
  • Other miscellaneous improvements to the existing documentation.

Commitment

Part-time

Mentors

Matthieu Muffato and Brandon Walts

Ensembl production manual

Brief explanation

The Ensembl teams have developed many automated pipelines to process and analyse genomics data. Those are used to produce the Ensembl databases, including our gene annotation, gene tree, regulatory build and variation annotation pipelines. Many people who work with species not included in Ensembl or with confidential data, would like to run our pipelines on their own data and produce their own Ensembl-like databases.

All of the code needed to run an Ensembl pipeline is Open Source, either from a public provider or written by us and distributed on GitHub. However, getting started with this involves setting up a complex production environment, and several pipelines require domain-specific knowledge.

Expected results

  • A manual detailing how to set up the production environment and run an Ensembl pipeline.
  • A framework to host and document pipeline-specific information.
  • A manuals for each pipeline, written with the help and knowledge from the relevant teams

Commitment

Full-time

Mentors

Helen Schuilenburg and Nishadi De Silva

VEP documentation

Brief explanation

The Ensembl Variant Effect Predictor (VEP) annotates lists of known and novel genetic variants with the genes they affect. This is available as an online tool, an offline package and REST API endpoints.

The offline package is the most flexible and powerful method of annotating variants. However, the many options are difficult to navigate, and the installation relies on a number of external packages which can be difficult to install.

We would like to rewrite our documentation to make it easier for people to navigate, and improve the installation documentation to include troubleshooting advice.

Expected results

  • Restructure of the VEP documentation.
  • Installation troubleshooting guide.

Commitment

Part-time

Mentors

Andrew Parton and Sarah Hunt