This is the first of our monthly posts introducing a member of the Ensembl team, and what they do in Ensembl. We’ll start with Emily Perry, who runs our Outreach team.

What is your job in Ensembl?

I’m in charge of the Outreach team. There are four of us, including me, and we’re essentially the contact point between all the scientists who use Ensembl and the developers who produce it. We work on things like online help and documentation, delivering training courses, social media, user-testing our new tools and displays, and answering questions via our email helpdesk. We’re always busy, and often have a half-empty office as someone is off delivering a training course somewhere around the world.

What do you enjoy about your job?

I’m one of those strange people who thrive when they have an audience, so I really love the teaching and presenting side of what I do. I’m also fascinated by the data that we have and love learning more about it and what people are using it for. I’m a biologist by trade, not a bioinformatician, but I’ve picked up a bit of coding as part of this job and one of the things I find most satisfying is fixing bugs in people’s code for them, without asking for help from our developers. I’ve also travelled to some amazing places to deliver workshops that I would never otherwise have gone to, such as Malawi, Colombia and South Korea, where I can really feel that I’m benefitting scientific progress.

What are you currently working on?

In Outreach we’re always dipping out of lots of projects, as well as all the reactive jobs we need to do like answering help emails or social media. A few of the things I have ongoing at the moment are looking into the documentation on the site, running a seven-week webinar course and preparing a conceptual course on genetic variation for the EBI’s Train Online platform.

What is your typical day?

Ha, there isn’t one. That’s one of the things I love about this job.

How did you end up here?

As I said, I’m a biologist. I studied genetics at undergrad, then went on to do a PhD in molecular biology. I knew about halfway through that I wanted out of the lab, so pursued a lot of science communication opportunities, both in person and written, as I went on, which led me to working for a year delivering science roadshows in secondary schools. This combination of communication skills and scientific expertise qualified me to start working as an Outreach Officer in Ensembl in 2012, then in 2015 I was promoted to lead the team.

What surprised you most about Ensembl when you started working here?

I had used Ensembl during my PhD, and I remember thinking that I knew what it was all about. Within a few days I was overwhelmed with just how much data and how many different types of data are in Ensembl. Turns out that my PhD work had barely scratched the surface of what was there. I hear the same thing all the time when I teach workshops: people who think they know all Ensembl has to offer (often the host of the course who was planning to just hide at the back and do other work during the course) tell me how surprised they are at all the useful things that are available.

What is the coolest tool or data type in Ensembl that you think everybody should know about?

I worked on chromatin for my PhD so my favourite data type is the regulatory build. I would have killed for those data back when I was a researcher. It’s just so useful having all the promoters, enhancers, CTCF binding sites (I love a bit of CTCF) and everything mapped onto the genome with its activity in different cell types. And I find the way that they just trained a computer to recognise patterns and assign functions from ChIP-seq data just amazing.

Following the success of last year’s course, we’re pleased to announce a second Free Ensembl Webinar Course.

This course allows you to learn about Ensembl for free from the comfort of your own office (or bed, no-one’s judging you), with the ability to interact live with the instructors. Perfect for those who can’t attend or host one of our live courses.

What is it?

The Ensembl online training series comprises a series of live webinars, once a week over seven weeks. Each webinar explores a specific aspect of Ensembl data or tools with a presentation and a demonstration – see the online course for details. You can then practice what you’ve learnt over the following week with online exercises.

Not all of the topics will be useful to you, so you can dip in and out of the webinars. If life gets in the way and you miss one you are keen on, we will post the videos to our YouTube channel and YouKu for those of you in China and embed them in the online course so that you can catch up.

What makes it special is that the course is fully interactive. If you attend the live webinars, you will have an opportunity to ask the instructors questions in real time. Afterwards, while you work on the exercises, you can interact with the instructors and other participants via our dedicated Facebook group. If you prefer not to use Facebook, you can also email us for help. Plus, you’ll be able to re-watch all or part of the videos at your leisure.

When is it?

We start on the 6th April, and will hold seven webinars on Thursday mornings, up until the 18th May. The live webinars will take place at 9 am BST (GMT+1), but if you are unable to attend live, the videos will be posted shortly afterwards. Since last year’s course was held in the afternoons, good for our American friends, we’re hoping that this morning course will be easier to access for anyone in Asia or Oceania.

After the live course finishes, we will leave the full course of recordings and exercises online, so that you can take it independently whenever you choose.

69.7% are very likely to recommend this course, 30.3% are likely to.

Is it any good?

We think so, but don’t take our word for it. Here’s what the attendees from last year had to say:

“Thank you. I really appreciate having access to this course. I’ve learned a lot.”

“Thank you so much for organising this. I really enjoyed!”

“Thank you; the course is very useful. I´m very happy”

How do I sign up?

You can visit the course pages to see what’s going on without signing up. If you want to attend the webinars live, you will need to sign up (or sign up here from China), but there’s no charge for doing so. You may also wish to join the Facebook group.

File Chameleon, click to enlarge

File Chameleon, click to enlarge

Transforming file formats has always been a troublesome issue in bioinformatics because of the numerous standards and slight eccentricities in formatting required by some software packages. How many times have you needed to transform chromosome names between 1,2,3 and chr1, chr2, chr3 or vice versa? With the introduction of File Chameleon we hope to somewhat smooth this process for data consumers.

File Chameleon is a web service introduced by Ensembl to transform Ensembl FTP files for easier use across the spectrum of bioinformatics tools. Need UCSC style chromosome names? Need genes longer than 4Mbp removed? File Chameleon can do that. From the File Chameleon web interface simply select the species and which flat file you want to download (individual chromosome gtf, full assembly fasta, etc), then select which filters you want to apply. The file will be transcribed and ready to download within a few minutes.

Currently File Chameleon only operates on GTF, GFF3, and FASTA formats and has a very limited set of filters for each format, however we’re committed to expanding the tool over future releases. Please take a look and give us feedback, which of the Ensembl formats would be useful to add, and more importantly what transformations and filters on the data would make it more useful for you? What is the awk or sed script you run on the files you download that we can do for you, or others might find helpful?

File Chameleon is also available as a standalone tool and is designed to have easily pluggable filters. If you find the tool useful, you can run it locally and expand it writing your own plugins to further process files. The package can be downloaded via GitHub along with extensive documentation and examples.

We think the Ensembl workshops that we offer are a brilliant way to familiarise yourself, and other people in your research institute, with Ensembl data and tools. Don’t take our word for it, over 99% of the people who attended our workshops in the first six months of 2016 would recommend them to a colleague.

Pie chart of who would recommend Ensembl workshops.

68% of participants say they are “Very Likely” to recommend our workshops, while 31% say they are “Likely” to.

So if you (and your colleagues) want to get training on Ensembl, the best option is to join the over 50 institutes a year who benefit from hosting an Ensembl browser workshop, training over 1000 people. Send us an email to find out more.

We appreciate, however, that that’s not an option for everybody. Although we do not charge fees for workshops, we ask our hosts to pick up the tab for all expenses incurred, such as travel, accommodation and subsistence. While we do our best to alleviate these costs, such as tagging together overseas workshops in a series, you may still not have the spare budget for this or may want to host workshop to a different timetable than we can support.

In this case, if you are an experienced Ensembl-user, you may consider teaching an Ensembl workshop of your own. We’d like to help. We want everybody learning about Ensembl to receive the most extensive and up-to-date training possible, and we believe that the second best way to do this (after having us do it) is with our support.

How can I teach a workshop?

All our workshops are hands-on, usually in a computer teaching room, although we can work with people bringing laptops (provided suitable WiFi and somewhere to charge them through the course). We recommend a similar set-up.

Our general style for a workshop is that we split it into modules. The modules we usually offer are:

  • Introduction to Ensembl and the Region view
  • Genes and transcripts in Ensembl
  • Data export with BioMart
  • Genetic variation data in Ensembl, including annotating your own variants with the VEP.
  • Comparative genomics: homologues and whole genome alignments.
  • The Ensembl Regulatory Build: finding features that regulate genes.
  • Advanced access to Ensembl data and viewing custom data in Ensembl

Within each module, there are three elements:

  1. Presentation, where we introduce what the data or tool is and where it comes from.
  2. Demonstration, where we have a hands-on walkthrough of finding that data or using that tool. We give out printed booklets containing screenshots that take participants through these walkthroughs. This provides a suitable place to make extra notes, and gives participants something to take away and use later. Some participants choose to join in with the walkthroughs, while others just make notes while we go through it on the screen.
  3. Exercises, where participants can practice using Ensembl to find information. These exercises build on what we do in the demonstration. During the exercises, we circulate the room, ready to answer any question that might come up. We also provide answer sheets (usually electronic only) that guide the participants on how to get the answers and what they are.

We find that combining these three elements gives the participants all the information they need, and provides a holistic learning experience that appeals to different kinds of learning style. We think this is why we consistently get such excellent feedback from our course participants.

You can see an example of a course that follows this structure, with the full set of modules, each with three elements, in our webinar course that we held in Spring 2016. You’re free to harvest the presentations (embedded as pdfs), demonstrations and exercises from that course for your own teaching, although under our Creative Commons BY licence, you need to credit us with their creation.

While we would usually have the workshop as one intensive day of learning, the flexibility of being in your home institute might mean that you prefer to have a module a day over a number of days or or one a week.

Where do I get the materials from?

As well as the webinar course I already mentioned, we have a page of walkthroughs and exercises. We use these ourselves in workshop creation, copying and pasting them together to make our courses.

We like to tailor our courses to match our participants’ interests, so will try to use exercises and walkthroughs that feature the species they’re working with. This is why our exercise page has many similar exercises and walkthroughs with different species. We recommend finding suitable exercises and demos to match your group’s interests and skills. You can also copy the process and style of an existing walkthrough or exercise for an example in a new species of interest.

Because we only update these exercises and walkthroughs when we use them, they can get out-of-date. The “Updated” column on the tables shows you for which Ensembl or Ensembl Genomes release they were last updated. If the exercise or walkthrough you’re looking at is not from the current release (check the release news section of this blog to see what’s current – note the different release numbers for Ensembl and Ensembl Genomes), then you might want to check the content to see if it needs editing at all.

If you do any updates, or make any new exercises, we’d love to hear about it. Email us your new material and we’ll add it to the page for other people to use (and maybe steal it for ourselves too).

For presentations (in pdf) and to see how a whole workshop might fit together, all of the Outreach team post their course materials online for use during and after the courses. You can see the materials from on our training pages.

Remember, if you use any of our materials, do credit us with their creation, as they are distributed under a CC BY licence.

What about website downtime?

Website downtime can be a disaster for a workshop, leaving you floundering with no way to teach. However, downtime is also an occasional necessity when we are running such a huge website and database. If you are planning on running a workshop, we recommend you get in touch to ask if there is any planned downtime. You can usually get around downtime by using one of our mirror sites, but we can advise you on this.

Similarly, if we put out a new release in the days between your preparation and workshop delivery, it can make some of your materials out-of-date. This can be a valuable lesson for your participants in how bioinformatic databases can change, or you can run the workshop from the previous archive site instead.

A release on the day of your workshop means both downtime and changes in the data. If we know you’re having a workshop, we’ll make sure that the archive site from the previous release is up and working before we take the main site down, so that you have something to work with, and you can keep using that even once the new site is up.

Need more help?

The Outreach team are here to support you. Just send us an email if you want practical support on how best to run a workshop, if you have any background questions on our data or tools, or indeed with any other questions or problems you might have with Ensembl.

The Ensembl regulation resources FTP site saw a facelift in release 87. The directory structures have been modified to make it easier to find files- the file names have become more descriptive and we now also provide our data in a greater variety of file formats. All data files on our FTP site now adheres to a naming convention, which is described in greater detail here. The filenames include the following information separated with a dot (‘.’):

  • species
  • assembly version
  • cell type (if applicable)
  • feature type (if applicable)
  • analysis name
  • results type
  • data freeze date
  • file format.

E.g.: homo_sapiens.GRCh38.K562.Regulatory_Build.regulatory_activity.20161111.gff.gz

The data available on our FTP site include:

Peaks: The set of peaks for transcription factors, histone modifications and variants that are part of our regulatory resources. In previous releases these used to be collated in one file, called ‘AnnotatedFeatures.gff.gz’, but with our recent expansion to 88 human cell types with ChIP-seq data, the file became too big. Therefore, we split it into separate files by cell and feature type in the ‘Peaks’ subdirectory. The peaks are now available in gff, bed and bigBed format.

Quality scores: The outcome of our quality checks from processing the ChIP-seq data that yielded the peaks. They are in JSON format in the ‘QualityChecks’ subdirectory:

  • the number of mapped reads
  • the estimated fragment length, the NSC and RSC values using phantompeakqualtools
  • the proportion of reads in peaks
  • the enrichment of the ChIP over the Input using CHANCE.

Regulatory build: The current set of regulatory features along with their predicted activity in every cell type. We provide one gff file per cell type in the ‘regulatory_features’ subdirectory.

Transcription factor motifs: The transcription factor motifs identified using position weight matrices from JASPAR in enriched regions identified by our ChIP-seq analysis pipeline in gff format.

For our latest release (e87) we’ve produced annotations from some new embryonic zebrafish RNA-seq data using the Ensembl genebuild RNA-seq pipeline. The collection of new data we’re providing consists of gene sets and alignments for 18 separate embryonic developmental stages, from the single celled zygote right up until 120 hours post fertilisation. As per usual, these features can be viewed in our browser as separate tracks, or they can be downloaded from our ftp site.

The RNA-seq data we used were produced by the Vertebrate Genetics and Genomics Group at the Sanger Institute. The team collected 96 embryos from each of the 18 stages, examining their morphology so as to ensure every single embryo was at the correct phase of development. Such an undertaking, although extensive, is more achievable in zebrafish than in many other vertebrates due to features such as large clutch size and external fertilisation and development. The team made 5 libraries for each of the developmental stages, each one comprising a pool of 12 embryos. All 90 libraries were made simultaneously by a robot to reduce batch effect and strand-specific sequencing was used to reveal information on genes overlapping on the opposing strand. The data were released to ENA directly after sequencing, to allow public access as early as possible. Variation in gene structure across development can be viewed in Ensembl and the changing expression level can be viewed in Expression Atlas. A manuscript describing the changes in gene structure and expression level across development is currently in preparation.

screen-shot-2016-12-09-at-16-51-45The alignments and annotations generated from the data are viewable in the Ensembl browser, and the individual tracks can be configured using the RNA-seq tissue matrix. The initial introduction of this matrix was covered in a previous blog post. The new zebrafish entries appear in chronological order under the heading ‘WTSI stranded RNA-seq’. A merged set, which contains all of the new developmental RNA-seq data, is also selectable.

We expect these RNA-seq data will expose new isoforms of previously annotated genes, which may be especially prevalent during, and perhaps even unique to, early embryonic development. The alignments may also reveal interesting expression patterns for specific genes.

We’d like to encourage our users to take full advantage of these exciting new data, and we hope they’ll facilitate some interesting new research.

Please send any questions to our helpdesk.

 

 

 

 

The Variant Effect Predictor (VEP) is one of Ensembl’s most popular tools. It has grown in 6 years from a simple perl script with just a couple of hundred lines of code to become a multi-limbed beast with thousands of lines of code and well over 100 configurable options.

VEP is now used by many high-profile projects, institutes and companies around the world. In order to effectively manage this growth and ensure we deliver the most reliable and feature filled variant annotator out there, we’ve had to go back to basics. Over the past six months the VEP codebase has been totally rewritten, and the new version is now available for download. Users of VEP’s web and REST API interfaces should see virtually no difference with the new version, so if that’s you, you can stop reading now!

For users of our command line tool, you can trial the new VEP by visiting https://github.com/Ensembl/ensembl-vep. The full list of changes to the code can be found in the README on GitHub, but these are the main points of note:

  • Faster : process an individual genome in around 30 minutes.
  • Backward-compatible : all data sources (cache files, databases) and most command line flags from the old code are fully compatible with the new code.
  • More reliable : test-driven development means the new code is covered by more than 1500 unit tests with over 99% statement coverage.

For those tied to the current codebase, it is still available as part of the ensembl-tools GitHub repository, though updates and support for this will cease over time. Ensembl release 87 will be the last for which the ensembl-tools version of VEP will be the “primary” VEP codebase. Of course, the previous code and supporting data will remain available as part of Ensembl’s archiving strategy.

Some other points of note:

  • The documentation at ensembl.org still refers to the old code. From Ensembl release 88 onwards full documentation for the new code will be made available.
  • If possible, please report any issues you may find with the new code as a GitHub Issue.
  • The code that calculates variant consequence types (e.g. missense_variant, stop_gained) remains a part of the ensembl-variation API module and has not been (significantly) updated; it is used by both the old and new code. The ensembl-vep codebase performs the following functions:
    • parsing command line flags
    • parsing input
    • reading data from annotation sources (databases, cache files, flat files)
    • interval alignment of input variants with annotation data
    • writing output
    • monitoring statistics
    • threading
    • data filtering interface