Retirement of archive 58

Please note that the archive websites for Ensembl release 58 (May 2010) will be retired  in mid June when version 72 is released.

This is in accordance with our rolling retirement policy, whereby archives more than three years old are retired unless they include the last instance of the previous assembly from one of our key species (human, mouse and zebrafish).

For more information about how to use archives, please see our previous blog post on the topic; a list of all current archives is available on the main website.

Posted in Web | Leave a comment

What’s coming in Ensembl release 72

Ensembl 72 is scheduled for release in June 2013. We expect this release to include, among other things:

  • Updated patches for the human assembly (GRCh37.p11).Homo_sapiens
  • Imported HGMD-PUBLIC data from release 2013.1 with regulatory data for human.
  • Import of COSMIC version 64 and update of COSMIC structural variants.
  • Addition of phenotype associations data from OMIM and Orphanet and data from GIANT and MAGIC association studies.
  • Updated HAVANA manual curation for human and mouse.
  • Import of the genotypes from the Mouse Genomes Project, SNP Release Version 3.
  • Addition of variation data for Gibbon.
  • Addition of a form to search an individual in the “Individual genotypes” page (variation tab) Nomascus_leucogenysand a new page listing the publications where the variation has been cited.
  • Updated CCDS sets and cDNA alignments for human and mouse.
  • Updated mitochondrial sequence and annotation for several species including alpaca, lamprey, platyfish and xenopus.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

Posted in Ensembl, Release News | Leave a comment

3-day Ensembl API workshop, 22-24 May 2013, Hinxton, UK

download_apiIn May the Ensembl team will again provide a 3-day Ensembl Perl API workshop on the Wellcome Trust Genome Campus in Hinxton, United Kingdom. Although this workshop is primarily meant for campus employees, external participants are also more than welcome to attend.

The workshop itself is free of cost. You should note though that our campus is located a bit in the middle of nowhere and that you have to make your own arrangements for accommodation and/or travel. A similar workshop will be given again end of the year at the University of Cambridge (27-29 November 2013).

For more information and to register please mail me (bert@ebi.ac.uk).

3-DAY ENSEMBL API WORKSHOP
Time: 22-24 May 2013, 9:30-17:00
Place: Teaching room, EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Instructors: Magali Ruffier, Thomas Juettemann, Anja Thormann, Matthieu Muffato, Javier Herrero
Cost: none

The Ensembl project provides a comprehensive and integrated source of annotation of mainly vertebrate genome sequences. This 3-day workshop is aimed at researchers and developers interested in exploring Ensembl beyond the website. The workshop covers the core, compara, variation and functional genomics (regulation) databases and APIs. For each of these the database schema and the API design as well as its most important objects and their methods will be presented. This will be followed by practical sessions in which the participants can put the learned into practice by writing their own Perl scripts.

Important
This workshop is NOT intended to teach you either Perl or basic molecular biological and genetic concepts! To be able to attend you should be able to code in Perl and be familiar with basic molecular biology and genetics. A basic knowledge of Ensembl is advantageous.

Posted in API and schemas, Workshops | Leave a comment

Database Port Changes

You may have noticed that release 71 saw us make useastdb.ensembl.org accessible on port 3306 alongside the traditional Ensembl DB port 5306. This is in response to comments from users that many institutions and businesses do not allow access to remote resources on non-standard port numbers. The older 5306 was such a non-standard port number. We are beginning a migration which will see Ensembl host current release databases on the default MySQL port 3306 alongside 5306 detailed in the following table

Ensembl Databases and their Ports
Database Database Releases Release 71 Release 72
ensembldb.ensembl.org 47 and lower 3306 and 4306 4306
ensembldb.ensembl.org 48 plus 5306 3306 and 5306
useastdb.ensembl.org Current and previous 3306 and 5306 3306 and 5306

Prior to release 48 all Ensembl databases were hosted on 3306. Release 48 saw an upgrading of our MySQL deployment platform from v4 to v5 (http://lists.ensembl.org/ensembl-dev/msg03424.html). This necessitated the deployment of two servers; 3306 hosting MySQL v4 databases and 5306 hosting v5 databases. This process sees us returning to our original hosting pattern of making our latest releases available on 3306.

We hope that this move will help more users access our resources. Should you need more information then please do not hesitate to get in touch.

Posted in API and schemas, Cloud, Mirrors | Leave a comment

Rebuilding the zebrafish gene set

The zebrafish (Danio rerio) is pretty much the ideal model for understanding vertebrate development as it combines some of the best attributes from several other model organisms. More importantly, there is extensive similarity between the zebrafish and human genomes. Thus, many human developmental and disease genes have counterparts in zebrafish, and linking such genes is key to elucidating human gene function. Consequently, an array of zebrafish models of human diseases have been produced for the purpose of testing candidate drugs. The Wellcome Trust Sanger Institute recently released a video addressing the usefulness of using zebrafish in research:

Until recently, we at Ensembl have relied mostly on protein and cDNA alignments to produce transcript models. However, the number of known zebrafish proteins and cDNAs is relatively small. With the advent of RNASeq, many more splice variants can be identified and used in the gene building process. Not only do these data provide proof of transcription, but RNASeq also presents us with information on splice sites, UTRs and tissue expression.

The RNA-Seq pipeline
The zebrafish gene set was among the first to be enhanced by the incorporation of RNASeq data. For those of you with some knowledge of model organisms, starting a new genebuild process using zebrafish as the test case may seem counter-intuitive when far simpler eukaryotic genomes are available, such as those of D. melanogaster and C. elegans. Ironically, the complexity of the zebrafish genome, and the predicted difficulty in annotating it, was the main reason for its selection as the new pipeline’s first test subject. At the time, the general consensus throughout the Ensembl genebuild team was that annotating a complex genome was the best way to prepare for future annotations of mammalian genomes.

The following video describes how RNASeq data is used in our gene sets:

Unsurprisingly, the zebrafish RNASeq annotation process did prove to be a difficult endeavour. Some particularly long genes, such as those encoding Nebulin and Titin, were problematic due to the very high number of possible combinations of exons and introns. The models, therefore, had to be simplified. Perhaps most frustratingly, however, the zebrafish assembly was updated from version 8 to version 9 just as the genebuild team were finishing. This meant that after running a full analysis they had to rerun the entire pipeline on the new assembly. Fortunately the hard work paid off in the end. The same pipeline that was developed for zebrafish can now be used for all other eukaryotes. Anole lizard, for example, has already been updated using the RNASeq pipeline and rabbit will soon follow.

The new zebrafish gene models
The RNASeq zebrafish pipeline took data from 5 tissues and 7 developmental stages and assembled them into 25,748 gene models. These elements were then incorporated into the Ensembl genebuild process after careful filtering. This was followed by a merge with the manually curated VEGA gene models to produce a final set of 26,152 genes, represented by 51,569 transcripts.

The different sets of gene models contribute contrasting elements to the final product, and achieving the best possible result is a balance between including correct models while excluding incorrect ones. In this particular case, the RNASeq gene models were used to adjust intron and exon boundaries, confirm expression and improve on the accuracy of the 3′ UTRs. If you would like to know more, the entire genebuild process for the zebrafish is summarised here, and you can read the published paper here.

Viewing the data on our website
If you’d like to, you can view the RNASeq information for a particular species on its Ensembl homepage :

  • Firstly, search for a gene on the homepage and click on it.
  • Then click “Configure this page”, which is to the left of the page.
  • Click “RNASeq models”, also to the left of the page.
  • Click “Enable/disable all RNASeq models”.

New tracks will now appear. These tracks can then be repositioned to facilitate simple comparisons. The video below gives a visual demonstration, as well as a more detailed description, of the steps involved.

In future Ensembl releases more annotated species will be updated with such information. As new species are introduced their genebuilds will also incorporate RNASeq data wherever possible. The final outcome will be better, more accurate UTR and splice site annotations, as well as a clearer picture of gene expression patterns.

Posted in Assembly & Genebuild | Tagged , , , | Leave a comment

Ensembl Genomes release 18

Brugia_malayi

Yarrowia_lipolytica

The Ensembl Genomes project has celebrated its 4th year anniversary in the beginning of this week and now we are pleased to announce another milestone: Ensembl Genomes release 18.

In our latest release, we provide several new species for all divisions of Ensembl Genomes, which includes three new genomes in Ensembl Fungi (Yarrowia lipolytica, Cryptococcus neoformans and Trichoderma reesei); one new genome in Ensembl Protists (Guillardia theta); four new genomes in Ensembl Metazoa (Brugia malayiLoa loaMegaselia scalaris and Strigamia maritima); one new species in Ensembl Plants (Medicago truncatula). We have also incorporated the latest versions of prokaryotic genomes from INSDC in Ensembl Bacteria.

NOTE: The public mysql database will be available on Monday, April 29th 2013.

The detailed features of this new release are:

* Bacteria

  • Addition of cross-references to Rhea in Ensembl Bacteria;
  • New and updated genomes: 6,305 bacterial genomes in total, all deposited in ENA.

* Fungi

* Protists

  • Extended taxonomic coverage with the newly added Guillardia theta genome;
  • Updated DNA alignments and synteny for tramenopiles genomes.

* Metazoa

* Plants

  • Barrel clover (Medicago truncatula) is now available. This is the second legume genome to be included in Ensembl Plants;
  • New visualization of barley (Hordeum vulgarephysical map anchoring the gene space assembly;
  • Whole genome alignments between barley, rice and brachypodium;
  • Updated maize (Zea mays) assembly to version 3;
  • New variation datasets for barley and Brachypodium distachyon;
  • Alignments of Triticum aestivum (bread wheat) WGS, TSA and EST assemblies against the barley genome are now available. The bread wheat data can also still be visualised in the context of Bracypodium distachyon;
  • Protein domain predictions and cross references have been updated for most plant genomes;
  • New pairwise alignments between several genome sequences.

* Software migration to Ensembl 71

Have fun!

The Ensembl Genomes Team

Posted in EnsemblGenomes, Release News | Leave a comment

New Pre! sites for olive baboon, sheep and common shrew

New Pre! sites have been released for three species: common shrew (Sorex araneus), olive baboon (Papio anubis) and sheep (Ovis aries).

The common shrew assembly SorAra2.0 (GCA_000181275.2) was submitted by the Broad Institute. It consists of 12,845 unplaced scaffolds comprised of 201,798 contigs. A total of 16,210 gene models have been created from alignments of 16,209 human Ensembl translations and 1 shrew-specific protein. In addition, alignments of sequences from UniProt, UniGene and the ENA vertebrate RNA collection are provided. Click here to visit the common shrew Pre! site.

The olive baboon assembly Panu_2.0 (GCA_000264685.1) was submitted by the Baylor College of Medicine. It consists of 20 autosomal chromosomes (1-20) and the X chromosome. There are 72,500 scaffolds (63,229 of which are unplaced) comprised of 198,931 contigs. A total of 20,234 gene models were generated from alignments of known olive baboon proteins and human Ensembl translations. In addition, alignments of sequences from UniProt, UniGene, GenBank (baboon ESTs), RefSeq (baboon cDNAs) and the ENA vertebrate RNA collection are provided. Click here to visit the olive baboon Pre! site.

The sheep assembly Oar_v.31 (GCA_000298735.1) has 130,765 contigs, 5,697 toplevel sequences and 27 chromosomes including the X chromosome. It was submitted by the International Sheep Genomics Sequencing Consortium. Alignments were created using human and cow Ensembl translations and known sheep proteins from Uniprot and RefSeq. These alignments gave 45,972 gene models in total. In addition, alignments of sequences from UniProt, UniGene, GenBank (~935,000 sheep ESTs), RefSeq (~15,700 sheep cDNAs) and the ENA vertebrate RNA collection are provided. Click here to visit the sheep Pre! site.

Posted in Uncategorized | Leave a comment

What’s new for the VEP

It has been quite a while since we’ve blogged about the VEP (Variant Effect Predictor), and in that time we’ve added a whole load of new features, particularly to the downloadable script version.

Structural variants

The VEP now supports finding the consequences of structural variants, with input either in VCF or tab-delimited format. Using the web interface to the VEP you can visualise which transcripts and features your structural variants overlap by clicking through to the Region in Detail view:

Screen Shot 2013-04-19 at 15.14.23 copy

The cache

We’ve really pushed the VEP script’s capabilities when using local “caches” (as opposed to using remote databases). Almost every feature of the VEP is now available when using the cache in offline mode. You can use a local FASTA file to quickly retrieve the sequences required to construct HGVS notations. You can even construct your own cache from a GTF file if your species isn’t supported by Ensembl.

Our cache for human now contains allele frequency data from phase 1 of the 1000 Genomes Project, and you can use these frequencies to filter your input (for example, you might want to filter out variants that are common in the combined European (EUR) population). We also now provide SIFT predictions for 8 species - human, mouse, zebrafish, pig, cow, chicken, rat and dog.

Plugins

We’re always trying to add new and useful features to the VEP, but we also recognise that other users have great ideas that they’d like to implement. The VEP script enables the use of plugins; these are bits of code that add extra functionality to the VEP. They can be used to retrieve data from remote sources, run external tools, filter output; pretty much anything you can think of can be accomplished in a plugin!

It’s easy to get started, and a basic plugin can be just a few lines of code – have a look at some of the examples we’ve created.

I recently added a plugin to retrieve data from dbNSFP – this is a great resource created by Liu et al in Houston, TX. They have, for every possible missense substitution in the human genome, pre-calculated pathogenicity scores, frequencies, conservation scores and a plethora of other things, and made all of this available as an easily downloadable file. To use this with the VEP, you just download the file and the plugin, run a couple of commands to get the data into the right format, and away you go – the VEP can now provide you with scores from LRT, MutationAssessor, MutationTaster, FATHMM and more for any missense substitution in your input.

Summary and HTML output

We had a number of requests for the VEP to provide summary statistics at the end of each run, and who are we to disappoint our loyal users?!? The VEP now writes a pretty HTML summary:
Screen Shot 2013-04-03 at 13.35.45 You can also view your output as HTML using the –html flag, which allows you to sort, filter and analyse your output on the fly.

Don’t hesitate to get in touch with us about the VEP – dev@ensembl.org is the best place for technical questions, with helpdesk@ensembl.org for everything else.

 

Posted in Variation | Tagged | 1 Comment

Open source and open access

As an Ensembl Outreach Officer I get asked a lot of questions. Mostly questions about our data and interfaces but occasionally, just occasionally, something a bit more blue sky.

A couple of weeks ago I was teaching an Ensembl Browser workshop at the Erasmus MC in Rotterdam. I was just explaining that all our data and code was completely free to use, open source and open access, when someone asked me: Why? What’s in it for you?

Why indeed? Why are there forty people dedicated to producing this project? Why do our funders give us all this money to do it? Why do we just give it all away for free?

Why do science at all?

The fundamental answer varies for all of us. Things like improving people’s lives, curiosity, discovery. These are the motivations that got most of us into careers in science at all. Ensembl may not be directly be doing research, but we’re enabling it.

Servers from the Ensembl farm

A tiny portion of the Ensembl farm

The Economic argument

There’s also an economic answer – in terms of time, money and infrastructure. How much does it cost to annotate a genome? To do pairwise sequence comparisons of over a million genes? To annotate variation? To make regulatory data meaningful? How much does it cost to put this into an easily accessible format? How much does it cost to regularly update this with new data? How many terabytes of memory do you need to actually store this stuff?

Even though these are non-trival costs, infrastructure projects in bioinformatics are about saving money overall. Funders and scientists understand that lots of different labs need the data and the analysis that we produce. However, it would be horribly inefficient if each lab who needs the resources we provide had to produce it themselves, repeating work that somebody else has already done, spending money that has already been spent, spending time that they could be spending doing other experiments or doing other analysis. Therefore, we have a system where we do it for them and put it all up where they can find it. Nothing’s repeated. Plus, our experience, expertise and raw computing power means that we can do it more cheaply and quickly than most labs can.

Free to be serendipitous

By giving the data away for free, we allow serendipitous discovery. If we charged people to use Ensembl in some kind of per-use manner, then they’d only use Ensembl to look for things they knew they were looking for. Yet we know that much of scientific discovery occurs when people accidentally stumble across things, like Alexander Fleming’s mouldy Staphylococcus plates. By allowing people to browse Ensembl freely, without worrying about costs, they may stumble across the tool or data that will be exactly what they need.

A relatively big group of people work for the project and they don’t work for free. But overall, we save the research community money by enabling science to be built on our foundation.

So, the answer to “what’s in it for me?”: I work for a project that makes science happen as efficiently as possible.

Posted in Overview | Leave a comment

Ensembl 71 has been released!

The lastest Ensembl update (e!71) has just gone live!

What’s new in e!71?

New views and web features

We have added a new “clinically associated” summary track for Human showing ClinVar variants.

We have added a new expression view to the set of gene-based displays, listing the available RNASeq data for a given species, with links through to Region in Detail that turn on a set of RNAseq tracks for a given tissue.

A new transcript comparison view allows comparison of transcript sequences for a gene.

scrollable overview panel (supported by most updated browsers) is now available on the Region in detail view.  If you do not have a compatible browser you will see the static Region in detail view, and there is an option available to switch between scrollable and static images for supported browsers.

The RNASeq data supporting the introns of RNASeq genes is now shown on the Supporting evidence panels for those transcripts. The display distinguishes between alignments that support canonical and ones that support non-canonical splice sites.

New Assemblies

We are happy to introduce two assembly updates for this release.

The chicken (Gallus gallus) genome, Galgal4, was produced by the International Chicken Genome Sequencing Consortium. In addition to being important agriculturally, the chicken is an important model organism for biomedical research, development, and ageing. The chicken is also one of the primary models for embryology and development, the study of viruses, and cancer.

The WBcel235 assembly of C. elegans (Caenorhabditis elegans) has been imported from the WS235 release of WormBase. C. elegans provides a model for complex organ systems, as well as developmental biology and genetics.

Posted in Ensembl, Release News | 2 Comments