Ensembl 76 is scheduled for release by end of July 2014. Highlights for the coming release include:

New human assembly – GRCh38

  • Complete Ensembl annotation of the new genome
  • GENCODE release 20 will be updated including manual annotation from Havana
  • Full regulatory build for human using the new Ensembl Regulatory Build Annotation pipelineGRCh38
  • Alignment of Human BodyMap data
  • Updated Ensembl variation data for the new assembly
  • Browser track/variation set of variants for the Illumina Human OmniExpress-12v1 genotyping chip
  • NHLBI ESP data will be updated to version  v.0.0.26
  • COSMIC version 69 will be imported

Other features and data sets

  • Mouse GENCODE release M3: manual annotation from Havana will be updated and merged with Ensembl automatic annotation to produce the next gene sets for GENCODE
  • Amazon molly gene set with BAM files and transcript models for 11 tissues
  • Olive baboon with BAM files and transcript models for 14 tissues including thymus, liver, lung, heart and pituitary
  • DGVa data will be updated for cow, dog, horse, human, macaque, mouse, pig, zebrafish
  • Chicken, cow, pig and sheep will be updated to dbSNP build 140

The GRCh38 blog series gives insight into how Ensembl annotates the new human assembly.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

What’s new in e!73?

  • Updated human GENCODE gene set including manual annotation from Havana.
  • New assembly patches (GRCh37.p12) have been added and annotated.
  • Import of human PhenCode data.
  • HGMD-PUBLIC data from release 2013.2 with regulatory data for human.
  • Update of human NHLBI GO Exome Sequencing Project (ESP) to EVS-v.0.0.20.
  • Mouse phenotype data from EuroPhenome, International Mouse Phenotyping Consortium and WTSI Mouse Genetics Project.
  • Updated zebrafish gene set including manual annotation from Havana.
  • Updated rabbit gene set models built from RNAseq data. Tissue-specific gene models with indexed BAM files are also provided.
  • COSMIC version 65 and update of COSMIC structural variants.
  • Phenotype data associated with the dbSNP variants from dbGaP.
  • dbSNP Build 138 data for chicken, pig, rat, zebrafish.

New search engine (Solr)

We will be using the Solr search engine, which builds on our existing Lucene search with the following features:

  • Faceted searching – restrict an existing search by species or category
  • Google-style search listings
  • Suggestions of similar terms (in case you mistyped a word)
  • Autocomplete for “real words” (e.g. enzyme names)
  • Preview of top result
  • Downloadable results when in table layout
  • Links to the term in other species if present

Solr search results for human BRCA2 gene

New species

We are happy to announce annotation for two new species in this release.

Photo: Garth Peacock

The flycatcher (Ficedulla albicollis), FicAlb_1.4 (GCA_000247815.1), is the first assembly of the collared flycatcher genome provided by Uppsala University. Flycatcher is a model for understanding species differentiation. We have generated BAM files and RNAseq-based gene models for nine different tissues including embryonic tissue.

pic_Anas_platyrhynchosThe duck (Anas platyrhynchos) genome, BGI_duck_1.0 (GCA_000355885.1) is a high coverage assembly, produced by the BGI and the duck genome consortium. Duck genome contributes to understanding of avian flu and provides insights into mechanisms by which the host and influenza viruses interact.

A complete list of the changes can be found on the Ensembl website.

Ensembl is holding a workshop titled, ‘Introduction to automatic
gene annotation’ aimed at developers. The workshop runs on 4-5 of
September 2012 at Department of Genetics, University of Cambridge,
UK. Two Ensembl developers will present sessions on how to create
your own core database, including the loading of a genome assembly
into a database and the running of simple analyses using the Ensembl
genebuild pipeline.

Participants will be expected to have experience in programming and
a background in object-oriented programming. A good familiarity with
Perl, a Unix/Linux environment, and MySQL are essential to follow the
workshop and the programming examples. Knowledge of the Ensembl core
API
is also essential.

Topics to be presented:

  • Introduction to the GeneBuild pipeline, including data input types, generating protein-coding transcript models, and adding UTR to these model
  • An introduction to assembly structure (toplevel, contigs, scaffolds,  chromosomes)
  • Overview of the different Ensembl APIs
  • Obtaining the Ensembl API (cvs checkout)
  • Core database schema
  • Tracking jobs in the pipeline
  • Runnable and RunnableDB modules

Practical sessions:

  • Creating a genebuild database
  • Loading an assembly into the database
  • Running algorithms first on the commandline and then using the  pipeline
  • Understanding how the pipeline code interacts with the algorithms and the database
  • Understanding the pipeline’s job tracking system
  • Visualisation of results with Apollo.

Registration for this workshop is free, but participants will need to
cover their own accommodation and meal expenses. Would you like to
join us? Please contact Bert (bert@ebi.ac.uk) for more details or to
register.

Related Wellcome Trust Conference:
Genome Informatics 2012, 6-9 September, Cambridge. Please click here for full details.

 

We are delighted to inform you that CRYSTAL, University of Malaya will be hosting a workshop on Ensembl gene annotation on the 5th and 6th December, 2011.

This 2-day workshop is aimed at developers and bioinformatics programmers. The workshop will consist of sessions on how to create your own core database, including the loading of a genome assembly into a database and the running of simple analyses using the Ensembl genebuild pipeline.

Prerequisite: Participants will be expected to have experience in writing Perl programs. A background in object oriented programming techniques and familiarity with databases (MySQL) are essential to follow the workshop. Knowledge of the Ensembl core API is also essential.

Topics to be presented:

  • Introduction to the GeneBuild pipeline, including data input types, generating protein-coding transcript models, and adding UTR to these models
  • An introduction to assembly structure (toplevel, contigs, scaffolds, chromosomes)
  • Overview of the different Ensembl APIs
  • Obtaining the Ensembl API (cvs checkout)
  • Core database schema
  • Tracking jobs in the pipeline
  • Runnable and RunnableDB modules

Practical sessions:

  • Creating a genebuild database
  • Loading an assembly into the database
  • Running algorithms first on the commandline and then using the pipeline
  • Understanding how the pipeline code interacts with the algorithms and the database
  • Understanding the pipelines job tracking system
  • Visualisation of results with Apollo.

This workshop will be conducted by Dr Amonida Zadissa and Magali Ruffier of the Ensembl team, Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.

Registration: For registration please send us your information via email to crystal_seminar@um.edu.my and include the information on the registration form by 19th November. For workshop details, please contact Dr Amonida Zadissa at amonida@sanger.ac.uk or Dr Giulietta Spudich at gspudich@ebi.ac.uk.

Please note: the number of participants is limited to 15. The organisers shall select the participants based on the information provided in the registration form. Only successful applicants will be notified.

Due to the changes in the web interface there have been a number of changes to the URLs for pages. In most cases the web code catches these changes but there are a number of requests which due to the nature of the site have changed:

  • Configuring the way a page is rendered;
  • Changing the way tracks are rendered;
  • Adding DAS sources via a web-address and not via the web interface;
  • Attach UCSC style external resources.

These are now all attached in a similar – systematic way:

  • To change global page settings: add a paramter config=key=value{,key=val}
    e.g.
    to turn off the top image on Location > Region in detailhttp://www.ensembl.org/Homo_sapiens/Location/View?r=1:1000-2000;config=view_top=off

    e.g. to link directly to the Exon Intron markup panel (Transcript > Exons) and to show full introns and only 60bp flanking sequence AND turn the display to be 60bp wide

    http://www.ensembl.org/Homo_sapiens/Transcript/Exons?t=ENST00000309255;config=flanking=60,seq_cols=60,fullseq=yes

  • To change configuration for an individual panel add a parameter refering to the panel (this will be documented shortly on the website) e.g. For Location > Region in detail the two panels are contigviewtopcontigviewbottom, for Location > Region overview it is cytoview. This is again a comma separated list, where the left hand side of each “=” is the name of the track, and the right hand side is the name of the “renderer” to use – the latter depends on the type of track. Additionally the left hand side can be used to integrate external data: Notes:
    • Track names are now systematically named so will have changed from the values you may have been used to using – again we will shortly publish a list of these, but examples are: transcript_core_ensembl – the ensembl genes from the ensembl database.
    • Renderers depend on the type of track, but e.g. for transcripts you have the option of “transcript_label”, “transcript_nolabel”, “collapsed_label” and “collapsed_nolabel”, for alignment features (and also url attached data at the moment) “normal”, “half_height”, “stack”, “unlimited” and “ungrouped”, for DAS tracks “labels” (show labels if configured by the source) or “nolabels” – hide labels.
    • At the moment two special parameters can be used:
      das:http://www.mydas.source/das/my_data=render
      – which attaches a DAS source to the session and selects the renderer
      url:http://www.myweb.server/my_data.format=render

    For example:

    http://www.ensembl.org/Homo_sapiens/Location/View?g=ENSG00000012048;config=panel_top=off;contigviewbottom=das:http://www.ensembl.org/das/Homo_sapiens.NCBI36.transcript=nolabels,transcript_core_ensembl=collapsed_nolabel

    Turns on a das source (in this case the Ensembl transcripts) and collapses the standard ensembl track down to a single line per Gene AND also turns off the top panel!


Development for the new Ensembl 50 website is progressing well… some of you may have already seen the test sites when you signed up to be part of our testing team…

One of the complaints of the current site (hardware failures aside) is the performance of the webpages – we are addressing this in a number of ways in the Ensembl 50 web code.

  • Tuning the Apache web server configuration:
    Compressing all HTML/Javascript/CSS files using mod_deflate;
    Minimizing the number and size of Javascript/CSS files by stripping unnecessary white space and comments from the files and merging them together;
    Setting headers to improve the browsers caching of content.
  • Aggressively caching content on the server side using a modified version of memcached (this will require Linux users using a 2.6.x kernel as it uses the epoll technology).
  • Increased use of asynchronous HTTP requests (AJAX) to allow more immediate responses for the page while generating other content; and to minimize the content that is sent (can retrieve initially hidden content later)
  • Reducing page size – rather than having single pages containing lots of disparate information having more pages containing smaller amounts of information; this doesn’t just help with the page size – but also increases the discoverability of content that we have on the site – which people do not find easily – especially comparative genomics; variational genomics and regulatory information.

For those who will be implementing local copies of Ensembl 50 code – additionally Ensembl 50 code will:

  • Make configuration easier – the pages will configure most of the tracks directly from the contents of the databases;
  • Make code more pluggable:
    ConfigPacker – the SpeciesDefs database parsing; and
    ImageConfig – replacement for UserConfig;
  • Make caching and AJAX implementation easier.

There are a number of changes to the code – so if you have written your own components or drawing code tracks there will be work to be done but in most cases these modifications are easy to implement (e.g. moving code between modules).

Finally, here are some additional system recommendations:

  • Perl 5.8.8 or newer;
  • MySQL 5.0 server;
  • 64 bit architecture;
  • large memory machine;
  • you can compile our modified “memcached” code (e.g. for Linux you will need a 2.6.x kernel) to get significant speed up;