I was thinking about the web design process for e50 – our new web interface due out in July (definitely will be late July). We’re at the stage now where Fiona is going to be asking users their preferences for all the “little things” which make no difference to technical aspects of the web site but make a pretty big difference to the useability. Like, for example, how do we colour our genes? This is a long standing debate where everyone has an opinion and everyone’s opinion is right – at least for them. (only 2 colours, and the colours should distinguish manually annotated genes from automatic says one person. No – use the whole spectrum of colours, and make sure we distinguish non-coding RNA genes from pseudogenes from protein coding genes and indicate which ones have orthologs – to mouse. No to rat. No – instead of that use GO functional catagories to colour genes. Or the number of non coding SNPs. Or the gene-wide omega value from the dn/ds measurement)

Sometimes people look at this debate and say that this is a clear area for user defined colours. Which is sort of true for 10 seconds, but – not really. Firstly most users are not going to get around to changing options – partly due to the fact they have better things to do (like design experiments and run them!), partly because this sort of configuration is just a bit too geeky and partly because, to be honest, if they are into configuring things we’d like them first off to work out which tracks that would like displayed (more on this below), and colouring genes should be low on their list. Secondly we want to provide a scheme which feels natural to the most number of people. Hence a rather long series of options to choose from currently being proposed.

The same argument goes for default tracks. (I can’t imagine not having SNPs on my display! I can’t imagine not having the ESTs switched on!). Everyone has an opinion and everyone is right. Here it is clear we’ve got to make sensible default decisions (which are also heavily, heavily speed optimised – sadly the new Collections framework wont be ready for SNPs for 50, which is annoying, as really we want SNP density these days in human, but all the other obvious default tracks are pretty well optimised, including some funky scaling stuff to get the continuous basepair comparative genomics measure to come back sensibly when you are zoomed out). But then our main task it to get the user to explore as the “wouldn’t it be nice to see xxxx, I wonder if Ensembl has it” with configuration system which is very enticing, but not in the way, and importantly for the non-expert user, not completely overwhelming. In our e50 design means more hierarchy in the options so they can be grouped (itself a bit of pain to handle – we’ve got alot of tracks), and a nice “light box” effect over the display which reassures you that (a) the thing that you were looking at wont disappear (b) the display will come back quickly. I think we’re on the right path here for the configuration, but we still have decide on the default tracks (for me the only obvious one is “Genes”).

Finally we’ve got the mundane business of which words do we use for each of our “pagelet” displays. (our new pagelets are very nice, and in our latest round of testing, >50% of the in-the-lab biologists liked not only the pagelets, but a specific layout of them. less than 10% preferred the current ensembl display). So – we need one or two words to describe “A graphical representation of a phylogenetic tree of a gene with duplication nodes marked”. Hmmm. “Gene Tree”. Or “Phylogenetic Tree”? (phylogenetic is a bit of a long word, and might get in the way of the menu…). What about “a text based alignment of resequenced individuals with the potential to mark up some features of interest”. Is this – “resequencing alignment” or “individual alignment” or “individuals”.

If you’d like to take part in this, email survey@ebi.ac.uk (perhaps cc’d to Xose – xose@ebi.ac.uk) to make sure you are on our list. Ideally we’d like you to be wet-lab biologists. We have alot of in-house or near-in-house opinions from bioinformaticians, and in anycase, bioinformaticians are happier to explore configurations etc. Its the researcher who will be visiting us – say – once or twice a month which we think is the main user to optimise for (again, more frequent users we hope will explore configuration to match things perfectly for them).

More on other e50 topics soon – speed, the importance of chocolate in bribing web developers and the end game for e50!


Development for the new Ensembl 50 website is progressing well… some of you may have already seen the test sites when you signed up to be part of our testing team…

One of the complaints of the current site (hardware failures aside) is the performance of the webpages – we are addressing this in a number of ways in the Ensembl 50 web code.

  • Tuning the Apache web server configuration:
    Compressing all HTML/Javascript/CSS files using mod_deflate;
    Minimizing the number and size of Javascript/CSS files by stripping unnecessary white space and comments from the files and merging them together;
    Setting headers to improve the browsers caching of content.
  • Aggressively caching content on the server side using a modified version of memcached (this will require Linux users using a 2.6.x kernel as it uses the epoll technology).
  • Increased use of asynchronous HTTP requests (AJAX) to allow more immediate responses for the page while generating other content; and to minimize the content that is sent (can retrieve initially hidden content later)
  • Reducing page size – rather than having single pages containing lots of disparate information having more pages containing smaller amounts of information; this doesn’t just help with the page size – but also increases the discoverability of content that we have on the site – which people do not find easily – especially comparative genomics; variational genomics and regulatory information.

For those who will be implementing local copies of Ensembl 50 code – additionally Ensembl 50 code will:

  • Make configuration easier – the pages will configure most of the tracks directly from the contents of the databases;
  • Make code more pluggable:
    ConfigPacker – the SpeciesDefs database parsing; and
    ImageConfig – replacement for UserConfig;
  • Make caching and AJAX implementation easier.

There are a number of changes to the code – so if you have written your own components or drawing code tracks there will be work to be done but in most cases these modifications are easy to implement (e.g. moving code between modules).

Finally, here are some additional system recommendations:

  • Perl 5.8.8 or newer;
  • MySQL 5.0 server;
  • 64 bit architecture;
  • large memory machine;
  • you can compile our modified “memcached” code (e.g. for Linux you will need a 2.6.x kernel) to get significant speed up;

For the past two days, Ensembl has been slow or has not returned the page (instead offering an ‘Ensembl is down’ yellow screen).

Be assured we are working on the problem. It is a hardware issue, but should be resolved soon.

From all of us in the Ensembl team, thanks for your patience!

As you know, we are working on a new website design for the Ensembl 50 release. We are currently seeking ‘beta testers‘ who would be happy to take part in a survey and help us shape the look and feel of the new website.

If you could spare some time we would be very grateful if you could send an eMail to survey@ebi.ac.uk so we can add you to our list of testers.

We are looking forward to hearing from you.
The Ensembl Team

Hello all,

There are a few Ensembl training events taking place this summer:

(2-day) Browser workshop in the Dept. of Genetics, University of Cambridge, UK (5-6 June)

Module in a Wellcome Trust Mini-Open Door Workshop (ODW) for MalariaGEN in Hinxton, UK (20 June)

Module in a Mini-ODW at the ICG in Berlin (12 July)

Programmers’ group at the ISMB meeting in Toronto, Canada (19-23 July)

As ever, email us with any questions (or comments).

Best Wishes,
Helpdesk

The past days I was in Barcelona at the European Human Genetics Conference 2008. After giving my presentation on Ensembl in one of the ‘Educational sessions’ and listening to numerous talks about GWAS (genome-wide association studies), I had a look at the posters. Under the impression that the UCSC Genome Browser is the preferred browser amongst (human) geneticists and with Ewan’s experience at the recent ‘Biology of Genomes’ meeting fresh in my mind, I decided to have a closer look at the posters in the Cytogenetics section. Out of 189 posters, I could positively identify 11 with Ensembl screenshots (mostly CytoView and ContigView, but also two times KaryoView), 8 with UCSC Genome Browser screenshots and none with NCBI Map Viewer screenshots. OK, I admit that I can recognise almost any pixel copied from our site and may have missed one or two UCSC screenshots, but all in all I thought this was a very encouraging result! Of course we should keep in mind that this was a European conference, mainly attended by European scientists …. I guess I have a bit more screenshot counting to do at the International Congress of Genetics 2008 in Berlin. So, let’s say Ensembl 11 – UCSC 8 is the score at half time …. next month I’ll report back with the final result!

Ensembl is now busy with preparations for our next release, Ensembl 50! We’re working hard and we’ll keep you updated on what’s in store for this release. Our biggest new development will be our revamped website. As usual, we have updated some species and provided new data for other species. Keep reading for an outline of what we aim to provide in Ensembl 50.

New web interface:
The most exciting change in Ensembl 50 will be a new web interface: Simpler, Better, Faster is what we’re aiming for. Not only will pages take less time to load, but they will also look a little different. We’re hoping that we will have improved the navigability and discoverability of the site so that you can make the best use possible of the data we provide. We have taken into account your messages at helpdesk and your voices in courses. Let us know what you think by emailing helpdesk@ensembl.org !

Genebuild team:
In terms of new data for Ensembl 50, we have constructed new gene sets for tetraodon and cow. Vega/Havana (manual annotation) has released new gene sets for human and mouse so these will be displayed on our website alongside Ensembl genes.

For human, you may know that Ensembl and Havana merge identical transcripts. We have improved the Vega/Havana merge using the latest Havana gene set. Because untranslated regions are notoriously difficult to determine, we’ve used ditags when predicting UTRs for human. Finally, we have removed some dodgy-looking gene models that were highlighted by the Alpheus project.

For low-coverage genomes, gene models are predicted by projecting the human gene models down onto the 2x genomes. In this release, cat and pika have been updated by projecting current human gene models onto the existing assembly.

We’ve also updated the gene sets for C. elegans and chimp. Release notes for C. elegans can be found on the WormBase website. Chimp has an updated gene set to include more chimp-specific predictions, and genes projected from human onto chimp are updated.

The horse genomic assembly (EquCab2) has recently been updated such that chromosome 27 has been shortened. This is not a new genebuild as such, but we have modified our data to reflect this change. Zebrafish Agilent V2 Arrays have been mapped to cDNA and genomic sequences.

Canonical transcripts (the longest translations) have been labeled for all species in the database, though this will not appear in the browser. As usual, non-coding RNA genes have also been updated for most species, and cDNA alignments have been redone for human and mouse.

Variation and Functional Genomics teams:
Our Variation team plans to provide updated single nucleotide polymorphisms (SNPs) for tetraodon, cow, human, chimp and orangutan. Our Functional Genomics team will provide promoter cis-regulatory motifs from here. They will also update the current regulatory build on human.

Comparative Genomics team:
Our Comparative Genomics team is extending their multiple alignments with new species and low-coverage (2x) genomes to include:
* 4-species: catarrhini primates EPO (Enredo-Pecan-Ortheus) alignments (human, chimp, orangutan, macaque )
* 12-species: amniote vertebrates Mercator-Pecan alignments (current 10-species alignments + Pongo pygmaeus and Equus caballus)
* 23-species: eutherian mammals EPO (Enredo-Pecan-Ortheus) alignments (all 2X genomes + current 7-species alignments + Pongo pygmaeus and Equus caballus)

GERP scores (% conservation on a basepair level for the 23-species eutherian mammals alignments) will be released.

The Compara (comparative genomics) team is working hard! They’re also providing new pairwise alignments:
* All the pairwise (between two species), whole-genome alignments (using tBLAT) will be updated using a new pipeline that follows a best-in-genome approach to filter spurious hits.
* The pairwise alignments for more closely related species (using BLASTz-net) will be updated for the following species so that the reference species is human:
. human vs Pongo_pygmaeus
. human vs Loxodonta africana
. human vs Echinops telfairi
. human vs Oryctolagus cuniculus
. human vs Dasypus novemcinctus
. human vs Myotis lucifugus
. human vs Bos Taurus
. human vs Ochotona princeps
. human vs Felis catus
Sitewise dN/dS values will be provided in our gene trees to detect positions in the alignments that are under different evolutionary pressure.

Web team:
Last but not least, please note that from Release 50 we will no longer be providing the ‘ssaha’ sequence search. If you wish to run your own ‘ssaha’ sequence search you can download the files to generate the search hashes from our FTP site. Alternatively, use BLAT (the BLAST-like Alignment Tool) which is equally fast and also demands exact matches.

That’s it for now! Any questions, just email helpdesk. We will be posting more information as the release date gets closer (we are aiming for end of July!)

I’m in the airline lounge about to head back from “Biology of Genomes” at Cold Spring Harbor Laboratory. As always, it was a great meeting; highlights for me was seeing the 1,000 genomes data starting to flow – it is clear that the shift in technology is going to change the way we think about population genomics – and for me, the best session was one on “non-traditional models” – Dogs, Horses and Cows, where the ability to do cost effective genotyping has completely revolutionised this field. Now the peculiarities of the breeding structures, with Dog breeds being selected for diverse phenotypes, Cows with the elite bulls siring thousands of offspring due to artificial insemination and Horses having obsessive trait fixation over the last 1,000 years can really bring power to genetics in different ways. Expect alot more knowledge to come from these organisms and others (chickens, pigs, sheep…) over the coming years.

For my own group, Daniel Zerbino talked about Velvet, our new short read assembler which has also just been published in Genome Research (link). Velvet is now robust and capable of assembling “lower” eukaryotic genomes – certainly up to 300MB from short reads in read pair format. It is also being extensively used by other groups, often for partial, minature de novo assemblies in regions. It went down well, and Daniel handled some pretty tricky questions in the Q&A afterwards. Next up – we get access to a 1.5TB real memory machine, and put a whole human genome WGS into memory. Alison (Meynert) and Michael (Hoffman) had great posters on cis-regulation and looked completely exhausted at the end of their poster session.

From Ensembl, Javier talked about Enredo-Pecan-Ortheus (which we often nickname as EPO) pipeline. As some said afterwards to us “you’ve really solved the problem, haven’t you” – Javier was able to show clear evidence that each component was working well, better than competitive methods, and having a impact on real biological problems, for example, derived allele frequency. Its ability to handle duplications is a key innovation. Javier and Kathryn are current wrestling in the “final” 2x genomes into this framework, from which point we will start to have a truly comprehensive grasp on mammalian DNA alignments. I also like it as Enredo is another “de bruijn graph” like mechanism. Currently the joke is that about 10 minutes into any conversation I say “well, the right way to solve this problem is to put the DNA sequence into a de bruijn graph”.

Going to CSHL biology of genomes is always a little wince making though as this field – high end genomics – really prefers to use the UCSC Genome Browser (which as I’ve written before on, is a good browser, and I take the use of it to be our challenge to make better interfaces for these users on our side). My informal counting of screen shots was > 20 UCSC, 4 Ensembl (sneaking one case of ‘Ensembl genes’ shown in the UCSC browser as a point for each side) and 0 NCBI shots. Well. It just shows the task ahead of us. e50! – our user interface relaunch – is coming together, and we will start focus-group testing soon – time for us to address our failings head on. I’ll be blogging more about this as we start to head towards broader testing.

Lots more to write about potentially – Neanderthals, Francis Collins singing in the NHGRI band (quite an experience), reduced representation libraries with Elliott, genome wide association studies (of which, I just _love_ the basic phenotype measures, from groups like Manolis Dermitzakis) and structural variation… but for the moment I’ve got to persuade my body to feel as if it is 11.30 at night and see if I can get a good nights sleep on the plane.

Things are moving within the rat community as this month’s Nature Genetics issue shows with a special on rat genetics exploring the latest developments.

Featuring:

  • ENU-induced gene targeting in rats;
  • A ‘white paper’ discussing progress and prospects in rat genetics;
  • A brief overview on rat genome resources online;
  • ENU-induced gene targeting in rats;
  • A contribution on dynamics of CNV in rat and their impact in phenotypes;
  • A survey of genetic variation from The STAR Consortium (over 3 million newly identified SNPs and over 20,000 SNPs genotyped across 167 distinct inbred rat strains);
  • and several papers focusing on the identification of genetic variants associated to rat models of human disease…

The driving force behind these outstanding achievements can be found on a well interliked rat community bridging resources across the Atlantic: RGD and the EURATools Consortium (FP6 contract number LSHG-CT-2005-019015) collaborations are a good example.

EURATools investigators are developing integrated genome tools (Ensembl is one of the partners of this consortium). Integrating high-throughput sequencing and genotyping with informatics; intensive analysis of phenotypes, gene sequence and gene expression in congenic strains to identify genes and regulatory pathways for a wide range of rat disease phenotypes; and establishing optimised protocols for rat gene targeting are the goals of this ambitious EU funded project.

Hello to our readers, I hope everyone is having a nice April. In the UK we are experiencing a long winter with some rain, but spring seems to be around the corner… as are these upcoming workshops…

Did you know? The EBI has released tutorial videos.

Have a look at the Ensembl browser videos for information and direction to some of its pages! Or, learn more about BioMart, a fast data mining tool.

Upcoming workshops- May

Browser workshop at the WHO in Cairo (12-13 May)
Module in the Open Door Workshop at the Sanger (12-14 May)
Ensembl in China: The Shanghai Center for Bioinformation Technology (14-16 May)
Ensembl in China: Center for Bioinformatics, Beijing (19-21 May)
Browser and API workshops at the GTPB in Oeiras, Portugal (27-30 May)
Presentation at the European Human Genetics Conference in Barcelona (30 May)