Steve posted the news that we’re delaying our new release for at least two more weeks. The message is pasted in here:

Hi all

In our Intentions Summary mail for release 51 we stated that the release was scheduled for early/mid September. The 51 release will include significant updates and improvements to the web interface. We are delaying release while we complete development on these. We are working to get the release out as soon as possible, and are now aiming for end September/early October. I apologise for this delay.

Steve

 

Dr Steve Searle
Ensembl Project Leader, Sanger

It is always so frustrating to delay, but of course, far more important to have a working site than something only part working. Welcome to delivering high end services.

We took on alot of things to change in this web refresh. For most users the main thing people will notice is the entirely new web layout. This was driven by our surveys of users who mainly complained about being buried in too many displays and data. We then took around 4 months working with user groups and trialling different layouts (many thanks for those who participated) which in some cases made significant changes to our original designs (we now have a hybrid “tab and left-hand-side” approach, voted as best by ~60% of people, with the other three options splitting the rest of vote). We’re very excited about this new layout going live as it just looks cleaner, less cluttered and yet providing more information. The other thing people will notice is that it is just faster. As the saying goes, you can’t be too rich, too thin or have your websites go too fast.

Making a website go faster is harder than it might look. It involves all sorts of things – the bandwidth of your machines to us, the speed the servers, the connectivity of servers to databases, the speed of the API, the database to disk, the management of the huge number of simultaneous users we have and then the size of the html returned and finally the render speed on your browser. All of these contribute to the overall perception of “speed”. Under the hood we’ve been working on all these aspects – internally a big change is that we have switched from needing a common file system for our web farm to work off. Previously when your browser asks for a contigview page, our servers generates html with an image and that image is written to the common disk, the browser parses the image tag, asks for this image – and this is the critical bit – sends a request which in all likelihood will be served by a different server in our webfarm. That server then went to the common file system to pick up the file and send it back. Many times a critical bottleneck has been read/write on this shared filesystem. In the new system this has all gone, and the images are stored in a memory-based common store, meaning both that we remove this bottle-neck (which will be the first big effect) and secondly we will be able to cache alot more – the hope is that many of the identical pictures for the common species will be entirely served from memory in the new system. Another important change has been aggressively sliming our html. Currently all sorts of files – often very small – are pinged by each page up, just to see if they have changed. We’ve consolidated alot of these files – and compressed them – and then also optimised them for render speed.

There is a variety of things not for this release but coming up end of 2008/early 2009 also on speed. Our API has a new concept, collections, which better handles the case of zoomed out views, where we know the renders will not be able to render every object. Instead a collection – which may be rendered as a union or density or something will be provided. The other thing on the horizon is us setting up a US mirror on the west coast. For the last year we have been extensively monitoring the speed of Ensembl from different sites, and there is a large increase in time to retrieve on the north-west coast of the US. We’ve been investigating quite why this (and learning lots more about the backbone of the internet than we knew before) but it seems as if the simplest way to getting speed to work in the west coast is to just run a mirror over there. Probably 2009 for that to go live.

Back to the website. It looks so much better – and has much better hardware characteristics – (our shared file system is … well … rather 2004 technology and needs pretty constant care at the moment) that I can’t wait until it comes out. But there is absolutely no point in having a crippled site in functionality even though we’ve got many of the user interface and technical issues right. The sticking point at the moment is the configuration panel. This comes up as “modal” box on top of the page, allowing alot of options to choose from, but not a bewildering set of options on each page. To cope with the 200 odd different tracks to switch on and off, the box has to have tabs and friendly, browseable hieriarchies. To get all this to work in a nice, friendly, slick way… that’s alot of Javascript.

And alot of Javascript is alot of browser compatible headaches. Even using JS libraries – prototype and scriptolicious (I think – James smith can tell you the details!) there are all sorts of details that might not work just-quite the same way on IE5 compared to IE6. Or Firefox. Or Safari. And it must degrade at least functionally without JS. And of course work, and render fast. This modal box is the last, complex thing to get sorted.

We’re close. I’ve seen the box come up over James’ screen. I hear Steve has seen it come and tracks change, and see the link of tracks to changes. The API for the configuration system was gutted and is much better. But its got to work on all main browsers. For all our genomes, in particular Human and Mouse. And this is just tricky, fiddly work.

We’re not quite there yet. We’re really close, and so much is working it is just excruitiating. But we need another couple of weeks. James is being shielded from other jobs by Steve and others; Eugene is torture testing memcachedb to stress test the system before it goes live; Xose, Bert and Guilietta are writing help; Beth and Anne are writing the additional pagelets inside of the new geneview and transcriptviews. and it all looks really good.

So – apologies – we thought we’d be launching in July. We thought we’d be launching in September. We still might just do that, but then again, it might well be October. If it goes any later I will have no hair.

But it does look really good.

It is definitely worth the wait. Like Guinness.

Ewan

The next release (50) will happen in just under a week’s time. This will retain the old (classic) look, with the Ensembl interface you are all used to! The new interface will be released in August as a publicly accessible beta testing site alongside our usual Ensembl, in order to make sure everything is running smoothly before we switch over completely. This will give us time to collect feedback from you about the new interface, before we completely switch over to the new interface in release 51 (due in September).

What can you expect in release 50?

A new gene set for human, where UTRs (UnTranslated Regions) are based on ditags. An improved merge between the new human Ensembl gene set and the latest manually annotated gene set from Havana will be available. Also, new gene sets for tetraodon (genes from the Ensembl pipeline along with other genes from the genoscope set), C. elegans (WS190), and projection of the new human set against pika and cat.

Cow has a new assembly and geneset! The Ensembl automated pipeline was run on Btau 4.0 for this release.

New variation sets will be available for orangutan, tetraodon, cow and human.

We will keep you posted about the new interface, beta testing surveys, and upcoming organisms and annotation in release 51.

Thanks to all our users.

I was thinking about the web design process for e50 – our new web interface due out in July (definitely will be late July). We’re at the stage now where Fiona is going to be asking users their preferences for all the “little things” which make no difference to technical aspects of the web site but make a pretty big difference to the useability. Like, for example, how do we colour our genes? This is a long standing debate where everyone has an opinion and everyone’s opinion is right – at least for them. (only 2 colours, and the colours should distinguish manually annotated genes from automatic says one person. No – use the whole spectrum of colours, and make sure we distinguish non-coding RNA genes from pseudogenes from protein coding genes and indicate which ones have orthologs – to mouse. No to rat. No – instead of that use GO functional catagories to colour genes. Or the number of non coding SNPs. Or the gene-wide omega value from the dn/ds measurement)

Sometimes people look at this debate and say that this is a clear area for user defined colours. Which is sort of true for 10 seconds, but – not really. Firstly most users are not going to get around to changing options – partly due to the fact they have better things to do (like design experiments and run them!), partly because this sort of configuration is just a bit too geeky and partly because, to be honest, if they are into configuring things we’d like them first off to work out which tracks that would like displayed (more on this below), and colouring genes should be low on their list. Secondly we want to provide a scheme which feels natural to the most number of people. Hence a rather long series of options to choose from currently being proposed.

The same argument goes for default tracks. (I can’t imagine not having SNPs on my display! I can’t imagine not having the ESTs switched on!). Everyone has an opinion and everyone is right. Here it is clear we’ve got to make sensible default decisions (which are also heavily, heavily speed optimised – sadly the new Collections framework wont be ready for SNPs for 50, which is annoying, as really we want SNP density these days in human, but all the other obvious default tracks are pretty well optimised, including some funky scaling stuff to get the continuous basepair comparative genomics measure to come back sensibly when you are zoomed out). But then our main task it to get the user to explore as the “wouldn’t it be nice to see xxxx, I wonder if Ensembl has it” with configuration system which is very enticing, but not in the way, and importantly for the non-expert user, not completely overwhelming. In our e50 design means more hierarchy in the options so they can be grouped (itself a bit of pain to handle – we’ve got alot of tracks), and a nice “light box” effect over the display which reassures you that (a) the thing that you were looking at wont disappear (b) the display will come back quickly. I think we’re on the right path here for the configuration, but we still have decide on the default tracks (for me the only obvious one is “Genes”).

Finally we’ve got the mundane business of which words do we use for each of our “pagelet” displays. (our new pagelets are very nice, and in our latest round of testing, >50% of the in-the-lab biologists liked not only the pagelets, but a specific layout of them. less than 10% preferred the current ensembl display). So – we need one or two words to describe “A graphical representation of a phylogenetic tree of a gene with duplication nodes marked”. Hmmm. “Gene Tree”. Or “Phylogenetic Tree”? (phylogenetic is a bit of a long word, and might get in the way of the menu…). What about “a text based alignment of resequenced individuals with the potential to mark up some features of interest”. Is this – “resequencing alignment” or “individual alignment” or “individuals”.

If you’d like to take part in this, email survey@ebi.ac.uk (perhaps cc’d to Xose – xose@ebi.ac.uk) to make sure you are on our list. Ideally we’d like you to be wet-lab biologists. We have alot of in-house or near-in-house opinions from bioinformaticians, and in anycase, bioinformaticians are happier to explore configurations etc. Its the researcher who will be visiting us – say – once or twice a month which we think is the main user to optimise for (again, more frequent users we hope will explore configuration to match things perfectly for them).

More on other e50 topics soon – speed, the importance of chocolate in bribing web developers and the end game for e50!


Development for the new Ensembl 50 website is progressing well… some of you may have already seen the test sites when you signed up to be part of our testing team…

One of the complaints of the current site (hardware failures aside) is the performance of the webpages – we are addressing this in a number of ways in the Ensembl 50 web code.

  • Tuning the Apache web server configuration:
    Compressing all HTML/Javascript/CSS files using mod_deflate;
    Minimizing the number and size of Javascript/CSS files by stripping unnecessary white space and comments from the files and merging them together;
    Setting headers to improve the browsers caching of content.
  • Aggressively caching content on the server side using a modified version of memcached (this will require Linux users using a 2.6.x kernel as it uses the epoll technology).
  • Increased use of asynchronous HTTP requests (AJAX) to allow more immediate responses for the page while generating other content; and to minimize the content that is sent (can retrieve initially hidden content later)
  • Reducing page size – rather than having single pages containing lots of disparate information having more pages containing smaller amounts of information; this doesn’t just help with the page size – but also increases the discoverability of content that we have on the site – which people do not find easily – especially comparative genomics; variational genomics and regulatory information.

For those who will be implementing local copies of Ensembl 50 code – additionally Ensembl 50 code will:

  • Make configuration easier – the pages will configure most of the tracks directly from the contents of the databases;
  • Make code more pluggable:
    ConfigPacker – the SpeciesDefs database parsing; and
    ImageConfig – replacement for UserConfig;
  • Make caching and AJAX implementation easier.

There are a number of changes to the code – so if you have written your own components or drawing code tracks there will be work to be done but in most cases these modifications are easy to implement (e.g. moving code between modules).

Finally, here are some additional system recommendations:

  • Perl 5.8.8 or newer;
  • MySQL 5.0 server;
  • 64 bit architecture;
  • large memory machine;
  • you can compile our modified “memcached” code (e.g. for Linux you will need a 2.6.x kernel) to get significant speed up;

Ensembl is now busy with preparations for our next release, Ensembl 50! We’re working hard and we’ll keep you updated on what’s in store for this release. Our biggest new development will be our revamped website. As usual, we have updated some species and provided new data for other species. Keep reading for an outline of what we aim to provide in Ensembl 50.

New web interface:
The most exciting change in Ensembl 50 will be a new web interface: Simpler, Better, Faster is what we’re aiming for. Not only will pages take less time to load, but they will also look a little different. We’re hoping that we will have improved the navigability and discoverability of the site so that you can make the best use possible of the data we provide. We have taken into account your messages at helpdesk and your voices in courses. Let us know what you think by emailing helpdesk@ensembl.org !

Genebuild team:
In terms of new data for Ensembl 50, we have constructed new gene sets for tetraodon and cow. Vega/Havana (manual annotation) has released new gene sets for human and mouse so these will be displayed on our website alongside Ensembl genes.

For human, you may know that Ensembl and Havana merge identical transcripts. We have improved the Vega/Havana merge using the latest Havana gene set. Because untranslated regions are notoriously difficult to determine, we’ve used ditags when predicting UTRs for human. Finally, we have removed some dodgy-looking gene models that were highlighted by the Alpheus project.

For low-coverage genomes, gene models are predicted by projecting the human gene models down onto the 2x genomes. In this release, cat and pika have been updated by projecting current human gene models onto the existing assembly.

We’ve also updated the gene sets for C. elegans and chimp. Release notes for C. elegans can be found on the WormBase website. Chimp has an updated gene set to include more chimp-specific predictions, and genes projected from human onto chimp are updated.

The horse genomic assembly (EquCab2) has recently been updated such that chromosome 27 has been shortened. This is not a new genebuild as such, but we have modified our data to reflect this change. Zebrafish Agilent V2 Arrays have been mapped to cDNA and genomic sequences.

Canonical transcripts (the longest translations) have been labeled for all species in the database, though this will not appear in the browser. As usual, non-coding RNA genes have also been updated for most species, and cDNA alignments have been redone for human and mouse.

Variation and Functional Genomics teams:
Our Variation team plans to provide updated single nucleotide polymorphisms (SNPs) for tetraodon, cow, human, chimp and orangutan. Our Functional Genomics team will provide promoter cis-regulatory motifs from here. They will also update the current regulatory build on human.

Comparative Genomics team:
Our Comparative Genomics team is extending their multiple alignments with new species and low-coverage (2x) genomes to include:
* 4-species: catarrhini primates EPO (Enredo-Pecan-Ortheus) alignments (human, chimp, orangutan, macaque )
* 12-species: amniote vertebrates Mercator-Pecan alignments (current 10-species alignments + Pongo pygmaeus and Equus caballus)
* 23-species: eutherian mammals EPO (Enredo-Pecan-Ortheus) alignments (all 2X genomes + current 7-species alignments + Pongo pygmaeus and Equus caballus)

GERP scores (% conservation on a basepair level for the 23-species eutherian mammals alignments) will be released.

The Compara (comparative genomics) team is working hard! They’re also providing new pairwise alignments:
* All the pairwise (between two species), whole-genome alignments (using tBLAT) will be updated using a new pipeline that follows a best-in-genome approach to filter spurious hits.
* The pairwise alignments for more closely related species (using BLASTz-net) will be updated for the following species so that the reference species is human:
. human vs Pongo_pygmaeus
. human vs Loxodonta africana
. human vs Echinops telfairi
. human vs Oryctolagus cuniculus
. human vs Dasypus novemcinctus
. human vs Myotis lucifugus
. human vs Bos Taurus
. human vs Ochotona princeps
. human vs Felis catus
Sitewise dN/dS values will be provided in our gene trees to detect positions in the alignments that are under different evolutionary pressure.

Web team:
Last but not least, please note that from Release 50 we will no longer be providing the ‘ssaha’ sequence search. If you wish to run your own ‘ssaha’ sequence search you can download the files to generate the search hashes from our FTP site. Alternatively, use BLAT (the BLAST-like Alignment Tool) which is equally fast and also demands exact matches.

That’s it for now! Any questions, just email helpdesk. We will be posting more information as the release date gets closer (we are aiming for end of July!)

Yesterday, Ensembl released a new version of the browser and database (version 49). Along with new species, homologue predictions, and new code in our API, there have been changes in how the multiple alignments are done on the whole-genome scale. Have a look at the news for more details.

We are looking forward to release 50! as we are working on some new features. Keep your eye out in August for this next release. A reminder, we will not release another version between now and August, and updates may appear in the Pre! site but not in the main site, for that time.

Please explore features on release 49 such as BLAST which is now configured to align queries against top-level sequences (i.e. chromosomes and scaffolds), and BLAT, a fast alignment program which is now the default selection.

Paralogues are shown in blue in GeneTreeView to help aid your eye.

Upcoming workshops -April

(March workshops are listed in a previous post)

Browser workshops at the VIB Ghent and Leuven (31 Mar – 2 Apr)
Browser workshop (focus: rat) at the ULB Brussels (EURATools) (16 Apr)
Browser workshop at the BCB UCL/Birkbeck (21 Apr)
Module in the EBI roadshow in Poitiers (23, 24 Apr)
API workshop at the Dept. of Genetics, Cambridge (28, 29, 30 Apr)

Keep your eye out for Release 49, which is due on Tuesday 18 March. The delay is due to the scheduled downtime and maintance at the Sanger and EBI this weekend, which has caused some trouble. However, Release 49 will soon be visible to the community!

New features in release 49 will includeBLAST against top-level sequences on all species, updates on theGeneTreeView page that should make things easier to see, and new Ensembl gene sets for Orangutan, Horse and and Takifugu. FlyBase 5.4 will be imported for Fruitfly. For API users, the regulatory features will be moved from the core API to the functional genomics API.

Also, a word of warning to those using our mouse clones under ‘DAS sources’. MICER clones and the bMQ set (129S7/AB2.2 in the ‘DAS Sources’ menu of ContigView). The clones, originally mapped to NCBI M36, are lifted over to the new assembly (NCBIM 37) coordinates. The drawing indicates where the clone lifts over to in the new assembly. However, the pop-up box shows the coordinates of the original mappings. This is indicated in Ensembl by the ‘NCBIM36’ label above the coordinates.

Write our helpdesk if you are confused!

This new release hosts an updated zebrafish assembly (Zv7) along with newly determined gene sets for zebrafish, platypus and chimpanzee.

SequenceAlignView is a new page allowing sequences to be compared across strains (mouse and rat)/individuals (humans). Variations can be displayed in this view. See the sitemap to find the page.

Finally, three new videos are available in the help. See the video tutorials here:
http://www.ensembl.org/common/Workshops_Online

More news is available on our website. Come find out what’s new!

Giulietta (Ensembl)

The latest update of the Ensembl Genome Brower and associated databases occurred 13 June, 2007 (release 45). This release was coordinated with the publication of the ENCODE project in the journal Nature. The first stage of this project focused on an in-depth view of 1% of the genome, contributing to a set of regulatory features that has now been incorporated into Ensembl, accessible in ContigView.

BioMart 0.6 has just been released, including an improved layout and response times for result viewing. In addition, the query can be exported in Perl API format
via the new ‘Perl’ button.

Those are the major updates at this time.