I’m going to be blogging a bit more about the recent Ensembl 61 release and the Ensembl Genomes 8 release – lots and lots of goodies in both these releases – web site tweaks (some of the them totally critical for generating good displays), the new “favourite tracks” feature, and impressive content changes.

I’ll start today on content changes, and in Ensembl Genomes 8 there are some important genome additions. Some come from Paul Kersey’s new collaboration with PhytoPathDB – more on that in a later post – but top of my excitement has been the diversity in metazoa. The Ensembl Metazoa team has added Sea Urchin, Sea Anemone, the rather weird primitive animal, Trichoplax adhaerens (also called the “carpet” organism) and the blood fluke, Schistosoma mansoni. The motivation of bringing these organisms in is to broaden our phylogenetic tree and comparisons we can provide across all of life. So for example for the drosophila Twist Gene one can now see the deep tree for this across metazoa. For example, there is a deep ortholog to Trichoplax which seems to predate the split of some of these Helix Loop Helix proteins, whereas there are other members of the family which have a paralog in Trichoplax meaning that there seems a fundamental split in this developmentally key transcription factor. This is just one of many interesting gene trees that one can look at using this resource…

Happy browsing/data mining!


Now that Ensembl Genomes has moved onto the 60 code base, all the goodies in Ensembl‘s user data site are available across all 304 species in Ensembl genomes – from tiny bacteria to crazy big plants.

One of these is uploads of BAMs, Wigs, Beds in the location based view. For example, if you have the following file to specify a SNP in Drosophila

2L 356013 356013 C/T

By using the Variant Effect Predictor, accessible from Tools on the top of each page, or the ‘Manage your data’ link at the left of pages, one can get the effect (synonymous, non synonymous, in a UTR etc) for the variant.

(You click on “Variant Effect Predictor”, upload a text file of chr, position, allele and hit go) This can also be run as script using the API (the database is internet accessible so you just need to have an internet connection, Perl, MySQL client libraries and the Ensembl code base installed).

In addition, the full set of visualisation tools for your own data is also now accessible for all Ensembl Genomes species. For example, this bedGraph file:

track type=bedGraph name=”BedGraph Format” description=”BedGraph format” priority=20
2L 21302000 21302300 -1.0
2L 21302300 21302600 -0.75
2L 21302600 21302900 -0.50
2L 21302900 21303200 -0.25
2L 21303200 21303500 0.0
2L 21303500 21303800 0.25
2L 21303800 21304100 0.50
2L 21304100 21304400 0.75

Will render a nice little variable height picture in the main contigview display. Other options are available (like Bed and Wig format) – many of which people will know from UCSC. Come, try it out, and give us feedback at helpdesk@ensembl.org

The Ensembl 60 release sees two changes in our data upload capabilities

First off Ensembl can now “attach” a BAM file. BAM is the compressed form SAM – Sequence AlignMent files – which has become the dominant way to package up next-generation sequencing data. A BAM (or SAM file) has both the sequence and the alignment of a set of reads in a compact form (BAM makes it even more compact). Critically you can index a BAM file allowing programs rapid access to particular “Slices” of the reads by genomic position. Alignment tools such as Maq, BWA, SOAP can produce BAM files; a variety of analysis tools are written around BAM files, and now Ensembl can view BAM files.

To make a BAM file viewable you need to have access to a website where you can put files (like you local web space, perhaps an institutional thing). Call MyGreatExperiment.bam. You then need to index the BAM file using one of the tools – samtools is the usual one to do this, making a MyGreatExperiment.bam.bai (BAM index) precisely along side it (The Ensembl code is going to make the assumption that the index is called filename.bai). Then go to “Manage your Data” button on any web page in Ensembl, and go to the “Attach BAM” section. And then browse your RNA-seq, Chip-seq, Exome data to your hearts content!

In addition, we’ve spruced up our functionality and documentation on the UCSC file formats of Bed, BedGraph and Wig. Take a look at the “File Upload” and “Attach URL” forms, and the documentation. Now we precisely indicate what attributes you can use in each of these formats. Our goal is to make Ensembl as useful as possible to as broad a set of users as possible, so let us know if you find something confusing and/or you have a Bed/Bedgraph/Wig file that works for UCSC but doesn’t work on Ensembl.
This is of course available across all 50 species in Ensembl, and in a couple of weeks, when Ensembl Genomes 7 is out, across another 50 eukaryotes from protists to plants and about 250 different bacteria.
Comments are welcome – either on this blog, or email our helpdesk

Release 57 saw the release of a 5-way EPO alignment across the telost fish – Zebrafish, Stickleback, Medaka, Tetraodon and Fugu. Just recently I’ve spent some time browsing through them. They are very interesting, with the ancestral duplication in Fish showing more complex homology relationships than in mammals. Here’s a nice, clean example

Simple Fish multiple alignments

Here is a far more complex region, when ENREDO has clearly picked up the ancestral duplication but is struggling to make it colinear across the entire region

Complex Fish multiple alignments

One thing which I don’t think most people appreciate is the incredible phylogenetic depth in the telost linage. In terms of “millions of years of evolution” or “sequence divergence” actually the deepest splits in the telosts – such as ZebraFish to Stickleback – are almost as deep as telosts to mammals – certainly deeper than birds to mammals. So this is asking alot to find good, clearly co linear stretches, in particular when you think of the draft nature of these genomes.

Chatting to Javier, it might be much better to also look at a 4-way EPO on the “Stickleback” side of the telost linage, in other words, Medaka/Stickleback/Fugu/Tetraodon. This I think will come together better (in fact most of the “nice” regions in the 5-way EPO are actually regions without ZebraFish) and we might be able to look at taking that ancestral chromosome ordering and perhaps sequence in comparisons to Zebrafish.


The gene tree images now have little intron “ticks” on them showing how the intron position is placed relative to the protein sequence. An example is shown above. Each tick is a little black line on each side of the green protein bars, on the right. As intron positions have been remarkably stable on the “chordate” side of the metazoan tree (ie, the deutrosomes), one should expect that the introns line up – if they do, it is good evidence that the alignment is right.

There are some interesting things. Ensembl models small frameshifts to create open reading frames around erroneous data as tiny introns. In this code you cannot distinguish these two classes of introns, but as these errors normally come in patches, a run of intron ticks unique to a genome is probably a set of errors (an example is in Gorilla). I’ve enjoyed browsing around some of my favourite genes to check out that the introns make sense.

There is some more to go here. The fact that the intron ticks disappear on collapsed nodes is a bit frustrating – it would be nice to see “consensus” intron positions (though this is a bit complex to execute underneath).

For about 3 weeks now researchers in the US have had their default Ensembl go to our US west mirror (uswest.ensembl.org) automatically – if you just go to www.ensembl.org you brower gets automatically redirected. The US joined Japan and Canada who we switched in late 2009.

From our perspective, this is all working fine; the usage of uswest has gone up, and IP tracking on that shows far more US IP addresses (so the redirect is working fine!). We get some 3,000 odd visits a day, with some 50,000 pages delivered from our uswest site – about 20% of our total hits. Brilliant.

What is slightly more surprising is that we’re not getting any queries on this. Given the usage, we think this means the default browser, content and other functionality must be working well (or Americans are very shy about complaining… but that doesn’t sound like a good description of all Americans…). But we’d also like to hear from American users – have you noticed ensembl “go faster?”. Are there any glitches?

Another issue we are unclear about is whether we should automatically shift other users on the Pacific Rim to automatically go to uswest – in terms of usage, the biggest country would be Australia, but New Zealand, Phillipines and other Pacific Rim countries would also be candidates. It’s quite hard for us to assess whether our Europe (Cambridge UK) based servers or US west servers are best for this – both latency and throughput changes on different routes, and the time zone shift makes things complex to assess systematically in an easy way.

So – feedback welcome – either on this post or by email to our helpdesk about what your experience is, either from the US or from the Pacific Rim.

When we changed our look and feel almost a year ago, we “left behind” our two main graphical genome-wide comparative genomics displays (our textual comparative genomics displays remains, as did some of the gene centric ones). These were some of the most complex displays, not only in the graphics layout but also in aspects such as configuration – with comparative genomics tracks with up to 30 species, potentially one has the union of all tracks in each species, and doing this consistently required reworking how we thought about the “same” or “different” tracks across species.

It’s taken longer than we thought it would, but finally in release 56 these displays are back and better than ever. With more aggressive caching of data items as they head to the web (and, in addition, if you are on the west coast of the US or the Pacific Rim, check out the US west mirror at uswest.ensembl.org) they go far faster, making them far more useable.

We have two fundamentally different ways of thinking about genomic alignments.

In “Multi Sequence View”, which works fundamentally as a set of pairwise alignments, we maintain the linear sequence of each genome, and then draw regions which are conserved between them. Check out displays like:

Mouse/Human

And make sure you hit “Configure Page” and in the Comparative Genomics section, switch on blastz. I also like to have genes in “Collapsed, labels” (so alternative splicing doesn’t produce excessive displays) and also switch on Regulatory Features.

Now – you get a nice picture of this region in human and mouse. The orthologous gene (PECI) has conserved exons, and the regulatory features at the start of this gene is conserved in human and mouse and both cases classified as a promoter. All as expected.

But a closer look shows that the transcript going by the catchy name of AC123437.5 in mouse, going on the opposite strand has some of its exons overlapping to the human PECI, and Human PECI is duplicated into two local genes here. This is perhaps easier to see as one zooms out in this display (notice you can drag-and-select in the upper panels, or use the + and – bars to change in the lower panels)

Zoom Out

In contrast, the alignment (Image) view, asks you to choose one species as the co-linear
reference, and then the other species are organised specifically by the alignment of that
reference. This is ideal in more linear, orthologous regions. I like using the 10-way EPO alignment for visualisation/gene model comparison, although to go things like conservation analysis, you want to use the 31-way mammalian alignment with the low coverage data

This is gene, well conserved across mammals.

Co Linear

We can look at the precisely the same alignment from the perpsective of Mouse, Rat, Dog, Horse, Human, Pig. In each case, the alignment is unbiased to each species. For example, the Mouse-Rat portion of this multiple alignment still aligns the unique rodent portions.

Here is that same region from the perspective of Cow:

Cow

Notice when you go to human you have a choice of not only 4 different multiple alignments – a 4-way primate alignment, a 10-way mammalian alignment, a 12-way alignment including chicken and 31-way mammalian alignment, but also 40 odd other individual pairwise alignments.

In each case, you can get the alignment out as text – here’s a 4-way primate alignment:

Text alignment

or the same region in a 31-way glory

31 one way text

Of course, all this information is also available to download or access through our Perl API. A particularly interesting thing in these alignments is the ability to switch on the ancestral sequence as well (go to the configuration panel).

More on the use and power of comparative genomics later I hope, but for the moment, do enjoy these displays being back, and do both browse around and download/script against them.

Ewan

This year we’ve invested in our own mirror – maintained by us – on the west coast of the US. This was mainly because assessing the web return time for our users showed a consistent additional 3 to 4 seconds if you were lucky enough to live out on the west coast (worse still if you are in Australia!). Although we did alot last year to improve the general response time of our web pages (for example, compressing our CSS and Javascript down to single files for the whole site, so these are only loaded once and then cach’ed locally), the Ensembl site delivers alot of dynamic content – and nothing but getting closer to the users can help this.

You can reach the site directly at uswest.ensembl.org or alternatively there is a little “world” icon on the top right of the page which switches to the star-and-stripes when you’re on the west coast. Having the mirror not only helps our users who are on the west coast but also provides resilience when our main site goes down. As we’re responsibile for provisioning it in-sync with our main site (its part of our release process) this mirror will stay current with the main site.

In some sense the mirror should be a low cost “per user” for us having the mirror – if users go to the mirror, it means less load on the main site, and so it’s really how we distribute the “web farm” that sits behind Ensembl geographically. However, there are overheads from hiring rack space in the US to making our own release cycle more complex. This means we will need to assess whether running a US mirror makes sense in the long term. Our instinct is yes, but we need hard data on this.

These things need time to pick up, but already we’d be interested in feedback on this – for US users, is this site faster for you – in particular for East coast people who we think are probably still best off on the main site. Does it change with time of day? For Pacific rim users – Japan, Singapore, Korea, Australia – is the west coast site snappier for you? We’ll be putting in place our own monitoring schemes, but user feedback is always good…

Release 55 has lots of goodies – not least the new, coordinated, GRCh37 assembly (more on that later), but one addition is the Martability of Ensembl Regulatory Features. Regulatory features are on by default on Human and Mouse, and each gene has a specific page for the regulatory features (for example http://www.ensembl.org/Homo_sapiens/Gene/Regulation?g=ENSG00000139618). Regulatory Features are developing fast, and the Martability is bringing out the richer information in the functional genomics database – for example, the classification of features into “promoter”, “gene associated” and “unclassified”. Next release we’re hoping to release a more graphical view for each feature, but the present of the regulatory features in Mart allows the large scale users – from Perl, Java, R or just plain-only tab delimited text – to use them.

We’re expecting alot of development in this area – the addition of Mouse DNaseI sites has allowed us to develop a Mouse build, and of course, the ENCODE project which is now on line in production mode will provide a far richer, deeper, dataset to work against.

So – watch this space.

Steve posted the news that we’re delaying our new release for at least two more weeks. The message is pasted in here:

Hi all

In our Intentions Summary mail for release 51 we stated that the release was scheduled for early/mid September. The 51 release will include significant updates and improvements to the web interface. We are delaying release while we complete development on these. We are working to get the release out as soon as possible, and are now aiming for end September/early October. I apologise for this delay.

Steve

 

Dr Steve Searle
Ensembl Project Leader, Sanger

It is always so frustrating to delay, but of course, far more important to have a working site than something only part working. Welcome to delivering high end services.

We took on alot of things to change in this web refresh. For most users the main thing people will notice is the entirely new web layout. This was driven by our surveys of users who mainly complained about being buried in too many displays and data. We then took around 4 months working with user groups and trialling different layouts (many thanks for those who participated) which in some cases made significant changes to our original designs (we now have a hybrid “tab and left-hand-side” approach, voted as best by ~60% of people, with the other three options splitting the rest of vote). We’re very excited about this new layout going live as it just looks cleaner, less cluttered and yet providing more information. The other thing people will notice is that it is just faster. As the saying goes, you can’t be too rich, too thin or have your websites go too fast.

Making a website go faster is harder than it might look. It involves all sorts of things – the bandwidth of your machines to us, the speed the servers, the connectivity of servers to databases, the speed of the API, the database to disk, the management of the huge number of simultaneous users we have and then the size of the html returned and finally the render speed on your browser. All of these contribute to the overall perception of “speed”. Under the hood we’ve been working on all these aspects – internally a big change is that we have switched from needing a common file system for our web farm to work off. Previously when your browser asks for a contigview page, our servers generates html with an image and that image is written to the common disk, the browser parses the image tag, asks for this image – and this is the critical bit – sends a request which in all likelihood will be served by a different server in our webfarm. That server then went to the common file system to pick up the file and send it back. Many times a critical bottleneck has been read/write on this shared filesystem. In the new system this has all gone, and the images are stored in a memory-based common store, meaning both that we remove this bottle-neck (which will be the first big effect) and secondly we will be able to cache alot more – the hope is that many of the identical pictures for the common species will be entirely served from memory in the new system. Another important change has been aggressively sliming our html. Currently all sorts of files – often very small – are pinged by each page up, just to see if they have changed. We’ve consolidated alot of these files – and compressed them – and then also optimised them for render speed.

There is a variety of things not for this release but coming up end of 2008/early 2009 also on speed. Our API has a new concept, collections, which better handles the case of zoomed out views, where we know the renders will not be able to render every object. Instead a collection – which may be rendered as a union or density or something will be provided. The other thing on the horizon is us setting up a US mirror on the west coast. For the last year we have been extensively monitoring the speed of Ensembl from different sites, and there is a large increase in time to retrieve on the north-west coast of the US. We’ve been investigating quite why this (and learning lots more about the backbone of the internet than we knew before) but it seems as if the simplest way to getting speed to work in the west coast is to just run a mirror over there. Probably 2009 for that to go live.

Back to the website. It looks so much better – and has much better hardware characteristics – (our shared file system is … well … rather 2004 technology and needs pretty constant care at the moment) that I can’t wait until it comes out. But there is absolutely no point in having a crippled site in functionality even though we’ve got many of the user interface and technical issues right. The sticking point at the moment is the configuration panel. This comes up as “modal” box on top of the page, allowing alot of options to choose from, but not a bewildering set of options on each page. To cope with the 200 odd different tracks to switch on and off, the box has to have tabs and friendly, browseable hieriarchies. To get all this to work in a nice, friendly, slick way… that’s alot of Javascript.

And alot of Javascript is alot of browser compatible headaches. Even using JS libraries – prototype and scriptolicious (I think – James smith can tell you the details!) there are all sorts of details that might not work just-quite the same way on IE5 compared to IE6. Or Firefox. Or Safari. And it must degrade at least functionally without JS. And of course work, and render fast. This modal box is the last, complex thing to get sorted.

We’re close. I’ve seen the box come up over James’ screen. I hear Steve has seen it come and tracks change, and see the link of tracks to changes. The API for the configuration system was gutted and is much better. But its got to work on all main browsers. For all our genomes, in particular Human and Mouse. And this is just tricky, fiddly work.

We’re not quite there yet. We’re really close, and so much is working it is just excruitiating. But we need another couple of weeks. James is being shielded from other jobs by Steve and others; Eugene is torture testing memcachedb to stress test the system before it goes live; Xose, Bert and Guilietta are writing help; Beth and Anne are writing the additional pagelets inside of the new geneview and transcriptviews. and it all looks really good.

So – apologies – we thought we’d be launching in July. We thought we’d be launching in September. We still might just do that, but then again, it might well be October. If it goes any later I will have no hair.

But it does look really good.

It is definitely worth the wait. Like Guinness.

Ewan