This year we’ve invested in our own mirror – maintained by us – on the west coast of the US. This was mainly because assessing the web return time for our users showed a consistent additional 3 to 4 seconds if you were lucky enough to live out on the west coast (worse still if you are in Australia!). Although we did alot last year to improve the general response time of our web pages (for example, compressing our CSS and Javascript down to single files for the whole site, so these are only loaded once and then cach’ed locally), the Ensembl site delivers alot of dynamic content – and nothing but getting closer to the users can help this.

You can reach the site directly at uswest.ensembl.org or alternatively there is a little “world” icon on the top right of the page which switches to the star-and-stripes when you’re on the west coast. Having the mirror not only helps our users who are on the west coast but also provides resilience when our main site goes down. As we’re responsibile for provisioning it in-sync with our main site (its part of our release process) this mirror will stay current with the main site.

In some sense the mirror should be a low cost “per user” for us having the mirror – if users go to the mirror, it means less load on the main site, and so it’s really how we distribute the “web farm” that sits behind Ensembl geographically. However, there are overheads from hiring rack space in the US to making our own release cycle more complex. This means we will need to assess whether running a US mirror makes sense in the long term. Our instinct is yes, but we need hard data on this.

These things need time to pick up, but already we’d be interested in feedback on this – for US users, is this site faster for you – in particular for East coast people who we think are probably still best off on the main site. Does it change with time of day? For Pacific rim users – Japan, Singapore, Korea, Australia – is the west coast site snappier for you? We’ll be putting in place our own monitoring schemes, but user feedback is always good…

Ensembl is pleased to announce the release of its West Coast US mirror (uswest.ensembl.org). This is a full mirror of the current Ensembl 54 release. We are providing this mirror to improve performance for users in the US, particularly on the West coast. It includes full search, BioMart and BLAST support (BLAST searching is actually run at Sanger with results passed back to the mirror).

This mirror is managed directly by the Ensembl web team, and we will aim to update it along with the main site, to keep it current. Credit for gettting this mirror up goes to James Smith and Eugene Bragin from the web team, with support from the Sanger systems team, particularly Peter Clapham, John Nicholson and Dave Holland.

Future plans: We will improve the mirror in the near future by allowing users to switch between the main and mirror site. Currently, we do not suggest logging in to the mirror. All user data must be retrieved by the main site at the Wellcome Trust Genome Campus. Speed is optimal if login is not used, however this will be improved in the future.

Many thanks,
The Ensembl Team

Following a recent thread in our ensembl-dev mailing list, we can point our users to a recent post in the Gramene blog (a resources for grass genomes maintained at CSHL). This framework extends Ensembl with a data resource to browse several plant species: maize (Zea mays), rice (Oryza glaberrima and Oryza rufipogon), sorghum (Sorghum bicolor), the model organism Arabidopsis thaliana, grape (Vitis vinifera), and poplar (Populus trichocarpa); with comparative maps for additional species such as wheat (Triticum aestivum), barley (Hordeum vulgare) and oat (Avena sativa).

You can find some sample scripts to load an Ensembl species database from scratch, here.

Thanks to our colleagues at Gramene.

We hope you like the new Ensembl website – we have had quite a lot of feedback about the system, and are digesting this to see how and where we can make the site more easy to use.

Missing features

We know there are a number of features which were in the webcode prior to the revamped version 51 that we are working on.

Views:

  • AlignSliceView [target e!53]
  • MultiContigView [target e!54]
  • CytoDump [will be released in e!53 as part of the export module]
  • DotterView
  • HistoryView – "ID liftover" [target e!53/4]
  • AssemblyConverter – "location liftover" [target e!53/4]

Components:

  • Drawing code tracks, e.g. rat QTLs, protein co-ordinate based DAS tracks [target e!53]
  • User gene annotations [target e!54]

New developments

We have a number of new "web" developments in the pipeline – some of these are listed below:

  • Extended configuration panel – searching for tracks, show currently active etc [target e!53]
  • Extended configuration panel – re-ordering tracks etc [target e!53]
  • Extended configuration panel – further configuration options – colour, depth, more display options, label options [target e!54/5]
  • New BLAST/BLAT interface [target e!55/6]
  • Re-write of the vertical drawing code to allow high quality PDF/PS/SVG karyotype and chromosome images to be produced.
  • Further work on export – finer configuration of what to export, exporting in multi-regions, integration with "user data"


If you have clicked on the GeneTree link in Ensembl (for example, the gene tree for IL2), you may have noticed that we have a new way of displaying large GeneTrees. This time, if you have a large gene family with lots of genes that you want to look at, you won’t need to ask the Miami Dolphins to let you plug your laptop into their huge screen…


This new feature in EnsemblCompara is called collapsible subtrees and allows for more compact, summarized views of interesting gene families like PAX2/PAX5/PAX8:

http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?g=ENSG00000075891

If you check the legend at the bottom, you will see that “blue triangles” correspond to collapsed subtrees that have within-species paralogs of your gene. If you want to see all the within-species paralogs expanded, you can click on the option “View paralogs of current gene“. You can even set that as a default if you want in the “Configure this page” options.

Jalview is a great way to view protein alignments in the tree. And were is my Jalview link now? Click on any internal node (square) in the tree, and be able to visualize the alignment (or subalignment) with the new Jalview applet by clicking on the Jalview link. You have to have Java installed though, or the link won’t show. The two Jalview windows that pop up are one, the protein alignment and the other, the underlying TreeBeST tree. You can now use Jalview’s sorting feature to sort your sequences according to the tree with: Calculate->Sort->By Tree Order->URL. Having the tree associated to the alignment allows for a more phylo-centric visualization of sequence conservation: if you click at a point in the tree, a red vertical line will appear that divides the alignment into different groups. If you choose Colour->Percentage Identity, the shades of blue will be relative to the subgroups in your tree (e.g., fish versus placental mammals). This is also useful to spot segments in the alignment that don’t look that good, or gaps created in a subpart that can now be collapsed in the subalignment (Edit->Remove Empty Columns), or sequences that stand out as long branches in the alignment (View->Overview Window).


For even more tree funkiness, you can use PhyloWidget to visualize our NHX trees. Use our NHX tree (“Configure this page->Output for normal tree->NHX->Save and Close->Gene Tree(text)“) to copy+paste the representation of the GeneTree into Phylowidget, with duplication/speciation events (red/blue), bootstrap values (greyscale) and taxonomy levels “View->Rendering->Show clade labels“. Then use the “Zoom in/Zoom out” features, or clicking on an internal node, the “Tree Edit->collapse“, and specially the “View->Branch lenghts [x]” and the “View->Layout->Options->Branch Scaling” options.


We hope these new features will help you in your research. We have some new ideas that we are currently testing to visualize even more phylogenetic information, and help make better judgement on the orthology and paralogy relationships in our EnsemblCompara GeneTrees. Stay tuned for more updates!

We’re happy to announce that Ensembl is one of the launch partners for Amazon’s “Public Data Sets” initiative, so the MySQL data and index files for the current release of Ensembl can be accessed from within Amazon’s Elastic Compute Cloud (EC2) service. From the Amazon website:

AWS Hosted Public Data Sets provide a convenient way to share, access, and use public domain or non-proprietary data within your Amazon EC2 environment. Select public data sets are hosted on AWS for free as an Amazon EBS snapshot. Any Amazon EC2 customer can access this data by creating their own personal Amazon EBS volume from a publicly shared Amazon EBS public data set snapshot. They can then access, modify, and perform computation on these data sets directly using an Amazon EC2 instance and just pay for the compute and storage resources that they use.

Details of how to access the data can be found at http://aws.amazon.com/publicdatasets .

We have plans to make much more use of AWS in the future, stay tuned!

Due to the changes in the web interface there have been a number of changes to the URLs for pages. In most cases the web code catches these changes but there are a number of requests which due to the nature of the site have changed:

  • Configuring the way a page is rendered;
  • Changing the way tracks are rendered;
  • Adding DAS sources via a web-address and not via the web interface;
  • Attach UCSC style external resources.

These are now all attached in a similar – systematic way:

  • To change global page settings: add a paramter config=key=value{,key=val}
    e.g.
    to turn off the top image on Location > Region in detailhttp://www.ensembl.org/Homo_sapiens/Location/View?r=1:1000-2000;config=view_top=off

    e.g. to link directly to the Exon Intron markup panel (Transcript > Exons) and to show full introns and only 60bp flanking sequence AND turn the display to be 60bp wide

    http://www.ensembl.org/Homo_sapiens/Transcript/Exons?t=ENST00000309255;config=flanking=60,seq_cols=60,fullseq=yes

  • To change configuration for an individual panel add a parameter refering to the panel (this will be documented shortly on the website) e.g. For Location > Region in detail the two panels are contigviewtopcontigviewbottom, for Location > Region overview it is cytoview. This is again a comma separated list, where the left hand side of each “=” is the name of the track, and the right hand side is the name of the “renderer” to use – the latter depends on the type of track. Additionally the left hand side can be used to integrate external data: Notes:
    • Track names are now systematically named so will have changed from the values you may have been used to using – again we will shortly publish a list of these, but examples are: transcript_core_ensembl – the ensembl genes from the ensembl database.
    • Renderers depend on the type of track, but e.g. for transcripts you have the option of “transcript_label”, “transcript_nolabel”, “collapsed_label” and “collapsed_nolabel”, for alignment features (and also url attached data at the moment) “normal”, “half_height”, “stack”, “unlimited” and “ungrouped”, for DAS tracks “labels” (show labels if configured by the source) or “nolabels” – hide labels.
    • At the moment two special parameters can be used:
      das:http://www.mydas.source/das/my_data=render
      – which attaches a DAS source to the session and selects the renderer
      url:http://www.myweb.server/my_data.format=render

    For example:

    http://www.ensembl.org/Homo_sapiens/Location/View?g=ENSG00000012048;config=panel_top=off;contigviewbottom=das:http://www.ensembl.org/das/Homo_sapiens.NCBI36.transcript=nolabels,transcript_core_ensembl=collapsed_nolabel

    Turns on a das source (in this case the Ensembl transcripts) and collapses the standard ensembl track down to a single line per Gene AND also turns off the top panel!

Do you know a bit of Perl? Ensembl hosts an API (Application Programmers Interface) which uses Object-Oriented Perl to extract data from Ensembl databases. This API is public and can be used for people to programmatically access the data in the Ensembl database. We understand that not everyone is used to Object-Oriented code, although people may have basic Perl skills and be interested in using our datasets. For that kind of bioinformaticist, I would recommend a recent short read in O’Reilly’s Broadcast:

Beginners Introduction to Object-Oriented Programming with Perl – O’Reilly Broadcast

And for the more advanced readers, the classic reference book in OO-Perl would be Damian Conway’s Object Oriented Perl, which a part from being very informative, has a really cool cover 🙂

We are always trying to lower the barrier to entry for research communities interested in using the Ensembl database in programmatic ways that make use of all the complexity associated with the generation of our data. That’s why our API is public and well-documented. You can learn about our API by attending on of our API workshops for free (e.g.: 1-3 December – Univ. Cambridge, UK). We are currently trying to smooth things out even more, working on ways to make it even easier to download all that’s needed to use the API and have the example scripts running in your computer with the minimum number of steps. Keep tuned for news in this respect soon…

Ensembl has begun to incorporate data from genome-wide association studies. These data are being added in coordination with the European Genotype Archive, a new database resource at the EBI designed to provide a permanent archive for human variation data that is not available for unlimited public release because of ethical or individual privacy restrictions. The European Genotype Archive has recently launched with the raw data from the Wellcome Trust Case Control Consortium (WTCCC. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661-678). In the future the EGA will provide additional array-based genotype data as well as data from re-sequencing and CNV studies. The EGA will also contain phenotype data.

Ensembl is incorporating summary data from genome-wide association studies represented in the EGA. The data generally represent the p-value for each of the tested SNP (Single Nucleotide Polymorphism) associated with the given phenotype.

The WTCCC summary data is now available on Ensembl as DAS tracks selectable from the “DAS Sources” menu from the CytoView and ContigView pages. The following menu items provide access to data from biopolar disorder (BD), coronary artery disease (CAD), cardiovascular disease (CD), hypertension (HT), type 1 diabetes (T1D), type 2 diabetes (T2D):

WTCCC BD
WTCCC CAD
WTCCC CD
WTCCC HT
WTCCC T1D
WTCCC T2D

In future releases, GWAS data will be integrated into the Ensembl variation databases.

We will be adding additional data to both Ensembl and the European Genotype Archive as the data become available. We hope you find these new data resources useful.

I’m in the airline lounge about to head back from “Biology of Genomes” at Cold Spring Harbor Laboratory. As always, it was a great meeting; highlights for me was seeing the 1,000 genomes data starting to flow – it is clear that the shift in technology is going to change the way we think about population genomics – and for me, the best session was one on “non-traditional models” – Dogs, Horses and Cows, where the ability to do cost effective genotyping has completely revolutionised this field. Now the peculiarities of the breeding structures, with Dog breeds being selected for diverse phenotypes, Cows with the elite bulls siring thousands of offspring due to artificial insemination and Horses having obsessive trait fixation over the last 1,000 years can really bring power to genetics in different ways. Expect alot more knowledge to come from these organisms and others (chickens, pigs, sheep…) over the coming years.

For my own group, Daniel Zerbino talked about Velvet, our new short read assembler which has also just been published in Genome Research (link). Velvet is now robust and capable of assembling “lower” eukaryotic genomes – certainly up to 300MB from short reads in read pair format. It is also being extensively used by other groups, often for partial, minature de novo assemblies in regions. It went down well, and Daniel handled some pretty tricky questions in the Q&A afterwards. Next up – we get access to a 1.5TB real memory machine, and put a whole human genome WGS into memory. Alison (Meynert) and Michael (Hoffman) had great posters on cis-regulation and looked completely exhausted at the end of their poster session.

From Ensembl, Javier talked about Enredo-Pecan-Ortheus (which we often nickname as EPO) pipeline. As some said afterwards to us “you’ve really solved the problem, haven’t you” – Javier was able to show clear evidence that each component was working well, better than competitive methods, and having a impact on real biological problems, for example, derived allele frequency. Its ability to handle duplications is a key innovation. Javier and Kathryn are current wrestling in the “final” 2x genomes into this framework, from which point we will start to have a truly comprehensive grasp on mammalian DNA alignments. I also like it as Enredo is another “de bruijn graph” like mechanism. Currently the joke is that about 10 minutes into any conversation I say “well, the right way to solve this problem is to put the DNA sequence into a de bruijn graph”.

Going to CSHL biology of genomes is always a little wince making though as this field – high end genomics – really prefers to use the UCSC Genome Browser (which as I’ve written before on, is a good browser, and I take the use of it to be our challenge to make better interfaces for these users on our side). My informal counting of screen shots was > 20 UCSC, 4 Ensembl (sneaking one case of ‘Ensembl genes’ shown in the UCSC browser as a point for each side) and 0 NCBI shots. Well. It just shows the task ahead of us. e50! – our user interface relaunch – is coming together, and we will start focus-group testing soon – time for us to address our failings head on. I’ll be blogging more about this as we start to head towards broader testing.

Lots more to write about potentially – Neanderthals, Francis Collins singing in the NHGRI band (quite an experience), reduced representation libraries with Elliott, genome wide association studies (of which, I just _love_ the basic phenotype measures, from groups like Manolis Dermitzakis) and structural variation… but for the moment I’ve got to persuade my body to feel as if it is 11.30 at night and see if I can get a good nights sleep on the plane.