Last week I was a co-organiser of a Newton Institute workshop on high dimensional statistics in biology. It was a great meeting and there were lots of interesting discussions, in particular on chip-seq methods and protein-DNA binding array work. I also finally heard Peter Bickel talk about the “Genome Structure Correction” method (GSC), something which he developed for ENCODE statistics, which I now, finally, understand. It is a really important advance in the way we think about statistics on the genome.

The headache for genome analysis is that we know for sure that it is a heterogeneous place – lots of things vary, from gene density to GC content to … nearly anything you name. This means that naive parametric statistical measures, for example, assuming everything is poisson, is will completely overestimate the significance. In contrast, naive randomisation experiments, to build some potential empirical distribution of the genome can easily lead to over-dispersed null distributions, ie, end up under estimating the significance (given a choice it is always better to underestimate). What’s nice is that Peter has come up with a sampling method to give you the “right” empirical null distribution. This involves a segmented-block-bootstrap method where in effect you create “feasible” miniature genome samples by sampling the existing data. As well as being intuitively correct, Peter can show it is actually correct given only two assumptions; one that genome’s heterogeneity is block-y at a suitably larger scale than the items being measured, and secondly that the genome has independence of structure once one samples from far enough way, a sort of mixing property. Finally Peter appeals the same ergodic theory used in physics to convert a sampling over space to being a sampling over time; in other words, that by sampling the single genome’s heterogeneity from the genome we have, this produces a set of samples of “potential genomes” that evolution could have created. All these are justifiable, and certainly this is far fewer assumptions than other statistics. Using this method, empirical distributions (which in some cases can be safely assumed to be gaussian, so then far fewer points are needed to get the estimate) can be generated, and test statistics built off these distributions. (Peter prefers confidence limits of a null distribution).

End result – one can control, correctly, for heterogeneity (of certain sorts, but many of the class you want to, eg, gene density). Peter is part of the ENCODE DAC group I am putting together, and Peter and his postdoc, Ben Brown, are going to be making Perl, pseudo-code and R routines for this statistic. We in Ensembl will implement this I think in a web page, so that everyone can easily use this. Overall… it is a great step forward in handling genome-wide statistics.

It is also about as mathematical as I get.

Yesterday, Ensembl released a new version of the browser and database (version 49). Along with new species, homologue predictions, and new code in our API, there have been changes in how the multiple alignments are done on the whole-genome scale. Have a look at the news for more details.

We are looking forward to release 50! as we are working on some new features. Keep your eye out in August for this next release. A reminder, we will not release another version between now and August, and updates may appear in the Pre! site but not in the main site, for that time.

Please explore features on release 49 such as BLAST which is now configured to align queries against top-level sequences (i.e. chromosomes and scaffolds), and BLAT, a fast alignment program which is now the default selection.

Paralogues are shown in blue in GeneTreeView to help aid your eye.

Upcoming workshops -April

(March workshops are listed in a previous post)

Browser workshops at the VIB Ghent and Leuven (31 Mar – 2 Apr)
Browser workshop (focus: rat) at the ULB Brussels (EURATools) (16 Apr)
Browser workshop at the BCB UCL/Birkbeck (21 Apr)
Module in the EBI roadshow in Poitiers (23, 24 Apr)
API workshop at the Dept. of Genetics, Cambridge (28, 29, 30 Apr)

Keep your eye out for Release 49, which is due on Tuesday 18 March. The delay is due to the scheduled downtime and maintance at the Sanger and EBI this weekend, which has caused some trouble. However, Release 49 will soon be visible to the community!

New features in release 49 will includeBLAST against top-level sequences on all species, updates on theGeneTreeView page that should make things easier to see, and new Ensembl gene sets for Orangutan, Horse and and Takifugu. FlyBase 5.4 will be imported for Fruitfly. For API users, the regulatory features will be moved from the core API to the functional genomics API.

Also, a word of warning to those using our mouse clones under ‘DAS sources’. MICER clones and the bMQ set (129S7/AB2.2 in the ‘DAS Sources’ menu of ContigView). The clones, originally mapped to NCBI M36, are lifted over to the new assembly (NCBIM 37) coordinates. The drawing indicates where the clone lifts over to in the new assembly. However, the pop-up box shows the coordinates of the original mappings. This is indicated in Ensembl by the ‘NCBIM36’ label above the coordinates.

Write our helpdesk if you are confused!

After a very successful Ensembl US West Coast Tour last month, the Ensembl Outreach team is presently looking into the possibility of organising a similar tour on the US East Coast in the second half of 2008. At the moment we are mainly thinking of 1-day browser workshops, but if there is interest in an API workshop we can of course also consider this.
The participating institutions would only have to pay the instructor’s expenses and would share the travel costs, but we would not otherwise charge for the workshops. People that are potentially interested in hosting a workshop can contact me for more details.


We are looking forward to Ensembl release 49, which has been delayed to 13 March, 2008. This is a result of some downtime planned at the Wellcome Trust Sanger Institute. Users, beware! Ensembl will not be available from: Friday 7 March – Sunday 9 March

On the 13th March Ensembl version 49 will be available. Keep an eye out for:
Drosophila melanogaster (assembly BDGP 5.4)
Horse (first gene build)

Viral genes have been removed in multiple species. ncRNA updates will be ready for Pika and Mouse Lemur, and new variations from dbSNP 128 (mouse, chicken, cow, and zebrafish) and dbSNP 126 (rat) will be available. Have a look at the new pairwise alignments between human and horse genomes.

Upcoming Workshops – March

A series of talks and workshops are happening in March.

Browser Workshop (2-day course at the EBI) 5-7 March
Browser Workshop (part of a 2 day course, MRC London: EBI Roadshow) 11-12 March
Presentation at the Genomes to Systems 2008 conference in Manchester, 17-19 March

Interested in organising a course in Ensembl and BioMart? Contact our helpdesk.

-The Ensembl Outreach team


(you wouldn’t expect orangutans out of the trees, would you?)

Ensembl 49 will contain good news on the comparative genomics side. Apart from the new whole-genome multiple alignments for which we can now handle segmental duplications and infer ancestral sequences (see Ewan’s post on the 20th of January), two new species will be available, namely horse and orangutan.

We are especially excited about the new orangutan genome as it is a key species in the primate lineage, in between the human, chimp and gorilla group and the Old World monkeys. Its inclusion in our gene trees will result in a better resolution of the phylogeny of the primate genes.

We’re pleased to announce that Ensembl now has a mirror at the Beijing Genomics Institute, Shenzhen (BGI-SZ). The mirror can be found at http://ensembl.genomics.org.cn/


Most of the functionality of the main Ensembl site is mirrored, however we’re still working with our colleagues at the BGI to provide the rest, for example BioMart.


Due to a combination of the volume of data comprising a single Ensembl release (the MySQL data and index files for release 48 take up apround 600Gb, and that’s without counting all f the flat-file dumps) and the very slow Internet connection between the UK and China, the we’re using a “sneakernet” solution – i.e. dumping the data onto a hard drive and shipping it to China. This has proved to be an interesting challenge but it’s working out pretty well so far.

We hope that this mirror will make life easier for our users in and around China. We’re actively trying to set up mirrors elsewhere around the world to reduce network delays and improve peoples’ Ensembl experience; we’ll post here as soon as any new mirrors come online.

I would like to thank our colleagues at the BGI-SZ, particularly Lin Fang, for setting this mirror up.

You might think that only our group and team leaders are traveling the globe, but also the members of the Ensembl outreach team (Xose Fernandez, Giulietta Spudich and myself) spend a fair amount of their time on the road (or in the air ….) to spread the word about Ensembl.

I myself, for example, just returned back in the UK from a 3-week “Ensembl US West Coast tour”. That means no more Margaritas, motels with ocean view or trendy LA restaurants for me for a while, but also no more lost luggage or cancelled or delayed flights (it’s not all glitz and glamour …. ).

My tour started with a visit to the Plant and Animal Genome XVI Conference in San Diego, where I gave a presentation on Ensembl and spent, together with other EBI colleagues, time in the EBI booth to promote our institute. After that I gave Ensembl browser workshops at City of Hope (see picture), the University of Oregon , UCSF, UCSC (where the audience mainly consisted of genome browser folks!), and UCLA. Numbers of participants ranged from around 15 till over 50 and in all places the workshop was very enthousiastically received. In fact, several of my hosts were already asking when we could repeat this ….

The principle aim of our workshops is of course to teach people how to get the most out of Ensembl, but apart from that it also is a really good way for us to stay in contact with our users. We can see what people exactly use Ensembl for, how they use it and what they like and dislike about it, so we always return back home with lots of new ideas and suggestions. One thing that, for instance, often strikes me is that most people are not aware of the existence of our data mining tool BioMart. However, after a short explanation and some hands-on exercises they find it almost without exception very useful! So, we still have some work to do to promote this very handy tool.

By the way, we not only offer browser workshops, but also workshops on the use of the various Ensembl Perl API’s. Keep an eye on this blog to see where and when the next workshops will be. Or, even better, host one at your own university or institute! For more information with regard to our workshops you can contact our helpdesk.

I’ve just been visiting CNIO in madrid – a great, fancy new(ish) institute in Madrid focusing on cancer – it was a great visit if you ignore the 2 hour delay (thanks Iberia) coming out and currently 1 hour delay (thanks BA…) coming back. They are doing all the things one expects from a high-end molecular biology institute. There are a chip-chip guys, moving to chip-seq. There are some classic cell biologists moving into more genome wide assays (in this case, replication). They have a great prospective sample collection in two cancers, and are about to get into a Genome Wide Association Study (GWAS).

David – the head of bioinformatics service – already is leveraging Ensembl alot. They script against our databases (Perl API mainly) and have a local mirror set up. They ran courses, bringing over Ensembl people for both an API course and a Browser course (contact helpdesk if you’d like this to happen at your institute…). But even then, discussions with David made us realise that they could use us even more – for the functional genomics schema and the variation schema in particular.

This is what Ensembl is all about. We make it easier for people who want to work genomically to do the sometimes painful data manipulation and plumbing. In particular, Ensembl provides public domain information in a large scale, well organised and ready to be browsed on the web, scripting against in Perl and accessible to clients like bioconductor. And more than any other group, we help group’s like David’s do more for his institute and have to worry less about the infrastructure. David was very interested in the “geek for a week” program when someone comes to work at Ensembl to help accelerate a project.

Returning to the airline theme, some of the biologists admitted using the UCSC browser in a little embarrassed way. I responded that it was fine – UCSC is a great browser, with some great tools. Like airlines, we know people have a choice browsers, and we hope people come “fly ensembl” and enjoy it, but we know the competition is good (and really friendly as well – we like working with those crazy californians, and have a number of joint projects). If you are a biologist, you should use the best tool for the job at hand. Of course, we know where we’re lacking, in particular in comparison to UCSC, and we are working on getting better. Keep an eye open on changes in Ensembl this year – and do come fly with us even if your “regular browser” is US based.

Finally my plane I think is ready to depart.

(Madrid airport is so big I think I’m half way to the UK already)

Today the 1000 Genomes projects was announced. By any measure this is a big deal.
The goal is simple: to create the most comprehensive and medically useful collection of human variation ever assembled by producing approximately 6 terabases of sequence. To put this amount of data in prospective, 6 terabases is more than 60 times the amount of data that is currently available in the DDBJ/GenBank/EMBL Archive and that took more than 25 years to collect. At the peak production of the 1000 Genomes project more that 8 billion basepairs per day will be sequenced. It’s data output of the the entire human genome project every week. All made publicly available.
The data generation rate and the short read length mean that the bioinformatics requires for the project are equally ambitious (or terrifying depending on your point of view). The EBI and NCBI, working together, are creating a joint DCC (data coordination centre) to collect, organise and provide the data to the world. Steve Sherry at the NCBI and I are eager to take this on.
At Ensembl we’ve been expecting this development and built support for re-sequencing data into our variation database a couple of years ago. So far, we have data for about 6 humans, 5 mouse strains, and a smattering of rat data. Small stuff compared to six months from now, but large enough that we have both experience and confidence dealing with the large-scale resequencing data. We are probably going to need both.
Check out more at http://www.1000genomes.org