I was thinking about the web design process for e50 – our new web interface due out in July (definitely will be late July). We’re at the stage now where Fiona is going to be asking users their preferences for all the “little things” which make no difference to technical aspects of the web site but make a pretty big difference to the useability. Like, for example, how do we colour our genes? This is a long standing debate where everyone has an opinion and everyone’s opinion is right – at least for them. (only 2 colours, and the colours should distinguish manually annotated genes from automatic says one person. No – use the whole spectrum of colours, and make sure we distinguish non-coding RNA genes from pseudogenes from protein coding genes and indicate which ones have orthologs – to mouse. No to rat. No – instead of that use GO functional catagories to colour genes. Or the number of non coding SNPs. Or the gene-wide omega value from the dn/ds measurement)

Sometimes people look at this debate and say that this is a clear area for user defined colours. Which is sort of true for 10 seconds, but – not really. Firstly most users are not going to get around to changing options – partly due to the fact they have better things to do (like design experiments and run them!), partly because this sort of configuration is just a bit too geeky and partly because, to be honest, if they are into configuring things we’d like them first off to work out which tracks that would like displayed (more on this below), and colouring genes should be low on their list. Secondly we want to provide a scheme which feels natural to the most number of people. Hence a rather long series of options to choose from currently being proposed.

The same argument goes for default tracks. (I can’t imagine not having SNPs on my display! I can’t imagine not having the ESTs switched on!). Everyone has an opinion and everyone is right. Here it is clear we’ve got to make sensible default decisions (which are also heavily, heavily speed optimised – sadly the new Collections framework wont be ready for SNPs for 50, which is annoying, as really we want SNP density these days in human, but all the other obvious default tracks are pretty well optimised, including some funky scaling stuff to get the continuous basepair comparative genomics measure to come back sensibly when you are zoomed out). But then our main task it to get the user to explore as the “wouldn’t it be nice to see xxxx, I wonder if Ensembl has it” with configuration system which is very enticing, but not in the way, and importantly for the non-expert user, not completely overwhelming. In our e50 design means more hierarchy in the options so they can be grouped (itself a bit of pain to handle – we’ve got alot of tracks), and a nice “light box” effect over the display which reassures you that (a) the thing that you were looking at wont disappear (b) the display will come back quickly. I think we’re on the right path here for the configuration, but we still have decide on the default tracks (for me the only obvious one is “Genes”).

Finally we’ve got the mundane business of which words do we use for each of our “pagelet” displays. (our new pagelets are very nice, and in our latest round of testing, >50% of the in-the-lab biologists liked not only the pagelets, but a specific layout of them. less than 10% preferred the current ensembl display). So – we need one or two words to describe “A graphical representation of a phylogenetic tree of a gene with duplication nodes marked”. Hmmm. “Gene Tree”. Or “Phylogenetic Tree”? (phylogenetic is a bit of a long word, and might get in the way of the menu…). What about “a text based alignment of resequenced individuals with the potential to mark up some features of interest”. Is this – “resequencing alignment” or “individual alignment” or “individuals”.

If you’d like to take part in this, email survey@ebi.ac.uk (perhaps cc’d to Xose – xose@ebi.ac.uk) to make sure you are on our list. Ideally we’d like you to be wet-lab biologists. We have alot of in-house or near-in-house opinions from bioinformaticians, and in anycase, bioinformaticians are happier to explore configurations etc. Its the researcher who will be visiting us – say – once or twice a month which we think is the main user to optimise for (again, more frequent users we hope will explore configuration to match things perfectly for them).

More on other e50 topics soon – speed, the importance of chocolate in bribing web developers and the end game for e50!

I’m in the airline lounge about to head back from “Biology of Genomes” at Cold Spring Harbor Laboratory. As always, it was a great meeting; highlights for me was seeing the 1,000 genomes data starting to flow – it is clear that the shift in technology is going to change the way we think about population genomics – and for me, the best session was one on “non-traditional models” – Dogs, Horses and Cows, where the ability to do cost effective genotyping has completely revolutionised this field. Now the peculiarities of the breeding structures, with Dog breeds being selected for diverse phenotypes, Cows with the elite bulls siring thousands of offspring due to artificial insemination and Horses having obsessive trait fixation over the last 1,000 years can really bring power to genetics in different ways. Expect alot more knowledge to come from these organisms and others (chickens, pigs, sheep…) over the coming years.

For my own group, Daniel Zerbino talked about Velvet, our new short read assembler which has also just been published in Genome Research (link). Velvet is now robust and capable of assembling “lower” eukaryotic genomes – certainly up to 300MB from short reads in read pair format. It is also being extensively used by other groups, often for partial, minature de novo assemblies in regions. It went down well, and Daniel handled some pretty tricky questions in the Q&A afterwards. Next up – we get access to a 1.5TB real memory machine, and put a whole human genome WGS into memory. Alison (Meynert) and Michael (Hoffman) had great posters on cis-regulation and looked completely exhausted at the end of their poster session.

From Ensembl, Javier talked about Enredo-Pecan-Ortheus (which we often nickname as EPO) pipeline. As some said afterwards to us “you’ve really solved the problem, haven’t you” – Javier was able to show clear evidence that each component was working well, better than competitive methods, and having a impact on real biological problems, for example, derived allele frequency. Its ability to handle duplications is a key innovation. Javier and Kathryn are current wrestling in the “final” 2x genomes into this framework, from which point we will start to have a truly comprehensive grasp on mammalian DNA alignments. I also like it as Enredo is another “de bruijn graph” like mechanism. Currently the joke is that about 10 minutes into any conversation I say “well, the right way to solve this problem is to put the DNA sequence into a de bruijn graph”.

Going to CSHL biology of genomes is always a little wince making though as this field – high end genomics – really prefers to use the UCSC Genome Browser (which as I’ve written before on, is a good browser, and I take the use of it to be our challenge to make better interfaces for these users on our side). My informal counting of screen shots was > 20 UCSC, 4 Ensembl (sneaking one case of ‘Ensembl genes’ shown in the UCSC browser as a point for each side) and 0 NCBI shots. Well. It just shows the task ahead of us. e50! – our user interface relaunch – is coming together, and we will start focus-group testing soon – time for us to address our failings head on. I’ll be blogging more about this as we start to head towards broader testing.

Lots more to write about potentially – Neanderthals, Francis Collins singing in the NHGRI band (quite an experience), reduced representation libraries with Elliott, genome wide association studies (of which, I just _love_ the basic phenotype measures, from groups like Manolis Dermitzakis) and structural variation… but for the moment I’ve got to persuade my body to feel as if it is 11.30 at night and see if I can get a good nights sleep on the plane.

We have four groups on campus interested in human genes: Ensembl, Havana, whos data forms the bulk of the Vega database, HGNC, the human gene nomenclature committee, and finally UniProt, which has a special initiative on human proteins. With all these groups on the hinxton campus, and with all of them reporting to (at least one) of myself (Ewan Birney), Rolf Apweiler or Tim Hubbard, who form the three-way coordination body now called “Hinxton Sequence Forum”, HSF it should all work out well, right?

Which is sort of true; the main thing that has recently changed over the last year has been far, far closer coordination between these four groups than there was ever before, meaning we will be achieving an even closer coordination of our data, leaving us hopefully with the only differences being update cycle and genes which cannot be coordinated fully (eg, due to gaps in the assembly).

Each of these groups have a unique view point on the problem. Ensembl wants to create as best-as-possible geneset across the entire human genome, and its genesis back in 2000 was that this had to be largely automatic to be achievable in the time scale desired, being months (not years) after data was present. Havana wants to provide the best possible individual gene calls when they annotate a region, integrating both computational, high throughput and individual literature references together, UniProt wants to provide maximal functional information on the protein products of genes, using many literature references on protein function which are not directly informative on gene structure and finally HGNC wants to provide a single, unique, symbol for each gene to provide a framework for discussing genes, in particular between practicing scientists.

Three years ago, each group knew of the other’s existence, often discussed things, was friendly enough but rarely tried to understand in depth why certain data items were causing conflicts as they moved between the different groups. Result: many coordinated genes but a rather persistent set of things which was not coordinated. Result of that: irritated users.

This year, this has already changed, and will change even more over 2008 and 2009. Ensembl is now using full length Havana genes in the gene build, such that when Havana has integrated the usually complex web of high throughput cDNAs, ESTs and literature information, these gene structures “lock down” this part of the genome. About one third of the genome has Havana annotation, and because of the ENCODE Scale up award to a consortium headed by Tim Hubbard, this will now both extend across the entire genome and be challenged and refined by some of the leading computational gene finders world wide (Michael Brent, Mark Diekhans and Manolis Kellis, please take a bow). Previously Ensembl brought in Havana on a one-off basis; now this process has been robustly engineered, and Steve Searle, the co-head of the Gene Build team, is confident this can work in a 4-monthly cycle. This means it seems possible that we can promise a worse-case response to a bad gene structure being fixed in six months, with the fixed gene structure also being present far faster on the Vega web site. It also means that the Ensembl “automated” system will be progressively replaced by this expert lead “manual” annotation over the next 3 years across the entire genome.

(An aside. I hate using the words “automated” and “manual” for these two processes. The Ensembl gene build is, in parts, very un-automated, with each gene build being precisely tailored to the genome of interest in a manual manner, by the so called “gene builder”. In contrast “manual” annotation is an expert curator looking at the results of many computational tools, each usually using different experimental information mapped, in often sophisticated ways, onto the genome. Both use alot of human expertise and alot of computational expertise. The “Ensembl” approach is to use human expertise in the crafting of rules, parameters and choosing which evidence is the most reliable in the context of the genome of interest, but having the final decision executed on those rules systematically, whereas the “Havana” curation approach is to use human expertise inherently gene-by-gene to provide the decision making in each case, and have the computational expertise focus on making this decision making as efficient as possible. Both will continue as critical parts of what we do, with higher investment genomes (or gene regions in some genomes) deserving the more human-resource hungry per genome annotated “manual” curation whereas “automated” systems, which still have a considerable human resource, can be scaled across many more genomes easily).

This joint Havana/Ensembl build will, by construction, be both more correct and more stable over time due to the nature of the Havana annotation process. This means other groups interacting with Havana/Ensembl can work in a smoother, more predictable way. In particular on campus it provides a route for the UniProt team to both schedule their own curation in a smart way (basically, being post-Havana curation) and provide a feedback route for issues noticed in UniProt curation which can be fixed in a gene-by-gene manner. This coordination also helps drive down the issues with HGNC. HGNC always had a tight relationship with Havana, providing HGNC names to their structures, but the HGNC naming process did not coordinate so well with the Ensembl models, with gene names in complex cases becoming confused. This now can be untangled at the right levels – when it is an issue with gene structures, prioritise those for the manual route, when it is an issue with the transfer of the assignment of HGNC names (which primarily has individual sequences, with notes to provide disambiguation) to the final Havana/Ensembl gene models this can be triaged and fixed. HGNC will be providing new classifiers of gene names to deal with complex scenarios where there is just no consistent rule-based way of classifying the difference between “gene” “locus” and “transcript” in a way which can work genome-wide. The most extreme example are the ig loci, with a specialised naming scheme for the components of each locus, but there are other oddities in the genome, such as the proto-cadherin locus which is… just complex. By having these flags, we can warn users that they are looking at a complex scenario, and provide the ability for people who want to work only with cases that follow the “simple” rules (one gene, in one location, with multiple transcripts) the ability to work just in that genome space, without pretending that these parts of biology don’t exist.

It also means our relationships to the other groups in this area; in particular NCBI and UCSC (via the CCDS collaboration), NCBI EntrezGenes (via the HGNC collaboration) and other places worldwide can (a) work better with us because we’ve got more of our shop in order and (b) we can provide a system where if we want to change information or a system, we have only one place we need to change it.

End result; far more synchrony of data, far less confusion for users, far better use of our own resources and better integration with other groups. Everyone’s a winner. Although this is all fiddly, sometimes annoying, detail orientated work, it really makes me happy to see us on a path where we can see this resolved.

Last week I was a co-organiser of a Newton Institute workshop on high dimensional statistics in biology. It was a great meeting and there were lots of interesting discussions, in particular on chip-seq methods and protein-DNA binding array work. I also finally heard Peter Bickel talk about the “Genome Structure Correction” method (GSC), something which he developed for ENCODE statistics, which I now, finally, understand. It is a really important advance in the way we think about statistics on the genome.

The headache for genome analysis is that we know for sure that it is a heterogeneous place – lots of things vary, from gene density to GC content to … nearly anything you name. This means that naive parametric statistical measures, for example, assuming everything is poisson, is will completely overestimate the significance. In contrast, naive randomisation experiments, to build some potential empirical distribution of the genome can easily lead to over-dispersed null distributions, ie, end up under estimating the significance (given a choice it is always better to underestimate). What’s nice is that Peter has come up with a sampling method to give you the “right” empirical null distribution. This involves a segmented-block-bootstrap method where in effect you create “feasible” miniature genome samples by sampling the existing data. As well as being intuitively correct, Peter can show it is actually correct given only two assumptions; one that genome’s heterogeneity is block-y at a suitably larger scale than the items being measured, and secondly that the genome has independence of structure once one samples from far enough way, a sort of mixing property. Finally Peter appeals the same ergodic theory used in physics to convert a sampling over space to being a sampling over time; in other words, that by sampling the single genome’s heterogeneity from the genome we have, this produces a set of samples of “potential genomes” that evolution could have created. All these are justifiable, and certainly this is far fewer assumptions than other statistics. Using this method, empirical distributions (which in some cases can be safely assumed to be gaussian, so then far fewer points are needed to get the estimate) can be generated, and test statistics built off these distributions. (Peter prefers confidence limits of a null distribution).

End result – one can control, correctly, for heterogeneity (of certain sorts, but many of the class you want to, eg, gene density). Peter is part of the ENCODE DAC group I am putting together, and Peter and his postdoc, Ben Brown, are going to be making Perl, pseudo-code and R routines for this statistic. We in Ensembl will implement this I think in a web page, so that everyone can easily use this. Overall… it is a great step forward in handling genome-wide statistics.

It is also about as mathematical as I get.

I’ve just been visiting CNIO in madrid – a great, fancy new(ish) institute in Madrid focusing on cancer – it was a great visit if you ignore the 2 hour delay (thanks Iberia) coming out and currently 1 hour delay (thanks BA…) coming back. They are doing all the things one expects from a high-end molecular biology institute. There are a chip-chip guys, moving to chip-seq. There are some classic cell biologists moving into more genome wide assays (in this case, replication). They have a great prospective sample collection in two cancers, and are about to get into a Genome Wide Association Study (GWAS).

David – the head of bioinformatics service – already is leveraging Ensembl alot. They script against our databases (Perl API mainly) and have a local mirror set up. They ran courses, bringing over Ensembl people for both an API course and a Browser course (contact helpdesk if you’d like this to happen at your institute…). But even then, discussions with David made us realise that they could use us even more – for the functional genomics schema and the variation schema in particular.

This is what Ensembl is all about. We make it easier for people who want to work genomically to do the sometimes painful data manipulation and plumbing. In particular, Ensembl provides public domain information in a large scale, well organised and ready to be browsed on the web, scripting against in Perl and accessible to clients like bioconductor. And more than any other group, we help group’s like David’s do more for his institute and have to worry less about the infrastructure. David was very interested in the “geek for a week” program when someone comes to work at Ensembl to help accelerate a project.

Returning to the airline theme, some of the biologists admitted using the UCSC browser in a little embarrassed way. I responded that it was fine – UCSC is a great browser, with some great tools. Like airlines, we know people have a choice browsers, and we hope people come “fly ensembl” and enjoy it, but we know the competition is good (and really friendly as well – we like working with those crazy californians, and have a number of joint projects). If you are a biologist, you should use the best tool for the job at hand. Of course, we know where we’re lacking, in particular in comparison to UCSC, and we are working on getting better. Keep an eye open on changes in Ensembl this year – and do come fly with us even if your “regular browser” is US based.

Finally my plane I think is ready to depart.

(Madrid airport is so big I think I’m half way to the UK already)

Richard posted the next release intentions here:

ensembl-dev archive

Lots of good stuff – orangutan, horse being released, the usual tweaks about contamination (viral genes) into the gene sets being removed, little details.

But one thing is quite a change. It is from Javier’s Compara team, and it is simply stated as

“Generate the 7-way alignments using the new enredo-pecan-ortheus pipline”

Unpacking this statement, it is a big change in how we’re thinking about comparative genomics alignments. Enredo is a method to produce a set of co-linear regions, sometimes called a “synteny map” though this term is a dreadful term. The key thing is that it handles duplications in the genome, allowing (say) two regions of human to be co-linear with one region of mouse. This is hard to handle on a genome-wide scale in a scaleable manner. Pecan is the multiple aligner written by the brilliant Ben Paten (used to be my student, and wrote Pecan whilst at the EBI; he is now at UCSC with Jim and David and co). Pecan is the best aligner – by both simulation testing and testing via ancient repeat alignability criteria – it has the highest sensitivity of alignment with the same specificity as the next best aligner. Finally Ortheus, also from Ben, provides (potentially) realignment whilst simultaneously sampling correctly from a probabilistic model of sequence evolution, critically including insertion and deletions, and thus as a side effect, producing likely ancestral sequences. This also has been stringently tested using a hold-one-out criteria, basically can we “predict” the marmoset sequence only using other extant species (answer – not completely correctly, but better than any other method, eg taking the nearest sequence).

So – what does this all mean. Basically there are two key things:

  1. Handling lineage specific duplications. This is a headache, and we have a good solution, providing the alignment of therefore the paralogous and orthologous regions (the paralogy is limited to relatively recent paralogy, ie, within mammals) simultaneously
  2. We can reliably predict ancestoral sequences

One headache is that some of the things we display, in particular the GERP continuous conservation score, needs to be adapted to work on the basis now of regions with paralogy. There is a fascinating piece of theory to work through here – what is the concept of the “neutral tree” when there has been a lineage specific duplication? How should one treat paralogs? Currently this is ignored by virtue of the fact that the alignments don’t allow this. Now the alignments do allow this, and we need to do something sensible, as well as stimulate evolution theory people to look at the data and work out new methods.

The next headache is what do we do with the ancestral sequences? Dump them? Display them? Gene predict on them? If so, how?

The end result is that release 49, even the comparative genomics, wont look very different, but it will have these new alignments, and over 2008 we will be working out how to present, analyse and leverage them more – so if you are interested, please do take them for a spin!

(Release 49 is due to be out sometime in mid-Feb)

The beginning of this week myself and Paul Flicek were in lovely Rotterdam at the Gen2Phen kick off meeting, an EU project lead by Tony Brookes from Leicester. Like all large European projects, the kick off meeting is a get-to-know everyone, have beers (very good ones in Holland) and get a feel for the project.

For me, the exciting thing was getting closer to the locus specific databases – in the project is Johan den Dunnen (from just down the road in Leiden, Holland) and Andy Devereau (from Manchester) who run locus specific databases and diagnostic databases respectively. Getting this valuable data coordinated with genome data (and the fiddly bit is about sequence coordinates, at least at first) is going to be great thing to do.

There’s lots to do in this area – certainly this is something that effects all the big browsers (UCSC, NCBI, ourselves) and has a had a long history of complex systems and sociological tensions in getting things sorted. But my sense in this small room hidden away in the Erasmus medical centre was that we had good people in the room, committed to finding a good solution whilst understanding the complexity of problem. Next up will be more technical meetings, but it was an excellent start. Don’t expect anything tomorrow, but I think we can expect something end of 2008/2009.

And did I mention the beer was good as well?

We’re going to be experimenting with broader content generated by the Ensembl team in the Ensembl blog – at the very least by myself, Ewan Birney. So you can expect to read more about what we’re doing, the things which are coming up in the pipeline and our thoughts on how genomic infrastructure is going to evolve over time. Ensembl is a big team, with alot of components, so it is often hard to track what we’re doing and why we’ve made some decisions. This blog hopefully will keep you up to date with our progress in an informal manner.