Today the 1000 Genomes projects was announced. By any measure this is a big deal.
The goal is simple: to create the most comprehensive and medically useful collection of human variation ever assembled by producing approximately 6 terabases of sequence. To put this amount of data in prospective, 6 terabases is more than 60 times the amount of data that is currently available in the DDBJ/GenBank/EMBL Archive and that took more than 25 years to collect. At the peak production of the 1000 Genomes project more that 8 billion basepairs per day will be sequenced. It’s data output of the the entire human genome project every week. All made publicly available.
The data generation rate and the short read length mean that the bioinformatics requires for the project are equally ambitious (or terrifying depending on your point of view). The EBI and NCBI, working together, are creating a joint DCC (data coordination centre) to collect, organise and provide the data to the world. Steve Sherry at the NCBI and I are eager to take this on.
At Ensembl we’ve been expecting this development and built support for re-sequencing data into our variation database a couple of years ago. So far, we have data for about 6 humans, 5 mouse strains, and a smattering of rat data. Small stuff compared to six months from now, but large enough that we have both experience and confidence dealing with the large-scale resequencing data. We are probably going to need both.
Lots of good stuff – orangutan, horse being released, the usual tweaks about contamination (viral genes) into the gene sets being removed, little details.

But one thing is quite a change. It is from Javier’s Compara team, and it is simply stated as

“Generate the 7-way alignments using the new enredo-pecan-ortheus pipline”

Unpacking this statement, it is a big change in how we’re thinking about comparative genomics alignments. Enredo is a method to produce a set of co-linear regions, sometimes called a “synteny map” though this term is a dreadful term. The key thing is that it handles duplications in the genome, allowing (say) two regions of human to be co-linear with one region of mouse. This is hard to handle on a genome-wide scale in a scaleable manner. Pecan is the multiple aligner written by the brilliant Ben Paten (used to be my student, and wrote Pecan whilst at the EBI; he is now at UCSC with Jim and David and co). Pecan is the best aligner – by both simulation testing and testing via ancient repeat alignability criteria – it has the highest sensitivity of alignment with the same specificity as the next best aligner. Finally Ortheus, also from Ben, provides (potentially) realignment whilst simultaneously sampling correctly from a probabilistic model of sequence evolution, critically including insertion and deletions, and thus as a side effect, producing likely ancestral sequences. This also has been stringently tested using a hold-one-out criteria, basically can we “predict” the marmoset sequence only using other extant species (answer – not completely correctly, but better than any other method, eg taking the nearest sequence).

So – what does this all mean. Basically there are two key things:

  1. Handling lineage specific duplications. This is a headache, and we have a good solution, providing the alignment of therefore the paralogous and orthologous regions (the paralogy is limited to relatively recent paralogy, ie, within mammals) simultaneously
  2. We can reliably predict ancestoral sequences

One headache is that some of the things we display, in particular the GERP continuous conservation score, needs to be adapted to work on the basis now of regions with paralogy. There is a fascinating piece of theory to work through here – what is the concept of the “neutral tree” when there has been a lineage specific duplication? How should one treat paralogs? Currently this is ignored by virtue of the fact that the alignments don’t allow this. Now the alignments do allow this, and we need to do something sensible, as well as stimulate evolution theory people to look at the data and work out new methods.

The next headache is what do we do with the ancestral sequences? Dump them? Display them? Gene predict on them? If so, how?

The end result is that release 49, even the comparative genomics, wont look very different, but it will have these new alignments, and over 2008 we will be working out how to present, analyse and leverage them more – so if you are interested, please do take them for a spin!

(Release 49 is due to be out sometime in mid-Feb)

The beginning of this week myself and Paul Flicek were in lovely Rotterdam at the Gen2Phen kick off meeting, an EU project lead by Tony Brookes from Leicester. Like all large European projects, the kick off meeting is a get-to-know everyone, have beers (very good ones in Holland) and get a feel for the project.

For me, the exciting thing was getting closer to the locus specific databases – in the project is Johan den Dunnen (from just down the road in Leiden, Holland) and Andy Devereau (from Manchester) who run locus specific databases and diagnostic databases respectively. Getting this valuable data coordinated with genome data (and the fiddly bit is about sequence coordinates, at least at first) is going to be great thing to do.

There’s lots to do in this area – certainly this is something that effects all the big browsers (UCSC, NCBI, ourselves) and has a had a long history of complex systems and sociological tensions in getting things sorted. But my sense in this small room hidden away in the Erasmus medical centre was that we had good people in the room, committed to finding a good solution whilst understanding the complexity of problem. Next up will be more technical meetings, but it was an excellent start. Don’t expect anything tomorrow, but I think we can expect something end of 2008/2009.

And did I mention the beer was good as well?

We’re going to be experimenting with broader content generated by the Ensembl team in the Ensembl blog – at the very least by myself, Ewan Birney. So you can expect to read more about what we’re doing, the things which are coming up in the pipeline and our thoughts on how genomic infrastructure is going to evolve over time. Ensembl is a big team, with alot of components, so it is often hard to track what we’re doing and why we’ve made some decisions. This blog hopefully will keep you up to date with our progress in an informal manner.

Happy New Year from Ensembl!

Ensembl is pleased to announce a new release (version 48) due on 11 December. Featured in the release will be a new gene set for pika and the mouse lemur. These species will also be incorporated into homology calculations and alignments. Further upcoming features include variations from dbSNP128 for mouse and a new rat strain, RNB1 (from Japan). The human variation set will also feature updated data from Dr. Watson’s genome, along with updates from dbSNP128.

Did you know about our new human and mouse databases (since release 41)? The functional genomics database (funcgen) is a first step into the world of annotating promoter and enhancer elements detected in the ENCODE project. See these features in ContigView (‘Regulatory features’ under the ‘Features’ roll-down menu) or access the data with our API.

Finally, a new database integrating tissue expression data and presenting it on the rat genome will be available as EURATMart.

The Barcelona Supercomputing Center, site of an EBI roadshow in November:

Ensembl released version 47. News highlights: New gene sets for mouse, human and C. elegans. A new mouse assembly (m37) is available with a new Ensembl gene set, also a new human gene set has been determined for assembly NCBI 36. WormBase 180 has been imported into the browser for C. elegans. A word of warning: the FTP site has been rearranged, so please check the site for the updated format. Click here for more release news.

New! An updated pig assembly and a new orangutan assembly in our Pre! site.

Have you seen our animated tutorials? Learn how to use BioMart to convert IDs here.

This new release hosts an updated zebrafish assembly (Zv7) along with newly determined gene sets for zebrafish, platypus and chimpanzee.

SequenceAlignView is a new page allowing sequences to be compared across strains (mouse and rat)/individuals (humans). Variations can be displayed in this view. See the sitemap to find the page.

Finally, three new videos are available in the help. See the video tutorials here:

More news is available on our website. Come find out what’s new!

