Ensembl 55

We are currently working on our next release which is due at the end of June 2009 and will contain the following:


Human GRCh 37
We will be releasing a new genebuild for human based on the latest assembly GRCh37 from the Genome Reference Consortium. A preliminary version of this assembly is available now in Ensembl Pre! Due to the new assembly we will have:

  • Updated repeat masking
  • New probeset mappings
  • cDNA update
  • A new ensembl-vega merge delivering a new gene set
Ensembl 55 includes the 2X genome for Tammar Wallaby (Macropus eugenii), this will be a projection build similar to our other 2X species.

C. elegans
We will also include an import of the WormBase release WS200 database for C. elegans.

Anole lizard – A gene patch incorporating the gene set provided by Chris Ponting at Oxford University means that we have a new gene set for the green anole lizard (Anolis Carolinensis).

Mouse – The mouse cDNA alignments have been updated.

Zebrafinch – There will be an updated gene set for the 6X zebra finch genome.

Zebrafish – Non-coding RNAs will be added to the Zv8 zebrafish assembly and there will also be some changes to protein coding gene models and new repeats and expression patterns.


Schema Changes

  • Patch to update versions (patch_54_55_a.sql). * Add the missing types to go_xref (patch_54_55_b.sql).
  • Add new table dependent_xref (will hold the dependencys for the xrefs, i.e. if an EMBL entry come from a uniprot entry this relationship will be in the table)( patch_54_55_d.sql).
  • Add new tables for alternative splicing/transcript events (patch_54_55_c.sql).
  • Add new column ‘is_constitutive’ to the exon table (patch_54_55_e.sql)

Xrefs will be run for Human, Macacca, Opossum, Chimp, Chicken, Dog and Mouse (including Fantom Xrefs).

Ontology database schema and tools
The ensembl_go_NN databases are no longer being built. Instead we are replacing this with the ensembl_ontology_NN database which may be connected to using the core API.

Assembly mapping
Some of the databases will contain mapping coordinates between current and previous assemblies:

  • human: mapping from current GRCh37 to NCBI36, NCBI35 and NCBI34
  • mouse: mapping from current NCBIM37 to NCBIM36, NCBIM35 and NCBIM34
Other changes
  • API support for alternative transcripts/splicing events will be added
  • API support for constitutive exons will be added
  • Deprecated API modules will be removed
  • All slices will be created using the new_fast method from the SliceAdaptor to improve performance
  • seq_region seq edit support will be added. Seq_edits can already be stored and retrieved but these were not used in getting the sequence data. This will be changed so that “_rna_edit” attributes in the seq_region_attrib table will be used and the sequence changed.
  • MySQL and FASTA dumps will be copied to Amazon Public Datasets project
  • Gene name and xref projections

  • New functional genomics mart * A new Probe section added to Ensembl mart
  • New ontology mart
  • Constitutive exon information will be re-added to Ensembl mart

  • There will be a new human variation database generated by mapping NCBI36 coordinates to GRCh37 (using dbSNP 129)
  • Illumina array data for SNP/CNV is to be added
  • Transcript variations for Zebrafish and Zebrafinch will be reculated to include information from the new gene sets
  • Schema change – added a call to get consequence_type
Functional genomics
  • Human Regulatory Build will be updated using the GRCh37 assembly
  • Probe alignment and transcript annotation for all species will migrate from the core datbases to the functional genomics databases, this includes Affymetrix, Illumina, Codelink and Phalanx
  • Schema change, an is_current filed is to be added to the coord_system table
Comparative genomics

Alignments – The new human assembly means that the following alignments will be regenerated:

  • 9 eutherian mammals EPO multiple alignments
  • 31 eutherian mammals EPO multiple alignments
  • 12 amniota vertbrates Pecan multiple alignments
  • 4 catarrhini primate EPO multiple alignments
  • Pairwise BLASTZ-NET alignments of human against each of the other 9 and 31 eutherian mammals
  • Additional pairwise BLASTZ-NET alignments will be run for human-opossum, human-platypus, human- chicken and human-wallaby
  • Translated BLAT-NET will be regenerated for human against fugu, X.tropicalis, C.intestinalis, C.savignyi, stickleback, medaka, chicken, zebrafish, tetraodon, zebrafinch and anole lizard

Synteny will be recalculated for: rat vs. huamn, chicken vs. human and human vs. macaque, dog, chimpanzee, platypus, opossum, mouse, orangutan, horse and cow

Homologies amd families

  • 50 way GeneTrees and homologies with new/updated genebuilds and assemblies
  • Clustering using hcluster_sg
  • Multiple Sequence Alignments using consistency-based MCoffee meta-aligner (mafftgins + muscle + kalign + probcons) and new exon-skipping aware “skipper” algorithm.
  • New ‘putative gene split’ and ‘distant paralog’ homology types
  • Pairwise gene-based dN/dS calculations for high coverage species pairs
  • Updated MCL families including all Ensembl transcript isoforms and newest Uniprot Metazoa
  • Multiple sequence alignments with MAFFT
  • Stable IDs for GeneTrees (ENSGT00550NNNNNNNNN) and MCL Families (ENSFM00550NNNNNNNNN).