The Genome Reference Consortium (GRC) is a collaboration between the EMBL-EBI, NCBI, Wellcome Trust Sanger Institute, and Genome Institute at Washington University. They are responsible for maintaining the human, mouse and zebrafish reference genome assemblies that you can see in Ensembl, including updating to new assemblies such as the new human assembly GRCh38. They have also been developing methods that allow for the representation of different sequence paths for loci where allelic diversity is needed (PLoS Biol. 2011 Jul:9(7):e1001091).

The GRC would like to invite you to a highly technical workshop, which is planned for the morning of Sunday 21st September. The workshop will be chaired by the Wellcome Trust Sanger Institute’s Richard Durbin and Deanna Church from Personalis. Members of the GRC will present and discuss a range of topics including:

  • Alignment/Mapping tools for using the full assembly: distinguishing allelic duplication from paralogous duplication.
  • Representing alignment data in BAM files.
  • Variant calling.
  • Representing variant calls in VCF (or other formats).
  • Reporting results to users in biological friendly ways.
  • Relationship to parallel interests in the Global Alliance for Genomics and Health (GA4GH) Data Working Group.

The GRC workshop is open to everybody, not just Genome Informatics conference attendees. The workshop is free to attend, but there are limited places so please register if you’d like to come along.

Other events

The 14th Genome Informatics conference will be held at Churchill College, Cambridge, UK, and Ensembl will be there. In addition to the Genome Reference Consortium workshop, we will also be at:

The Ensembl Pre! site has been updated for five species: vervet monkey (Chlorocebus sabaeus), naked mole-rat (Heterocephalus glaber), aardvark (Orycteropus afer), bottlenosed dolphin (Tursiops truncatus) and the American pika (Ochotona princeps).

Vervet monkey, naked mole-rat and aardvark are new species to Ensembl. Our main site already displays earlier, low-coverage assemblies for dolphin and pika.

New species

Vervet monkey

Vervet monkeyThis species is an important non-human primate model for biomedical research into HIV and heritable behavioural phenotypes. The assembly, Chlorocebus_sabeus 1.0 (GCA_000409795.1), was submitted by the Vervet Genomics Consortium and became available in June 2013. It comprises 31 chromosomes and 1432 unplaced scaffolds. The vervet monkey Pre! pages display alignments of UniProt proteins, and human Ensembl translations.

Naked mole-rat

Naked mole-ratThese unique rodents can live for longer than 28 years in captivity and are a model for age-related diseases such as cancer and heart disease. Unlike other mammals, naked mole-rats are virtually poikilothermic. They live underground in cooperative colonies where only one queen and several males are reproductively active. The assembly, HetGla_female_1.0 (GCA_000247695.1), was submitted by the Broad Institute and became available in March 2012. It comprises 4229 toplevel sequences, all of which are unplaced scaffolds. The naked mole-rat Pre! pages display alignments of UniProt proteins, and human and mouse Ensembl translations.

Aardvark

AardvarkThe genome has been sequenced in the first phase of a project aiming to identify constrained regions in the human genome by comparing the genomes of ~200 mammals. The assembly, OryAfe1.0 (GCA_000298275.1), was submitted by the Broad Institute and became available in October 2012. It comprises 22508 toplevel sequences, all of which are unplaced scaffolds. The aardvark Pre! pages display alignments of UniProt proteins, human Ensembl translations, and 79 aardvark proteins.

New assemblies

Dolphin

DolphinThe new assembly, Tru_1.4 (GCA_000151865.2), was submitted by Baylor College of Medicine and became publicly available in January 2012. The dolphin assembly comprises 240900 toplevel sequences, all of which are unplaced scaffolds. The dolphin Pre! pages display alignments of dolphin proteins and cDNAs, UniProt proteins, and human Ensembl translations.

Pika

PikaThe new assembly for pika, OchPri3.0 (GCA_000292845.1), was submitted by the Broad Institute and became publicly available in August 2012. The pika assembly comprises 10421 toplevel sequences, all of which are unplaced scaffolds. The pika Pre! pages display alignments of UniProt proteins, and mouse and human Ensembl translations.

 

Pre! sites have been released for two species: Rat (Rattus norvegicus) and Chinese softshell turtle (Pelodiscus sinensis).

Rattus norvegicus

The rat assembly Rnor_5.0 (GCA_000001895.3) was submitted by the Rat Genome Sequencing Consortium. This assembly is composed of 21 chromosomes and 1439 unplaced scaffolds. Click here to go to the rat Pre! site where you can view alignments of rat proteins from UniProt and RefSeq, and alignments of human, rat and mouse translations from Ensembl release 67.

 

Pelodiscus sinensis

Pelodiscus sinensis

The Chinese softshell turtle assembly PelSin_1.0 (GCA_000230535.1) is composed of 19,904 unplaced scaffolds. It has undergone a full Ensembl gene annotation and these results, including RNASeq, will be released in Ensembl 68. Click here to go to the Chinese softshell turtle Pre! site.

 

 

Pre! sites have been released for two species: Western painted turtle (Chrysemys picta bellii) and Spotted gar (Lepisosteus oculatus).

Chrysemys picta bellii

The western painted turtle assembly Chrysemys_picta_bellii-3.0.1 (GCA_000241765.1) was submitted by the International Painted Turtle Genome Sequencing Consortium. The painted turtle is used as a model species for studying anoxia tolerance, freeze tolerance, temperature-dependent sex determination, and vertebrate evolution. This assembly is composed of 80983 unplaced scaffolds. Click here to go to the painted turtle Pre! site where you can view transcripts built from alignments of a few painted turtle proteins (from UniProtKB) as well as from alignments of chicken and green anole lizard translations (from Ensembl release 66).

Lepisosteus oculatus

The spotted gar assembly LepOcu1 (GCA_000242695.1) was submitted by the Broad Institute. The spotted gar is a primitive freshwater fish. It is especially interesting as an outgroup to teleost fishes: the gar lineage diverged from teleosts before the teleost whole-genome duplication. This assembly is composed of 29 chromosomes and 1896 unplaced scaffolds. Click here to go to the spotted gar Pre! site where you can view alignments of zebrafish, stickleback and coelacanth translations from Ensembl release 67.

Pre! sites have been released for five species: cat (Felis catus), chicken (Gallus gallus), dog (Canis lupus familiaris), squirrel monkey (Saimiri boliviensis) and thirteen-lined ground squirrel (Spermophilus tridecemlineatus).

Felis catus

The cat assembly Felis_catus-6.2 (GCA_000181335.2) was submitted by the International Cat Genome Sequencing Consortium. The domestic cat is important as a model organism for human infectious disease and for the conservation of endangered cat species. This assembly is composed of 19 chromosomes, 749 unlocalized scaffolds and 4731 unplaced scaffolds. Click here to go to the cat Pre! site, where you can view cat protein, cDNA and EST alignments, as well as alignments of the Ensembl release 66 human and cat translations. This assembly will undergo full automatic gene annotation in due course.

 

Gallus gallus

The chicken assembly Gallus_gallus-4.0 (GCA_000002315.2) was submitted by the International Chicken Genome Consortium. The chicken is important not only as a food source but also for studies in vertebrate embryology and as a model for other bird species. This assembly is composed of 33 chromosomes, 1805 unlocalized and 14093 unplaced scaffolds. Click here to go to the chicken Pre! site, where you can view chicken protein alignments, as well as alignments of the Ensembl release 65 human, chicken, turkey and zebra finch translations. This assembly will undergo full automatic gene annotation in due course.

 

Canis familiaris

The dog assembly CanFam3.1 (GCA_000002285.2) was submitted by the Dog Genome Sequencing Consortium. The dog is an important model organism for the study of human disease including cancer, heart disease and obsessive-compulsive disorder. This assembly is composed of 39 chromosomes and 3228 unplaced scaffolds. Click here to go to the dog Pre! site, where you can view dog protein, cDNA and EST alignments, as well as alignments of the Ensembl release 65 human and dog translations. Full automatic gene annotation is in progress and will include RNA-seq data.

 

Spermophilus tridecemlineatus

The squirrel assembly SpeTri2.0 (GCA_000236235.1) was submitted by the Broad Institute. The squirrel is a model for mammal hibernation. This assembly is composed of 12483 unplaced scaffolds. Click here to go to the squirrel Pre! site, where you can view alignments of the Ensembl release 66 squirrel, mouse and human translations. Full automatic gene annotation is in progress and will include RNA-seq data.

 

Saimiri boliviensis

The squirrel monkey assembly SaiBol1.0 (GCA_000235385.1) was submitted by the Broad Institute. The squirrel monkey is a model organism for infectious disease, behaviour and reproduction. This assembly is composed of 2685 unplaced scaffolds. Click here to go to the squirrel monkey Pre! site, where you can view alignments of the Ensembl release 65 human translations. This assembly will undergo automatic gene annotation in due course.

Pre! sites have been released for three species: Comorese coelacanth (Latimeria chalumnae), pig (Sus scrofa) and Chinese hamster (Cricetulus griseus).

 

Latimeria chalumnae

Comorese coelacanth. Picture courtesy of Robbie Cada

The Comorese coelacanth assembly LatCha1, provided by the Broad Institute, is in the process of full automatic gene annotation. This interesting fish species is a member of the lobe-finned fishes, which were thought to be extinct since the Late Cretaceous period. The first living specimen was discovered off the east shores of South Africa in 1938. The coelacanth is an important outgroup to tetrapods. Click here to go to the coelacanth Pre site, where you can view coelacanth protein and cDNA alignments, as well as alignments from the Ensembl release 64 human, stickleback and zebrafish translations.

 

Sus scrofa

Pig

The pig assembly Sscrofa10.2, provided by the Swine Genome Sequencing Consortium, is in the process of full automatic gene annotation that will include RNAseq data. The pig is important not only for pork production but also as an important model organism as it enables research into human health issues such as cardiovascular disease, obesity and immunity. Click here to go to the pig Pre site, where you can view pig protein, cDNA and EST alignments as well as alignments from the Ensembl release 64 human and pig translations.

 

Cricetulus griseus

Chinese hamster

The Chinese hamster assembly CriGro_1.0 is provided by the BGI and published here. The Chinese hamster Ovary (CHO) K1 cell line is widely used for the production of biopharmaceutical proteins. There are currently no plans to progress to full genome annotation for this assembly. Click here to go to the Chinese hamster Pre site.

 

Join us for a workshop titled, “Introduction to automatic gene annotation”. This workshop, running 1-2 November at CSHL, is aimed at developers.  Two Ensembl developers will present sessions on how to create your own core database, including the loading of a genome assembly into a database and the running of simple analyses using the Ensembl genebuild pipeline. This meeting will therefore follow the same format as the 2007-2010 automatic gene annotation workshops.

Participants will be expected to have experience in programming and a background in object-oriented programming. A good familiarity with Perl, a Unix/Linux environment, and MySQL are essential to follow the workshop and the programming examples. Knowledge of the Ensembl core API is also essential.  We will be working from a Virtual Machine and participants are expected to bring their own laptops (preferabley Mac) to work from – more details will be provided on registration.

Topics to be presented:

  • Introduction to the GeneBuild pipeline, including data input types, generating protein-coding transcript models, and adding UTR to these models
  • An introduction to assembly structure (toplevel, contigs, scaffolds, chromosomes)
  • Overview of the different Ensembl APIs
  • Obtaining the Ensembl API (cvs checkout)
  • Core database schema
  • Tracking jobs in the pipeline
  • Runnable and RunnableDB modules

Practical sessions:

  • Creating a genebuild database
  • Loading an assembly into the database
  • Running algorithms first on the commandline and then using the pipeline
  • Understanding how the pipeline code interacts with the algorithms and the database
  • Understanding the pipeline’s job tracking system
  • Visualisation of results with Apollo.

Would you like to join us? Please contact Bert (bert@ebi.ac.uk) for more details or to register.

 

Pre! sites have been released for three new species: Atlantic cod (Gadus morhua), Nile tilapia (Oreochromis niloticus) and domestic ferret (Mustela putorius furo).

The Atlantic cod annotation will be released on our main site for Ensembl release 65, while Nile tilapia and domestic ferret will be available in later releases.

The Atlantic cod assembly gadMor1, provided by the cod genome consortium, has undergone a full gene annotation. The final gene set, displayed here, comprises 20,095 protein-coding genes, 518 pseudogenes, and 1,541 noncoding RNA genes. Click here to go to the Atlantic cod pre site.

The Nile tilapia assembly Orenil1.0, provided by the Broad Institute, has undergone preliminary annotation. This Pre! site displays 20248 raw gene models predicted from alignments of vertebrate proteins in UniProt. Alignments of zebrafish and stickleback Ensembl proteins from release 62 are also available, as are ab initio gene predictions and alignment of sequences from several public databases (e.g. UniGene, EMBL Vertebrate RNA, UniProt). RNASeq data are expected for the Nile tilapia and we intend to make use of these data during the forthcoming analyses. Click here to go to the Nile tilapia pre site.

The domestic ferret assembly MusPutFur1.0 was also provided by the Broad Institute. This Pre! site includes alignments of ferret proteins, cDNAs and ESTs. In addition, ab initio gene predictions and alignment of sequences from several public databases (e.g. UniGene, EMBL Vertebrate RNA, UniProt) are available. Click here to go to the domestic ferret pre site.

 

Have you noticed any strange-looking chromosome names when browsing the human data? For example, you might notice sequence region names looking like “Chromosome HSCHR17_2_CTG4: 68,302,419-68,526,413” or “Chromosome HG75_PATCH: 34,442,621-34,976,908”.

The names refer to genomic sequence that differs from the genomic DNA on the primary assembly. These alternate sequences come in two types: Allelic sequence (haplotypes and novel patches) and fix patches. Haplotypes are known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL).  Novel patches also represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence.  Haplotypes, novel patches and fix patches are determined by the GRC, not by Ensembl.

In the Ensembl browser, as in the figure below, the allelic sequence (haplotypic regions and novel patches) are coloured red and the fix patches are coloured green. If you have a look at the top image in Region In Detail for chromosome 17, you’ll see examples of both types of alternate sequence.

 

There are several ways to view alternate sequences in Ensembl:

  • If you know the name of the sequence you’re looking for, you can find it by searching in our Search bar.
  • You can view alternate sequence regions in the top image of any Location page eg. Region In Detail, Region Overview, Chromosome Summary.
  • Some alternate sequences are available through BioMart.
  • If you’re comfortable using MySQL, you can access the list through the assembly_exception table as follows:

mysql -uanonymous -hensembldb.ensembl.org -P5306 -Dhomo_sapiens_core_62_37g -e “select sr2.name as chr_name, exc_seq_region_start,exc_seq_region_end,exc_type,sr1.name as alternate_seq_name,seq_region_start, seq_region_end from assembly_exception ae, seq_region sr1, seq_region sr2 where sr1.seq_region_id=ae.seq_region_id and sr2.seq_region_id=ae.exc_seq_region_id order by chr_name,exc_seq_region_start”

Click here for the full list of e62 alternate sequences

$slices = $slice_adaptor->fetch_all( ‘toplevel’, undef, 1 );

or

$assembly_exception_features = $assembly_exception_feature_adaptor->fetch_all_by_Slice($slice);

When using the API, the primary assembly is known as the ‘reference’ sequence and the alternate sequences are know as ‘non-reference’ sequence.

Enjoy!

Ensembl release 59 includes the first human assembly patches released by the Genome Reference Consortium (GRC).

The goal of the GRC is to ensure that the human reference assembly is biologically relevant by closing gaps, fixing errors and representing complex variation. Their ongoing efforts are made available to the community via minor releases called patches. The patches do not change the chromosome coordinate system but do provide either a new alternate haplotype (novel patch) or provide a preview of the chromosome tiling path for that region in the next major release (fix patch).

The patched update GRCh37.p1 affects only 2 regions of the reference assembly.The patch update GRCh37.p1 includes:

Fix Patch: HG79_PATCH (GL339450.1) on chromosome 9, correction for the ABO gene. Fix patches are coloured green on the Chromosome Summary page. Click here for an example region.

Novel Patch: HSCHR5_1_CTG1 (GL339449.1) on chromosome 5. This patch provides an alternative region (haplotype). Novel patches are coloured red on the Chromosome Summary page. Click here for an example region.

The two patched regions have undergone preliminary gene annotation. Human cDNAs with their annotated ORFs were aligned to the genome using the Exonerate cdna2genome model to generate coding transcripts.

We expect future patch releases on a quarterly basis.