GRCh38We’re excited to announce that the initial annotation of the new human assembly, GRCh38, is now live on the Ensembl Pre! site. A new human assembly is always an exciting (and scary) prospect for both Ensembl and the greater scientific community. Since the release of GRCh38 we’ve been working tirelessly to produce annotation on the new assembly. The first major milestone on that journey is the completion of the human Pre! site, which is now live.

 

Human GRCh38 assembly information:
GRCh38 was produced by the Genome Reference Consortium and released in December 2013. It consists of 24 chromosomes (1-22, X and Y), 127 unplaced scaffolds and 42 unlocalized scaffolds. In addition, there are 261 haplotypes and for the first time centromeres are represented in the assembly.

Preliminary annotation of GRCh38:
Ensembl Pre! sites are designed to offer users an early glimpse at the data, through a range of analyses that include repeat finding, gene prediction, nucleotide and protein alignments, and the construction of an initial set of gene and transcript models using the aforementioned analyses. We have imported or mapped pre-existing models from RefSeq, CCDS and GENCODE 19. The GRCh38 Pre! site includes:

  • RepeatMasker and Dust analyses, masking 50.46% of the genome
  • Centromeres covering 2% of the genome
  • Total Genscan predictions: 50,117
  • Distinct UniGene BLAST hits: 348,921
  • Distinct UniProt BLAST hits: 451,431
  • GENCODE 19 models mapped from e!75: 61,349
  • CCDS models mapped from e!75: 28,781
  • RefSeq annotation release 106 models: 26,670
  • Human-specific cDNA and EST alignments

We have now begun construction of a new set of Ensembl gene models for GRCh38. Human-specific data are key to building a high quality final set of Ensembl gene models. For the Pre! site just over 80,000 uniquely accessioned, human-specific proteins were aligned to the genome to create preliminary gene models. It is important to note that these are an initial set of raw gene models; during the full annotation process for release 76 we will create models from multiple data sources, in addition to carrying out a filtration step to remove low quality or redundant models. This will bring the final gene count in line with the previous assembly.

Variation data for GRCh38:
As human is one of our flagship species we also have prepared a preliminary view of variation data on the new assembly.

The variation team has mapped all variant locations from GRCh37 to GRCh38 using a method developed by the Ensembl core team to project features between assemblies. Using this method we successfully projected 98.5% of our variants from release 75 to GRCh38. Reasons for why a projection might have failed are explained in the assembly mapping blog post.

For the Pre! site we also provide variant consequence information. We used the projected Ensembl gene set and computed the consequence of a variant overlapping an Ensembl transcript. You can view these data on the Pre! site or download the data from our Pre! FTP site in GVF or VCF format. Additionally, we computed consequences for variants overlapping NCBI’s RefSeq transcripts. These are available in GVF and VCF files from our Pre! FTP site.

The variation team is now working on the rest of our variation data for Ensembl release 76, including genotype information, phenotype data, LD data, structural variations and mappings of variants to all alternative loci that are new in GRCh38.

What’s next for GRCh38:
Over the following months we will continue to annotate GRCh38. This includes producing the GENCODE 20 gene set that will be released in Ensembl 76, as well as pairwise genome alignments between human and other species in Ensembl, and an updated Regulatory Build.

Keep an eye out for further GRCh38 annotation updates by checking the blog category GRCh38. We’ll be running a free 20 minute webinar on Wed 16 April at 4PM BST.  Register here.

Other species added to our Pre! site:
In addition to human we’re pleased to announce that we also have Pre! sites available for Amazon molly, crab-eating macaque and hedgehog.

Ensembl is holding a workshop titled, ‘Introduction to automatic gene annotation’ aimed at developers. The workshop runs on 29-30th of October 2013 at Cold Spring Harbor Laboratory, New York.

Registration for this workshop is free, but participants will need to cover their own accommodation and meal expenses. Please contact Bert (bert@ebi.ac.uk) for more details or to register.

Two Ensembl developers will present sessions on how to create your own core database, including the loading of a genome assembly into a database and the running of simple analyses using the Ensembl genebuild system.

Participants will be expected to have experience in programming and a background in object-oriented programming. A good familiarity with Perl, a Unix/Linux environment, and MySQL are essential to follow the workshop and the programming examples. Knowledge of the Ensembl core API is also essential.

Topics to be presented:

  • Introduction to the Ensembl genebuild system, including data input types, generating protein-coding transcript models, and adding UTR to these models
  • An introduction to assembly structure (toplevel, contigs, scaffolds,  chromosomes)
  • Overview of the Ensembl Analysis and Pipeline APIs
  • Obtaining the Ensembl API (cvs checkout)
  • Core database schema
  • Tracking jobs in the system
  • Runnable and RunnableDB modules

Practical sessions:

  • Creating a genebuild database
  • Loading an assembly into the database
  • Running algorithms first on the commandline and then using the  pipeline
  • Understanding how the pipeline code interacts with the algorithms and the database
  • Understanding the pipeline’s job tracking system
  • Visualisation of results with Apollo

Would you like to join us? Please contact Bert (bert@ebi.ac.uk) for more details or to register.

Related Cold Spring Harbor Conference:
Genome Informatics 2013, 30 October to 2nd November, Cold Spring Harbor, New York. Please click here for full details.

New Pre! sites have been released for three species: common shrew (Sorex araneus), olive baboon (Papio anubis) and sheep (Ovis aries).

The common shrew assembly SorAra2.0 (GCA_000181275.2) was submitted by the Broad Institute. It consists of 12,845 unplaced scaffolds comprised of 201,798 contigs. A total of 16,210 gene models have been created from alignments of 16,209 human Ensembl translations and 1 shrew-specific protein. In addition, alignments of sequences from UniProt, UniGene and the ENA vertebrate RNA collection are provided. Click here to visit the common shrew Pre! site.

The olive baboon assembly Panu_2.0 (GCA_000264685.1) was submitted by the Baylor College of Medicine. It consists of 20 autosomal chromosomes (1-20) and the X chromosome. There are 72,500 scaffolds (63,229 of which are unplaced) comprised of 198,931 contigs. A total of 20,234 gene models were generated from alignments of known olive baboon proteins and human Ensembl translations. In addition, alignments of sequences from UniProt, UniGene, GenBank (baboon ESTs), RefSeq (baboon cDNAs) and the ENA vertebrate RNA collection are provided. Click here to visit the olive baboon Pre! site.

The sheep assembly Oar_v.31 (GCA_000298735.1) has 130,765 contigs, 5,697 toplevel sequences and 27 chromosomes including the X chromosome. It was submitted by the International Sheep Genomics Sequencing Consortium. Alignments were created using human and cow Ensembl translations and known sheep proteins from Uniprot and RefSeq. These alignments gave 45,972 gene models in total. In addition, alignments of sequences from UniProt, UniGene, GenBank (~935,000 sheep ESTs), RefSeq (~15,700 sheep cDNAs) and the ENA vertebrate RNA collection are provided. Click here to visit the sheep Pre! site.