I’d like to introduce you an exciting new data set that we’ve introduced in Ensembl release 62: RNASeq data from Illumina’s Human BodyMap 2.0 project. The data, generated on HiSeq 2000 instruments in 2010, consist of 16 human tissue types, including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Raw reads are available for download here. For each tissue, we have aligned the raw reads to the genome and then linked exons into tissue-specific transcript models using the reads that span an exon-exon boundary.

You can view these data in the Region in Detail view. Click on ‘Configure this page’ and choose ‘RNA-Seq’ at the left of the main panel. Enable any or all of the 32 tracks and then close the configuration panel. Out of 32 possible tracks you can draw, 16 are tissue ‘gene model’ tracks, and 16 are ‘intron’ tracks.

The ‘gene model’ track shows you a transcript model. The ‘intron’ track shows you how many raw reads aligned across an exon-exon junction. The higher the intron block, the more highly expressed the transcript isoform is.


In this example, the kidney gene model track shows a transcript (dark blue) with an exon structure that matches the gold-coloured Ensembl transcript AQP6-001. The kidney transcript model includes coding and noncoding exons (in the example above, the empty box is UTR, and the filled boxes are exons).
Click on the kidney intron track to see that 192 raw reads were split between the first and second exons.

This example is interesting because it shows a gene with high expression in kidney tissue, and almost no expression in any other tissue.

The high read coverage for kidney means that the transcript’s exon-intron structure produced for the gene track has a good chance of being correct. When read coverage is very low, it is not always possible to build a full-length transcript model: Look at the colon and brain intron tracks to see that two colon reads and three brain reads have aligned across the transcript’s middle exon-exon junction. Although this read coverage is low, our pipeline has generated a transcript model for brain tissue. The pipeline however was not able to predict the two splice on either side because there were no raw reads from brain aligning over the splice junctions.

Below is a nice example of a gene that seems to be expressed in all 16 tissues, spermidine synthase (SRM).

Try dump_transcripts.pl as an example script to access the RNAseq-based transcript models. Have fun with these new data!

Have you noticed any strange-looking chromosome names when browsing the human data? For example, you might notice sequence region names looking like “Chromosome HSCHR17_2_CTG4: 68,302,419-68,526,413” or “Chromosome HG75_PATCH: 34,442,621-34,976,908”.

The names refer to genomic sequence that differs from the genomic DNA on the primary assembly. These alternate sequences come in two types: Allelic sequence (haplotypes and novel patches) and fix patches. Haplotypes are known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL).  Novel patches also represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence.  Haplotypes, novel patches and fix patches are determined by the GRC, not by Ensembl.

In the Ensembl browser, as in the figure below, the allelic sequence (haplotypic regions and novel patches) are coloured red and the fix patches are coloured green. If you have a look at the top image in Region In Detail for chromosome 17, you’ll see examples of both types of alternate sequence.

 

There are several ways to view alternate sequences in Ensembl:

  • If you know the name of the sequence you’re looking for, you can find it by searching in our Search bar.
  • You can view alternate sequence regions in the top image of any Location page eg. Region In Detail, Region Overview, Chromosome Summary.
  • Some alternate sequences are available through BioMart.
  • If you’re comfortable using MySQL, you can access the list through the assembly_exception table as follows:

mysql -uanonymous -hensembldb.ensembl.org -P5306 -Dhomo_sapiens_core_62_37g -e “select sr2.name as chr_name, exc_seq_region_start,exc_seq_region_end,exc_type,sr1.name as alternate_seq_name,seq_region_start, seq_region_end from assembly_exception ae, seq_region sr1, seq_region sr2 where sr1.seq_region_id=ae.seq_region_id and sr2.seq_region_id=ae.exc_seq_region_id order by chr_name,exc_seq_region_start”

Click here for the full list of e62 alternate sequences

$slices = $slice_adaptor->fetch_all( ‘toplevel’, undef, 1 );

or

$assembly_exception_features = $assembly_exception_feature_adaptor->fetch_all_by_Slice($slice);

When using the API, the primary assembly is known as the ‘reference’ sequence and the alternate sequences are know as ‘non-reference’ sequence.

Enjoy!

Ensembl provides annotations indicating regions in the genome that are experimentally verified to be bound by transcription factors (from ChIP-Seq experiments). Within these regions, we now also provide precise transcription factor binding sites. To generate these binding sites, we make use of publicly available Position Weight Matrices (PWM) from Jaspar.

Transcription factor binding sites can be seen as black boxes in the Regulatory Features track. If you click on a Regulatory Feature you can see information regarding the binding sites contained within that regulatory feature. This includes the binding matrix used and a binding score representing how well a particular site matches the binding matrix. Clicking on a specific black box within the regulatory feature will highlight the corresponding information on the menu (the darker blue line in the figure showing information for a CTCF binding site). Transcription factor binding sites are also displayed as evidence for a regulatory feature (as ‘Core PWM’ entries).

To generate these PWM matches we take Jaspar matrices and find matches throughout the genome. Then, we use experimental binding data to stringently choose high confidence binding sites that fall within regions enriched in ChIP-Seq experiments for the corresponding factor. More details on this process can be found here.

The ENCODE project is generating an immense amount of data.  ChIP-Seq, ChIP-ChIP, and RNA-Seq experiments are yielding an exciting number of sequences that may be involved in gene regulation.  Trying to look at these data individually would be a lot of work, and less powerful, than viewing these data in combination.  The Ensembl Regulatory Build takes data such as CTCF binding sites, DNaseI hypersensitive sites, transcription factor binding sites, and histone modification profiles, and integrates these data into ‘regulatory features’.  The result is for any given regulatory feature in a specific region of the human genome, our users can learn 1) if there is data supporting that sequence as being involved in gene regulation (for example,  is it promoter associated) and  2) how much data supports the sequence as a regulatory feature.

For example, if we zoom in to BRCA2 in the Ensembl region in detail view, the regulatory feature at the 5’ end of the BRCA2-003 transcript is supported by data from numerous experiments (shown in the figure below).  Click on the grey block in the Reg. Feats. track in the region in detail view to see them.

 

 

 

 

 

You can even view regulatory features for a specific cell type.  Click on the Configure this page tool button at the left of the region in detail view to do so.  Or, try the regulation tab (get there by clicking on a regulatory feature’s stable ID, such as ENSR00000054736 in the image above).

Questions or comments about these data or views are welcome.  Reply to this post, or email our helpdesk by using the Contact us link at the top right of this blog.

The Ensembl release 61 mouse gene set (released on 1 February 2011) includes updated automatic annotation for mouse NCBIM37 assembly. The automatically annotated transcript models match 94.6% of RefSeq proteins exactly.

We merged the results of this automated annotation with Havana’s manual annotation to produce the final gene set that you see displayed on our website. (We are currently merging in Havana’s mouse annotation every other release.)  The final gene set has also been used as an input set for a new round of CCDS comparisons. As a result, the number of genes in the mouse CCDS set has increased.

Due to feedback on a recent survey, we decided to post some pointers to functionality that may not have been discovered by many of our users.  One specific case is viewing syntenic regions calculated by the Ensembl comparative genomics team (Compara).

How can we view these regions in the Ensembl genome browser?

There are two ways.  The first way is to go to the dedicated Synteny view.  This view, found in the location tab, offers a comparison of syntenic regions between two species.

In the above figure, a chromosome of interest (human chromosome 6)  is depicted in the centre, and any mouse chromosomes with synteny to human chromosome 6 are drawn as smaller chromosomes along the sides.  The colours of the syntenic blocks represent specific mouse chromosomes.

Compare with a different species, or change the central chromosome, using the menus beneath the figure.  A gene list comparing genes within the regions of synteny in both species can also be found below the image.  Read the page help for more.

The second way is to draw syntenic blocks along the chromosome in the Location tab, Region overview.  Click the Configure this page tool button at the left of the view to add synteny for one or multiple species.

If you have a topic you would like to see in a ‘navigation tip’ post, please leave us a comment, or email our helpdesk.


Ensembl is in the process of moving its site search to the open source Apache Lucene framework. This change should bring several advantages, not only to us, but to all users, the main one being added flexibility; in the short term it will have little impact on web site users, except for making life easier to those maintaining local instances.

From Ensembl release 62 (due out this spring) we will incorporate more data into the search (for example help and documentation) and start to improve how we display results. For developers, note that whilst we are not releasing the webcode for Lucene immediately, we are aiming to do so for release 62.

This powerful platform allows searching of over 3 million genes and gene symbols, over 6 million oligo probes, and over 67 million variations! Our implementation utilises software designed and developed by our colleagues at the European Bioinformatics Institute (used in the EB-eye) which has proven to be fast and flexible.

Lucene is open-source technology that has also been implemented to
provide searches of our mailing lists (i.e. announce and dev), thanks to our colleagues at the Wellcome Trust Sanger Institute.

We hope these improvements will help make browsing Ensembl a more user-friendly experience. Please give your feedback at helpdesk@ensembl.org.


Now that Ensembl Genomes has moved onto the 60 code base, all the goodies in Ensembl‘s user data site are available across all 304 species in Ensembl genomes – from tiny bacteria to crazy big plants.

One of these is uploads of BAMs, Wigs, Beds in the location based view. For example, if you have the following file to specify a SNP in Drosophila

2L 356013 356013 C/T

By using the Variant Effect Predictor, accessible from Tools on the top of each page, or the ‘Manage your data’ link at the left of pages, one can get the effect (synonymous, non synonymous, in a UTR etc) for the variant.

(You click on “Variant Effect Predictor”, upload a text file of chr, position, allele and hit go) This can also be run as script using the API (the database is internet accessible so you just need to have an internet connection, Perl, MySQL client libraries and the Ensembl code base installed).

In addition, the full set of visualisation tools for your own data is also now accessible for all Ensembl Genomes species. For example, this bedGraph file:

track type=bedGraph name=”BedGraph Format” description=”BedGraph format” priority=20
2L 21302000 21302300 -1.0
2L 21302300 21302600 -0.75
2L 21302600 21302900 -0.50
2L 21302900 21303200 -0.25
2L 21303200 21303500 0.0
2L 21303500 21303800 0.25
2L 21303800 21304100 0.50
2L 21304100 21304400 0.75

Will render a nice little variable height picture in the main contigview display. Other options are available (like Bed and Wig format) – many of which people will know from UCSC. Come, try it out, and give us feedback at helpdesk@ensembl.org

The Ensembl 60 release sees two changes in our data upload capabilities

First off Ensembl can now “attach” a BAM file. BAM is the compressed form SAM – Sequence AlignMent files – which has become the dominant way to package up next-generation sequencing data. A BAM (or SAM file) has both the sequence and the alignment of a set of reads in a compact form (BAM makes it even more compact). Critically you can index a BAM file allowing programs rapid access to particular “Slices” of the reads by genomic position. Alignment tools such as Maq, BWA, SOAP can produce BAM files; a variety of analysis tools are written around BAM files, and now Ensembl can view BAM files.

To make a BAM file viewable you need to have access to a website where you can put files (like you local web space, perhaps an institutional thing). Call MyGreatExperiment.bam. You then need to index the BAM file using one of the tools – samtools is the usual one to do this, making a MyGreatExperiment.bam.bai (BAM index) precisely along side it (The Ensembl code is going to make the assumption that the index is called filename.bai). Then go to “Manage your Data” button on any web page in Ensembl, and go to the “Attach BAM” section. And then browse your RNA-seq, Chip-seq, Exome data to your hearts content!

In addition, we’ve spruced up our functionality and documentation on the UCSC file formats of Bed, BedGraph and Wig. Take a look at the “File Upload” and “Attach URL” forms, and the documentation. Now we precisely indicate what attributes you can use in each of these formats. Our goal is to make Ensembl as useful as possible to as broad a set of users as possible, so let us know if you find something confusing and/or you have a Bed/Bedgraph/Wig file that works for UCSC but doesn’t work on Ensembl.
This is of course available across all 50 species in Ensembl, and in a couple of weeks, when Ensembl Genomes 7 is out, across another 50 eukaryotes from protists to plants and about 250 different bacteria.
Comments are welcome – either on this blog, or email our helpdesk


Ensembl is always extending the variation pages to include more information. Did you know that the latest data from SNPedia is now available?

SNPedia is a wiki-style resource for human genetics with public annotation of over 11,000 SNPs, released under a Creative Commons style license. We have integrated it into Ensembl, so you can view these SNP reports along with our other information including variations, genotype and allele frequencies from dbSNP, and SNPs from other sources including UniProt, Affymetrix and Illumina chipsets and phenotype annotations from several genome-wide association studies.

You need to configure the page to view SNPedia. From the variation page, e.g. rs1333049, click on “Configure this page” and then click on “External Data” to select SNPedia to appear in the left hand side menu of all variation pages via DAS. As this information comes directly from SNPedia via DAS it is always up-to-date.