The ENCODE project is generating an immense amount of data. ChIP-Seq, ChIP-ChIP, and RNA-Seq experiments are yielding an exciting number of sequences that may be involved in gene regulation. Trying to look at these data individually would be a lot of work, and less powerful, than viewing these data in combination. The Ensembl Regulatory Build takes data such as CTCF binding sites, DNaseI hypersensitive sites, transcription factor binding sites, and histone modification profiles, and integrates these data into ‘regulatory features’. The result is for any given regulatory feature in a specific region of the human genome, our users can learn 1) if there is data supporting that sequence as being involved in gene regulation (for example, is it promoter associated) and 2) how much data supports the sequence as a regulatory feature.
For example, if we zoom in to BRCA2 in the Ensembl region in detail view, the regulatory feature at the 5’ end of the BRCA2-003 transcript is supported by data from numerous experiments (shown in the figure below). Click on the grey block in the Reg. Feats. track in the region in detail view to see them.
You can even view regulatory features for a specific cell type. Click on the Configure this page tool button at the left of the region in detail view to do so. Or, try the regulation tab (get there by clicking on a regulatory feature’s stable ID, such as ENSR00000054736 in the image above).
Questions or comments about these data or views are welcome. Reply to this post, or email our helpdesk by using the Contact us link at the top right of this blog.
The Ensembl release 61 mouse gene set (released on 1 February 2011) includes updated automatic annotation for mouse NCBIM37 assembly. The automatically annotated transcript models match 94.6% of RefSeq proteins exactly.
We merged the results of this automated annotation with Havana’s manual annotation to produce the final gene set that you see displayed on our website. (We are currently merging in Havana’s mouse annotation every other release.) The final gene set has also been used as an input set for a new round of CCDS comparisons. As a result, the number of genes in the mouse CCDS set has increased.
Due to feedback on a recent survey, we decided to post some pointers to functionality that may not have been discovered by many of our users. One specific case is viewing syntenic regions calculated by the Ensembl comparative genomics team (Compara).
How can we view these regions in the Ensembl genome browser?
There are two ways. The first way is to go to the dedicated Synteny view. This view, found in the location tab, offers a comparison of syntenic regions between two species.
In the above figure, a chromosome of interest (human chromosome 6) is depicted in the centre, and any mouse chromosomes with synteny to human chromosome 6 are drawn as smaller chromosomes along the sides. The colours of the syntenic blocks represent specific mouse chromosomes.
Compare with a different species, or change the central chromosome, using the menus beneath the figure. A gene list comparing genes within the regions of synteny in both species can also be found below the image. Read the page help for more.
The second way is to draw syntenic blocks along the chromosome in the Location tab, Region overview. Click the Configure this page tool button at the left of the view to add synteny for one or multiple species.
If you have a topic you would like to see in a ‘navigation tip’ post, please leave us a comment, or email our helpdesk.
Ensembl is in the process of moving its site search to the open source Apache Lucene framework. This change should bring several advantages, not only to us, but to all users, the main one being added flexibility; in the short term it will have little impact on web site users, except for making life easier to those maintaining local instances.
From Ensembl release 62 (due out this spring) we will incorporate more data into the search (for example help and documentation) and start to improve how we display results. For developers, note that whilst we are not releasing the webcode for Lucene immediately, we are aiming to do so for release 62.
This powerful platform allows searching of over 3 million genes and gene symbols, over 6 million oligo probes, and over 67 million variations! Our implementation utilises software designed and developed by our colleagues at the European Bioinformatics Institute (used in the EB-eye) which has proven to be fast and flexible.
Lucene is open-source technology that has also been implemented to
provide searches of our mailing lists (i.e. announce and dev), thanks to our colleagues at the Wellcome Trust Sanger Institute.
We hope these improvements will help make browsing Ensembl a more user-friendly experience. Please give your feedback at email@example.com.
Now that Ensembl Genomes has moved onto the 60 code base, all the goodies in Ensembl‘s user data site are available across all 304 species in Ensembl genomes – from tiny bacteria to crazy big plants.
One of these is uploads of BAMs, Wigs, Beds in the location based view. For example, if you have the following file to specify a SNP in Drosophila
By using the Variant Effect Predictor, accessible from Tools on the top of each page, or the ‘Manage your data’ link at the left of pages, one can get the effect (synonymous, non synonymous, in a UTR etc) for the variant.
(You click on “Variant Effect Predictor”, upload a text file of chr, position, allele and hit go) This can also be run as script using the API (the database is internet accessible so you just need to have an internet connection, Perl, MySQL client libraries and the Ensembl code base installed).
In addition, the full set of visualisation tools for your own data is also now accessible for all Ensembl Genomes species. For example, this bedGraph file:
track type=bedGraph name=”BedGraph Format” description=”BedGraph format” priority=20
2L 21302000 21302300 -1.0
2L 21302300 21302600 -0.75
2L 21302600 21302900 -0.50
2L 21302900 21303200 -0.25
2L 21303200 21303500 0.0
2L 21303500 21303800 0.25
2L 21303800 21304100 0.50
2L 21304100 21304400 0.75
Will render a nice little variable height picture in the main contigview display. Other options are available (like Bed and Wig format) – many of which people will know from UCSC. Come, try it out, and give us feedback at firstname.lastname@example.org
The Ensembl 60 release sees two changes in our data upload capabilities
First off Ensembl can now “attach” a BAM file. BAM is the compressed form SAM – Sequence AlignMent files – which has become the dominant way to package up next-generation sequencing data. A BAM (or SAM file) has both the sequence and the alignment of a set of reads in a compact form (BAM makes it even more compact). Critically you can index a BAM file allowing programs rapid access to particular “Slices” of the reads by genomic position. Alignment tools such as Maq, BWA, SOAP can produce BAM files; a variety of analysis tools are written around BAM files, and now Ensembl can view BAM files.
To make a BAM file viewable you need to have access to a website where you can put files (like you local web space, perhaps an institutional thing). Call MyGreatExperiment.bam. You then need to index the BAM file using one of the tools – samtools is the usual one to do this, making a MyGreatExperiment.bam.bai (BAM index) precisely along side it (The Ensembl code is going to make the assumption that the index is called filename.bai). Then go to “Manage your Data” button on any web page in Ensembl, and go to the “Attach BAM” section. And then browse your RNA-seq, Chip-seq, Exome data to your hearts content!
In addition, we’ve spruced up our functionality and documentation on the UCSC file formats of Bed, BedGraph and Wig. Take a look at the “File Upload” and “Attach URL” forms, and the documentation. Now we precisely indicate what attributes you can use in each of these formats. Our goal is to make Ensembl as useful as possible to as broad a set of users as possible, so let us know if you find something confusing and/or you have a Bed/Bedgraph/Wig file that works for UCSC but doesn’t work on Ensembl.
This is of course available across all 50 species in Ensembl, and in a couple of weeks, when Ensembl Genomes 7 is out, across another 50 eukaryotes from protists to plants and about 250 different bacteria.
Ensembl is always extending the variation pages to include more information. Did you know that the latest data from SNPedia is now available?
SNPedia is a wiki-style resource for human genetics with public annotation of over 11,000 SNPs, released under a Creative Commons style license. We have integrated it into Ensembl, so you can view these SNP reports along with our other information including variations, genotype and allele frequencies from dbSNP, and SNPs from other sources including UniProt, Affymetrix and Illumina chipsets and phenotype annotations from several genome-wide association studies.
You need to configure the page to view SNPedia. From the variation page, e.g. rs1333049, click on “Configure this page” and then click on “External Data” to select SNPedia to appear in the left hand side menu of all variation pages via DAS. As this information comes directly from SNPedia via DAS it is always up-to-date.
Want to be sure you are going to the right version of Ensembl? Whilst our archives can be accessed through the link at the bottom of each page, if you want to cite a particular version or access it directly, you previously needed to know the month and year of release to find the archive site (e.g. may2009.archive.ensembl.org).
Now, for the convenience of our users, we have introduced shortcuts that include the version number instead of the date. For example, typing:
into your browser will redirect you to the same May 2009 archive.
We have put these redirects in as far back as e30 – if the archive no longer exists, you will be directed to the next most recent one (unless doing so would mean a change of assembly, in which case you are redirected back to the last archive on your chosen assembly, if available).
P.S. Don’t forget the ‘e’ at the beginning – we can’t use plain numbers as it causes problems with DNS servers
We have added a little trick for orthology-lovers. Starting from the orthologues page, you can choose to switch to the GeneTree. This will highlight the orthologue of interest, as well as the ancestral node that relates both genes.
Another useful feature added in Ensembl 57 is the possibility to display a set of genes (up to 10) using the Multi-Species view. Click on an internal node and select the “Jump to Multi-species view” option. This will show each of these genes in their respective genomic location, with genomic alignments when available.
Ensembl 57 includes the turkey genome, the third bird in Ensembl. We are now providing a 3-way avian multiple alignments (chicken, turkey and zebra finch) together with GERP constraint analysis. The image shows amniote and bird constrained elements on the chicken genome.
We have also added a new set of fish multiple alignments (stickleback, medaka, takifugu, tetraodon and zebrafish). GERP constraint analysis is available on fish genomes as well.