Human BodyMap 2.0 data from Illumina

I’d like to introduce you an exciting new data set that we’ve introduced in Ensembl release 62: RNASeq data from Illumina’s Human BodyMap 2.0 project. The data, generated on HiSeq 2000 instruments in 2010, consist of 16 human tissue types, including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Raw reads are available for download here. For each tissue, we have aligned the raw reads to the genome and then linked exons into tissue-specific transcript models using the reads that span an exon-exon boundary.

You can view these data in the Region in Detail view. Click on ‘Configure this page’ and choose ‘RNA-Seq’ at the left of the main panel. Enable any or all of the 32 tracks and then close the configuration panel. Out of 32 possible tracks you can draw, 16 are tissue ‘gene model’ tracks, and 16 are ‘intron’ tracks.

The ‘gene model’ track shows you a transcript model. The ‘intron’ track shows you how many raw reads aligned across an exon-exon junction. The higher the intron block, the more highly expressed the transcript isoform is.


In this example, the kidney gene model track shows a transcript (dark blue) with an exon structure that matches the gold-coloured Ensembl transcript AQP6-001. The kidney transcript model includes coding and noncoding exons (in the example above, the empty box is UTR, and the filled boxes are exons).
Click on the kidney intron track to see that 192 raw reads were split between the first and second exons.

This example is interesting because it shows a gene with high expression in kidney tissue, and almost no expression in any other tissue.

The high read coverage for kidney means that the transcript’s exon-intron structure produced for the gene track has a good chance of being correct. When read coverage is very low, it is not always possible to build a full-length transcript model: Look at the colon and brain intron tracks to see that two colon reads and three brain reads have aligned across the transcript’s middle exon-exon junction. Although this read coverage is low, our pipeline has generated a transcript model for brain tissue. The pipeline however was not able to predict the two splice on either side because there were no raw reads from brain aligning over the splice junctions.

Below is a nice example of a gene that seems to be expressed in all 16 tissues, spermidine synthase (SRM).

Try dump_transcripts.pl as an example script to access the RNAseq-based transcript models. Have fun with these new data!

37 thoughts on “Human BodyMap 2.0 data from Illumina

  1. Dear Vinay,
    I just tried all the links and they work for me. Can you try it again and give us the error if you can’t get the data?
    Also the data is stored at the EBI, so if you can access the EBI web page but can’t download the file, you will have to email them because we have no access to this data.

    Regards

  2. Dear NoaR,
    It is not possible to download the relative expressions per tissue per gene.

    But you can get the number of spanning reads for each intron of each model created using the API and the RNA-Seq database homo_sapiens_rnaseq_65_37 available on the public MySQL server ensembldb.ensembl.org. Use the Bio::EnsEMBL::DBSQL::DnaAlignFeatureAdaptor class and then retrieve the score from the DnaAlignFeature object.

    Regards

  3. Dear Thibaut,

    I would appreciate if you could add some information about the pipeline used for producing ‘intron’ and ‘gene model’ tracks.
    Did you use paired-end reads only, as an initial data for the pipeline?

    Thanks!

  4. Hi Thibault,

    Do you have any resources where we can find out more about Illumina’s body map project?

    Thank you!

    • Hi Melissa,
      If you follow the link to the raw sequences, here, you will have more information on the project.
      If it’s not enough, the best thing would be to email Gary Schroth who is the contact for the project.

      Hope this will help!

  5. Hi Thibault,

    Is it possible to export ‘intron’ tracks from Ensembl?
    I couldn’t find this option using ‘export data’ link.

    Thanks!

  6. Hi,
    is there an ftp for the BAM files so I can load them into ENSEMBL browser? I’ve only found fastq files.

    • Hi Sebastian,
      At the moment we have no knowledge of BAM files for the Human Body Map. We are planning to generate BAM files but there is no release date.

      Regards

  7. Hi Thibaut,
    You mentioned above that, “For each tissue, we have aligned the raw reads to the genome and then linked exons into tissue-specific transcript models using the reads that span an exon-exon boundary.” How can I download the raw data of these exon junctions or boundaries, raw counts, and RPKM for each of these 16 tissues? I read some docs about using perl API to access data from the EMBL core schema. However, I couldn’t find info on which classes or methods I should use to extract the type of data I’m looking for. Any advice is greatly appreciated.

  8. Hi Thibaut,
    By raw data, I mean the coordinates of the exon boundaries. Any suggestion on how I can retrieve that from the rnaseq database?

    Thanks again and I apologize for the confusion.

    • Hi Jerry,
      The exon boundaries you will be able to get will only be the boundaries of our models. We can’t assure that it is the right one but it is the one where we had the more support.
      So to get the coordinate of the exon boundaries you will need to use the module Bio::EnsEMBL::Exon
      Use the example from my previous reply and then you can call your exons via the Bio::EnsEMBL::Transcript module:
      foreach my $exon (@{$transcript->get_all_Exons}) {
        print "Start: ", $exon->start, " End: ", $exon-end, "\n";
      }

      I suggest you to subscribe to the Ensembl Dev mailing list, the entire Ensembl community will be able to help you when needed: http://lists.ensembl.org/mailman/listinfo/dev

      Thibaut

  9. I ran the script dump_transcripts.pl listed above and got a whole bunch of errors such as those listed below. Does the new release of Ensembl 68 have something to do with it where modules were changed? I re-installed Emsembl API but it still doesn’t work. Perhaps the name of the modules changed thus they need to be changed in the script as well? Please advice. Thank you!

    Error Output:
    ——————– WARNING ———————-
    MSG: ‘Bio::EnsEMBL::DBSQL::GeneAdaptor’ cannot be found.
    Exception Can’t locate Bio/PrimarySeqI.pm in @INC (

    BEGIN failed–compilation aborted at /home/jli/src/ensembl/modules/Bio/EnsEMBL/Slice.pm line 62.
    Compilation failed in require at /home/jli/src/ensembl/modules/Bio/EnsEMBL/DBSQL/SliceAdaptor.pm line 99.
    BEGIN failed–compilation aborted at /home/jli/src/ensembl/modules/Bio/EnsEMBL/DBSQL/SliceAdaptor.pm line 99.
    Compilation failed in require at /home/jli/src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm line 67.
    BEGIN failed–compilation aborted at /home/jli/src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm line 67.
    Compilation failed in require at (eval 8) line 3.

    FILE: Bio/EnsEMBL/Registry.pm LINE: 1015
    CALLED BY: dump_transcripts.pl LINE: 88
    Date (localtime) = Thu Aug 23 16:54:56 2012
    Ensembl API version = 68

    • Hi Jerry,

      Your question is a technical one and a response to it could benefit others trying to use the BodyMap data. Please refer to the Ensembl developers mailing list (http://lists.ensembl.org/mailman/listinfo/dev) for response to your email there. Your question is now answered on the list.

      Best regards,
      Amonida

  10. I’d like to download the FPKM values or the read counts for each of the samples. Is there a place I can obtain these?

    Thank you,
    Teja

    • Hi Teja,
      We don’t provide FPKM values.
      But we provide the bam files containing the raw alignments of the reads on the Ensembl FTP. You can then use samtools or any other SAM/BAM tool to retrieve the data you want.
      You can also look at the intron supporting evidence in the RNASeq database (Public databases) via the API. We store the number of intron-spanning reads for each model generated.

      Hope this help
      Thibaut

      • I would like to know if the bam files were generated from:

        single read experiment or from pair end experiment

        Thanks

        Cycy

        • Hi Cycy,
          All the reads for the pooled set are 100bp single end.
          For each tissue we had 50bp paired end reads and 75bp single reads.

          Hope this helps,
          Thibaut

          • Hi Thibaut,

            How does this bam files generated then? by bowtie or bwa, or other aligner? Thanks!

          • Hi Daofeng,
            We used BWA 0.5.9 to align the reads onto the genome.

            Regards
            Thibaut

  11. Is it possible to put in a sequence and have the sequenced mapped to RNAseq data do determine the frequency that the query sequence has been detected in the database?

    • Hi Robert,
      No it is not possible. We only store the number of intron spanning reads in the database so there is not an easy way.

      What you can do, but it requires some knowledge in writing scripts, is to map your query sequence on the genome then query the BAM files to get the number of reads on the region where your query sequence aligned. You will have to filter the reads depending on how many mismatch you want to allow in your count.

      The BAM files contain all the reads that we used. You can also create a blast database from the BAM files.

      Hope this help
      Thibaut

  12. I download the rawdata, and process using Fastqc. the begin bases of pair end sample are abnomal and the end bases of single end sample are abnormal too. I doubt, they may have adapter, linker or primer pollution. Form E-MTAB-513.sdrf.txt, I got the single end sample linker information, it seems can explain the the end bases of reads abnormal, but for pair end, I have no idea. Is there any information about the sample’s adapter, linker or primer?

    • Hi Yuting,
      We did not produce these data. You should contact Gary P Schroth at Illumina, gschroth@illumina.com, he will be the person who should be able to give you some answers.

      Regards
      Thibaut

  13. Hello,

    Is the data from diseased or normal human tissue? It seems to be assumed normal, but I see no information confirming this. Can someone provide assurance, or refer to where the tissues are originating from? (aka healthy victims perhaps of accidents etc.)

    thanks

  14. How can I download all the fastq’s files. The weblinks are useful but is really slow when I want to download all the fastq files.

    Thank you,
    Teja

    • Hi Teja,
      The fastq files are hosted on the ArrayExpress website. You will need to contact them to find a better/faster way to downlad the fastq files.

      Hop[e this helps
      Thibaut

    • Hi Varun,
      We received the Illumina BodyMap data directly from Illumina. ArrayExpress has given different name to the reads. If you look in the fastq files, for each read you will see 2 different names. The first one is the one you have in your BAM files, the second one is the one that will be in the BAM files provided by Ensembl which use the name in the original BAM files.

      Hope this helps
      Thibaut

  15. Dear Sir/Madam,

    I am using these RNA seq. data. how can i cite it please in my work.

    please let me know

    Thanks