Custom data upload: creating URLs for large files

Did you know you can upload your own data for display alongside the reference genomes in Ensembl? For some file types, and files larger than 20MB in size you will need to create a URL to attach the data, rather than uploading from your local directory. It’s not difficult to create these URLs, but there are quite a few steps, so read on to find out how!

What files formats require a URL to attach?

We support these file types:

Those with an asterisk (*) next to their name can only be attached by URL, all the others are able to be uploaded from your directory or attached from a URL depending on the file size.

Where to host your URLs & files

CyVerse is a web-based resource developed with the aim of creating a ‘cyber infrastructure’ for life sciences, providing a platform for storing data, hosting bioinformatics tools, cloud services, and more. Importantly, it is a free service which makes creating URLs for attaching to genome browsers very simple.

There are other resources such as FigShare you may prefer to work with (GitHub is not reliable for URL attachment). We will explore how to create URLs with CyVerse and cyberduck. All software is free to use.

Unfortunately, attaching files by URL is not a simple one click process. Before you can attach your file, you need to do some quite standard housekeeping, such as zipping, sorting and indexing.

Step 1: Sorting and Indexing files

Depending on what file type you have, you may need to create a sorted and indexed version of this file before you are able to upload. This will require the use of the command line (using Terminal on a Mac, or Ubuntu on Windows). It is necessary for attaching larger file by URL, particularly for BAM, BigWig, BigBed or large VCF files (further details).

You will need to have the following installed:

After running these you should have two files: a sorted and compressed file, and an index for that file. Here are the commands you will need for VCF files for example:

  1. bgzip file_to_upload.vcf 
  2. vcf-sort file_to_upload.vcf.gz #not needed for non-VCFs
  3. tabix -p vcf file_to_upload.vcf

Figure 1: An example of the process of zipping, sorting and indexing a file.

Step 2: Getting started with CyVerse

Once you have your indexed files then you’re able to upload to CyVerse.

Firstly, you will need to create a (free) account with CyVerse: https://user.cyverse.org/register. Once logged in, you will see two options, Data commons and Discovery Environment. Click LAUNCH on the Discovery Environment option (Figure 2). You may be prompted to log in again to the Discovery Environment, this is normal, simply click on the orange login button and enter your details for the CyVerse site.

Figure 2: Enter the ‘Discovery Environment’!

There should be a folder in your directory (simply called your username), with a folder called analyses. You can create a new folder by clicking on File > New Folder… if you will have multiple analyses.

Figure 3: Discovery environment in CyVerse

 

To upload your file, click on the Upload button at the top-left of the pop-up. There are two ways to upload files, and again the method depends on the size of the file.

 

Step 2a: Small file (<1.9GB) upload

Click on Simple Upload from Desktop. Click on Choose file and upload your file, and the indexed file if required (you will find out at the next step if it is or not!).   

Figure 4: Simple data upload for files smaller than 1.9GB in CyVerse

Click the Upload button to close the pop-up and upload your files. Your files may not appear in the folder, click the Refresh button at the middle-top of the pop-up and you should see the files appear.

Step 2b: Large (>1.9GB) file upload with Cyberduck

If your files are larger than 1.9GB then you will need a file sharer, we’re using Cyberduck, another free piece of software. Full instructions on how to download and install Cyberduck can be found on the CyVerse wiki page Once installed carry out the following options:

  1. Click on the download link from the CyVerse wiki to download the portal for Cyberduck.
  2. Open Cyberduck and click on the file (iPlant Data Store.cyberduckprofile) in your downloads folder.
  3. A pop-up will appear if you want to configure the profile. The server and port fields should be populated. If not fill them in (look at the CyVerse wiki page or copy the image below). You need to add your CyVerse username (e.g. ehaskell)
  4. The profile will now appear in the Cyberduck application (Figure 5).
  5. Double click on the CyVerse profile. You will be prompted to log in with your CyVerse account and password. Make sure it is only your username, and not the server name also (e.g. username and not username@data.cyverse.org).
  6. Your profile on CyVerse Discovery environment will now appear here. You can click and drag files from your local documents into the cyberduck environment, and it will appear in the CyVerse Discovery Environment.
  7. Be patient: If you have large files this might take a while!

 

Step 3: Generate the URL in CyVerse for import to Ensembl

This process is the same regardless of how you imported the files to CyVerse (i.e. by simple upload or upload by Cyberduck). Now we can generate the URL to use in Ensembl (or other genome browsers). Click on the tick box to select your file (the file, not the index). Click on Share > View in Genome Browser.

Figure 6: Get your URL for import into Ensembl

 

You can just copy the URL and go directly to your species of interest’s homepage in Ensembl. The link from CyVerse to Ensembl takes you to the species list at www.ensemblgenomes.org. This is a sister site to Ensembl and has all species from bacteria, fungi, metazoa, plants and protists but does not have all vertebrate species and you can only search using the latin name. Either way, you need to get to the homepage of your species of interest.

From the homepage www.ensembl.org find the link to View full list of Ensembl species just below the main search box. Use the search box in the top right of the table to find your species. Click on the link in the first column to go to the homepage. Click the Display your data in Ensembl link and a pop-up will appear. Paste your URL in the data box. The ‘Data format’ section should automatically detect the file type and update accordingly. If it doesn’t you can click on the drop down menu and choose a different file type.

 

Figure 7: Where to upload your files in Ensembl

 

You should get a message to tell you that your file has attached successfully. For some file types there may also be a link to take you an example region where there is data.

Once you have gone to the location tab and to a genomic region, your data will appear as a track, just like any other type of data in Ensembl. If you click on the Configure this page, your track will appear by default in the ‘Active tracks’ section. You can also click the Personal data tab at the top of the Configure this page pop-up to view and manage your custom tracks.

Figure 9: How to configure and manage your data tracks once uploaded into Ensembl

Different files will have different styles, BigWig files may require some configuration of the y-axis limits which you can read more about in an earlier blog-post.

Step 4: Give it a go!

If you want to have a go, attach the BAM from this CyVerse URL:

https://de.cyverse.org/anon-files/iplant/home/ehaskell/analyses/Demo%20BAM%20files/GRCh38.20.illumina.merged.1.bam

Go to the region for the gene CYP24A1, 20:54153449-54173973. Zoom in to view two or three exons of this gene.

Figure 10: How our BAM demo track looks in the region in detail view

 

Warning! If you have attached several large files, it is recommended to disconnect or turn off the tracks when you’re not looking at that data, otherwise the load times may be quite long.