Accessing alternate sequences in human

Have you noticed any strange-looking chromosome names when browsing the human data? For example, you might notice sequence region names looking like “Chromosome HSCHR17_2_CTG4: 68,302,419-68,526,413” or “Chromosome HG75_PATCH: 34,442,621-34,976,908”.

The names refer to genomic sequence that differs from the genomic DNA on the primary assembly. These alternate sequences come in two types: Allelic sequence (haplotypes and novel patches) and fix patches. Haplotypes are known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL).  Novel patches also represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence.  Haplotypes, novel patches and fix patches are determined by the GRC, not by Ensembl.

In the Ensembl browser, as in the figure below, the allelic sequence (haplotypic regions and novel patches) are coloured red and the fix patches are coloured green. If you have a look at the top image in Region In Detail for chromosome 17, you’ll see examples of both types of alternate sequence.

 

There are several ways to view alternate sequences in Ensembl:

  • If you know the name of the sequence you’re looking for, you can find it by searching in our Search bar.
  • You can view alternate sequence regions in the top image of any Location page eg. Region In Detail, Region Overview, Chromosome Summary.
  • Some alternate sequences are available through BioMart.
  • If you’re comfortable using MySQL, you can access the list through the assembly_exception table as follows:

mysql -uanonymous -hensembldb.ensembl.org -P5306 -Dhomo_sapiens_core_62_37g -e “select sr2.name as chr_name, exc_seq_region_start,exc_seq_region_end,exc_type,sr1.name as alternate_seq_name,seq_region_start, seq_region_end from assembly_exception ae, seq_region sr1, seq_region sr2 where sr1.seq_region_id=ae.seq_region_id and sr2.seq_region_id=ae.exc_seq_region_id order by chr_name,exc_seq_region_start”

Click here for the full list of e62 alternate sequences

$slices = $slice_adaptor->fetch_all( ‘toplevel’, undef, 1 );

or

$assembly_exception_features = $assembly_exception_feature_adaptor->fetch_all_by_Slice($slice);

When using the API, the primary assembly is known as the ‘reference’ sequence and the alternate sequences are know as ‘non-reference’ sequence.

Enjoy!

3 thoughts on “Accessing alternate sequences in human

  1. Pingback: non-reference sequences in UCSC and Ensembl | The OpenHelix Blog

  2. I used the “Tables” feature of UCSC to generate a bed file for a list of targeted genes I want to examine on Ion Torrent. However, I keep getting the “apd_hap” chromosome ID’s showing up when I try to import into AmpliSeq Designer by LTI. So it errors out.

    So if I do NOT want the haplotype ID’s, how might I get rid of them?

  3. Hi MS,

    It sounds like you would like to exclude results on the APD haplotype (major histocompability complex region) of human. Haplotypes are alternate sequences to the primary assembly and show variation in the population.

    If you would like to ignore haplotypes and you are using Ensembl BioMart, you’ll need to go into Filters -> REGION and select only the chromsomes and scaffolds that you are interested in.

    If you are using our API and would like to fetch only the primary assembly then you just need to modify the third argument for SliceAdaptor (see documentation here: http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1DBSQL_1_1SliceAdaptor.html#a065dc6181872885e5056882a9e1a8567).

    To fetch only the primary assembly (ie. chromosomes and unplaced scaffolds), do:
    $slices = $slice_adaptor->fetch_all( ‘toplevel’, undef, 0 );

    Hope that helps.