Accessing alternate sequences in human

Have you noticed any strange-looking chromosome names when browsing the human data? For example, you might notice sequence region names looking like “Chromosome HSCHR17_2_CTG4: 68,302,419-68,526,413” or “Chromosome HG75_PATCH: 34,442,621-34,976,908”.

The names refer to genomic sequence that differs from the genomic DNA on the primary assembly. These alternate sequences come in two types: Allelic sequence (haplotypes and novel patches) and fix patches. Haplotypes are known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL). Novel patches also represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. Haplotypes, novel patches and fix patches are determined by the GRC, not by Ensembl.

In the Ensembl browser, as in the figure below, the allelic sequence (haplotypic regions and novel patches) are coloured red and the fix patches are coloured green. If you have a look at the top image in Region In Detail for chromosome 17, you’ll see examples of both types of alternate sequence.

There are several ways to view alternate sequences in Ensembl:

If you know the name of the sequence you’re looking for, you can find it by searching in our Search bar.
You can view alternate sequence regions in the top image of any Location page eg. Region In Detail, Region Overview, Chromosome Summary.
Some alternate sequences are available through BioMart.
If you’re comfortable using MySQL, you can access the list through the assembly_exception table as follows:

mysql -uanonymous -hensembldb.ensembl.org -P5306 -Dhomo_sapiens_core_62_37g -e “select sr2.name as chr_name, exc_seq_region_start,exc_seq_region_end,exc_type,sr1.name as alternate_seq_name,seq_region_start, seq_region_end from assembly_exception ae, seq_region sr1, seq_region sr2 where sr1.seq_region_id=ae.seq_region_id and sr2.seq_region_id=ae.exc_seq_region_id order by chr_name,exc_seq_region_start”

Click here for the full list of e62 alternate sequences

It’s also easy to fetch the alternate sequences using our API. Click here for an example script.

$slices = $slice_adaptor->fetch_all( ‘toplevel’, undef, 1 );

$assembly_exception_features = $assembly_exception_feature_adaptor->fetch_all_by_Slice($slice);

When using the API, the primary assembly is known as the ‘reference’ sequence and the alternate sequences are know as ‘non-reference’ sequence.

Enjoy!

Ensembl Blog

News about the Ensembl Project and its genome browser

Accessing alternate sequences in human