Known Bugs

Ensembl strives to deliver the highest quality resources for the research community. However, there are times that we discover errors in our released databases either due to our own mistakes or errors and inconsistencies in our input data sources. We list these bugs here as they are discovered. In every case, we correct these bugs as soon as they are discovered and normally provide these corrections in the next Ensembl release.

Genes names with additional semi-colon in release 78

Some semi-colons (‘;’) have been left in gene names assigned using Uniprot gene names.
The display has been fixed on the website but remains in biomart results.
This affects 16 species: human, mouse, marmoset, guinea pig, cat, cod, turkey, ferret, microbat, pig, tetraodon, platyfish, orangutan, nile tilapia, gibbon and spotted gar

Ensembl mart mouse Affy Moex 1 0 st v1 probeset ids in e78

The mouse Affy Moex 1 0 st v1 probeset ids in the Ensembl mart 78 contains an extra semi-colon. This issue will be fixed in release 79.

VEP cache incomplete in chromosome Y in e77

We have corrected a bug in the VEP version 77 cache files for human that were released at the start of October 2014. The October files were missing some transcripts on the Y chromosome and so VEP requests for variants on Y that fell within some genes were erroneously called as ‘intergenic’. As of November 18th 2014 this is fixed for the websites, off-line script and REST API.

For script users, please update your cache files with these new ones from here:
ftp://ftp.ensembl.org/pub/current_variation/VEP/homo_sapiens_vep_77_GRCh37.tar.gz
ftp://ftp.ensembl.org/pub/current_variation/VEP/homo_sapiens_vep_77_GRCh38.tar.gz

EPO alignments in e76/77

The coverage of the EPO alignments on the cat (Felis catus) genome has decreased from 89.58% base pair coverage (in release 75) to 58.20% base pair coverage (in releases 76 and 77). This was caused by the use of an old set of anchor sequences (these sequences are used in the first stage of the generation of the EPO alignments) which where missing cat-specific sequences. This will be rectified in the next EPO alignment build.

LRG genes missing from Ensembl Families, release 76

This is a due to a lack of synchronisation between different pipelines. The issue will be addressed in the future releases, but the data will be missing in e76.

Mis-assigned HGNC names in human, release 76

Due to a bug in our external references mapping pipeline, 2,373 HGNC symbols have been mis-assigned, corresponding to 2,570 genes.
Another 34,520 HGNC symbols have been correctly assigned to 31,775 genes.
This issue will be fixed in release 77.
If in doubt regarding an assigned HGNC symbol, please check whether other external references, for example EntrezGene or Uniprot, confirm that symbol.
These erroneous entries can be identified in our database as having the info_text ‘Generated via ccds’.

 

Gene gain/loss trees, release 75

Due to a bug in our gene gain/loss analysis pipeline, the predicted numbers of ancestral genes are all set to 0. We advice to use the data of Ensembl 74 if you have to stick to the GRCh37 assembly of the human genome, or switch to a more recent release otherwise.

Incorrect TarBase data, release 75

The coordinates of the TarBase data from mouse are largely incorrect due to a problem with a projection between assemblies.
The TarBase data from Human contains some duplicate entries and the features are not ordered in ascending coordinates. This affects only queries through BioMart or the API fetch_all function.
These issues will be corrected in release 76.

Individual genomes data in the Location Resequencing View, release 74

The differences between the reference sequence and the genome sequences of James Watson and Craig Venter are not available in the Location: Resequencing View in Ensembl release 74. These data have not changed and are available in release 73 in the archive site. They will be re-instated in release 75.

Incorrect sequence for chicken chr Z – Updated

Since the new chicken assembly was released in April 2013 (release 71), there has been a problem with chromosome Z. In particular, we have incorrectly used contig AC186840.3 instead of AC186840.2 for scaffold JH375087.1. This will be fixed as soon as possible, for the next release (e74) in November 2013. Chicken chromsome Z was also incorrect on our Pre! site from January 2012 – April 2013.

The correct chromosome Z is now available on our FTP site: ftp://ftp.ensembl.org/pub/release-73/fasta/gallus_gallus/dna/. Apologies for the inconvenience.

Problems with ENCODE WGBS data

The Encode whole-genome bisulfite sequencing data (GEO ref: GSE 40832) have been flagged as erroneous by its producers, namely the strand column contains errors. New data files are expected to be deposited soon in GEO.

Missing human ncRNA genes

In release: 72

Due to an error in the ncRNAs import process, there are 99 ncRNA genes which are missing from the human gene set:

ENSG00000194647, ENSG00000199438, ENSG00000199537, ENSG00000199789, ENSG00000200000, ENSG00000200280, ENSG00000200285, ENSG00000200654, ENSG00000200837, ENSG00000201061, ENSG00000201103, ENSG00000201686, ENSG00000201784, ENSG00000201976, ENSG00000202181, ENSG00000202294, ENSG00000202323, ENSG00000202641, ENSG00000206696, ENSG00000206753, ENSG00000206830, ENSG00000207040, ENSG00000207315, ENSG00000207427, ENSG00000207447, ENSG00000207498, ENSG00000207718, ENSG00000207787, ENSG00000207793, ENSG00000207809, ENSG00000208007, ENSG00000208011, ENSG00000208342, ENSG00000211521, ENSG00000212300, ENSG00000215944, ENSG00000216036, ENSG00000221311, ENSG00000222234, ENSG00000222417, ENSG00000222687, ENSG00000222944, ENSG00000222946, ENSG00000223010, ENSG00000223279, ENSG00000223292, ENSG00000238353, ENSG00000238439, ENSG00000238461, ENSG00000238505, ENSG00000238547, ENSG00000238636, ENSG00000238682, ENSG00000238779, ENSG00000238944, ENSG00000238994, ENSG00000239062, ENSG00000239071, ENSG00000239088, ENSG00000239187, ENSG00000239337, ENSG00000239421, ENSG00000239688, ENSG00000239800, ENSG00000240379, ENSG00000240620, ENSG00000242926, ENSG00000243133, ENSG00000243835, ENSG00000243922, ENSG00000244684, ENSG00000251736, ENSG00000251820, ENSG00000251845, ENSG00000252360, ENSG00000252384, ENSG00000252527, ENSG00000252564, ENSG00000252665, ENSG00000262405, ENSG00000263454, ENSG00000264132, ENSG00000264152, ENSG00000264394, ENSG00000264460, ENSG00000264581, ENSG00000264854, ENSG00000265276, ENSG00000265701, ENSG00000265805, ENSG00000266031, ENSG00000266067, ENSG00000266351, ENSG00000266506, ENSG00000266623, ENSG00000266661, ENSG00000266742, ENSG00000266752

Stable ID ENSG00000199654 in Ensembl release 72 has been wrongly assigned to the corresponding gene on the patch rather than the parent gene, which is missing.

Polyphen predictions, release 71.

Polyphen predictions are not available for human variants or proteins which are novel to release 71. No Polyphen data is available through BioMart for this release (Polyphen predictions are available through the Ensembl variation 70 mart here http://jan2013.archive.ensembl.org/biomart/martview/.)

Updated information will be available in release 72

Drosophila funcgen DB release 70

The gene set in the fruitfly core database was updated from FlyBase version 5.39 to 5.46, but the regulation database was not updated correspondingly. Many transcript IDs remained the same (22,659/23,657 = 96%) but some were removed (998/23,657 = 4%), and others were added (4,257).  Consequently, some mappings between probesets and FlyBase transcripts (2,532/51,711 = 5%) refer to transcript IDs that are no longer current, and mappings for new transcript IDs do not exist (10,773/59,952 = 18%).  A small number of mappings between REDfly annotations and FlyBase genes refer to gene IDs that have been replaced (3/339 = 1%). BioTIFFIN regulation features have no explicit links to FlyBase data, so are unaffected.

A regulation database with up-to-date mappings between probes and transcripts will be available in EnsemblGenomes release 17, scheduled for 29 Jan 2013, and in Ensembl release 71.

BioMart release 70: missing mouse strains

In release 70 Ensembl variation mart

There were some changes made to the variation sample table that has meant approximately half of the mouse strains are missing from the variation mart database. You will be able to see the available strains here:

Filter-> GENERAL VARIATION FILTERS-> Limit to variations from strain(s)

As an alternative, please use the Ensembl variation 69 mart here:

http://oct2012.archive.ensembl.org/biomart/martview/

The strains will be added again for release 71

Rat eQTLs

In Release 70:

The eQTL data is not available for rat for this release. These data were not available for us to download for the new assembly, Rnor_5.0 at the time we prepared our databases.

Ensembl read-through transcripts

In releases: up to and including 72

We have identified 15 Ensembl read-through transcripts which have not been assigned the correct gene due to a bug in our Ensembl-HAVANA merge code.

There are 11 human genes affected whose HGNC names are: CFB, ARHGAP8, AKAP2, CFHR4, C20orf141, FDX1L, APOC2, TNNI3K , C1QTNF5, APITD1 and TMEM189. The Ensembl read-through transcripts within these genes should have been annotated as part of other neighbouring genes.

We are currently working on a fix for this issue.

Regulation: Chr Y blacklist filtering

In Releases: 64-69

Peak calls based on ChIP-Seq and DNase1 date are filtered using a list of black list regions curated by the ENCODE project.  In release 64-69, a bug was introduced caused by the addition of filtering support for the Y pseudo-autosomal regions in human. This resulted in all black list regions on the human Y chromosome (including the PARs) being omitted from the filtering. The effect of this is two fold: PAR regions appear to have duplicate data at a given location, as data from the corresponding X PAR is projected across; a small amount of low quality regulatory features (~150-200) and associated supporting evidence have not been filtered out. This will be rectified in the release 70.

BioMart release 69 bugs.

In Release: 69

1) Ferret (Mustela putorius furo) is missing Orthologs, possible Orthologs and Paralogs in the filter and attribute sections.
2) Ferret (Mustela putorius furo) and Platyfish (Xiphophorus maculatus) are missing the following id list limit filters in the gene section: ensembl_gene_id, ensembl_transcript_id, ensembl_protein_id and ensembl_exon_id.
3) The Homologs attributes section is not working when using a second dataset.
These issues will be fixed for release 70.

Ensembl-annotated lincRNA genes

In Release: 68

The mouse gene set for Ensembl release 68 is missing approximately 700 Ensembl-annotated lincRNA genes. These genes will be incorporated in the gene set as part of the standard Ensembl annotation of mouse for e70.

Mis-assignment of Canonical Transcripts in Mouse

In Release: 68

There are [700] mouse genes that have not been assigned the correct canonical transcript due to the CCDS transcripts not being prioritised over other transcripts. Side-effects include a reduced number of orthologs to other species. The issue will be fixed in Ensembl 70.

Missing stable ids for mouse estgene exons

In Release: 68

For the EST alignment based gene models, stable ids are missing on the exon level. This issue will be fixed in Ensembl 68.

Mis-assignment of Canonical Transcripts in Human

In Release: 68

The number of human genes not using the same canonical transcript as was declared in Ensembl 67 has risen by 5%. Side-effects include a reduced number of orthologs receiving annotations (display names and GO terms) from human genes. The issue will be fixed in Ensembl 69.

Ensembl Gene mart UTR start and end coordinate error has been fixed.

In Release: 67

There was a bug in the Exon.pm module that led to the miscalculation of the 5′ and 3′ UTR coordinates for the Ensembl Gene mart in release 67 (9th May). This issue has now been fixed in the API and the Ensembl Gene mart has been patched.  The fixed database has been pushed to the live site, the public mysql database and the FTP site (25th May 2012). The BioMart central portal (www.biomart.org) been made aware of the issue and will update the version on their portal as soon as possible. If you are using the biomaRt package from BioConductor, please set your host to www.ensembl.org to get the most up to date version until the fix has been made live on the BioMart central portal.

BioMart SIFT and PolyPhen scores

In Release: 67

The SIFT and PolyPhen scores are not available in the filters and attributes section in the variation mart for this release.

As a work around, you can get these scores in the variation attributes section of the Ensembl mart. This issue will be fixed for release 68.

Ensembl-annotated lincRNA genes

In Release: 66

The Human gene set for Ensembl release 66 is missing approximately 300 Ensembl-annotated lincRNA genes. These genes will be incorporated in the gene set as part of the standard Ensembl annotation of human for e67.

BioMart release 65 bug

In Release: 65

There is an issue with the retrieval of UniProt/TrEMBL Accession(s) and UniProt/Swissprot Accession(s) in the filters and attributes for Drosophila melanogaster in the Ensembl Gene mart. As a workaround, one can use the Ensembl Genomes’ metazoa mart (http://metazoa.ensembl.org/biomart/martview/) where a query against UniProt does not result in the same error. This issue will be fixed for release 66.

Regulation GFF dumps

In Releases: 63 & 64

Start and end loci in the RegulatoryFeature and AnnotatedFeature GFF dumps are truncated to the nearest mega base. This was due to the implementation of a Slice Iterator in the dump script which erroneously used local coordinates of the 1MB slices used to perform the dumps. These have now been corrected on the ftp site.

BioMart “ID list limit” filter issues

In Release: 64

There is an issue with the “ID list limit” filters in the Ensembl Gene mart for RefSeq mRNA, RefSeq mRNA predicted, RefSeq ncRNA and RefSeq ncRNA predicted. If one inputs a RefSeq accession for one of these categories, it throws a “table does not exist” error.

As a work around, users can still use the “Limit to genes” filter and download a list of all genes that have, for example, RefSeq mRNA external references and then filter for their NM_* accessions of interest. This bug will be rectified for release 65.

The PFAM IDs have also had versions added to the ID during the running of the protein annotation pipeline (e.g. PF07654.8). This makes it difficult to use the “ID list limit” filter in BioMart. This version will be removed for release 65.

Missing HGNC symbols

In Release: 64

The Xref mapping system has failed to map HGNC symbols for 175 human genes, which had symbols in Ensembl 63, none of which are active CCDS entries. This will also affect genes in species that receive projected human display names. Download affected symbols

Missing orthology data in Gorilla

In Release 64.

With the update of the gorilla assembly, several gorilla genes have been misplaced in the gene trees due to a problem while extracting their genomic sequences in our pipeline. This has affected about a thousand genes. As a result, the orthology predictions for these genes is missing or inaccurate. The missing orthology relationships are often found in the set of  ‘possible orthologs’. However we recommend using Ensembl 63 for gorilla orthologs.

Rat Codelink alignments and transcript annotations

Up to and including release 63.

During the array import and mapping process for the rat Codelink array, a fasta file was erroneously truncated. This did not affect the import of the array design (i.e. probes), hence passed our current health checks. However, it did impact on the genomic and transcript alignment steps, resulting in only 30% of the probes being aligned to the genome rather than ~90%. In turn this impacted on the transcript annotation step which assigned xrefs to only ~15% rather than ~50% of the Codelink probes. A new health check will be added to the array mapping pipeline to prevent this in future. Updated MySQL dumps are available here:

ftp://ftp.ensembl.org/pub/misc/codelink_fix_rattus_norvegicus_funcgen_63_34.tar.gz

Protein alignments in GeneTrees

In Release: 62

We have been using an experimental extension of M-Coffee, the exon-disaligner (AKA decaf module), in the last few releases. In short, we inform M-Coffee about the exon boundaries in order to reduce the amount of over-alignments spanning exon boundaries. The exon-disaligner module has been disaligning too much sequence in the alignments in e!62. As a result, dN/dS values and similarity stats are unreliable in the affected alignments.

Read more on the GeneTree pipeline

Human RegulatoryFeature Stable IDs

In Release: 62

The stable ID mapping procedure produced some erroneous results for the Human Regulatory Feature sets. Approximately ~150k out of a total of ~445k ‘MultiCell’ Regulatory Features were erroneously assigned new stable IDs rather than being projected from the previous Regulatory Build (v61).

Transcript names for human, mouse and zebrafish

In Release: 62

Transcript names in human, mouse and zebrafish are suffixed with a number starting with either ‘0’ or ‘2’. If the number starts with ‘0’ then it is a merged or manually curated transcript from Havana/Vega. If the number starts with ‘2’ then it is an automatically annotated transcript from Ensembl. For release 62, some of the transcript numbers have been set incorrectly. To know whether a transcript is merged, from Ensembl or from Havana, see the “Prediction Method” line on the Transcript Summary page.
2257 merged or Havana human transcripts have the transcript number starting ‘2’.
1029 Ensembl human transcripts have the transcript number starting ‘0’.
980 merged mouse transcripts have the transcript number starting ‘2’.
19 Ensembl mouse transcripts have the transcript number starting ‘0’.
324 merged or Havana zebrafish transcripts have the transcript number starting ‘2’.
234 Ensembl zebrafish transcripts have the transcript number starting ‘0’.

Gene set on human haplotypes

In Release: 62

The Human gene set for Ensembl release 62 is missing gene annotation from the Ensembl automatic pipeline on the haplotypes. All but two of these haplotypes still contain gene annotation as imported directly from Havana. We plan to generate Ensembl annotation for all haplotypes for e63.
The following haplotype regions have annotation from Havana only: HSCHR6_MHC_APD, HSCHR6_MHC_COX, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, HSCHR6_MHC_QBL, HSCHR6_MHC_SSTO.
The following haplotype regions have no gene annotation: HSCHR17_1, HSCHR4_1.

Canonical transcripts and gene trees (Human and mouse)

In release: 61

A subset of genes in human and mouse have their canonical transcript set incorrectly. The canonical transcripts are used by the Comparative Genomics team to generate gene trees and so this bug has also caused some gene trees to be incorrect. For human, 3393 of 53515 genes (6.34%) have an incorrect canonical transcript. For mouse, 2072 of 36817 genes (5.63%) have an incorrect canonical transcript. This will be fixed for e62.

Missing SNP status (Human)

In release: 61

We are missing information about validation status for most human variations due to this data being unavailable from dbSNP at the time of import. The validation status for each rsId (exported from dbSNP on
2011-01-20) is available via FTP as a tab-separated file.

Consequence Types in Mart (all species)

In release: 58

The attribute “Consequence Type (Transcript Variation)” is missing from the Ensembl Mart due to a mart building bug. The correct transcript consequence for a SNP may be found by either using the Variation API or by using the Variation Mart.

Missing UniProtKB/Swiss-Prot secondary accessions in human (Human)

In release: 57

There are no UniProtKB/Swiss-Prot secondary accessions for human, due to a change in the way we obtain Uniprot-Ensembl mappings. Previously these were stored as synonyms of the primary accessions, and, although they were not visible on the website they were searchable and available via BioMart. The UniProtKB/Swiss-Prot secondary accessions will be restored for Ensembl release 58. Users with a need to use the secondary accessions are advised to use Ensembl release 56 until Ensembl 58 is released.

Variation flanking sequence (Human, Mouse, Zebrafish, Rat, Cow)

In releases: Up to and including 57

For a small number of variations, the flanking sequences displayed in the variation property tab may contain sequence permutations. This situation arises when the flanking sequence is a composite of a sequence which has been determined by an experimental assay and sequence extracted from e.g. a genomic database. The number of affected variations in Ensembl release 57 for respective species is shown in the table below.

Species Affected Variations
Human 11,237
Cow 363
Rat 11,536
Mouse 101
Zebrafish 6,156

Incorrect consequence type (Human, Mouse, Rat)

In releases: Up to and including 57

146593 variants in human, 739 in mouse, and 26947 in rat have the wrong consequence and should have consequence type of “INTERGENIC” as they fall in an area with no transcript. e.g. rs7298705 is non-synonymous but it should be intergenic.

SNP flanking sequence (Human)

In release: 56

It has come to our attention that some code operating on the variation database flanking_sequence table failed for 1,421,205 SNPs which originally mapped to the reverse strand. Although the website reports the SNPs as being on the forward strand, the displayed flanking sequences are from the reverse strand. Only the flanking sequence is affected; the genotypes and alleles are correct.

Source name misspelling (Human)

In release: 56

Watson’s entry in the source table is misspelled as “ENSENBL:Watson”

rsIDs not merged (Human)

In release: 56

In Ensembl 56, rsIDs were not merged, leaving ~25,000 extra rsIDs in variation/variation_feature that should be in variation_synonym.

Catarrhini primates EPO alignments (Human, Chimp, Macaque, Orangutan, Gorilla)

In releases: 55 – 56

Due to a bug in Ortheus, all the internal gaps in these alignments are shifted by 1 position.

Tetraodon BLAST indices corrupted (Tetraodon)

In releases: 53 – 56

The Tetraodon blast index is corrupted; between release 53 and release 56, the Tetraodon genome was only partially indexed – the following chromosomes were ABSENT from the blast-db: 4, 6, 7, 8, 9, 11, 14, 15, 16, 18, 19, 20, 21, MT.

Eutherian mammals EPO alignments (multiple species)

In releases: 49 – 56

Due to a bug in Ortheus, all the internal gaps in these alignments are shifted by 1 position. This will have also affected the GERP constraint elements we derive from these alignments.

Eutherian mammals alignment (multiple species)

In release: 55

In Ensembl 55, the web interface incorrectly lists “10-way eutherian mammals EPO”, which is not actually present in the database. A 9-way eutherian alignment is available in the database and via API. You can also download the EMF files from our FTP server at ftp://ftp.ensembl.org/pub/release-55/emf/ensembl-compara/epo_9_eutherian/

Mouse Regulatory Features (Mouse)

In releases: 54 – 55

For release 54 to release 55 there was a number(~3.7%) of duplicate RegulatoryFeatures. These were present on the later parts of chromsomes 1 and 17. The duplicates were removed in release 56

ENSEMBL:Sanger SNPs (Mouse)

In releases: Up to and including 53

Up until and including release 53, all SNP data with sample name “ENSEMBL:Sanger” should be on mouse strain “C3HeB/FeJ” and not “C3H/HeJ”

LD calculations (all species)

In releases: Up to and including 53

Up until release 53, there was an error in the linkage disequilibrium calculation script that was causing the values of r2 and D’ to be incorrect in some cases and also miscalculated about 5% of the set of tag SNPs.

Probeset Annotations (Human, Mouse)

In releases: 43 – 49

Between releases 43 and 49 the probeset transcript annotation method contained a bug where in some instances probes were being assigned to transcripts on the wrong strand. The effect on the final transcript annotations varied across dependant on the species in question, with the human and mouse having approximately 10% of annotations affected.

Errors in human sequence near haplotypic regions

In releases: 38-46

We found a problem in the human genome sequence for version 46, such that there are only N’s in the region Chr5:70946027-71169807. This appears to be a mistake in the mapper that was used to position the nearby haplotype (c5_H2, at positions 68965368-70760237). These Ns are not a result of repeat masking. Similar errors may be present in previous releases of this assembly, but are correct from release 47 onwards.

We therefore recommend downloading NCBI36 sequence data from release 54, the last Ensembl release with this assembly.