Changes to paralogy in release 94

We’ve heard from a number of you about missing paralogues in release 94. We have lost some paralogy relationships and we’re looking to restore them in future. We’re sorry for any problems this caused.

For release 94, we launched a major change to our gene tree pipeline, which we use to infer orthologues and paralogues. In our previous pipeline, we used BLAST for clustering of the genes, but for the recent release we switched to using HMMs instead. There are three major benefits of using HMMs.

The first benefit is that we are able to add new genes into the cluster without re-running the clustering step. This means that when we get new genomes (which we’re now doing at a greatly amplified rate), or new genes in our existing genomes, the old gene trees will remain intact, just with additional genes. We believe and have heard from you that accurate and stable gene trees are valuable, so this would be a great improvement.

The second benefit is that we can manually move genes between trees if we believe that annotation is correct.

The third benefit is that we believe the HMMs are stricter than the old BLAST clustering methods. This means that some larger gene trees with questionable duplication nodes are broken up. There was additional payoff in that the smaller trees give us more potential for adding more genomes and genes, as the compute power needed to align and cluster the sequences in the larger trees was so great. Indeed, the relationships between the smaller sub-trees were unstable between releases because of the difficulty in alignment.

In retrospect, we discovered that some large gene trees with many real duplication nodes have been inappropriately broken up, which means that we have lost some of the paralogy relationships.

We are looking into how to fix this now, with the idea to keep the smaller sub-trees the size they are and link them together, such that the super-tree could be displayed if needed. This would mean that the paralogues between the sub-trees would be linked back together.

We apologise to anyone who is working with paralogues and has found data missing. Please let us know if your favourite gene families have been affected and we will add them to our list of test genes. You can still get the old paralogy relationships using our archive site for release 93.