Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

A standard method in phylogenetic reconstruction for representing variation in substitution rates between sites in the genome is the discrete Gamma model (DGM). Relative rates are assumed to be distributed according to a discretised Gamma distribution, where the probabilities that a site is included in each class are equal. Here, we identify a serious bias in the branch lengths of reconstructed phylogenies when the DGM is used, with the magnitude of the effect varying with the number of sequences in the alignment. We demonstrate the existence of the bias, using both simulated datasets and real HIV-1 sequences; in both cases branch lengths are overestimated. The phenomenon is exacerbated by increasing the number of discrete rate categories, is only very slightly mitigated by the addition of an invariant sites category, and happens regardless of the software package used for reconstruction. We show that the alternative "FreeRate" model, which assumes no parametric distribution and allows the class probabilities to vary, is not subject to the issue. We further establish that the reason for the behaviour is the equal size of the class probabilities in the discretisation, not simply the fact that a continuous distribution has been discretised. We explore the mathematics of the phenomenon, showing how maximum likelihood branch lengths under the DGM may differ from the true ones used to generate the tree, and how the magnitude of this difference is equal to the departure of the mean maximum likelihood substitution rate across all sites in the genome from 1. We recommend that the DGM be retired from general use. While FreeRate is an immediately available replacement, it is known to be difficult to fit, and thus there is scope for innovation in rate heterogeneity models.

More information Original publication

DOI

10.1093/sysbio/syag037

Type

Journal article

Publication Date

2026-05-20T00:00:00+00:00

Keywords

Statistical phylogenetics, biased inference, maximum likelihood, mutation rates