Elsevier

Molecular Phylogenetics and Evolution

Volume 78, September 2014, Pages 277-289
Molecular Phylogenetics and Evolution

The impact of calibration and clock-model choice on molecular estimates of divergence times

https://doi.org/10.1016/j.ympev.2014.05.032Get rights and content

Highlights

  • Molecular clocks can be calibrated with probability distributions of node ages.

  • We use simulations to investigate the effect of the age and number of calibrations.

  • Molecular clock model misspecification is an important source of estimation error.

  • The best strategy is to include multiple calibrations and to prefer those at deep nodes.

  • Effective calibrations minimise estimation error due to clock model misspecification.

Abstract

Phylogenetic estimates of evolutionary timescales can be obtained from nucleotide sequence data using the molecular clock. These estimates are important for our understanding of evolutionary processes across all taxonomic levels. The molecular clock needs to be calibrated with an independent source of information, such as fossil evidence, to allow absolute ages to be inferred. Calibration typically involves fixing or constraining the age of at least one node in the phylogeny, enabling the ages of the remaining nodes to be estimated. We conducted an extensive simulation study to investigate the effects of the position and number of calibrations on the resulting estimate of the timescale. Our analyses focused on Bayesian estimates obtained using relaxed molecular clocks. Our findings suggest that an effective strategy is to include multiple calibrations and to prefer those that are close to the root of the phylogeny. Under these conditions, we found that evolutionary timescales could be estimated accurately even when the relaxed-clock model was misspecified and when the sequence data were relatively uninformative. We tested these findings in a case study of simian foamy virus, where we found that shallow calibrations caused the overall timescale to be underestimated by up to three orders of magnitude. Finally, we provide some recommendations for improving the practice of molecular-clock calibration.

Introduction

Our understanding of the tempo and mode of evolution has been transformed by the study of molecular data. One of the most illuminating fields of research has been the use of molecular clocks to estimate evolutionary rates and timescales. There has been much progress in this area, with sophisticated methods being able to handle large, multilocus data sets and to model various patterns of rate variation among lineages (Dos Reis and Yang, 2011, Drummond et al., 2006, Rannala and Yang, 2007). However, all molecular clocks need to be calibrated so that estimates of rates and timescales are given in units of absolute time. Accordingly, identifying and dealing with sources of error in calibrations is a crucial component of molecular-clock analyses (Ho and Phillips, 2009, Inoue et al., 2010, Parham et al., 2012).

The most common method for calibrating molecular clocks is to use independent information to constrain the age of one or more nodes in the phylogenetic tree. We refer to these as the ‘calibrating nodes’ throughout this article. Calibrations are often based on a biogeographic event or on fossil evidence that can provide an estimate of when two lineages last shared a common ancestor. In the tree in Fig. 1, for example, a paleontological estimate of the divergence time of species 1 and 2 can be used to calibrate node A. By analysing the DNA sequences of these two species, we can estimate the absolute rate of molecular evolution along the two lineages descending from node A. The ages of other nodes in the tree can then be inferred by assuming some relationship among the substitution rates along different branches. A common strategy is to use several calibrating nodes, but this is only possible in taxonomic groups with a sufficient paleontological or biogeographic record. Although calibrations are often specified as point values, it is more appropriate to take into account their associated uncertainty (Ho and Phillips, 2009).

In all molecular-clock analyses, the strongest assumption about the substitution rate is that it is homogeneous across the tree, which is known as a ‘strict’ molecular clock (Zuckerkandl and Pauling, 1962). However, many empirical data sets fail to meet this assumption, with important consequences for estimates of divergence times (Yoder and Yang, 2000). As a response, various methods that can account for rate variation among lineages have been implemented (see reviews by Rutschmann, 2006, Welch and Bromham, 2005). These can be broadly classified as either uncorrelated or autocorrelated relaxed-clock models. In uncorrelated models, the rate along each branch of the phylogeny is an independent sample from a chosen probability distribution (Drummond et al., 2006, Rannala and Yang, 2007). The autocorrelated models assume that rates vary gradually throughout the phylogeny, so that the rates along neighbouring branches have some degree of correlation (Kishino et al., 2001, Sanderson, 2002, Sanderson, 1997, Thorne et al., 1998). The inclusion of calibrations can have an important impact on clock-model selection. In particular, informative calibration(s) can allow the pattern of rate variation among lineages to be resolved more reliably (Brandley et al., 2011, Lukoschek et al., 2012).

Molecular-clock estimates can be sensitive to the positions of the calibrations in the phylogenetic tree, especially when only a single or very few calibrations are available (Lee, 1999, Near and Sanderson, 2004). In general, calibrations at the root (node B in Fig. 1) or at deeper nodes are preferred over those at shallower nodes (e.g., nodes A and D in Fig. 1) (Hug and Roger, 2007, Sauquet et al., 2012, Van Tuinen and Hedges, 2004). The estimate of the substitution rate is primarily based on the branches that lie between the calibrating nodes and the tips, so that deeper calibrations capture a larger proportion of the overall genetic variation.

Studies of various data sets have shown that analyses using multiple calibrations tend to produce more reliable estimates than those based on a single or few calibrations (Conroy and Van Tuinen, 2003, Smith and Peterson, 2002, Soltis et al., 2002). A possible explanation for this pattern is that the inclusion of only a small number of calibrations can lead to a biased estimate of the substitution rate if there is substantial among-lineage rate variation. Additionally, the use of multiple calibrations reduces the average genetic distance between the calibrating nodes and the nodes that are not calibrated (Marshall, 2008, Rutschmann et al., 2007). Another benefit of multiple calibrations is that they can improve the accuracy of date estimates in the presence of taxon undersampling (Linder et al., 2005).

In Bayesian molecular-clock analyses, calibrations can be specified in the form of prior probability densities for node ages (Drummond et al., 2006, Yang and Rannala, 2006). In some Bayesian implementations of relaxed clocks, these calibration priors, chosen by the user, interact with each other and with the prior distribution of the tree to give the marginal priors for the node ages (Heled and Drummond, 2012, Ho and Phillips, 2009, Kishino et al., 2001). This can lead to differences between the user-specified and marginal calibration priors, with unexpected impacts on the resulting estimates of divergence times (Heled and Drummond, 2012, Warnock et al., 2012). In practice, one can evaluate the extent of the problem by comparing the marginal and the user-specified priors, which is typically done by running a Bayesian analysis without sequence data. There are ongoing efforts to provide a more direct solution to this problem (Heled and Drummond, 2013).

Most research into molecular-clock calibrations has focussed on empirical data. A potential limitation of these studies is that the true divergence times and rates of evolution are unknown, making it impossible to assess the accuracy of the phylogenetic estimates. Here we perform an extensive simulation study to assess the impact of different calibration practices on the estimation of evolutionary timescales. By analysing data that were generated under known conditions, we are able to measure the error in the estimates of divergence times and substitution rates. We evaluate the impact of the number and position of calibrations, and investigate how these effects vary with sequence length, substitution rate, and misspecification of the molecular-clock model. We also test whether the correct distribution of rates among branches can be recovered using a Bayesian model-averaging approach. Finally, we examine the interactions among calibrations that lead to differences between the user-specified and marginal calibration priors. Our study provides insights into the effects of using different calibration strategies and offers a number of guidelines for future studies of evolutionary timescales.

Section snippets

Materials and methods

We simulated nucleotide sequence evolution to produce a large number of datasets, which we used to test hypotheses about calibration practices. The main advantage of using simulated data is that we have complete knowledge of the evolutionary parameters, including the phylogenetic tree, the node ages, the pattern of rate variation among lineages, and the substitution model. Therefore, assessing the impact of different assumptions in the analysis is much easier than with empirical data. However,

Position of calibrations

Our study of the impact of the position of calibrating nodes considered various values for substitution rates, among-lineage rate variation, sequence lengths, and relaxed-clock models. We found that misspecification of the relaxed-clock model always had a negative impact on the error and precision of estimates of rates and node times. Other treatment variables, such as among-lineage rate variation and sequence length, did not have a consistent impact on error and precision. For instance,

Discussion

Our simulation study reveals a number of patterns that are informative for phylogenetic studies of evolutionary timescales. Importantly, we found that sequence length did not considerably affect the molecular-clock estimates. This is due to the nature of our simulations, which resulted in very informative sequence data. Previous studies have found that sequence length is an important factor with shallow phylogenies or uninformative sequences (Brown and Yang, 2010). As sequence length increases,

Conclusions

Overall, our study provides a number of insights into the impact of different calibration schemes on the estimates of evolutionary rates and timescales. The availability of reliable calibrations varies considerably among taxonomic groups. Some taxa have poor fossil and biogeographic records, leading to unreliable calibrations for molecular-clock analyses. In these cases, understanding the impact of different calibration strategies can help to improve estimates of evolutionary rates and

Author contributions

S.D., S.Y.W.H., and R.L. designed the research. S.D. collected and analysed the data. S.D., S.Y.W.H., and R.L. wrote the paper.

Acknowledgments

We thank Joseph Heled and two anonymous reviewers for their helpful comments on our study. SD was supported by a Francisco José de Caldas Scholarship from the Colombian government and by a University of Sydney World Scholars Award. SYWH was supported by the Australian Research Council.

References (76)

  • M.J. Benton et al.

    Paleontological evidence to date the tree of life

    Mol. Biol. Evol.

    (2007)
  • Bloom, J.D., 2014. An experimentally determined evolutionary model dramatically improves phylogenetic fit. bioRxiv....
  • M.C. Brandley et al.

    Accommodating heterogenous rates of evolution in molecular divergence dating methods: an example using intercontinental dispersal of Plestiodon (Eumeces) lizards

    Syst. Biol.

    (2011)
  • R.P. Brown et al.

    Bayesian dating of shallow phylogenies with a relaxed clock

    Syst. Biol.

    (2010)
  • R. Brown et al.

    Rate variation and estimation of divergence times using strict and relaxed clocks

    BMC Evol. Biol.

    (2011)
  • J.W. Brown et al.

    Strong mitochondrial DNA support for a Cretaceous origin of modern avian lineages

    BMC Biol.

    (2008)
  • C.J. Conroy et al.

    Extracting time from phylogenies: positive interplay between fossil and genetic data

    J. Mammal.

    (2003)
  • M. Dos Reis et al.

    Approximate likelihood calculation on a phylogeny for Bayesian estimation of divergence times

    Mol. Biol. Evol.

    (2011)
  • M. Dos Reis et al.

    The unbearable uncertainty of Bayesian divergence time estimation

    J. Syst. Evol.

    (2013)
  • A.J. Drummond et al.

    BEAST: Bayesian evolutionary analysis by sampling trees

    BMC Evol. Biol.

    (2007)
  • A.J. Drummond et al.

    Relaxed phylogenetics and dating with confidence

    PLoS Biol.

    (2006)
  • A.J. Drummond et al.

    Bayesian phylogenetics with BEAUti and the BEAST 1.7

    Mol. Biol. Evol.

    (2012)
  • M.A. Gandolfo et al.

    Selection of fossils for calibration of molecular dating models 1

    Ann. Missouri Bot. Gard.

    (2008)
  • L.J. Harmon et al.

    GEIGER: investigating evolutionary radiations

    Bioinformatics

    (2008)
  • T.A. Heath

    A hierarchical Bayesian model for calibrating estimates of species divergence times

    Syst. Biol.

    (2012)
  • T.A. Heath et al.

    A dirichlet process prior for estimating lineage-specific substitution rates

    Mol. Biol. Evol.

    (2012)
  • J. Heled et al.

    Calibrated tree priors for relaxed phylogenetics and divergence time estimation

    Syst. Biol.

    (2012)
  • Heled, J., Drummond, A.J., 2013. Calibrated birth-death phylogenetic time-tree priors for Bayesian inference. arXiv...
  • S.Y.M. Ho

    Calibrating molecular estimates of substitution rates and divergence times in birds

    J. Avian Biol.

    (2007)
  • S.Y.W. Ho

    An examination of phylogenetic models of substitution rate variation among lineages

    Biol. Lett.

    (2009)
  • S.Y.W. Ho et al.

    Accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times

    Syst. Biol.

    (2009)
  • S.Y.W. Ho et al.

    Time dependency of molecular rate estimates and systematic overestimation of recent divergence times

    Mol. Biol. Evol.

    (2005)
  • S.Y.W. Ho et al.

    Accuracy of rate estimation using relaxed-clock models with a critical focus on the early metazoan radiation

    Mol. Biol. Evol.

    (2005)
  • E.C. Holmes

    Molecular clocks and the puzzle of RNA virus origins

    J. Virol.

    (2003)
  • L.A. Hug et al.

    The impact of fossils and taxon sampling on ancient molecular dating analyses

    Mol. Biol. Evol.

    (2007)
  • J. Inoue et al.

    The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times

    Syst. Biol.

    (2010)
  • H. Kishino et al.

    Performance of a divergence time estimation method under a probabilistic model of rate evolution

    Mol. Biol. Evol.

    (2001)
  • M.A. Larkin et al.

    Clustal W and Clustal X version 2.0

    Bioinformatics

    (2007)
  • Cited by (0)

    View full text