Abstract
We introduce a low dimensional function of the site frequency spectrum that is tailor-made for distinguishing coalescent models with multiple mergers from Kingman coalescent models with population growth, and use this function to construct a hypothesis test between these model classes. The null and alternative sampling distributions of the statistic are intractable, but its low dimensionality renders them amenable to Monte Carlo estimation. We construct kernel density estimates of the sampling distributions based on simulated data, and show that the resulting hypothesis test dramatically improves on the statistical power of a current state-of-the-art method. A key reason for this improvement is the use of multi-locus data, in particular averaging observed site frequency spectra across unlinked loci to reduce sampling variance. We also demonstrate the robustness of our method to nuisance and tuning parameters. Finally we show that the same kernel density estimates can be used to conduct parameter estimation, and argue that our method is readily generalisable for applications in model selection, parameter inference and experimental design.
Acknowledgement
The author is grateful to Bjarki Eldon, Jochen Blath and Matthias Birkner for their help and suggestions concerning summary statistics for multiple merger detection, and was supported by Deutsche Forschungsgemeinschaft (DFG) grant BL 1105/3-2. Part of this work was completed while visiting Matthias Birkner at the Johannes Gutenberg Universität Mainz, funded by DFG grant BI 1058/2-2. Both grants are part of SPP Priority Programme 1590.
A Appendix
A Λ-coalescent for L unlinked loci with ni(t) lineages at locus i ∈ [L] can be simulated naively by setting
selecting K lineages uniformly at random without replacement from among the N possible segments, merging all groups of lineages which occupy the same locus, and incrementing time by a step drawn from an exponential distribution with rate given by the sum of the weights (6). The remaining sample size N can then be updated, and the process iterated until the most recent common ancestor (MRCA) is reached, at which point the algorithm terminates. Diploid, biparental Ξ-coalescent models developed in (Birkner, Blath & Eldon , 2013a) can also be incorporated by further randomly assigning all lineages into 4 groups, and only merging lineages which occupy the same locus and the same group.
However, normalising the weights (6) is an O(N) operation, which is expensive when the sample size and number of loci are large. Moreover, the resulting merger is unlikely to sample two or more lineages occupying the same locus unless Λ(dx) is concentrated near 1 (resulting in very large mergers), so that most steps of the algorithm result in no progress towards the MRCA. In practice, this naive method was infeasibly slow for our simulation-based model selection. Here we present a faster alternative based on rejection sampling.
The following is a modification of the Poisson construction of Ξ-coalescents given in (Birkner et al., 2009, Section 1.4) to the present multi-locus, biparental, diploid Ξ-coalescents. Consider a Poisson point process Υ on [0, ∞) × [0, 1] × [0, 1]ℕ × L with intensity
At each point
that is, at the time of the ith point ti, each of the ni(t) ≤ ni active lineages at each locus checks whether ui(j, k) ≤ yi for each k ∈ [L], j ∈ [ni(t)]. All lineages for which this is the case take part in a merger within the locus. These merging lineages are the uniformly at random assigned into one of four groups, and any groups with more than one member are merged into a common ancestor.
A technical problem from the simulation point of view is that the intensity measure of our Poisson point process Υ is infinite for Λ(dx) corresponding to a Beta(β1, β2)-coalescent, and hence also for a Beta(2 − α, α)-Ξ-coalescent for any α ≥ 0. However, if we restrict Υ to the set where ui(j, k) ≤ yi for at least one j ∈ [L] and two distinct ks from [nj(t)], the restricted intensity is finite, and given by
where
This restricted intensity corresponds to thinning Υ by only retaining those events in which there is at least one locus in which two lineages successfully participate in the event.
An alternative simulation mechanism is thus as follows: Given n1(t), …, nL(t), consider a new thinned Poisson point process
where
within-locus pairs. Further mergers or merging lineages can then be added according to independent uniform coin flips with success probability yi. As before, all merging lineages within a locus have to be grouped into four uniform classes in the diploid case, and only lineages within the same locus and class can merge. If no class contains at least two lineages, then nothing happens.
It remains to specify a way of sampling the pairs (ti, yi) with intensity (8). Note that f(x; n1(t), …, nL(t)) is maximised at x = 0+ for fixed (n1(t), …, nL(t)), where it takes the value c(t) from (9). Thus, pairs with the correct intensity can be generated by sampling a time T(1) ∼ Exp(c(t)), an independent U(1) ∼ U[0, 1] and a further independent Y(1) ∼ Λ, and by checking whether
If yes, accept (T(1), Y(1)) for (t1, y1). If not, repeat this procedure, generating independent tuples (T(2), U(2), Y(2)) etc. If the first success requires k steps, accept (T(1) + … + T(k), Y(k)) for (t1, y1), at which point the merger event can be simulated and n1(t1), …, nL(t1), f(⋅; n1(t), … nL(t)) and c(t) updated accordingly. This procedure is then repeated to obtain (t2, y2), and so on, until the most recent common ancestor is reached.
Remark 4
This method is straightforward to adapt to the case of variable, deterministic population size by first generating the next coalescence time under a constant population size, and then adjusting it to the variable population size scenario using the method of Donnelly and Tavaré (1995).
Remark 5
This procedure also covers the Kingman coalescent case Λ = δ0, in which Y(1) ≡ 0 and f(0, n1(t), …, nL(t)) ≡ c(t). This means that only one pair of lineages ever merges at any given time, the coin flips for all other lineages need not be simulated, and all proposed mergers are accepted. More general atoms at 0 can also be handled similarly by first sampling Y(1) ∼ Λ, and then setting f(0; n1(t), …, nL(t)) = c(t) if Y(1) = 0.
For x close to 0, the function f(x; n1(t), …, nL(t)) is numerically unstable because of the very small number in the denominator in (7), rendering the procedure outlined above unreliable. Note also that it is in fact a polynomial in x and for small arguments, readily represented via
as can be verified as follows. For n ∈ ℕ we have
where we have used the fact that the terms for k = 1 and for j = 0 in the first line cancel, then replaced j = k − 1 in the second sum in the first line and used
in the second line. Thus
and subtracting from 1 before dividing by x2 to obtain f(x; n1(t), …, nL(t)) yields (10).
Numerical experiments were performed to determine that the polynomial approximation (10) needed to be used whenever x ≤ 10−4. The form of higher order terms in (11) shows that the constant of proportionality of the O(xk) term in (10) contains factors that are O(nk). Hence, the algorithm cannot remain accurate for samples of size n = 104, as then approximation errors are O(1) for a polynomial approximation of any order, short of the trivial case of using the full, finite expansion which recovers the function f(x; n1(t), …, nL(t)) exactly. We remark that using the full polynomial expansion of f(x; n1(t), …, nL(t)) to compute it exactly is also prone to numerical instability due to alternating signs. However, the first order truncation shown in (10) was found to be fast, accurate and stable for the sample sizes considered in this paper, which we have also shown to be sufficient for our aim of distinguishing multiple merger coalescents from exponential or algebraic population growth models.
We conclude this appendix by specifying how the summary statistic (2) can be computed from data sets, as well as simulated realisations of the coalescent tree. The form of (2) as an average across loci implies that it is sufficient to consider a single locus in both cases.
First, consider a SNP data set of n individuals sequenced at a single locus. We assume the data is consistent with the infinite alleles model of mutation, and associate to individual i the list xi = (li(1), …, li(ki)), where li(q) is the position of the qth derived allele from the left on individual i, and ki is the number of mutations carried by that individual. For an location l on the locus, we say l ∈ xi if l = li(q) for some q ∈ {1, …, ki}, and l ≠ xi otherwise. Then the pair
Now consider a simulated ancestral tree for n individuals. The tree consist of q ∈ {n, …, 2n} branches, which we assume are labelled with mutations. We impose an arbitrary ordering on the branches, and denote the number of mutations on branch i by mi, the set of indices of the immediate child branches of branch i by ci, and the number of leaves subtended by branch i by si. The si’s can be computed recursively for a given tree by setting
Once the si’s are computed, the summary statistic can be expressed as
References
Achaz, G. (2008): “Testing for neutrality in samples with sequencing errors,” Genetics, 179, 1409–1424.10.1534/genetics.107.082198Search in Google Scholar PubMed PubMed Central
Árnason, E. (2004): “Mitochondrial cytochrome b variation in the high-fecundity Atlantic cod: trans-Atlantic clines and shallow gene genealogy.” Genetics, 166, 1871–1885.10.1093/genetics/166.4.1871Search in Google Scholar
Beaumont, M. A. (2010): “Approximate Bayesian computation in evolution and ecology,” Annu. Rev. Ecol. Evol. Syst., 41, 379–406.10.1146/annurev-ecolsys-102209-144621Search in Google Scholar
Beckenbach, A. T. (1994): “Mitochondrial haplotype frequencies in oysters: neutral alternatives to selection models,” In: Golding, B. (Ed.), Non-neutral evolution. New York: Chapman & Hall, pp. 188–198.10.1007/978-1-4615-2383-3_15Search in Google Scholar
Birkner, M. and J. Blath (2008): “Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model,” J. Math. Biol., 57, 435–465.10.1007/s00285-008-0170-6Search in Google Scholar PubMed
Birkner, M., J. Blath, M. Möhle, M. Steinrücken, and J. Tams (2009): “A modified lookdown construction for the Xi-Fleming-Viot process with mutation and populations with recurrent bottlenecks,” ALEA Lat. Am. J. Probab. Math. Stat., 6, 25–61.Search in Google Scholar
Birkner, M., J. Blath, and M. Steinrücken (2011): “Importance sampling for Lambda-coalescents in the infinitely many sites model,” Theor. Popul. Biol., 79, 155–173.10.1016/j.tpb.2011.01.005Search in Google Scholar PubMed PubMed Central
Birkner, M., J. Blath, and B. Eldon (2013a): “An ancestral recombination graph for diploid populations with skewed offspring distribution,” Genetics, 193, 255–290.10.1534/genetics.112.144329Search in Google Scholar PubMed PubMed Central
Birkner, M., J. Blath, and B. Eldon (2013b): “Statistical properties of the site-frequency spectrum associated with Lambda-coalescents,” Genetics, 195, 1037–1053.10.1534/genetics.113.156612Search in Google Scholar PubMed PubMed Central
Birkner, M., H. Liu, and A. Sturm (2017): “A note on coalescent results for diploid exchangeable population models,” Preprint, arXiv:1709.02563v2.10.1214/18-EJP175Search in Google Scholar
Blath, J., M. C. Cronjäger, B. Eldon, and M. Hammer (2016): “The site-frequency spectrum associated with Ξ-coalescents,” Theor. Popul. Biol., 110, 36–50.10.1016/j.tpb.2016.04.002Search in Google Scholar PubMed
Depaulis, F. and M. Veuille (1998): “Neutrality tests based on the distribution of haplotypes under an infinite-site model,” Mol. Biol. Evol., 15, 1788.10.1093/oxfordjournals.molbev.a025905Search in Google Scholar PubMed
Diggle, P. J. and R. J. Gratton (1984): “Monte Carlo methods of inference for implicit statistical models,” J. R. Stat. Soc. B, 46, 193–227.10.1111/j.2517-6161.1984.tb01290.xSearch in Google Scholar
Donnelly, P. and T. G. Kurtz (1999): “Particle representations for measure-valued population models,” Ann. Probab., 27, 166–205.10.1214/aop/1022677258Search in Google Scholar
Donnelly, P. and S. Tavaré (1995): “Coalescents and genealogical structure under neutrality,” Annu. Rev. Genet., 29, 401–421.10.1146/annurev.ge.29.120195.002153Search in Google Scholar PubMed
Duong, T. and M. L. Hazelton (2003): “Plug-in bandwidth matrices for bivariate kernel density estimation,” J. Nonparametr Stat., 15, 17–30.10.1080/10485250306039Search in Google Scholar
Durrett, R. and J. Schweinsberg (2005): “A coalescent model for the effect of advantageous mutations on the genealogy of a population,” Stoch. Proc. Appl., 115, 1628–1657.10.1016/j.spa.2005.04.009Search in Google Scholar
Eldon, B. (2011): “Estimation of parameters in large offspring number models and ratios of coalescence times,” Theor. Popul. Biol., 80, 16–28.10.1016/j.tpb.2011.04.002Search in Google Scholar PubMed
Eldon, B. and J. Wakeley (2006): “Coalescent processes when the distribution of offspring number among individuals is highly skewed,” Genetics, 172, 2621–2633.10.1534/genetics.105.052175Search in Google Scholar PubMed PubMed Central
Eldon, B. and J. Wakeley (2009): “Coalescence times and FST under a skewed offspring distribution among individuals in a population,” Genetics, 181, 615–629.10.1534/genetics.108.094342Search in Google Scholar PubMed PubMed Central
Eldon, B., M. Birkner, J. Blath, and F. Freund (2015): “Can the site frequency spectrum distinguish exponential population growth from multiple-merger coalescents,” Genetics, 199, 841–856.10.1534/genetics.114.173807Search in Google Scholar PubMed PubMed Central
Fay, J. C. and C.-I. Wu (2000): “Hitchhiking under positive Darwinian selection,” Genetics, 155, 1405–1413.10.1093/genetics/155.3.1405Search in Google Scholar
Fu, Y. X. (1995): “Statistical properties of segregating sites,” Theor. Popul. Biol., 48, 172–197.10.1006/tpbi.1995.1025Search in Google Scholar
Fu, Y. X. and W. H. Li (1993): “Statistical tests of neutrality of mutations,” Genetics, 133, 693–709.10.1093/genetics/133.3.693Search in Google Scholar
Hedgecock, D. and A. I. Pudovkin (2011): “Sweepstakes reproductive success in highly fecund marine fish and shellfish: a review and commentary,” Bull. Mar. Sci., 87, 971–1002.10.5343/bms.2010.1051Search in Google Scholar
Hein, J., M. H. Schierup, and C. Wiuf (2005): Gene genealogies, variation and evolution. Oxford, UK: Oxford University Press.Search in Google Scholar
Hudson, R. R. (1983a): “Properties of a neutral allele model with intragenic recombination,” Theor. Popul. Biol., 23, 183–201.10.1016/0040-5809(83)90013-8Search in Google Scholar
Hudson, R. R. (1983b): “Testing the constant-rate neutral allele model with protein sequence data,” Evolution, 37, 203–217.10.1111/j.1558-5646.1983.tb05528.xSearch in Google Scholar
Hudson, R. R. (1990): “Gene genealogies and the coalescent process,” In: Futuyma, D. J., Antonovics, J. (Eds.), Oxford surveys in evolutionary biology, Vol. 7. Oxford: Oxford University Press, pp. 1–44.Search in Google Scholar
Kingman, J. F. C. (1982a): “The coalescent,” Stoch. Proc. Appl., 13, 235–248.10.1016/0304-4149(82)90011-4Search in Google Scholar
Kingman, J. F. C. (1982b): “Exchangeability and the evolution of large populations,” In: Koch, G., Spizzichino, F., (Eds.), Exchangeability in probability and statistics. Amsterdam: North-Holland, pp. 97–112.Search in Google Scholar
Kingman, J. F. C. (1982c): “On the genealogy of large populations,” J. Appl. Probab., 19A, 27–43.10.1017/S0021900200034446Search in Google Scholar
Koskela, J., P. Jenkins, and D. Spanò (2015): “Computational inference beyond Kingman’s coalescent,” J. Appl. Probab., 52, 519–537.10.1017/S0021900200012614Search in Google Scholar
Koskela, J., P. Jenkins, and D. Spanò (2018): “Bayesian non-parametric inference for Λ-coalescents: posterior consistency and a parametric method,” Bernoulli, 24, 2122–2153.10.3150/16-BEJ923Search in Google Scholar
Möhle, M. (1998): “Robustness results for the coalescent,” J. Appl. Probab., 35, 438–447.10.1239/jap/1032192859Search in Google Scholar
Nordborg, M. (2001): “Coalescent theory,” In: Balding, D. J., Bishop, M. J., Cannings, C. (Eds.), Handbook of statistical genetics, chapter 25, 2nd edn. Chichester, UK: John Wiley & Sons, pp. 179–212.Search in Google Scholar
Pitman, J. (1999): “Coalescents with multiple collisions,” Ann. Probab., 27, 1870–1902.10.1214/aop/1022874819Search in Google Scholar
Ramos-Onsins, S. E. and J. Rozas (2002): “Statistical properties of new neutrality tests against population growth,” Mol. Biol. Evol., 19, 2092–2100.10.1093/oxfordjournals.molbev.a004034Search in Google Scholar
Sagitov, S. (1999): “The general coalescent with asynchronous mergers of ancestral lines,” J. Appl. Probab., 36, 1116–1125.10.1239/jap/1032374759Search in Google Scholar
Sargsyan, O. and J. Wakeley (2008): “A coalescent process with simultaneous multiple mergers for approximating the gene genealogies of many marine organisms,” Theor. Popul. Biol., 74, 104–114.10.1016/j.tpb.2008.04.009Search in Google Scholar
Schweinsberg, J. (2003): “Coalescent processes obtained from supercritical Galton-Watson processes,” Stoch. Proc. Appl., 106, 107–139.10.1016/S0304-4149(03)00028-0Search in Google Scholar
Scott, D. W. (1992): Multivariate density estimation: theory, practice and visualization. New York: John Wiley & Sons.10.1002/9780470316849Search in Google Scholar
Steinrücken, M., M. Birkner, and J. Blath (2013): “Analysis of DNA sequence variation within marine species using beta-coalescents,” Theor. Popul. Biol., 87, 15–24.10.1016/j.tpb.2013.01.007Search in Google Scholar PubMed PubMed Central
Tajima, F. (1983): “Evolutionary relationship of DNA sequences in finite populations,” Genetics, 105, 437–460.10.1093/genetics/105.2.437Search in Google Scholar PubMed PubMed Central
Tajima, F. (1989): “The effect of change in population size on DNA polymorphism,” Genetics, 123, 597–601.10.1093/genetics/123.3.597Search in Google Scholar PubMed PubMed Central
Tellier, A. and C. Lemaire (2014): “Coalescence 2.0: a multiple branching of recent theoretical developments and their applications,” Mol. Ecol., 23, 2637–2652.10.1111/mec.12755Search in Google Scholar
Tørresen, O. K., B. Star, S. Jentoft, W. B. Reinar, H. Grove, J. R. Miller, B. P. Walenz, J. Knight, J. M. Ekholm, P. Peluso, R. B. Edvardsen, A. Tooming-Klunderud, M. Skage, S. Lien, K. S. Jakobsen, and A. J. Nederbragt (2017): “An improved genome assembly uncovers prolific tandem repeats in Atlantic cod,” BMC Genomics, 18, 95.10.1186/s12864-016-3448-xSearch in Google Scholar
Wakeley, J. (2007): Coalescent theory. Greenwood Village: Roberts & Co.Search in Google Scholar
Watterson, G. A. (1975): “On the number of segregating sites in genetical models without recombination,” Theor. Pop. Biol., 7, 1539–1546.10.1016/0040-5809(75)90020-9Search in Google Scholar
Zhu, S., J. H. Degnan, S. J. Goldstein, and B. Eldon (2015): “Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees,” BMC Bioinformatics, 16.10.1186/s12859-015-0721-ySearch in Google Scholar PubMed PubMed Central
©2018 Walter de Gruyter GmbH, Berlin/Boston