Multi-locus data distinguishes between population growth and multiple merger coalescents

Jere Koskela

doi:10.1515/sagmb-2017-0011

Published by De Gruyter June 13, 2018

Multi-locus data distinguishes between population growth and multiple merger coalescents

Jere Koskela

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2017-0011

Showing a limited preview of this publication:

Abstract

We introduce a low dimensional function of the site frequency spectrum that is tailor-made for distinguishing coalescent models with multiple mergers from Kingman coalescent models with population growth, and use this function to construct a hypothesis test between these model classes. The null and alternative sampling distributions of the statistic are intractable, but its low dimensionality renders them amenable to Monte Carlo estimation. We construct kernel density estimates of the sampling distributions based on simulated data, and show that the resulting hypothesis test dramatically improves on the statistical power of a current state-of-the-art method. A key reason for this improvement is the use of multi-locus data, in particular averaging observed site frequency spectra across unlinked loci to reduce sampling variance. We also demonstrate the robustness of our method to nuisance and tuning parameters. Finally we show that the same kernel density estimates can be used to conduct parameter estimation, and argue that our method is readily generalisable for applications in model selection, parameter inference and experimental design.

Keywords: kernel density estimation; multiple merger coalescent; population growth; site frequency spectrum; statistical power

MSC 2010: Primary: 92D10; Secondary: 62M02; 62F03

Acknowledgement

The author is grateful to Bjarki Eldon, Jochen Blath and Matthias Birkner for their help and suggestions concerning summary statistics for multiple merger detection, and was supported by Deutsche Forschungsgemeinschaft (DFG) grant BL 1105/3-2. Part of this work was completed while visiting Matthias Birkner at the Johannes Gutenberg Universität Mainz, funded by DFG grant BI 1058/2-2. Both grants are part of SPP Priority Programme 1590.

A Appendix

A Λ-coalescent for L unlinked loci with n_i(t) lineages at locus i ∈ [L] can be simulated naively by setting N:=∑i=1Lni(t), sampling a merger size K from {2, …, N} proportional to weights

(6)(Nj)∫01xj−2(1−x)N−jΛ(dx),

selecting K lineages uniformly at random without replacement from among the N possible segments, merging all groups of lineages which occupy the same locus, and incrementing time by a step drawn from an exponential distribution with rate given by the sum of the weights (6). The remaining sample size N can then be updated, and the process iterated until the most recent common ancestor (MRCA) is reached, at which point the algorithm terminates. Diploid, biparental Ξ-coalescent models developed in (Birkner, Blath & Eldon , 2013a) can also be incorporated by further randomly assigning all lineages into 4 groups, and only merging lineages which occupy the same locus and the same group.

However, normalising the weights (6) is an O(N) operation, which is expensive when the sample size and number of loci are large. Moreover, the resulting merger is unlikely to sample two or more lineages occupying the same locus unless Λ(dx) is concentrated near 1 (resulting in very large mergers), so that most steps of the algorithm result in no progress towards the MRCA. In practice, this naive method was infeasibly slow for our simulation-based model selection. Here we present a faster alternative based on rejection sampling.

The following is a modification of the Poisson construction of Ξ-coalescents given in (Birkner et al., 2009, Section 1.4) to the present multi-locus, biparental, diploid Ξ-coalescents. Consider a Poisson point process Υ on [0, ∞) × [0, 1] × [0, 1]^{ℕ × L} with intensity

dt⊗Λ(dx)x2⊗U[0,1]⊗(L×N).

At each point

(ti,yi,ui)∈Υ,

that is, at the time of the i^th point t_i, each of the n_i(t) ≤ n_i active lineages at each locus checks whether u_i(j, k) ≤ y_i for each k ∈ [L], j ∈ [n_i(t)]. All lineages for which this is the case take part in a merger within the locus. These merging lineages are the uniformly at random assigned into one of four groups, and any groups with more than one member are merged into a common ancestor.

A technical problem from the simulation point of view is that the intensity measure of our Poisson point process Υ is infinite for Λ(dx) corresponding to a Beta(β₁, β₂)-coalescent, and hence also for a Beta(2 − α, α)-Ξ-coalescent for any α ≥ 0. However, if we restrict Υ to the set where u_i(j, k) ≤ y_i for at least one j ∈ [L] and two distinct ks from [n_j(t)], the restricted intensity is finite, and given by

∫01f(x;n1(t),…,nL(t))Λ(dx),

where

(7)f(x;n1(t),…,nL(t)):=1−∏j=1L((1−x)nj(t)+nj(t)x(1−x)nj(t)−1)x2.

This restricted intensity corresponds to thinning Υ by only retaining those events in which there is at least one locus in which two lineages successfully participate in the event.

An alternative simulation mechanism is thus as follows: Given n₁(t), …, n_L(t), consider a new thinned Poisson point process Υ~ with intensity

(8)dt⊗Λ~(dx),

where Λ~(dx):=f(x;n1(t),…,nL(t))Λ(dx) determines the merging times and success probabilities (t_i, y_i). Given (t_i, y_i), which now appear with finite rate, independently and uniformly choose a pair of lineages within a locus to merge out of the

(9)(n1(t)2)+…+(nL(t)2)=:c(t)

within-locus pairs. Further mergers or merging lineages can then be added according to independent uniform coin flips with success probability y_i. As before, all merging lineages within a locus have to be grouped into four uniform classes in the diploid case, and only lineages within the same locus and class can merge. If no class contains at least two lineages, then nothing happens.

It remains to specify a way of sampling the pairs (t_i, y_i) with intensity (8). Note that f(x; n₁(t), …, n_L(t)) is maximised at x = 0+ for fixed (n₁(t), …, n_L(t)), where it takes the value c(t) from (9). Thus, pairs with the correct intensity can be generated by sampling a time T⁽¹⁾ ∼ Exp(c(t)), an independent U⁽¹⁾ ∼ U[0, 1] and a further independent Y⁽¹⁾ ∼ Λ, and by checking whether

U(1)≤f(Y(1);n1(t),…,nL(t))c(t).

If yes, accept (T⁽¹⁾, Y⁽¹⁾) for (t₁, y₁). If not, repeat this procedure, generating independent tuples (T⁽²⁾, U⁽²⁾, Y⁽²⁾) etc. If the first success requires k steps, accept (T⁽¹⁾ + … + T^(k), Y^(k)) for (t₁, y₁), at which point the merger event can be simulated and n₁(t₁), …, n_L(t₁), f(⋅; n₁(t), … n_L(t)) and c(t) updated accordingly. This procedure is then repeated to obtain (t₂, y₂), and so on, until the most recent common ancestor is reached.

Remark 4

This method is straightforward to adapt to the case of variable, deterministic population size by first generating the next coalescence time under a constant population size, and then adjusting it to the variable population size scenario using the method of Donnelly and Tavaré (1995).

Remark 5

This procedure also covers the Kingman coalescent case Λ = δ₀, in which Y⁽¹⁾ ≡ 0 and f(0, n₁(t), …, n_L(t)) ≡ c(t). This means that only one pair of lineages ever merges at any given time, the coin flips for all other lineages need not be simulated, and all proposed mergers are accepted. More general atoms at 0 can also be handled similarly by first sampling Y⁽¹⁾ ∼ Λ, and then setting f(0; n₁(t), …, n_L(t)) = c(t) if Y⁽¹⁾ = 0.

For x close to 0, the function f(x; n₁(t), …, n_L(t)) is numerically unstable because of the very small number in the denominator in (7), rendering the procedure outlined above unreliable. Note also that it is in fact a polynomial in x and for small arguments, readily represented via

(10)f(x;n1(t),…,nL(t))=∑j=1L(nj(t)2)−2∑j=1m(nj(t)3)x+O(x2),

as can be verified as follows. For n ∈ ℕ we have

(11)(1−x)n+nx(1−x)n−1=1+∑k=1n(nk)(−x)k+∑j=0n−1n(n−1j)(−1)jxj+1=1+∑k=2n(−x)k[(nk)−n(n−1k−1)]=1−∑k=2n(k−1)(nk)(−x)k

where we have used the fact that the terms for k = 1 and for j = 0 in the first line cancel, then replaced j = k − 1 in the second sum in the first line and used

(nk)−n(n−1k−1)=n!k!(n−k)!−n(n−1)!(k−1)!(n−k)!=(nk)(1−k)

in the second line. Thus

∏j=1L((1−x)nj(t)+nj(t)x(1−x)nj(t)−1)=∏j=1L(1−(nj(t)2)x2+2(nj(t)3)x3+O(x4))=1−∑j=1L(nj(t)2)x2+2∑j=1L(nj(t)3)x3+O(x4)

and subtracting from 1 before dividing by x² to obtain f(x; n₁(t), …, n_L(t)) yields (10).

Numerical experiments were performed to determine that the polynomial approximation (10) needed to be used whenever x ≤ 10⁻⁴. The form of higher order terms in (11) shows that the constant of proportionality of the O(x^k) term in (10) contains factors that are O(n^k). Hence, the algorithm cannot remain accurate for samples of size n = 10⁴, as then approximation errors are O(1) for a polynomial approximation of any order, short of the trivial case of using the full, finite expansion which recovers the function f(x; n₁(t), …, n_L(t)) exactly. We remark that using the full polynomial expansion of f(x; n₁(t), …, n_L(t)) to compute it exactly is also prone to numerical instability due to alternating signs. However, the first order truncation shown in (10) was found to be fast, accurate and stable for the sample sizes considered in this paper, which we have also shown to be sufficient for our aim of distinguishing multiple merger coalescents from exponential or algebraic population growth models.

We conclude this appendix by specifying how the summary statistic (2) can be computed from data sets, as well as simulated realisations of the coalescent tree. The form of (2) as an average across loci implies that it is sufficient to consider a single locus in both cases.

First, consider a SNP data set of n individuals sequenced at a single locus. We assume the data is consistent with the infinite alleles model of mutation, and associate to individual i the list x_i = (l_i(1), …, l_i(k_i)), where l_i(q) is the position of the q^th derived allele from the left on individual i, and k_i is the number of mutations carried by that individual. For an location l on the locus, we say l ∈ x_i if l = l_i(q) for some q ∈ {1, …, k_i}, and l ≠ x_i otherwise. Then the pair (ζ1(n),ζ¯k(n)) can be computed as

Now consider a simulated ancestral tree for n individuals. The tree consist of q ∈ {n, …, 2n} branches, which we assume are labelled with mutations. We impose an arbitrary ordering on the branches, and denote the number of mutations on branch i by m_i, the set of indices of the immediate child branches of branch i by c_i, and the number of leaves subtended by branch i by s_i. The s_i’s can be computed recursively for a given tree by setting

si={1 if ci=∅,∑j∈cisj otherwise.

Once the s_i’s are computed, the summary statistic can be expressed as

References

Achaz, G. (2008): “Testing for neutrality in samples with sequencing errors,” Genetics, 179, 1409–1424.10.1534/genetics.107.082198Search in Google Scholar PubMed PubMed Central

Árnason, E. (2004): “Mitochondrial cytochrome b variation in the high-fecundity Atlantic cod: trans-Atlantic clines and shallow gene genealogy.” Genetics, 166, 1871–1885.10.1093/genetics/166.4.1871Search in Google Scholar

Beaumont, M. A. (2010): “Approximate Bayesian computation in evolution and ecology,” Annu. Rev. Ecol. Evol. Syst., 41, 379–406.10.1146/annurev-ecolsys-102209-144621Search in Google Scholar

Beckenbach, A. T. (1994): “Mitochondrial haplotype frequencies in oysters: neutral alternatives to selection models,” In: Golding, B. (Ed.), Non-neutral evolution. New York: Chapman & Hall, pp. 188–198.10.1007/978-1-4615-2383-3_15Search in Google Scholar

Birkner, M. and J. Blath (2008): “Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model,” J. Math. Biol., 57, 435–465.10.1007/s00285-008-0170-6Search in Google Scholar PubMed

Birkner, M., J. Blath, M. Möhle, M. Steinrücken, and J. Tams (2009): “A modified lookdown construction for the Xi-Fleming-Viot process with mutation and populations with recurrent bottlenecks,” ALEA Lat. Am. J. Probab. Math. Stat., 6, 25–61.Search in Google Scholar

Birkner, M., J. Blath, and M. Steinrücken (2011): “Importance sampling for Lambda-coalescents in the infinitely many sites model,” Theor. Popul. Biol., 79, 155–173.10.1016/j.tpb.2011.01.005Search in Google Scholar PubMed PubMed Central

Birkner, M., J. Blath, and B. Eldon (2013a): “An ancestral recombination graph for diploid populations with skewed offspring distribution,” Genetics, 193, 255–290.10.1534/genetics.112.144329Search in Google Scholar PubMed PubMed Central

Birkner, M., J. Blath, and B. Eldon (2013b): “Statistical properties of the site-frequency spectrum associated with Lambda-coalescents,” Genetics, 195, 1037–1053.10.1534/genetics.113.156612Search in Google Scholar PubMed PubMed Central

Birkner, M., H. Liu, and A. Sturm (2017): “A note on coalescent results for diploid exchangeable population models,” Preprint, arXiv:1709.02563v2.10.1214/18-EJP175Search in Google Scholar

Blath, J., M. C. Cronjäger, B. Eldon, and M. Hammer (2016): “The site-frequency spectrum associated with Ξ-coalescents,” Theor. Popul. Biol., 110, 36–50.10.1016/j.tpb.2016.04.002Search in Google Scholar PubMed

Depaulis, F. and M. Veuille (1998): “Neutrality tests based on the distribution of haplotypes under an infinite-site model,” Mol. Biol. Evol., 15, 1788.10.1093/oxfordjournals.molbev.a025905Search in Google Scholar PubMed

Diggle, P. J. and R. J. Gratton (1984): “Monte Carlo methods of inference for implicit statistical models,” J. R. Stat. Soc. B, 46, 193–227.10.1111/j.2517-6161.1984.tb01290.xSearch in Google Scholar

Donnelly, P. and T. G. Kurtz (1999): “Particle representations for measure-valued population models,” Ann. Probab., 27, 166–205.10.1214/aop/1022677258Search in Google Scholar

Donnelly, P. and S. Tavaré (1995): “Coalescents and genealogical structure under neutrality,” Annu. Rev. Genet., 29, 401–421.10.1146/annurev.ge.29.120195.002153Search in Google Scholar PubMed

Duong, T. and M. L. Hazelton (2003): “Plug-in bandwidth matrices for bivariate kernel density estimation,” J. Nonparametr Stat., 15, 17–30.10.1080/10485250306039Search in Google Scholar

Durrett, R. and J. Schweinsberg (2005): “A coalescent model for the effect of advantageous mutations on the genealogy of a population,” Stoch. Proc. Appl., 115, 1628–1657.10.1016/j.spa.2005.04.009Search in Google Scholar

Eldon, B. (2011): “Estimation of parameters in large offspring number models and ratios of coalescence times,” Theor. Popul. Biol., 80, 16–28.10.1016/j.tpb.2011.04.002Search in Google Scholar PubMed

Eldon, B. and J. Wakeley (2006): “Coalescent processes when the distribution of offspring number among individuals is highly skewed,” Genetics, 172, 2621–2633.10.1534/genetics.105.052175Search in Google Scholar PubMed PubMed Central

Eldon, B. and J. Wakeley (2009): “Coalescence times and F_ST under a skewed offspring distribution among individuals in a population,” Genetics, 181, 615–629.10.1534/genetics.108.094342Search in Google Scholar PubMed PubMed Central

Eldon, B., M. Birkner, J. Blath, and F. Freund (2015): “Can the site frequency spectrum distinguish exponential population growth from multiple-merger coalescents,” Genetics, 199, 841–856.10.1534/genetics.114.173807Search in Google Scholar PubMed PubMed Central

Fay, J. C. and C.-I. Wu (2000): “Hitchhiking under positive Darwinian selection,” Genetics, 155, 1405–1413.10.1093/genetics/155.3.1405Search in Google Scholar

Fu, Y. X. (1995): “Statistical properties of segregating sites,” Theor. Popul. Biol., 48, 172–197.10.1006/tpbi.1995.1025Search in Google Scholar

Fu, Y. X. and W. H. Li (1993): “Statistical tests of neutrality of mutations,” Genetics, 133, 693–709.10.1093/genetics/133.3.693Search in Google Scholar

Hedgecock, D. and A. I. Pudovkin (2011): “Sweepstakes reproductive success in highly fecund marine fish and shellfish: a review and commentary,” Bull. Mar. Sci., 87, 971–1002.10.5343/bms.2010.1051Search in Google Scholar

Hein, J., M. H. Schierup, and C. Wiuf (2005): Gene genealogies, variation and evolution. Oxford, UK: Oxford University Press.Search in Google Scholar

Hudson, R. R. (1983a): “Properties of a neutral allele model with intragenic recombination,” Theor. Popul. Biol., 23, 183–201.10.1016/0040-5809(83)90013-8Search in Google Scholar

Hudson, R. R. (1983b): “Testing the constant-rate neutral allele model with protein sequence data,” Evolution, 37, 203–217.10.1111/j.1558-5646.1983.tb05528.xSearch in Google Scholar

Hudson, R. R. (1990): “Gene genealogies and the coalescent process,” In: Futuyma, D. J., Antonovics, J. (Eds.), Oxford surveys in evolutionary biology, Vol. 7. Oxford: Oxford University Press, pp. 1–44.Search in Google Scholar

Kingman, J. F. C. (1982a): “The coalescent,” Stoch. Proc. Appl., 13, 235–248.10.1016/0304-4149(82)90011-4Search in Google Scholar

Kingman, J. F. C. (1982b): “Exchangeability and the evolution of large populations,” In: Koch, G., Spizzichino, F., (Eds.), Exchangeability in probability and statistics. Amsterdam: North-Holland, pp. 97–112.Search in Google Scholar

Kingman, J. F. C. (1982c): “On the genealogy of large populations,” J. Appl. Probab., 19A, 27–43.10.1017/S0021900200034446Search in Google Scholar

Koskela, J., P. Jenkins, and D. Spanò (2015): “Computational inference beyond Kingman’s coalescent,” J. Appl. Probab., 52, 519–537.10.1017/S0021900200012614Search in Google Scholar

Koskela, J., P. Jenkins, and D. Spanò (2018): “Bayesian non-parametric inference for Λ-coalescents: posterior consistency and a parametric method,” Bernoulli, 24, 2122–2153.10.3150/16-BEJ923Search in Google Scholar

Möhle, M. (1998): “Robustness results for the coalescent,” J. Appl. Probab., 35, 438–447.10.1239/jap/1032192859Search in Google Scholar

Nordborg, M. (2001): “Coalescent theory,” In: Balding, D. J., Bishop, M. J., Cannings, C. (Eds.), Handbook of statistical genetics, chapter 25, 2nd edn. Chichester, UK: John Wiley & Sons, pp. 179–212.Search in Google Scholar

Pitman, J. (1999): “Coalescents with multiple collisions,” Ann. Probab., 27, 1870–1902.10.1214/aop/1022874819Search in Google Scholar

Ramos-Onsins, S. E. and J. Rozas (2002): “Statistical properties of new neutrality tests against population growth,” Mol. Biol. Evol., 19, 2092–2100.10.1093/oxfordjournals.molbev.a004034Search in Google Scholar

Sagitov, S. (1999): “The general coalescent with asynchronous mergers of ancestral lines,” J. Appl. Probab., 36, 1116–1125.10.1239/jap/1032374759Search in Google Scholar

Sargsyan, O. and J. Wakeley (2008): “A coalescent process with simultaneous multiple mergers for approximating the gene genealogies of many marine organisms,” Theor. Popul. Biol., 74, 104–114.10.1016/j.tpb.2008.04.009Search in Google Scholar

Schweinsberg, J. (2003): “Coalescent processes obtained from supercritical Galton-Watson processes,” Stoch. Proc. Appl., 106, 107–139.10.1016/S0304-4149(03)00028-0Search in Google Scholar

Scott, D. W. (1992): Multivariate density estimation: theory, practice and visualization. New York: John Wiley & Sons.10.1002/9780470316849Search in Google Scholar

Steinrücken, M., M. Birkner, and J. Blath (2013): “Analysis of DNA sequence variation within marine species using beta-coalescents,” Theor. Popul. Biol., 87, 15–24.10.1016/j.tpb.2013.01.007Search in Google Scholar PubMed PubMed Central

Tajima, F. (1983): “Evolutionary relationship of DNA sequences in finite populations,” Genetics, 105, 437–460.10.1093/genetics/105.2.437Search in Google Scholar PubMed PubMed Central

Tajima, F. (1989): “The effect of change in population size on DNA polymorphism,” Genetics, 123, 597–601.10.1093/genetics/123.3.597Search in Google Scholar PubMed PubMed Central

Tellier, A. and C. Lemaire (2014): “Coalescence 2.0: a multiple branching of recent theoretical developments and their applications,” Mol. Ecol., 23, 2637–2652.10.1111/mec.12755Search in Google Scholar

Tørresen, O. K., B. Star, S. Jentoft, W. B. Reinar, H. Grove, J. R. Miller, B. P. Walenz, J. Knight, J. M. Ekholm, P. Peluso, R. B. Edvardsen, A. Tooming-Klunderud, M. Skage, S. Lien, K. S. Jakobsen, and A. J. Nederbragt (2017): “An improved genome assembly uncovers prolific tandem repeats in Atlantic cod,” BMC Genomics, 18, 95.10.1186/s12864-016-3448-xSearch in Google Scholar

Wakeley, J. (2007): Coalescent theory. Greenwood Village: Roberts & Co.Search in Google Scholar

Watterson, G. A. (1975): “On the number of segregating sites in genetical models without recombination,” Theor. Pop. Biol., 7, 1539–1546.10.1016/0040-5809(75)90020-9Search in Google Scholar

Zhu, S., J. H. Degnan, S. J. Goldstein, and B. Eldon (2015): “Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees,” BMC Bioinformatics, 16.10.1186/s12859-015-0721-ySearch in Google Scholar PubMed PubMed Central

Published Online: 2018-6-13

Multi-locus data distinguishes between population growth and multiple merger coalescents

Abstract

Acknowledgement

A Appendix

Remark 4

Remark 5

References

Journal and Issue

Articles in the same Issue