A general framework for moment-based analysis of genetic data

Speed, Maria Simonsen; Balding, David Joseph; Hobolth, Asger

doi:10.1007/s00285-018-01325-0

A general framework for moment-based analysis of genetic data

Published: 28 January 2019

Volume 78, pages 1727–1769, (2019)
Cite this article

Journal of Mathematical Biology Aims and scope Submit manuscript

666 Accesses
3 Citations
27 Altmetric
1 Mention
Explore all metrics

Abstract

In population genetics, the Dirichlet (also called the Balding–Nichols) model has for 20 years been considered the key model to approximate the distribution of allele fractions within populations in a multi-allelic setting. It has often been noted that the Dirichlet assumption is approximate because positive correlations among alleles cannot be accommodated under the Dirichlet model. However, the validity of the Dirichlet distribution has never been systematically investigated in a general framework. This paper attempts to address this problem by providing a general overview of how allele fraction data under the most common multi-allelic mutational structures should be modeled. The Dirichlet and alternative models are investigated by simulating allele fractions from a diffusion approximation of the multi-allelic Wright–Fisher process with mutation, and applying a moment-based analysis method. The study shows that the optimal modeling strategy for the distribution of allele fractions depends on the specific mutation process. The Dirichlet model is only an exceptionally good approximation for the pure drift, Jukes–Cantor and parent-independent mutation processes with small mutation rates. Alternative models are required and proposed for the other mutation processes, such as a Beta–Dirichlet model for the infinite alleles mutation process, and a Hierarchical Beta model for the Kimura, Hasegawa–Kishino–Yano and Tamura–Nei processes. Finally, a novel Hierarchical Beta approximation is developed, a Pyramidal Hierarchical Beta model, for the generalized time-reversible and single-step mutation processes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

The IICR and the non-stationary structured coalescent: towards demographic inference with arbitrary changes in population structure

Article Open access 07 October 2018

Inference with selection, varying population size, and evolving population structure: application of ABC to a forward–backward coalescent process with interactions

Article 30 October 2020

Bayesian estimation of scaled mutation rate under the coalescent: a sequential Monte Carlo approach

Article Open access 08 December 2017

References

1000 Genomes Project Consortium et al (2015) A global reference for human genetic variation. Nature 526(7571):68–74
Article Google Scholar
Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, Boca Raton
Book MATH Google Scholar
Balding DJ, Nichols RA (1995) A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96:3–12
Article Google Scholar
Balding DJ, Nichols RA (1997) Significant genetic correlations among Caucasians at forensic DNA loci. Heredity 78(6):583–589
Article Google Scholar
Balding DJ, Steele CD (2015) Weight-of-evidence for forensic DNA profiles, 2nd edn. Wiley, Woolloongabba
Book Google Scholar
Crow JF, Kimura M (1970) An introduction to population genetics theory. Harper & Row, Publishers, New York
MATH Google Scholar
De Maio N, Schrempf D, Kosiol C (2015) PoMo: an allele frequency-based approach for species tree estimation. Syst Biol 64(6):1018–1031
Article Google Scholar
Etheridge A (2012) Some mathematical models from population genetics. Springer, Berlin
MATH Google Scholar
Ewens WJ (2004) Mathematical population genetics 1: I. Theoretical introduction, vol 27. Springer, New York
Book MATH Google Scholar
Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, Sunderland
Google Scholar
Gautier M, Vitalis R (2013) Inferring population histories using genome-wide allele frequency data. Mol Biol Evol 30(3):654–68
Article Google Scholar
Griffiths RC, Spanò D (2010) Diffusion processes and coalescent trees. In: Bingham NH, Goldie CM (eds) Probability and mathematical genetics: papers in honour of Sir John Kingman. Cambridge University Press, Cambridge, pp 358–379
Chapter Google Scholar
Hasegawa M, Kishino H, Yano T (1985) Dating of human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22(2):160–174
Article Google Scholar
Hobolth A, Sirén J (2016) The multivariate Wright–Fisher process with mutation: moment-based analysis and inference using a hierarchical Beta model. Theor Popul Biol 108:36–50
Article MATH Google Scholar
Hodgkinson A, Eyre-Walker A (2010) Human triallelic sites: Evidence for a new mutational mechanism? Genetics 184(1):233–241
Article Google Scholar
Jenkins PA, Mueller JW, Song YS (2014) General triallelic frequency spectrum under demographic models with variable population size. Genetics 196:295–311
Article Google Scholar
Jukes TH, Cantor CR (1969) Evolution of protein molecules. Academic Press, New York, pp 21–132
Google Scholar
Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16(2):111–120
Article Google Scholar
Kimura M, Ohta T (1978) Stepwise mutation model and distribution of allele frequencies in a finite population. Proc Natl Acad Sci 75(6):2868–2872
Article MATH Google Scholar
Motoo K (1955a) Random genetic drift in multi-allelic locus. Evolution 9(4):419–435
Article Google Scholar
Motoo K (1955b) Solution of a process of random genetic drift with a continuous model. Proc Natl Acad Sci U S A 41(3):144
Article MATH Google Scholar
Nicholson G, Smith AV, Jónsson F, Gustafsson Ó, Stefánsson K, Donnelly P (2002) Assessing population differentiation and isolation from single-nucleotide polymorphism data. J R Stat Soc Ser B (Stat Methodol) 64(4):695–715
Article MathSciNet MATH Google Scholar
Ongora A, Migliorati S, Monti GS (2008) A new distribution on the simplex containing the Dirichlet family. In: Proceedings of the 3rd compositional data analysis workshop, 27–30 May. University of Girona
Pickrell JK, Pritchard JK (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet 8(11):e1002967
Article Google Scholar
Ross SM (1996) Stochastic processes, 2nd edn. Wiley, Hoboken
MATH Google Scholar
Sirén J, Marttinen P, Corander J (2011) Reconstructing population histories from single nucleotide polymorphism data. Mol Biol Evol 28:673–683
Article Google Scholar
Sirén J, Hanage WP, Corander J (2013) Inference on population histories by approximating infinite alleles diffusion. Mol Biol Evol 30(2):457–468
Article Google Scholar
Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference. Sinauer Associates, Sunderland
Google Scholar
Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526
Google Scholar
Tataru P, Bataillon T, Hobolth A (2015) Inference under a Wright–Fisher model using an accurate Beta approximation. Genetics 201:1133–1141
Article Google Scholar
Tataru P, Simonsen M, Bataillon T, Hobolth A (2016) Statistical inference in the Wright–Fisher model using allele frequency data. Syst Biol 66:e30–e46
Google Scholar
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101:1566–1581
Article MathSciNet MATH Google Scholar
Wong TT (2010) Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables. Comput Stat Data Anal 54(7):1756–1765
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We are grateful to the associate editor and two anonymous reviewers for helpful comments and suggestions. This work is funded through a Grant from the Danish Research Council (DFF 4002-00382) awarded to Asger Hobolth.

Author information

Authors and Affiliations

Bioinformatics Research Centre, Aarhus University, 8000, Aarhus, Denmark
Maria Simonsen Speed & Asger Hobolth
Melbourne Integrative Genomics, School of BioSciences and School of Mathematics & Statistics, University of Melbourne, Melbourne, VIC 3010, Australia
David Joseph Balding
Department of Affective Disorders, Aarhus University Hospital, 8000, Aarhus, Denmark
Maria Simonsen Speed

Authors

Maria Simonsen Speed
View author publications
You can also search for this author in PubMed Google Scholar
David Joseph Balding
View author publications
You can also search for this author in PubMed Google Scholar
Asger Hobolth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maria Simonsen Speed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

(Co)variance for PIM

In order to calculate the matrix exponential $e^{Qt}=I+\sum \nolimits _{k=1}^{\infty } (Qt)^k/k!$ we need to find an expression for $Q^k$, where $Q=\mathbf {e}'\beta -\beta _0 I$ for $\beta =(\beta _1,\dots ,\beta _K)$. By induction and using that $\mathbf {e}'\beta \mathbf {e}'=\beta _0 \mathbf {e}'$, we generally get $Q^k = (-\beta _0)^{k-1}(\mathbf {e}'\beta - \beta _0 I)$. This implies that the matrix exponential is given by

$$\begin{aligned} e^{Qt}= I + (\mathbf {e}'\beta -\beta _0 I) \sum _{k=1}^{\infty }(-\beta _0)^{k-1}t^k\frac{1}{k!} =a(t)(I-\mathbf {e}'\overline{\beta })+\mathbf {e}'\overline{\beta } \end{aligned}$$

where $a(t)=e^{-\beta _0 t}$ and $\overline{\beta }=\beta /\beta _0$. The mean is therefore specified by

$$\begin{aligned} {{\,\mathrm{E}\,}}[x(t)\vert {x}(0)]= x(0)e^{Qt}= x(0) \left\{ a(t)\left( I-\mathbf {e}{'}\overline{\beta }\right) +\mathbf {e}{'}\overline{\beta }\right\} . \end{aligned}$$

We ought to calculate the (co)variance specified in Theorem 2.1 for the WF PIM process. By using that $x(0)\mathbf {e}{'}=\mathbf {e}x(0){'}=1$ and $\overline{\beta }\mathbf {e}{'}=\mathbf {e}\overline{\beta }{'}=1$, the last term in the (co)variance is given by

$$\begin{aligned}&\left( e^{Qt}\right) {'}x(0){'}x(0)e^{Qt}(1-e^{-t})\\&\quad =\left( 1-e^{-t}\right) \left\{ e^{-\beta _{0}t}\left( I-\overline{\beta }{'}\mathbf {e}\right) +\overline{\beta }{'} \mathbf {e} \right\} x(0){'}x(0) \left\{ e^{-\beta _{0}t}\left( I-\mathbf {e}{'}\overline{\beta }\right) +\mathbf {e}{'} \overline{\beta } \right\} \\&\quad =\left( 1-e^{-t}\right) \Bigl \lbrace e^{-2\beta _0 t}\left( x(0)-\overline{\beta }\right) ^{'}\left( x(0) -\overline{\beta }\right) +\overline{\beta }{'}\overline{\beta }\\&\qquad +\,e^{-\beta _0 t}\left\{ (x(0)-\overline{\beta })^{'}\overline{\beta } +\overline{\beta }{'}(x(0)-\overline{\beta })\right\} \Bigr \rbrace . \end{aligned}$$

The first term of the (co)variance involves an integral and is more tedious to calculate

$$\begin{aligned}&\int _0^{t}e^{-s}(e^{Qs}){'}{{\,\mathrm{diag}\,}}{\left( x(0)e^{Q(t-s)} \right) } e^{Qs}\;ds\\&\quad = \int _0^{t}e^{-s} \left\{ a(s)\left( I-\overline{\beta }{'}\mathbf {e}\right) +\overline{\beta }{'}\mathbf {e} \right\} \left\{ a(s)\left( I-\mathbf {e}{'}\overline{\beta }\right) +\mathbf {e}{'} \overline{\beta }\right\} \\&\qquad \times \,\left\{ a(t-s){{\,\mathrm{diag}\,}}\left( x(0)-\overline{\beta }\right) + {{\,\mathrm{diag}\,}}\left( \overline{\beta }\right) \right\} \;ds\;\\&\quad = \int _0^t e^{-s}a(s)^2 a(t-s)\;ds\; (I-\overline{\beta }{'}\mathbf {e}) {{\,\mathrm{diag}\,}}(x(0)-\overline{\beta })(I-\mathbf {e}{'}\overline{\beta })\\&\qquad + \,\int _0^t e^{-s}a(s)a(t-s)\;ds\; (I-\overline{\beta }{'}\mathbf {e}){{\,\mathrm{diag}\,}}(x(0)-\overline{\beta }) \mathbf {e}{'}\overline{\beta }\\&\qquad + \,\int _0^t e^{-s}a(s)^2\;ds\; (I-\overline{\beta }{'}\mathbf {e}){{\,\mathrm{diag}\,}}(\overline{\beta })(I-\mathbf {e}{'} \overline{\beta })\\&\qquad +\, \int _0^t e^{-s}a(s)\;ds\; (I-\overline{\beta }{'}\mathbf {e}){{\,\mathrm{diag}\,}}(\overline{\beta })\mathbf {e}{'}\overline{\beta }\\&\qquad + \,\int _0^t e^{-s}a(t-s)a(s)\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(x(0)-\overline{\beta })(I-\mathbf {e}{'} \overline{\beta })\\&\qquad + \,\int _0^t e^{-s}a(t-s)\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(x(0) -\overline{\beta })\mathbf {e}{'}\overline{\beta }\\&\qquad +\, \int _0^t e^{-s}a(s)\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(\overline{\beta }) (I-\mathbf {e}{'}\overline{\beta })\\&\qquad +\, \int _0^{t} e^{-s}\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(\overline{\beta }) \mathbf {e}{'}\overline{\beta }\\&\quad =\left( 1-e^{-t}\right) \overline{\beta }{'}\overline{\beta }+e^{-\beta _0 t} \frac{1}{1+\beta _0} \left( 1-e^{-(1+\beta _0)t}\right) \\&\qquad \times \left\{ {{\,\mathrm{diag}\,}}\left( x(0)-\overline{\beta }\right) -\left( x(0) -\overline{\beta }\right) {'}\overline{\beta } -\overline{\beta }{'}\left( x(0)-\overline{\beta }\right) \right\} \\&\qquad +\, e^{-\beta _0 t}\left( 1-e^{-t}\right) \left\{ (x(0)-\overline{\beta }){'}\overline{\beta }{'}+\overline{\beta }{'}(x(0) -\overline{\beta })\right\} \\&\qquad +\,\frac{1}{1+2\beta _0}\left( 1-e^{-(1+2\beta _0)t}\right) \left\{ {{\,\mathrm{diag}\,}}\left( \overline{\beta }\right) -\overline{\beta }{'} \overline{\beta }\right\} \end{aligned}$$

by using $\mathbf {e}{{\,\mathrm{diag}\,}}(\overline{\beta })=\overline{\beta }$, ${{\,\mathrm{diag}\,}}(x(0))\mathbf {e}{'}=x(0){'}$, $\mathbf {e}{{\,\mathrm{diag}\,}}(x(0))=x(0)$, assuming $\beta _0\ne 1$, and

$$\begin{aligned} \begin{aligned}&\int _0^t e^{-s}\;ds\; = 1-e^{-t}&\quad&\int _0^t e^{-s}a(s)\;ds\; = \frac{1}{\beta _0+1}\left( 1-e^{-(\beta _0+1) t}\right) \\&\int _0^t e^{-s}a(t-s)\;ds\; = \frac{1}{\beta _0-1}\left( e^{-t}-e^{-\beta _0 t}\right)&\quad&\int _0^t e^{-s}a(t-s)a(s)\;ds\;=e^{-\beta _0 t}\left( 1-e^{-t}\right) \\&\int _0^t e^{-s}a(s)^2\;ds\; =\frac{1}{1+2\beta _0}\left( 1-e^{-(1+2\beta _0)t}\right)&\quad&\int _0^t e^{-s}a(s)^2a(t-s)\;ds\; =e^{-\beta _0 t}\frac{1}{1+\beta _0}\left( 1-e^{-(1+\beta _0)t}\right) . \end{aligned} \end{aligned}$$

The final expression for the (co)variance becomes

$$\begin{aligned}&{{\,\mathrm{Var}\,}}(x(t)\vert {x}(0))\\&\quad = \frac{1}{1+2\beta _{0}} \left( 1-e^{-(1+2\beta _{0})t}\right) \left\{ {{\,\mathrm{diag}\,}}\left( \overline{\beta }\right) -\overline{\beta }{'}\overline{\beta }\right\} \\&\qquad -\,\left( 1-e^{-t}\right) e^{-2\beta _{0}t}\left( x(0)-\overline{\beta }\right) {'} \left( x(0)-\overline{\beta }\right) + e^{-\beta _{0}t} \frac{1}{1+\beta _{0}}\left( 1+e^{-(1+\beta _{0})t}\right) \\&\qquad \times \, \left\{ {{\,\mathrm{diag}\,}}\left( x(0)-\overline{\beta }\right) -\left( x(0) -\overline{\beta }\right) {'}\overline{\beta }-\overline{\beta }{'}\left( x(0)-\overline{\beta }\right) \right\} . \end{aligned}$$

(Co)variance for IAM

This section provides more elegant calculations to obtain the (co)variance expression for the IAM process compared to the computations obtained by Sirén et al. (2013). By induction we obtain the following expression in matrix notation of the matrix exponential for the IAM process

$$\begin{aligned} e^{Qt}=e^{-tm} \left( I-\mathbf {e}{'}\mathbf {e}_K\right) +\mathbf {e}{'}\mathbf {e}_K. \end{aligned}$$

Let ${\tilde{x}}(0)=(x_1(0),\dots ,x_{K-1}(0),x_K(0)-1)$ and ${\tilde{\mathbf {1}}}=(0,\dots ,0,1)$ be $(1\times K)$-dimensional vectors. We provide a proof of the expression for the (co)variance in Eq. (1) in Corollary 3.3, where the last term is given by

$$\begin{aligned} (e^{Qt}){'}x(0)&{'}x(0)e^{Qt}\left( 1-e^{-t}\right) \\&\quad = \left( 1-e^{-t}\right) \Big \lbrace \left( e^{tm}{\tilde{x}(0)}{'}+{\tilde{\mathbf {1}}}{'}\right) \Big (e^{-tm}{\tilde{x}(0)}+{\tilde{\mathbf {1}}}\Big )\Big \rbrace \\&\quad = \left( 1-e^{-t}\right) \left\{ e^{-2tm}\left( x(0){'} x(0) - \begin{pmatrix} 0 &{} \quad \dots &{} \quad 0 &{} \quad x_1(0)\\ \vdots &{} \quad \ddots &{} \quad \vdots &{} \quad \vdots \\ 0 &{} \quad \dots &{} \quad 0 &{} \quad x_{K-1}(0)\\ x_1(0) &{} \quad \dots &{} \quad x_{K-1}(0) &{} \quad -1+2x_{K}(0) \end{pmatrix} \right) \right. \\&\left. \qquad +\, e^{-tm}\left\{ \begin{pmatrix} 0 &{} \quad \dots &{} \quad x_{1}(0)\\ \vdots &{} \quad \ddots &{} \quad \vdots \\ 0 &{} \quad \dots &{} \quad x_{K}(0)-1\\ \end{pmatrix} + \begin{pmatrix} 0 &{} \quad \dots &{} \quad 0\\ \vdots &{} \quad \ddots &{} \quad \vdots \\ x_1(0) &{} \quad \dots &{} \quad x_{K}(0)-1\\ \end{pmatrix}\right\} + {{\,\mathrm{diag}\,}}\left( {\tilde{\mathbf {1}}}\right) \right\} . \end{aligned}$$

The first term of the (co)variance becomes

$$\begin{aligned} \int _0^{t}&e^{-s}(e^{Qs}){'}{{\,\mathrm{diag}\,}}{\left( x(0)e^{Q(t-s)} \right) } e^{Qs}\;ds\;\\ =&\int _0^{t}e^{-s}\left\{ e^{-sm}\begin{pmatrix} 1 &{} \quad 0 &{} \quad \dots &{} \quad 0\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 1 &{} \quad 0\\ -1 &{} \quad \dots &{} \quad -1 &{} \quad 0 \end{pmatrix} + \begin{pmatrix} 0 &{} \quad 0 &{} \quad \dots &{} \quad 0\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 0 &{} \quad 0\\ 1 &{} \quad \dots &{} \quad 1 &{} \quad 1 \end{pmatrix}\right\} \left\{ e^{-(t-s)m}{{\,\mathrm{diag}\,}}(\widetilde{{x}}(0)) + {{\,\mathrm{diag}\,}}\left( {\tilde{\mathbf {1}}}\right) \right\} \\&\times \left\{ e^{-sm}\begin{pmatrix} 1 &{} \quad 0 &{} \quad \dots &{} \quad -1\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 1 &{} \quad -1\\ 0 &{} \quad \dots &{} \quad 0 &{} \quad 0 \end{pmatrix} + \begin{pmatrix} 0 &{} \quad 0 &{} \quad \dots &{} \quad 1\\ 0 &{} \quad \ddots &{} \quad \vdots &{} \quad 1\\ \vdots &{} \quad \dots &{} \quad 0 &{} \quad 1\\ 0 &{} \quad \dots &{} \quad 0 &{} \quad 1 \end{pmatrix}\right\} \;ds\;\\&= e^{-tm} \begin{pmatrix} x_1(0) &{} \quad 0 &{} \quad \dots &{} \quad -x_1(0)\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad x_{K-1}(0) &{} \quad -x_{K-1}(0)\\ -x_1(0) &{} \quad \dots &{} \quad -x_{K-1}(0) &{} \quad 1-x_K(0) \end{pmatrix} \int _0^t e^{-(1+m)s}\;ds\; +\left( 1-e^{-t}\right) {{\,\mathrm{diag}\,}}\left( {\tilde{\mathbf {1}}}\right) \\&\quad +\, e^{-tm}\left( 1-e^{-t}\right) \begin{pmatrix} 0 &{} \quad 0 &{} \quad \dots &{} \quad x_1(0)\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 0 &{} \quad x_{K-1}(0)\\ x_1(0) &{} \quad \dots &{} \quad x_{K-1}(0) &{} \quad 2(x_K(0)-1) \end{pmatrix}. \end{aligned}$$

The final expression for the (co)variance becomes by letting $\widehat{x}(0)=(x_1(0),\dots ,x_{K-1}(0))$

$$\begin{aligned}&{{\,\mathrm{Var}\,}}(x(t)\vert {\widehat{x}}(0),x_K(0))\\&\quad =e^{-tm}\frac{1}{m+1}\left( 1-e^{-(1+m)t}\right) \begin{pmatrix} {{\,\mathrm{diag}\,}}(\widehat{x}(0)) &{}\quad -\widehat{x}(0){'}\\ -\widehat{x}(0) &{}\quad 1-x_K(0) \end{pmatrix}\\&\qquad -\,\left( 1-e^{-t}\right) e^{-2tm} \begin{pmatrix} \widehat{x}(0){'}\widehat{x}(0) &{} \quad -\widehat{x}(0){'}(1-x_K(0))\\ -\widehat{x}(0)(1-x_K(0)) &{} \quad (1-x_K(0))(1-x_K(0)) \end{pmatrix}\\&\quad =F\big (t,m,x_K(0)\big )\begin{pmatrix} {{\,\mathrm{diag}\,}}\left( \frac{\widehat{x}}{1-x_K(0)}\right) &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix} - G\big (t,m,x_K(0)\big )\begin{pmatrix} \frac{\widehat{x}{'}(0)}{1-x_K(0)}\frac{\widehat{x}(0)}{1-x_K(0)} &{} \quad 0 \\ 0 &{} \quad 0 \end{pmatrix}\\&\qquad +\, \Big (F\big (t,m,x_K(0)\big )-G\big (t,m,x_K(0)\big )\Big )\begin{pmatrix} 0 &{} \quad -\,\frac{\widehat{x}{'}(0)}{1-x_K(0)}\\ -\,\frac{\widehat{x}(0)}{1-x_K(0)} &{} \quad 1 \end{pmatrix} \end{aligned}$$

where

$$\begin{aligned} F(t,m,x_K(0))= & {} e^{-tm}\frac{1}{m+1}\left( 1-e^{-(1+m)t}\right) (1-x_K(0))\\ G(t,m,x_K(0))= & {} \left( 1-e^{-t}\right) e^{-2tm}(1-x_K(0))^2. \end{aligned}$$

(Co)variance for HKY, TN and SSM

This section provides a proof of the expression for the (co)variance for the TN process in Eq. (3) in Corollary 3.5 (and for the HKY and SSM processes). The matrix exponential can be written on the form $e^{Qt}=\sum _{i=1}^4 a_ie^{-b_i t}A_i$, which makes the last term in the (co)variance very easy to calculate

$$\begin{aligned} (e^{Qt}){'}x(0){'}x(0)e^{Qt}\left( 1-e^{-t}\right)= & {} \left( 1-e^{-t}\right) \sum _{i=1}^4\sum _{j=1}^4a_ia_je^{-(b_i+b_j) t}A_i{'}x(0){'}x(0)A_j. \end{aligned}$$

The first term of the (co)variance is given by

$$\begin{aligned}&\int _0^{t}e^{-s}\left( e^{Qs}\right) {'}{{\,\mathrm{diag}\,}}{\left( x(0)e^{Q(t-s)} \right) } e^{Qs}\;ds\;\\&\quad = \int _0^{t}e^{-s}\sum _{i=1}^4 a_ie^{-b_i s}A_i{'} {{\,\mathrm{diag}\,}}{\left( x(0)\sum _{j=1}^4 a_je^{-b_j (t-s)}A_j\right) }\\&\qquad \times \sum _{k=1}^4 a_k e^{-b_k s}A_k\;ds\;\\&\quad = \sum _{i=1}^4\sum _{j=1}^4\sum _{k=1}^4 a_ia_ja_kg_{ijk}(t)A_i{'} {{\,\mathrm{diag}\,}}{\left( x(0)A_j\right) }A_k\\ g_{ijk}(t)= & {} \int _0^t e^{-s}e^{-b_i s} e^{-b_j(t-s)}e^{-b_k s}\;ds\;\\= & {} {\left\{ \begin{array}{ll} te^{-b_j t} &{}\quad \text {if }1+b_i+b_k=b_j\\ \frac{e^{-b_j t}-e^{-(1+b_i+b_k)t}}{1+b_i+b_k-b_j} &{}\quad \text {otherwise} \end{array}\right. }\qquad i,j,k = 1,2,3,4. \end{aligned}$$

Moment-based parameter estimation

This section outlines how to estimate the free parameters in the four models: Dirichlet, Beta–Dirichlet, Hierarchical Beta and Pyramidal Hierachical Beta, using the moment-based method.

1.1 Dirichlet

Consider a K-dimensional Dirichlet distributed allele fraction data $x(t)=(x_1(t),\dots ,x_K(t)) \sim {{\,\mathrm{Dirichlet}\,}}(\alpha )$ with parameter $\alpha =(\alpha _1,\dots ,\alpha _K)$. The mean and (co)variance under the Dirichlet model are given by

$$\begin{aligned} {{\,\mathrm{E}\,}}[x(t)]=\alpha /\alpha _0\quad \text {and}\quad {{\,\mathrm{Var}\,}}(x(t))=1/(\alpha _0+1) \left\{ \alpha /\alpha _0 -\left( \alpha /\alpha _0\right) ^2\right\} ,\quad \alpha _0=\sum _{i=1}^{K}\alpha _i. \end{aligned}$$

We ought to estimate $\alpha $ and $\alpha _0$ from the Dirichlet means and (co)variances displayed in Table 1. An estimate of $\alpha $ is obtained from the mean: $\widehat{\alpha }={{\,\mathrm{E}\,}}[x(t)]\alpha _0$. The parameter $\alpha _0$ is estimated by summing the known (co)variance over the number of allelic types

$$\begin{aligned} \sum _{k=1}^K{{\,\mathrm{Var}\,}}(x_k(t))= & {} \frac{1}{\alpha _0+1} \left\{ \sum _{k=1}^K{{\,\mathrm{E}\,}}[x_k(t)] -\sum _{k=1}^K {{\,\mathrm{E}\,}}[x_k(t)]^2\right\} \\= & {} \frac{1}{\alpha _0+1} \left\{ 1 -\sum _{k=1}^K {{\,\mathrm{E}\,}}[x_k(t)]^2\right\} \end{aligned}$$

which implies

$$\begin{aligned} \widehat{\alpha }_0=\frac{1 -\sum _{k=1}^K {{\,\mathrm{E}\,}}[x_k(t)]^2-\sum _{k=1}^K{{\,\mathrm{Var}\,}}(x_k(t))}{\sum _{k=1}^K{{\,\mathrm{Var}\,}}(x_k(t))}. \end{aligned}$$

We then estimate the mean and variance of the stochastic variables as follows for $k=1,\dots ,K-1$

$$\begin{aligned} \widehat{\mu }_{\omega _k} = \frac{\widehat{\alpha }_k}{\widehat{\alpha }_0-\sum _{l=1}^{k-1} \widehat{\alpha }_l}\quad \text {and}\quad \widehat{V}_{\omega _k} = \frac{\widehat{\alpha }_k(\widehat{\alpha }_0-\sum _{l=1}^k \widehat{\alpha }_l)}{\left( \widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l\right) ^2 \left( \widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l+1\right) }. \end{aligned}$$

1.2 Beta–Dirichlet

Consider a K-dimensional Beta–Dirichlet distributed allele fraction data $x(t)=((1-x_K(t))y,x_K(t))$, where

$$\begin{aligned} x_K(t)\sim {{\,\mathrm{Beta}\,}}(\phi \mu ,\phi (1-\mu ))\quad \text {and}\quad y\sim {{\,\mathrm{Dirichlet}\,}}(\alpha ), \quad \alpha =(\alpha _1,\dots ,\alpha _{K-1}) \end{aligned}$$

are independent. The mean and (co)variance under the Beta–Dirichlet model are given by

$$\begin{aligned} {{\,\mathrm{E}\,}}[x(t)]= & {} {{\,\mathrm{E}\,}}[{{\,\mathrm{E}\,}}[x(t)\vert {x}_K(t)]]={{\,\mathrm{E}\,}}\left[ \left( (1-x_K(t)) \tfrac{\alpha }{\alpha _0},x_K(t)\right) \right] =\left( (1-\mu )\tfrac{\alpha }{\alpha _0},\mu \right) \\ {{\,\mathrm{Var}\,}}(x(t))= & {} {{\,\mathrm{Var}\,}}({{\,\mathrm{E}\,}}[x(t)\vert {x}_{K}(t)])+{{\,\mathrm{E}\,}}[{{\,\mathrm{Var}\,}}(x(t)\vert {x}_K(t))]\\= & {} {{\,\mathrm{Var}\,}}\left( ((1-x_K(t))\tfrac{\alpha }{\alpha _0},x_K(t)\right) +{{\,\mathrm{E}\,}}\left[ (1-x_K(t))^2\right] \\&\quad \begin{pmatrix} \frac{1}{\alpha _0+1}\left\{ {{\,\mathrm{diag}\,}}\left( \frac{\alpha }{\alpha _0}\right) -\left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0}\right\} &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix}\\= & {} \frac{\mu (1-\mu )}{\phi +1}\begin{pmatrix} \left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0} &{} \quad -\left( \frac{\alpha }{\alpha _0}\right) {'}\\ -\frac{\alpha }{\alpha _0} &{} \quad 1 \end{pmatrix}\\&+ \left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1}\right\} \frac{1}{\alpha _0+1}\begin{pmatrix} {{\,\mathrm{diag}\,}}\left( \frac{\alpha }{\alpha _0}\right) -\left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0} &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix}\\= & {} A(\mu ,\phi ,\alpha _0)\begin{pmatrix} {{\,\mathrm{diag}\,}}\left( \frac{\alpha }{\alpha _0}\right) &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix} +B(\mu ,\phi )\begin{pmatrix} 0 &{} \quad -\left( \frac{\alpha }{\alpha _0}\right) {'}\\ -\frac{\alpha }{\alpha _0} &{} \quad 1 \end{pmatrix}\\&+ \left( B(\mu ,\phi )-A(\mu ,\phi ,\alpha _0)\right) \begin{pmatrix} \left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0} &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix} \end{aligned}$$

where

$$\begin{aligned} A(\mu ,\phi ,\alpha _0)= & {} \left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1}\right\} \frac{1}{\alpha _0+1},\\ B(\mu ,\phi )= & {} \frac{\mu (1-\mu )}{\phi +1} \quad \text {and}\quad \alpha _0=\sum _{i=1}^{K-1}\alpha _i. \end{aligned}$$

The Beta–Dirichlet model has $K+1$ free parameter: $\mu ,\phi ,\alpha =(\alpha _1,\dots ,\alpha _{K-1})$. Having calculated the means and (co)variances of the model displayed in Table 2, we now discuss how these parameters are estimated from the $K+1$ parameters $t,m,x(0)=(x_1(0),\dots ,x_K(0))$ in the IAM process. Let the expressions for the means in the Beta–Dirichlet model and the IAM process for $x_K(t)$ and $(x_1(t),\dots ,x_{K-1}(t))=(1-x_K(t))y$, respectively, be equal

$$\begin{aligned} \mu = 1-e^{-mt}(1-x_K(t))\quad \text {and}\quad (1-\mu )\frac{\alpha _j}{\alpha _0}=e^{-mt}x_j(0), \quad j=1,\dots ,K-1.\nonumber \\ \end{aligned}$$

(8)

The first part of Eq. (8) estimates the parameter $\widehat{\mu }$. The parameter $\phi $ is estimated by letting the expressions for the (co)variance of $x_K(t)$ in the two models be equal and applying Eq. (8)

$$\begin{aligned} \frac{\mu (1-\mu )}{\phi }= & {} e^{-mt}\frac{1-e^{-(1+m)t}}{m+1}(1-x_K(t))-(1-e^{-t})e^{-2mt}(1-x_K(t))^2\\= & {} (1-\mu )\frac{1-e^{-(1+m)t}}{m+1}-(1-\mu )^2(1-e^{-t}) \end{aligned}$$

which is equivalent to

$$\begin{aligned} \widehat{\phi } = \mu \left( \frac{1-e^{-(1+m)t}}{m+1}-(1-\mu )(1-e^{-t})\right) ^{-1}-1. \end{aligned}$$

We observe that the two expressions $A(\mu ,\phi ,\alpha _0)$ and $F(t,m,x_K(0))$ are equal

$$\begin{aligned} \left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1}\right\} \frac{1}{\alpha _0+1}= & {} e^{-mt}\frac{1}{1+m}\left( 1-e^{-(1+m)t}\right) (1-x_K(0)) \end{aligned}$$

and can therefore estimate the parameter $\alpha _0$ from this equality by using Eq. (8)

$$\begin{aligned} \widehat{\alpha }_0= & {} \frac{(1+m)\left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1} \right\} }{(1-\mu )(1-e^{-(1+m)t})}-1 =\frac{1+m}{1-e^{-(1+m)t}}\left\{ 1-\mu +\frac{\mu }{\phi +1}\right\} -1\\= & {} \frac{1+m}{1-e^{-(1+m)t}}\left\{ 1-\mu +\frac{1-e^{-(1+m)t}}{1+m}-(1-e^{-t})(1-\mu )\right\} -1\\= & {} \frac{1+m}{1-e^{-(1+m)t}}(1-\mu )e^{-t}. \end{aligned}$$

The second part of Eq. (8) estimates $\widehat{\alpha }_j=(\alpha _0e^{-mt}x_j(0))/(1-\mu )$, $j=1,\dots ,K-1$. We then estimate the mean and variance of the stochastic variables as follows for $k=1,\dots ,K-2$

$$\begin{aligned} \widehat{\mu }_{\xi }= & {} \widehat{\mu }, \qquad \widehat{V}_{\xi } = \frac{\widehat{\mu }(1-\widehat{\mu })}{\widehat{\phi }+1}, \qquad \widehat{\mu }_{\omega _k} = \frac{\widehat{\alpha }_k}{\widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l},\\ \widehat{V}_{\omega _k}= & {} \frac{\widehat{\alpha }_k\left( \widehat{\alpha }_0-\sum _{l=1}^k\widehat{\alpha }_l\right) }{\left( \widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l)^2(\widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l+1\right) }. \end{aligned}$$

Table 7 Definition of the Pyramidal Hierarchical Beta model with ${K=4}$ alleles

Full size table

1.3 Hierarchical Beta

Consider a 4-dimensional Hierarchical Beta distributed allele fraction data $x(t)=(x_1(t),x_2(t),x_3(t),x_4(t))$. The Hierarchical Beta model has 6 free parameter: 3 means $(\mu _{\omega }, \mu _{\eta _1}, \mu _{\eta _2})$ and 3 variances $(V_{\omega }, V_{\eta _1}, V_{\eta _2})$. Hobolth and Sirén (2016) provide the calculations of the means and (co)variances displayed in Table 3.

The mean of $(x_1, x_2, x_3, x_4)$ is a bijective function of $(\mu _{\omega }, \mu _{\eta _1}, \mu _{\eta _2})$. The means $(\mu _{\omega }, \mu _{\eta _1}, \mu _{\eta _2})$ are therefore completely determined from the means of the allele fractions

$$\begin{aligned} \widehat{\mu }_{\omega } = {{\,\mathrm{E}\,}}[x_1]+{{\,\mathrm{E}\,}}[x_2],\qquad \widehat{\mu }_{\eta _1} = {{\,\mathrm{E}\,}}[x_1]\mu _{\omega }^{-1},\qquad \widehat{\mu }_{\eta _2} = {{\,\mathrm{E}\,}}[x_3](1-\mu _{\omega })^{-1}. \end{aligned}$$

The variances are determined from the following three equations by isolating the variable of interest

$$\begin{aligned} \widehat{V}_{\omega }= {{\,\mathrm{Var}\,}}(x_1+x_2),\qquad \widehat{V}_{\eta _1}= f_1({{\,\mathrm{Var}\,}}(x_1)+{{\,\mathrm{Var}\,}}(x_2)) ,\qquad \widehat{V}_{\eta _2}= f_2({{\,\mathrm{Var}\,}}(x_3)+{{\,\mathrm{Var}\,}}(x_4)) \end{aligned}$$

where $f_i: \mathbf {R} \rightarrow \mathbf {R}$, $i=1,2$, are functions of the variance of interest.

1.4 Pyramidal hierarchical Beta

The Pyramidal model has $(K-1)(K+2)/2$ free parameters ($K-1$ means: $\mu _{\eta _k}$, and $K(K-1)/2$ variances: $V_{\omega _{ij}}$, $i=1,\dots ,K-2$, $j=1,\dots ,i$, and $V_{\eta _k}$, $k=1,\dots ,K$). The calculated means of the model are displayed in Table 4 for $K=4$, while the variances and covariances are depicted in Table 7. We discuss how the 9 free parameters [3 means: $(\mu _{\eta _{1}},\mu _{\eta _2},\mu _{\eta _3})$ and 6 variances: $(V_{\omega _{11}},V_{\omega _{21}},V_{\omega _{22}}, V_{\eta _1},V_{\eta _2},V_{\eta _3})$] are estimated.

The mean of $(x_1,x_2,x_3,x_4)$ is a bijective function of $(\mu _{\eta _1},\mu _{\eta _2},\mu _{\eta _3})$. The means $(\mu _{\eta _{1}},\mu _{\eta _2},\mu _{\eta _3})$ are therefore completely determined from the means as follows

$$\begin{aligned} \widehat{\mu }_{\eta _1}= & {} {{\,\mathrm{E}\,}}[x_1](\mu _{\omega _{11}}\mu _{\omega _{21}})^{-1}\\ \widehat{\mu }_{\eta _2}= & {} \left( {{\,\mathrm{E}\,}}[x_1]+{{\,\mathrm{E}\,}}[x_2]-\mu _{\omega _{11}}\mu _{\omega _{21}}\right) \left( \mu _{\omega _{11}}(1-\mu _{\omega _{21}})+(1-\mu _{\omega _{11}})\mu _{\omega _{22}}\right) ^{-1}\\ \widehat{\mu }_{\eta _3}= & {} 1-{{\,\mathrm{E}\,}}[x_4]\left( (1-\mu _{\omega _{11}})(1-\mu _{\omega _{22}})\right) ^{-1}. \end{aligned}$$

The remaining variances are determined from the following (co)variances by isolating the variable of interest

$$\begin{aligned} \begin{aligned} \widehat{V}_{\omega _{11}}&= f_{11}({{\,\mathrm{Cov}\,}}(x_1,x_4)))&\qquad \widehat{V}_{\eta _3}&= f_{3}({{\,\mathrm{Var}\,}}(x_4) )&\qquad \widehat{V}_{\eta _2}&= f_{2}({{\,\mathrm{Var}\,}}(x_2) ) \\ \widehat{V}_{\omega _{22}}&= f_{22}({{\,\mathrm{Cov}\,}}(x_2,x_4) )&\qquad \widehat{V}_{\omega _{21}}&= f_{21}({{\,\mathrm{Var}\,}}(x_3)&\qquad \widehat{V}_{\eta _1}&= f_{1}({{\,\mathrm{Var}\,}}(x_1)) \end{aligned} \end{aligned}$$

where $f_{i}: \mathbf {R}\rightarrow \mathbf {R}$, $i=1,2,3,11,21,22$, is a function of the variances or covariances provided in Table 7. By isolating the parameter of interest, we obtain an expression for the variance parameters under the Pyramid model. A similar extended method is used to in the general case with a K-dimensional allele fraction vector $x=(x_1,\dots ,x_K)$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Speed, M.S., Balding, D.J. & Hobolth, A. A general framework for moment-based analysis of genetic data. J. Math. Biol. 78, 1727–1769 (2019). https://doi.org/10.1007/s00285-018-01325-0

Download citation

Received: 21 March 2018
Revised: 18 December 2018
Published: 28 January 2019
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s00285-018-01325-0

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A general framework for moment-based analysis of genetic data

Abstract

Access this article

Similar content being viewed by others

The IICR and the non-stationary structured coalescent: towards demographic inference with arbitrary changes in population structure

Inference with selection, varying population size, and evolving population structure: application of ABC to a forward–backward coalescent process with interactions

Bayesian estimation of scaled mutation rate under the coalescent: a sequential Monte Carlo approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

(Co)variance for PIM

(Co)variance for IAM

(Co)variance for HKY, TN and SSM

Moment-based parameter estimation

1.1 Dirichlet

1.2 Beta–Dirichlet

1.3 Hierarchical Beta

1.4 Pyramidal hierarchical Beta

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A general framework for moment-based analysis of genetic data

Abstract

Access this article

Similar content being viewed by others

The IICR and the non-stationary structured coalescent: towards demographic inference with arbitrary changes in population structure

Inference with selection, varying population size, and evolving population structure: application of ABC to a forward–backward coalescent process with interactions

Bayesian estimation of scaled mutation rate under the coalescent: a sequential Monte Carlo approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

(Co)variance for PIM

(Co)variance for IAM

(Co)variance for HKY, TN and SSM

Moment-based parameter estimation

1.1 Dirichlet

1.2 Beta–Dirichlet

1.3 Hierarchical Beta

1.4 Pyramidal hierarchical Beta

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation