Skip to main content
Log in

A general framework for moment-based analysis of genetic data

  • Published:
Journal of Mathematical Biology Aims and scope Submit manuscript

Abstract

In population genetics, the Dirichlet (also called the Balding–Nichols) model has for 20 years been considered the key model to approximate the distribution of allele fractions within populations in a multi-allelic setting. It has often been noted that the Dirichlet assumption is approximate because positive correlations among alleles cannot be accommodated under the Dirichlet model. However, the validity of the Dirichlet distribution has never been systematically investigated in a general framework. This paper attempts to address this problem by providing a general overview of how allele fraction data under the most common multi-allelic mutational structures should be modeled. The Dirichlet and alternative models are investigated by simulating allele fractions from a diffusion approximation of the multi-allelic Wright–Fisher process with mutation, and applying a moment-based analysis method. The study shows that the optimal modeling strategy for the distribution of allele fractions depends on the specific mutation process. The Dirichlet model is only an exceptionally good approximation for the pure drift, Jukes–Cantor and parent-independent mutation processes with small mutation rates. Alternative models are required and proposed for the other mutation processes, such as a Beta–Dirichlet model for the infinite alleles mutation process, and a Hierarchical Beta model for the Kimura, Hasegawa–Kishino–Yano and Tamura–Nei processes. Finally, a novel Hierarchical Beta approximation is developed, a Pyramidal Hierarchical Beta model, for the generalized time-reversible and single-step mutation processes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • 1000 Genomes Project Consortium et al (2015) A global reference for human genetic variation. Nature 526(7571):68–74

    Article  Google Scholar 

  • Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, Boca Raton

    Book  MATH  Google Scholar 

  • Balding DJ, Nichols RA (1995) A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96:3–12

    Article  Google Scholar 

  • Balding DJ, Nichols RA (1997) Significant genetic correlations among Caucasians at forensic DNA loci. Heredity 78(6):583–589

    Article  Google Scholar 

  • Balding DJ, Steele CD (2015) Weight-of-evidence for forensic DNA profiles, 2nd edn. Wiley, Woolloongabba

    Book  Google Scholar 

  • Crow JF, Kimura M (1970) An introduction to population genetics theory. Harper & Row, Publishers, New York

    MATH  Google Scholar 

  • De Maio N, Schrempf D, Kosiol C (2015) PoMo: an allele frequency-based approach for species tree estimation. Syst Biol 64(6):1018–1031

    Article  Google Scholar 

  • Etheridge A (2012) Some mathematical models from population genetics. Springer, Berlin

    MATH  Google Scholar 

  • Ewens WJ (2004) Mathematical population genetics 1: I. Theoretical introduction, vol 27. Springer, New York

    Book  MATH  Google Scholar 

  • Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, Sunderland

    Google Scholar 

  • Gautier M, Vitalis R (2013) Inferring population histories using genome-wide allele frequency data. Mol Biol Evol 30(3):654–68

    Article  Google Scholar 

  • Griffiths RC, Spanò D (2010) Diffusion processes and coalescent trees. In: Bingham NH, Goldie CM (eds) Probability and mathematical genetics: papers in honour of Sir John Kingman. Cambridge University Press, Cambridge, pp 358–379

    Chapter  Google Scholar 

  • Hasegawa M, Kishino H, Yano T (1985) Dating of human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22(2):160–174

    Article  Google Scholar 

  • Hobolth A, Sirén J (2016) The multivariate Wright–Fisher process with mutation: moment-based analysis and inference using a hierarchical Beta model. Theor Popul Biol 108:36–50

    Article  MATH  Google Scholar 

  • Hodgkinson A, Eyre-Walker A (2010) Human triallelic sites: Evidence for a new mutational mechanism? Genetics 184(1):233–241

    Article  Google Scholar 

  • Jenkins PA, Mueller JW, Song YS (2014) General triallelic frequency spectrum under demographic models with variable population size. Genetics 196:295–311

    Article  Google Scholar 

  • Jukes TH, Cantor CR (1969) Evolution of protein molecules. Academic Press, New York, pp 21–132

    Google Scholar 

  • Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16(2):111–120

    Article  Google Scholar 

  • Kimura M, Ohta T (1978) Stepwise mutation model and distribution of allele frequencies in a finite population. Proc Natl Acad Sci 75(6):2868–2872

    Article  MATH  Google Scholar 

  • Motoo K (1955a) Random genetic drift in multi-allelic locus. Evolution 9(4):419–435

    Article  Google Scholar 

  • Motoo K (1955b) Solution of a process of random genetic drift with a continuous model. Proc Natl Acad Sci U S A 41(3):144

    Article  MATH  Google Scholar 

  • Nicholson G, Smith AV, Jónsson F, Gustafsson Ó, Stefánsson K, Donnelly P (2002) Assessing population differentiation and isolation from single-nucleotide polymorphism data. J R Stat Soc Ser B (Stat Methodol) 64(4):695–715

    Article  MathSciNet  MATH  Google Scholar 

  • Ongora A, Migliorati S, Monti GS (2008) A new distribution on the simplex containing the Dirichlet family. In: Proceedings of the 3rd compositional data analysis workshop, 27–30 May. University of Girona

  • Pickrell JK, Pritchard JK (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet 8(11):e1002967

    Article  Google Scholar 

  • Ross SM (1996) Stochastic processes, 2nd edn. Wiley, Hoboken

    MATH  Google Scholar 

  • Sirén J, Marttinen P, Corander J (2011) Reconstructing population histories from single nucleotide polymorphism data. Mol Biol Evol 28:673–683

    Article  Google Scholar 

  • Sirén J, Hanage WP, Corander J (2013) Inference on population histories by approximating infinite alleles diffusion. Mol Biol Evol 30(2):457–468

    Article  Google Scholar 

  • Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference. Sinauer Associates, Sunderland

    Google Scholar 

  • Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526

    Google Scholar 

  • Tataru P, Bataillon T, Hobolth A (2015) Inference under a Wright–Fisher model using an accurate Beta approximation. Genetics 201:1133–1141

    Article  Google Scholar 

  • Tataru P, Simonsen M, Bataillon T, Hobolth A (2016) Statistical inference in the Wright–Fisher model using allele frequency data. Syst Biol 66:e30–e46

    Google Scholar 

  • Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101:1566–1581

    Article  MathSciNet  MATH  Google Scholar 

  • Wong TT (2010) Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables. Comput Stat Data Anal 54(7):1756–1765

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We are grateful to the associate editor and two anonymous reviewers for helpful comments and suggestions. This work is funded through a Grant from the Danish Research Council (DFF 4002-00382) awarded to Asger Hobolth.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Simonsen Speed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

(Co)variance for PIM

In order to calculate the matrix exponential \(e^{Qt}=I+\sum \nolimits _{k=1}^{\infty } (Qt)^k/k!\) we need to find an expression for \(Q^k\), where \(Q=\mathbf {e}'\beta -\beta _0 I\) for \(\beta =(\beta _1,\dots ,\beta _K)\). By induction and using that \(\mathbf {e}'\beta \mathbf {e}'=\beta _0 \mathbf {e}'\), we generally get \(Q^k = (-\beta _0)^{k-1}(\mathbf {e}'\beta - \beta _0 I)\). This implies that the matrix exponential is given by

$$\begin{aligned} e^{Qt}= I + (\mathbf {e}'\beta -\beta _0 I) \sum _{k=1}^{\infty }(-\beta _0)^{k-1}t^k\frac{1}{k!} =a(t)(I-\mathbf {e}'\overline{\beta })+\mathbf {e}'\overline{\beta } \end{aligned}$$

where \(a(t)=e^{-\beta _0 t}\) and \(\overline{\beta }=\beta /\beta _0\). The mean is therefore specified by

$$\begin{aligned} {{\,\mathrm{E}\,}}[x(t)\vert {x}(0)]= x(0)e^{Qt}= x(0) \left\{ a(t)\left( I-\mathbf {e}{'}\overline{\beta }\right) +\mathbf {e}{'}\overline{\beta }\right\} . \end{aligned}$$

We ought to calculate the (co)variance specified in Theorem 2.1 for the WF PIM process. By using that \(x(0)\mathbf {e}{'}=\mathbf {e}x(0){'}=1\) and \(\overline{\beta }\mathbf {e}{'}=\mathbf {e}\overline{\beta }{'}=1\), the last term in the (co)variance is given by

$$\begin{aligned}&\left( e^{Qt}\right) {'}x(0){'}x(0)e^{Qt}(1-e^{-t})\\&\quad =\left( 1-e^{-t}\right) \left\{ e^{-\beta _{0}t}\left( I-\overline{\beta }{'}\mathbf {e}\right) +\overline{\beta }{'} \mathbf {e} \right\} x(0){'}x(0) \left\{ e^{-\beta _{0}t}\left( I-\mathbf {e}{'}\overline{\beta }\right) +\mathbf {e}{'} \overline{\beta } \right\} \\&\quad =\left( 1-e^{-t}\right) \Bigl \lbrace e^{-2\beta _0 t}\left( x(0)-\overline{\beta }\right) ^{'}\left( x(0) -\overline{\beta }\right) +\overline{\beta }{'}\overline{\beta }\\&\qquad +\,e^{-\beta _0 t}\left\{ (x(0)-\overline{\beta })^{'}\overline{\beta } +\overline{\beta }{'}(x(0)-\overline{\beta })\right\} \Bigr \rbrace . \end{aligned}$$

The first term of the (co)variance involves an integral and is more tedious to calculate

$$\begin{aligned}&\int _0^{t}e^{-s}(e^{Qs}){'}{{\,\mathrm{diag}\,}}{\left( x(0)e^{Q(t-s)} \right) } e^{Qs}\;ds\\&\quad = \int _0^{t}e^{-s} \left\{ a(s)\left( I-\overline{\beta }{'}\mathbf {e}\right) +\overline{\beta }{'}\mathbf {e} \right\} \left\{ a(s)\left( I-\mathbf {e}{'}\overline{\beta }\right) +\mathbf {e}{'} \overline{\beta }\right\} \\&\qquad \times \,\left\{ a(t-s){{\,\mathrm{diag}\,}}\left( x(0)-\overline{\beta }\right) + {{\,\mathrm{diag}\,}}\left( \overline{\beta }\right) \right\} \;ds\;\\&\quad = \int _0^t e^{-s}a(s)^2 a(t-s)\;ds\; (I-\overline{\beta }{'}\mathbf {e}) {{\,\mathrm{diag}\,}}(x(0)-\overline{\beta })(I-\mathbf {e}{'}\overline{\beta })\\&\qquad + \,\int _0^t e^{-s}a(s)a(t-s)\;ds\; (I-\overline{\beta }{'}\mathbf {e}){{\,\mathrm{diag}\,}}(x(0)-\overline{\beta }) \mathbf {e}{'}\overline{\beta }\\&\qquad + \,\int _0^t e^{-s}a(s)^2\;ds\; (I-\overline{\beta }{'}\mathbf {e}){{\,\mathrm{diag}\,}}(\overline{\beta })(I-\mathbf {e}{'} \overline{\beta })\\&\qquad +\, \int _0^t e^{-s}a(s)\;ds\; (I-\overline{\beta }{'}\mathbf {e}){{\,\mathrm{diag}\,}}(\overline{\beta })\mathbf {e}{'}\overline{\beta }\\&\qquad + \,\int _0^t e^{-s}a(t-s)a(s)\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(x(0)-\overline{\beta })(I-\mathbf {e}{'} \overline{\beta })\\&\qquad + \,\int _0^t e^{-s}a(t-s)\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(x(0) -\overline{\beta })\mathbf {e}{'}\overline{\beta }\\&\qquad +\, \int _0^t e^{-s}a(s)\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(\overline{\beta }) (I-\mathbf {e}{'}\overline{\beta })\\&\qquad +\, \int _0^{t} e^{-s}\;ds\; \overline{\beta }{'}\mathbf {e}{{\,\mathrm{diag}\,}}(\overline{\beta }) \mathbf {e}{'}\overline{\beta }\\&\quad =\left( 1-e^{-t}\right) \overline{\beta }{'}\overline{\beta }+e^{-\beta _0 t} \frac{1}{1+\beta _0} \left( 1-e^{-(1+\beta _0)t}\right) \\&\qquad \times \left\{ {{\,\mathrm{diag}\,}}\left( x(0)-\overline{\beta }\right) -\left( x(0) -\overline{\beta }\right) {'}\overline{\beta } -\overline{\beta }{'}\left( x(0)-\overline{\beta }\right) \right\} \\&\qquad +\, e^{-\beta _0 t}\left( 1-e^{-t}\right) \left\{ (x(0)-\overline{\beta }){'}\overline{\beta }{'}+\overline{\beta }{'}(x(0) -\overline{\beta })\right\} \\&\qquad +\,\frac{1}{1+2\beta _0}\left( 1-e^{-(1+2\beta _0)t}\right) \left\{ {{\,\mathrm{diag}\,}}\left( \overline{\beta }\right) -\overline{\beta }{'} \overline{\beta }\right\} \end{aligned}$$

by using \(\mathbf {e}{{\,\mathrm{diag}\,}}(\overline{\beta })=\overline{\beta }\), \({{\,\mathrm{diag}\,}}(x(0))\mathbf {e}{'}=x(0){'}\), \(\mathbf {e}{{\,\mathrm{diag}\,}}(x(0))=x(0)\), assuming \(\beta _0\ne 1\), and

$$\begin{aligned} \begin{aligned}&\int _0^t e^{-s}\;ds\; = 1-e^{-t}&\quad&\int _0^t e^{-s}a(s)\;ds\; = \frac{1}{\beta _0+1}\left( 1-e^{-(\beta _0+1) t}\right) \\&\int _0^t e^{-s}a(t-s)\;ds\; = \frac{1}{\beta _0-1}\left( e^{-t}-e^{-\beta _0 t}\right)&\quad&\int _0^t e^{-s}a(t-s)a(s)\;ds\;=e^{-\beta _0 t}\left( 1-e^{-t}\right) \\&\int _0^t e^{-s}a(s)^2\;ds\; =\frac{1}{1+2\beta _0}\left( 1-e^{-(1+2\beta _0)t}\right)&\quad&\int _0^t e^{-s}a(s)^2a(t-s)\;ds\; =e^{-\beta _0 t}\frac{1}{1+\beta _0}\left( 1-e^{-(1+\beta _0)t}\right) . \end{aligned} \end{aligned}$$

The final expression for the (co)variance becomes

$$\begin{aligned}&{{\,\mathrm{Var}\,}}(x(t)\vert {x}(0))\\&\quad = \frac{1}{1+2\beta _{0}} \left( 1-e^{-(1+2\beta _{0})t}\right) \left\{ {{\,\mathrm{diag}\,}}\left( \overline{\beta }\right) -\overline{\beta }{'}\overline{\beta }\right\} \\&\qquad -\,\left( 1-e^{-t}\right) e^{-2\beta _{0}t}\left( x(0)-\overline{\beta }\right) {'} \left( x(0)-\overline{\beta }\right) + e^{-\beta _{0}t} \frac{1}{1+\beta _{0}}\left( 1+e^{-(1+\beta _{0})t}\right) \\&\qquad \times \, \left\{ {{\,\mathrm{diag}\,}}\left( x(0)-\overline{\beta }\right) -\left( x(0) -\overline{\beta }\right) {'}\overline{\beta }-\overline{\beta }{'}\left( x(0)-\overline{\beta }\right) \right\} . \end{aligned}$$

(Co)variance for IAM

This section provides more elegant calculations to obtain the (co)variance expression for the IAM process compared to the computations obtained by Sirén et al. (2013). By induction we obtain the following expression in matrix notation of the matrix exponential for the IAM process

$$\begin{aligned} e^{Qt}=e^{-tm} \left( I-\mathbf {e}{'}\mathbf {e}_K\right) +\mathbf {e}{'}\mathbf {e}_K. \end{aligned}$$

Let \({\tilde{x}}(0)=(x_1(0),\dots ,x_{K-1}(0),x_K(0)-1)\) and \({\tilde{\mathbf {1}}}=(0,\dots ,0,1)\) be \((1\times K)\)-dimensional vectors. We provide a proof of the expression for the (co)variance in Eq. (1) in Corollary 3.3, where the last term is given by

$$\begin{aligned} (e^{Qt}){'}x(0)&{'}x(0)e^{Qt}\left( 1-e^{-t}\right) \\&\quad = \left( 1-e^{-t}\right) \Big \lbrace \left( e^{tm}{\tilde{x}(0)}{'}+{\tilde{\mathbf {1}}}{'}\right) \Big (e^{-tm}{\tilde{x}(0)}+{\tilde{\mathbf {1}}}\Big )\Big \rbrace \\&\quad = \left( 1-e^{-t}\right) \left\{ e^{-2tm}\left( x(0){'} x(0) - \begin{pmatrix} 0 &{} \quad \dots &{} \quad 0 &{} \quad x_1(0)\\ \vdots &{} \quad \ddots &{} \quad \vdots &{} \quad \vdots \\ 0 &{} \quad \dots &{} \quad 0 &{} \quad x_{K-1}(0)\\ x_1(0) &{} \quad \dots &{} \quad x_{K-1}(0) &{} \quad -1+2x_{K}(0) \end{pmatrix} \right) \right. \\&\left. \qquad +\, e^{-tm}\left\{ \begin{pmatrix} 0 &{} \quad \dots &{} \quad x_{1}(0)\\ \vdots &{} \quad \ddots &{} \quad \vdots \\ 0 &{} \quad \dots &{} \quad x_{K}(0)-1\\ \end{pmatrix} + \begin{pmatrix} 0 &{} \quad \dots &{} \quad 0\\ \vdots &{} \quad \ddots &{} \quad \vdots \\ x_1(0) &{} \quad \dots &{} \quad x_{K}(0)-1\\ \end{pmatrix}\right\} + {{\,\mathrm{diag}\,}}\left( {\tilde{\mathbf {1}}}\right) \right\} . \end{aligned}$$

The first term of the (co)variance becomes

$$\begin{aligned} \int _0^{t}&e^{-s}(e^{Qs}){'}{{\,\mathrm{diag}\,}}{\left( x(0)e^{Q(t-s)} \right) } e^{Qs}\;ds\;\\ =&\int _0^{t}e^{-s}\left\{ e^{-sm}\begin{pmatrix} 1 &{} \quad 0 &{} \quad \dots &{} \quad 0\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 1 &{} \quad 0\\ -1 &{} \quad \dots &{} \quad -1 &{} \quad 0 \end{pmatrix} + \begin{pmatrix} 0 &{} \quad 0 &{} \quad \dots &{} \quad 0\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 0 &{} \quad 0\\ 1 &{} \quad \dots &{} \quad 1 &{} \quad 1 \end{pmatrix}\right\} \left\{ e^{-(t-s)m}{{\,\mathrm{diag}\,}}(\widetilde{{x}}(0)) + {{\,\mathrm{diag}\,}}\left( {\tilde{\mathbf {1}}}\right) \right\} \\&\times \left\{ e^{-sm}\begin{pmatrix} 1 &{} \quad 0 &{} \quad \dots &{} \quad -1\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 1 &{} \quad -1\\ 0 &{} \quad \dots &{} \quad 0 &{} \quad 0 \end{pmatrix} + \begin{pmatrix} 0 &{} \quad 0 &{} \quad \dots &{} \quad 1\\ 0 &{} \quad \ddots &{} \quad \vdots &{} \quad 1\\ \vdots &{} \quad \dots &{} \quad 0 &{} \quad 1\\ 0 &{} \quad \dots &{} \quad 0 &{} \quad 1 \end{pmatrix}\right\} \;ds\;\\&= e^{-tm} \begin{pmatrix} x_1(0) &{} \quad 0 &{} \quad \dots &{} \quad -x_1(0)\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad x_{K-1}(0) &{} \quad -x_{K-1}(0)\\ -x_1(0) &{} \quad \dots &{} \quad -x_{K-1}(0) &{} \quad 1-x_K(0) \end{pmatrix} \int _0^t e^{-(1+m)s}\;ds\; +\left( 1-e^{-t}\right) {{\,\mathrm{diag}\,}}\left( {\tilde{\mathbf {1}}}\right) \\&\quad +\, e^{-tm}\left( 1-e^{-t}\right) \begin{pmatrix} 0 &{} \quad 0 &{} \quad \dots &{} \quad x_1(0)\\ 0 &{} \quad \ddots &{} \quad &{} \quad \vdots \\ \vdots &{} \quad &{} \quad 0 &{} \quad x_{K-1}(0)\\ x_1(0) &{} \quad \dots &{} \quad x_{K-1}(0) &{} \quad 2(x_K(0)-1) \end{pmatrix}. \end{aligned}$$

The final expression for the (co)variance becomes by letting \(\widehat{x}(0)=(x_1(0),\dots ,x_{K-1}(0))\)

$$\begin{aligned}&{{\,\mathrm{Var}\,}}(x(t)\vert {\widehat{x}}(0),x_K(0))\\&\quad =e^{-tm}\frac{1}{m+1}\left( 1-e^{-(1+m)t}\right) \begin{pmatrix} {{\,\mathrm{diag}\,}}(\widehat{x}(0)) &{}\quad -\widehat{x}(0){'}\\ -\widehat{x}(0) &{}\quad 1-x_K(0) \end{pmatrix}\\&\qquad -\,\left( 1-e^{-t}\right) e^{-2tm} \begin{pmatrix} \widehat{x}(0){'}\widehat{x}(0) &{} \quad -\widehat{x}(0){'}(1-x_K(0))\\ -\widehat{x}(0)(1-x_K(0)) &{} \quad (1-x_K(0))(1-x_K(0)) \end{pmatrix}\\&\quad =F\big (t,m,x_K(0)\big )\begin{pmatrix} {{\,\mathrm{diag}\,}}\left( \frac{\widehat{x}}{1-x_K(0)}\right) &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix} - G\big (t,m,x_K(0)\big )\begin{pmatrix} \frac{\widehat{x}{'}(0)}{1-x_K(0)}\frac{\widehat{x}(0)}{1-x_K(0)} &{} \quad 0 \\ 0 &{} \quad 0 \end{pmatrix}\\&\qquad +\, \Big (F\big (t,m,x_K(0)\big )-G\big (t,m,x_K(0)\big )\Big )\begin{pmatrix} 0 &{} \quad -\,\frac{\widehat{x}{'}(0)}{1-x_K(0)}\\ -\,\frac{\widehat{x}(0)}{1-x_K(0)} &{} \quad 1 \end{pmatrix} \end{aligned}$$

where

$$\begin{aligned} F(t,m,x_K(0))= & {} e^{-tm}\frac{1}{m+1}\left( 1-e^{-(1+m)t}\right) (1-x_K(0))\\ G(t,m,x_K(0))= & {} \left( 1-e^{-t}\right) e^{-2tm}(1-x_K(0))^2. \end{aligned}$$

(Co)variance for HKY, TN and SSM

This section provides a proof of the expression for the (co)variance for the TN process in Eq. (3) in Corollary 3.5 (and for the HKY and SSM processes). The matrix exponential can be written on the form \(e^{Qt}=\sum _{i=1}^4 a_ie^{-b_i t}A_i\), which makes the last term in the (co)variance very easy to calculate

$$\begin{aligned} (e^{Qt}){'}x(0){'}x(0)e^{Qt}\left( 1-e^{-t}\right)= & {} \left( 1-e^{-t}\right) \sum _{i=1}^4\sum _{j=1}^4a_ia_je^{-(b_i+b_j) t}A_i{'}x(0){'}x(0)A_j. \end{aligned}$$

The first term of the (co)variance is given by

$$\begin{aligned}&\int _0^{t}e^{-s}\left( e^{Qs}\right) {'}{{\,\mathrm{diag}\,}}{\left( x(0)e^{Q(t-s)} \right) } e^{Qs}\;ds\;\\&\quad = \int _0^{t}e^{-s}\sum _{i=1}^4 a_ie^{-b_i s}A_i{'} {{\,\mathrm{diag}\,}}{\left( x(0)\sum _{j=1}^4 a_je^{-b_j (t-s)}A_j\right) }\\&\qquad \times \sum _{k=1}^4 a_k e^{-b_k s}A_k\;ds\;\\&\quad = \sum _{i=1}^4\sum _{j=1}^4\sum _{k=1}^4 a_ia_ja_kg_{ijk}(t)A_i{'} {{\,\mathrm{diag}\,}}{\left( x(0)A_j\right) }A_k\\ g_{ijk}(t)= & {} \int _0^t e^{-s}e^{-b_i s} e^{-b_j(t-s)}e^{-b_k s}\;ds\;\\= & {} {\left\{ \begin{array}{ll} te^{-b_j t} &{}\quad \text {if }1+b_i+b_k=b_j\\ \frac{e^{-b_j t}-e^{-(1+b_i+b_k)t}}{1+b_i+b_k-b_j} &{}\quad \text {otherwise} \end{array}\right. }\qquad i,j,k = 1,2,3,4. \end{aligned}$$

Moment-based parameter estimation

This section outlines how to estimate the free parameters in the four models: Dirichlet, Beta–Dirichlet, Hierarchical Beta and Pyramidal Hierachical Beta, using the moment-based method.

1.1 Dirichlet

Consider a K-dimensional Dirichlet distributed allele fraction data \(x(t)=(x_1(t),\dots ,x_K(t)) \sim {{\,\mathrm{Dirichlet}\,}}(\alpha )\) with parameter \(\alpha =(\alpha _1,\dots ,\alpha _K)\). The mean and (co)variance under the Dirichlet model are given by

$$\begin{aligned} {{\,\mathrm{E}\,}}[x(t)]=\alpha /\alpha _0\quad \text {and}\quad {{\,\mathrm{Var}\,}}(x(t))=1/(\alpha _0+1) \left\{ \alpha /\alpha _0 -\left( \alpha /\alpha _0\right) ^2\right\} ,\quad \alpha _0=\sum _{i=1}^{K}\alpha _i. \end{aligned}$$

We ought to estimate \(\alpha \) and \(\alpha _0\) from the Dirichlet means and (co)variances displayed in Table 1. An estimate of \(\alpha \) is obtained from the mean: \(\widehat{\alpha }={{\,\mathrm{E}\,}}[x(t)]\alpha _0\). The parameter \(\alpha _0\) is estimated by summing the known (co)variance over the number of allelic types

$$\begin{aligned} \sum _{k=1}^K{{\,\mathrm{Var}\,}}(x_k(t))= & {} \frac{1}{\alpha _0+1} \left\{ \sum _{k=1}^K{{\,\mathrm{E}\,}}[x_k(t)] -\sum _{k=1}^K {{\,\mathrm{E}\,}}[x_k(t)]^2\right\} \\= & {} \frac{1}{\alpha _0+1} \left\{ 1 -\sum _{k=1}^K {{\,\mathrm{E}\,}}[x_k(t)]^2\right\} \end{aligned}$$

which implies

$$\begin{aligned} \widehat{\alpha }_0=\frac{1 -\sum _{k=1}^K {{\,\mathrm{E}\,}}[x_k(t)]^2-\sum _{k=1}^K{{\,\mathrm{Var}\,}}(x_k(t))}{\sum _{k=1}^K{{\,\mathrm{Var}\,}}(x_k(t))}. \end{aligned}$$

We then estimate the mean and variance of the stochastic variables as follows for \(k=1,\dots ,K-1\)

$$\begin{aligned} \widehat{\mu }_{\omega _k} = \frac{\widehat{\alpha }_k}{\widehat{\alpha }_0-\sum _{l=1}^{k-1} \widehat{\alpha }_l}\quad \text {and}\quad \widehat{V}_{\omega _k} = \frac{\widehat{\alpha }_k(\widehat{\alpha }_0-\sum _{l=1}^k \widehat{\alpha }_l)}{\left( \widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l\right) ^2 \left( \widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l+1\right) }. \end{aligned}$$

1.2 Beta–Dirichlet

Consider a K-dimensional Beta–Dirichlet distributed allele fraction data \(x(t)=((1-x_K(t))y,x_K(t))\), where

$$\begin{aligned} x_K(t)\sim {{\,\mathrm{Beta}\,}}(\phi \mu ,\phi (1-\mu ))\quad \text {and}\quad y\sim {{\,\mathrm{Dirichlet}\,}}(\alpha ), \quad \alpha =(\alpha _1,\dots ,\alpha _{K-1}) \end{aligned}$$

are independent. The mean and (co)variance under the Beta–Dirichlet model are given by

$$\begin{aligned} {{\,\mathrm{E}\,}}[x(t)]= & {} {{\,\mathrm{E}\,}}[{{\,\mathrm{E}\,}}[x(t)\vert {x}_K(t)]]={{\,\mathrm{E}\,}}\left[ \left( (1-x_K(t)) \tfrac{\alpha }{\alpha _0},x_K(t)\right) \right] =\left( (1-\mu )\tfrac{\alpha }{\alpha _0},\mu \right) \\ {{\,\mathrm{Var}\,}}(x(t))= & {} {{\,\mathrm{Var}\,}}({{\,\mathrm{E}\,}}[x(t)\vert {x}_{K}(t)])+{{\,\mathrm{E}\,}}[{{\,\mathrm{Var}\,}}(x(t)\vert {x}_K(t))]\\= & {} {{\,\mathrm{Var}\,}}\left( ((1-x_K(t))\tfrac{\alpha }{\alpha _0},x_K(t)\right) +{{\,\mathrm{E}\,}}\left[ (1-x_K(t))^2\right] \\&\quad \begin{pmatrix} \frac{1}{\alpha _0+1}\left\{ {{\,\mathrm{diag}\,}}\left( \frac{\alpha }{\alpha _0}\right) -\left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0}\right\} &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix}\\= & {} \frac{\mu (1-\mu )}{\phi +1}\begin{pmatrix} \left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0} &{} \quad -\left( \frac{\alpha }{\alpha _0}\right) {'}\\ -\frac{\alpha }{\alpha _0} &{} \quad 1 \end{pmatrix}\\&+ \left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1}\right\} \frac{1}{\alpha _0+1}\begin{pmatrix} {{\,\mathrm{diag}\,}}\left( \frac{\alpha }{\alpha _0}\right) -\left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0} &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix}\\= & {} A(\mu ,\phi ,\alpha _0)\begin{pmatrix} {{\,\mathrm{diag}\,}}\left( \frac{\alpha }{\alpha _0}\right) &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix} +B(\mu ,\phi )\begin{pmatrix} 0 &{} \quad -\left( \frac{\alpha }{\alpha _0}\right) {'}\\ -\frac{\alpha }{\alpha _0} &{} \quad 1 \end{pmatrix}\\&+ \left( B(\mu ,\phi )-A(\mu ,\phi ,\alpha _0)\right) \begin{pmatrix} \left( \frac{\alpha }{\alpha _0}\right) {'}\frac{\alpha }{\alpha _0} &{} \quad 0\\ 0 &{} \quad 0 \end{pmatrix} \end{aligned}$$

where

$$\begin{aligned} A(\mu ,\phi ,\alpha _0)= & {} \left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1}\right\} \frac{1}{\alpha _0+1},\\ B(\mu ,\phi )= & {} \frac{\mu (1-\mu )}{\phi +1} \quad \text {and}\quad \alpha _0=\sum _{i=1}^{K-1}\alpha _i. \end{aligned}$$

The Beta–Dirichlet model has \(K+1\) free parameter: \(\mu ,\phi ,\alpha =(\alpha _1,\dots ,\alpha _{K-1})\). Having calculated the means and (co)variances of the model displayed in Table 2, we now discuss how these parameters are estimated from the \(K+1\) parameters \(t,m,x(0)=(x_1(0),\dots ,x_K(0))\) in the IAM process. Let the expressions for the means in the Beta–Dirichlet model and the IAM process for \(x_K(t)\) and \((x_1(t),\dots ,x_{K-1}(t))=(1-x_K(t))y\), respectively, be equal

$$\begin{aligned} \mu = 1-e^{-mt}(1-x_K(t))\quad \text {and}\quad (1-\mu )\frac{\alpha _j}{\alpha _0}=e^{-mt}x_j(0), \quad j=1,\dots ,K-1.\nonumber \\ \end{aligned}$$
(8)

The first part of Eq. (8) estimates the parameter \(\widehat{\mu }\). The parameter \(\phi \) is estimated by letting the expressions for the (co)variance of \(x_K(t)\) in the two models be equal and applying Eq. (8)

$$\begin{aligned} \frac{\mu (1-\mu )}{\phi }= & {} e^{-mt}\frac{1-e^{-(1+m)t}}{m+1}(1-x_K(t))-(1-e^{-t})e^{-2mt}(1-x_K(t))^2\\= & {} (1-\mu )\frac{1-e^{-(1+m)t}}{m+1}-(1-\mu )^2(1-e^{-t}) \end{aligned}$$

which is equivalent to

$$\begin{aligned} \widehat{\phi } = \mu \left( \frac{1-e^{-(1+m)t}}{m+1}-(1-\mu )(1-e^{-t})\right) ^{-1}-1. \end{aligned}$$

We observe that the two expressions \(A(\mu ,\phi ,\alpha _0)\) and \(F(t,m,x_K(0))\) are equal

$$\begin{aligned} \left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1}\right\} \frac{1}{\alpha _0+1}= & {} e^{-mt}\frac{1}{1+m}\left( 1-e^{-(1+m)t}\right) (1-x_K(0)) \end{aligned}$$

and can therefore estimate the parameter \(\alpha _0\) from this equality by using Eq. (8)

$$\begin{aligned} \widehat{\alpha }_0= & {} \frac{(1+m)\left\{ (1-\mu )^2+\frac{\mu (1-\mu )}{\phi +1} \right\} }{(1-\mu )(1-e^{-(1+m)t})}-1 =\frac{1+m}{1-e^{-(1+m)t}}\left\{ 1-\mu +\frac{\mu }{\phi +1}\right\} -1\\= & {} \frac{1+m}{1-e^{-(1+m)t}}\left\{ 1-\mu +\frac{1-e^{-(1+m)t}}{1+m}-(1-e^{-t})(1-\mu )\right\} -1\\= & {} \frac{1+m}{1-e^{-(1+m)t}}(1-\mu )e^{-t}. \end{aligned}$$

The second part of Eq. (8) estimates \(\widehat{\alpha }_j=(\alpha _0e^{-mt}x_j(0))/(1-\mu )\), \(j=1,\dots ,K-1\). We then estimate the mean and variance of the stochastic variables as follows for \(k=1,\dots ,K-2\)

$$\begin{aligned} \widehat{\mu }_{\xi }= & {} \widehat{\mu }, \qquad \widehat{V}_{\xi } = \frac{\widehat{\mu }(1-\widehat{\mu })}{\widehat{\phi }+1}, \qquad \widehat{\mu }_{\omega _k} = \frac{\widehat{\alpha }_k}{\widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l},\\ \widehat{V}_{\omega _k}= & {} \frac{\widehat{\alpha }_k\left( \widehat{\alpha }_0-\sum _{l=1}^k\widehat{\alpha }_l\right) }{\left( \widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l)^2(\widehat{\alpha }_0-\sum _{l=1}^{k-1}\widehat{\alpha }_l+1\right) }. \end{aligned}$$
Table 7 Definition of the Pyramidal Hierarchical Beta model with \({K=4}\) alleles

1.3 Hierarchical Beta

Consider a 4-dimensional Hierarchical Beta distributed allele fraction data \(x(t)=(x_1(t),x_2(t),x_3(t),x_4(t))\). The Hierarchical Beta model has 6 free parameter: 3 means \((\mu _{\omega }, \mu _{\eta _1}, \mu _{\eta _2})\) and 3 variances \((V_{\omega }, V_{\eta _1}, V_{\eta _2})\). Hobolth and Sirén (2016) provide the calculations of the means and (co)variances displayed in Table 3.

The mean of \((x_1, x_2, x_3, x_4)\) is a bijective function of \((\mu _{\omega }, \mu _{\eta _1}, \mu _{\eta _2})\). The means \((\mu _{\omega }, \mu _{\eta _1}, \mu _{\eta _2})\) are therefore completely determined from the means of the allele fractions

$$\begin{aligned} \widehat{\mu }_{\omega } = {{\,\mathrm{E}\,}}[x_1]+{{\,\mathrm{E}\,}}[x_2],\qquad \widehat{\mu }_{\eta _1} = {{\,\mathrm{E}\,}}[x_1]\mu _{\omega }^{-1},\qquad \widehat{\mu }_{\eta _2} = {{\,\mathrm{E}\,}}[x_3](1-\mu _{\omega })^{-1}. \end{aligned}$$

The variances are determined from the following three equations by isolating the variable of interest

$$\begin{aligned} \widehat{V}_{\omega }= {{\,\mathrm{Var}\,}}(x_1+x_2),\qquad \widehat{V}_{\eta _1}= f_1({{\,\mathrm{Var}\,}}(x_1)+{{\,\mathrm{Var}\,}}(x_2)) ,\qquad \widehat{V}_{\eta _2}= f_2({{\,\mathrm{Var}\,}}(x_3)+{{\,\mathrm{Var}\,}}(x_4)) \end{aligned}$$

where \(f_i: \mathbf {R} \rightarrow \mathbf {R}\), \(i=1,2\), are functions of the variance of interest.

1.4 Pyramidal hierarchical Beta

The Pyramidal model has \((K-1)(K+2)/2\) free parameters (\(K-1\) means: \(\mu _{\eta _k}\), and \(K(K-1)/2\) variances: \(V_{\omega _{ij}}\), \(i=1,\dots ,K-2\), \(j=1,\dots ,i\), and \(V_{\eta _k}\), \(k=1,\dots ,K\)). The calculated means of the model are displayed in Table 4 for \(K=4\), while the variances and covariances are depicted in Table 7. We discuss how the 9 free parameters [3 means: \((\mu _{\eta _{1}},\mu _{\eta _2},\mu _{\eta _3})\) and 6 variances: \((V_{\omega _{11}},V_{\omega _{21}},V_{\omega _{22}}, V_{\eta _1},V_{\eta _2},V_{\eta _3})\)] are estimated.

The mean of \((x_1,x_2,x_3,x_4)\) is a bijective function of \((\mu _{\eta _1},\mu _{\eta _2},\mu _{\eta _3})\). The means \((\mu _{\eta _{1}},\mu _{\eta _2},\mu _{\eta _3})\) are therefore completely determined from the means as follows

$$\begin{aligned} \widehat{\mu }_{\eta _1}= & {} {{\,\mathrm{E}\,}}[x_1](\mu _{\omega _{11}}\mu _{\omega _{21}})^{-1}\\ \widehat{\mu }_{\eta _2}= & {} \left( {{\,\mathrm{E}\,}}[x_1]+{{\,\mathrm{E}\,}}[x_2]-\mu _{\omega _{11}}\mu _{\omega _{21}}\right) \left( \mu _{\omega _{11}}(1-\mu _{\omega _{21}})+(1-\mu _{\omega _{11}})\mu _{\omega _{22}}\right) ^{-1}\\ \widehat{\mu }_{\eta _3}= & {} 1-{{\,\mathrm{E}\,}}[x_4]\left( (1-\mu _{\omega _{11}})(1-\mu _{\omega _{22}})\right) ^{-1}. \end{aligned}$$

The remaining variances are determined from the following (co)variances by isolating the variable of interest

$$\begin{aligned} \begin{aligned} \widehat{V}_{\omega _{11}}&= f_{11}({{\,\mathrm{Cov}\,}}(x_1,x_4)))&\qquad \widehat{V}_{\eta _3}&= f_{3}({{\,\mathrm{Var}\,}}(x_4) )&\qquad \widehat{V}_{\eta _2}&= f_{2}({{\,\mathrm{Var}\,}}(x_2) ) \\ \widehat{V}_{\omega _{22}}&= f_{22}({{\,\mathrm{Cov}\,}}(x_2,x_4) )&\qquad \widehat{V}_{\omega _{21}}&= f_{21}({{\,\mathrm{Var}\,}}(x_3)&\qquad \widehat{V}_{\eta _1}&= f_{1}({{\,\mathrm{Var}\,}}(x_1)) \end{aligned} \end{aligned}$$

where \(f_{i}: \mathbf {R}\rightarrow \mathbf {R}\), \(i=1,2,3,11,21,22\), is a function of the variances or covariances provided in Table 7. By isolating the parameter of interest, we obtain an expression for the variance parameters under the Pyramid model. A similar extended method is used to in the general case with a K-dimensional allele fraction vector \(x=(x_1,\dots ,x_K)\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Speed, M.S., Balding, D.J. & Hobolth, A. A general framework for moment-based analysis of genetic data. J. Math. Biol. 78, 1727–1769 (2019). https://doi.org/10.1007/s00285-018-01325-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00285-018-01325-0

Keywords

Mathematics Subject Classification

Navigation