Abstract
Making inference on population structure from genotype data requires to identify the actual subpopulations and assign individuals to these populations. The source populations are assumed to be in Hardy-Weinberg equilibrium, but the allelic frequencies of these populations and even the number of populations present in a sample are unknown. In this chapter we present a review of some Bayesian parametric and nonparametric models for making inference on population structure, with emphasis on model-based clustering methods. Our aim is to show how recent developments in Bayesian nonparametrics have been usefully exploited in order to introduce natural nonparametric counterparts of some of the most celebrated parametric approaches for inferring population structure. We use data from the 1000 Genomes project (http://www.1000genomes.org/) to provide a brief illustration of some of these nonparametric approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aldous, D. J. (1985). Exchangeability and related topics. Ecole d’ete de probabilites de Saint-Flour, XIII. Lecture notes in Mathematics N. 1117, Springer, Berlin.
Alexander, D.H., Novembre, J. and Lange K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19, 1655–1664.
Anderson, E.C. and Thompson, E.A. (2002). A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160, 1217–1229.
Balding, D.J. and Nichols, R.A. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12.
Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1, 353–355.
Corander, J., Waldmann, P. and Sillanpää, M.J. (2003). Bayesian analysis of genetic differentiation between populations. Genetics 163, 367–374.
Corander, J., Waldmann, P., Marttinen, P. and Sillanpää, M.J. (2004). BAPS2: enhanced possibilities for the analysis of population structure. Bioinformatics 20, 2363–2369.
Dawson, K.J. and Belkhir, K. (2001). A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genet. Res. 78, 59–77.
De Iorio, M., Elliott, L., Favaro, S., Adhikari, K. and Teh, Y.W. (2015). Modeling population structure under hierarchical Dirichlet processes. Preprint arXiv:1503.08278.
Evanno, G., Regnaut, S. and Goudet, J. (2005). Detecting the number of clusters of individuals using the software Structure: a simulation study. Mol. Ecol. 14, 2611–2620.
Falush, D., Stephens, M. and Pritchard, J.K. (2003). Inference of population structure from multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587.
Falush, D., Stephens, M. and Pritchard, J.K. (2007). Inference of population structure using multi locus genotype data: dominant markers and null alleles. Mol. Ecol. Notes 7, 574–578.
Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1, 209–230.
Field, D.L., Ayre, D.J., Whelan, R.J. and Young, A.G. (2011). Patterns of hybridization and asymmetrical gene flow in hybrid zones of the rare Eucalyptus aggregata and common E. rubida. Heredity 106, 841–853.
Fritsch, A. and Ickstadt, K. (2009). Improved criteria for clustering based on the posterior similarity matrix. Bayesian Analysis, 4, 367–392.
Hubisz, M.J., Falush, D., Stephens, M. and Pritchard, J.K. (2009). Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resources 9, 1322–1332.
Huelsenbeck, J.P. and Andolfatto, P. (2007). Inference of population structure under a Dirichlet process model. Genetics 175, 1787–1802.
Miller, J.W. and Harrison, M.T. (2014) Inconsistency of Pitman-Yor process mixtures for the number of components. Journal of Machine Learning Research 15, 3333–3370.
Novembre, J. and Stephens, M. (2008) Interpreting principal components analyses of spatial population genetic variation. Nature Gentics 40, 646–649.
Papaspiliopoulos, O. and Roberts, G.O. (2008). Retrospective Markov Chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika 95, 169–186.
Parker, H.G., Kim, L.V., Sutter, N.B., Carlson, S., Lorentzen, T.D., Malek, T.B., Johnson, G.S., DeFrance, H.B., Ostrander, E.A. and Kruglya, L. (2004). Genetic structure of the purebred domestic dog. Science 304, 1160–1164.
Patterson, N., Price, A.L. and Reich, D. (2006) Population structure and eigenanalysis. PLoS Genetics 2, 2074–2093.
Pella, J. and Masuda, M. (2006). The Gibbs and split-merge sampler for population mixture analysis from genetic data with incomplete baselines. Can. J. Fish. Aquat. Sci. 63, 576–596.
Pritchard, J.K., Stephens, M. and Donelly, P. (2000). Inference on population structure using multilocus genotype data. Genetics 155, 945–959.
Ranalla, B. and Mountain, J.L. (1997). Detecting immigration by using multilocus genotypes. Proc. Natl. Acad. Sci. 94, 9197–9201.
Ray, A. and Quader, S. (2014). Genetic diversity and population structure of Lantana camara in India indicates multiple introductions and gene flow. Plant Biology 16, 651–658.
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica. 4, 639–650.
Teh, Y.W., Jordan, M.I., Beal, M,J. and Blei, D.M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101, 1566–1581.
Walker, S.G. (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simulation Comput. 36, 45–54.
Wasser, S.K., Mailand, C., Booth, R., Mutayoba, B., Kisamo, E., Clark, B. and Stephens, M. (2007). Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban. Proceedings of the National Academy of Sciences 104, 4228–4233.
Acknowledgements
We would like to thank Kaustubh Adhikari for kindly providing the pahsed data and Lloyd Elliott for developing user-friendly MATLAB functions for the linked hierarchical Dirichlet process. Stefano Favaro is supported by the European Research Council (ERC) through StG N-BNP 306406. Yee Whye Teh is supported by the European Research Council (ERC) through the European Unions Seventh Framework Programme (FP7/2007–2013) ERC grant agreement 617411.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
De Iorio, M., Favaro, S., Teh, Y.W. (2015). Bayesian Inference on Population Structure: From Parametric to Nonparametric Modeling. In: Mitra, R., Müller, P. (eds) Nonparametric Bayesian Inference in Biostatistics. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-19518-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-19518-6_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19517-9
Online ISBN: 978-3-319-19518-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)