Skip to main content
Log in

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Cross-validation using randomized subsets of data—known as k-fold cross-validation—is a powerful means of testing the success rate of models used for classification. However, few if any studies have explored how values of k (number of subsets) affect validation results in models tested with data of known statistical properties. Here, we explore conditions of sample size, model structure, and variable dependence affecting validation outcomes in discrete Bayesian networks (BNs). We created 6 variants of a BN model with known properties of variance and collinearity, along with data sets of n = 50, 500, and 5000 samples, and then tested classification success and evaluated CPU computation time with seven levels of folds (k = 2, 5, 10, 20, n − 5, n − 2, and n − 1). Classification error declined with increasing n, particularly in BN models with high multivariate dependence, and declined with increasing k, generally levelling out at k = 10, although k = 5 sufficed with large samples (n = 5000). Our work supports the common use of k = 10 in the literature, although in some cases k = 5 would suffice with BN models having independent variable structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. K-fold (Anguita et al. 2012) is also referred to as V-fold (Arlot and Celisse 2010) and M-fold (Hobbs and Hooten 2015; M here is in reference to the Markov chain Monte Carlo algorithm). We use K-fold as a synonym for all terms.

References

  • Adelin AA, Zhang L (2010) A novel definition of the multivariate coefficient of variation. Biomet J 52(5):667–675

    Article  MathSciNet  Google Scholar 

  • Aguilera PA, Fernández A, Reche F, Rumi R (2010) Hybrid Bayesian network classifiers: application to species distribution models. Environ Mod Softw 25:1630–1639

    Article  Google Scholar 

  • Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S (2012) The ‘K’ in K-fold cross validation. In: Proceedings, ESANN 2012, European symposium on artificial neural networks, computational intelligence and Mmachine learning. Bruges (Belgium), 25–27 Apr 2012, i6doc.com publ. http://www.i6doc.com/en/livre/?GCOI=28001100967420

  • Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79

    Article  MathSciNet  Google Scholar 

  • Booms TL, Huettmann F, Schempf PF (2010) Gyrfalcon nest distribution in Alaska based on a predictive GIS model. Polar Biol 33:347–358

    Article  Google Scholar 

  • Brady TJ, Monleon VJ, Gray AN (2010) Calibrating vascular plant abundance for detecting future climate changes in Oregon and Washington, USA. Ecol Ind 10:657–667

    Article  Google Scholar 

  • Breiman L, Spector P (1992) Submodel selection and evaluation in regression: the X-random case. Int Stat Rev 291–319

  • Cawley GC, Talbot NLC (2007) Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters. J Mach Learn Res 8:841–861

    MATH  Google Scholar 

  • Constantinuo AC, Fenton N, Marsh W, Radlinski L (2016) From complex questionnaire and interviewing data to intelligent Bayesian network models for medical decision support. Artif Intell Med 67:75–93

    Article  Google Scholar 

  • Cooke RM, Kurowicka D, Hanea AM, Morales O, Ababei DA, Ale B, Roelen A (2007) Continuous/discrete non parametric Bayesian belief nets with UNICORN and UNINET. In: Proceedings of Mathematical Methods in Reliability MMR, 1–4 July 2007, Glasgow, UK

  • Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(Series B):1–38

    MathSciNet  MATH  Google Scholar 

  • Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotechnol 26:897–899

    Article  Google Scholar 

  • Forio MAE, Landuyt D, Bennetsen E, Lock K, Nguyen THT, Ambarita MND, Musonge PLS, Boets P, Everaert G, Dominguez-Granda L, Goethals PLM (2015) Bayesian belief network models to analyse and predict ecological water quality in rivers. Ecol Model 312:222–238

    Article  Google Scholar 

  • Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163

    Article  Google Scholar 

  • Geisser S (1975) The predictive sample reuse method with applications. J Amer Stat Assoc 70:320–328

    Article  Google Scholar 

  • Guyon I, Saffari A, Dror G, Cawley G (2010) Model selection: beyond the Bayesian-Frequentist divide. J Mach Learn Res 11:61–87

    MathSciNet  MATH  Google Scholar 

  • Hammond TR, Ellis JR (2002) A meta-assessment for elasmobranchs based on dietary data and Bayesian networks. Ecol Ind 1:197–211

    Article  Google Scholar 

  • Hanea AM, Nane GF (2018) The asymptotic distribution of the determinant of a random correlation matrix. Stat Neerl 72:14–33

    Article  MathSciNet  Google Scholar 

  • Hartemink AJ (2001) Principled computational methods for the validation and discovery of genetic regulatory networks. PhD Dissertation, Massachusetts Institute of Technology, Cambridge, MA

  • Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the Lasso and generalizations. Monographs on statistics and applied probability 143. CRC Press, Chapman

    Book  Google Scholar 

  • Hobbs NT, Hooten MB (2015) Bayesian models: a statistical primer for ecologists. Princeton University Press, Princeton

    Book  Google Scholar 

  • Jensen FV, Nielsen TD (2007) Bayesian networks and decision graphs, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Koski T, Noble J (2011) Bayesian networks: an introduction. Wiley, London

    MATH  Google Scholar 

  • LaDeau SL, Han BA, Rosi-Marshall EJ, Weathers KC (2017) The next decade of big data in ecosystem science. Ecosystems 20:274–283

    Article  Google Scholar 

  • Last M (2006) The uncertainty principle of cross-validation. In: 2006 IEEE International conference on granular computing, 10–12 May 2006, pp 275–208

  • Lillegard M, Engen S, Saether BE (2005) Bootstrap methods for estimating spatial synchrony of fluctuating populations. Oikos 109:342–350

    Article  Google Scholar 

  • Marcot BG (2007) Étude de cas n°5: gestion de ressources naturelles et analyses de risques (Natural resource assessment and risk management). In: Naim P, Wuillemin P-H, Leray P, Pourret O, Becker A (eds) Réseaux Bayésiens (Bayesian networks; in French). Eyrolles, Paris, pp 293–315

    Google Scholar 

  • Marcot BG (2012) Metrics for evaluating performance and uncertainty of Bayesian network models. Ecol Mod 230:50–62

    Article  Google Scholar 

  • Marcot BG, Penman TD (2019) Advances in Bayesian network modelling: integration of modelling technologies. Environ Model softw 111:386–393

    Article  Google Scholar 

  • Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press, Cambridge

    MATH  Google Scholar 

  • Pawson SM, Marcot BG, Woodberry O (2017) Predicting forest insect flight activity: a Bayesian network approach. PLoS ONE 12:e0183464

    Article  Google Scholar 

  • Pourret O, Naïm P, Marcot BG (eds) (2008) Bayesian belief networks: a practical guide to applications. Wiley, West Sussex

    MATH  Google Scholar 

  • Scutari M (2010) Learning Bayesian networks with the bnlearn R package. J Stat Softw 35(3):1–22

    Article  Google Scholar 

  • Shcheglovitova M, Anderson RP (2013) Estimating optimal complexity for ecological niche models: a jackknife approach for species with small sample sizes. Ecol Mod 269:9–17

    Article  Google Scholar 

  • Stow CA, Webster KE, Wagner T, Lottig N, Soranno PA, Cha Y (2018) Small values in big data: the continuing need for appropriate metadata. Eco Inform 45:26–30

    Article  Google Scholar 

  • Van Valen L (2005) The statistics of variation. In: Hallgrímsson B, Hall BK (eds) Variation. Elsevier, Amsterdam, pp 29–47

    Chapter  Google Scholar 

  • Zhao Y, Hasan YA (2013) Machine learning algorithms for predicting roadside fine particulate matter concentration level in Hong Kong Central. Comput Ecol Softw 3:61–73

    Google Scholar 

Download references

Acknowledgements

We thank Clint Epps, Julie Heinrichs, and an anonymous reviewer for helpful comments on the manuscript. Marcot acknowledges support from U.S. Forest Service, Pacific Northwest Research Station, and University of Melbourne, Australia. Mention of commercial or other products does not necessarily imply endorsement by the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruce G. Marcot.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Marcot, B.G., Hanea, A.M. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?. Comput Stat 36, 2009–2031 (2021). https://doi.org/10.1007/s00180-020-00999-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-00999-9

Keywords

Navigation