Skip to main content
Log in

Improved feature selection with simulation optimization

  • Research Article
  • Published:
Optimization and Engineering Aims and scope Submit manuscript

Abstract

Non-informative or redundant features with big data can significantly reduce the performance of any machine learning problem. They render the model training costly and the model interpretability weak. Traditional feature selection methods, particularly wrapper methods, often performed using greedy search, are susceptible to suboptimal solutions, selection bias, and high variability due to noise in the data. Our simulation optimization framework seeks to identify the best subset of features by utilizing resamples of the training and test set, where the random holdout errors produce the simulation outputs. The resulting feature subsets are more reliable since they perform well on several resampled datasets. Our experiments on four actual and simulated datasets indicate the fixed sampling approach’s competitive advantages in various performance metrics. Further, we develop adaptive sampling strategies for large enough datasets, where the number of training and test resamples vary for each solution. Adaptive sample sizes reach the same quality level of recommended feature subsets but significantly faster than the fixed sample size version.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. There is no publicly available code for the nested partitioning method, so we compare results with those published on nested partitioning solvers on a unique dataset from UCI (Dua and Graff 2017).

References

  • Abramson MA, Audet C, Chrissis JW, Walston JG (2009) Mesh adaptive direct search algorithms for mixed variable optimization. Optim Lett 3(1):35

    MathSciNet  MATH  Google Scholar 

  • Almuallim H, Dietterich TG (1994) Learning Boolean concepts in the presence of many irrelevant features. Artif Intell 69(1–2):279–305

    MathSciNet  MATH  Google Scholar 

  • Audet C, Dennis JE Jr (2002) Analysis of generalized pattern searches. SIAM J Optim 13(3):889–903

    MathSciNet  MATH  Google Scholar 

  • Audet C, Dennis JE Jr (2006) Mesh adaptive direct search algorithms for constrained optimization. SIAM J Optim 17(1):188–217

    MathSciNet  MATH  Google Scholar 

  • Bareiss ER, Porter B (1987) Protos: an exemplar-based learning apprentice. In: Proceedings of the 4th international workshop on machine learning, pp 12–23

  • Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Proceedings of the 24th international conference on neural information processing systems series NIPS’11. Curran Associates Inc., Red Hook, NY, pp 2546–2554

  • Billingsley P (2012) Probability and measure. Wiley, Hoboken

    MATH  Google Scholar 

  • Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM (2016) MLR: Machine learning in R. J Mach Learn Res 17(170):1–5

    MathSciNet  MATH  Google Scholar 

  • Bischl B, Richter J, Bossek J, Horn D, Thomas J, Lang M (2017) mlrmbo: a modular framework for model-based optimization of expensive black-box functions. arXiv preprint arXiv:1703.03373

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Google Scholar 

  • Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Trans Evol Comput 7(6):561–575

    Google Scholar 

  • Cardie C (1993) Using decision trees to improve case-based learning. In: Proceedings of the tenth international conference on machine learning, pp 25–32

  • Chen Y-W, Lin C-J (2006) Combining SVMs with various feature selection strategies. In: Guyon I, Gunn S, Nikravesh M, Zadeh LA (eds) Feature extraction. Springer, Berlin, Heidelberg, pp 315–324

  • Cristianini N, Shawe-Taylor J et al (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Derrac J, García S, Herrera F (2012) A survey on evolutionary instance selection and generation. In: Yin, P-P (ed) Modeling, analysis, and applications in metaheuristic computing: advancements and trends. IGI Global, pp 233–266

  • Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Eckman DJ, Henderson SG, Shashaani S (2021) Evaluating and comparing simulation-optimization algorithms (under review)

  • Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7(1):1–26

    MathSciNet  MATH  Google Scholar 

  • Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc 92(438):548–560

    MathSciNet  MATH  Google Scholar 

  • Fisher A, Rudin C, Dominici F (2018) All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. arXiv preprint arXiv:1801.01489

  • Fu CCMC, Hu J, Xiong X (2004) Optimal computing budget allocation under correlated sampling. In: Proceedings of the 2004 Winter simulation conference, Washington, DC,USA, p 603

  • Geisser S (1975) The predictive sample reuse method with applications. J Am Stat Assoc 70(350):320–328

    MATH  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182

    MATH  Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422

    MATH  Google Scholar 

  • Haury A-C, Gestraud P, Vert J-P (2011) The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLOS One 6:e28210

    Google Scholar 

  • Hong LJ, Nelson BL (2006) Discrete optimization via simulation using compass. Oper Res 54(1):115–129

    MATH  Google Scholar 

  • Hunter SR, Nelson BL (2017) Parallel ranking and selection. In: Tolk A, Fowler J, Shao G, Yücesan E (eds) Advances in modeling and simulation. Springer, Cham, pp 249–275

  • Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Global Optim 13(4):455–492

    MathSciNet  MATH  Google Scholar 

  • Jung Y (2018) Multiple predicting k-fold cross-validation for model selection. J Nonparametr Stat 30(1):197–215

    MathSciNet  MATH  Google Scholar 

  • Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95—international conference on neural networks, vol 4, pp 1942–1948

  • Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation, vol 5, pp 4104–4108

  • Kepplinger D, Filzmoser P, Varmuza K (2017) Variable selection with genetic algorithms using repeated cross-validation of pls regression models as fitness measure. arXiv preprint arXiv:1711.06695

  • Kim S, Pasupathy R, Henderson SG (2015) A guide to sample average approximation. In: Fu MC (ed) Handbook of simulation optimization. Springer, pp 207–243

  • Kira K, Rendell LA (1992) A practical approach to feature selection. In: Machine learning proceedings. Elsevier, pp 249–256

  • Kleijnen JP (2009) Factor screening in simulation experiments: review of sequential bifurcation. In: Alexopoulos C, Goldsman D, Wilson JR (eds) Advancing the frontiers of simulation. Springer, pp 153–167

  • Kolda TG, Lewis RM, Torczon V (2003) Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev 45(3):385–482

    MathSciNet  MATH  Google Scholar 

  • Koumi F, Aldasht M, Tamimi H (2019) Efficient feature selection using particle swarm optimization: a hybrid filters-wrapper approach. In: 2019 10th international conference on information and communication systems (ICICS), pp 122–127

  • Kudo M, Sklansky J (2000) Comparison of algorithms that select features for pattern classifiers. Pattern Recogn 33(1):25–41

    Google Scholar 

  • Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26

    Google Scholar 

  • Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, Berlin, vol, p 26

    MATH  Google Scholar 

  • Le Digabel S (2011) Algorithm 909: Nomad: nonlinear optimization with the mads algorithm. ACM Trans Math Softw 37(4):1–15

    MathSciNet  MATH  Google Scholar 

  • Li R, Lu J, Zhang Y, Zhao T (2010) Dynamic adaboost learning with feature selection based on parallel genetic algorithm for image annotation. Knowl-Based Syst 23(3):195–201

    Google Scholar 

  • Liu W, Wang J (2019) A brief survey on nature-inspired metaheuristics for feature selection in classification in this decade. In: 2019 IEEE 16th international conference on networking, sensing and control (ICNSC), pp 424–429

  • Mak W-K, Morton DP, Wood RK (1999) Monte Carlo bounding techniques for determining solution quality in stochastic programs. Oper Res Lett 24(1–2):47–56

    MathSciNet  MATH  Google Scholar 

  • Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory 9(1):11–17

    Google Scholar 

  • Muni DP, Pal NR, Das J (2006) Genetic programming for simultaneous feature selection and classifier design. IEEE Trans Syst Man Cybern B (Cybern) 36(1):106–117

    Google Scholar 

  • Musavi M, Ahmed W, Chan K, Faris K, Hummels D (1992) On the training of radial basis function classifiers. Neural Netw 5(4):595–603

    Google Scholar 

  • Muthukrishnan R, Rohini R (2016) Lasso: a feature selection technique in predictive modeling for machine learning. In: 2016 IEEE international conference on advances in computer applications (ICACA), Coimbatore, pp 18–20

  • Nazzal D, Mollaghasemi M, Hedlund H, Bozorgi A (2012) Using genetic algorithms and an indifference-zone ranking and selection procedure under common random numbers for simulation optimisation. J Simul 6(1):56–66

    Google Scholar 

  • Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308–313

    MathSciNet  MATH  Google Scholar 

  • Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc Ser A (Gen) 135(3):370–384

    Google Scholar 

  • Ni EC, Ciocan DF, Henderson SG, Hunter SR (2017) Efficient ranking and selection in parallel computing environments. Oper Res 65(3):821–836

    MathSciNet  MATH  Google Scholar 

  • Ólafsson S (2004) Two-stage nested partitions method for stochastic optimization. Methodol Comput Appl Probab 6(1):5–27

    MathSciNet  MATH  Google Scholar 

  • Ólafsson S, Yang J (2005) Intelligent partitioning for feature selection. INFORMS J Comput 17(3):339–355

    Google Scholar 

  • Pei L, Nelson BL, Hunter SR (2020) Evaluation of bi-pass for parallel simulation optimization. In: Proceedings of the 2020 winter simulation conference. IEEE, pp 2960–2971

  • Porcelli M, Toint PL (2017) Bfo, a trainable derivative-free brute force optimizer for nonlinear bound-constrained optimization and equilibrium computations with continuous and discrete variables. ACM Trans Math Softw (TOMS) 44(1):6

    MathSciNet  MATH  Google Scholar 

  • Redmond MA, Baveja A (2002) A data-driven software tool for enabling cooperative information sharing among police departments. Eur J Oper Res 141:660–678

    MATH  Google Scholar 

  • Sanz-Garcia A, Fernandez-Ceniceros J, Antonanzas-Torres F, Pernia-Espinoza A, de Pison FM (2015) GA-parsimony: a GA-SVR approach with feature selection and parameter optimization to obtain parsimonious solutions for predicting temperature settings in a continuous annealing furnace. Appl Soft Comput 35:13–28

    Google Scholar 

  • Sapp S, van der Laan MJ, Canny J (2014) Subsemble: an ensemble method for combining subset-specific algorithm fits. J Appl Stat 41(6):1247–1259

    MathSciNet  MATH  Google Scholar 

  • Shashaani S, Hashemi FS, Pasupathy R (2018) Astro-df: a class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization. SIAM J Optim 28(4):3145–3176

    MathSciNet  MATH  Google Scholar 

  • Singh DAAG, Appavu S, Leavline EJ (2016) Literature review on feature selection methods for high-dimensional data. Int J Comput Appl 136(1):0975–8887

    Google Scholar 

  • Sinha A, Malo P, Kuosmanen T (2015) A multiobjective exploratory procedure for regression model selection. J Comput Graph Stat 24(1):154–182

    MathSciNet  Google Scholar 

  • Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. In: Pereira F, Burges CJ, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, pp 2951–2959

  • Song E, Nelson BL, Staum J (2016) Shapley effects for global sensitivity analysis: theory and computation. SIAM/ASA J Uncertainty Quant 4(1):1060–1083

    MathSciNet  MATH  Google Scholar 

  • Song E, Nelson BL, Hong LJ (2015) Input uncertainty and indifference-zone ranking and selection. In: Winter simulation conference (WSC) 2015, pp 414–424

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Urraca R, Sodupe-Ortega E, Antonanzas J, Antonanzas-Torres F, de Pison FM (2018) Evaluation of a novel GA-based methodology for model structure selection: The GA-parsimony. Neurocomputing 271:9–17

    Google Scholar 

  • Vahdat K, Shashaani S (2020) Simulation optimization based feature selection, a study on data-driven optimization with input uncertainty. In: Proceedings of the 2020 winter simulation conference. IEEE, pp 2149–2160

  • Vahdat K, Shashaani S (2021) Non-parametric uncertainty bias and variance estimation via nested bootstrapping and influence functions. In: Kim S, Feng B, Masoud S, Zheng Z, Loper M (eds) Proceedings of the 2021 winter simulation conference. Institute of Electrical and Electronics Engineers, Inc, Savannah

  • van der Laan MJ, Polley EC, Hubbard AE (2007) “super learner’’, statistical applications in genetics and molecular biology, vol 6(25). Walter de Gruyter GmbH & Co. KG, Berlin/Boston, pp 1–23

    Google Scholar 

  • Vasquez D, Shashaani S, Pasupathy R (2021) The complexity of adaptive sampling trust-region methods for nonconvex stochastic optimization. Working paper

  • Wang H, Pasupathy R, Schmeiser BW (2013) Integer-ordered simulation optimization using r-spline: retrospective search with piecewise-linear interpolation and neighborhood enumeration. ACM Trans Model Comput Simul (TOMACS) 23(3):17

    MathSciNet  MATH  Google Scholar 

  • Whitney AW (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput 100(9):1100–1103

    MATH  Google Scholar 

  • Xu J, Nelson BL, Hong LJ (2013) An adaptive hyperbox algorithm for high-dimensional discrete optimization via simulation problems. INFORMS J Comput 25(1):133–146

    MathSciNet  Google Scholar 

  • Xue B, Zhang M, Browne WN (2012) Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans Cybern 43(6):1656–1671

    Google Scholar 

  • Xue B, Zhang M, Browne WN, Yao X (2016) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 20(4):606–626

    Google Scholar 

  • Yang J, Olafsson S (2006) Optimization-based feature selection with adaptive instance sampling. Comput Oper Res 33(11):3088–3106

    MATH  Google Scholar 

  • Yusta SC (2009) Different metaheuristic strategies to solve the feature selection problem. Pattern Recogn Lett 30(5):525–534

    Google Scholar 

  • Zames G, Ajlouni N, Holland J, Hills W, Goldberg D (1981) Genetic algorithms in search, optimization and machine learning. Inf Technol J 3(1):301–302

    Google Scholar 

  • Zeng CTX, Chen Y, Alphen D (2009) Feature selection using recursive feature elimination for handwritten digit recognition. In: 2009 Fifth international conference on intelligent information hiding and multimedia signal processing, Kyoto, pp 1205–1208

  • Zhou Q, Zhou H, Zhou Q, Yang F, Luo L (2014) Structure damage detection based on random forest recursive feature elimination. Mech Syst Signal Process 46:82–90

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank Seth Guikema for the original discussions, Reha Uzsoy and Anton Panchishin for helpful comments, and the reviewer for excellent questions that helped us improve this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sara Shashaani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shashaani, S., Vahdat, K. Improved feature selection with simulation optimization. Optim Eng 24, 1183–1223 (2023). https://doi.org/10.1007/s11081-022-09726-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11081-022-09726-3

Keywords

Navigation