Skip to main content

Feature Selection in High-Dimensional Data

  • Chapter
  • First Online:
Optimization, Learning, and Control for Interdependent Complex Networks

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1123))

Abstract

Today, with the increase of data dimensions, many challenges are faced in many contexts including machine learning, informatics, and medicine. However, reducing data dimension can be considered as a basic method in handling high-dimensional data, because by reducing dimensions, applying many of the existing operations on data is facilitated.

Microarray data are derived from tissues and cells considering differences in the gene, which can be useful for diagnosing disease and tumors. Due to the large number of features (genes) and small number of samples in microarray datasets, selecting the most salient genes is a difficult task. Among the many techniques of machine learning, feature selection and data classification play a very important and widespread role in enhancing human life, from detecting voice emotion to detecting illness in the body. In medicine, an effective gene selection can greatly enhance the process of prediction and diagnosis of cancer. After selecting effective genes, the duty of a specific classifier is usually to discriminate healthy people from patients that are suffering from cancer based on their expression of the selected genes.

A vast body of feature selection methods has been proposed for high-dimensional microarray data. Traditionally, these methods fall into three categories including filter, wrapper, and hybrid approaches. Furthermore, new techniques such as ensemble methods have recently been developed to improve the process of feature selection and classification.

This chapter presents an overview of the most popular feature selection methods to deal with high-dimensional data and analyze their performance under different conditions. The chapter starts with a global overview of the high-dimensional data and feature selection (Sects. 5.2 and 5.3). Then, in Sect. 5.4 we review the state-of-the-art methods on filter algorithms. In the next three Sects. (5.5, 5.6 and 5.7) we describe the wrapper, hybrid, and embedded methods and in each section, an overview of several works performed on these methods is discussed. Sect. 5.8 describes the ensemble techniques recently considered by the researchers and summarizes the works done based on these techniques. In Sect. 5.9, we present the experimental results of the most significant methods on high-dimensional data. Finally, Sect. 5.10 summarizes this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. C.E. Crangle, R. Wang, M. Perreau-Guimaraes, M.U. Nguyen, D.T. Nguyen, P. Suppes, Machine learning for the recognition of emotion in the speech of couples in psychotherapy using the Stanford Suppes Brain Lab Psychotherapy Dataset. arXiv preprint arXiv:1901.04110 (2019)

    Google Scholar 

  2. A. Rouhi, M. Spitale, F. Catania, G. Cosentino, M. Gelsomini, F. Garzotto, Emotify: emotional game for children with autism spectrum disorder based-on machine learning, in Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion (ACM, New York, 2019), pp. 31–32

    Google Scholar 

  3. U. Shruthi, V. Nagaveni, B. Raghavendra, A review on machine learning classification techniques for plant disease detection, in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), (IEEE, Piscataway, 2019), pp. 281–284

    Google Scholar 

  4. R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, Hoboken, 2012)

    MATH  Google Scholar 

  5. M. Fernandes, A. Canito, V. Bolón-Canedo, L. Conceição, I. Praça, G. Marreiros, Data analysis and feature selection for predictive maintenance: A case-study in the metallurgic industry. Int. J. Inf. Manag. 46, 252–262 (2019)

    Article  Google Scholar 

  6. H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining (Springer, Berlin, 2012)

    MATH  Google Scholar 

  7. H. Handels, T. Roß, J. Kreusch, H.H. Wolff, S.J. Poeppl, Feature selection for optimized skin tumor recognition using genetic algorithms. Artif. Intell. Med. 16(3), 283–297 (1999)

    Article  Google Scholar 

  8. B. Nikpour, H. Nezamabadi-pour, HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J. Comput. Sci. 1(2), 109–128 (2018)

    Article  Google Scholar 

  9. K. Borowska, J. Stepaniuk, A rough–granular approach to the imbalanced data classification problem. Appl. Soft Comput. 83, 105607 (2019)

    Article  Google Scholar 

  10. A. Reyes-Nava, H. Cruz-Reyes, R. Alejo, E. Rendón-Lara, A. Flores-Fuentes, and E. Granda-Gutiérrez, Using deep learning to classify class imbalanced gene-expression microarrays datasets, in Iberoamerican Congress on Pattern Recognition (Springer, Berlin, 2018), pp. 46–54

    Chapter  Google Scholar 

  11. P.B. andLuis Torgo, R. Ribeiro, A survey of predictive modeling under imbalanced distributions. ACM Comput. Surv. 49(2), 1–31 (2016)

    Google Scholar 

  12. H. He, E.A. Garcia, Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 9, 1263–1284 (2008)

    Google Scholar 

  13. J. Błaszczyński, J. Stefanowski, Improving bagging ensembles for class imbalanced data by active learning, in Advances in Feature Selection for Data and Pattern Recognition, (Springer, Berlin, 2018), pp. 25–52

    Chapter  Google Scholar 

  14. R.J. Hickey, Noise modelling and evaluating learning from examples. Artif. Intell. 82(1–2), 157–179 (1996)

    Article  MathSciNet  Google Scholar 

  15. Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, Y. Zhou, A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)

    Article  Google Scholar 

  16. C.E. Brodley, M.A. Friedl, Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)

    Article  MATH  Google Scholar 

  17. B. Frénay, A. Kabán, A comprehensive introduction to label noise, in ESANN (2014)

    Google Scholar 

  18. F. Barani, M. Mirhosseini, H. Nezamabadi-Pour, Application of binary quantum-inspired gravitational search algorithm in feature subset selection. Appl. Intell. 47(2), 304–318 (2017)

    Article  Google Scholar 

  19. A.P. Dawid, A.M. Skene, Maximum likelihood estimation of observer error-rates using the EM algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 28(1), 20–28 (1979)

    Google Scholar 

  20. T.R. Golub et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)

    Article  Google Scholar 

  21. I. Kamkar, S.K. Gupta, D. Phung, S. Venkatesh, Stable feature selection for clinical prediction: exploiting ICD tree structure using tree-lasso. J. Biomed. Inform. 53, 277–290 (2015)

    Article  Google Scholar 

  22. A. Rouhi and H. Nezamabadi-Pour, A hybrid feature selection approach based on ensemble method for high-dimensional data, in 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC) (IEEE, Piscataway, 2017), pp. 16–20

    Google Scholar 

  23. S. Tabakhi, A. Najafi, R. Ranjbar, P. Moradi, Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 168, 1024–1036 (2015)

    Article  Google Scholar 

  24. M.K. Ebrahimpour, H. Nezamabadi-Pour, M. Eftekhari, CCFS: a cooperating coevolution technique for large scale feature selection on microarray datasets. Comput. Biol. Chem. 73, 171–178 (2018)

    Article  Google Scholar 

  25. A. Rouhi and H. Nezamabadi-Pour, Filter-based feature selection for microarray data using improved binary gravitational search algorithm, in 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC) (IEEE, Piscataway, 2018), pp. 1–6

    Google Scholar 

  26. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  27. Y.-W. Chen, C.-J. Lin, Combining SVMs with various feature selection strategies, in Feature Extraction, (Springer, Berlin, 2006), pp. 315–324

    Chapter  Google Scholar 

  28. Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725, 2012

    Google Scholar 

  29. I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, in European Conference on Machine Learning (Springer, Berlin, 1994), pp. 171–182

    Chapter  Google Scholar 

  30. L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, in Proceedings of the 20th International Conference on Machine Learning (ICML-03) (2003), pp. 856–863

    Google Scholar 

  31. H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 8, 1226–1238 (2005)

    Article  Google Scholar 

  32. M. A. Hall, Correlation-based feature selection for machine learning (1999)

    Google Scholar 

  33. J. Li et al., Feature selection: a data perspective. ACM Comput. Sur. (CSUR) 50(6), 94 (2018)

    Google Scholar 

  34. A. Rouhi and H. Nezamabadi-Pour, A hybrid method for dimensionality reduction in microarray data based on advanced binary ant colony algorithm, in 2016 1st Conference on Swarm Intelligence and Evolutionary Computation (CSIEC) (IEEE, Piscataway, 2016), pp. 70–75

    Google Scholar 

  35. N. Taheri, H. Nezamabadi-Pour, A hybrid feature selection method for high-dimensional data, in 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE) (IEEE, Piscataway, 2014), pp. 141–145

    Google Scholar 

  36. X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in Advances in Neural Information Processing Systems, (ACM, New York, 2006), pp. 507–514

    Google Scholar 

  37. M.A. Hall, L.A. Smith, Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper, in FLAIRS Conference, vol. 1999 (1999), pp. 235–239

    Google Scholar 

  38. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical recipes in C++. Art Sci. Comput. 2, 1002 (1992)

    MATH  Google Scholar 

  39. J.C. Davis, R.J. Sampson, Statistics and Data Analysis in Geology (Wiley, New York, 1986)

    Google Scholar 

  40. H. Lee et al., Feature selection practice for unsupervised learning of credit card fraud detection. J. Theor. Appl. Inf. Technol. 96(2), 408–417 (2018)

    Google Scholar 

  41. Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)

    Article  Google Scholar 

  42. A. Rouhi, H. Nezamabadi-pour, A hybrid-ensemble based framework for microarray data gene selection. Int. J. Data Min. Bioinform. 19(3), 221–242 (2017)

    Article  Google Scholar 

  43. S. Kashef, H. Nezamabadi-pour, B. Nikpour, Multilabel feature selection: a comprehensive review and guiding experiments. Wiley Interdiscip. Rev. Data Min. Knowl. Disc. 8(2), e1240 (2018)

    Article  Google Scholar 

  44. M. Dowlatshahi, V. Derhami, H. Nezamabadi-Pour, Ensemble of filter-based rankers to guide an epsilon-greedy swarm optimizer for high-dimensional feature subset selection. Information 8(4), 152 (2017)

    Article  Google Scholar 

  45. M. Dorigo, G. di Caro, Ant colony optimization: a new meta-heuristic, in Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406), vol. 2 (IEEE, Piscataway, 1999), pp. 1470–1477

    Google Scholar 

  46. S. Kashef, H. Nezamabadi-pour, An advanced ACO algorithm for feature subset selection. Neurocomputing 147, 271–279 (2015)

    Article  Google Scholar 

  47. J. Kennedy, Particle swarm optimization. Enc. Mach. Learn., 760–766 (2010)

    Google Scholar 

  48. E. Rashedi, H. Nezamabadi-Pour, S. Saryazdi, GSA: a gravitational search algorithm. Inf. Sci. 179(13), 2232–2248 (2009)

    Article  MATH  Google Scholar 

  49. A. Mahanipour, H. Nezamabadi-Pour, A multiple feature construction method based on gravitational search algorithm. Expert Syst. Appl. 127, 199–209 (2019)

    Article  Google Scholar 

  50. E. Rashedi, H. Nezamabadi-Pour, S. Saryazdi, BGSA: binary gravitational search algorithm. Nat. Comput. 9(3), 727–745 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  51. E. Rashedi, H. Nezamabadi-pour, Feature subset selection using improved binary gravitational search algorithm. J. Intell. Fuzzy Syst. 26(3), 1211–1221 (2014)

    Article  Google Scholar 

  52. A. Rouhi, P.H. Nezamabadi, A Hybrid-Based Feature Selection Method for High-Dimensional Data Using Ensemble Methods (2018)

    Google Scholar 

  53. V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J.M. Benítez, F. Herrera, A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)

    Article  Google Scholar 

  54. P.A. Mundra, J.C. Rajapakse, SVM-RFE with MRMR filter for gene selection. IEEE Trans. Nanobioscience 9(1), 31–37 (2009)

    Article  Google Scholar 

  55. H. Uğuz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)

    Article  Google Scholar 

  56. L.-Y. Chuang, C.-H. Yang, K.-C. Wu, C.-H. Yang, A hybrid feature selection method for DNA microarray data. Comput. Biol. Med. 41(4), 228–237 (2011)

    Article  Google Scholar 

  57. C.-P. Lee, Y. Leu, A novel hybrid feature selection method for microarray data analysis. Appl. Soft Comput. 11(1), 208–213 (2011)

    Article  Google Scholar 

  58. S.S. Shreem, S. Abdullah, M.Z.A. Nazri, M. Alzaqebah, Hybridizing ReliefF, MRMR filters and GA wrapper approaches for gene selection. J. Theor. Appl. Inf. Technol. 46(2), 1034–1039 (2012)

    Google Scholar 

  59. J. Apolloni, G. Leguizamón, E. Alba, Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl. Soft Comput. 38, 922–932 (2016)

    Article  Google Scholar 

  60. B. Venkatesh, J. Anuradha, A hybrid feature selection approach for handling a high-dimensional data, in Innovations in Computer Science and Engineering, (Springer, Berlin, 2019), pp. 365–373

    Chapter  Google Scholar 

  61. Z. Manbari, F. AkhlaghianTab, C. Salavati, Hybrid fast unsupervised feature selection for high-dimensional data. Expert Syst. Appl. 124, 97–118 (2019)

    Article  Google Scholar 

  62. C. Yan, J. Liang, M. Zhao, X. Zhang, T. Zhang, H. Li, A novel hybrid feature selection strategy in quantitative analysis of laser-induced breakdown spectroscopy. Anal. Chim. Acta 1080, 35–42 (2019)

    Article  Google Scholar 

  63. T. Gangavarapu, N. Patil, A novel filter-wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets. Appl. Soft Comput. 81, 105538 (2019)

    Article  Google Scholar 

  64. L. Sun, X. Kong, J. Xu, R. Zhai, S. Zhang, A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification. Sci. Rep. 9(1), 8978 (2019)

    Article  Google Scholar 

  65. W. You, Z. Yang, G. Ji, PLS-based recursive feature elimination for high-dimensional small sample. Knowl.-Based Syst. 55, 15–28 (2014)

    Article  Google Scholar 

  66. T. Prasartvit, A. Banharnsakun, B. Kaewkamnerdpong, T. Achalakul, Reducing bioinformatics data dimension with ABC-kNN. Neurocomputing 116, 367–381 (2013)

    Article  Google Scholar 

  67. I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)

    Article  MATH  Google Scholar 

  68. S. Maldonado, R. Weber, J. Basak, Simultaneous feature selection and classification using kernel-penalized support vector machines. Inf. Sci. 181(1), 115–128 (2011)

    Article  Google Scholar 

  69. J. Canul-Reich, L.O. Hall, D.B. Goldgof, J.N. Korecki, S. Eschrich, Iterative feature perturbation as a gene selector for microarray data. Int. J. Pattern Recognit. Artif. Intell. 26(05), 1260003 (2012)

    Article  MathSciNet  Google Scholar 

  70. S. Maldonado, J. López, Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification. Appl. Soft Comput. 67, 94–105 (2018)

    Article  Google Scholar 

  71. H. Liu, M. Zhou, Q. Liu, An embedded feature selection method for imbalanced data classification. IEEE/CAA J. Autom. Sin. 6(3), 703–715 (2019)

    Article  Google Scholar 

  72. C. Peng, X. Wu, W. Yuan, X. Zhang, Y. Li, MGRFE: multilayer recursive feature elimination based on an embedded genetic algorithm for cancer classification. IEEE/ACM Trans. Comput. Biol. Bioinform. (2019). https://doi.org/10.1109/TCBB.2019.2921961

  73. A.B. Brahim, M. Limam, Robust ensemble feature selection for high dimensional data sets, in 2013 International Conference on High Performance Computing & Simulation (HPCS) (IEEE, Piscataway, 2013), pp. 151–157

    Google Scholar 

  74. V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, Data classification using an ensemble of filters. Neurocomputing 135, 13–20 (2014)

    Article  Google Scholar 

  75. F. Yang, K. Mao, Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(4), 1080–1092 (2010)

    Article  Google Scholar 

  76. V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)

    Article  Google Scholar 

  77. S. Sayed, M. Nassef, A. Badr, I. Farag, A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst. Appl. 121, 233–243 (2019)

    Article  Google Scholar 

  78. B. Pes, Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains. Neural Comput. Applic., 1–23 (2019)

    Google Scholar 

  79. B. Singh, K. Kumar, S. Mohan, R. Ahmad, Ensemble of clustering approaches for feature selection of high dimensional data. Available at SSRN 3349018 (2019)

    Google Scholar 

  80. J. Wang, J. Xu, C. Zhao, Y. Peng, H. Wang, An ensemble feature selection method for high-dimensional data based on sort aggregation. Syst. Sci. Control Eng. 7(2), 32–39 (2019)

    Article  Google Scholar 

  81. X. Song, L.R. Waitman, Y. Hu, A.S. Yu, D. Robins, M. Liu, Robust clinical marker identification for diabetic kidney disease with ensemble feature selection. J. Am. Med. Inform. Assoc. 26(3), 242–253 (2019)

    Article  Google Scholar 

  82. V.P. Singh, D.J. Kalita, S. Tripathi, Classifying gene expression data of cancer using multistage ensemble of neural networks. Available at SSRN 3349578 (2019)

    Google Scholar 

  83. Feature Selection at Arizona State University. http://featureselection.asu.edu/datasets.php

  84. B. Institute. Cancer Program Data Sets. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amirreza Rouhi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rouhi, A., Nezamabadi-Pour, H. (2020). Feature Selection in High-Dimensional Data. In: Amini, M. (eds) Optimization, Learning, and Control for Interdependent Complex Networks. Advances in Intelligent Systems and Computing, vol 1123. Springer, Cham. https://doi.org/10.1007/978-3-030-34094-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34094-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34093-3

  • Online ISBN: 978-3-030-34094-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics