Feature Selection in High-Dimensional Data

Rouhi, Amirreza; Nezamabadi-Pour, Hossein

doi:10.1007/978-3-030-34094-0_5

Amirreza Rouhi^15,16 &
Hossein Nezamabadi-Pour¹⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1123))

887 Accesses
10 Citations

Abstract

Today, with the increase of data dimensions, many challenges are faced in many contexts including machine learning, informatics, and medicine. However, reducing data dimension can be considered as a basic method in handling high-dimensional data, because by reducing dimensions, applying many of the existing operations on data is facilitated.

Microarray data are derived from tissues and cells considering differences in the gene, which can be useful for diagnosing disease and tumors. Due to the large number of features (genes) and small number of samples in microarray datasets, selecting the most salient genes is a difficult task. Among the many techniques of machine learning, feature selection and data classification play a very important and widespread role in enhancing human life, from detecting voice emotion to detecting illness in the body. In medicine, an effective gene selection can greatly enhance the process of prediction and diagnosis of cancer. After selecting effective genes, the duty of a specific classifier is usually to discriminate healthy people from patients that are suffering from cancer based on their expression of the selected genes.

A vast body of feature selection methods has been proposed for high-dimensional microarray data. Traditionally, these methods fall into three categories including filter, wrapper, and hybrid approaches. Furthermore, new techniques such as ensemble methods have recently been developed to improve the process of feature selection and classification.

This chapter presents an overview of the most popular feature selection methods to deal with high-dimensional data and analyze their performance under different conditions. The chapter starts with a global overview of the high-dimensional data and feature selection (Sects. 5.2 and 5.3). Then, in Sect. 5.4 we review the state-of-the-art methods on filter algorithms. In the next three Sects. (5.5, 5.6 and 5.7) we describe the wrapper, hybrid, and embedded methods and in each section, an overview of several works performed on these methods is discussed. Sect. 5.8 describes the ensemble techniques recently considered by the researchers and summarizes the works done based on these techniques. In Sect. 5.9, we present the experimental results of the most significant methods on high-dimensional data. Finally, Sect. 5.10 summarizes this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

C.E. Crangle, R. Wang, M. Perreau-Guimaraes, M.U. Nguyen, D.T. Nguyen, P. Suppes, Machine learning for the recognition of emotion in the speech of couples in psychotherapy using the Stanford Suppes Brain Lab Psychotherapy Dataset. arXiv preprint arXiv:1901.04110 (2019)
Google Scholar
A. Rouhi, M. Spitale, F. Catania, G. Cosentino, M. Gelsomini, F. Garzotto, Emotify: emotional game for children with autism spectrum disorder based-on machine learning, in Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion (ACM, New York, 2019), pp. 31–32
Google Scholar
U. Shruthi, V. Nagaveni, B. Raghavendra, A review on machine learning classification techniques for plant disease detection, in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), (IEEE, Piscataway, 2019), pp. 281–284
Google Scholar
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, Hoboken, 2012)
MATH Google Scholar
M. Fernandes, A. Canito, V. Bolón-Canedo, L. Conceição, I. Praça, G. Marreiros, Data analysis and feature selection for predictive maintenance: A case-study in the metallurgic industry. Int. J. Inf. Manag. 46, 252–262 (2019)
Article Google Scholar
H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining (Springer, Berlin, 2012)
MATH Google Scholar
H. Handels, T. Roß, J. Kreusch, H.H. Wolff, S.J. Poeppl, Feature selection for optimized skin tumor recognition using genetic algorithms. Artif. Intell. Med. 16(3), 283–297 (1999)
Article Google Scholar
B. Nikpour, H. Nezamabadi-pour, HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J. Comput. Sci. 1(2), 109–128 (2018)
Article Google Scholar
K. Borowska, J. Stepaniuk, A rough–granular approach to the imbalanced data classification problem. Appl. Soft Comput. 83, 105607 (2019)
Article Google Scholar
A. Reyes-Nava, H. Cruz-Reyes, R. Alejo, E. Rendón-Lara, A. Flores-Fuentes, and E. Granda-Gutiérrez, Using deep learning to classify class imbalanced gene-expression microarrays datasets, in Iberoamerican Congress on Pattern Recognition (Springer, Berlin, 2018), pp. 46–54
Chapter Google Scholar
P.B. andLuis Torgo, R. Ribeiro, A survey of predictive modeling under imbalanced distributions. ACM Comput. Surv. 49(2), 1–31 (2016)
Google Scholar
H. He, E.A. Garcia, Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 9, 1263–1284 (2008)
Google Scholar
J. Błaszczyński, J. Stefanowski, Improving bagging ensembles for class imbalanced data by active learning, in Advances in Feature Selection for Data and Pattern Recognition, (Springer, Berlin, 2018), pp. 25–52
Chapter Google Scholar
R.J. Hickey, Noise modelling and evaluating learning from examples. Artif. Intell. 82(1–2), 157–179 (1996)
Article MathSciNet Google Scholar
Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, Y. Zhou, A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)
Article Google Scholar
C.E. Brodley, M.A. Friedl, Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Article MATH Google Scholar
B. Frénay, A. Kabán, A comprehensive introduction to label noise, in ESANN (2014)
Google Scholar
F. Barani, M. Mirhosseini, H. Nezamabadi-Pour, Application of binary quantum-inspired gravitational search algorithm in feature subset selection. Appl. Intell. 47(2), 304–318 (2017)
Article Google Scholar
A.P. Dawid, A.M. Skene, Maximum likelihood estimation of observer error-rates using the EM algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 28(1), 20–28 (1979)
Google Scholar
T.R. Golub et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
Article Google Scholar
I. Kamkar, S.K. Gupta, D. Phung, S. Venkatesh, Stable feature selection for clinical prediction: exploiting ICD tree structure using tree-lasso. J. Biomed. Inform. 53, 277–290 (2015)
Article Google Scholar
A. Rouhi and H. Nezamabadi-Pour, A hybrid feature selection approach based on ensemble method for high-dimensional data, in 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC) (IEEE, Piscataway, 2017), pp. 16–20
Google Scholar
S. Tabakhi, A. Najafi, R. Ranjbar, P. Moradi, Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 168, 1024–1036 (2015)
Article Google Scholar
M.K. Ebrahimpour, H. Nezamabadi-Pour, M. Eftekhari, CCFS: a cooperating coevolution technique for large scale feature selection on microarray datasets. Comput. Biol. Chem. 73, 171–178 (2018)
Article Google Scholar
A. Rouhi and H. Nezamabadi-Pour, Filter-based feature selection for microarray data using improved binary gravitational search algorithm, in 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC) (IEEE, Piscataway, 2018), pp. 1–6
Google Scholar
J.R. Quinlan, Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Y.-W. Chen, C.-J. Lin, Combining SVMs with various feature selection strategies, in Feature Extraction, (Springer, Berlin, 2006), pp. 315–324
Chapter Google Scholar
Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725, 2012
Google Scholar
I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, in European Conference on Machine Learning (Springer, Berlin, 1994), pp. 171–182
Chapter Google Scholar
L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, in Proceedings of the 20th International Conference on Machine Learning (ICML-03) (2003), pp. 856–863
Google Scholar
H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 8, 1226–1238 (2005)
Article Google Scholar
M. A. Hall, Correlation-based feature selection for machine learning (1999)
Google Scholar
J. Li et al., Feature selection: a data perspective. ACM Comput. Sur. (CSUR) 50(6), 94 (2018)
Google Scholar
A. Rouhi and H. Nezamabadi-Pour, A hybrid method for dimensionality reduction in microarray data based on advanced binary ant colony algorithm, in 2016 1st Conference on Swarm Intelligence and Evolutionary Computation (CSIEC) (IEEE, Piscataway, 2016), pp. 70–75
Google Scholar
N. Taheri, H. Nezamabadi-Pour, A hybrid feature selection method for high-dimensional data, in 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE) (IEEE, Piscataway, 2014), pp. 141–145
Google Scholar
X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in Advances in Neural Information Processing Systems, (ACM, New York, 2006), pp. 507–514
Google Scholar
M.A. Hall, L.A. Smith, Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper, in FLAIRS Conference, vol. 1999 (1999), pp. 235–239
Google Scholar
W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical recipes in C++. Art Sci. Comput. 2, 1002 (1992)
MATH Google Scholar
J.C. Davis, R.J. Sampson, Statistics and Data Analysis in Geology (Wiley, New York, 1986)
Google Scholar
H. Lee et al., Feature selection practice for unsupervised learning of credit card fraud detection. J. Theor. Appl. Inf. Technol. 96(2), 408–417 (2018)
Google Scholar
Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
A. Rouhi, H. Nezamabadi-pour, A hybrid-ensemble based framework for microarray data gene selection. Int. J. Data Min. Bioinform. 19(3), 221–242 (2017)
Article Google Scholar
S. Kashef, H. Nezamabadi-pour, B. Nikpour, Multilabel feature selection: a comprehensive review and guiding experiments. Wiley Interdiscip. Rev. Data Min. Knowl. Disc. 8(2), e1240 (2018)
Article Google Scholar
M. Dowlatshahi, V. Derhami, H. Nezamabadi-Pour, Ensemble of filter-based rankers to guide an epsilon-greedy swarm optimizer for high-dimensional feature subset selection. Information 8(4), 152 (2017)
Article Google Scholar
M. Dorigo, G. di Caro, Ant colony optimization: a new meta-heuristic, in Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406), vol. 2 (IEEE, Piscataway, 1999), pp. 1470–1477
Google Scholar
S. Kashef, H. Nezamabadi-pour, An advanced ACO algorithm for feature subset selection. Neurocomputing 147, 271–279 (2015)
Article Google Scholar
J. Kennedy, Particle swarm optimization. Enc. Mach. Learn., 760–766 (2010)
Google Scholar
E. Rashedi, H. Nezamabadi-Pour, S. Saryazdi, GSA: a gravitational search algorithm. Inf. Sci. 179(13), 2232–2248 (2009)
Article MATH Google Scholar
A. Mahanipour, H. Nezamabadi-Pour, A multiple feature construction method based on gravitational search algorithm. Expert Syst. Appl. 127, 199–209 (2019)
Article Google Scholar
E. Rashedi, H. Nezamabadi-Pour, S. Saryazdi, BGSA: binary gravitational search algorithm. Nat. Comput. 9(3), 727–745 (2010)
Article MathSciNet MATH Google Scholar
E. Rashedi, H. Nezamabadi-pour, Feature subset selection using improved binary gravitational search algorithm. J. Intell. Fuzzy Syst. 26(3), 1211–1221 (2014)
Article Google Scholar
A. Rouhi, P.H. Nezamabadi, A Hybrid-Based Feature Selection Method for High-Dimensional Data Using Ensemble Methods (2018)
Google Scholar
V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J.M. Benítez, F. Herrera, A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)
Article Google Scholar
P.A. Mundra, J.C. Rajapakse, SVM-RFE with MRMR filter for gene selection. IEEE Trans. Nanobioscience 9(1), 31–37 (2009)
Article Google Scholar
H. Uğuz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)
Article Google Scholar
L.-Y. Chuang, C.-H. Yang, K.-C. Wu, C.-H. Yang, A hybrid feature selection method for DNA microarray data. Comput. Biol. Med. 41(4), 228–237 (2011)
Article Google Scholar
C.-P. Lee, Y. Leu, A novel hybrid feature selection method for microarray data analysis. Appl. Soft Comput. 11(1), 208–213 (2011)
Article Google Scholar
S.S. Shreem, S. Abdullah, M.Z.A. Nazri, M. Alzaqebah, Hybridizing ReliefF, MRMR filters and GA wrapper approaches for gene selection. J. Theor. Appl. Inf. Technol. 46(2), 1034–1039 (2012)
Google Scholar
J. Apolloni, G. Leguizamón, E. Alba, Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl. Soft Comput. 38, 922–932 (2016)
Article Google Scholar
B. Venkatesh, J. Anuradha, A hybrid feature selection approach for handling a high-dimensional data, in Innovations in Computer Science and Engineering, (Springer, Berlin, 2019), pp. 365–373
Chapter Google Scholar
Z. Manbari, F. AkhlaghianTab, C. Salavati, Hybrid fast unsupervised feature selection for high-dimensional data. Expert Syst. Appl. 124, 97–118 (2019)
Article Google Scholar
C. Yan, J. Liang, M. Zhao, X. Zhang, T. Zhang, H. Li, A novel hybrid feature selection strategy in quantitative analysis of laser-induced breakdown spectroscopy. Anal. Chim. Acta 1080, 35–42 (2019)
Article Google Scholar
T. Gangavarapu, N. Patil, A novel filter-wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets. Appl. Soft Comput. 81, 105538 (2019)
Article Google Scholar
L. Sun, X. Kong, J. Xu, R. Zhai, S. Zhang, A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification. Sci. Rep. 9(1), 8978 (2019)
Article Google Scholar
W. You, Z. Yang, G. Ji, PLS-based recursive feature elimination for high-dimensional small sample. Knowl.-Based Syst. 55, 15–28 (2014)
Article Google Scholar
T. Prasartvit, A. Banharnsakun, B. Kaewkamnerdpong, T. Achalakul, Reducing bioinformatics data dimension with ABC-kNN. Neurocomputing 116, 367–381 (2013)
Article Google Scholar
I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)
Article MATH Google Scholar
S. Maldonado, R. Weber, J. Basak, Simultaneous feature selection and classification using kernel-penalized support vector machines. Inf. Sci. 181(1), 115–128 (2011)
Article Google Scholar
J. Canul-Reich, L.O. Hall, D.B. Goldgof, J.N. Korecki, S. Eschrich, Iterative feature perturbation as a gene selector for microarray data. Int. J. Pattern Recognit. Artif. Intell. 26(05), 1260003 (2012)
Article MathSciNet Google Scholar
S. Maldonado, J. López, Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification. Appl. Soft Comput. 67, 94–105 (2018)
Article Google Scholar
H. Liu, M. Zhou, Q. Liu, An embedded feature selection method for imbalanced data classification. IEEE/CAA J. Autom. Sin. 6(3), 703–715 (2019)
Article Google Scholar
C. Peng, X. Wu, W. Yuan, X. Zhang, Y. Li, MGRFE: multilayer recursive feature elimination based on an embedded genetic algorithm for cancer classification. IEEE/ACM Trans. Comput. Biol. Bioinform. (2019). https://doi.org/10.1109/TCBB.2019.2921961
A.B. Brahim, M. Limam, Robust ensemble feature selection for high dimensional data sets, in 2013 International Conference on High Performance Computing & Simulation (HPCS) (IEEE, Piscataway, 2013), pp. 151–157
Google Scholar
V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, Data classification using an ensemble of filters. Neurocomputing 135, 13–20 (2014)
Article Google Scholar
F. Yang, K. Mao, Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(4), 1080–1092 (2010)
Article Google Scholar
V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)
Article Google Scholar
S. Sayed, M. Nassef, A. Badr, I. Farag, A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst. Appl. 121, 233–243 (2019)
Article Google Scholar
B. Pes, Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains. Neural Comput. Applic., 1–23 (2019)
Google Scholar
B. Singh, K. Kumar, S. Mohan, R. Ahmad, Ensemble of clustering approaches for feature selection of high dimensional data. Available at SSRN 3349018 (2019)
Google Scholar
J. Wang, J. Xu, C. Zhao, Y. Peng, H. Wang, An ensemble feature selection method for high-dimensional data based on sort aggregation. Syst. Sci. Control Eng. 7(2), 32–39 (2019)
Article Google Scholar
X. Song, L.R. Waitman, Y. Hu, A.S. Yu, D. Robins, M. Liu, Robust clinical marker identification for diabetic kidney disease with ensemble feature selection. J. Am. Med. Inform. Assoc. 26(3), 242–253 (2019)
Article Google Scholar
V.P. Singh, D.J. Kalita, S. Tripathi, Classifying gene expression data of cancer using multistage ensemble of neural networks. Available at SSRN 3349578 (2019)
Google Scholar
Feature Selection at Arizona State University. http://featureselection.asu.edu/datasets.php
B. Institute. Cancer Program Data Sets. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

Download references

Author information

Authors and Affiliations

Data Science and Bioinformatics Laboratory, Politecnico di Milano, Department of Electronics, Information and Bioengineering, Milan, Italy
Amirreza Rouhi
Intelligent Data Processing Laboratory (IDPL), Department of Electrical Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
Amirreza Rouhi & Hossein Nezamabadi-Pour

Authors

Amirreza Rouhi
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Nezamabadi-Pour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amirreza Rouhi .

Editor information

Editors and Affiliations

School of Computing and Information Sciences, Florida International University, Miami, FL, USA, Sustainability, Optimization, and Learning for InterDependent Networks Laboratory (solid lab), Florida International University, Miami, FL, USA
M. Hadi Amini

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rouhi, A., Nezamabadi-Pour, H. (2020). Feature Selection in High-Dimensional Data. In: Amini, M. (eds) Optimization, Learning, and Control for Interdependent Complex Networks. Advances in Intelligent Systems and Computing, vol 1123. Springer, Cham. https://doi.org/10.1007/978-3-030-34094-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-34094-0_5
Published: 23 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34093-3
Online ISBN: 978-3-030-34094-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics