Abstract
Feature selection in binary datasets is an important task in many real world machine learning applications such as document classification, genomic data analysis, and image recognition. Despite many algorithms available, selecting features that distinguish all classes from one another in a multiclass binary dataset remains a challenge. Furthermore, many existing feature selection methods incur unnecessary computation costs for binary data, as they are not specifically designed for binary data. We show that exploiting the symmetry and feature value imbalance of binary datasets, more efficient feature selection measures that can better distinguish the classes in multiclass binary datasets can be developed. Using these measures, we propose a greedy feature selection algorithm, CovSkew, for multiclass binary data. We show that CovSkew achieves high accuracy gain over baseline methods, upto \(\sim \)40%, especially when the selected feature subset is small. We also show that CovSkew has low computational costs compared with most of the baselines.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Sen, P., Namata, G.M., Bilgic, M., et al.: Collective classification in network data. AI Mag. 29(3), 93–106 (2008)
Juan, A., Vidal, E.: Bernoulli mixture models for binary images. In: Proceedings of 17th IEEE ICPR, vol. 3, pp. 367–370 (2004)
Shmulevich, I., Zhang, W.: Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4), 555–565 (2002)
Calonder, M., Lepetit, V., Ozuysal, M., et al.: Brief: computing a local binary descriptor very fast. TPAMI 34(7), 1281–1298 (2012)
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)
Pereira, R.B., Plastino, A., Zadrozny, B., et al.: Categorizing feature selection methods for multi-label classification. AI Rev. 49, 1–22 (2016)
Park, H., Kwon, S., Kwon, H.C.: Complete Gini-Index Text (GIT) feature-selection algorithm for text classification. In: SEDM 2010, pp. 366–371. IEEE (2010)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE TPAMI 27(8), 1226–1238 (2005)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)
Herman, G., Zhang, B., Wang, Y., et al.: Mutual information-based method for selecting informative feature sets. Pattern Recogn. 46(12), 3315–3327 (2013)
Javed, K., Babri, H.A., Saeed, M.: Feature selection based on class-dependent densities for high-dimensional binary data. IEEE TKDE 24(3), 465–477 (2012)
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st ICML, p. 38. ACM (2004)
Xiang, S., Shen, X., Ye, J.: Efficient nonconvex sparse group feature selection via continuous and discrete optimization. Artif. Intell. 224, 28–50 (2015)
Davies, S., Russell, S.: NP-completeness of searches for smallest possible feature sets. In: Proceedings of the AAAI Fall Symposium on Relevance, pp. 37–39 (1994)
Wang, R., Chen, F., Chen, Z., et al.: StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In: Proceedings of the ACM Ubicomp, pp. 3–14 (2014)
Torres-Sospedra, J., Montoliu, R., Martínez-Usó, A., et al.: UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: IPIN 2014, pp. 261–270. IEEE (2014)
Su, A.I., Welsh, J.B., Sapinoso, L.M., et al.: Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 61(20), 7388–7393 (2001)
Acknowledgements
This work is supported by the Australian Government under the Australian Postgraduate Award.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Perera, K., Chan, J., Karunasekera, S. (2018). Feature Selection for Multiclass Binary Data. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-93040-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)