Feature Selection for Multiclass Binary Data

Perera, Kushani; Chan, Jeffrey; Karunasekera, Shanika

doi:10.1007/978-3-319-93040-4_5

Feature Selection for Multiclass Binary Data

Kushani Perera¹⁹,
Jeffrey Chan²⁰ &
Shanika Karunasekera¹⁹

Conference paper
First Online: 17 June 2018

3524 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Abstract

Feature selection in binary datasets is an important task in many real world machine learning applications such as document classification, genomic data analysis, and image recognition. Despite many algorithms available, selecting features that distinguish all classes from one another in a multiclass binary dataset remains a challenge. Furthermore, many existing feature selection methods incur unnecessary computation costs for binary data, as they are not specifically designed for binary data. We show that exploiting the symmetry and feature value imbalance of binary datasets, more efficient feature selection measures that can better distinguish the classes in multiclass binary datasets can be developed. Using these measures, we propose a greedy feature selection algorithm, CovSkew, for multiclass binary data. We show that CovSkew achieves high accuracy gain over baseline methods, upto \(\sim \)40%, especially when the selected feature subset is small. We also show that CovSkew has low computational costs compared with most of the baselines.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://sites.google.com/view/kushani/publications.

References

Sen, P., Namata, G.M., Bilgic, M., et al.: Collective classification in network data. AI Mag. 29(3), 93–106 (2008)
Article Google Scholar
Juan, A., Vidal, E.: Bernoulli mixture models for binary images. In: Proceedings of 17th IEEE ICPR, vol. 3, pp. 367–370 (2004)
Google Scholar
Shmulevich, I., Zhang, W.: Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4), 555–565 (2002)
Article Google Scholar
Calonder, M., Lepetit, V., Ozuysal, M., et al.: Brief: computing a local binary descriptor very fast. TPAMI 34(7), 1281–1298 (2012)
Article Google Scholar
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)
Article Google Scholar
Pereira, R.B., Plastino, A., Zadrozny, B., et al.: Categorizing feature selection methods for multi-label classification. AI Rev. 49, 1–22 (2016)
Google Scholar
Park, H., Kwon, S., Kwon, H.C.: Complete Gini-Index Text (GIT) feature-selection algorithm for text classification. In: SEDM 2010, pp. 366–371. IEEE (2010)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE TPAMI 27(8), 1226–1238 (2005)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)
Google Scholar
Herman, G., Zhang, B., Wang, Y., et al.: Mutual information-based method for selecting informative feature sets. Pattern Recogn. 46(12), 3315–3327 (2013)
Article Google Scholar
Javed, K., Babri, H.A., Saeed, M.: Feature selection based on class-dependent densities for high-dimensional binary data. IEEE TKDE 24(3), 465–477 (2012)
Google Scholar
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st ICML, p. 38. ACM (2004)
Google Scholar
Xiang, S., Shen, X., Ye, J.: Efficient nonconvex sparse group feature selection via continuous and discrete optimization. Artif. Intell. 224, 28–50 (2015)
Article MathSciNet Google Scholar
Davies, S., Russell, S.: NP-completeness of searches for smallest possible feature sets. In: Proceedings of the AAAI Fall Symposium on Relevance, pp. 37–39 (1994)
Google Scholar
Wang, R., Chen, F., Chen, Z., et al.: StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In: Proceedings of the ACM Ubicomp, pp. 3–14 (2014)
Google Scholar
Torres-Sospedra, J., Montoliu, R., Martínez-Usó, A., et al.: UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: IPIN 2014, pp. 261–270. IEEE (2014)
Google Scholar
Su, A.I., Welsh, J.B., Sapinoso, L.M., et al.: Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 61(20), 7388–7393 (2001)
Google Scholar

Download references

Acknowledgements

This work is supported by the Australian Government under the Australian Postgraduate Award.

Author information

Authors and Affiliations

University of Melbourne, Melbourne, VIC, 3010, Australia
Kushani Perera & Shanika Karunasekera
RMIT University, Melbourne, VIC, 3000, Australia
Jeffrey Chan

Authors

Kushani Perera
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Chan
View author publications
You can also search for this author in PubMed Google Scholar
Shanika Karunasekera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kushani Perera .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perera, K., Chan, J., Karunasekera, S. (2018). Feature Selection for Multiclass Binary Data. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-93040-4_5
Published: 17 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics