Skip to main content

Feature Selection for Multiclass Binary Data

  • Conference paper
  • First Online:
  • 3524 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Abstract

Feature selection in binary datasets is an important task in many real world machine learning applications such as document classification, genomic data analysis, and image recognition. Despite many algorithms available, selecting features that distinguish all classes from one another in a multiclass binary dataset remains a challenge. Furthermore, many existing feature selection methods incur unnecessary computation costs for binary data, as they are not specifically designed for binary data. We show that exploiting the symmetry and feature value imbalance of binary datasets, more efficient feature selection measures that can better distinguish the classes in multiclass binary datasets can be developed. Using these measures, we propose a greedy feature selection algorithm, CovSkew, for multiclass binary data. We show that CovSkew achieves high accuracy gain over baseline methods, upto \(\sim \)40%, especially when the selected feature subset is small. We also show that CovSkew has low computational costs compared with most of the baselines.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://sites.google.com/view/kushani/publications.

References

  1. Sen, P., Namata, G.M., Bilgic, M., et al.: Collective classification in network data. AI Mag. 29(3), 93–106 (2008)

    Article  Google Scholar 

  2. Juan, A., Vidal, E.: Bernoulli mixture models for binary images. In: Proceedings of 17th IEEE ICPR, vol. 3, pp. 367–370 (2004)

    Google Scholar 

  3. Shmulevich, I., Zhang, W.: Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4), 555–565 (2002)

    Article  Google Scholar 

  4. Calonder, M., Lepetit, V., Ozuysal, M., et al.: Brief: computing a local binary descriptor very fast. TPAMI 34(7), 1281–1298 (2012)

    Article  Google Scholar 

  5. Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)

    Article  Google Scholar 

  6. Pereira, R.B., Plastino, A., Zadrozny, B., et al.: Categorizing feature selection methods for multi-label classification. AI Rev. 49, 1–22 (2016)

    Google Scholar 

  7. Park, H., Kwon, S., Kwon, H.C.: Complete Gini-Index Text (GIT) feature-selection algorithm for text classification. In: SEDM 2010, pp. 366–371. IEEE (2010)

    Google Scholar 

  8. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE TPAMI 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  9. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)

    Google Scholar 

  10. Herman, G., Zhang, B., Wang, Y., et al.: Mutual information-based method for selecting informative feature sets. Pattern Recogn. 46(12), 3315–3327 (2013)

    Article  Google Scholar 

  11. Javed, K., Babri, H.A., Saeed, M.: Feature selection based on class-dependent densities for high-dimensional binary data. IEEE TKDE 24(3), 465–477 (2012)

    Google Scholar 

  12. Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st ICML, p. 38. ACM (2004)

    Google Scholar 

  13. Xiang, S., Shen, X., Ye, J.: Efficient nonconvex sparse group feature selection via continuous and discrete optimization. Artif. Intell. 224, 28–50 (2015)

    Article  MathSciNet  Google Scholar 

  14. Davies, S., Russell, S.: NP-completeness of searches for smallest possible feature sets. In: Proceedings of the AAAI Fall Symposium on Relevance, pp. 37–39 (1994)

    Google Scholar 

  15. Wang, R., Chen, F., Chen, Z., et al.: StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In: Proceedings of the ACM Ubicomp, pp. 3–14 (2014)

    Google Scholar 

  16. Torres-Sospedra, J., Montoliu, R., Martínez-Usó, A., et al.: UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: IPIN 2014, pp. 261–270. IEEE (2014)

    Google Scholar 

  17. Su, A.I., Welsh, J.B., Sapinoso, L.M., et al.: Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 61(20), 7388–7393 (2001)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the Australian Government under the Australian Postgraduate Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kushani Perera .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Perera, K., Chan, J., Karunasekera, S. (2018). Feature Selection for Multiclass Binary Data. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93040-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93039-8

  • Online ISBN: 978-3-319-93040-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics