Skip to main content
Log in

Optimal feature subset selection using hybrid binary Jaya optimization algorithm for text classification

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

Feature selection is an important task in the high-dimensional problem of text classification. Nowadays most of the feature selection methods use the significance of optimization algorithm to select an optimal subset of feature from the high-dimensional feature space. Optimal feature subset reduces the computation cost and increases the text classifier accuracy. In this paper, we have proposed a new hybrid feature selection method based on normalized difference measure and binary Jaya optimization algorithm (NDM-BJO) to obtain the appropriate subset of optimal features from the text corpus. We have used the error rate as a minimizing objective function to measure the fitness of a solution. The nominated optimal feature subsets are evaluated using Naive Bayes and Support Vector Machine classifier with various popular benchmark text corpus datasets. The observed results have confirmed that the proposed work NDM-BJO shows auspicious improvements compared with existing work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Similar content being viewed by others

References

  1. Drucker H, Donghui W and Vapnik V N 1999 Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10: 1048–1054

    Article  Google Scholar 

  2. Guzella T S and Caminhas W M 2009 A review of machine learning approaches to spam filtering. Expert Syst. Appl. 36: 10206–10222

    Article  Google Scholar 

  3. Günal S, Ergin S, Gülmezoğlu M B and Gerek Ö N 2006 On feature extraction for spam E-Mail detection. In: Proceedings of the Conference on Multimedia Content Representation, Classification and Security (MRCS 2006), Springer, Berlin, pp. 635–642

  4. Yu B and Dong-hua Z 2009 Combining neural networks and semantic feature space for Email classification. Knowl.-Based Syst. 22: 376–381

  5. Anagnostopoulos I, Anagnostopoulos C, Loumos V and Kayafas E 2004 Classifying web pages employing a probabilistic neural network. IEE Proc. Softw. 151: 139–150

    Article  Google Scholar 

  6. Chen R C and Hsieh C H 2006 Web page classification based on a support vector machine using a weighted vote schema. Expert Syst. Appl. 31: 427–435

    Article  Google Scholar 

  7. Cheng N, Chandramouli R and Subbalakshmi K P 2011 Author gender identification from text. Digit. Invest. 8: 78–88

    Article  Google Scholar 

  8. Forman G 2003 An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3: 1289–1305

    MATH  Google Scholar 

  9. Gunal S and Edizkan R 2008 Subspace based feature selection for pattern recognition. Inf. Sci. 178: 3716–3726

    Article  Google Scholar 

  10. Guyon I and Elisseeff A 2003 An introduction to variable and feature selection. J. Mach. Learn. Res. 3: 1157–1182.

    MATH  Google Scholar 

  11. Wenqian S, Houkuan H, Haibin Z, Yongmin L, Youli Q and Zhihai W 2007 A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33: 1–5

    Article  Google Scholar 

  12. Gunal S, Omer G, Ece D G and Rifat E 2009 The search for optimal feature set in power quality event classification. Expert Syst. Appl. 36: 10266–10273

    Article  Google Scholar 

  13. Gunal S 2012 Hybrid feature selection for text classification. Turk. J. Electr. Eng. Comput. Sci. 20: 1296–1311

    Google Scholar 

  14. Hoque N, Mihir S and Bhattacharyya D K 2018 EFS-MI: an ensemble feature selection method for classification. Complex Intell. Syst. 4: 105–118

    Article  Google Scholar 

  15. Xu Y, Wang B, Li J and Jing H 2008 An extended document frequency metric for feature selection in text categorization. In: Proceedings of the Information Retrieval Technology Symposium (AIRS 2008), Springer, Berlin, pp. 71–82

  16. Lee C and Lee G G 2006 Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manag. 42: 155–165

    Article  Google Scholar 

  17. Liu H, Jigui S, Lei L and Huijie Z 2009 Feature selection with dynamic mutual information. Pattern Recognit. 42: 1330–1339

    Article  Google Scholar 

  18. Vergara J R and Estévez P A 2014 A review of feature selection methods based on mutual information. Neural Comput. Appl. 24: 175–186

    Article  Google Scholar 

  19. Uysal A K and Gunal S 2012 A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36: 226–235

    Article  Google Scholar 

  20. Ogura H, Hiromi A and Masato K 2009 Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst. Appl. 36: 6826–6832

    Article  Google Scholar 

  21. Kohavi R and John G H 1997 Wrappers for feature subset selection. Artif. Intell. 97: 273–324

    Article  Google Scholar 

  22. Yang J, Yuanning L, Zhen L, Xiaodong Z and Xiaoxu Z 2011 A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl.-Based Syst. 24: 904–914

    Article  Google Scholar 

  23. Tan F, Xuezheng F, Yanqing Z and Bourgeois A G 2008 A genetic algorithm-based method for feature subset selection. Soft Comput. 12: 111–120

    Article  Google Scholar 

  24. Yang J and Honavar V 1998 Feature subset selection using a genetic algorithm. IEEE Intell. Syst. 13: 44–49

    Article  Google Scholar 

  25. Marie-Sainte S L and Alalyani N 2018 Firefly algorithm based feature selection for Arabic text classification. J. King Saud Univ. Comput. Inf. Sci. 10: 1016–1025

    Google Scholar 

  26. Mesleh A M and Kanaan G G 2008 Support vector machine text classification system: using ant colony optimization based feature subset selection. In: Proceedings of the International Conference on Computer Engineering & Systems, Cairo, Egypt, IEEE, pp. 143–148

  27. Raho G, Al-Shalabi R, Ghassan K and Asmaa N 2015 Different classification algorithms based on Arabic text classification: feature selection comparative study. Int. J. Adv. Comput. Sci. Appl. 6: 192–195

    Google Scholar 

  28. Banati H and Bajaj M 2011 Firefly based feature selection approach. Int. J. Comput. Sci. 4: 273–280

    Google Scholar 

  29. Abdur R, Kashif J and Haroon A B 2017 Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53: 473–489

    Article  Google Scholar 

  30. Rao R V 2016 Jaya: a simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int. J. Ind. Eng. Comput. 7: 19–34

    Google Scholar 

  31. Mishra S and Ray P K 2016 Power quality improvement using photovoltaic fed DSTATCOM based on JAYA optimization. IEEE Trans. Sustain. Energy 7: 1672–1680

    Article  Google Scholar 

  32. Sinha R K and Ghosh S 2007 Jaya based ANFIS for monitoring of two class motor imagery task. IEEE Access 4: 9273–9282

    Google Scholar 

  33. Chen J, Houkuan H, Shengfeng T and Youli Q 2009 Feature selection for text classification with Naıve Bayes. Expert Syst. Appl. 36: 5432–5435

    Article  Google Scholar 

  34. Sang-Bum K, Kyoung-Soo H, Hae-Chang R and Sung H M 2006 Some effective techniques for Naive Bayes text classification. IEEE Trans. Knowl. Data Eng. 18: 1457–1466

    Article  Google Scholar 

  35. Hsu C W, and Lin C J 2002 A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13: 415–425

    Article  Google Scholar 

  36. Huang C L and Wang C J 2006 A GA-based feature selection and parameters optimization for support vector machines. Expert Syst. Appl. 31: 231–240

    Article  Google Scholar 

  37. Kumar M A and Gopal M 2010 A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognit. Lett. 31: 1437–1444

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments and advice in improving this work. Also, we would like to thank the Management and Principal of Mepco Schlenk Engineering College (Autonomous), Sivakasi, for providing us the state of art facilities to carry out this proposed research work in the Mepco Research Centre in collaboration with Anna University Chennai, Tamil Nadu, India.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K Thirumoorthy.

Nomenclature

Nomenclature

i :

solution index

j :

position index

t :

iteration/generation index

f :

number of features to be selected

\(T_{max}\) :

maximum number of iterations/generations

\(\Psi _{i}\) :

\(i^{\mathrm{th}}\) solution

\(\Psi _{i,j}\) :

\(j^{\mathrm{th}}\) position of solution \(\Psi _{i}\)

\(\Psi _{i}^{(t)}\) :

\(i^{\mathrm{th}}\) solution \(\Psi _{i}\) of iteration/generation t

\(\Psi _{i}^{fitness}\) :

fitness value of solution \(\Psi _{i}\)

\(\Psi _{best}\) :

best solution

\(\Psi _{best}^{fitness}\) :

fitness value of best solution

\(\Psi _{worst}\) :

worst solution

\(\Psi _{worst}^{fitness}\) :

fitness value of worst solution

\(\alpha , \beta \) :

random numbers in [0,1]

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thirumoorthy, K., Muneeswaran, K. Optimal feature subset selection using hybrid binary Jaya optimization algorithm for text classification. Sādhanā 45, 201 (2020). https://doi.org/10.1007/s12046-020-01443-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-020-01443-w

Keywords

Navigation