Abstract
Feature selection is an important task in the high-dimensional problem of text classification. Nowadays most of the feature selection methods use the significance of optimization algorithm to select an optimal subset of feature from the high-dimensional feature space. Optimal feature subset reduces the computation cost and increases the text classifier accuracy. In this paper, we have proposed a new hybrid feature selection method based on normalized difference measure and binary Jaya optimization algorithm (NDM-BJO) to obtain the appropriate subset of optimal features from the text corpus. We have used the error rate as a minimizing objective function to measure the fitness of a solution. The nominated optimal feature subsets are evaluated using Naive Bayes and Support Vector Machine classifier with various popular benchmark text corpus datasets. The observed results have confirmed that the proposed work NDM-BJO shows auspicious improvements compared with existing work.
Similar content being viewed by others
References
Drucker H, Donghui W and Vapnik V N 1999 Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10: 1048–1054
Guzella T S and Caminhas W M 2009 A review of machine learning approaches to spam filtering. Expert Syst. Appl. 36: 10206–10222
Günal S, Ergin S, Gülmezoğlu M B and Gerek Ö N 2006 On feature extraction for spam E-Mail detection. In: Proceedings of the Conference on Multimedia Content Representation, Classification and Security (MRCS 2006), Springer, Berlin, pp. 635–642
Yu B and Dong-hua Z 2009 Combining neural networks and semantic feature space for Email classification. Knowl.-Based Syst. 22: 376–381
Anagnostopoulos I, Anagnostopoulos C, Loumos V and Kayafas E 2004 Classifying web pages employing a probabilistic neural network. IEE Proc. Softw. 151: 139–150
Chen R C and Hsieh C H 2006 Web page classification based on a support vector machine using a weighted vote schema. Expert Syst. Appl. 31: 427–435
Cheng N, Chandramouli R and Subbalakshmi K P 2011 Author gender identification from text. Digit. Invest. 8: 78–88
Forman G 2003 An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3: 1289–1305
Gunal S and Edizkan R 2008 Subspace based feature selection for pattern recognition. Inf. Sci. 178: 3716–3726
Guyon I and Elisseeff A 2003 An introduction to variable and feature selection. J. Mach. Learn. Res. 3: 1157–1182.
Wenqian S, Houkuan H, Haibin Z, Yongmin L, Youli Q and Zhihai W 2007 A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33: 1–5
Gunal S, Omer G, Ece D G and Rifat E 2009 The search for optimal feature set in power quality event classification. Expert Syst. Appl. 36: 10266–10273
Gunal S 2012 Hybrid feature selection for text classification. Turk. J. Electr. Eng. Comput. Sci. 20: 1296–1311
Hoque N, Mihir S and Bhattacharyya D K 2018 EFS-MI: an ensemble feature selection method for classification. Complex Intell. Syst. 4: 105–118
Xu Y, Wang B, Li J and Jing H 2008 An extended document frequency metric for feature selection in text categorization. In: Proceedings of the Information Retrieval Technology Symposium (AIRS 2008), Springer, Berlin, pp. 71–82
Lee C and Lee G G 2006 Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manag. 42: 155–165
Liu H, Jigui S, Lei L and Huijie Z 2009 Feature selection with dynamic mutual information. Pattern Recognit. 42: 1330–1339
Vergara J R and Estévez P A 2014 A review of feature selection methods based on mutual information. Neural Comput. Appl. 24: 175–186
Uysal A K and Gunal S 2012 A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36: 226–235
Ogura H, Hiromi A and Masato K 2009 Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst. Appl. 36: 6826–6832
Kohavi R and John G H 1997 Wrappers for feature subset selection. Artif. Intell. 97: 273–324
Yang J, Yuanning L, Zhen L, Xiaodong Z and Xiaoxu Z 2011 A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl.-Based Syst. 24: 904–914
Tan F, Xuezheng F, Yanqing Z and Bourgeois A G 2008 A genetic algorithm-based method for feature subset selection. Soft Comput. 12: 111–120
Yang J and Honavar V 1998 Feature subset selection using a genetic algorithm. IEEE Intell. Syst. 13: 44–49
Marie-Sainte S L and Alalyani N 2018 Firefly algorithm based feature selection for Arabic text classification. J. King Saud Univ. Comput. Inf. Sci. 10: 1016–1025
Mesleh A M and Kanaan G G 2008 Support vector machine text classification system: using ant colony optimization based feature subset selection. In: Proceedings of the International Conference on Computer Engineering & Systems, Cairo, Egypt, IEEE, pp. 143–148
Raho G, Al-Shalabi R, Ghassan K and Asmaa N 2015 Different classification algorithms based on Arabic text classification: feature selection comparative study. Int. J. Adv. Comput. Sci. Appl. 6: 192–195
Banati H and Bajaj M 2011 Firefly based feature selection approach. Int. J. Comput. Sci. 4: 273–280
Abdur R, Kashif J and Haroon A B 2017 Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53: 473–489
Rao R V 2016 Jaya: a simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int. J. Ind. Eng. Comput. 7: 19–34
Mishra S and Ray P K 2016 Power quality improvement using photovoltaic fed DSTATCOM based on JAYA optimization. IEEE Trans. Sustain. Energy 7: 1672–1680
Sinha R K and Ghosh S 2007 Jaya based ANFIS for monitoring of two class motor imagery task. IEEE Access 4: 9273–9282
Chen J, Houkuan H, Shengfeng T and Youli Q 2009 Feature selection for text classification with Naıve Bayes. Expert Syst. Appl. 36: 5432–5435
Sang-Bum K, Kyoung-Soo H, Hae-Chang R and Sung H M 2006 Some effective techniques for Naive Bayes text classification. IEEE Trans. Knowl. Data Eng. 18: 1457–1466
Hsu C W, and Lin C J 2002 A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13: 415–425
Huang C L and Wang C J 2006 A GA-based feature selection and parameters optimization for support vector machines. Expert Syst. Appl. 31: 231–240
Kumar M A and Gopal M 2010 A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognit. Lett. 31: 1437–1444
Acknowledgements
We would like to thank the anonymous reviewers for their helpful comments and advice in improving this work. Also, we would like to thank the Management and Principal of Mepco Schlenk Engineering College (Autonomous), Sivakasi, for providing us the state of art facilities to carry out this proposed research work in the Mepco Research Centre in collaboration with Anna University Chennai, Tamil Nadu, India.
Author information
Authors and Affiliations
Corresponding author
Nomenclature
Nomenclature
- i :
-
solution index
- j :
-
position index
- t :
-
iteration/generation index
- f :
-
number of features to be selected
- \(T_{max}\) :
-
maximum number of iterations/generations
- \(\Psi _{i}\) :
-
\(i^{\mathrm{th}}\) solution
- \(\Psi _{i,j}\) :
-
\(j^{\mathrm{th}}\) position of solution \(\Psi _{i}\)
- \(\Psi _{i}^{(t)}\) :
-
\(i^{\mathrm{th}}\) solution \(\Psi _{i}\) of iteration/generation t
- \(\Psi _{i}^{fitness}\) :
-
fitness value of solution \(\Psi _{i}\)
- \(\Psi _{best}\) :
-
best solution
- \(\Psi _{best}^{fitness}\) :
-
fitness value of best solution
- \(\Psi _{worst}\) :
-
worst solution
- \(\Psi _{worst}^{fitness}\) :
-
fitness value of worst solution
- \(\alpha , \beta \) :
-
random numbers in [0,1]
Rights and permissions
About this article
Cite this article
Thirumoorthy, K., Muneeswaran, K. Optimal feature subset selection using hybrid binary Jaya optimization algorithm for text classification. Sādhanā 45, 201 (2020). https://doi.org/10.1007/s12046-020-01443-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-020-01443-w