ABSTRACT
Several machine learning algorithms have recently been used for text categorization and filtering. In particular, boosting methods such as AdaBoost have shown good performance applied to real text data. However, most of existing boosting algorithms are based on classifiers that use binary-valued features. Thus, they do not fully make use of the weight information provided by standard term weighting methods. In this paper, we present a boosting-based learning method for text filtering that uses naive Bayes classifiers as a weak learner. The use of naive Bayes allows the boosting algorithm to utilize term frequency information while maintaining probabilistically accurate confidence ratio. Applied to TREC-7 and TREC-8 filtering track documents, the proposed method obtained a significant improvement in LF1, LF2, F1 and F3 measures compared to the best results submitted by other TREC entries.
- 1.N. J. Belkin and W. B. Croft. Information filtering and information retrieval: Two sides of the same coin?. Communications of the ACM, 35(12):29-38, 1992. Google ScholarDigital Library
- 2.L. Breiman. Bagging predictors. Machine Learning, 24(2):123.-140, 1996. Google ScholarDigital Library
- 3.C. Buckley and G. Salton. Optimization of relevance feedback weights. In Proc. SIGIR-95, pp. 351-357, 1995. Google ScholarDigital Library
- 4.H. Drucker and C. Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems 8, pp. 479-485, 1996.Google Scholar
- 5.Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Int. Conf. on Machine Learning, pp. 148.-156, 1996.Google Scholar
- 6.D. Hull. The TREC-8 filtering track: Description and analysis. In Proc. 7th Text Retrieval Conf. (TREC-7), pp. 33-56, 1998.Google Scholar
- 7.T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proc. Int. Conf. on Machine Learning (ICML-97), pp. 143-151, 1997. Google ScholarDigital Library
- 8.K. L. Kwok, L. Grunfeld, M. Chan, N. Dinstl, and C. Cool. TREC-8 ad-hoc, query and filtering track experiments using PIRCS. In Proc. Text Retrieval Conf. (TREC-8), pp. 107-116, 1998.Google Scholar
- 9.D. Lewis, R. E. Schapire, J. P.Callan, and R. Papka. Training algorithms for linear text classifters. In Proc. SIGIR-#6, pp. 298-306, 1996. Google ScholarDigital Library
- 10.David Lewis. Evaluating and optimizing autonomous text classification systems. In Proc. SLGIR-95, pp. 246-255, 1995. Google ScholarDigital Library
- 11.A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. .In Proc. Int. Conf. on Machine Learning ICML- 98), pp. 350-358, 1998. Google ScholarDigital Library
- 12.J. R. Quinlan. bagging, boosting and C4.5 In Proc. AAAI-96, pp. 725-730, 1996. Google ScholarDigital Library
- 13.R. E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297-336, Google ScholarDigital Library
- 14.R. E. Schapire, Y. Freund, P. Barlett, and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The annual of Statistics, 26(5):1651-1686, 1998.Google Scholar
- 15.R.E. Schapire, Yoram Singer, and Amit singal Boosting and Rocchio applied to text filtering. In Proc. SIGIR-98, pp. 251-223, 1998. Google ScholarDigital Library
- 16.A. Singhal, M. Mitra, and C. Buckley. Learning routing queries in a query zone. In Proc. SIGIR- 96, pp. 21-29, 1996. Google ScholarDigital Library
- 17.A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. SIGIR- 96, pp. 21-29, 1996. Google ScholarDigital Library
- 18.D. K. Harman. Overview of 8th Text Retrieval Conference (TREC-8). In Proc. 8th Text Retrieval Conf. (TREC-8), pp. 1-19, 1999.Google Scholar
- 19.Y. Yang and X. Liu. A Re-examination of text categorization methods. In Proc. SIGIR-pp. 42-49. 1999. Google ScholarDigital Library
Index Terms
- Text filtering by boosting naive Bayes classifiers
Recommendations
A comprehensive review of recursive Naïve Bayes Classifiers
In this paper we provide a comprehensive empirical review of a variant of the Recursive Naïve Baye Classifier (RNBC*) in comparison to simple Naïve Bayes and C4.5. We show that in terms of a zero one loss cost function for classification accuracy, RNBC* ...
Bayesian Naïve Bayes classifiers to text classification
Text classification is the task of assigning predefined categories to natural language documents, and it can provide conceptual views of document collections. The Na ve Bayes NB classifier is a family of simple probabilistic classifiers based on a ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Comments