Text filtering by boosting naive Bayes classifiers

Authors:
Yu-Hwan Kim

Artificial Intelligence Lab (SCAI), School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea

Artificial Intelligence Lab (SCAI), School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea
View Profile

,
Shang-Yoon Hahn

Artificial Intelligence Lab (SCAI), School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea

Artificial Intelligence Lab (SCAI), School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea
View Profile

,
Byoung-Tak Zhang

Artificial Intelligence Lab (SCAI), School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea

Artificial Intelligence Lab (SCAI), School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea
View Profile

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrievalJuly 2000Pages 168–175https://doi.org/10.1145/345508.345572

Published:01 July 2000Publication History

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

Pages 168–175

ABSTRACT

Several machine learning algorithms have recently been used for text categorization and filtering. In particular, boosting methods such as AdaBoost have shown good performance applied to real text data. However, most of existing boosting algorithms are based on classifiers that use binary-valued features. Thus, they do not fully make use of the weight information provided by standard term weighting methods. In this paper, we present a boosting-based learning method for text filtering that uses naive Bayes classifiers as a weak learner. The use of naive Bayes allows the boosting algorithm to utilize term frequency information while maintaining probabilistically accurate confidence ratio. Applied to TREC-7 and TREC-8 filtering track documents, the proposed method obtained a significant improvement in LF1, LF2, F1 and F3 measures compared to the best results submitted by other TREC entries.

References

1.N. J. Belkin and W. B. Croft. Information filtering and information retrieval: Two sides of the same coin?. Communications of the ACM, 35(12):29-38, 1992. Google ScholarDigital Library
2.L. Breiman. Bagging predictors. Machine Learning, 24(2):123.-140, 1996. Google ScholarDigital Library
3.C. Buckley and G. Salton. Optimization of relevance feedback weights. In Proc. SIGIR-95, pp. 351-357, 1995. Google ScholarDigital Library
4.H. Drucker and C. Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems 8, pp. 479-485, 1996.Google Scholar
5.Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Int. Conf. on Machine Learning, pp. 148.-156, 1996.Google Scholar
6.D. Hull. The TREC-8 filtering track: Description and analysis. In Proc. 7th Text Retrieval Conf. (TREC-7), pp. 33-56, 1998.Google Scholar
7.T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proc. Int. Conf. on Machine Learning (ICML-97), pp. 143-151, 1997. Google ScholarDigital Library
8.K. L. Kwok, L. Grunfeld, M. Chan, N. Dinstl, and C. Cool. TREC-8 ad-hoc, query and filtering track experiments using PIRCS. In Proc. Text Retrieval Conf. (TREC-8), pp. 107-116, 1998.Google Scholar
9.D. Lewis, R. E. Schapire, J. P.Callan, and R. Papka. Training algorithms for linear text classifters. In Proc. SIGIR-#6, pp. 298-306, 1996. Google ScholarDigital Library
10.David Lewis. Evaluating and optimizing autonomous text classification systems. In Proc. SLGIR-95, pp. 246-255, 1995. Google ScholarDigital Library
11.A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. .In Proc. Int. Conf. on Machine Learning ICML- 98), pp. 350-358, 1998. Google ScholarDigital Library
12.J. R. Quinlan. bagging, boosting and C4.5 In Proc. AAAI-96, pp. 725-730, 1996. Google ScholarDigital Library
13.R. E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297-336, Google ScholarDigital Library
14.R. E. Schapire, Y. Freund, P. Barlett, and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The annual of Statistics, 26(5):1651-1686, 1998.Google Scholar
15.R.E. Schapire, Yoram Singer, and Amit singal Boosting and Rocchio applied to text filtering. In Proc. SIGIR-98, pp. 251-223, 1998. Google ScholarDigital Library
16.A. Singhal, M. Mitra, and C. Buckley. Learning routing queries in a query zone. In Proc. SIGIR- 96, pp. 21-29, 1996. Google ScholarDigital Library
17.A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. SIGIR- 96, pp. 21-29, 1996. Google ScholarDigital Library
18.D. K. Harman. Overview of 8th Text Retrieval Conference (TREC-8). In Proc. 8th Text Retrieval Conf. (TREC-8), pp. 1-19, 1999.Google Scholar
19.Y. Yang and X. Liu. A Re-examination of text categorization methods. In Proc. SIGIR-pp. 42-49. 1999. Google ScholarDigital Library

Index Terms

Text filtering by boosting naive Bayes classifiers
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

A comprehensive review of recursive Naïve Bayes Classifiers

In this paper we provide a comprehensive empirical review of a variant of the Recursive Naïve Baye Classifier (RNBC*) in comparison to simple Naïve Bayes and C4.5. We show that in terms of a zero one loss cost function for classification accuracy, RNBC* ...
Read More
Bayesian Naïve Bayes classifiers to text classification

Text classification is the task of assigning predefined categories to natural language documents, and it can provide conceptual views of document collections. The Na ve Bayes NB classifier is a family of simple probabilistic classifiers based on a ...
Read More
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
July 2000
396 pages
ISBN:1581132263
DOI:10.1145/345508
Chairmen:
Emmanuel Yannakoudakis
Athens Univ. of Economics and Business, Greece
,
Nicholas J. Belkin
Rutgers Univ.
,
Mun-Kew Leong
Kent Ridge Digital Labs
,
Peter Ingwersen
Royal School of Library and Information Science
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 199
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text filtering by boosting naive Bayes classifiers

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A comprehensive review of recursive Naïve Bayes Classifiers

Bayesian Naïve Bayes classifiers to text classification

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values