Abstract
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.
We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
Article PDF
Similar content being viewed by others
References
Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT '98) (pp. 92–100).
Castelli, V. & Cover, T.M. (1995). On the exponential value of labeled samples. Pattern Recognition Letters, 16(1), 105–111.
Cheeseman, P. & Stutz, J. (1996). Bayesian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining. MIT Press.
Cohen, W.W. & Singer, Y. (1996). Context-sensitive learning methods for text categorization. SIGIR '96: Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–315).
Cover, T.M. & Thomas, J.A. (1991). Elements of information theory. New York: John Wiley and Sons.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (1998). Learning to extract symbolic knowledge from the World Wide Web. Proceedings of the Fifteenth National Conference on Artificial Intellligence (AAAI-98) (pp. 509–516).
Dagan, I. & Engelson, S.P. (1995). Committee-based sampling for training probabilistic classifiers. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 150–157).
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923.
Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.
Friedman, J.H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77.
Ghahramani, Z. & Jordan, M.I. (1994). Supervised learning from incomplete data via an EM approach. In Advances in neural information processing systems 6 (pp. 120–127). Morgan Kaufmann.
Jaakkola, T.S. & Jordan, M.I. (1998). Improving the mean field approximation via the use of mixture distributions. In M. I. Jordan (Ed.), Learning in graphical models. Kluwer Academic Publishers.
Joachims, T. (1997). Aprobabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 143–151).
Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. Machine Learning: ECML-98, Tenth European Conference on Machine Learning (pp. 137–142).
Koller, D. & Sahami, M. (1997). Hierarchically classifying documents using very few words. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 170–178).
Lang, K. (1995). Newsweeder: Learning to filter netnews. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 331–339).
Larkey, L.S. & Croft, W.B. (1996). Combining classifiers in text categorization. SIGIR '96: Proceedings of the Nineteenth Annual International ACMSIGIR Conference on Research andDevelopment in Information Retrieval (pp. 289–297).
Lewis, D.D. (1992). An evaluation of phrasal and clustered representations on a text categorization task. SIGIR '92: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 37–50).
Lewis, D.D. (1995). A sequential algorithm for training text classifiers: Corrigendum and additional data. SIGIR Forum, 29(2), 13–19.
Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Machine Learning: ECML-98, Tenth European Conference on Machine Learning (pp. 4–15).
Lewis, D.D. & Gale, W.A. (1994). A sequential algorithm for training text classifiers. SIGIR '94: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12).
Lewis, D.D. & Knowles, K.A. (1997). Threading electronic mail: A preliminary study. Information Processing and Management, 33(2), 209–217.
Lewis, D.D. & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93).
Li, H. & Yamanishi, K. (1997). Document classification using a finite mixture model. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (pp. 39–47).
Liere, R. & Tadepalli, P. (1997). Active learning with committees for text categorization. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97) (pp. 591–596).
McCallum, A. & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://www.cs.cmu.edu/∼mccallum.
McCallum, A.K. & Nigam, K. (1998). EmployingEMin pool-based active learning for text classification. Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98) (pp. 350–358).
McCallum, A., Rosenfeld, R., Mitchell, T., & Ng, A. (1998). Improving text clasification by shrinkage in a hierarchy of classes. Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98) (pp. 359–367).
McLachlan, G. & Basford, K. (1988). Mixture models. New York: Marcel Dekker.
McLachlan, G.J. & Krishnan, T. (1997). The EM algorithm and extensions. New York: John Wiley and Sons.
Miller, D.J. & Uyar, H.S. (1997). A mixture of experts classifier with learning based on both labelled and unlabelled data. In Advances in Neural Information Processing Systems 9 (pp. 571–577). The MIT Press.
Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.
Ng, A.Y. (1997). Preventing “overfitting” of cross-validation data. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 245–253).
Pazzani, M.J., Muramatsu, J., & Billsus, D. (1996). Syskill & Webert: Identifying interesting Web sites. Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) (pp. 54–59).
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11(2), 416–431.
Robertson, S.E. & Sparck-Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.
Rocchio, J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: experiments in automatic document processing. Englewood Cliffs, NJ: Prentice Hall.
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Baysian approach to filtering junk e-mail. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://robotics.stanford.edu/users/sahami/papers.html.
Salton, G. (1991). Developments in automatic text retrieval. Science, 253(5023), 974–980.
Schuurmans, D. (1997). A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97) (pp. 552–558).
Shahshahani, B. & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5), 1087–1095.
Shavlik, J. & Eliassi-Rad, T. (1998). Intelligent agents for web-based tasks: An advice-taking approach. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://www.cs.wisc.edu/∼shavlik/mlrg/publications.html.
Stolcke, A. & Omohundro, S.M. (1994). Best-first model merging for hidden Markov model induction. Tech. Rep. TR–94–003, ICSI, University of California, Berkeley. http://www.icsi.berkeley.edu/techreports/1994.html.
Yang, Y. (1994). Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. SIGIR '94: Proceedings of the Seventeenth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (pp. 13–22).
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1, 69–90.
Yang, Y. & Pederson, J.O. (1997). Feature selection in statistical learning of text categorization. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 412–420).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Nigam, K., Mccallum, A.K., Thrun, S. et al. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000). https://doi.org/10.1023/A:1007692713085
Issue Date:
DOI: https://doi.org/10.1023/A:1007692713085