Text Classification from Labeled and Unlabeled Documents using EM

Nigam, Kamal; Mccallum, Andrew Kachites; Thrun, Sebastian; Mitchell, Tom

doi:10.1023/A:1007692713085

Text Classification from Labeled and Unlabeled Documents using EM

Published: May 2000

Volume 39, pages 103–134, (2000)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Text Classification from Labeled and Unlabeled Documents using EM

Download PDF

Kamal Nigam¹,
Andrew Kachites Mccallum^2,3,
Sebastian Thrun⁴ &
…
Tom Mitchell⁵

18k Accesses
1646 Citations
23 Altmetric
2 Mentions
Explore all metrics

Abstract

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.

We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

References

Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT '98) (pp. 92–100).
Castelli, V. & Cover, T.M. (1995). On the exponential value of labeled samples. Pattern Recognition Letters, 16(1), 105–111.
Google Scholar
Cheeseman, P. & Stutz, J. (1996). Bayesian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining. MIT Press.
Cohen, W.W. & Singer, Y. (1996). Context-sensitive learning methods for text categorization. SIGIR '96: Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–315).
Cover, T.M. & Thomas, J.A. (1991). Elements of information theory. New York: John Wiley and Sons.
Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (1998). Learning to extract symbolic knowledge from the World Wide Web. Proceedings of the Fifteenth National Conference on Artificial Intellligence (AAAI-98) (pp. 509–516).
Dagan, I. & Engelson, S.P. (1995). Committee-based sampling for training probabilistic classifiers. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 150–157).
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
Google Scholar
Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923.
Google Scholar
Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.
Google Scholar
Friedman, J.H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77.
Google Scholar
Ghahramani, Z. & Jordan, M.I. (1994). Supervised learning from incomplete data via an EM approach. In Advances in neural information processing systems 6 (pp. 120–127). Morgan Kaufmann.
Google Scholar
Jaakkola, T.S. & Jordan, M.I. (1998). Improving the mean field approximation via the use of mixture distributions. In M. I. Jordan (Ed.), Learning in graphical models. Kluwer Academic Publishers.
Joachims, T. (1997). Aprobabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 143–151).
Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. Machine Learning: ECML-98, Tenth European Conference on Machine Learning (pp. 137–142).
Koller, D. & Sahami, M. (1997). Hierarchically classifying documents using very few words. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 170–178).
Lang, K. (1995). Newsweeder: Learning to filter netnews. Machine Learning: Proceedings of the Twelfth International Conference (ICML '95) (pp. 331–339).
Larkey, L.S. & Croft, W.B. (1996). Combining classifiers in text categorization. SIGIR '96: Proceedings of the Nineteenth Annual International ACMSIGIR Conference on Research andDevelopment in Information Retrieval (pp. 289–297).
Lewis, D.D. (1992). An evaluation of phrasal and clustered representations on a text categorization task. SIGIR '92: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 37–50).
Lewis, D.D. (1995). A sequential algorithm for training text classifiers: Corrigendum and additional data. SIGIR Forum, 29(2), 13–19.
Google Scholar
Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Machine Learning: ECML-98, Tenth European Conference on Machine Learning (pp. 4–15).
Lewis, D.D. & Gale, W.A. (1994). A sequential algorithm for training text classifiers. SIGIR '94: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12).
Lewis, D.D. & Knowles, K.A. (1997). Threading electronic mail: A preliminary study. Information Processing and Management, 33(2), 209–217.
Google Scholar
Lewis, D.D. & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93).
Li, H. & Yamanishi, K. (1997). Document classification using a finite mixture model. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (pp. 39–47).
Liere, R. & Tadepalli, P. (1997). Active learning with committees for text categorization. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97) (pp. 591–596).
McCallum, A. & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://www.cs.cmu.edu/∼mccallum.
McCallum, A.K. & Nigam, K. (1998). EmployingEMin pool-based active learning for text classification. Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98) (pp. 350–358).
McCallum, A., Rosenfeld, R., Mitchell, T., & Ng, A. (1998). Improving text clasification by shrinkage in a hierarchy of classes. Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98) (pp. 359–367).
McLachlan, G. & Basford, K. (1988). Mixture models. New York: Marcel Dekker.
Google Scholar
McLachlan, G.J. & Krishnan, T. (1997). The EM algorithm and extensions. New York: John Wiley and Sons.
Google Scholar
Miller, D.J. & Uyar, H.S. (1997). A mixture of experts classifier with learning based on both labelled and unlabelled data. In Advances in Neural Information Processing Systems 9 (pp. 571–577). The MIT Press.
Google Scholar
Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.
Google Scholar
Ng, A.Y. (1997). Preventing “overfitting” of cross-validation data. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 245–253).
Pazzani, M.J., Muramatsu, J., & Billsus, D. (1996). Syskill & Webert: Identifying interesting Web sites. Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) (pp. 54–59).
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11(2), 416–431.
Google Scholar
Robertson, S.E. & Sparck-Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.
Google Scholar
Rocchio, J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: experiments in automatic document processing. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Baysian approach to filtering junk e-mail. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://robotics.stanford.edu/users/sahami/papers.html.
Salton, G. (1991). Developments in automatic text retrieval. Science, 253(5023), 974–980.
Google Scholar
Schuurmans, D. (1997). A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97) (pp. 552–558).
Shahshahani, B. & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5), 1087–1095.
Google Scholar
Shavlik, J. & Eliassi-Rad, T. (1998). Intelligent agents for web-based tasks: An advice-taking approach. AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS–98–05, AAAI Press. http://www.cs.wisc.edu/∼shavlik/mlrg/publications.html.
Stolcke, A. & Omohundro, S.M. (1994). Best-first model merging for hidden Markov model induction. Tech. Rep. TR–94–003, ICSI, University of California, Berkeley. http://www.icsi.berkeley.edu/techreports/1994.html.
Google Scholar
Yang, Y. (1994). Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. SIGIR '94: Proceedings of the Seventeenth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (pp. 13–22).
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1, 69–90.
Google Scholar
Yang, Y. & Pederson, J.O. (1997). Feature selection in statistical learning of text categorization. Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) (pp. 412–420).

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Kamal Nigam
Just Research, 4616 Henry Street, Pittsburgh, PA, 15213, USA
Andrew Kachites Mccallum
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Andrew Kachites Mccallum
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Sebastian Thrun
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Tom Mitchell

Authors

Kamal Nigam
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Kachites Mccallum
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Thrun
View author publications
You can also search for this author in PubMed Google Scholar
Tom Mitchell
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nigam, K., Mccallum, A.K., Thrun, S. et al. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000). https://doi.org/10.1023/A:1007692713085

Download citation

Issue Date: May 2000
DOI: https://doi.org/10.1023/A:1007692713085

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Text Classification from Labeled and Unlabeled Documents using EM

Abstract

Article PDF

Similar content being viewed by others

How Many Labels? Determining the Number of Labels in Multi-Label Text Classification

Learning to Classify Text Using a Few Labeled Examples

Improving Supervised Classification Using Information Extraction

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Text Classification from Labeled and Unlabeled Documents using EM

Abstract

Article PDF

Similar content being viewed by others

How Many Labels? Determining the Number of Labels in Multi-Label Text Classification

Learning to Classify Text Using a Few Labeled Examples

Improving Supervised Classification Using Information Extraction

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation