skip to main content
10.1145/1014052.1014150acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

A cross-collection mixture model for comparative text mining

Published:22 August 2004Publication History

ABSTRACT

In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.

References

  1. D. Baker and A. McCallum. Distributional clustering of words for text classification. In Proceedings of ACM SIGIR 1998, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1--38, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  4. epinions.com, 2003. http://www.epinions.com/.Google ScholarGoogle Scholar
  5. R. Feldman and I. Dagan. Knowledge discovery in textual databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1995.Google ScholarGoogle Scholar
  6. M. A. Hearst. Untangling text data mining. In Proceedings of ACL'99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of ACM SIGIR'99, pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Marx, I. Dagan, J. Buhmann, and E. Shamir. Coupled clustering: a method for detecting structural correspondence. Journal of Machine Learning Research, 3:747--780, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. McKeown, J. L. Klavans, V. Hatzivassiloglou, R. Barzilay, and E. E. Towards multidocument summarization by reformulation: Progress and prospects. In Proceedings of AAAI-99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Sarawagi, S. Chakrabarti, and S. Godbole. Cross-training: Learning probabilistic mappings between topics. In Proceedings of ACM SIGKDD 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. R. Schatz. The interspace: Concept navigation across distributed communities. Computer, 35(1):54--62, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Zha. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In Proceedings of ACM SIGIR 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A cross-collection mixture model for comparative text mining

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2004
      874 pages
      ISBN:1581138881
      DOI:10.1145/1014052

      Copyright © 2004 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2004

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader