Article

A cross-collection mixture model for comparative text mining

Authors:
ChengXiang Zhai

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

,
Atulya Velivelli

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

,
Bei Yu

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2004Pages 743–748https://doi.org/10.1145/1014052.1014150

Published:22 August 2004Publication History

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 743–748

ABSTRACT

In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.

References

D. Baker and A. McCallum. Distributional clustering of words for text classification. In Proceedings of ACM SIGIR 1998, 1998. Google ScholarDigital Library
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1--38, 1977.Google ScholarCross Ref
epinions.com, 2003. http://www.epinions.com/.Google Scholar
R. Feldman and I. Dagan. Knowledge discovery in textual databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1995.Google Scholar
M. A. Hearst. Untangling text data mining. In Proceedings of ACL'99, 1999. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of ACM SIGIR'99, pages 50--57, 1999. Google ScholarDigital Library
Z. Marx, I. Dagan, J. Buhmann, and E. Shamir. Coupled clustering: a method for detecting structural correspondence. Journal of Machine Learning Research, 3:747--780, 2002. Google ScholarDigital Library
K. McKeown, J. L. Klavans, V. Hatzivassiloglou, R. Barzilay, and E. E. Towards multidocument summarization by reformulation: Progress and prospects. In Proceedings of AAAI-99. Google ScholarDigital Library
S. Sarawagi, S. Chakrabarti, and S. Godbole. Cross-training: Learning probabilistic mappings between topics. In Proceedings of ACM SIGKDD 2003. Google ScholarDigital Library
B. R. Schatz. The interspace: Concept navigation across distributed communities. Computer, 35(1):54--62, 2002. Google ScholarDigital Library
H. Zha. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In Proceedings of ACM SIGIR 2002. Google ScholarDigital Library

Index Terms

A cross-collection mixture model for comparative text mining
1. Information systems
  1. Information retrieval

Recommendations

A mixture model for contextual text mining
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Contextual text mining is concerned with extracting topical themes from a text collection with context information (e.g., time and location) and comparing/analyzing the variations of themes over different contexts. Since the topics covered in a document ...
Read More
Cross-collection latent Beta-Liouville allocation model training with privacy protection and applications
Abstract
Cross-collection topic models extend previous single-collection topic models, such as Latent Dirichlet Allocation (LDA), to multiple collections. The purpose of cross-collection topic modeling is to model document-topic representations and reveal ...
Read More
Parallel inference for cross-collection latent generalized Dirichlet allocation model and applications
Abstract
Existing cross-collection topic models with document-topic representation encounter performance bottlenecks in large-scale datasets due to their reliance on Dirichlet priors and conventional inference schemes. These constraints become noticeable ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2004
874 pages
ISBN:1581138881
DOI:10.1145/1014052
General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
comparative text mining
mixture models
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 169
  Total Citations
  View Citations
- 1,905
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A cross-collection mixture model for comparative text mining

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A mixture model for contextual text mining

Cross-collection latent Beta-Liouville allocation model training with privacy protection and applications

Parallel inference for cross-collection latent generalized Dirichlet allocation model and applications