ABSTRACT
In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.
- D. Baker and A. McCallum. Distributional clustering of words for text classification. In Proceedings of ACM SIGIR 1998, 1998. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1--38, 1977.Google ScholarCross Ref
- epinions.com, 2003. http://www.epinions.com/.Google Scholar
- R. Feldman and I. Dagan. Knowledge discovery in textual databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1995.Google Scholar
- M. A. Hearst. Untangling text data mining. In Proceedings of ACL'99, 1999. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of ACM SIGIR'99, pages 50--57, 1999. Google ScholarDigital Library
- Z. Marx, I. Dagan, J. Buhmann, and E. Shamir. Coupled clustering: a method for detecting structural correspondence. Journal of Machine Learning Research, 3:747--780, 2002. Google ScholarDigital Library
- K. McKeown, J. L. Klavans, V. Hatzivassiloglou, R. Barzilay, and E. E. Towards multidocument summarization by reformulation: Progress and prospects. In Proceedings of AAAI-99. Google ScholarDigital Library
- S. Sarawagi, S. Chakrabarti, and S. Godbole. Cross-training: Learning probabilistic mappings between topics. In Proceedings of ACM SIGKDD 2003. Google ScholarDigital Library
- B. R. Schatz. The interspace: Concept navigation across distributed communities. Computer, 35(1):54--62, 2002. Google ScholarDigital Library
- H. Zha. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In Proceedings of ACM SIGIR 2002. Google ScholarDigital Library
Index Terms
- A cross-collection mixture model for comparative text mining
Recommendations
A mixture model for contextual text mining
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningContextual text mining is concerned with extracting topical themes from a text collection with context information (e.g., time and location) and comparing/analyzing the variations of themes over different contexts. Since the topics covered in a document ...
Cross-collection latent Beta-Liouville allocation model training with privacy protection and applications
AbstractCross-collection topic models extend previous single-collection topic models, such as Latent Dirichlet Allocation (LDA), to multiple collections. The purpose of cross-collection topic modeling is to model document-topic representations and reveal ...
Parallel inference for cross-collection latent generalized Dirichlet allocation model and applications
AbstractExisting cross-collection topic models with document-topic representation encounter performance bottlenecks in large-scale datasets due to their reliance on Dirichlet priors and conventional inference schemes. These constraints become noticeable ...
Comments