ABSTRACT
Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). It is straightforward to incorporate the DCM extension into topic models that are more complex than LDA.
- Airoldi, E. M., Fienberg, S. E., & Xing, E. P. (2007). Mixed membership analysis of genome-wide expression data. Arxiv preprint arXiv:0711.2520.Google Scholar
- Blei, D., & Lafferty, J. (2005). Correlated topic models. Advances in Neural Information Processing Systems 18 (pp. 147--154).Google Scholar
- Blei, D., Ng, A., & Jordan, M. (2001). Latent Dirichlet allocation. Advances in Neural Information Processing Systems 14 (pp. 601--608).Google Scholar
- Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. J. of Machine Learning Research, 3, 993--1022. Google ScholarDigital Library
- Celeux, G., Chaveau, D., & Diebolt, J. (1996). Stochastic versions of the EM algorithm: An experimental study in the mixture case. J. of Statistical Computation and Simulation, 55, 287--314.Google ScholarCross Ref
- Church, K., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163--190.Google ScholarCross Ref
- Elkan, C. (2006). Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. Proceedings of the 23rd International Conference on Machine Learning (pp. 289--296). Google ScholarDigital Library
- Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 524--531). Google ScholarDigital Library
- Globerson, A., Chechik, G., Pereira, F., & Tishby, N. (2004). Euclidean embedding of co-occurrence data. Advances in Neural Information Processing Systems 17 (pp. 497--504).Google Scholar
- Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 104, 5228--5235.Google ScholarCross Ref
- Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2004). Integrating topics and syntax. Advances in Neural Information Processing Systems 17 (pp. 537--544).Google Scholar
- Heinrich, G. (2005). Parameter estimation for text analysis. Available at http://www.arbylon.net/publications/text-est.pdf.Google Scholar
- Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. Proceedings of the 23rd International Conference on Machine Learning (pp. 577--584). Google ScholarDigital Library
- Li, W., & McCallum, A. (2008). Pachinko allocation: Scalable mixture models of topic correlations. J. of Machine Learning Research. Submitted.Google Scholar
- Madsen, R., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the Dirichlet distribution. Proceedings of the 22nd International Conference on Machine Learning (pp. 545--552). Google ScholarDigital Library
- Newton, M., & Raftery, A. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society B, 56, 3--48.Google ScholarCross Ref
- Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of 20th International Conference on Machine Learning (pp. 616--623).Google Scholar
- Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-BFGS-B: Fortran routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software, 23, 550--560. Google ScholarDigital Library
Index Terms
- Accounting for burstiness in topic models
Recommendations
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementTopic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Topic sentiment change analysis
MLDM'11: Proceedings of the 7th international conference on Machine learning and data mining in pattern recognitionPublic opinions on a topic may change over time. Topic Sentiment change analysis is a new research problem consisting of two main components: (a) mining opinions on a certain topic, and (b) detect significant changes of sentiment of the opinions on the ...
Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementReaders of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Comments