ABSTRACT
In this paper we introduce a novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sample where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No approximations are necessary, and we show that our fast sampling scheme produces exactly the same results as the standard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 million documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.
- K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. Workshop on High-Performance Data Mining at IPPS/SPDP, Mar. 1998.Google Scholar
- J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, Sept. 1975. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- W. Buntine, J. Löfström, J. Perkiö, S. Perttu, V. Poroshin, T. S. H. Tirri, and V. T. A. Tuominen. A scalable topic-based open source search engine. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI 2004), pages 228--234, 2004. Google ScholarDigital Library
- C. Chemudugunta, P. Smyth, , and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Neural Information Processing Systems 19. MIT Press, 2006.Google Scholar
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarCross Ref
- G. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge University Press, 1989.Google Scholar
- A. T. Ihler, E. B. Sudderth, W. T. Freeman, and A. S. Willsky. Efficient multiscale sampling from products of Gaussian mixtures. In Proc. Neural Information Processing Systems (NIPS) 17, Dec. 2003.Google Scholar
- K. Kurihara and M. Welling. Bayesian k-means as a maximization-expectation. In Neural Computation, accepted. Google ScholarDigital Library
- K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational dirichlet process mixtures. In NIPS, volume 19, 2006.Google Scholar
- W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarDigital Library
- D. Mimno and A. McCallum. Organizing the OCA: learning faceted subjects from a library of digital books. In JCDL '07: Proceedings of the 2007 conference on Digital libraries, pages 376--385, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In NIPS, volume 10, 1998. Google ScholarDigital Library
- D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet allocation. In Proc. Neural Information Processing Systems (NIPS) 22, dec 2007.Google Scholar
- D. Newman, K. Hagedorn, C. Chemudugunta, and P. Smyth. Subject metadata enrichment using statistical topic models. In JCDL '07: Proceedings of the 2007 conference on Digital libraries, pages 366--375, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. of the 5th Int'l Conf. on Knowledge Discovery in Databases, pages 277--281, 1999. Google ScholarDigital Library
- D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML, volume 17, pages 727--734, 2000. Google ScholarDigital Library
- Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. In NIPS, volume 17, 2004.Google Scholar
- X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarDigital Library
Index Terms
- Fast collapsed gibbs sampling for latent dirichlet allocation
Recommendations
Partially collapsed Gibbs sampling for latent Dirichlet allocation
Highlights- We propose an enhanced Latent Dirichlet Allocation (LDA) model in text mining.
- ...
AbstractA latent Dirichlet allocation (LDA) model is a machine learning technique to identify latent topics from text corpora within a Bayesian hierarchical framework. Current popular inferential methods to fit the LDA model are based on ...
Sequential latent Dirichlet allocation
Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant ...
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataExtraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Comments