skip to main content
10.1145/1401890.1401960acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Fast collapsed gibbs sampling for latent dirichlet allocation

Published:24 August 2008Publication History

ABSTRACT

In this paper we introduce a novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sample where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No approximations are necessary, and we show that our fast sampling scheme produces exactly the same results as the standard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 million documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.

References

  1. K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. Workshop on High-Performance Data Mining at IPPS/SPDP, Mar. 1998.Google ScholarGoogle Scholar
  2. J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, Sept. 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Buntine, J. Löfström, J. Perkiö, S. Perttu, V. Poroshin, T. S. H. Tirri, and V. T. A. Tuominen. A scalable topic-based open source search engine. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI 2004), pages 228--234, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Chemudugunta, P. Smyth, , and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Neural Information Processing Systems 19. MIT Press, 2006.Google ScholarGoogle Scholar
  6. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. G. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge University Press, 1989.Google ScholarGoogle Scholar
  8. A. T. Ihler, E. B. Sudderth, W. T. Freeman, and A. S. Willsky. Efficient multiscale sampling from products of Gaussian mixtures. In Proc. Neural Information Processing Systems (NIPS) 17, Dec. 2003.Google ScholarGoogle Scholar
  9. K. Kurihara and M. Welling. Bayesian k-means as a maximization-expectation. In Neural Computation, accepted. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational dirichlet process mixtures. In NIPS, volume 19, 2006.Google ScholarGoogle Scholar
  11. W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Mimno and A. McCallum. Organizing the OCA: learning faceted subjects from a library of digital books. In JCDL '07: Proceedings of the 2007 conference on Digital libraries, pages 376--385, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In NIPS, volume 10, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet allocation. In Proc. Neural Information Processing Systems (NIPS) 22, dec 2007.Google ScholarGoogle Scholar
  15. D. Newman, K. Hagedorn, C. Chemudugunta, and P. Smyth. Subject metadata enrichment using statistical topic models. In JCDL '07: Proceedings of the 2007 conference on Digital libraries, pages 366--375, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. of the 5th Int'l Conf. on Knowledge Discovery in Databases, pages 277--281, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML, volume 17, pages 727--734, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. In NIPS, volume 17, 2004.Google ScholarGoogle Scholar
  19. X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast collapsed gibbs sampling for latent dirichlet allocation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2008
          1116 pages
          ISBN:9781605581934
          DOI:10.1145/1401890
          • General Chair:
          • Ying Li,
          • Program Chairs:
          • Bing Liu,
          • Sunita Sarawagi

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 August 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader