research-article

Fast collapsed gibbs sampling for latent dirichlet allocation

Authors:
Ian Porteous

University of California Irvine, Irvine, CA, USA

University of California Irvine, Irvine, CA, USA
View Profile

,
David Newman

University of California Irvine, Irvine, CA, USA

University of California Irvine, Irvine, CA, USA
View Profile

,
Alexander Ihler

University of California Irvine, Irvine, CA, USA

University of California Irvine, Irvine, CA, USA
View Profile

,
Arthur Asuncion

University of California Irvine, Irvine, CA, USA

University of California Irvine, Irvine, CA, USA
View Profile

,
Padhraic Smyth

University of California Irvine, Irvine, CA, USA

University of California Irvine, Irvine, CA, USA
View Profile

,
Max Welling

University of California Irvine, Irvine, CA, USA

University of California Irvine, Irvine, CA, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 569–577https://doi.org/10.1145/1401890.1401960

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 569–577

ABSTRACT

In this paper we introduce a novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sample where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No approximations are necessary, and we show that our fast sampling scheme produces exactly the same results as the standard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 million documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.

References

K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. Workshop on High-Performance Data Mining at IPPS/SPDP, Mar. 1998.Google Scholar
J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, Sept. 1975. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
W. Buntine, J. Löfström, J. Perkiö, S. Perttu, V. Poroshin, T. S. H. Tirri, and V. T. A. Tuominen. A scalable topic-based open source search engine. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI 2004), pages 228--234, 2004. Google ScholarDigital Library
C. Chemudugunta, P. Smyth, , and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Neural Information Processing Systems 19. MIT Press, 2006.Google Scholar
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarCross Ref
G. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge University Press, 1989.Google Scholar
A. T. Ihler, E. B. Sudderth, W. T. Freeman, and A. S. Willsky. Efficient multiscale sampling from products of Gaussian mixtures. In Proc. Neural Information Processing Systems (NIPS) 17, Dec. 2003.Google Scholar
K. Kurihara and M. Welling. Bayesian k-means as a maximization-expectation. In Neural Computation, accepted. Google ScholarDigital Library
K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational dirichlet process mixtures. In NIPS, volume 19, 2006.Google Scholar
W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarDigital Library
D. Mimno and A. McCallum. Organizing the OCA: learning faceted subjects from a library of digital books. In JCDL '07: Proceedings of the 2007 conference on Digital libraries, pages 376--385, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In NIPS, volume 10, 1998. Google ScholarDigital Library
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet allocation. In Proc. Neural Information Processing Systems (NIPS) 22, dec 2007.Google Scholar
D. Newman, K. Hagedorn, C. Chemudugunta, and P. Smyth. Subject metadata enrichment using statistical topic models. In JCDL '07: Proceedings of the 2007 conference on Digital libraries, pages 366--375, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. of the 5th Int'l Conf. on Knowledge Discovery in Databases, pages 277--281, 1999. Google ScholarDigital Library
D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML, volume 17, pages 727--734, 2000. Google ScholarDigital Library
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. In NIPS, volume 17, 2004.Google Scholar
X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarDigital Library

Index Terms

Fast collapsed gibbs sampling for latent dirichlet allocation
1. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic algorithms
    2. Probabilistic reasoning algorithms
      1. Markov-chain Monte Carlo methods
      2. Sequential Monte Carlo methods

Recommendations

Partially collapsed Gibbs sampling for latent Dirichlet allocation
Highlights
- We propose an enhanced Latent Dirichlet Allocation (LDA) model in text mining.
- ...
Abstract
A latent Dirichlet allocation (LDA) model is a machine learning technique to identify latent topics from text corpora within a Bayesian hierarchical framework. Current popular inferential methods to fit the LDA model are based on ...
Read More
Sequential latent Dirichlet allocation

Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant ...
Read More
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
latent dirichlet allocation
sampling
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 325
  Total Citations
  View Citations
- 3,412
  Total Downloads
- Downloads (Last 12 months)106
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast collapsed gibbs sampling for latent dirichlet allocation

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Partially collapsed Gibbs sampling for latent Dirichlet allocation

Sequential latent Dirichlet allocation

Latent dirichlet allocation based multi-document summarization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast collapsed gibbs sampling for latent dirichlet allocation

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Partially collapsed Gibbs sampling for latent Dirichlet allocation

Sequential latent Dirichlet allocation

Latent dirichlet allocation based multi-document summarization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media