research-article

Peacock: Learning Long-Tail Topic Features for Industrial Applications

Authors:
Yi Wang

Tencent

Tencent
View Profile

,
Xuemin Zhao

Tencent

Tencent
View Profile

,
Zhenlong Sun

Tencent

Tencent
View Profile

,
Hao Yan

Tencent

Tencent
View Profile

,
Lifeng Wang

Tencent

Tencent
View Profile

,
Zhihui Jin

Tencent

Tencent
View Profile

,
Liubin Wang

Tencent

Tencent
View Profile

,
Yang Gao

School of Computer Science and Technology, Soochow University, Shuzhou, China

School of Computer Science and Technology, Soochow University, Shuzhou, China
View Profile

,
Ching Law

Tencent

Tencent
View Profile

,
Jia Zeng

School of Computer Science and Technology, Soochow University & Huawei Noah's Ark Lab, Hong Kong

School of Computer Science and Technology, Soochow University & Huawei Noah's Ark Lab, Hong Kong
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 6 Issue 4Article No.: 47pp 1–23https://doi.org/10.1145/2700497

Published:15 July 2015Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engine and online advertising systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to 10³ topics, which difficultly cover the long-tail semantic word sets. In this article, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling systems. In particular, we show that a “big” LDA model with at least 10⁵ topics inferred from 10⁹ search queries can achieve a significant improvement on industrial search engine and online advertising systems, both of which serve hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features of Peacock include hierarchical distributed architecture, real-time prediction, and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.

References

Sungjin Ahn, Babak Shahbaba, and Max Welling. 2014. Distributed stochastic gradient MCMC. In Proceedings of ICML. 1044--1052.Google Scholar
Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan M. Narayanamurthy, and Alexander J. Smola. 2012. Scalable inference in latent variable models. In Proceedings of WSDM. 123--132. Google ScholarDigital Library
Galen Andrew and Jianfeng Gao. 2007. Scalable training of L¹-regularized log-linear models. In Proceedings of ICML. 33--40. Google ScholarDigital Library
Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. In Proceedings of UAI. 27--34. Google ScholarDigital Library
Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Asynchronous distributed learning of topic models. In Proceedings of NIPS. 81--88.Google Scholar
Petra Berenbrink, Tom Friedetzky, Zengjian Hu, and Russell Martin. 2008. On weighted balls-into-bins games. Theoretical Computer Science 409, 3, 511--520. Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022. Google ScholarDigital Library
Andrei Broder and Vanja Josifovski. 2013. Lecture Introduction to Computational Advertising. Stanford University, Computer Science, Online Lecture Notes.Google Scholar
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of CIKM. 426--434. Google ScholarDigital Library
Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. 2013. Streaming variational Bayes. In Proceedings of NIPS. 1727--1735.Google Scholar
Wray L. Buntine and Aleks Jakulin. 2005. Discrete component analysis. In Proceedings of SLSFS. 1--33. Google ScholarDigital Library
N. de Freitas and K. Barnard. 2001. Bayesian Latent Semantic Analysis of Multimedia Databases. Technical Report. University of British Columbia. Google ScholarDigital Library
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of NIPS. 1232--1240.Google ScholarDigital Library
Jeffery Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113. Google ScholarDigital Library
Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American Economic Review 97, 1, 242--259.Google ScholarCross Ref
James R. Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. 2013. Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In Proceedings of KDD. 446--454. Google ScholarDigital Library
James R. Foulds and Padhraic Smyth. 2014. Annealing paths for the evaluation of topic models. In Proceedings of UAI.Google Scholar
Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. 2011. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of KDD. 69--77. Google ScholarDigital Library
Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of ICML. 13--20.Google Scholar
David Graff and Christopher Cieri. 2003. English Gigaword. Linguistic Data Consortium.Google Scholar
Thomas Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228--5235.Google ScholarCross Ref
Matthew D. Hoffman, David M. Blei, and Francis R. Bach. 2010. Online learning for latent Dirichlet allocation. In Proceedings of NIPS. 856--864.Google Scholar
Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. 2013. Stochastic variational inference. Journal of Machine Learning Research 14, 1, 1303--1347. Google ScholarDigital Library
Matthew Johnson, James Saunderson, and Alan Willsky. 2013. Analyzing hogwild parallel Gaussian Gibbs sampling. In Proceedings of NIPS.Google Scholar
Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. 2008. PFP: Parallel FP-growth for query recommendation. In Proceedings of RecSys. 107--114. Google ScholarDigital Library
Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology 2, 3, 26. Google ScholarDigital Library
David M. Mimno, Matthew D. Hoffman, and David M. Blei. 2012. Sparse stochastic inference for latent Dirichlet allocation. In Proceedings of ICML.Google Scholar
Thomas P. Minka and John D. Lafferty. 2002. Expectation-propogation for the generative aspect model. In Proceedings of UAI. 352--359. Google ScholarDigital Library
Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA. Google ScholarDigital Library
David Newman, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2007. Distributed inference for latent Dirichlet allocation. In Proceedings of NIPS.Google Scholar
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of HLT-NAACL. 100--108. Google ScholarDigital Library
Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD&excl;: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of NIPS. 693--701.Google ScholarDigital Library
Sam Patterson and Yee Whye Teh. 2013. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Proceedings of NIPS. 3102--3110.Google Scholar
Ian Porteous, David Newman, Alexander T. Ihler, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proceedings of KDD. 569--577. Google ScholarDigital Library
Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of WWW. 521--530. Google ScholarDigital Library
Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 3, 400--407.Google ScholarCross Ref
Issei Sato and Hiroshi Nakagawa. 2012. Rethinking collapsed variational Bayes inference for LDA. In Proceedings of ICML.Google Scholar
Mark W. Schmidt, Nicolas Le Roux, and Francis Bach. 2013. Minimizing finite sums with the stochastic average gradient. CoRR abs/1309.2388.Google Scholar
Alexander J. Smola and Shravan M. Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1, 703--710. Google ScholarDigital Library
Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas L. Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of KDD. 306--315. Google ScholarDigital Library
Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.Google Scholar
Yee Whye Teh, David Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Proceedings of NIPS. 1353--1360.Google Scholar
Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge MA. Google ScholarDigital Library
Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009a. Rethinking LDA: Why priors matter. In Proceedings of NIPS. 1973--1981.Google Scholar
Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David M. Mimno. 2009b. Evaluation methods for topic models. In Proceedings of ICML. 1105--1112. Google ScholarDigital Library
Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. 2009. PLDA: Parallel latent Dirichlet allocation for large-scale applications. In Proceedings of AAIM. 301--314. Google ScholarDigital Library
Feng Yan, Ningyi Xu, and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Proceedings of NIPS. 2134--2142.Google Scholar
Jian-Feng Yan, Jia Zeng, Yang Gao, and Zhi-Qiang Liu. 2014. Communication-efficient algorithms for parallel latent Dirichlet allocation. Soft Computing 19, 1, 3--11. Google ScholarDigital Library
Jian-Feng Yan, Jia Zeng, Zhi-Qiang Liu, and Yang Gao. 2013. Towards big topic modeling. arXiv:1311.4150.Google Scholar
Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of KDD. 937--946. Google ScholarDigital Library
Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2010. A comparison of optimization methods and software for large-scale L¹-regularized linear classification. Journal of Machine Learning Research 11, 3183--3234. Google ScholarDigital Library
Jia Zeng. 2012. A topic modeling toolbox using belief propagation. Journal of Machine Learning Research 13, 2233--2236. Google ScholarDigital Library
Jia Zeng, William K. Cheung, and Jiming Liu. 2013. Learning topic models by belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5, 1121--1134. Google ScholarDigital Library
Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012a. A new approach to speeding up topic modeling. arXiv:1204.0170 {cs.LG}.Google Scholar
Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012b. Online belief propagation for topic modeling. arXiv:1210.2179 {cs.LG}.Google Scholar
Ke Zhai, Jordan L. Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of WWW. 879--888. Google ScholarDigital Library
Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. 2013. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of RecSys. 249--256. Google ScholarDigital Library

Index Terms

Peacock: Learning Long-Tail Topic Features for Industrial Applications
1. Information systems
  1. Information systems applications

Recommendations

Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Read More
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Read More
ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences

Probabilistic topic models are statistical methods whose aim is to discover the latent structure in a large collection of documents. The intuition behind topic models is that, by generating documents by latent topics, the word distribution for each ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Intelligent Systems and Technology Volume 6, Issue 4
Regular Papers and Special Section on Intelligent Healthcare Informatics
August 2015
419 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2801030
Editor:
Yu Zheng
Microsoft Research, China
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 July 2015
- Accepted: 1 December 2014
- Revised: 1 October 2014
- Received: 1 May 2014
Published in tist Volume 6, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Latent Dirichlet allocation
big data
big topic models
long-tail topic features
online advertising systems
search engine
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 549
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Peacock: Learning Long-Tail Topic Features for Industrial Applications

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Joint sentiment/topic model for sentiment analysis

Latent dirichlet allocation based multi-document summarization

ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences