skip to main content
research-article

Peacock: Learning Long-Tail Topic Features for Industrial Applications

Published:15 July 2015Publication History
Skip Abstract Section

Abstract

Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engine and online advertising systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to 103 topics, which difficultly cover the long-tail semantic word sets. In this article, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling systems. In particular, we show that a “big” LDA model with at least 105 topics inferred from 109 search queries can achieve a significant improvement on industrial search engine and online advertising systems, both of which serve hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features of Peacock include hierarchical distributed architecture, real-time prediction, and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.

References

  1. Sungjin Ahn, Babak Shahbaba, and Max Welling. 2014. Distributed stochastic gradient MCMC. In Proceedings of ICML. 1044--1052.Google ScholarGoogle Scholar
  2. Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan M. Narayanamurthy, and Alexander J. Smola. 2012. Scalable inference in latent variable models. In Proceedings of WSDM. 123--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1-regularized log-linear models. In Proceedings of ICML. 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. In Proceedings of UAI. 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Asynchronous distributed learning of topic models. In Proceedings of NIPS. 81--88.Google ScholarGoogle Scholar
  6. Petra Berenbrink, Tom Friedetzky, Zengjian Hu, and Russell Martin. 2008. On weighted balls-into-bins games. Theoretical Computer Science 409, 3, 511--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrei Broder and Vanja Josifovski. 2013. Lecture Introduction to Computational Advertising. Stanford University, Computer Science, Online Lecture Notes.Google ScholarGoogle Scholar
  9. Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of CIKM. 426--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. 2013. Streaming variational Bayes. In Proceedings of NIPS. 1727--1735.Google ScholarGoogle Scholar
  11. Wray L. Buntine and Aleks Jakulin. 2005. Discrete component analysis. In Proceedings of SLSFS. 1--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. de Freitas and K. Barnard. 2001. Bayesian Latent Semantic Analysis of Multimedia Databases. Technical Report. University of British Columbia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of NIPS. 1232--1240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jeffery Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American Economic Review 97, 1, 242--259.Google ScholarGoogle ScholarCross RefCross Ref
  16. James R. Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. 2013. Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In Proceedings of KDD. 446--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. James R. Foulds and Padhraic Smyth. 2014. Annealing paths for the evaluation of topic models. In Proceedings of UAI.Google ScholarGoogle Scholar
  18. Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. 2011. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of KDD. 69--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of ICML. 13--20.Google ScholarGoogle Scholar
  20. David Graff and Christopher Cieri. 2003. English Gigaword. Linguistic Data Consortium.Google ScholarGoogle Scholar
  21. Thomas Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228--5235.Google ScholarGoogle ScholarCross RefCross Ref
  22. Matthew D. Hoffman, David M. Blei, and Francis R. Bach. 2010. Online learning for latent Dirichlet allocation. In Proceedings of NIPS. 856--864.Google ScholarGoogle Scholar
  23. Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. 2013. Stochastic variational inference. Journal of Machine Learning Research 14, 1, 1303--1347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Matthew Johnson, James Saunderson, and Alan Willsky. 2013. Analyzing hogwild parallel Gaussian Gibbs sampling. In Proceedings of NIPS.Google ScholarGoogle Scholar
  25. Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. 2008. PFP: Parallel FP-growth for query recommendation. In Proceedings of RecSys. 107--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology 2, 3, 26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. David M. Mimno, Matthew D. Hoffman, and David M. Blei. 2012. Sparse stochastic inference for latent Dirichlet allocation. In Proceedings of ICML.Google ScholarGoogle Scholar
  28. Thomas P. Minka and John D. Lafferty. 2002. Expectation-propogation for the generative aspect model. In Proceedings of UAI. 352--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. David Newman, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2007. Distributed inference for latent Dirichlet allocation. In Proceedings of NIPS.Google ScholarGoogle Scholar
  31. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of HLT-NAACL. 100--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of NIPS. 693--701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sam Patterson and Yee Whye Teh. 2013. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Proceedings of NIPS. 3102--3110.Google ScholarGoogle Scholar
  34. Ian Porteous, David Newman, Alexander T. Ihler, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proceedings of KDD. 569--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of WWW. 521--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 3, 400--407.Google ScholarGoogle ScholarCross RefCross Ref
  37. Issei Sato and Hiroshi Nakagawa. 2012. Rethinking collapsed variational Bayes inference for LDA. In Proceedings of ICML.Google ScholarGoogle Scholar
  38. Mark W. Schmidt, Nicolas Le Roux, and Francis Bach. 2013. Minimizing finite sums with the stochastic average gradient. CoRR abs/1309.2388.Google ScholarGoogle Scholar
  39. Alexander J. Smola and Shravan M. Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1, 703--710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas L. Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of KDD. 306--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.Google ScholarGoogle Scholar
  42. Yee Whye Teh, David Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Proceedings of NIPS. 1353--1360.Google ScholarGoogle Scholar
  43. Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009a. Rethinking LDA: Why priors matter. In Proceedings of NIPS. 1973--1981.Google ScholarGoogle Scholar
  45. Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David M. Mimno. 2009b. Evaluation methods for topic models. In Proceedings of ICML. 1105--1112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. 2009. PLDA: Parallel latent Dirichlet allocation for large-scale applications. In Proceedings of AAIM. 301--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Feng Yan, Ningyi Xu, and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Proceedings of NIPS. 2134--2142.Google ScholarGoogle Scholar
  48. Jian-Feng Yan, Jia Zeng, Yang Gao, and Zhi-Qiang Liu. 2014. Communication-efficient algorithms for parallel latent Dirichlet allocation. Soft Computing 19, 1, 3--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jian-Feng Yan, Jia Zeng, Zhi-Qiang Liu, and Yang Gao. 2013. Towards big topic modeling. arXiv:1311.4150.Google ScholarGoogle Scholar
  50. Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of KDD. 937--946. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2010. A comparison of optimization methods and software for large-scale L1-regularized linear classification. Journal of Machine Learning Research 11, 3183--3234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jia Zeng. 2012. A topic modeling toolbox using belief propagation. Journal of Machine Learning Research 13, 2233--2236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jia Zeng, William K. Cheung, and Jiming Liu. 2013. Learning topic models by belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5, 1121--1134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012a. A new approach to speeding up topic modeling. arXiv:1204.0170 {cs.LG}.Google ScholarGoogle Scholar
  55. Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012b. Online belief propagation for topic modeling. arXiv:1210.2179 {cs.LG}.Google ScholarGoogle Scholar
  56. Ke Zhai, Jordan L. Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of WWW. 879--888. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. 2013. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of RecSys. 249--256. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Peacock: Learning Long-Tail Topic Features for Industrial Applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 6, Issue 4
      Regular Papers and Special Section on Intelligent Healthcare Informatics
      August 2015
      419 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/2801030
      • Editor:
      • Yu Zheng
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 July 2015
      • Accepted: 1 December 2014
      • Revised: 1 October 2014
      • Received: 1 May 2014
      Published in tist Volume 6, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader