skip to main content
10.1145/3357384.3358045acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open Access

AIBox: CTR Prediction Model Training on a Single Node

Published:03 November 2019Publication History

ABSTRACT

As one of the major search engines in the world, Baidu's Sponsored Search has long adopted the use of deep neural network (DNN) models for Ads click-through rate (CTR) predictions, as early as in 2013. The input futures used by Baidu's online advertising system (a.k.a. "Phoenix Nest'') are extremely high-dimensional (e.g., hundreds or even thousands of billions of features) and also extremely sparse. The size of the CTR models used by Baidu's production system can well exceed 10TB. This imposes tremendous challenges for training, updating, and using such models in production. For Baidu's Ads system, it is obviously important to keep the model training process highly efficient so that engineers (and researchers) are able to quickly refine and test their new models or new features. Moreover, as billions of user ads click history entries are arriving every day, the models have to be re-trained rapidly because CTR prediction is an extremely time-sensitive task. Baidu's current CTR models are trained on MPI (Message Passing Interface) clusters, which require high fault tolerance and synchronization that incur expensive communication and computation costs. And, of course, the maintenance costs for clusters are also substantial. This paper presents AIBox, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs. Due to the memory limitation on GPUs, we carefully partition the CTR model into two parts: one is suitable for CPUs and another for GPUs. We further introduce a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Extensive experiments on production data reveal the effectiveness of the new system. AIBox has comparable training performance with a large MPI cluster, while requiring only a small fraction of the cost for the cluster.

References

  1. Marc Abrams, Charles R Standridge, Ghaleb Abdulla, Stephen Williams, and Edward A Fox. 1996. Caching Proxies: Limitations and Potentials . World Wide Web Journal , Vol. 1, 1 (1996).Google ScholarGoogle Scholar
  2. Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D Davis, Mark S Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In USENIX Annual Technical Conference (USENIX ATC) , Vol. 57.Google ScholarGoogle Scholar
  3. Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and Large CAMs for High Performance Data-Intensive Networked Systems.. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Vol. 10. 29--29.Google ScholarGoogle Scholar
  4. David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SIGOPS). 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate . arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  6. Jeff Bonwick et almbox. 1994. The Slab Allocator: An Object-Caching Kernel Memory Allocator. In USENIX Summer , Vol. 16.Google ScholarGoogle Scholar
  7. Dhruba Borthakur. 2007. The Hadoop Distributed File System: Architecture and Design . Hadoop Project Website , Vol. 11, 2007 (2007), 21.Google ScholarGoogle Scholar
  8. Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A Theoretical Analysis of Feature Pooling in Visual Recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML) . 111--118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Andrei Z. Broder and Michael Mitzenmacher. 2003. Survey: Network Applications of Bloom Filters: A Survey . Internet Mathematics , Vol. 1, 4, 485--509.Google ScholarGoogle ScholarCross RefCross Ref
  10. Pei Cao and Sandy Irani. 1997. Cost-Aware WWW Proxy Caching Algorithms. In USENIX Symposium on Internet Technologies and Systems (USITS), Vol. 12. 193--206.Google ScholarGoogle Scholar
  11. Li-Pin Chang. 2007. On Efficient Wear Leveling for Large-Scale Flash-Memory Storage Systems. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC) . 1126--1130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 2285--2294.Google ScholarGoogle Scholar
  13. Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et almbox. 2016. Wide & Deep Learning for Recommender Systems . In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (RecSys). 7--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hong-Tai Chou and David J DeWitt. 1986. An Evaluation of Buffer Management Strategies for Relational Database Systems . Algorithmica , Vol. 1, 1--4 (1986), 311--336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for Youtube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). 191--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Shaul Dar, Michael J Franklin, Bjorn T Jonsson, Divesh Srivastava, Michael Tan, et almbox. 1996. Semantic Data Caching and Replacement. In Proceedings of 22th International Conference on Very Large Data Bases (VLDB), Vol. 96. 330--341.Google ScholarGoogle Scholar
  17. Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High Throughput Persistent Key-Value Store . Proceedings of the VLDB Endowment , Vol. 3, 1--2 (2010), 1414--1425.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD) . 25--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet Advertising and the Generalized Second-Price Auction: Selling Billions of Dollars Worth of Keywords . American Economic Review , Vol. 97, 1 (2007), 242--259.Google ScholarGoogle ScholarCross RefCross Ref
  20. Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 2509--2517.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect . IEEE Micro , Vol. 37, 2 (2017), 7--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine. In Proceedings of the 27th International Conference on Machine Learning (ICML). 13--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a Factorization-Machine based Neural Network for CTR Prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI) . 1725--1731.Google ScholarGoogle ScholarCross RefCross Ref
  24. Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Advances in Neural Information Processing Systems (NIPS). 1223--1231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jin Huang and Charles X Ling. 2005. Using AUC and Accuracy in Evaluating Learning Algorithms . IEEE Transactions on Knowledge and Data Engineering , Vol. 17, 3 (2005), 299--310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, et almbox. [n. d.]. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. In Biennial Conference on Innovative Data Systems Research (CIDR) .Google ScholarGoogle Scholar
  27. Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3232--3240.Google ScholarGoogle ScholarCross RefCross Ref
  28. Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the Future of Parallel Computing . IEEE Micro 5 (2011), 7--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS). 1097--1105.Google ScholarGoogle Scholar
  30. Jason Kuen, Xiangfei Kong, Zhe Lin, Gang Wang, Jianxiong Yin, Simon See, and Yap-Peng Tan. 2018. Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7929--7938.Google ScholarGoogle Scholar
  31. Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. 2001. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies . IEEE Trans. Comput. 12 (2001), 1352--1361.Google ScholarGoogle Scholar
  32. Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , Vol. 14. 583--598.Google ScholarGoogle Scholar
  33. Ping Li, Art B Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In Advances in Neural Information Processing Systems (NIPS). 3122--3130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems (NIPS). 2672--2680.Google ScholarGoogle Scholar
  35. Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1754--1763.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Hyeontaek Lim, Bin Fan, David G Andersen, and Michael Kaminsky. 2011. SILT: A Memory-Efficient, High-Performance Key-Value Store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP) . 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2017. WiscKey: Separating Keys from Values in SSD-Conscious Storage . ACM Transactions on Storage (TOS) , Vol. 13, 1 (2017), 5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision . arXiv preprint arXiv:1803.04014 (2018).Google ScholarGoogle Scholar
  39. Avantika Mathur, Mingming Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The New Ext4 Filesystem: Current Status and Future Plans. In Proceedings of the Linux Symposium , Vol. 2. 21--33.Google ScholarGoogle Scholar
  40. H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et almbox. 2013. Ad Click Prediction: a View from the Trenches . In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1222--1230.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Michael Mitzenmacher. 2002. Compressed Bloom Filters . IEEE/ACM Transactions on Networking (TON) , Vol. 10, 5 (2002), 604--612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shamkant Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. 1984. Vertical Partitioning Algorithms for Database Design . ACM Transactions on Database Systems (TODS) , Vol. 9, 4 (1984), 680--710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA . ACM Queue , Vol. 6, 2, 40--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. NVIDIA. 2018. NVLink Fabric . https://www.nvidia.com/en-us/data-center/nvlink . Accessed: 2019-01--29.Google ScholarGoogle Scholar
  45. Elizabeth J O'neil, Patrick E O'neil, and Gerhard Weikum. 1993. The LRU-K Page Replacement Algorithm for Database Disk Buffering . , Vol. 22, 2 (1993), 297--306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Alexander D Poularikas. 1998. Handbook of Formulas and Tables for Signal Processing. CRC press.Google ScholarGoogle Scholar
  47. Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based Neural Networks for User Response Prediction. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM) . 1149--1154.Google ScholarGoogle ScholarCross RefCross Ref
  48. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS). 693--701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Bianca Schroeder and Garth Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing , Vol. 7, 4 (2010), 337--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ying Shan, T Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016. Deep crossing: Web-scale Modeling without Manually Crafted Combinatorial Features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 255--262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In Advances in Neural Information Processing Systems (NIPS). 2321--2329.Google ScholarGoogle Scholar
  52. Marc Snir, Steve Otto, Steven Huss-Lederman, Jack Dongarra, and David Walker. 1998. MPI--the Complete Reference: the MPI Core. Vol. 1. MIT press.Google ScholarGoogle Scholar
  53. Leonid B Sokolinsky. 2004. LFU-K: An Effective Buffer Management Replacement Algorithm. In International Conference on Database Systems for Advanced Applications (DASFAA) . 670--681.Google ScholarGoogle Scholar
  54. Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS File System. In USENIX Annual Technical Conference (USENIX ATC) , Vol. 15.Google ScholarGoogle Scholar
  55. Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 a. Fast Item Ranking under Neural Network based Measures. Technical Report. Baidu Research.Google ScholarGoogle Scholar
  56. Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 b. On Efficient Retrieval of Top Similarity Vectors. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) .Google ScholarGoogle ScholarCross RefCross Ref
  57. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML) . 1113--1120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Roland P Wooster and Marc Abrams. 1997. Proxy Caching that Estimates Page Load Delays . Computer Networks and ISDN Systems , Vol. 29, 8 (1997), 977--986.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 3119--3125.Google ScholarGoogle ScholarCross RefCross Ref
  60. Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. 2015. Performance Analysis of NVMe SSDs and Their Implication on Real World Databases. In Proceedings of the 8th ACM International Systems and Storage Conference (SYSTOR). 6:1--6:11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016. Deepintent: Learning Attentions for Online Advertising with Recurrent Neural Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1295--1304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. 2013. Asynchronous Stochastic Gradient Descent for DNN Training. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6660--6663.Google ScholarGoogle ScholarCross RefCross Ref
  63. Weijie Zhao, Yu Cheng, and Florin Rusu. 2015. Vertical Partitioning for Query Processing over Raw Data. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management (SSDBM). 15:1--15:12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Weijie Zhao, Shulong Tan, and Ping Li. 2019. SONG: Approximate Nearest Neighbor Search on GPU. Technical Report. Baidu Research.Google ScholarGoogle Scholar
  65. Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint Deep Modeling of Users and Items Using Reviews for Recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM) . 425--434.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1059--1068.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. AIBox: CTR Prediction Model Training on a Single Node

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
          November 2019
          3373 pages
          ISBN:9781450369763
          DOI:10.1145/3357384

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 November 2019

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          CIKM '19 Paper Acceptance Rate202of1,031submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader