AIBox: CTR Prediction Model Training on a Single Node

Authors:
Weijie Zhao

Baidu Research USA, Sunnyvale, CA, USA

Baidu Research USA, Sunnyvale, CA, USA
View Profile

,
Jingyuan Zhang

Baidu Research USA, Sunnyvale, CA, USA

Baidu Research USA, Sunnyvale, CA, USA
View Profile

,
Deping Xie

Baidu Inc., Beijing, China

Baidu Inc., Beijing, China
View Profile

,
Yulei Qian

Baidu Inc., Beijing, China

Baidu Inc., Beijing, China
View Profile

,
Ronglai Jia

Baidu Inc., Beijing, China

Baidu Inc., Beijing, China
View Profile

,
Ping Li

Baidu Research USA, Bellevue, WA, USA

Baidu Research USA, Bellevue, WA, USA
View Profile

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementNovember 2019Pages 319–328https://doi.org/10.1145/3357384.3358045

Published:03 November 2019Publication History

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Pages 319–328

ABSTRACT

As one of the major search engines in the world, Baidu's Sponsored Search has long adopted the use of deep neural network (DNN) models for Ads click-through rate (CTR) predictions, as early as in 2013. The input futures used by Baidu's online advertising system (a.k.a. "Phoenix Nest'') are extremely high-dimensional (e.g., hundreds or even thousands of billions of features) and also extremely sparse. The size of the CTR models used by Baidu's production system can well exceed 10TB. This imposes tremendous challenges for training, updating, and using such models in production. For Baidu's Ads system, it is obviously important to keep the model training process highly efficient so that engineers (and researchers) are able to quickly refine and test their new models or new features. Moreover, as billions of user ads click history entries are arriving every day, the models have to be re-trained rapidly because CTR prediction is an extremely time-sensitive task. Baidu's current CTR models are trained on MPI (Message Passing Interface) clusters, which require high fault tolerance and synchronization that incur expensive communication and computation costs. And, of course, the maintenance costs for clusters are also substantial. This paper presents AIBox, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs. Due to the memory limitation on GPUs, we carefully partition the CTR model into two parts: one is suitable for CPUs and another for GPUs. We further introduce a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Extensive experiments on production data reveal the effectiveness of the new system. AIBox has comparable training performance with a large MPI cluster, while requiring only a small fraction of the cost for the cluster.

References

Marc Abrams, Charles R Standridge, Ghaleb Abdulla, Stephen Williams, and Edward A Fox. 1996. Caching Proxies: Limitations and Potentials . World Wide Web Journal , Vol. 1, 1 (1996).Google Scholar
Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D Davis, Mark S Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In USENIX Annual Technical Conference (USENIX ATC) , Vol. 57.Google Scholar
Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and Large CAMs for High Performance Data-Intensive Networked Systems.. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Vol. 10. 29--29.Google Scholar
David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SIGOPS). 1--14.Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate . arXiv preprint arXiv:1409.0473 (2014).Google Scholar
Jeff Bonwick et almbox. 1994. The Slab Allocator: An Object-Caching Kernel Memory Allocator. In USENIX Summer , Vol. 16.Google Scholar
Dhruba Borthakur. 2007. The Hadoop Distributed File System: Architecture and Design . Hadoop Project Website , Vol. 11, 2007 (2007), 21.Google Scholar
Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A Theoretical Analysis of Feature Pooling in Visual Recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML) . 111--118.Google ScholarDigital Library
Andrei Z. Broder and Michael Mitzenmacher. 2003. Survey: Network Applications of Bloom Filters: A Survey . Internet Mathematics , Vol. 1, 4, 485--509.Google ScholarCross Ref
Pei Cao and Sandy Irani. 1997. Cost-Aware WWW Proxy Caching Algorithms. In USENIX Symposium on Internet Technologies and Systems (USITS), Vol. 12. 193--206.Google Scholar
Li-Pin Chang. 2007. On Efficient Wear Leveling for Large-Scale Flash-Memory Storage Systems. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC) . 1126--1130.Google ScholarDigital Library
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 2285--2294.Google Scholar
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et almbox. 2016. Wide & Deep Learning for Recommender Systems . In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (RecSys). 7--10.Google ScholarDigital Library
Hong-Tai Chou and David J DeWitt. 1986. An Evaluation of Buffer Management Strategies for Relational Database Systems . Algorithmica , Vol. 1, 1--4 (1986), 311--336.Google ScholarDigital Library
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for Youtube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). 191--198.Google ScholarDigital Library
Shaul Dar, Michael J Franklin, Bjorn T Jonsson, Divesh Srivastava, Michael Tan, et almbox. 1996. Semantic Data Caching and Replacement. In Proceedings of 22th International Conference on Very Large Data Bases (VLDB), Vol. 96. 330--341.Google Scholar
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High Throughput Persistent Key-Value Store . Proceedings of the VLDB Endowment , Vol. 3, 1--2 (2010), 1414--1425.Google ScholarDigital Library
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD) . 25--36.Google ScholarDigital Library
Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet Advertising and the Generalized Second-Price Auction: Selling Billions of Dollars Worth of Keywords . American Economic Review , Vol. 97, 1 (2007), 242--259.Google ScholarCross Ref
Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 2509--2517.Google ScholarDigital Library
Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect . IEEE Micro , Vol. 37, 2 (2017), 7--17.Google ScholarDigital Library
Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine. In Proceedings of the 27th International Conference on Machine Learning (ICML). 13--20.Google ScholarDigital Library
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a Factorization-Machine based Neural Network for CTR Prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI) . 1725--1731.Google ScholarCross Ref
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Advances in Neural Information Processing Systems (NIPS). 1223--1231.Google ScholarDigital Library
Jin Huang and Charles X Ling. 2005. Using AUC and Accuracy in Evaluating Learning Algorithms . IEEE Transactions on Knowledge and Data Engineering , Vol. 17, 3 (2005), 299--310.Google ScholarDigital Library
Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, et almbox. [n. d.]. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. In Biennial Conference on Innovative Data Systems Research (CIDR) .Google Scholar
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3232--3240.Google ScholarCross Ref
Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the Future of Parallel Computing . IEEE Micro 5 (2011), 7--17.Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS). 1097--1105.Google Scholar
Jason Kuen, Xiangfei Kong, Zhe Lin, Gang Wang, Jianxiong Yin, Simon See, and Yap-Peng Tan. 2018. Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7929--7938.Google Scholar
Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. 2001. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies . IEEE Trans. Comput. 12 (2001), 1352--1361.Google Scholar
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , Vol. 14. 583--598.Google Scholar
Ping Li, Art B Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In Advances in Neural Information Processing Systems (NIPS). 3122--3130.Google ScholarDigital Library
Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems (NIPS). 2672--2680.Google Scholar
Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1754--1763.Google ScholarDigital Library
Hyeontaek Lim, Bin Fan, David G Andersen, and Michael Kaminsky. 2011. SILT: A Memory-Efficient, High-Performance Key-Value Store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP) . 1--13.Google ScholarDigital Library
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2017. WiscKey: Separating Keys from Values in SSD-Conscious Storage . ACM Transactions on Storage (TOS) , Vol. 13, 1 (2017), 5.Google ScholarDigital Library
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision . arXiv preprint arXiv:1803.04014 (2018).Google Scholar
Avantika Mathur, Mingming Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The New Ext4 Filesystem: Current Status and Future Plans. In Proceedings of the Linux Symposium , Vol. 2. 21--33.Google Scholar
H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et almbox. 2013. Ad Click Prediction: a View from the Trenches . In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1222--1230.Google ScholarDigital Library
Michael Mitzenmacher. 2002. Compressed Bloom Filters . IEEE/ACM Transactions on Networking (TON) , Vol. 10, 5 (2002), 604--612.Google ScholarDigital Library
Shamkant Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. 1984. Vertical Partitioning Algorithms for Database Design . ACM Transactions on Database Systems (TODS) , Vol. 9, 4 (1984), 680--710.Google ScholarDigital Library
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA . ACM Queue , Vol. 6, 2, 40--53.Google ScholarDigital Library
NVIDIA. 2018. NVLink Fabric . https://www.nvidia.com/en-us/data-center/nvlink . Accessed: 2019-01--29.Google Scholar
Elizabeth J O'neil, Patrick E O'neil, and Gerhard Weikum. 1993. The LRU-K Page Replacement Algorithm for Database Disk Buffering . , Vol. 22, 2 (1993), 297--306.Google ScholarDigital Library
Alexander D Poularikas. 1998. Handbook of Formulas and Tables for Signal Processing. CRC press.Google Scholar
Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based Neural Networks for User Response Prediction. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM) . 1149--1154.Google ScholarCross Ref
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS). 693--701.Google ScholarDigital Library
Bianca Schroeder and Garth Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing , Vol. 7, 4 (2010), 337--350.Google ScholarDigital Library
Ying Shan, T Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016. Deep crossing: Web-scale Modeling without Manually Crafted Combinatorial Features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 255--262.Google ScholarDigital Library
Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In Advances in Neural Information Processing Systems (NIPS). 2321--2329.Google Scholar
Marc Snir, Steve Otto, Steven Huss-Lederman, Jack Dongarra, and David Walker. 1998. MPI--the Complete Reference: the MPI Core. Vol. 1. MIT press.Google Scholar
Leonid B Sokolinsky. 2004. LFU-K: An Effective Buffer Management Replacement Algorithm. In International Conference on Database Systems for Advanced Applications (DASFAA) . 670--681.Google Scholar
Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS File System. In USENIX Annual Technical Conference (USENIX ATC) , Vol. 15.Google Scholar
Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 a. Fast Item Ranking under Neural Network based Measures. Technical Report. Baidu Research.Google Scholar
Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 b. On Efficient Retrieval of Top Similarity Vectors. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) .Google ScholarCross Ref
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML) . 1113--1120.Google ScholarDigital Library
Roland P Wooster and Marc Abrams. 1997. Proxy Caching that Estimates Page Load Delays . Computer Networks and ISDN Systems , Vol. 29, 8 (1997), 977--986.Google ScholarDigital Library
Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 3119--3125.Google ScholarCross Ref
Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. 2015. Performance Analysis of NVMe SSDs and Their Implication on Real World Databases. In Proceedings of the 8th ACM International Systems and Storage Conference (SYSTOR). 6:1--6:11.Google ScholarDigital Library
Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016. Deepintent: Learning Attentions for Online Advertising with Recurrent Neural Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1295--1304.Google ScholarDigital Library
Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. 2013. Asynchronous Stochastic Gradient Descent for DNN Training. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6660--6663.Google ScholarCross Ref
Weijie Zhao, Yu Cheng, and Florin Rusu. 2015. Vertical Partitioning for Query Processing over Raw Data. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management (SSDBM). 15:1--15:12.Google ScholarDigital Library
Weijie Zhao, Shulong Tan, and Ping Li. 2019. SONG: Approximate Nearest Neighbor Search on GPU. Technical Report. Baidu Research.Google Scholar
Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint Deep Modeling of Users and Items Using Reviews for Recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM) . 425--434.Google ScholarDigital Library
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1059--1068.Google ScholarDigital Library

Index Terms

AIBox: CTR Prediction Model Training on a Single Node

Recommendations

The comparative effectiveness of sponsored and nonsponsored links for Web e-commerce queries

The predominant business model for Web search engines is sponsored search, which generates billions in yearly revenue. But are sponsored links providing online consumers with relevant choices for products and services? We address this and related issues ...
Read More
Investigating the relevance of sponsored results for web ecommerce queries
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Are sponsored links, the primary business model for Web search engines, providing Web consumers with relevant results? This research addresses this issue by investigating the relevance of sponsored and non-sponsored links for ecommerce queries from the ...
Read More
Keyword advertising is not what you think: Clicking and eye movement behaviors on keyword advertising

This study examined the behavior of online searchers in relation to keyword advertising according to the theory of advertising avoidance. A total of 451 volunteers were recruited for an experiment. A computer program and an eye-tracking device were used ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
November 2019
3373 pages
ISBN:9781450369763
DOI:10.1145/3357384
General Chairs:
Wenwu Zhu
Tsinghua University, China
,
Dacheng Tao
University of Massachusetts, USA
,
Xueqi Cheng
Institute of Computing Technology, CAS, China
,
Program Chairs:
Peng Cui
Tsinghua University, China
,
Elke Rundensteiner
Worcester Polytechnic Institute, USA
,
David Carmel
Amazon Research, USA
,
Qi He
LinkedIn, USA
,
Jeffrey Xu Yu
Chinese University of Hong Kong, China
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 November 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gpu computing
sponsored search
ssd cache management
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '19 Paper Acceptance Rate202of1,031submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 3,308
  Total Downloads
- Downloads (Last 12 months)787
- Downloads (Last 6 weeks)115
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

AIBox: CTR Prediction Model Training on a Single Node

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

The comparative effectiveness of sponsored and nonsponsored links for Web e-commerce queries

Investigating the relevance of sponsored results for web ecommerce queries

Keyword advertising is not what you think: Clicking and eye movement behaviors on keyword advertising