ABSTRACT
As one of the major search engines in the world, Baidu's Sponsored Search has long adopted the use of deep neural network (DNN) models for Ads click-through rate (CTR) predictions, as early as in 2013. The input futures used by Baidu's online advertising system (a.k.a. "Phoenix Nest'') are extremely high-dimensional (e.g., hundreds or even thousands of billions of features) and also extremely sparse. The size of the CTR models used by Baidu's production system can well exceed 10TB. This imposes tremendous challenges for training, updating, and using such models in production. For Baidu's Ads system, it is obviously important to keep the model training process highly efficient so that engineers (and researchers) are able to quickly refine and test their new models or new features. Moreover, as billions of user ads click history entries are arriving every day, the models have to be re-trained rapidly because CTR prediction is an extremely time-sensitive task. Baidu's current CTR models are trained on MPI (Message Passing Interface) clusters, which require high fault tolerance and synchronization that incur expensive communication and computation costs. And, of course, the maintenance costs for clusters are also substantial. This paper presents AIBox, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs. Due to the memory limitation on GPUs, we carefully partition the CTR model into two parts: one is suitable for CPUs and another for GPUs. We further introduce a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Extensive experiments on production data reveal the effectiveness of the new system. AIBox has comparable training performance with a large MPI cluster, while requiring only a small fraction of the cost for the cluster.
- Marc Abrams, Charles R Standridge, Ghaleb Abdulla, Stephen Williams, and Edward A Fox. 1996. Caching Proxies: Limitations and Potentials . World Wide Web Journal , Vol. 1, 1 (1996).Google Scholar
- Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D Davis, Mark S Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In USENIX Annual Technical Conference (USENIX ATC) , Vol. 57.Google Scholar
- Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and Large CAMs for High Performance Data-Intensive Networked Systems.. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Vol. 10. 29--29.Google Scholar
- David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SIGOPS). 1--14.Google ScholarDigital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate . arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Jeff Bonwick et almbox. 1994. The Slab Allocator: An Object-Caching Kernel Memory Allocator. In USENIX Summer , Vol. 16.Google Scholar
- Dhruba Borthakur. 2007. The Hadoop Distributed File System: Architecture and Design . Hadoop Project Website , Vol. 11, 2007 (2007), 21.Google Scholar
- Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A Theoretical Analysis of Feature Pooling in Visual Recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML) . 111--118.Google ScholarDigital Library
- Andrei Z. Broder and Michael Mitzenmacher. 2003. Survey: Network Applications of Bloom Filters: A Survey . Internet Mathematics , Vol. 1, 4, 485--509.Google ScholarCross Ref
- Pei Cao and Sandy Irani. 1997. Cost-Aware WWW Proxy Caching Algorithms. In USENIX Symposium on Internet Technologies and Systems (USITS), Vol. 12. 193--206.Google Scholar
- Li-Pin Chang. 2007. On Efficient Wear Leveling for Large-Scale Flash-Memory Storage Systems. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC) . 1126--1130.Google ScholarDigital Library
- Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 2285--2294.Google Scholar
- Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et almbox. 2016. Wide & Deep Learning for Recommender Systems . In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (RecSys). 7--10.Google ScholarDigital Library
- Hong-Tai Chou and David J DeWitt. 1986. An Evaluation of Buffer Management Strategies for Relational Database Systems . Algorithmica , Vol. 1, 1--4 (1986), 311--336.Google ScholarDigital Library
- Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for Youtube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). 191--198.Google ScholarDigital Library
- Shaul Dar, Michael J Franklin, Bjorn T Jonsson, Divesh Srivastava, Michael Tan, et almbox. 1996. Semantic Data Caching and Replacement. In Proceedings of 22th International Conference on Very Large Data Bases (VLDB), Vol. 96. 330--341.Google Scholar
- Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High Throughput Persistent Key-Value Store . Proceedings of the VLDB Endowment , Vol. 3, 1--2 (2010), 1414--1425.Google ScholarDigital Library
- Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD) . 25--36.Google ScholarDigital Library
- Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet Advertising and the Generalized Second-Price Auction: Selling Billions of Dollars Worth of Keywords . American Economic Review , Vol. 97, 1 (2007), 242--259.Google ScholarCross Ref
- Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 2509--2517.Google ScholarDigital Library
- Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect . IEEE Micro , Vol. 37, 2 (2017), 7--17.Google ScholarDigital Library
- Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine. In Proceedings of the 27th International Conference on Machine Learning (ICML). 13--20.Google ScholarDigital Library
- Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a Factorization-Machine based Neural Network for CTR Prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI) . 1725--1731.Google ScholarCross Ref
- Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Advances in Neural Information Processing Systems (NIPS). 1223--1231.Google ScholarDigital Library
- Jin Huang and Charles X Ling. 2005. Using AUC and Accuracy in Evaluating Learning Algorithms . IEEE Transactions on Knowledge and Data Engineering , Vol. 17, 3 (2005), 299--310.Google ScholarDigital Library
- Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, et almbox. [n. d.]. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. In Biennial Conference on Innovative Data Systems Research (CIDR) .Google Scholar
- Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3232--3240.Google ScholarCross Ref
- Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the Future of Parallel Computing . IEEE Micro 5 (2011), 7--17.Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS). 1097--1105.Google Scholar
- Jason Kuen, Xiangfei Kong, Zhe Lin, Gang Wang, Jianxiong Yin, Simon See, and Yap-Peng Tan. 2018. Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7929--7938.Google Scholar
- Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. 2001. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies . IEEE Trans. Comput. 12 (2001), 1352--1361.Google Scholar
- Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , Vol. 14. 583--598.Google Scholar
- Ping Li, Art B Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In Advances in Neural Information Processing Systems (NIPS). 3122--3130.Google ScholarDigital Library
- Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems (NIPS). 2672--2680.Google Scholar
- Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1754--1763.Google ScholarDigital Library
- Hyeontaek Lim, Bin Fan, David G Andersen, and Michael Kaminsky. 2011. SILT: A Memory-Efficient, High-Performance Key-Value Store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP) . 1--13.Google ScholarDigital Library
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2017. WiscKey: Separating Keys from Values in SSD-Conscious Storage . ACM Transactions on Storage (TOS) , Vol. 13, 1 (2017), 5.Google ScholarDigital Library
- Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision . arXiv preprint arXiv:1803.04014 (2018).Google Scholar
- Avantika Mathur, Mingming Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The New Ext4 Filesystem: Current Status and Future Plans. In Proceedings of the Linux Symposium , Vol. 2. 21--33.Google Scholar
- H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et almbox. 2013. Ad Click Prediction: a View from the Trenches . In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1222--1230.Google ScholarDigital Library
- Michael Mitzenmacher. 2002. Compressed Bloom Filters . IEEE/ACM Transactions on Networking (TON) , Vol. 10, 5 (2002), 604--612.Google ScholarDigital Library
- Shamkant Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. 1984. Vertical Partitioning Algorithms for Database Design . ACM Transactions on Database Systems (TODS) , Vol. 9, 4 (1984), 680--710.Google ScholarDigital Library
- John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA . ACM Queue , Vol. 6, 2, 40--53.Google ScholarDigital Library
- NVIDIA. 2018. NVLink Fabric . https://www.nvidia.com/en-us/data-center/nvlink . Accessed: 2019-01--29.Google Scholar
- Elizabeth J O'neil, Patrick E O'neil, and Gerhard Weikum. 1993. The LRU-K Page Replacement Algorithm for Database Disk Buffering . , Vol. 22, 2 (1993), 297--306.Google ScholarDigital Library
- Alexander D Poularikas. 1998. Handbook of Formulas and Tables for Signal Processing. CRC press.Google Scholar
- Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based Neural Networks for User Response Prediction. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM) . 1149--1154.Google ScholarCross Ref
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS). 693--701.Google ScholarDigital Library
- Bianca Schroeder and Garth Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing , Vol. 7, 4 (2010), 337--350.Google ScholarDigital Library
- Ying Shan, T Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016. Deep crossing: Web-scale Modeling without Manually Crafted Combinatorial Features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 255--262.Google ScholarDigital Library
- Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In Advances in Neural Information Processing Systems (NIPS). 2321--2329.Google Scholar
- Marc Snir, Steve Otto, Steven Huss-Lederman, Jack Dongarra, and David Walker. 1998. MPI--the Complete Reference: the MPI Core. Vol. 1. MIT press.Google Scholar
- Leonid B Sokolinsky. 2004. LFU-K: An Effective Buffer Management Replacement Algorithm. In International Conference on Database Systems for Advanced Applications (DASFAA) . 670--681.Google Scholar
- Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS File System. In USENIX Annual Technical Conference (USENIX ATC) , Vol. 15.Google Scholar
- Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 a. Fast Item Ranking under Neural Network based Measures. Technical Report. Baidu Research.Google Scholar
- Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 b. On Efficient Retrieval of Top Similarity Vectors. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) .Google ScholarCross Ref
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML) . 1113--1120.Google ScholarDigital Library
- Roland P Wooster and Marc Abrams. 1997. Proxy Caching that Estimates Page Load Delays . Computer Networks and ISDN Systems , Vol. 29, 8 (1997), 977--986.Google ScholarDigital Library
- Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 3119--3125.Google ScholarCross Ref
- Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. 2015. Performance Analysis of NVMe SSDs and Their Implication on Real World Databases. In Proceedings of the 8th ACM International Systems and Storage Conference (SYSTOR). 6:1--6:11.Google ScholarDigital Library
- Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016. Deepintent: Learning Attentions for Online Advertising with Recurrent Neural Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1295--1304.Google ScholarDigital Library
- Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. 2013. Asynchronous Stochastic Gradient Descent for DNN Training. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6660--6663.Google ScholarCross Ref
- Weijie Zhao, Yu Cheng, and Florin Rusu. 2015. Vertical Partitioning for Query Processing over Raw Data. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management (SSDBM). 15:1--15:12.Google ScholarDigital Library
- Weijie Zhao, Shulong Tan, and Ping Li. 2019. SONG: Approximate Nearest Neighbor Search on GPU. Technical Report. Baidu Research.Google Scholar
- Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint Deep Modeling of Users and Items Using Reviews for Recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM) . 425--434.Google ScholarDigital Library
- Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1059--1068.Google ScholarDigital Library
Index Terms
- AIBox: CTR Prediction Model Training on a Single Node
Recommendations
The comparative effectiveness of sponsored and nonsponsored links for Web e-commerce queries
The predominant business model for Web search engines is sponsored search, which generates billions in yearly revenue. But are sponsored links providing online consumers with relevant choices for products and services? We address this and related issues ...
Investigating the relevance of sponsored results for web ecommerce queries
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalAre sponsored links, the primary business model for Web search engines, providing Web consumers with relevant results? This research addresses this issue by investigating the relevance of sponsored and non-sponsored links for ecommerce queries from the ...
Keyword advertising is not what you think: Clicking and eye movement behaviors on keyword advertising
This study examined the behavior of online searchers in relation to keyword advertising according to the theory of advertising avoidance. A total of 451 volunteers were recruited for an experiment. A computer program and an eye-tracking device were used ...
Comments