research-article

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Authors:
Anqi Guo

Boston University, Boston, USA

University of Rochester, Rochester, USA

Boston University, Boston, USA

University of Rochester, Rochester, USA

https://orcid.org/0000-0001-5872-4464
View Profile

,
Yuchen Hao

Meta Platforms, San Francisco, USA

Meta Platforms, San Francisco, USA

https://orcid.org/0009-0005-8513-9566
View Profile

,
Chunshu Wu

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0009-0006-2039-0853
View Profile

,
Pouya Haghi

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0000-0003-2893-9194
View Profile

,
Zhenyu Pan

University of Rochester, Rochester, USA

University of Rochester, Rochester, USA

https://orcid.org/0009-0009-2805-6018
View Profile

,
Min Si

Meta Platforms, San Francisco, United States of America

Meta Platforms, San Francisco, United States of America

https://orcid.org/0000-0002-0208-096X
View Profile

,
Dingwen Tao

Indiana University, Bloomington, United States of America

Indiana University, Bloomington, United States of America

https://orcid.org/0000-0001-5422-4497
View Profile

,
Ang Li

Pacific Northwest National Laboratory, Richland, United States of America

Pacific Northwest National Laboratory, Richland, United States of America

https://orcid.org/0000-0003-3734-9137
View Profile

,
Martin Herbordt

Boston University, Boston, United States of America

Boston University, Boston, United States of America

https://orcid.org/0000-0002-3443-9113
View Profile

,
Tong Geng

University of Rochester, Rochester, United States of America

University of Rochester, Rochester, United States of America

https://orcid.org/0000-0002-3644-2922
View Profile

ICS '23: Proceedings of the 37th International Conference on SupercomputingJune 2023Pages 336–347https://doi.org/10.1145/3577193.3593724

Published:21 June 2023Publication History

ICS '23: Proceedings of the 37th International Conference on Supercomputing

Pages 336–347

ABSTRACT

Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest and most important machine learning applications. With their trillions of parameters necessarily exceeding the high bandwidth memory (HBM) capacity of GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training and inference. However, these all suffer from the all-to-all communication bottleneck, which limits scalability.

SmartNICs couple computation and communication capabilities to provide powerful network-facing heterogeneous devices that reduce communication overhead. There has not, however, been a distributed system design that fully leverages SmartNIC resources to address scalability of DLRMs.

We propose a software-hardware co-design of a heterogeneous SmartNIC system that overcomes the communication bottleneck of distributed DLRMs, mitigates the pressure on memory bandwidth, and improves computation efficiency. We provide a set of SmartNIC designs of cache systems (including local cache and remote cache) and SmartNIC computation kernels that reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches and optimizes the overall system performance with higher data reuse. Our evaluation shows that the system achieves 2.1× latency speedup for inference and 1.6× throughput speedup for training.

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Google ScholarCross Ref
C. Bobda, J. Mandebi, P. Chow, M. Ewais, N. Tarafdar, J.C. Vega, K. Eguro, D. Koch, S. Handagala, M. Leeser, M.C. Herbordt, H. Shahzad, P. Hofstee, B. Ringlein, J. Szefer, A. Sanaullah, and R. Tessier. 2022. The Future of FPGA Acceleration in Datacenters and the Cloud. ACM Transactions on Reconfigurable Technology and Systems 15, 3 (2022), 1--42. Google ScholarDigital Library
Broadcom. 2019. Stingray PS250 2x50-Gb High-Performance Data Center Smart-NIC. https://docs.broadcom.com/doc/PS250-PBGoogle Scholar
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13. Google ScholarCross Ref
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. New York, NY, USA.Google ScholarDigital Library
Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018. Bandana: Using Non-volatile Memory for Storing Deep Learning Models. Google ScholarCross Ref
facebookresearch. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. https://github.com/facebookresearch/dlrmGoogle Scholar
Carlos A. Gomez-Uribe and Neil Hunt. 2016. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Trans. Manage. Inf. Syst. 6, 4, Article 13 (dec 2016), 19 pages. Google ScholarDigital Library
Anqi Guo, Tong Geng, Yongan Zhang, Pouya Haghi, Chunshu Wu, Cheng Tan, Yingyan Lin, Ang Li, and Martin Herbordt. 2022. FCsN: A FPGA-Centric Smart-NIC Framework for Neural Networks. In 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 1--2. Google ScholarCross Ref
Anqi Guo, Tong Geng, Yongan Zhang, Pouya Haghi, Chunshu Wu, Cheng Tan, Yingyan Lin, Ang Li, and Martin Herbordt. 2022. A Framework for Neural Network Inference on FPGA-Centric SmartNICs. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). 01--08. Google ScholarCross Ref
Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The Architectural Implications of Facebook's DNN-based Personalized Recommendation. Google ScholarCross Ref
P. Haghi, A. Guo, Q. Xiong, C. Yang, T. Geng, J.T. Broaddus, R. Marshall, D. Schafer, A. Skjellum, and M.C. Herbordt. 2022. Reconfigurable switches for high performance and flexible MPI collectives. Concurrency and Computation: Practice and Experience 34, 2 (2022). Google ScholarCross Ref
P. Haghi, W. Krska, C. Tan, T. Geng, P.H. Chen, C. Greenwood, A. Guo, T. Hines, C. Wu, A. Li, A. Skjellum, and M.C. Herbordt. 2023. FLASH: FPGA-Accelerated Smart Switches with GCN Case Study. In ICS 2023: International Conference on Supercomputing.Google Scholar
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 620--629. Google ScholarCross Ref
Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, and Ron Brightwell. 2017. SPIN: High-Performance Streaming Processing In the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). Association for Computing Machinery, New York, NY, USA, Article 59, 16 pages. Google ScholarDigital Library
Intel. 2021. Intel® Infrastructure Processing Unit (Intel® IPU). https://www.intel.com/content/www/us/en/products/network-io/smartnic.htmlGoogle Scholar
Intel. 2022. Intel® FPGA SmartNIC. https://www.intel.com/content/www/us/en/products/details/fpga/platforms/smartnic.htmlGoogle Scholar
R.G. Jaganathan, K.D. Underwood, and R. Sass. 2003. A configurable network protocol for cluster based communications using modular hardware primitives on an intelligent NIC. In 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003. 286--287. Google ScholarCross Ref
Wenqi Jiang, Zhenhao He, Shuai Zhang, Thomas B. Preußer, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, Ce Zhang, and Gustavo Alonso. 2020. MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions. Google ScholarCross Ref
Jie. 2020. Training Deep Learning Recommendation Model with Quantized Collective Communications.Google Scholar
Venkata Krishnan, Olivier Serres, and Michael Blocksome. 2020. COnfigurable Network Protocol Accelerator (COPA) † : An Integrated Networking/Accelerator Hardware/Software Framework. In 2020 IEEE Symposium on High-Performance Interconnects (HOTI). 17--24. Google ScholarCross Ref
Youngeun Kwon and Minsoo Rhu. 2022. Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward Not Backwards. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 860--873. Google ScholarDigital Library
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. Google ScholarCross Ref
Hyeontaek Lim, David G. Andersen, and Michael Kaminsky. 2018. 3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning. Google ScholarCross Ref
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. (2017). Google ScholarCross Ref
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie (Amy) Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2022. Software-Hardware Co-Design for Fast and Scalable Training of Deep Learning Recommendation Models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 993--1011. Google ScholarDigital Library
Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, and Mikhail Smelyanskiy. 2020. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems. Google ScholarCross Ref
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. Google ScholarCross Ref
Nvidia. 2021. NVIDIA BLUEFIELD-2 DPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdfGoogle Scholar
Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood, Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, and Mikhail Smelyanskiy. 2018. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications. Google ScholarCross Ref
Whit Schonbein, Ryan E. Grant, Matthew G. F. Dosanjh, and Dorian Arnold. 2019. INCA: In-Network Compute Assistance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). Association for Computing Machinery, New York, NY, USA, Article 54, 13 pages. Google ScholarDigital Library
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. Google ScholarCross Ref
Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, and Carole-Jean Wu. 2022. RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '22). Association for Computing Machinery, New York, NY, USA, 344--358. Google ScholarDigital Library
H. Shahzad, A. Sanaullah, and M.C. Herbordt. 2021. Survey and Future Trends for FPGA Cloud Architectures. In IEEE High Performance Extreme Computing Conference. Google ScholarCross Ref
Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. 2020. Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery amp; Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 165--175. Google ScholarDigital Library
Brent Smith and Greg Linden. 2017. Two Decades of Recommender Systems at Amazon.com. IEEE Internet Computing 21, 3 (2017), 12--18. Google ScholarDigital Library
Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. 2009. Feature Hashing for Large Scale Multitask Learning. Google ScholarCross Ref
Mark Wilkening, Udit Gupta, Samuel Hsia, Caroline Trippel, Carole-Jean Wu, David Brooks, and Gu-Yeon Wei. 2021. RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 717--729. Google ScholarDigital Library
Xilinx. 2020. Alveo U25 SmartNIC Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u25.htmlGoogle Scholar
Xilinx. 2022. The Industry's First SmartNIC With Composable Hardware. https://www.xilinx.com/applications/data-center/network-acceleration/alveo-sn1000.htmlGoogle Scholar
Jie Amy Yang, Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, and Andrew Tulloch. 2020. Mixed-Precision Embedding Using a Cache. Google ScholarCross Ref
Chunxing Yin, Bilge Acun, Xing Liu, and Carole-Jean Wu. 2021. TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models. Google ScholarCross Ref
Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. Google ScholarCross Ref
Yu Zhu, Zhenhao He, Wenqi Jiang, Kai Zeng, Jingren Zhou, and Gustavo Alonso. 2021. Distributed Recommendation Inference on FPGA Clusters. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 279--285. Google ScholarCross Ref

Index Terms

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Neural networks

Recommendations

hKVS: a framework for designing a high throughput heterogeneous key-value store with SmartNIC and RDMA
RACS '22: Proceedings of the Conference on Research in Adaptive and Convergent Systems

In-memory key-value store (KVS) is a crucial component of data center applications. Since DRAM provides high bandwidth and low latency, the major performance bottleneck of common in-memory KVS lies in the network stack. Prior works have attempted to ...
Read More
Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

There is a growing trend towards heterogeneous systems, which contain CPUs and GPGPUs in a single chip. Managing those various on-chip resources shared between CPUs and GPGPUs, however, is a big issue and the last-level cache (LLC) is one of the most ...
Read More
Rapid analysis of interprocessor communications on heterogeneous system architectures via parallel cache emulation
RACS '15: Proceedings of the 2015 Conference on research in adaptive and convergent systems

The recently proposed heterogeneous system architecture (HSA) specifications enable shared-memory-based interprocessor communications between CPU cores and GPU cores via a flat coherent address space and memory-based signals to reduce explicit data copy ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '23: Proceedings of the 37th International Conference on Supercomputing
June 2023
505 pages
ISBN:9798400700569
DOI:10.1145/3577193
Chair:
Kyle Gallivan,
Co-chair:
Efstratios Gallopoulos,
Program Co-chairs:
Dimitrios S. Nikolopoulos,
Ramon Beivide
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
recommendation system
SmartNIC
heterogeneous system
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Upcoming Conference
ICS '24

Sponsor:

sigarch

2024 International Conference on Supercomputing

June 4 - 7, 2024

Kyoto , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 454
  Total Downloads
- Downloads (Last 12 months)454
- Downloads (Last 6 weeks)39
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

ICS '23: Proceedings of the 37th International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

hKVS: a framework for designing a high throughput heterogeneous key-value store with SmartNIC and RDMA

Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System

Rapid analysis of interprocessor communications on heterogeneous system architectures via parallel cache emulation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

ICS '23: Proceedings of the 37th International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

hKVS: a framework for designing a high throughput heterogeneous key-value store with SmartNIC and RDMA

Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System

Rapid analysis of interprocessor communications on heterogeneous system architectures via parallel cache emulation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media