ABSTRACT
Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility.
In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.
- O. Arap and M. Swany. 2016. Offloading Collective Operations to Programmable Logic on a Zynq Cluster. In 2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI). 76--83.Google Scholar
- Arista. 2023. 7130 FPGA-enabled Network Switches - Quick Look. www.arista.com/en/products/7130-fpga-enabled-network-switches-quick-look.Google Scholar
- AWS. 2019. Deliver high performance ML inference with AWS Inferentia. https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deliver_high_performance_ML_inference_with_AWS_Inferentia_CMP324-R1.pdf.Google Scholar
- M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Maqbool Hashmi, and D. K. Panda. 2021. BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs. In High Performance Computing: 36th International Conference, ISC High Performance 2021. Springer, 18--37.Google Scholar
- Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. 2014. P4: Programming Protocol-Independent Packet Processors. SIGCOMM Comput. Commun. Rev. 44, 3 (jul 2014), 87--95. Google ScholarDigital Library
- Y. Chen, J. Emer, and V. Sze. 2017. Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators. IEEE Micro 37, 3 (2017), 12--21. Google ScholarDigital Library
- D. De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler. 2021. Flare: Flexible In-Network Allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.Google Scholar
- A. Faraj, S. Kumar, B. Smith, A. Mamidala, and J. Gunnels. 2009. MPI Collective Communications on the Blue Gene/P Supercomputer: Algorithms and Optimizations. 2009 17th IEEE Symposium on High Performance Interconnects (2009), 63--72.Google ScholarDigital Library
- J. Gasteiger, C. Qian, and S. Günnemann. 2022. Influence-Based Mini-Batching for Graph Neural Networks. arXiv preprint arXiv:2212.09083 (2022).Google Scholar
- T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y. Li, P. Haghi, A. Tumeo, S. Che, S. Reinhardt, and M.C. Herbordt. 2020. AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing. In 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
- T. Geng, C. Wu, Y. Zhang, C. Tan, C. Xie, H. You, M.C. Herbordt, Y. Lin, and A. Li. 2021. I-GCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement Through Islandization. In 54th IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarDigital Library
- R. L. Graham et al. 2010. Overlapping Computation and Communication: Barrier Algorithms and ConnectX-2 CORE-Direct Capabilities. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). 1--8.Google ScholarCross Ref
- R. L. Graham et al. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1--10.Google ScholarCross Ref
- Richard L. Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, Ami Marelli, Valentin Petrov, Evyatar Romlet, Yong Qin, and Ido Zemah. 2020. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation. In High Performance Computing, Ponnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, and Hatem Ltaief (Eds.). Springer International Publishing, Cham, 41--59.Google Scholar
- A. Guo, T. Geng, Y. Zhang, P. Haghi, C. Wu, C. Tan, Y. Lin, A. Li, and M.C. Herbordt. 2022. A Framework for Neural Network Inference on FPGA-Centric SmartNICs. In International Conference on Field-Programmable Logic and Applications (FPL).Google Scholar
- A. Guo, Y. Hao, C. Wu, P. Haghi, Z. Pan, M. Si, D. Tao, A. Li, M.C. Herbordt, and T. Geng. 2023. Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training. In ICS 2023: International Conference on Supercomputing.Google Scholar
- P. Haghi, A. Guo, T. Geng, A. Skjellum, and M.C. Herbordt. 2021. Workload Imbalance in HPC Applications: Effect on Performance of In-Network Processing. In IEEE High Performance Extreme Computing Conference. Google ScholarCross Ref
- P. Haghi, A. Guo, Q. Xiong, R. Patel, C. Yang, T. Geng, J.T. Broaddus, R. Marshall, A. Skjellum, and M.C. Herbordt. 2020. FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives. In IEEE High Performance Extreme Computing Conference.Google Scholar
- P. Haghi, A. Guo, Q. Xiong, C. Yang, T. Geng, J.T. Broaddus, R. Marshall, D. Schafer, A. Skjellum, and M.C. Herbordt. 2022. Reconfigurable switches for high performance and flexible MPI collectives. Concurrency and Computation: Practice and Experience 34, 2 (2022). Google ScholarCross Ref
- S. Handagala, M.C. Herbordt, and M. Leeser. 2021. OCT: The Open Cloud FPGA Testbed. In 31st International Conference on Field Programmable Logic and Applications (FPL).Google Scholar
- S. Handagala, M. Leeser, K. Patle, and M. Zink. 2022. Network Attached FPGAs in the Open Cloud Testbed (OCT). In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 1--6.Google Scholar
- F. Hauser et al. 2021. A Survey on Data Plane Programming with P4: Fundamentals, Advances, and Applied Research. arXiv preprint arXiv:2101.10632 (2021).Google Scholar
- Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118--22133.Google Scholar
- Z. Jia, S. Lin, M. Gao, M. Zaharia, and A. Aiken. 2020. Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2--4, 2020, I.S. Dhillon, D.S. Papailiopoulos, and V. Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/300.pdfGoogle Scholar
- M. Karunaratne, A. K. Mohite, T. Mitra, and L. Peh. 2017. HyCUBE: A CGRA with Reconfigurable Single-Cycle Multi-hop Interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--6. Google ScholarDigital Library
- E. F. Kfoury, J. Crichigno, and E. Bou-Harb. 2021. An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends. IEEE Access 9 (2021), 87094--87155.Google ScholarCross Ref
- V. Krishnan, O. Serres, and M. Blocksome. 2020. COnfigurable Network Protocol Accelerator (COPA): An Integrated Networking/Accelerator Hardware/Software Framework. In 2020 IEEE Symposium on High-Performance Interconnects (HOTI). 17--24. Google ScholarCross Ref
- C. Lattner and V. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In International Symposium on Code Generation and Optimization, CGO. 75--86. Google ScholarCross Ref
- A. Li, T. Geng, T. Wang, M.C. Herbordt, S. Song, and K. Barker. 2019. BSTC: A Novel Binarized-Soft-Tensor-Core Design for Accelerating Bit-Based Approximated Neural Nets. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
- Youjie Li and et al. 2019. Accelerating Distributed Reinforcement learning with In-Switch Computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 279--291.Google Scholar
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In Field Programmable Logic and Application (FPL). 61--70.Google Scholar
- J. Naous, G. Gibb, S. Bolouki, and N. McKeown. 2008. NetFPGA: Reusable Router Architecture for Experimental Research. In Association for Computing Machinery PRESTO (Seattle, WA, USA). New York, NY, USA, 1--7. Google ScholarDigital Library
- New Wave DV. 2023. 32-Port Programmable Switch. https://newwavedv.com/products/appliances/32-port-programmable-switch/.Google Scholar
- J. Park, M. Smelyanskiy, U. M. Yang, D. Mudigere, and P. Dubey. 2015. High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.Google Scholar
- R. Prabhakar et al. 2017. Plasticine: A Reconfigurable Architecture for Parallel Patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 389--402. Google ScholarDigital Library
- S. Qiao, C. Hu, G. Brebner, J. Zou, and X. Guan. 2020. Adaptable Switch: A Heterogeneous Switch Architecture for Network-Centric Computing. IEEE Communications Magazine 58, 12 (2020), 64--69. Google ScholarCross Ref
- A. L. G. Rios, K. Bekshentayeva, M. Singh, S. Haeri, and L. Trajkovic. 2021. Virtual Network Embedding for Switch-Centric Data Center Networks. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS). 1--5. Google ScholarCross Ref
- RISC-V. 2023. RISC-V Specifications. https://riscv.org/technical/specifications/.Google Scholar
- RISC-V. 2023. RISC-V 'V' Vector Specifications. https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc.Google Scholar
- G. Sankaran, J. Chung, and R. Kettimuthu. 2021. Leveraging In-Network Computing and Programmable Switches for Streaming Analysis of Scientific Data. In 2021 IEEE 7th International Conference on Network Softwarization (NetSoft). 293--297. Google ScholarCross Ref
- A. Sapio et al. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 785--808. https://www.usenix.org/conference/nsdi21/presentation/sapioGoogle Scholar
- J. Sheng, Q. Xiong, C. Yang, and M.C. Herbordt. 2017. Collective Communication on FPGA Clusters with Static Scheduling. ACM SIGARCH Computer Architecture News 44, 4 (2017). Google ScholarDigital Library
- G. Siracusano and R. Bifulco. 2018. In-Network Neural Networks. arXiv preprint arXiv:1801.05731 (2018).Google Scholar
- D. Stanzione et al. 2017. Stampede 2: The Evolution of an XSEDE Supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing on Sustainability, Success and Impact (PEARC17). Article 15, 8 pages. Google ScholarDigital Library
- J. Stern, Q. Xiong, J. Sheng, A. Skjellum, and M.C. Herbordt. 2017. Accelerating MPI_Reduce with FPGAs in the Network. In Workshop on Exascale MPI.Google Scholar
- J. Stern, Q. Xiong, A. Skjellum, and M.C. Herbordt. 2018. A Novel Approach to Supporting Communicators for In-Switch Processing of MPI Collectives. In Workshop on Exascale MPI.Google Scholar
- T. Swamy, A. Rucker, M. Shahbaz, I. Gaur, and K. Olukotun. 2022. Taurus: a Data Plane Architecture for Per-Packet ML. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22). 1099--1114.Google Scholar
- I. Taras and J. H. Anderson. 2019. Impact of FPGA Architecture on Area and Performance of CGRA Overlays. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 87--95. Google ScholarCross Ref
- A. Tripathy, K. Yelick, and A. Buluç. 2020. Reducing Communication in Graph Neural Network Training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). 1--14. Google ScholarCross Ref
- H. Wang et al. 2017. P4FPGA: A Rapid Prototyping Framework for P4. In Proceedings of the Symposium on SDN Research (SOSR '17). 122--135.Google ScholarDigital Library
- Andrew Waterman and Krste Asanovic. 2017. The RISC-V Instruction Set Manual Volume I: User-Level ISA, Document Version 2.2. https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf.Google Scholar
- Xilinx. 2023. AXI Reference Guide, Vivado Design Suite. https://docs.xilinx.com/v/u/en-US/ug1037-vivado-axi-reference-guide.Google Scholar
- Xilinx. 2023. Xilinx Runtime Library (XRT). https://www.xilinx.com/products/design-tools/vitis/xrt.html.Google Scholar
- Xilinx. 2023. XUP Vitis Network Example (VNx). https://github.com/Xilinx/xup_vitis_network_example.Google Scholar
- Z. Xiong and N. Zilberman. 2019. Do Switches Dream of Machine Learning? Toward In-Network Classification. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks. 25--33.Google Scholar
- B. Zhang, R. Kannan, and V. Prasanna. 2021. BoostGCN: A Framework for Optimizing GCN Inference on FPGA. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 29--39.Google Scholar
Index Terms
- FLASH: FPGA-Accelerated Smart Switches with GCN Case Study
Recommendations
Conjoining soft-core FPGA processors
ICCAD '06: Proceedings of the 2006 IEEE/ACM international conference on Computer-aided designSoft-core programmable processors on field-programmable gate arrays (FPGAs) can be custom synthesized to instantiate only those hardware units, such as multipliers and floating-point units, that an application requires to meet performance demands, thus ...
Accelerated Embedded AKAZE Feature Detection Algorithm on FPGA
HEART '17: Proceedings of the 8th International Symposium on Highly Efficient Accelerators and Reconfigurable TechnologiesFeature detection is a major operation in various computer vision systems. The KAZE algorithm and its improved version, Accelerated-KAZE (AKAZE), are considered as the first algorithms to detect features by building a scale space using nonlinear ...
Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems
CODES+ISSS '09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesisField-Programmable Gate Arrays (FPGAs) have become promising mapping fabric for the implementation of System-on-Chip (SoC) platforms, due to their large capacity and their enhanced support for dynamic and partial reconfigurability. Design automation ...
Comments