skip to main content
10.1145/3577193.3593739acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

Authors Info & Claims
Published:21 June 2023Publication History

ABSTRACT

Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility.

In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.

References

  1. O. Arap and M. Swany. 2016. Offloading Collective Operations to Programmable Logic on a Zynq Cluster. In 2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI). 76--83.Google ScholarGoogle Scholar
  2. Arista. 2023. 7130 FPGA-enabled Network Switches - Quick Look. www.arista.com/en/products/7130-fpga-enabled-network-switches-quick-look.Google ScholarGoogle Scholar
  3. AWS. 2019. Deliver high performance ML inference with AWS Inferentia. https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deliver_high_performance_ML_inference_with_AWS_Inferentia_CMP324-R1.pdf.Google ScholarGoogle Scholar
  4. M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Maqbool Hashmi, and D. K. Panda. 2021. BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs. In High Performance Computing: 36th International Conference, ISC High Performance 2021. Springer, 18--37.Google ScholarGoogle Scholar
  5. Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. 2014. P4: Programming Protocol-Independent Packet Processors. SIGCOMM Comput. Commun. Rev. 44, 3 (jul 2014), 87--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Chen, J. Emer, and V. Sze. 2017. Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators. IEEE Micro 37, 3 (2017), 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler. 2021. Flare: Flexible In-Network Allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.Google ScholarGoogle Scholar
  8. A. Faraj, S. Kumar, B. Smith, A. Mamidala, and J. Gunnels. 2009. MPI Collective Communications on the Blue Gene/P Supercomputer: Algorithms and Optimizations. 2009 17th IEEE Symposium on High Performance Interconnects (2009), 63--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Gasteiger, C. Qian, and S. Günnemann. 2022. Influence-Based Mini-Batching for Graph Neural Networks. arXiv preprint arXiv:2212.09083 (2022).Google ScholarGoogle Scholar
  10. T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y. Li, P. Haghi, A. Tumeo, S. Che, S. Reinhardt, and M.C. Herbordt. 2020. AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing. In 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  11. T. Geng, C. Wu, Y. Zhang, C. Tan, C. Xie, H. You, M.C. Herbordt, Y. Lin, and A. Li. 2021. I-GCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement Through Islandization. In 54th IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. L. Graham et al. 2010. Overlapping Computation and Communication: Barrier Algorithms and ConnectX-2 CORE-Direct Capabilities. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  13. R. L. Graham et al. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  14. Richard L. Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, Ami Marelli, Valentin Petrov, Evyatar Romlet, Yong Qin, and Ido Zemah. 2020. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation. In High Performance Computing, Ponnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, and Hatem Ltaief (Eds.). Springer International Publishing, Cham, 41--59.Google ScholarGoogle Scholar
  15. A. Guo, T. Geng, Y. Zhang, P. Haghi, C. Wu, C. Tan, Y. Lin, A. Li, and M.C. Herbordt. 2022. A Framework for Neural Network Inference on FPGA-Centric SmartNICs. In International Conference on Field-Programmable Logic and Applications (FPL).Google ScholarGoogle Scholar
  16. A. Guo, Y. Hao, C. Wu, P. Haghi, Z. Pan, M. Si, D. Tao, A. Li, M.C. Herbordt, and T. Geng. 2023. Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training. In ICS 2023: International Conference on Supercomputing.Google ScholarGoogle Scholar
  17. P. Haghi, A. Guo, T. Geng, A. Skjellum, and M.C. Herbordt. 2021. Workload Imbalance in HPC Applications: Effect on Performance of In-Network Processing. In IEEE High Performance Extreme Computing Conference. Google ScholarGoogle ScholarCross RefCross Ref
  18. P. Haghi, A. Guo, Q. Xiong, R. Patel, C. Yang, T. Geng, J.T. Broaddus, R. Marshall, A. Skjellum, and M.C. Herbordt. 2020. FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives. In IEEE High Performance Extreme Computing Conference.Google ScholarGoogle Scholar
  19. P. Haghi, A. Guo, Q. Xiong, C. Yang, T. Geng, J.T. Broaddus, R. Marshall, D. Schafer, A. Skjellum, and M.C. Herbordt. 2022. Reconfigurable switches for high performance and flexible MPI collectives. Concurrency and Computation: Practice and Experience 34, 2 (2022). Google ScholarGoogle ScholarCross RefCross Ref
  20. S. Handagala, M.C. Herbordt, and M. Leeser. 2021. OCT: The Open Cloud FPGA Testbed. In 31st International Conference on Field Programmable Logic and Applications (FPL).Google ScholarGoogle Scholar
  21. S. Handagala, M. Leeser, K. Patle, and M. Zink. 2022. Network Attached FPGAs in the Open Cloud Testbed (OCT). In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 1--6.Google ScholarGoogle Scholar
  22. F. Hauser et al. 2021. A Survey on Data Plane Programming with P4: Fundamentals, Advances, and Applied Research. arXiv preprint arXiv:2101.10632 (2021).Google ScholarGoogle Scholar
  23. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118--22133.Google ScholarGoogle Scholar
  24. Z. Jia, S. Lin, M. Gao, M. Zaharia, and A. Aiken. 2020. Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2--4, 2020, I.S. Dhillon, D.S. Papailiopoulos, and V. Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/300.pdfGoogle ScholarGoogle Scholar
  25. M. Karunaratne, A. K. Mohite, T. Mitra, and L. Peh. 2017. HyCUBE: A CGRA with Reconfigurable Single-Cycle Multi-hop Interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. F. Kfoury, J. Crichigno, and E. Bou-Harb. 2021. An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends. IEEE Access 9 (2021), 87094--87155.Google ScholarGoogle ScholarCross RefCross Ref
  27. V. Krishnan, O. Serres, and M. Blocksome. 2020. COnfigurable Network Protocol Accelerator (COPA): An Integrated Networking/Accelerator Hardware/Software Framework. In 2020 IEEE Symposium on High-Performance Interconnects (HOTI). 17--24. Google ScholarGoogle ScholarCross RefCross Ref
  28. C. Lattner and V. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In International Symposium on Code Generation and Optimization, CGO. 75--86. Google ScholarGoogle ScholarCross RefCross Ref
  29. A. Li, T. Geng, T. Wang, M.C. Herbordt, S. Song, and K. Barker. 2019. BSTC: A Novel Binarized-Soft-Tensor-Core Design for Accelerating Bit-Based Approximated Neural Nets. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Youjie Li and et al. 2019. Accelerating Distributed Reinforcement learning with In-Switch Computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 279--291.Google ScholarGoogle Scholar
  31. B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In Field Programmable Logic and Application (FPL). 61--70.Google ScholarGoogle Scholar
  32. J. Naous, G. Gibb, S. Bolouki, and N. McKeown. 2008. NetFPGA: Reusable Router Architecture for Experimental Research. In Association for Computing Machinery PRESTO (Seattle, WA, USA). New York, NY, USA, 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. New Wave DV. 2023. 32-Port Programmable Switch. https://newwavedv.com/products/appliances/32-port-programmable-switch/.Google ScholarGoogle Scholar
  34. J. Park, M. Smelyanskiy, U. M. Yang, D. Mudigere, and P. Dubey. 2015. High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.Google ScholarGoogle Scholar
  35. R. Prabhakar et al. 2017. Plasticine: A Reconfigurable Architecture for Parallel Patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 389--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Qiao, C. Hu, G. Brebner, J. Zou, and X. Guan. 2020. Adaptable Switch: A Heterogeneous Switch Architecture for Network-Centric Computing. IEEE Communications Magazine 58, 12 (2020), 64--69. Google ScholarGoogle ScholarCross RefCross Ref
  37. A. L. G. Rios, K. Bekshentayeva, M. Singh, S. Haeri, and L. Trajkovic. 2021. Virtual Network Embedding for Switch-Centric Data Center Networks. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS). 1--5. Google ScholarGoogle ScholarCross RefCross Ref
  38. RISC-V. 2023. RISC-V Specifications. https://riscv.org/technical/specifications/.Google ScholarGoogle Scholar
  39. RISC-V. 2023. RISC-V 'V' Vector Specifications. https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc.Google ScholarGoogle Scholar
  40. G. Sankaran, J. Chung, and R. Kettimuthu. 2021. Leveraging In-Network Computing and Programmable Switches for Streaming Analysis of Scientific Data. In 2021 IEEE 7th International Conference on Network Softwarization (NetSoft). 293--297. Google ScholarGoogle ScholarCross RefCross Ref
  41. A. Sapio et al. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 785--808. https://www.usenix.org/conference/nsdi21/presentation/sapioGoogle ScholarGoogle Scholar
  42. J. Sheng, Q. Xiong, C. Yang, and M.C. Herbordt. 2017. Collective Communication on FPGA Clusters with Static Scheduling. ACM SIGARCH Computer Architecture News 44, 4 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. G. Siracusano and R. Bifulco. 2018. In-Network Neural Networks. arXiv preprint arXiv:1801.05731 (2018).Google ScholarGoogle Scholar
  44. D. Stanzione et al. 2017. Stampede 2: The Evolution of an XSEDE Supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing on Sustainability, Success and Impact (PEARC17). Article 15, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Stern, Q. Xiong, J. Sheng, A. Skjellum, and M.C. Herbordt. 2017. Accelerating MPI_Reduce with FPGAs in the Network. In Workshop on Exascale MPI.Google ScholarGoogle Scholar
  46. J. Stern, Q. Xiong, A. Skjellum, and M.C. Herbordt. 2018. A Novel Approach to Supporting Communicators for In-Switch Processing of MPI Collectives. In Workshop on Exascale MPI.Google ScholarGoogle Scholar
  47. T. Swamy, A. Rucker, M. Shahbaz, I. Gaur, and K. Olukotun. 2022. Taurus: a Data Plane Architecture for Per-Packet ML. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22). 1099--1114.Google ScholarGoogle Scholar
  48. I. Taras and J. H. Anderson. 2019. Impact of FPGA Architecture on Area and Performance of CGRA Overlays. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 87--95. Google ScholarGoogle ScholarCross RefCross Ref
  49. A. Tripathy, K. Yelick, and A. Buluç. 2020. Reducing Communication in Graph Neural Network Training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). 1--14. Google ScholarGoogle ScholarCross RefCross Ref
  50. H. Wang et al. 2017. P4FPGA: A Rapid Prototyping Framework for P4. In Proceedings of the Symposium on SDN Research (SOSR '17). 122--135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Andrew Waterman and Krste Asanovic. 2017. The RISC-V Instruction Set Manual Volume I: User-Level ISA, Document Version 2.2. https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf.Google ScholarGoogle Scholar
  52. Xilinx. 2023. AXI Reference Guide, Vivado Design Suite. https://docs.xilinx.com/v/u/en-US/ug1037-vivado-axi-reference-guide.Google ScholarGoogle Scholar
  53. Xilinx. 2023. Xilinx Runtime Library (XRT). https://www.xilinx.com/products/design-tools/vitis/xrt.html.Google ScholarGoogle Scholar
  54. Xilinx. 2023. XUP Vitis Network Example (VNx). https://github.com/Xilinx/xup_vitis_network_example.Google ScholarGoogle Scholar
  55. Z. Xiong and N. Zilberman. 2019. Do Switches Dream of Machine Learning? Toward In-Network Classification. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks. 25--33.Google ScholarGoogle Scholar
  56. B. Zhang, R. Kannan, and V. Prasanna. 2021. BoostGCN: A Framework for Optimizing GCN Inference on FPGA. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 29--39.Google ScholarGoogle Scholar

Index Terms

  1. FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICS '23: Proceedings of the 37th International Conference on Supercomputing
        June 2023
        505 pages
        ISBN:9798400700569
        DOI:10.1145/3577193

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 June 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate584of2,055submissions,28%

        Upcoming Conference

        ICS '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader