research-article

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

Authors:
Pouya Haghi

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0000-0003-2893-9194
View Profile

,
William Krska

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0009-0000-7259-5961
View Profile

,
Cheng Tan

Microsoft, Bellevue, USA

Microsoft, Bellevue, USA

https://orcid.org/0000-0003-3727-2889
View Profile

,
Tong Geng

University of Rochester, Rochester, USA

University of Rochester, Rochester, USA

https://orcid.org/0000-0002-3644-2922
View Profile

,
Po Hao Chen

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0009-0005-8164-7061
View Profile

,
Connor Greenwood

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0009-0005-4862-1169
View Profile

,
Anqi Guo

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0000-0001-5872-4464
View Profile

,
Thomas Hines

University of Tennessee at Chattanooga, Chattanooga, USA

University of Tennessee at Chattanooga, Chattanooga, USA

https://orcid.org/0000-0001-9675-0399
View Profile

,
Chunshu Wu

Boston University, Boston, USA

Boston University, Boston, USA

https://orcid.org/0009-0006-2039-0853
View Profile

,
Ang Li

Pacific Northwest National Laboratory, Richland, United States of America

Pacific Northwest National Laboratory, Richland, United States of America

https://orcid.org/0000-0003-3734-9137
View Profile

,
Anthony Skjellum

University of Tennessee at Chattanooga, Chattanooga, United States of America

University of Tennessee at Chattanooga, Chattanooga, United States of America

https://orcid.org/0000-0001-5252-6600
View Profile

,
Martin Herbordt

Boston University, Boston, United States of America

Boston University, Boston, United States of America

https://orcid.org/0000-0002-3443-9113
View Profile

ICS '23: Proceedings of the 37th International Conference on SupercomputingJune 2023Pages 450–462https://doi.org/10.1145/3577193.3593739

Published:21 June 2023Publication History

ICS '23: Proceedings of the 37th International Conference on Supercomputing

Pages 450–462

ABSTRACT

Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility.

In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.

References

O. Arap and M. Swany. 2016. Offloading Collective Operations to Programmable Logic on a Zynq Cluster. In 2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI). 76--83.Google Scholar
Arista. 2023. 7130 FPGA-enabled Network Switches - Quick Look. www.arista.com/en/products/7130-fpga-enabled-network-switches-quick-look.Google Scholar
AWS. 2019. Deliver high performance ML inference with AWS Inferentia. https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deliver_high_performance_ML_inference_with_AWS_Inferentia_CMP324-R1.pdf.Google Scholar
M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Maqbool Hashmi, and D. K. Panda. 2021. BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs. In High Performance Computing: 36th International Conference, ISC High Performance 2021. Springer, 18--37.Google Scholar
Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. 2014. P4: Programming Protocol-Independent Packet Processors. SIGCOMM Comput. Commun. Rev. 44, 3 (jul 2014), 87--95. Google ScholarDigital Library
Y. Chen, J. Emer, and V. Sze. 2017. Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators. IEEE Micro 37, 3 (2017), 12--21. Google ScholarDigital Library
D. De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler. 2021. Flare: Flexible In-Network Allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.Google Scholar
A. Faraj, S. Kumar, B. Smith, A. Mamidala, and J. Gunnels. 2009. MPI Collective Communications on the Blue Gene/P Supercomputer: Algorithms and Optimizations. 2009 17th IEEE Symposium on High Performance Interconnects (2009), 63--72.Google ScholarDigital Library
J. Gasteiger, C. Qian, and S. Günnemann. 2022. Influence-Based Mini-Batching for Graph Neural Networks. arXiv preprint arXiv:2212.09083 (2022).Google Scholar
T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y. Li, P. Haghi, A. Tumeo, S. Che, S. Reinhardt, and M.C. Herbordt. 2020. AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing. In 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
T. Geng, C. Wu, Y. Zhang, C. Tan, C. Xie, H. You, M.C. Herbordt, Y. Lin, and A. Li. 2021. I-GCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement Through Islandization. In 54th IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarDigital Library
R. L. Graham et al. 2010. Overlapping Computation and Communication: Barrier Algorithms and ConnectX-2 CORE-Direct Capabilities. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). 1--8.Google ScholarCross Ref
R. L. Graham et al. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1--10.Google ScholarCross Ref
Richard L. Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, Ami Marelli, Valentin Petrov, Evyatar Romlet, Yong Qin, and Ido Zemah. 2020. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation. In High Performance Computing, Ponnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, and Hatem Ltaief (Eds.). Springer International Publishing, Cham, 41--59.Google Scholar
A. Guo, T. Geng, Y. Zhang, P. Haghi, C. Wu, C. Tan, Y. Lin, A. Li, and M.C. Herbordt. 2022. A Framework for Neural Network Inference on FPGA-Centric SmartNICs. In International Conference on Field-Programmable Logic and Applications (FPL).Google Scholar
A. Guo, Y. Hao, C. Wu, P. Haghi, Z. Pan, M. Si, D. Tao, A. Li, M.C. Herbordt, and T. Geng. 2023. Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training. In ICS 2023: International Conference on Supercomputing.Google Scholar
P. Haghi, A. Guo, T. Geng, A. Skjellum, and M.C. Herbordt. 2021. Workload Imbalance in HPC Applications: Effect on Performance of In-Network Processing. In IEEE High Performance Extreme Computing Conference. Google ScholarCross Ref
P. Haghi, A. Guo, Q. Xiong, R. Patel, C. Yang, T. Geng, J.T. Broaddus, R. Marshall, A. Skjellum, and M.C. Herbordt. 2020. FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives. In IEEE High Performance Extreme Computing Conference.Google Scholar
P. Haghi, A. Guo, Q. Xiong, C. Yang, T. Geng, J.T. Broaddus, R. Marshall, D. Schafer, A. Skjellum, and M.C. Herbordt. 2022. Reconfigurable switches for high performance and flexible MPI collectives. Concurrency and Computation: Practice and Experience 34, 2 (2022). Google ScholarCross Ref
S. Handagala, M.C. Herbordt, and M. Leeser. 2021. OCT: The Open Cloud FPGA Testbed. In 31st International Conference on Field Programmable Logic and Applications (FPL).Google Scholar
S. Handagala, M. Leeser, K. Patle, and M. Zink. 2022. Network Attached FPGAs in the Open Cloud Testbed (OCT). In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 1--6.Google Scholar
F. Hauser et al. 2021. A Survey on Data Plane Programming with P4: Fundamentals, Advances, and Applied Research. arXiv preprint arXiv:2101.10632 (2021).Google Scholar
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118--22133.Google Scholar
Z. Jia, S. Lin, M. Gao, M. Zaharia, and A. Aiken. 2020. Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2--4, 2020, I.S. Dhillon, D.S. Papailiopoulos, and V. Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/300.pdfGoogle Scholar
M. Karunaratne, A. K. Mohite, T. Mitra, and L. Peh. 2017. HyCUBE: A CGRA with Reconfigurable Single-Cycle Multi-hop Interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--6. Google ScholarDigital Library
E. F. Kfoury, J. Crichigno, and E. Bou-Harb. 2021. An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends. IEEE Access 9 (2021), 87094--87155.Google ScholarCross Ref
V. Krishnan, O. Serres, and M. Blocksome. 2020. COnfigurable Network Protocol Accelerator (COPA): An Integrated Networking/Accelerator Hardware/Software Framework. In 2020 IEEE Symposium on High-Performance Interconnects (HOTI). 17--24. Google ScholarCross Ref
C. Lattner and V. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In International Symposium on Code Generation and Optimization, CGO. 75--86. Google ScholarCross Ref
A. Li, T. Geng, T. Wang, M.C. Herbordt, S. Song, and K. Barker. 2019. BSTC: A Novel Binarized-Soft-Tensor-Core Design for Accelerating Bit-Based Approximated Neural Nets. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
Youjie Li and et al. 2019. Accelerating Distributed Reinforcement learning with In-Switch Computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 279--291.Google Scholar
B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In Field Programmable Logic and Application (FPL). 61--70.Google Scholar
J. Naous, G. Gibb, S. Bolouki, and N. McKeown. 2008. NetFPGA: Reusable Router Architecture for Experimental Research. In Association for Computing Machinery PRESTO (Seattle, WA, USA). New York, NY, USA, 1--7. Google ScholarDigital Library
New Wave DV. 2023. 32-Port Programmable Switch. https://newwavedv.com/products/appliances/32-port-programmable-switch/.Google Scholar
J. Park, M. Smelyanskiy, U. M. Yang, D. Mudigere, and P. Dubey. 2015. High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.Google Scholar
R. Prabhakar et al. 2017. Plasticine: A Reconfigurable Architecture for Parallel Patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 389--402. Google ScholarDigital Library
S. Qiao, C. Hu, G. Brebner, J. Zou, and X. Guan. 2020. Adaptable Switch: A Heterogeneous Switch Architecture for Network-Centric Computing. IEEE Communications Magazine 58, 12 (2020), 64--69. Google ScholarCross Ref
A. L. G. Rios, K. Bekshentayeva, M. Singh, S. Haeri, and L. Trajkovic. 2021. Virtual Network Embedding for Switch-Centric Data Center Networks. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS). 1--5. Google ScholarCross Ref
RISC-V. 2023. RISC-V Specifications. https://riscv.org/technical/specifications/.Google Scholar
RISC-V. 2023. RISC-V 'V' Vector Specifications. https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc.Google Scholar
G. Sankaran, J. Chung, and R. Kettimuthu. 2021. Leveraging In-Network Computing and Programmable Switches for Streaming Analysis of Scientific Data. In 2021 IEEE 7th International Conference on Network Softwarization (NetSoft). 293--297. Google ScholarCross Ref
A. Sapio et al. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 785--808. https://www.usenix.org/conference/nsdi21/presentation/sapioGoogle Scholar
J. Sheng, Q. Xiong, C. Yang, and M.C. Herbordt. 2017. Collective Communication on FPGA Clusters with Static Scheduling. ACM SIGARCH Computer Architecture News 44, 4 (2017). Google ScholarDigital Library
G. Siracusano and R. Bifulco. 2018. In-Network Neural Networks. arXiv preprint arXiv:1801.05731 (2018).Google Scholar
D. Stanzione et al. 2017. Stampede 2: The Evolution of an XSEDE Supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing on Sustainability, Success and Impact (PEARC17). Article 15, 8 pages. Google ScholarDigital Library
J. Stern, Q. Xiong, J. Sheng, A. Skjellum, and M.C. Herbordt. 2017. Accelerating MPI_Reduce with FPGAs in the Network. In Workshop on Exascale MPI.Google Scholar
J. Stern, Q. Xiong, A. Skjellum, and M.C. Herbordt. 2018. A Novel Approach to Supporting Communicators for In-Switch Processing of MPI Collectives. In Workshop on Exascale MPI.Google Scholar
T. Swamy, A. Rucker, M. Shahbaz, I. Gaur, and K. Olukotun. 2022. Taurus: a Data Plane Architecture for Per-Packet ML. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22). 1099--1114.Google Scholar
I. Taras and J. H. Anderson. 2019. Impact of FPGA Architecture on Area and Performance of CGRA Overlays. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 87--95. Google ScholarCross Ref
A. Tripathy, K. Yelick, and A. Buluç. 2020. Reducing Communication in Graph Neural Network Training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). 1--14. Google ScholarCross Ref
H. Wang et al. 2017. P4FPGA: A Rapid Prototyping Framework for P4. In Proceedings of the Symposium on SDN Research (SOSR '17). 122--135.Google ScholarDigital Library
Andrew Waterman and Krste Asanovic. 2017. The RISC-V Instruction Set Manual Volume I: User-Level ISA, Document Version 2.2. https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf.Google Scholar
Xilinx. 2023. AXI Reference Guide, Vivado Design Suite. https://docs.xilinx.com/v/u/en-US/ug1037-vivado-axi-reference-guide.Google Scholar
Xilinx. 2023. Xilinx Runtime Library (XRT). https://www.xilinx.com/products/design-tools/vitis/xrt.html.Google Scholar
Xilinx. 2023. XUP Vitis Network Example (VNx). https://github.com/Xilinx/xup_vitis_network_example.Google Scholar
Z. Xiong and N. Zilberman. 2019. Do Switches Dream of Machine Learning? Toward In-Network Classification. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks. 25--33.Google Scholar
B. Zhang, R. Kannan, and V. Prasanna. 2021. BoostGCN: A Framework for Optimizing GCN Inference on FPGA. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 29--39.Google Scholar

Index Terms

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Conjoining soft-core FPGA processors
ICCAD '06: Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design

Soft-core programmable processors on field-programmable gate arrays (FPGAs) can be custom synthesized to instantiate only those hardware units, such as multipliers and floating-point units, that an application requires to meet performance demands, thus ...
Read More
Accelerated Embedded AKAZE Feature Detection Algorithm on FPGA
HEART '17: Proceedings of the 8th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

Feature detection is a major operation in various computer vision systems. The KAZE algorithm and its improved version, Accelerated-KAZE (AKAZE), are considered as the first algorithms to detect features by building a scale space using nonlinear ...
Read More
Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems
CODES+ISSS '09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis

Field-Programmable Gate Arrays (FPGAs) have become promising mapping fabric for the implementation of System-on-Chip (SoC) platforms, due to their large capacity and their enhanced support for dynamic and partial reconfigurability. Design automation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '23: Proceedings of the 37th International Conference on Supercomputing
June 2023
505 pages
ISBN:9798400700569
DOI:10.1145/3577193
Chair:
Kyle Gallivan,
Co-chair:
Efstratios Gallopoulos,
Program Co-chairs:
Dimitrios S. Nikolopoulos,
Ramon Beivide
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
in-switch computing
FPGAs
high performance computing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Upcoming Conference
ICS '24

Sponsor:

sigarch

2024 International Conference on Supercomputing

June 4 - 7, 2024

Kyoto , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 285
  Total Downloads
- Downloads (Last 12 months)285
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

ICS '23: Proceedings of the 37th International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Conjoining soft-core FPGA processors

Accelerated Embedded AKAZE Feature Detection Algorithm on FPGA

Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

ICS '23: Proceedings of the 37th International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Conjoining soft-core FPGA processors

Accelerated Embedded AKAZE Feature Detection Algorithm on FPGA

Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media