research-article

Streaming message interface: high-performance distributed memory programming on reconfigurable hardware

Authors:
Tiziano De Matteis

ETH Zurich

ETH Zurich
View Profile

,
Johannes de Fine Licht

ETH Zurich

ETH Zurich
View Profile

,
Jakub Beránek

Technical University of Ostrava

Technical University of Ostrava
View Profile

,
Torsten Hoefler

ETH Zurich

ETH Zurich
View Profile

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2019Article No.: 82Pages 1–33https://doi.org/10.1145/3295500.3356201

Published:17 November 2019Publication History

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–33

ABSTRACT

Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.

References

A. M. Aji, L. S. Panwar, F. Ji, K. Murthy, M. Chabbi, P. Balaji, K. R. Bisset, J. Dinan, W. Feng, J. Mellor-Crummey, X. Ma, and R. Thakur. 2016. MPI-ACC: Accelerator-Aware MPI for Scientific Applications. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1401--1414. Google ScholarDigital Library
Susan Blackford, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, Michael Heroux, Linda Kaufman, Andrew Lumsdaine, Antoine Petitet, Roldan Pozo, Karin Remington, and Clint Whaley. 2002. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (June 2002), 135--151.Google ScholarDigital Library
E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang, R. Zhao, and D. Burger. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38, 2 (Mar 2018), 8--20. Google ScholarCross Ref
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 4 (April 2011), 473--491. Google ScholarDigital Library
Tomasz S Czajkowski, Utku Aydonat, Dmitry Denisenko, John Freeman, Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras, and Deshanand P Singh. 2012. From OpenCL to high-performance hardware on FPGAs. In 22nd international conference on field programmable logic and applications (FPL). IEEE, 531--534.Google ScholarCross Ref
Johannes de Fine Licht, Simon Meierhans, and Torsten Hoefler. 2018. Transformations of High-Level Synthesis Codes for High-Performance Computing. CoRR abs/1805.08288 (2018). arXiv:1805.08288 http://arxiv.org/abs/1805.08288Google Scholar
Rob Dimond, Sébastien Racaniere, and Oliver Pell. 2011. Accelerating large-scale HPC Applications using FPGAs. In 2011 IEEE 20th Symposium on Computer Arithmetic. IEEE, 191--192.Google ScholarDigital Library
J. Domke, T. Hoefler, and W. E. Nagel. 2011. Deadlock-Free Oblivious Routing for Arbitrary Topologies. In 2011 IEEE International Parallel Distributed Processing Symposium. 616--627. Google ScholarDigital Library
Nariman Eskandari, Naif Tarafdar, Daniel Ly-Ma, and Paul Chow. 2019. A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '19). ACM, New York, NY, USA, 262--271. Google ScholarDigital Library
T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and M. Herbordt. 2018. FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 81--84.Google Scholar
A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and C. Yang. 2016. Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.Google Scholar
T. Gysi, J. BÃd'r, and T. Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 609--620. Google ScholarCross Ref
Amazon EC2 F1 instances. [n. d.]. https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 160--167.Google ScholarCross Ref
Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. 2016. IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). ACM, New York, NY, USA, 189--201. Google ScholarDigital Library
Ryohei Kobayashi, Yuma Oobata, Norihisa Fujita, Yoshiki Yamaguchi, and Taisuke Boku. 2018. OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018). ACM, New York, NY, USA, 192--201. Google ScholarDigital Library
A. Lawande, A. D. George, and H. Lam. 2016. An OpenCL Framework for Distributed Apps on a Multidimensional Network of FPGAs. In 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3). 42--49. Google ScholarCross Ref
Tiziano De Matteis, Johannes de Fine Licht, and Torsten Hoefler. 2019. FBLAS: Streaming Linear Algebra on FPGA. CoRR (Aug. 2019).Google Scholar
Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard, Version 3.1. Specification. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdfGoogle Scholar
M. Owaida and G. Alonso. 2018. Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 295--2955. Google ScholarCross Ref
Manuel Saldaña, Arun Patel, Christopher Madill, Daniel Nunes, Danyao Wang, Paul Chow, Ralph Wittig, Henry Styles, and Andrew Putnam. 2010. MPI As a Programming Model for High-Performance Reconfigurable Computers. ACM Trans. Reconfigurable Technol. Syst. 3, 4, Article 22 (Nov. 2010), 29 pages. Google ScholarDigital Library
Kentaro Sano, Yoshiaki Hatsuda, and Satoru Yamamoto. 2014. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. IEEE Transactions on Parallel and Distributed Systems 25, 3 (2014), 695--705.Google ScholarDigital Library
Ran Shu, Peng Cheng, Guo Chen, Zhiyuan Guo, Lei Qu, Yongqiang Xiong, Derek Chiou, and Thomas Moscibroda. 2019. Direct Universal Access: Making Data Center Resources Available to FPGA. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 127--140. https://www.usenix.org/conference/nsdi19/presentation/shuGoogle Scholar
Stratix 10 GX/SX Product Table. [n. d.]. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/stratix-10-product-table.pdf.Google Scholar
Versal ACAP AI Core Series Product Table. [n. d.]. https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf.Google Scholar
Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM, New York, NY, USA, 65--74. Google ScholarDigital Library
Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED '16). ACM, New York, NY, USA, 326--331. Google ScholarDigital Library
Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 153--162.Google ScholarDigital Library

Recommendations

SAccO

This paper presents SAccO (Scalable Accelerator platform Osnabrück), a novel framework for implementing data-intensive applications using scalable and portable reconfigurable hardware accelerators. Instead of using expensive "reconfigurable ...
Read More
A Mixed-Grained Reconfigurable Computing Platform for Multiple-Standard Video Decoding (Abstract Only)
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

A mixed-grained reconfigurable computing platform targeting multiple-standard video decoding is proposed in this paper. The platform integrates eight coarse-grained Reconfigurable Processing Units (RPUs), each of which consists of 16×16 multi-functional ...
Read More
Function-level multitasking interface design in an embedded operating system with reconfigurable hardware
EUC'07: Proceedings of the 2007 international conference on Embedded and ubiquitous computing

Reconfigurable architecture provides a high performance computing paradigm. We can implement the compute-intensive functions into reconfigurable devices to optimize the application performance. In current reconfigurable hardware designs, the function-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available
Author Tags
distributed memory programming
high-level synthesis tools
reconfigurable computing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 624
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Streaming message interface: high-performance distributed memory programming on reconfigurable hardware

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

SAccO

A Mixed-Grained Reconfigurable Computing Platform for Multiple-Standard Video Decoding (Abstract Only)

Function-level multitasking interface design in an embedded operating system with reconfigurable hardware