ABSTRACT
Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.
- A. M. Aji, L. S. Panwar, F. Ji, K. Murthy, M. Chabbi, P. Balaji, K. R. Bisset, J. Dinan, W. Feng, J. Mellor-Crummey, X. Ma, and R. Thakur. 2016. MPI-ACC: Accelerator-Aware MPI for Scientific Applications. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1401--1414. Google ScholarDigital Library
- Susan Blackford, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, Michael Heroux, Linda Kaufman, Andrew Lumsdaine, Antoine Petitet, Roldan Pozo, Karin Remington, and Clint Whaley. 2002. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (June 2002), 135--151.Google ScholarDigital Library
- E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang, R. Zhao, and D. Burger. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38, 2 (Mar 2018), 8--20. Google ScholarCross Ref
- J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 4 (April 2011), 473--491. Google ScholarDigital Library
- Tomasz S Czajkowski, Utku Aydonat, Dmitry Denisenko, John Freeman, Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras, and Deshanand P Singh. 2012. From OpenCL to high-performance hardware on FPGAs. In 22nd international conference on field programmable logic and applications (FPL). IEEE, 531--534.Google ScholarCross Ref
- Johannes de Fine Licht, Simon Meierhans, and Torsten Hoefler. 2018. Transformations of High-Level Synthesis Codes for High-Performance Computing. CoRR abs/1805.08288 (2018). arXiv:1805.08288 http://arxiv.org/abs/1805.08288Google Scholar
- Rob Dimond, Sébastien Racaniere, and Oliver Pell. 2011. Accelerating large-scale HPC Applications using FPGAs. In 2011 IEEE 20th Symposium on Computer Arithmetic. IEEE, 191--192.Google ScholarDigital Library
- J. Domke, T. Hoefler, and W. E. Nagel. 2011. Deadlock-Free Oblivious Routing for Arbitrary Topologies. In 2011 IEEE International Parallel Distributed Processing Symposium. 616--627. Google ScholarDigital Library
- Nariman Eskandari, Naif Tarafdar, Daniel Ly-Ma, and Paul Chow. 2019. A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '19). ACM, New York, NY, USA, 262--271. Google ScholarDigital Library
- T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and M. Herbordt. 2018. FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 81--84.Google Scholar
- A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and C. Yang. 2016. Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.Google Scholar
- T. Gysi, J. BÃd'r, and T. Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 609--620. Google ScholarCross Ref
- Amazon EC2 F1 instances. [n. d.]. https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 160--167.Google ScholarCross Ref
- Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. 2016. IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). ACM, New York, NY, USA, 189--201. Google ScholarDigital Library
- Ryohei Kobayashi, Yuma Oobata, Norihisa Fujita, Yoshiki Yamaguchi, and Taisuke Boku. 2018. OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018). ACM, New York, NY, USA, 192--201. Google ScholarDigital Library
- A. Lawande, A. D. George, and H. Lam. 2016. An OpenCL Framework for Distributed Apps on a Multidimensional Network of FPGAs. In 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3). 42--49. Google ScholarCross Ref
- Tiziano De Matteis, Johannes de Fine Licht, and Torsten Hoefler. 2019. FBLAS: Streaming Linear Algebra on FPGA. CoRR (Aug. 2019).Google Scholar
- Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard, Version 3.1. Specification. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdfGoogle Scholar
- M. Owaida and G. Alonso. 2018. Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 295--2955. Google ScholarCross Ref
- Manuel Saldaña, Arun Patel, Christopher Madill, Daniel Nunes, Danyao Wang, Paul Chow, Ralph Wittig, Henry Styles, and Andrew Putnam. 2010. MPI As a Programming Model for High-Performance Reconfigurable Computers. ACM Trans. Reconfigurable Technol. Syst. 3, 4, Article 22 (Nov. 2010), 29 pages. Google ScholarDigital Library
- Kentaro Sano, Yoshiaki Hatsuda, and Satoru Yamamoto. 2014. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. IEEE Transactions on Parallel and Distributed Systems 25, 3 (2014), 695--705.Google ScholarDigital Library
- Ran Shu, Peng Cheng, Guo Chen, Zhiyuan Guo, Lei Qu, Yongqiang Xiong, Derek Chiou, and Thomas Moscibroda. 2019. Direct Universal Access: Making Data Center Resources Available to FPGA. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 127--140. https://www.usenix.org/conference/nsdi19/presentation/shuGoogle Scholar
- Stratix 10 GX/SX Product Table. [n. d.]. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/stratix-10-product-table.pdf.Google Scholar
- Versal ACAP AI Core Series Product Table. [n. d.]. https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf.Google Scholar
- Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM, New York, NY, USA, 65--74. Google ScholarDigital Library
- Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED '16). ACM, New York, NY, USA, 326--331. Google ScholarDigital Library
- Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 153--162.Google ScholarDigital Library
Recommendations
A Mixed-Grained Reconfigurable Computing Platform for Multiple-Standard Video Decoding (Abstract Only)
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysA mixed-grained reconfigurable computing platform targeting multiple-standard video decoding is proposed in this paper. The platform integrates eight coarse-grained Reconfigurable Processing Units (RPUs), each of which consists of 16×16 multi-functional ...
Function-level multitasking interface design in an embedded operating system with reconfigurable hardware
EUC'07: Proceedings of the 2007 international conference on Embedded and ubiquitous computingReconfigurable architecture provides a high performance computing paradigm. We can implement the compute-intensive functions into reconfigurable devices to optimize the application performance. In current reconfigurable hardware designs, the function-...
Comments