ABSTRACT
Current proposals for in-network data processing operate on data as it streams through a network switch or endpoint. Since compute resources must be available when data arrives, these approaches provide deadline-based models of execution. This paper introduces a deadline-free general compute model for network endpoints called INCA: In-Network Compute Assistance. INCA builds upon contemporary NIC offload capabilities to provide on-NIC, deadline-free, general-purpose compute capacities that can be utilized when the network is inactive. We demonstrate INCA is Turing complete, and provide a detailed design for extending existing hardware to support this model. We evaluate runtimes for a selection of kernels, including several optimizations, and show INCA can provide up to a 11% speedup for applications with minimal code modifications and between 25% to 37% when applications are optimized for INCA.
- Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation. ACM Press, 95--105. Google ScholarDigital Library
- Brian W. Barrett, Ron Brightwell, Ryan E. Grant, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Torsten Hoefler, Arthur B. Maccabe, and Trammell Hudson. 2018. The Portals 4.2 Network Programming Interface. Technical Report SAND2018-12790.Google Scholar
- Brian W Barrett, Ron Brightwell, K Scott Hemmert, Kyle B Wheeler, and Keith D Underwood. 2011. Using triggered operations to offload rendezvous messages. In European MPI Users' Group Meeting. Springer, 120--129.Google ScholarDigital Library
- Nanette J Boden, Danny Cohen, Robert E Felderman, Alan E. Kulawik, Charles L Seitz, Jakov N Seizovic, and Wen-King Su. 1995. Myrinet: A gigabit-per-second local area network. IEEE Micro 15, 1 (1995), 29--36.Google ScholarDigital Library
- Ron Brightwell, Kevin T Pedretti, Keith D Underwood, and Trammell Hudson. 2006. SeaStar interconnect: Balanced bandwidth for scalable performance. IEEE Micro 26, 3 (2006), 41--57.Google ScholarDigital Library
- Broadcom. 2019. Stingray SmartNIC. Retrieved 2019-10-01 from https://www.broadcom.com/products/ethernet-connectivity/smartnic/ps225Google Scholar
- Darius Buntinas, Dhabaleswar K. Panda, and Ponnuswamy Sadayappan. 2001. Fast NIC-based barrier over Myrinet/GM. In Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001. 52--59. Google ScholarCross Ref
- Christopher L Chappell and James Mitchell. 2012. Packet processing in switched fabric networks. Patent No. 8285907, Filed December 10th., 2004, Issued October 9th., 2012.Google Scholar
- David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. 1993. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '93). ACM, New York, NY, USA, 1--12. Google ScholarDigital Library
- Dennis Dalessandro, Ananth Devulapalli, and Pete Wyckoff. 2005. Design and implementation of the iWARP protocol in software. In Proceedings of the 17th IASTED International Conference on Parallel and Distributed Computing and Systems. Phoenix, Arizona, 471--476.Google Scholar
- Dennis Dalessandro, Pete Wyckoff, and Gary Montry. 2006. Initial performance evaluation of the neteffect 10 gigabit iwarp adapter. In 2006 IEEE International Conference on Cluster Computing. IEEE, 1--7.Google ScholarCross Ref
- S. Derradji, T. Palfer-Sollier, J. P. Panziera, A. Poudes, and F. W. Atos. 2015. The BXI Interconnect Architecture. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 18--25. Google ScholarDigital Library
- Hans Devries. 2019. Chip Architect. Retrieved 2019-04-09 from http://www.chip-architect.com/Google Scholar
- Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. 2018. Azure accelerated networking: SmartNICs in the public cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 51--66.Google ScholarDigital Library
- Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In Proceedings of the First Workshop on Optimization of Communication in HPC. IEEE Press, 1--10.Google ScholarCross Ref
- Richard L Graham, Steve Poole, Pavel Shamis, Gil Bloch, Noam Bloch, Hillel Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, and Gilad Shainer. 2010. Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 1--8.Google Scholar
- Ryan E Grant, Mohammad J Rashti, Ahmad Afsahi, and Pavan Balaji. 2011. RDMA capable iWARP over datagrams. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 628--639.Google ScholarDigital Library
- K. Scott Hemmert, Brian Barrett, and Keith D. Underwood. 2010. Using Triggered Operations to Offload Collective Communication Operations. In Recent Advances in the Message Passing Interface (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 249--256. Google ScholarCross Ref
- Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.Google Scholar
- Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, and Ron Brightwell. 2017. sPIN: High-performance Streaming Processing In the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, 59:1--59:16. Google ScholarDigital Library
- Antoine Kaufmann, SImon Peter, Naveen Kr. Sharma, Thomas Anderson, and Arvind Krishnamurthy. 2016. High Performance Packet Processing with FlexNIC. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 67--81. Google ScholarDigital Library
- D. Brian Larkins, John Snyder, and James Dinan. 2018. Efficient Runtime Support for a Partitioned Global Logical Address Space. In ICPP 2018: 47th International Conference on Parallel Processing. ACM, Eugune, Oregon.Google ScholarDigital Library
- Mellanox. 2018. Mellanox BlueField SmartNIC. Retrieved 2019-10-01 from https://www.mellanox.com/products/bluefield-overviewGoogle Scholar
- Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. 2002. The Quadrics network: High-performance clustering technology. IEEE Micro 22, 1 (2002), 46--57.Google ScholarDigital Library
- Steve Plimpton. 1995. Fast parallel algorithms for short-range molecular dynamics. Journal of computational physics 117, 1 (1995), 1--19.Google ScholarDigital Library
- ECP Project. 2019. ECP Proxy Applications. Retrieved 2019-10-01 from https://proxyapps.exascaleproject.org/Google Scholar
- Mohammad J Rashti, Ryan E Grant, Ahmad Afsahi, and Pavan Balaji. 2010. iWARP redefined: Scalable connectionless communication over high-speed Ethernet. In High Performance Computing (HiPC), 2010 International Conference on. IEEE, 1--10.Google ScholarCross Ref
- Timo Schneider, Torsten Hoefler, Ryan E Grant, Brian W Barrett, and Ron Brightwell. 2013. Protocols for fully offloaded collective operations on accelerated network adapters. In Parallel Processing (ICPP), 2013 42nd International Conference on. IEEE, 593--602.Google ScholarDigital Library
- J. C. Shepherdson and H. E. Sturgis. 1963. Computability of Recursive Functions. J. ACM 10, 2 (April 1963), 217--255. Google ScholarDigital Library
- Krishna Parasuram Srinivasan. 2018. Creating a PCI express interconnect in the gem5 simulator. Master's thesis.Google Scholar
- K. D. Underwood, J. Coffman, R. Larsen, K. S. Hemmert, B.W. Barrett, R. Brightwell, and M. Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In 2011 IEEE 19th Annual Symposium on High Performance Interconnects. 35--42. Google ScholarDigital Library
- K. D. Underwood, K. S. Hemmert, A. Rodrigues, R. Murphy, and R. Brightwell. 2005. A Hardware Acceleration Unit for MPI Queue Processing. In 19th IEEE International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
Index Terms
- INCA: in-network compute assistance
Recommendations
INCA: a next-generation architecture for simulation
IVC '96: Proceedings of the 1996 IEEE International Verilog HDL Conference (IVC '96)The paper presents INCA, the Interleaved Native-Compiled code Architecture for simulation. INCA is a flexible strategy to create optimized simulations involving multiple design styles, languages, and scheduling paradigms. INCA emphasizes optimized ...
INCA: An Architecture for In-Network Computing
ENCP '19: Proceedings of the 1st ACM CoNEXT Workshop on Emerging in-Network Computing ParadigmsWe present some results on integrating computing with networking so as to optimize the placement of workloads within a distributed network. We describe INCA, an In-Network Computing Architecture that allows clients to request functions that are then ...
INCA: INterruptible CNN accelerator for multi-tasking in embedded robots
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation ConferenceIn recent years, Convolutional Neural Network (CNN) has been widely used in robotics, which has dramatically improved the perception and decision-making ability of robots. A series of CNN accelerators have been designed to implement energy-efficient CNN ...
Comments