Abstract
The design of network architectures has become increasingly complex as the chips connected by inter-node networkshave emerged as distributed systems in their own right, complete with their own on-chip networks. In Anton 2, a massively parallel special-purpose supercomputer for molecular dynamics simulations, we managed this complexity by reusing the on-chip network as a switch for inter-node traffic. This unified network approach introduces several design challenges. Maintaining fairness within the inter-node network is difficult, as each hop becomes a sequence of many on-chip routing decisions. We addressed this problem with an inverse-weighted arbiter that ensures fairness with low implementation costs. Balancing the load of inter-node traffic across the on-chip network is also critical, and we adopted an optimization approach to design an appropriate routing algorithm. Finally, the on-chip routers carry inter-node traffic, so they must implement inter-node virtual channels to avoid deadlock. In order to keep the routers small and fast, we developed a deadlock-free routing algorithm that reduces the number of virtual channels by one-third relative to previous approaches. The resulting Anton 2 network implementation efficiently utilizes its inter-node channels and provides low messaging latency, while occupying a modest amount of silicon area
- D. Abts and D. Weisser, "Age-based packet arbitration in large-radix k-ary n-cubes," in Proceedings of the 2007 ACM/IEEE conference on Supercomputing (SC), Nov. 2007, pp. 1--11. Google ScholarDigital Library
- A. Agarwal, "Limits on interconnection network performance," IEEE Transactions on Parallel and Distributed Systems, vol. 2, no. 4, pp. 398--412, 1991. Google ScholarDigital Library
- B. Alverson, "Cray high speed networking," in Proceedings of the 20th Annual Symposium on High-Performance Interconnects (HOTI), Aug. 2012.Google Scholar
- J. Balfour and W. J. Dally, "Design tradeoffs for tiled CMP on-chip networks," in Proceedings of the 20th annual International Conference on Supercomputing (ICS), Jun. 2006, pp. 187--198. Google ScholarDigital Library
- S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook, "TILE64 processor: A 64-core SoC with mesh interconnect," in Proceedings of the International Solid-State Circuits Conference (ISSCC), Feb. 2008, pp. 88--89, 598.Google Scholar
- L. Benini and G. De Micheli, "Networks on chips: a new SoC paradigm," Computer, vol. 35, no. 1, pp. 70--78, Jan. 2002. Google ScholarDigital Library
- T. Bjerregaard and S. Mahadevan, "A survey of research and practices of network-on-chip," ACM Computing Surveys, vol. 38, no. 1, 2006. Google ScholarDigital Library
- G. Chrysos, "IntelR Xeon PhiTM coprocessor (codename Knights Corner)," in Proceedings of the 24th Annual IEEE Hot Chips Symposium, 2012.Google Scholar
- W. Dally and B. Towles, Principles and Practices of Interconnection Networks. San Francisco: Morgan Kaufmann Publishers Inc., 2003. Google ScholarDigital Library
- W. Dally, "Performance analysis of k-ary n-cube interconnection networks," IEEE Transactions on Computers, vol. 39, no. 6, pp. 775--785, 1990. Google ScholarDigital Library
- W. Dally and C. Seitz, "Deadlock-free message routing in multiprocessor interconnection networks," IEEE Transactions on Computers, vol. 36, no. 5, pp. 547--553, 1987. Google ScholarDigital Library
- E. DeLano, "Tukwila -- a quad-core IntelR ItaniumR processor," in Proceedings of the 20th Annual IEEE Hot Chips Symposium, 2012.Google Scholar
- R. O. Dror, J. P. Grossman, K. M. Mackenzie, B. Towles, E. Chow, J. K. Salmon, C. Young, J. A. Bank, B. Batson, M. M. Deneroff, J. S. Kuskin, R. H. Larson, M. A. Moraes, and D. E. Shaw, "Exploiting 162- nanosecond end-to-end communication latency on Anton," in Proceedings of the Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2010, pp. 1--12. Google ScholarDigital Library
- N. Eisley and L.-S. Peh, "High-level power analysis for on-chip networks," in Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), Sep. 2004, pp. 104--115. Google ScholarDigital Library
- J. P. Grossman, J. S. Kuskin, J. A. Bank, M. Theobald, R. O. Dror, D. J. Ierardi, R. H. Larson, U. B. Schafer, B. Towles, C. Young, and D. E.Shaw, "Hardware support for fine-grained event-driven computation in Anton 2," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Apr. 2013, pp. 549--560. Google ScholarDigital Library
- B. Grot, S. Keckler, and O. Mutlu, "Preemptive virtual clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2009, pp. 268--279. Google ScholarDigital Library
- P. Gupta and N. McKeown, "Designing and implementing a fast crossbar scheduler," IEEE Micro, vol. 19, no. 1, pp. 20--28, 1999. Google ScholarDigital Library
- J. Lee, M. C. Ng, and K. Asanović, "Globally-synchronized frames for guaranteed quality-of-service in on-chip networks," in Proceedings of the 35th annual International Symposium on Computer Architecture (ISCA), Jun. 2008, pp. 89--100. Google ScholarDigital Library
- M. M. Lee, J. Kim, D. Abts, M. Marty, and J. W. Lee, "Probabilistic distance-based arbitration: providing equality of service for many-core CMPs," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2010, pp. 509--519. Google ScholarDigital Library
- T. Nesson and S. L. Johnsson, "ROMM routing on mesh and torus networks," in Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Jul. 1995, pp. 275--287. Google ScholarDigital Library
- T. D. Nguyen and L. Snyder, "Performance analysis of a minimal adaptive router," in Parallel Computer Routing and Communication: 1st International Workshop (PCRCW), May 1994, pp. 31--44. Google ScholarDigital Library
- J. Owens, W. Dally, R. Ho, D. N. Jayasimha, S. Keckler, and L.-S. Peh, "Research challenges for on-chip interconnection networks," IEEE Micro, vol. 27, no. 5, pp. 96--108, 2007. Google ScholarDigital Library
- S. Scott and G. Thorson, "The Cray T3E network: Adaptive routing in a high performance 3D torus," in Proceedings of the 4th Annual Symposium on High-Performance Interconnects (HOTI), Aug. 1996.Google Scholar
- D. E. Shaw, R. O. Dror, J. K. Salmon, J. P. Grossman, K. M. Mackenzie, J. A. Bank, C. Young, M. M. Deneroff, B. Batson, K. J. Bowers, E. Chow, M. P. Eastwood, D. J. Ierardi, J. L. Klepeis, J. S. Kuskin, R. H. Larson, K. Lindorff-Larsen, P. Maragakis, M. A. Moraes, S. Piana, Y. Shan, and B. Towles, "Millisecond-scale molecular dynamics simulations on Anton," in Proceedings of the Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009, pp. 1--11. Google ScholarDigital Library
- A. Singh,W. J. Dally, B. Towles, and A. K. Gupta, "Locality-preserving randomized oblivious routing on torus networks," in Proceedings of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Aug. 2002, pp. 9--13. Google ScholarDigital Library
- Taiwan Semiconductor Manufacturing Company (TSMC), "TSMC first to deliver 40nm process technology," Mar. 2008. Available: http://www.tsmc.com/tsmcdotcom/PRListingNewsAction.do? action=detail&newsid=2561Google Scholar
- B. Towles and W. J. Dally, "Worst-case traffic for oblivious routing functions," in Proceedings of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Aug. 2002, pp. 1--8. Google ScholarDigital Library
- L. G. Valiant and G. J. Brebner, "Universal schemes for parallel communication," in Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC), May 1981, pp. 263--277. Google ScholarDigital Library
- H. Wang, L.-S. Peh, and S. Malik, "A technology-aware and energyoriented topology exploration for on-chip networks," in Proceedings of the Conference on Design, Automation and Test in Europe (DATE), Mar. 2005, pp. 1238--1243. Google ScholarDigital Library
Recommendations
Unifying on-chip and inter-node switching within the Anton 2 network
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecutureThe design of network architectures has become increasingly complex as the chips connected by inter-node networkshave emerged as distributed systems in their own right, complete with their own on-chip networks. In Anton 2, a massively parallel special-...
A Torus-Based Hierarchical Optical-Electronic Network-on-Chip for Multiprocessor System-on-Chip
Networks-on-chip (NoCs) are emerging as a key on-chip communication architecture for multiprocessor systems-on-chip (MPSoCs). Optical communication technologies are introduced to NoCs in order to empower ultra-high bandwidth with low power consumption. ...
An improved transmission scheme for error-prone inter-chip network-on-chip communication links implemented on FPGAs
FPGAworld '13: Proceedings of the 10th FPGAworld ConferenceNetwork-on-Chip (NoC) is an alternative to traditional busses for faster interconnect mechanism. The aim is to have infinite scalability, and this implies the possibility to extend the on-chip NoC communication protocol off-chip. To gain wholesome ...
Comments