skip to main content
10.1145/3307650.3322230acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Public Access

MGPUSim: enabling multi-GPU performance modeling and optimization

Published:22 June 2019Publication History

ABSTRACT

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next-generation multi-GPU system designs.

In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5× and a 2.5× average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation.

We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose <u>P</u>rogressive P<u>a</u>ge <u>S</u>plitting M<u>i</u>gration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6× (geometric mean), and PASI can improve the system performance by 2.6× (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

References

  1. AMD. 2015. AMD Radeon R9 Series Gaming Graphics Cards with High-Bandwidth Memory.Google ScholarGoogle Scholar
  2. AMD. 2016. Graphics Core Next Architecture, Generation 3, Reference Guide. (2016).Google ScholarGoogle Scholar
  3. AMD. 2017. AMD APP SDK 3.0 Getting Started. (2017).Google ScholarGoogle Scholar
  4. AMD. 2017. Vega Instruction Set Architecture, Reference Guide. (2017).Google ScholarGoogle Scholar
  5. AMD. 2018. Radeon Compute Profiler. https://github.com/GPUOpen-Tools/RCPGoogle ScholarGoogle Scholar
  6. AMD. 2018. Radeon Instinct MI60 Accelerator. https://www.amd.com/en/products/professional-graphics/instinct-mi60Google ScholarGoogle Scholar
  7. Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. ACM SIGARCH Computer Architecture News 45, 2 (2017), 320--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2017. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 136--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur Mutlu. 2018. Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 503--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google ScholarGoogle ScholarCross RefCross Ref
  11. Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678 (2016).Google ScholarGoogle Scholar
  12. Cen Chen, Kenli Li, Aijia Ouyang, Zhuo Tang, and Keqin Li. 2017. Gpu-accelerated parallel hierarchical extreme learning machine on flink for big data. IEEE Transactions on Systems, Man, and Cybernetics: Systems 47, 10 (2017), 2740--2753.Google ScholarGoogle ScholarCross RefCross Ref
  13. Chia Chen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2014. Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sylvain Collange, David Defour, and David Parello. 2009. Barra, a parallel functional GPGPU simulator. (2009).Google ScholarGoogle Scholar
  15. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  16. Shi Dong, Xiang Gong, Yifan Sun, Trinayan Baruah, and David Kaeli. 2018. Characterizing the microarchitectural implications of a convolutional neural network (cnn) execution on gpus. In Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 96--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Richard M Fujimoto. 1990. Parallel discrete event simulation. Commun. ACM 33, 10 (1990), 30--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nitin A Gawande, Jeff A Daily, Charles Siegel, Nathan R Tallent, and Abhinav Vishnu. 2018. Scaling deep learning workloads: NVIDIA DGX-1/Pascal and intel knights landing. Future Generation Computer Systems (2018).Google ScholarGoogle Scholar
  19. Anthony Gutierrez, Bradford M Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, et al. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 608--619.Google ScholarGoogle ScholarCross RefCross Ref
  20. Wen-mei Hwu. 2015. Heterogeneous System Architecture: A new compute platform infrastructure. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. CL Jermain, GE Rowlands, RA Buhrman, and DC Ralph. 2016. GPU-accelerated micromagnetic simulations using cloud computing. Journal of Magnetism and Magnetic Materials 401 (2016), 320--322.Google ScholarGoogle ScholarCross RefCross Ref
  22. Hai Jiang, Yi Chen, Zhi Qiao, Tien-Hsiung Weng, and Kuan-Ching Li. 2015. Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing 18, 1 (2015), 369--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, Stephen W Keckler, Mahmut T Kandemir, and Chita R Das. 2015. Anatomy of gpu memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems. ACM, 223--234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. David Kanter. 2015. Graphics processing requirements for enabling immersive vr. AMD White Paper (2015).Google ScholarGoogle Scholar
  25. Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014. Multi-GPU system design with memory networks. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 484--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU triggered networking for intra-kernel communications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Michael LeBeane, Brandon Potter, Abhisek Pan, Alexandru Dutu, Vinay Agarwala, Wonchan Lee, Deepak Majeti, Bibek Ghimire, Eric Van Tassell, Samuel Wasmundt, et al. 2016. Extended task queuing: Active messages for heterogeneous systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sangpil Lee and Won Woo Ro. 2013. Parallel GPU architecture simulation framework exploiting work allocation unit parallelism. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on. IEEE, 107--117.Google ScholarGoogle ScholarCross RefCross Ref
  30. Sangpil Lee and Won Woo Ro. 2016. Parallel gpu architecture simulation framework exploiting architectural-level parallelism with timing error prediction. IEEE Trans. Comput. 4 (2016), 1253--1265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 49--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Geetika Malhotra, Seep Goel, and Smruti R Sarangi. 2014. Gputejas: A parallel simulator for gpu architectures. In High Performance Computing (HiPC), 2014 21st International Conference on. IEEE, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  33. Mitesh R Meswani, Gabriel H Loh, Sergey Blagodurov, David Roberts, John Slice, and Mike Ignatowski. 2014. Toward efficient programmer-managed two-level memory hierarchies in exascale computers. In 2014 Hardware-Software Co-Design for High Performance Computing. IEEE, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the socket: NUMA-aware GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 123--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Saiful Mojumder, Marcia Louis, Yifan Sun, Amir Kavyan Ziabari, José L. Abellán, John Kim, David Kaeli, and Ajay Joshi. 2018. Profiling DNN Workloads on a Volta-based DGX-1 System. In Proceedings of the International Symposium on Workload Characterization (IISWC'18). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  36. NVIDIA. 2010. CUDA Programming guide.Google ScholarGoogle Scholar
  37. NVIDIA. 2018. Developing a Linux Kernel Module using GPUDirect RDMA. (2018).Google ScholarGoogle Scholar
  38. NVIDIA. 2018. NVIDIA DGX-2. (2018). https://www.nvidia.com/en-us/data-center/dgx-2/Google ScholarGoogle Scholar
  39. NVIDIA. 2018. NVIDIA TITAN RTX. https://www.nvidia.com/en-us/titan/titan-rtx/Google ScholarGoogle Scholar
  40. Open Source Initiative. {n. d.}. The MIT Licence.Google ScholarGoogle Scholar
  41. Rajat Raina, Anand Madhavan, and Andrew Y Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning. ACM, 873--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ahmed Sanaullah, Saiful A Mojumder, Kathleen M Lewis, and Martin C Herbordt. 2016. GPU-accelerated charge mapping.. In HPEC. 1--7.Google ScholarGoogle Scholar
  43. Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. 2016. Very deep multilingual convolutional neural networks for LVCSR. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 4955--4959.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Daniel J Sorin, Mark D Hill, and David A Wood. 2011. A primer on memory consistency and cache coherence. Synthesis Lectures on Computer Architecture 6, 3 (2011), 1--212.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Heteromark, a benchmark suite for CPU-GPU collaborative computing. In 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  47. Yifan Sun, Saoni Mukherjee, Trinayan Baruah, Shi Dong, Julian Gutierrez, Prannoy Mohan, and David Kaeli. 2018. Evaluating performance tradeoffs on the radeon open compute platform. In 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 209--218.Google ScholarGoogle ScholarCross RefCross Ref
  48. The Go Authors. 2009. Effective Go. (2009).Google ScholarGoogle Scholar
  49. Josep Torrellas. 1999. Cache-Only Memory Architecture. In IEEE Computer Magazine. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: a simulation framework for CPU-GPU computing. In Parallel Architectures and Compilation Techniques (PACT), 2012 21st International Conference on. IEEE, 335--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jan Vesely, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel H Loh, Mark Oskin, and Steven K Reinhardt. 2018. Generic system calls for GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 843--856. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B Gibbons, and Onur Mutlu. 2018. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs. ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jun Wang, Eric Papenhausen, Bing Wang, Sungsoo Ha, Alla Zelenyuk, and Klaus Mueller. 2017. Progressive clustering of big data with GPU acceleration and visualization. In Scientific Data Summit (NYSDS), 2017 New York. IEEE, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  54. Siyue Wang, Xiao Wang, Shaokai Ye, Pu Zhao, and Xue Lin. 2018. Defending DNN Adversarial Attacks with Pruning and Logits Augmentation. In 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 1144--1148.Google ScholarGoogle Scholar
  55. James Whitney, Chandler Giford, and Maria Pantoja. 2018. Distributed execution of communicating sequential process-style concurrency: Golang case study. The Journal of Supercomputing (2018), 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. 2015. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876 (2015).Google ScholarGoogle Scholar
  57. Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Villa Oreste. 2018. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L Abellán, Rafael Ubal, John Kim, Ajay Joshi, and David Kaeli. 2016. UMH: A hardware-based unified memory hierarchy for systems with multiple discrete GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 35. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MGPUSim: enabling multi-GPU performance modeling and optimization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
        June 2019
        849 pages
        ISBN:9781450366694
        DOI:10.1145/3307650

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 June 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ISCA '19 Paper Acceptance Rate62of365submissions,17%Overall Acceptance Rate543of3,203submissions,17%

        Upcoming Conference

        ISCA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader