research-article

Public Access

MGPUSim: enabling multi-GPU performance modeling and optimization

Authors:
Yifan Sun

Northeastern University

Northeastern University
View Profile

,
Trinayan Baruah

Northeastern University

Northeastern University
View Profile

,
Saiful A. Mojumder

Boston University

Boston University
View Profile

,
Shi Dong

Northeastern University

Northeastern University
View Profile

,
Xiang Gong

Northeastern University

Northeastern University
View Profile

,
Shane Treadway

Northeastern University

Northeastern University
View Profile

,
Yuhui Bao

Northeastern University

Northeastern University
View Profile

,
Spencer Hance

Northeastern University

Northeastern University
View Profile

,
Carter McCardwell

Northeastern University

Northeastern University
View Profile

,
Vincent Zhao

Northeastern University

Northeastern University
View Profile

,
Harrison Barclay

Northeastern University

Northeastern University
View Profile

,
Amir Kavyan Ziabari

AMD

AMD
View Profile

,
Zhongliang Chen

AMD

AMD
View Profile

,
Rafael Ubal

Northeastern University

Northeastern University
View Profile

,
José L. Abellán

Universidad Católica San Antonio Murcia

Universidad Católica San Antonio Murcia
View Profile

,
John Kim

KAIST

KAIST
View Profile

,
Ajay Joshi

Boston University

Boston University
View Profile

,
David Kaeli

Northeastern University

Northeastern University
View Profile

ISCA '19: Proceedings of the 46th International Symposium on Computer ArchitectureJune 2019Pages 197–209https://doi.org/10.1145/3307650.3322230

Published:22 June 2019Publication History

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

Pages 197–209

ABSTRACT

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next-generation multi-GPU system designs.

In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5× and a 2.5× average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation.

We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6× (geometric mean), and PASI can improve the system performance by 2.6× (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

References

AMD. 2015. AMD Radeon R9 Series Gaming Graphics Cards with High-Bandwidth Memory.Google Scholar
AMD. 2016. Graphics Core Next Architecture, Generation 3, Reference Guide. (2016).Google Scholar
AMD. 2017. AMD APP SDK 3.0 Getting Started. (2017).Google Scholar
AMD. 2017. Vega Instruction Set Architecture, Reference Guide. (2017).Google Scholar
AMD. 2018. Radeon Compute Profiler. https://github.com/GPUOpen-Tools/RCPGoogle Scholar
AMD. 2018. Radeon Instinct MI60 Accelerator. https://www.amd.com/en/products/professional-graphics/instinct-mi60Google Scholar
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. ACM SIGARCH Computer Architecture News 45, 2 (2017), 320--332. Google ScholarDigital Library
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2017. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 136--150. Google ScholarDigital Library
Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur Mutlu. 2018. Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 503--518. Google ScholarDigital Library
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google ScholarCross Ref
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678 (2016).Google Scholar
Cen Chen, Kenli Li, Aijia Ouyang, Zhuo Tang, and Keqin Li. 2017. Gpu-accelerated parallel hierarchical extreme learning machine on flink for big data. IEEE Transactions on Systems, Man, and Cybernetics: Systems 47, 10 (2017), 2740--2753.Google ScholarCross Ref
Chia Chen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2014. Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--12. Google ScholarDigital Library
Sylvain Collange, David Defour, and David Parello. 2009. Barra, a parallel functional GPGPU simulator. (2009).Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248--255.Google ScholarCross Ref
Shi Dong, Xiang Gong, Yifan Sun, Trinayan Baruah, and David Kaeli. 2018. Characterizing the microarchitectural implications of a convolutional neural network (cnn) execution on gpus. In Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 96--106. Google ScholarDigital Library
Richard M Fujimoto. 1990. Parallel discrete event simulation. Commun. ACM 33, 10 (1990), 30--53. Google ScholarDigital Library
Nitin A Gawande, Jeff A Daily, Charles Siegel, Nathan R Tallent, and Abhinav Vishnu. 2018. Scaling deep learning workloads: NVIDIA DGX-1/Pascal and intel knights landing. Future Generation Computer Systems (2018).Google Scholar
Anthony Gutierrez, Bradford M Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, et al. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 608--619.Google ScholarCross Ref
Wen-mei Hwu. 2015. Heterogeneous System Architecture: A new compute platform infrastructure. Morgan Kaufmann. Google ScholarDigital Library
CL Jermain, GE Rowlands, RA Buhrman, and DC Ralph. 2016. GPU-accelerated micromagnetic simulations using cloud computing. Journal of Magnetism and Magnetic Materials 401 (2016), 320--322.Google ScholarCross Ref
Hai Jiang, Yi Chen, Zhi Qiao, Tien-Hsiung Weng, and Kuan-Ching Li. 2015. Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing 18, 1 (2015), 369--383. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, Stephen W Keckler, Mahmut T Kandemir, and Chita R Das. 2015. Anatomy of gpu memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems. ACM, 223--234.Google ScholarDigital Library
David Kanter. 2015. Graphics processing requirements for enabling immersive vr. AMD White Paper (2015).Google Scholar
Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014. Multi-GPU system design with memory networks. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 484--495. Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105. Google ScholarDigital Library
Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU triggered networking for intra-kernel communications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 22.Google ScholarDigital Library
Michael LeBeane, Brandon Potter, Abhisek Pan, Alexandru Dutu, Vinay Agarwala, Wonchan Lee, Deepak Majeti, Bibek Ghimire, Eric Van Tassell, Samuel Wasmundt, et al. 2016. Extended task queuing: Active messages for heterogeneous systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 80. Google ScholarDigital Library
Sangpil Lee and Won Woo Ro. 2013. Parallel GPU architecture simulation framework exploiting work allocation unit parallelism. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on. IEEE, 107--117.Google ScholarCross Ref
Sangpil Lee and Won Woo Ro. 2016. Parallel gpu architecture simulation framework exploiting architectural-level parallelism with timing error prediction. IEEE Trans. Comput. 4 (2016), 1253--1265. Google ScholarDigital Library
Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 49--63. Google ScholarDigital Library
Geetika Malhotra, Seep Goel, and Smruti R Sarangi. 2014. Gputejas: A parallel simulator for gpu architectures. In High Performance Computing (HiPC), 2014 21st International Conference on. IEEE, 1--10.Google ScholarCross Ref
Mitesh R Meswani, Gabriel H Loh, Sergey Blagodurov, David Roberts, John Slice, and Mike Ignatowski. 2014. Toward efficient programmer-managed two-level memory hierarchies in exascale computers. In 2014 Hardware-Software Co-Design for High Performance Computing. IEEE, 9--16. Google ScholarDigital Library
Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the socket: NUMA-aware GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 123--135. Google ScholarDigital Library
Saiful Mojumder, Marcia Louis, Yifan Sun, Amir Kavyan Ziabari, José L. Abellán, John Kim, David Kaeli, and Ajay Joshi. 2018. Profiling DNN Workloads on a Volta-based DGX-1 System. In Proceedings of the International Symposium on Workload Characterization (IISWC'18). IEEE.Google ScholarCross Ref
NVIDIA. 2010. CUDA Programming guide.Google Scholar
NVIDIA. 2018. Developing a Linux Kernel Module using GPUDirect RDMA. (2018).Google Scholar
NVIDIA. 2018. NVIDIA DGX-2. (2018). https://www.nvidia.com/en-us/data-center/dgx-2/Google Scholar
NVIDIA. 2018. NVIDIA TITAN RTX. https://www.nvidia.com/en-us/titan/titan-rtx/Google Scholar
Open Source Initiative. {n. d.}. The MIT Licence.Google Scholar
Rajat Raina, Anand Madhavan, and Andrew Y Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning. ACM, 873--880. Google ScholarDigital Library
Ahmed Sanaullah, Saiful A Mojumder, Kathleen M Lewis, and Martin C Herbordt. 2016. GPU-accelerated charge mapping.. In HPEC. 1--7.Google Scholar
Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. 2016. Very deep multilingual convolutional neural networks for LVCSR. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 4955--4959.Google ScholarDigital Library
Daniel J Sorin, Mark D Hill, and David A Wood. 2011. A primer on memory consistency and cache coherence. Synthesis Lectures on Computer Architecture 6, 3 (2011), 1--212.Google ScholarDigital Library
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66--73.Google ScholarDigital Library
Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Heteromark, a benchmark suite for CPU-GPU collaborative computing. In 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 1--10.Google ScholarCross Ref
Yifan Sun, Saoni Mukherjee, Trinayan Baruah, Shi Dong, Julian Gutierrez, Prannoy Mohan, and David Kaeli. 2018. Evaluating performance tradeoffs on the radeon open compute platform. In 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 209--218.Google ScholarCross Ref
The Go Authors. 2009. Effective Go. (2009).Google Scholar
Josep Torrellas. 1999. Cache-Only Memory Architecture. In IEEE Computer Magazine. Google ScholarDigital Library
Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: a simulation framework for CPU-GPU computing. In Parallel Architectures and Compilation Techniques (PACT), 2012 21st International Conference on. IEEE, 335--344. Google ScholarDigital Library
Jan Vesely, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel H Loh, Mark Oskin, and Steven K Reinhardt. 2018. Generic system calls for GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 843--856. Google ScholarDigital Library
Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B Gibbons, and Onur Mutlu. 2018. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs. ISCA. Google ScholarDigital Library
Jun Wang, Eric Papenhausen, Bing Wang, Sungsoo Ha, Alla Zelenyuk, and Klaus Mueller. 2017. Progressive clustering of big data with GPU acceleration and visualization. In Scientific Data Summit (NYSDS), 2017 New York. IEEE, 1--9.Google ScholarCross Ref
Siyue Wang, Xiao Wang, Shaokai Ye, Pu Zhao, and Xue Lin. 2018. Defending DNN Adversarial Attacks with Pruning and Logits Augmentation. In 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 1144--1148.Google Scholar
James Whitney, Chandler Giford, and Maria Pantoja. 2018. Distributed execution of communicating sequential process-style concurrency: Golang case study. The Journal of Supercomputing (2018), 1--14. Google ScholarDigital Library
Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. 2015. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876 (2015).Google Scholar
Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Villa Oreste. 2018. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. ACM.Google ScholarDigital Library
Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L Abellán, Rafael Ubal, John Kim, Ajay Joshi, and David Kaeli. 2016. UMH: A hardware-based unified memory hierarchy for systems with multiple discrete GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 35. Google ScholarDigital Library

Index Terms

MGPUSim: enabling multi-GPU performance modeling and optimization
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Computing methodologies
  1. Modeling and simulation
    1. Simulation support systems
      1. Simulation tools

Recommendations

GPUpd: a fast and scalable multi-GPU architecture using cooperative projection and distribution
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Graphics Processing Unit (GPU) vendors have been scaling single-GPU architectures to satisfy the ever-increasing user demands for faster graphics processing. However, as it gets extremely difficult to further scale single-GPU architectures, the vendors ...
Read More
Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Application programming for GPUs (Graphics Processing Units) is complex and error-prone, because the popular approaches - CUDA and OpenCL - are intrinsically low-level and offer no special support for systems consisting of multiple GPUs. The SkelCL ...
Read More
Techniques for the parallelization of unstructured grid applications on multi-GPU systems
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

Currently the set of scientific applications suitable for running on GPUs has increased due to the computational power of GPUs and the availability of programming languages that make more approachable writing scientific applications for GPUs. However, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
June 2019
849 pages
ISBN:9781450366694
DOI:10.1145/3307650
General Chair:
Srilatha (Bobbie) Manne
Microsoft
,
Program Chairs:
Hillery Hunter
IBM
,
Erik Altman
IBM Research
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
memory management
multi-GPU systems
simulation
Qualifiers
- research-article
Conference

Acceptance Rates
ISCA '19 Paper Acceptance Rate62of365submissions,17%Overall Acceptance Rate543of3,203submissions,17%
More
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 2,934
  Total Downloads
- Downloads (Last 12 months)991
- Downloads (Last 6 weeks)173
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MGPUSim: enabling multi-GPU performance modeling and optimization

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

GPUpd: a fast and scalable multi-GPU architecture using cooperative projection and distribution

Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library

Techniques for the parallelization of unstructured grid applications on multi-GPU systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

MGPUSim: enabling multi-GPU performance modeling and optimization

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

GPUpd: a fast and scalable multi-GPU architecture using cooperative projection and distribution

Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library

Techniques for the parallelization of unstructured grid applications on multi-GPU systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media