tutorial

Quality of Service Support for Fine-Grained Sharing on GPUs

Authors:
Zhenning Wang

Department of Computer Science, Shanghai Jiao Tong University

Department of Computer Science, Shanghai Jiao Tong University
View Profile

,
Jun Yang

Electrical and Computer Engineering Department, University of Pittsburgh

Electrical and Computer Engineering Department, University of Pittsburgh
View Profile

,
Rami Melhem

Department of Computer Science, University of Pittsburgh

Department of Computer Science, University of Pittsburgh
View Profile

,
Bruce Childers

Department of Computer Science, University of Pittsburgh

Department of Computer Science, University of Pittsburgh
View Profile

,
Youtao Zhang

Department of Computer Science, University of Pittsburgh

Department of Computer Science, University of Pittsburgh
View Profile

,
Minyi Guo

Department of Computer Science, Shanghai Jiao Tong University

Department of Computer Science, Shanghai Jiao Tong University
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 45 Issue 2May 2017pp 269–281https://doi.org/10.1145/3140659.3080203

Published:24 June 2017Publication History

ACM SIGARCH Computer Architecture News

Abstract

GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent applications is minimally supported. Previous efforts are too coarse-grained and not scalable with increasing QoS requirements. We propose QoS mechanisms for a fine-grained form of GPU sharing. Our QoS support can provide control over the progress of kernels on a per cycle basis and the amount of thread-level parallelism of each kernel. Due to accurate resource management, our QoS support has significantly better scalability compared with previous best efforts. Evaluations show that, when the GPU is shared by three kernels, two of which have QoS goals, the proposed techniques achieve QoS goals 43.8% more often than previous techniques and have 20.5% higher throughput.

References

Jacob T Adriaens, Katherine Compton, Nam Sung Kim, and Michael J Schulte. 2012. The case for GPGPU spatial multitasking. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. 1--12. Google ScholarDigital Library
Paula Aguilera, Katherine Morrow, and Nam Sung Kim. 2014. Fair share: Allocation of GPU resources for both performance and fairness. In Computer Design (ICCD), 2014 32nd IEEE International Conference on. 440--447.Google ScholarCross Ref
Paula Aguilera, Katherine Morrow, and Nam Sung Kim. 2014. QoS-aware dynamic resource allocation for spatial-multitasking GPUs. In Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific. 726--731.Google ScholarCross Ref
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. 163--174.Google ScholarCross Ref
Mikhail Bautin, Ashok Dwarakinath, and Tzi-cker Chiueh. 2008. Graphic engine resource management. Proc. SPIE 6818 (2008), 68180O--68180O--12.Google Scholar
Thomas Bradley. 2012. Hyper-Q example. (2012).Google Scholar
Abhishek Chandra, Micah Adler, Pawan Goyal, and Prashant Shenoy. 2000. Surplus Fair Scheduling: A Proportional-share CPU Scheduling Algorithm for Symmetric Multiprocessors. In Proceedings of the 4th Conference on Symposium on Operating System Design & Implementation - Volume 4 (OSDI'00). USENIX Association, Berkeley, CA, USA, 4--4. http://dl.acm.org/citation.cfm?id=1251229.1251233 Google ScholarDigital Library
Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 681--696. Google ScholarDigital Library
Alan Demers, Srinivasan Keshav, and Scott Shenker. 1989. Analysis and Simulation of a Fair Queueing Algorithm. In Symposium Proceedings on Communications Architectures & Protocols (SIGCOMM '89). ACM, New York, NY, USA, 1--12. Google ScholarDigital Library
Kenneth J. Duda and David R. Cheriton. 1999. Borrowed-virtual-time (BVT) Scheduling: Supporting Latency-sensitive Threads in a General-purpose Scheduler. In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles (SOSP '99). ACM, New York, NY, USA, 261--276. Google ScholarDigital Library
HSA Foundation. 2015. HSA Platform System Architecture Specification. (2015).Google Scholar
Pawan Goyal, Xingang Guo, and Harrick M. Vin. 1996. A Hierarchical CPU Scheduler for Multimedia Operating Systems. In Proceedings of the 2nd USENIX Conference on Operating Systems Design and Implementation (OSDI '96). USENIX Association. Google ScholarDigital Library
Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Finegrained resource sharing for concurrent GPGPU kernels. In 4th USENIX Workshop on Hot Topics in Parallelism (HotPar). Berkeley, CA. Google ScholarDigital Library
Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen, Cheng Li, Trevor Mudge, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. 2015. DjiNN and Tonic: DNN As a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 27--40. Google ScholarDigital Library
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, 260--269. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 395--406. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2016. Exploiting Core Criticality for Enhanced GPU Performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science (SIGMETRICS '16). ACM, New York, NY, USA, 351--363. Google ScholarDigital Library
Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIX ATC. 17--30. Google ScholarDigital Library
Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 114--126. Google ScholarDigital Library
Minseok Lee, Seokwoo Song, Joosik Moon, J. Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. 260--271.Google ScholarCross Ref
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 487--498. Google ScholarDigital Library
Jiwei Liu, Jun Yang, and Rami Melhem. 2015. SAWS: Synchronization Aware GPGPU Warp Scheduling for Multiple Independent Warp Schedulers. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48). Google ScholarDigital Library
Christos Margiolas and Michael F. P. O'Boyle. 2016. Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, New York, NY, USA, 82--93. Google ScholarDigital Library
Konstantinos Menychtas, Kai Shen, and Michael L Scott. 2013. Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack.. In USENIX Annual Technical Conference. 291--296. Google ScholarDigital Library
Konstantinos Menychtas, Kai Shen, and Michael L. Scott. 2014. Disengaged Scheduling for Fair, Protected Access to Fast Computational Accelerators. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, 301--316. Google ScholarDigital Library
NVIDIA. 2012. Sharing a GPU between MPI processes: multi-process service(MPS). (2012).Google Scholar
Nvidia. 2014. Programming Guide. (2014).Google Scholar
NVIDIA. 2016. GP100 Pascal Whitepaper. (2016). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdfGoogle Scholar
Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2014. Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 483--484. Google ScholarDigital Library
Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, 407--418. Google ScholarDigital Library
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 593--606. Google ScholarDigital Library
Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). ACM, 233--248. Google ScholarDigital Library
Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 647--658. Google ScholarDigital Library
Madhavapeddi Shreedhar and George Varghese. 1995. Efficient Fair Queueing Using Deficit Round Robin. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '95). ACM, New York, NY, USA, 231--242. Google ScholarDigital Library
Ion Stoica, Hussein Abdel-Wahab, Kevin Jeffay, Sanjoy K Baruah, Johannes E Gehrke, and C Greg Plaxton. 1996. A Proportional Share Resource Allocation Algorithm for Real-time, Time-shared Systems. In Proceedings of the 17th IEEE Real-Time Systems Symposium (RTSS '96). IEEE Computer Society, Washington, DC, USA, 288--. http://dl.acm.org/citation.cfm?id=827268.828976 Google ScholarDigital Library
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W-M Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing (2012).Google Scholar
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling Preemptive Multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, 193--204. Google ScholarDigital Library
Yash Ukidave, Xiangyu Li, and David Kaeli. 2016. Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 353--362.Google Scholar
Guibin Wang, Yisong Lin, and Wei Yi. 2010. Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom). 344--350. Google ScholarDigital Library
Minjie Wang, Tianjun Xiao, Jianpeng Li, Jiaxing Zhang, Chuntao Hong, and Zheng Zhang. 2014. Minerva: A scalable and highly efficient training platform for deep learning. (2014).Google Scholar
Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2015. Simultaneous Multikernel: Fine-grained Sharing of GPGPUs. Computer Architecture Letters PP, 99 (2015), 1--1. Google ScholarDigital Library
Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 358--369.Google ScholarCross Ref
Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations. In ICS' 15. Google ScholarDigital Library
Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. In Proceeding of the 43st Annual International Symposium on Computer Architecuture (ISCA '16). IEEE Press. Google ScholarDigital Library
Miao Yu, Chao Zhang, Zhengwei Qi, Jianguo Yao, Yin Wang, and Haibing Guan. 2013. VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud Gaming. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13). ACM, New York, NY, USA, 203--214. Google ScholarDigital Library
Wanghong Yuan and Klara Nahrstedt. 2003. Energy-efficient Soft Real-time CPU Scheduling for Mobile Multimedia Systems. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP '03). ACM, New York, NY, USA, 149--163. Google ScholarDigital Library
Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified Graph Processing on GPUs. Parallel and Distributed Systems, IEEE Transactions on 25, 6 (June 2014), 1543--1552. Google ScholarDigital Library

Index Terms

Quality of Service Support for Fine-Grained Sharing on GPUs
1. Applied computing
  1. Enterprise computing
    1. Enterprise information systems
      1. Data centers
2. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Quality of Service Support for Fine-Grained Sharing on GPUs
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent ...
Read More
Fine-Grained Synchronizations and Dataflow Programming on GPUs
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming ...
Read More
Fast Fine-Grained Global Synchronization on GPUs
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 45, Issue 2
ISCA'17
May 2017
715 pages
ISSN:0163-5964
DOI:10.1145/3140659
Editor:
Babak Falsafi
Interim
Issue’s Table of Contents
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
June 2017
736 pages
ISBN:9781450348928
DOI:10.1145/3079856
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 June 2017
Check for updates
Author Tags
GPU
Quality of Service
Qualifiers
- tutorial
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 938
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Quality of Service Support for Fine-Grained Sharing on GPUs

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Quality of Service Support for Fine-Grained Sharing on GPUs

Fine-Grained Synchronizations and Dataflow Programming on GPUs

Fast Fine-Grained Global Synchronization on GPUs