Abstract
GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent applications is minimally supported. Previous efforts are too coarse-grained and not scalable with increasing QoS requirements. We propose QoS mechanisms for a fine-grained form of GPU sharing. Our QoS support can provide control over the progress of kernels on a per cycle basis and the amount of thread-level parallelism of each kernel. Due to accurate resource management, our QoS support has significantly better scalability compared with previous best efforts. Evaluations show that, when the GPU is shared by three kernels, two of which have QoS goals, the proposed techniques achieve QoS goals 43.8% more often than previous techniques and have 20.5% higher throughput.
- Jacob T Adriaens, Katherine Compton, Nam Sung Kim, and Michael J Schulte. 2012. The case for GPGPU spatial multitasking. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. 1--12. Google ScholarDigital Library
- Paula Aguilera, Katherine Morrow, and Nam Sung Kim. 2014. Fair share: Allocation of GPU resources for both performance and fairness. In Computer Design (ICCD), 2014 32nd IEEE International Conference on. 440--447.Google ScholarCross Ref
- Paula Aguilera, Katherine Morrow, and Nam Sung Kim. 2014. QoS-aware dynamic resource allocation for spatial-multitasking GPUs. In Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific. 726--731.Google ScholarCross Ref
- Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. 163--174.Google ScholarCross Ref
- Mikhail Bautin, Ashok Dwarakinath, and Tzi-cker Chiueh. 2008. Graphic engine resource management. Proc. SPIE 6818 (2008), 68180O--68180O--12.Google Scholar
- Thomas Bradley. 2012. Hyper-Q example. (2012).Google Scholar
- Abhishek Chandra, Micah Adler, Pawan Goyal, and Prashant Shenoy. 2000. Surplus Fair Scheduling: A Proportional-share CPU Scheduling Algorithm for Symmetric Multiprocessors. In Proceedings of the 4th Conference on Symposium on Operating System Design & Implementation - Volume 4 (OSDI'00). USENIX Association, Berkeley, CA, USA, 4--4. http://dl.acm.org/citation.cfm?id=1251229.1251233 Google ScholarDigital Library
- Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 681--696. Google ScholarDigital Library
- Alan Demers, Srinivasan Keshav, and Scott Shenker. 1989. Analysis and Simulation of a Fair Queueing Algorithm. In Symposium Proceedings on Communications Architectures & Protocols (SIGCOMM '89). ACM, New York, NY, USA, 1--12. Google ScholarDigital Library
- Kenneth J. Duda and David R. Cheriton. 1999. Borrowed-virtual-time (BVT) Scheduling: Supporting Latency-sensitive Threads in a General-purpose Scheduler. In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles (SOSP '99). ACM, New York, NY, USA, 261--276. Google ScholarDigital Library
- HSA Foundation. 2015. HSA Platform System Architecture Specification. (2015).Google Scholar
- Pawan Goyal, Xingang Guo, and Harrick M. Vin. 1996. A Hierarchical CPU Scheduler for Multimedia Operating Systems. In Proceedings of the 2nd USENIX Conference on Operating Systems Design and Implementation (OSDI '96). USENIX Association. Google ScholarDigital Library
- Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Finegrained resource sharing for concurrent GPGPU kernels. In 4th USENIX Workshop on Hot Topics in Parallelism (HotPar). Berkeley, CA. Google ScholarDigital Library
- Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen, Cheng Li, Trevor Mudge, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. 2015. DjiNN and Tonic: DNN As a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 27--40. Google ScholarDigital Library
- Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, 260--269. Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 395--406. Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2016. Exploiting Core Criticality for Enhanced GPU Performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science (SIGMETRICS '16). ACM, New York, NY, USA, 351--363. Google ScholarDigital Library
- Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIX ATC. 17--30. Google ScholarDigital Library
- Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 114--126. Google ScholarDigital Library
- Minseok Lee, Seokwoo Song, Joosik Moon, J. Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. 260--271.Google ScholarCross Ref
- Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 487--498. Google ScholarDigital Library
- Jiwei Liu, Jun Yang, and Rami Melhem. 2015. SAWS: Synchronization Aware GPGPU Warp Scheduling for Multiple Independent Warp Schedulers. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48). Google ScholarDigital Library
- Christos Margiolas and Michael F. P. O'Boyle. 2016. Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, New York, NY, USA, 82--93. Google ScholarDigital Library
- Konstantinos Menychtas, Kai Shen, and Michael L Scott. 2013. Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack.. In USENIX Annual Technical Conference. 291--296. Google ScholarDigital Library
- Konstantinos Menychtas, Kai Shen, and Michael L. Scott. 2014. Disengaged Scheduling for Fair, Protected Access to Fast Computational Accelerators. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, 301--316. Google ScholarDigital Library
- NVIDIA. 2012. Sharing a GPU between MPI processes: multi-process service(MPS). (2012).Google Scholar
- Nvidia. 2014. Programming Guide. (2014).Google Scholar
- NVIDIA. 2016. GP100 Pascal Whitepaper. (2016). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdfGoogle Scholar
- Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2014. Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 483--484. Google ScholarDigital Library
- Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, 407--418. Google ScholarDigital Library
- Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 593--606. Google ScholarDigital Library
- Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). ACM, 233--248. Google ScholarDigital Library
- Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 647--658. Google ScholarDigital Library
- Madhavapeddi Shreedhar and George Varghese. 1995. Efficient Fair Queueing Using Deficit Round Robin. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '95). ACM, New York, NY, USA, 231--242. Google ScholarDigital Library
- Ion Stoica, Hussein Abdel-Wahab, Kevin Jeffay, Sanjoy K Baruah, Johannes E Gehrke, and C Greg Plaxton. 1996. A Proportional Share Resource Allocation Algorithm for Real-time, Time-shared Systems. In Proceedings of the 17th IEEE Real-Time Systems Symposium (RTSS '96). IEEE Computer Society, Washington, DC, USA, 288--. http://dl.acm.org/citation.cfm?id=827268.828976 Google ScholarDigital Library
- John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W-M Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing (2012).Google Scholar
- Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling Preemptive Multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, 193--204. Google ScholarDigital Library
- Yash Ukidave, Xiangyu Li, and David Kaeli. 2016. Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 353--362.Google Scholar
- Guibin Wang, Yisong Lin, and Wei Yi. 2010. Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom). 344--350. Google ScholarDigital Library
- Minjie Wang, Tianjun Xiao, Jianpeng Li, Jiaxing Zhang, Chuntao Hong, and Zheng Zhang. 2014. Minerva: A scalable and highly efficient training platform for deep learning. (2014).Google Scholar
- Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2015. Simultaneous Multikernel: Fine-grained Sharing of GPGPUs. Computer Architecture Letters PP, 99 (2015), 1--1. Google ScholarDigital Library
- Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 358--369.Google ScholarCross Ref
- Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations. In ICS' 15. Google ScholarDigital Library
- Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. In Proceeding of the 43st Annual International Symposium on Computer Architecuture (ISCA '16). IEEE Press. Google ScholarDigital Library
- Miao Yu, Chao Zhang, Zhengwei Qi, Jianguo Yao, Yin Wang, and Haibing Guan. 2013. VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud Gaming. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13). ACM, New York, NY, USA, 203--214. Google ScholarDigital Library
- Wanghong Yuan and Klara Nahrstedt. 2003. Energy-efficient Soft Real-time CPU Scheduling for Mobile Multimedia Systems. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP '03). ACM, New York, NY, USA, 149--163. Google ScholarDigital Library
- Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified Graph Processing on GPUs. Parallel and Distributed Systems, IEEE Transactions on 25, 6 (June 2014), 1543--1552. Google ScholarDigital Library
Index Terms
- Quality of Service Support for Fine-Grained Sharing on GPUs
Recommendations
Quality of Service Support for Fine-Grained Sharing on GPUs
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureGPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent ...
Fine-Grained Synchronizations and Dataflow Programming on GPUs
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingThe last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming ...
Fast Fine-Grained Global Synchronization on GPUs
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsThis paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication ...
Comments