skip to main content
tutorial

Quality of Service Support for Fine-Grained Sharing on GPUs

Authors Info & Claims
Published:24 June 2017Publication History
Skip Abstract Section

Abstract

GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent applications is minimally supported. Previous efforts are too coarse-grained and not scalable with increasing QoS requirements. We propose QoS mechanisms for a fine-grained form of GPU sharing. Our QoS support can provide control over the progress of kernels on a per cycle basis and the amount of thread-level parallelism of each kernel. Due to accurate resource management, our QoS support has significantly better scalability compared with previous best efforts. Evaluations show that, when the GPU is shared by three kernels, two of which have QoS goals, the proposed techniques achieve QoS goals 43.8% more often than previous techniques and have 20.5% higher throughput.

References

  1. Jacob T Adriaens, Katherine Compton, Nam Sung Kim, and Michael J Schulte. 2012. The case for GPGPU spatial multitasking. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Paula Aguilera, Katherine Morrow, and Nam Sung Kim. 2014. Fair share: Allocation of GPU resources for both performance and fairness. In Computer Design (ICCD), 2014 32nd IEEE International Conference on. 440--447.Google ScholarGoogle ScholarCross RefCross Ref
  3. Paula Aguilera, Katherine Morrow, and Nam Sung Kim. 2014. QoS-aware dynamic resource allocation for spatial-multitasking GPUs. In Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific. 726--731.Google ScholarGoogle ScholarCross RefCross Ref
  4. Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. 163--174.Google ScholarGoogle ScholarCross RefCross Ref
  5. Mikhail Bautin, Ashok Dwarakinath, and Tzi-cker Chiueh. 2008. Graphic engine resource management. Proc. SPIE 6818 (2008), 68180O--68180O--12.Google ScholarGoogle Scholar
  6. Thomas Bradley. 2012. Hyper-Q example. (2012).Google ScholarGoogle Scholar
  7. Abhishek Chandra, Micah Adler, Pawan Goyal, and Prashant Shenoy. 2000. Surplus Fair Scheduling: A Proportional-share CPU Scheduling Algorithm for Symmetric Multiprocessors. In Proceedings of the 4th Conference on Symposium on Operating System Design & Implementation - Volume 4 (OSDI'00). USENIX Association, Berkeley, CA, USA, 4--4. http://dl.acm.org/citation.cfm?id=1251229.1251233 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 681--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alan Demers, Srinivasan Keshav, and Scott Shenker. 1989. Analysis and Simulation of a Fair Queueing Algorithm. In Symposium Proceedings on Communications Architectures & Protocols (SIGCOMM '89). ACM, New York, NY, USA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kenneth J. Duda and David R. Cheriton. 1999. Borrowed-virtual-time (BVT) Scheduling: Supporting Latency-sensitive Threads in a General-purpose Scheduler. In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles (SOSP '99). ACM, New York, NY, USA, 261--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. HSA Foundation. 2015. HSA Platform System Architecture Specification. (2015).Google ScholarGoogle Scholar
  12. Pawan Goyal, Xingang Guo, and Harrick M. Vin. 1996. A Hierarchical CPU Scheduler for Multimedia Operating Systems. In Proceedings of the 2nd USENIX Conference on Operating Systems Design and Implementation (OSDI '96). USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Finegrained resource sharing for concurrent GPGPU kernels. In 4th USENIX Workshop on Hot Topics in Parallelism (HotPar). Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen, Cheng Li, Trevor Mudge, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. 2015. DjiNN and Tonic: DNN As a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 27--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2016. Exploiting Core Criticality for Enhanced GPU Performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science (SIGMETRICS '16). ACM, New York, NY, USA, 351--363. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIX ATC. 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 114--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Minseok Lee, Seokwoo Song, Joosik Moon, J. Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. 260--271.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 487--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jiwei Liu, Jun Yang, and Rami Melhem. 2015. SAWS: Synchronization Aware GPGPU Warp Scheduling for Multiple Independent Warp Schedulers. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Christos Margiolas and Michael F. P. O'Boyle. 2016. Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, New York, NY, USA, 82--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Konstantinos Menychtas, Kai Shen, and Michael L Scott. 2013. Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack.. In USENIX Annual Technical Conference. 291--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Konstantinos Menychtas, Kai Shen, and Michael L. Scott. 2014. Disengaged Scheduling for Fair, Protected Access to Fast Computational Accelerators. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, 301--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. NVIDIA. 2012. Sharing a GPU between MPI processes: multi-process service(MPS). (2012).Google ScholarGoogle Scholar
  27. Nvidia. 2014. Programming Guide. (2014).Google ScholarGoogle Scholar
  28. NVIDIA. 2016. GP100 Pascal Whitepaper. (2016). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdfGoogle ScholarGoogle Scholar
  29. Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2014. Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 483--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, 407--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 593--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). ACM, 233--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 647--658. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Madhavapeddi Shreedhar and George Varghese. 1995. Efficient Fair Queueing Using Deficit Round Robin. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '95). ACM, New York, NY, USA, 231--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ion Stoica, Hussein Abdel-Wahab, Kevin Jeffay, Sanjoy K Baruah, Johannes E Gehrke, and C Greg Plaxton. 1996. A Proportional Share Resource Allocation Algorithm for Real-time, Time-shared Systems. In Proceedings of the 17th IEEE Real-Time Systems Symposium (RTSS '96). IEEE Computer Society, Washington, DC, USA, 288--. http://dl.acm.org/citation.cfm?id=827268.828976 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W-M Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing (2012).Google ScholarGoogle Scholar
  37. Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling Preemptive Multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yash Ukidave, Xiangyu Li, and David Kaeli. 2016. Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 353--362.Google ScholarGoogle Scholar
  39. Guibin Wang, Yisong Lin, and Wei Yi. 2010. Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom). 344--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Minjie Wang, Tianjun Xiao, Jianpeng Li, Jiaxing Zhang, Chuntao Hong, and Zheng Zhang. 2014. Minerva: A scalable and highly efficient training platform for deep learning. (2014).Google ScholarGoogle Scholar
  41. Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2015. Simultaneous Multikernel: Fine-grained Sharing of GPGPUs. Computer Architecture Letters PP, 99 (2015), 1--1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 358--369.Google ScholarGoogle ScholarCross RefCross Ref
  43. Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations. In ICS' 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. In Proceeding of the 43st Annual International Symposium on Computer Architecuture (ISCA '16). IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Miao Yu, Chao Zhang, Zhengwei Qi, Jianguo Yao, Yin Wang, and Haibing Guan. 2013. VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud Gaming. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13). ACM, New York, NY, USA, 203--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wanghong Yuan and Klara Nahrstedt. 2003. Energy-efficient Soft Real-time CPU Scheduling for Mobile Multimedia Systems. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP '03). ACM, New York, NY, USA, 149--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified Graph Processing on GPUs. Parallel and Distributed Systems, IEEE Transactions on 25, 6 (June 2014), 1543--1552. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Quality of Service Support for Fine-Grained Sharing on GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 45, Issue 2
        ISCA'17
        May 2017
        715 pages
        ISSN:0163-5964
        DOI:10.1145/3140659
        Issue’s Table of Contents
        • cover image ACM Conferences
          ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
          June 2017
          736 pages
          ISBN:9781450348928
          DOI:10.1145/3079856

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 June 2017

        Check for updates

        Qualifiers

        • tutorial
        • Research
        • Refereed limited

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader