research-article

Enabling preemptive multiprogramming on GPUs

Authors:
Ivan Tanasic

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya
View Profile

,
Isaac Gelado

NVIDIA Research

NVIDIA Research
View Profile

,
Javier Cabezas

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya
View Profile

,
Alex Ramirez

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya
View Profile

,
Nacho Navarro

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya
View Profile

,
Mateo Valero

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 42 Issue 3June 2014pp 193–204https://doi.org/10.1145/2678373.2665702

Published:14 June 2014Publication History

ACM SIGARCH Computer Architecture News

Abstract

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.

In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

References

J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte, "The case for GPGPU spatial multitasking," in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1--12. Google ScholarDigital Library
T. Aila and S. Laine, "Understanding the efficiency of ray traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009. ACM, 2009, pp. 145--149. Google ScholarDigital Library
AMD, "AMD A-Series Processor-in-a-Box," 2012. {Online}. Available: http://www.amd.com/us/products/desktop/processors/a-series/ Pages/a-series-pib.aspxGoogle Scholar
AMD, "AMD Graphics Cores Next (GCN) architecture white paper," 2012.Google Scholar
ARM, "ARM Mali," 2012. {Online}. Available: www.arm.com/ products/multimedia/mali-graphics-plus-gpu-computeGoogle Scholar
C. Basaran and K.-D. Kang, "Supporting preemptive task executions and memory copies in GPGPUs," in Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on. IEEE, 2012, pp. 287--296. Google ScholarDigital Library
M. Bautin, A. Dwarakinath, and T. Chiueh, "Graphic engine resource management," in SPIE 2008, vol. 6818, 2008, p. 68180O.Google Scholar
A. Branover, D. Foley, and M. Steinman, "AMD Fusion APU: Llano," Micro, IEEE, vol. 32, no. 2, pp. 28--37, 2012. Google ScholarDigital Library
Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero, "Task superscalar: An out-of-order task pipeline," in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. 89--100. Google ScholarDigital Library
S. Eyerman and L. Eeckhout, "System-level performance metrics for multiprogram workloads," Micro, IEEE, vol. 28, no. 3, pp. 42--53, 2008. Google ScholarDigital Library
W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 407--420. Google ScholarDigital Library
C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron, "Fine-grained resource sharing for concurrent GPGPU kernels," in Proceedings of the 4th USENIX conference on Hot Topics in Parallelism. USENIX Association, 2012, pp. 10--10. Google ScholarDigital Library
M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, "Enabling task parallelism in the CUDA scheduler," in Workshop on Programming Models for Emerging Architectures, 2009, pp. 69--76.Google Scholar
K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style GPU programming for GPGPU workloads," in Innovative Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1--14.Google Scholar
Intel, "4th generation Intel Core processors are here," 2012. {Online}. Available: http://www.intel.com/content/www/us/en/processors/core/ 4th-gen-core-processor-family.htmlGoogle Scholar
S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar, "RGEM: A responsive GPGPU execution model for runtime engines," in Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd. IEEE, 2011, pp. 57--66. Google ScholarDigital Library
S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa, "Time- Graph: GPU scheduling for real-time multi-tasking environments," in 2011 USENIX Annual Technical Conference (USENIX ATC'11), 2011, p. 17. Google ScholarDigital Library
S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, "Gdev: First-class GPU resource management in the operating system," in USENIX ATC, vol. 12, 2012, pp. 37--37. Google ScholarDigital Library
G. Kyriazis, "Heterogenious System Architecture: a technical review," AMD, 2012.Google Scholar
T. Li, V. K. Narayana, E. El-Araby, and T. El-Ghazawi, "GPU resource sharing and virtualization on high performance computing systems," in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp. 733--742. Google ScholarDigital Library
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," Micro, IEEE, vol. 28, no. 2, pp. 39--55, 2008. Google ScholarDigital Library
J. Menon, M. De Kruijf, and K. Sankaralingam, "igpu: Exception support and speculative execution on gpus," in Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 2012, pp. 72--83. Google ScholarDigital Library
NVIDIA, "Next generation CUDA computer architecture Kepler GK110," 2012.Google Scholar
NVIDIA, "Sharing a GPU between MPI processes: multi-process service (MPS) overview," 2013.Google Scholar
NVIDIA, "Programming guide - CUDA toolkit documentation," 2014. {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle Scholar
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, 2008.Google ScholarCross Ref
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU concurrency with elastic kernels," in Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems. ACM, 2013, pp. 407--418. Google ScholarDigital Library
B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2014, pp. 743--758. Google ScholarDigital Library
V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proceedings of the 20th international symposium on High performance distributed computing. ACM, 2011, pp. 217--228. Google ScholarDigital Library
C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "PTask: operating system abstractions to manage GPUs as compute devices," in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011, pp. 233--248. Google ScholarDigital Library
Samsung, "Samsung Exynos," 2012. {Online}. Available: www. samsung.com/exynosGoogle Scholar
J. E. Smith and A. R. Pleszkun, "Implementation of precise interrupts in pipelined processors," in Proceedings of the 12th annual International Symposium on Computer Architecture, ser. ISCA '85, 1985, pp. 36--44. Google ScholarDigital Library
M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: dynamic scheduling on GPUs," ACM Transactions on Graphics (TOG), vol. 31, no. 6, p. 161, 2012. Google ScholarDigital Library
J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, G. Liu, and W. Hwu, "The Parboil benchmarks," Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, Tech. Rep., 2012.Google Scholar
J. Stratton, S. Stone, and W.-m. Hwu, "MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs," LCPC 2008, pp. 16--30, 2008. Google ScholarDigital Library
N. Tuck and D. M. Tullsen, "Initial observations of the simultaneous multithreading Pentium 4 processor," in Proceedings of 12th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT 2003. IEEE, 2003, pp. 26--34. Google ScholarDigital Library
J. Vera, F. J. Cazorla, A. Pajuelo, O. J. Santana, E. Fernandez, and M. Valero, "FAME: Fairly measuring multithreaded architectures," in Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. IEEE, 2007, pp. 305--316. Google ScholarDigital Library
C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, "Fermi GF100 GPU architecture," Micro, IEEE, vol. 31, no. 2, pp. 50--59, 2011. Google ScholarDigital Library
J. Zhong and B. He, "Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling," arXiv preprint arXiv:1303.5164, 2013.Google Scholar

Index Terms

Enabling preemptive multiprogramming on GPUs
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multiprocessing / multiprogramming / multitasking

Index terms have been assigned to the content through auto-classification.

Recommendations

Enabling preemptive multiprogramming on GPUs
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide ...
Read More
Brook for GPUs: stream computing on graphics hardware
SIGGRAPH '04: ACM SIGGRAPH 2004 Papers

In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a ...
Read More
A compiler and runtime system for enabling data mining applications on gpus
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

With increasing need for accelerating data mining and scientific data analysis on large data sets, and less chance to improve processor performance by simply increasing clock frequencies, multi-core architectures and accelerators like FPGAs and GPUs ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 42, Issue 3
ISCA '14
June 2014
552 pages
ISSN:0163-5964
DOI:10.1145/2678373
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
June 2014
566 pages
ISBN:9781479943944
General Chairs:
Pen-Chung Yew
University of Minnesota
,
Antonia Zhai
University of Minnesota
,
Program Chair:
Steve Keckler
NVIDIA/University of Texas at Austin
Copyright © 2014 IEEE
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2014
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 79
  Total Citations
  View Citations
- 892
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enabling preemptive multiprogramming on GPUs

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Enabling preemptive multiprogramming on GPUs

Brook for GPUs: stream computing on graphics hardware

A compiler and runtime system for enabling data mining applications on gpus