Abstract
GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.
In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x
- J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte, "The case for GPGPU spatial multitasking," in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1--12. Google ScholarDigital Library
- T. Aila and S. Laine, "Understanding the efficiency of ray traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009. ACM, 2009, pp. 145--149. Google ScholarDigital Library
- AMD, "AMD A-Series Processor-in-a-Box," 2012. {Online}. Available: http://www.amd.com/us/products/desktop/processors/a-series/ Pages/a-series-pib.aspxGoogle Scholar
- AMD, "AMD Graphics Cores Next (GCN) architecture white paper," 2012.Google Scholar
- ARM, "ARM Mali," 2012. {Online}. Available: www.arm.com/ products/multimedia/mali-graphics-plus-gpu-computeGoogle Scholar
- C. Basaran and K.-D. Kang, "Supporting preemptive task executions and memory copies in GPGPUs," in Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on. IEEE, 2012, pp. 287--296. Google ScholarDigital Library
- M. Bautin, A. Dwarakinath, and T. Chiueh, "Graphic engine resource management," in SPIE 2008, vol. 6818, 2008, p. 68180O.Google Scholar
- A. Branover, D. Foley, and M. Steinman, "AMD Fusion APU: Llano," Micro, IEEE, vol. 32, no. 2, pp. 28--37, 2012. Google ScholarDigital Library
- Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero, "Task superscalar: An out-of-order task pipeline," in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. 89--100. Google ScholarDigital Library
- S. Eyerman and L. Eeckhout, "System-level performance metrics for multiprogram workloads," Micro, IEEE, vol. 28, no. 3, pp. 42--53, 2008. Google ScholarDigital Library
- W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 407--420. Google ScholarDigital Library
- C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron, "Fine-grained resource sharing for concurrent GPGPU kernels," in Proceedings of the 4th USENIX conference on Hot Topics in Parallelism. USENIX Association, 2012, pp. 10--10. Google ScholarDigital Library
- M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, "Enabling task parallelism in the CUDA scheduler," in Workshop on Programming Models for Emerging Architectures, 2009, pp. 69--76.Google Scholar
- K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style GPU programming for GPGPU workloads," in Innovative Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1--14.Google Scholar
- Intel, "4th generation Intel Core processors are here," 2012. {Online}. Available: http://www.intel.com/content/www/us/en/processors/core/ 4th-gen-core-processor-family.htmlGoogle Scholar
- S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar, "RGEM: A responsive GPGPU execution model for runtime engines," in Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd. IEEE, 2011, pp. 57--66. Google ScholarDigital Library
- S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa, "Time- Graph: GPU scheduling for real-time multi-tasking environments," in 2011 USENIX Annual Technical Conference (USENIX ATC'11), 2011, p. 17. Google ScholarDigital Library
- S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, "Gdev: First-class GPU resource management in the operating system," in USENIX ATC, vol. 12, 2012, pp. 37--37. Google ScholarDigital Library
- G. Kyriazis, "Heterogenious System Architecture: a technical review," AMD, 2012.Google Scholar
- T. Li, V. K. Narayana, E. El-Araby, and T. El-Ghazawi, "GPU resource sharing and virtualization on high performance computing systems," in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp. 733--742. Google ScholarDigital Library
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," Micro, IEEE, vol. 28, no. 2, pp. 39--55, 2008. Google ScholarDigital Library
- J. Menon, M. De Kruijf, and K. Sankaralingam, "igpu: Exception support and speculative execution on gpus," in Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 2012, pp. 72--83. Google ScholarDigital Library
- NVIDIA, "Next generation CUDA computer architecture Kepler GK110," 2012.Google Scholar
- NVIDIA, "Sharing a GPU between MPI processes: multi-process service (MPS) overview," 2013.Google Scholar
- NVIDIA, "Programming guide - CUDA toolkit documentation," 2014. {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle Scholar
- J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, 2008.Google ScholarCross Ref
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU concurrency with elastic kernels," in Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems. ACM, 2013, pp. 407--418. Google ScholarDigital Library
- B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2014, pp. 743--758. Google ScholarDigital Library
- V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proceedings of the 20th international symposium on High performance distributed computing. ACM, 2011, pp. 217--228. Google ScholarDigital Library
- C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "PTask: operating system abstractions to manage GPUs as compute devices," in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011, pp. 233--248. Google ScholarDigital Library
- Samsung, "Samsung Exynos," 2012. {Online}. Available: www. samsung.com/exynosGoogle Scholar
- J. E. Smith and A. R. Pleszkun, "Implementation of precise interrupts in pipelined processors," in Proceedings of the 12th annual International Symposium on Computer Architecture, ser. ISCA '85, 1985, pp. 36--44. Google ScholarDigital Library
- M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: dynamic scheduling on GPUs," ACM Transactions on Graphics (TOG), vol. 31, no. 6, p. 161, 2012. Google ScholarDigital Library
- J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, G. Liu, and W. Hwu, "The Parboil benchmarks," Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, Tech. Rep., 2012.Google Scholar
- J. Stratton, S. Stone, and W.-m. Hwu, "MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs," LCPC 2008, pp. 16--30, 2008. Google ScholarDigital Library
- N. Tuck and D. M. Tullsen, "Initial observations of the simultaneous multithreading Pentium 4 processor," in Proceedings of 12th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT 2003. IEEE, 2003, pp. 26--34. Google ScholarDigital Library
- J. Vera, F. J. Cazorla, A. Pajuelo, O. J. Santana, E. Fernandez, and M. Valero, "FAME: Fairly measuring multithreaded architectures," in Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. IEEE, 2007, pp. 305--316. Google ScholarDigital Library
- C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, "Fermi GF100 GPU architecture," Micro, IEEE, vol. 31, no. 2, pp. 50--59, 2011. Google ScholarDigital Library
- J. Zhong and B. He, "Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling," arXiv preprint arXiv:1303.5164, 2013.Google Scholar
Index Terms
- Enabling preemptive multiprogramming on GPUs
Recommendations
Enabling preemptive multiprogramming on GPUs
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecutureGPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide ...
Brook for GPUs: stream computing on graphics hardware
SIGGRAPH '04: ACM SIGGRAPH 2004 PapersIn this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a ...
A compiler and runtime system for enabling data mining applications on gpus
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingWith increasing need for accelerating data mining and scientific data analysis on large data sets, and less chance to improve processor performance by simply increasing clock frequencies, multi-core architectures and accelerators like FPGAs and GPUs ...
Comments