skip to main content
research-article

Enabling preemptive multiprogramming on GPUs

Authors Info & Claims
Published:14 June 2014Publication History
Skip Abstract Section

Abstract

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.

In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

References

  1. J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte, "The case for GPGPU spatial multitasking," in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Aila and S. Laine, "Understanding the efficiency of ray traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009. ACM, 2009, pp. 145--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMD, "AMD A-Series Processor-in-a-Box," 2012. {Online}. Available: http://www.amd.com/us/products/desktop/processors/a-series/ Pages/a-series-pib.aspxGoogle ScholarGoogle Scholar
  4. AMD, "AMD Graphics Cores Next (GCN) architecture white paper," 2012.Google ScholarGoogle Scholar
  5. ARM, "ARM Mali," 2012. {Online}. Available: www.arm.com/ products/multimedia/mali-graphics-plus-gpu-computeGoogle ScholarGoogle Scholar
  6. C. Basaran and K.-D. Kang, "Supporting preemptive task executions and memory copies in GPGPUs," in Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on. IEEE, 2012, pp. 287--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Bautin, A. Dwarakinath, and T. Chiueh, "Graphic engine resource management," in SPIE 2008, vol. 6818, 2008, p. 68180O.Google ScholarGoogle Scholar
  8. A. Branover, D. Foley, and M. Steinman, "AMD Fusion APU: Llano," Micro, IEEE, vol. 32, no. 2, pp. 28--37, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero, "Task superscalar: An out-of-order task pipeline," in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Eyerman and L. Eeckhout, "System-level performance metrics for multiprogram workloads," Micro, IEEE, vol. 28, no. 3, pp. 42--53, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron, "Fine-grained resource sharing for concurrent GPGPU kernels," in Proceedings of the 4th USENIX conference on Hot Topics in Parallelism. USENIX Association, 2012, pp. 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, "Enabling task parallelism in the CUDA scheduler," in Workshop on Programming Models for Emerging Architectures, 2009, pp. 69--76.Google ScholarGoogle Scholar
  14. K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style GPU programming for GPGPU workloads," in Innovative Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1--14.Google ScholarGoogle Scholar
  15. Intel, "4th generation Intel Core processors are here," 2012. {Online}. Available: http://www.intel.com/content/www/us/en/processors/core/ 4th-gen-core-processor-family.htmlGoogle ScholarGoogle Scholar
  16. S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar, "RGEM: A responsive GPGPU execution model for runtime engines," in Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd. IEEE, 2011, pp. 57--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa, "Time- Graph: GPU scheduling for real-time multi-tasking environments," in 2011 USENIX Annual Technical Conference (USENIX ATC'11), 2011, p. 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, "Gdev: First-class GPU resource management in the operating system," in USENIX ATC, vol. 12, 2012, pp. 37--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Kyriazis, "Heterogenious System Architecture: a technical review," AMD, 2012.Google ScholarGoogle Scholar
  20. T. Li, V. K. Narayana, E. El-Araby, and T. El-Ghazawi, "GPU resource sharing and virtualization on high performance computing systems," in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp. 733--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," Micro, IEEE, vol. 28, no. 2, pp. 39--55, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Menon, M. De Kruijf, and K. Sankaralingam, "igpu: Exception support and speculative execution on gpus," in Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 2012, pp. 72--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. NVIDIA, "Next generation CUDA computer architecture Kepler GK110," 2012.Google ScholarGoogle Scholar
  24. NVIDIA, "Sharing a GPU between MPI processes: multi-process service (MPS) overview," 2013.Google ScholarGoogle Scholar
  25. NVIDIA, "Programming guide - CUDA toolkit documentation," 2014. {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle ScholarGoogle Scholar
  26. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  27. S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU concurrency with elastic kernels," in Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems. ACM, 2013, pp. 407--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2014, pp. 743--758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proceedings of the 20th international symposium on High performance distributed computing. ACM, 2011, pp. 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "PTask: operating system abstractions to manage GPUs as compute devices," in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011, pp. 233--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Samsung, "Samsung Exynos," 2012. {Online}. Available: www. samsung.com/exynosGoogle ScholarGoogle Scholar
  32. J. E. Smith and A. R. Pleszkun, "Implementation of precise interrupts in pipelined processors," in Proceedings of the 12th annual International Symposium on Computer Architecture, ser. ISCA '85, 1985, pp. 36--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: dynamic scheduling on GPUs," ACM Transactions on Graphics (TOG), vol. 31, no. 6, p. 161, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, G. Liu, and W. Hwu, "The Parboil benchmarks," Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, Tech. Rep., 2012.Google ScholarGoogle Scholar
  35. J. Stratton, S. Stone, and W.-m. Hwu, "MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs," LCPC 2008, pp. 16--30, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. N. Tuck and D. M. Tullsen, "Initial observations of the simultaneous multithreading Pentium 4 processor," in Proceedings of 12th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT 2003. IEEE, 2003, pp. 26--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Vera, F. J. Cazorla, A. Pajuelo, O. J. Santana, E. Fernandez, and M. Valero, "FAME: Fairly measuring multithreaded architectures," in Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. IEEE, 2007, pp. 305--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, "Fermi GF100 GPU architecture," Micro, IEEE, vol. 31, no. 2, pp. 50--59, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Zhong and B. He, "Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling," arXiv preprint arXiv:1303.5164, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Enabling preemptive multiprogramming on GPUs
Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 42, Issue 3
    ISCA '14
    June 2014
    552 pages
    ISSN:0163-5964
    DOI:10.1145/2678373
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
      June 2014
      566 pages
      ISBN:9781479943944

    Copyright © 2014 IEEE

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 June 2014

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader