ABSTRACT
Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation.
We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study.
- J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015. Google ScholarDigital Library
- M. Batty, M. Dodds, and A. Gotsman. Library abstraction for C/C++ concurrency. In POPL, pages 235–248. ACM, 2013. Google ScholarDigital Library
- M. Batty, A. F. Donaldson, and J. Wickerson. Overhauling SC atomics in C11 and OpenCL. In POPL, pages 634–648. ACM, 2016. Google ScholarDigital Library
- A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google ScholarDigital Library
- M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012. Google ScholarDigital Library
- D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH, pages 57–64. Eurographics Association, 2008. Google ScholarDigital Library
- S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185–195. IEEE, 2013.Google Scholar
- P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In ESOP, pages 270–289. Springer, 2013. Google ScholarDigital Library
- B. Gaster. A look at the OpenCL 2.0 execution model. In IWOCL, pages 2:1–2:1. ACM, 2015. Google ScholarDigital Library
- B. R. Gaster, D. Hower, and L. Howes. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. Trans. Archit. Code Optim., 2015. Google ScholarDigital Library
- K. Gupta, J. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of Innovative Parallel Computing, InPar, pages 1–14. IEEE, 2012.Google Scholar
- M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., 2008. Google ScholarDigital Library
- D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, pages 427–440. ACM, 2014. Google ScholarDigital Library
- Intel. The compute architecture of Intel processor graphics gen9, version 1.0, Aug. 2015.Google Scholar
- ISO/IEC. Standard for programming language C++, 2012.Google Scholar
- Khronos Group. The OpenCL C specification version: 2.0. https://www.khronos.org/registry/cl/ specs/opencl-2.0-openclc.pdf.Google Scholar
- Khronos Group. The OpenCL specification version: 2.0 (rev. 29), July 2015.Google Scholar
- https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.Google Scholar
- G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012. Google ScholarDigital Library
- S. Maleki, A. Yang, and M. Burtscher. Higher-order and tuplebased massively-parallel prefix sums. In PLDI, pages 539– 552. ACM, 2016. Google ScholarDigital Library
- D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012. Google ScholarDigital Library
- M. Mrozek and Z. Zdanowicz. GPU daemon: Road to zero cost submission. In IWOCL, pages 11:1–11:4. ACM, 2016. Google ScholarDigital Library
- Nvidia. CUB, April 2015. http://nvlabs.github. io/cub/.Google Scholar
- Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf.Google Scholar
- OpenMP Architecture Review Board. OpenMP application programming interface version 4.5, November 2015.Google Scholar
- M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, pages 73–86. ACM, 2015. Google ScholarDigital Library
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS, pages 407–418. ACM, 2013. Google ScholarDigital Library
- Y. Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Publishing, 2009.Google Scholar
- T. Sorensen and A. F. Donaldson. The hitchhiker’s guide to cross-platform OpenCL application development. IWOCL, pages 2:1–2:12. ACM, 2016. Google ScholarDigital Library
- Y. Torres, A. Gonzalez-Escribano, and D. Llanos. Understanding the impact of CUDA tuning techniques for Fermi. In High Performance Computing and Simulation (HPCS), pages 631–639, 2011.Google ScholarCross Ref
- S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29– 37, 2010. Google ScholarDigital Library
- B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In ICS, pages 119–130. ACM, 2015. Google ScholarDigital Library
- S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In IPDPS, pages 1–12. IEEE, 2010.Google Scholar
Index Terms
- Portable inter-workgroup barrier synchronisation for GPUs
Recommendations
Portable inter-workgroup barrier synchronisation for GPUs
OOPSLA '16Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
An OpenCL micro-benchmark suite for GPUs and CPUs
Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Comments