skip to main content
10.1145/2983990.2984032acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Portable inter-workgroup barrier synchronisation for GPUs

Published:19 October 2016Publication History

ABSTRACT

Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation.

We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study.

References

  1. J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Batty, M. Dodds, and A. Gotsman. Library abstraction for C/C++ concurrency. In POPL, pages 235–248. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Batty, A. F. Donaldson, and J. Wickerson. Overhauling SC atomics in C11 and OpenCL. In POPL, pages 634–648. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH, pages 57–64. Eurographics Association, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185–195. IEEE, 2013.Google ScholarGoogle Scholar
  8. P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In ESOP, pages 270–289. Springer, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Gaster. A look at the OpenCL 2.0 execution model. In IWOCL, pages 2:1–2:1. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. R. Gaster, D. Hower, and L. Howes. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. Trans. Archit. Code Optim., 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Gupta, J. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of Innovative Parallel Computing, InPar, pages 1–14. IEEE, 2012.Google ScholarGoogle Scholar
  12. M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, pages 427–440. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Intel. The compute architecture of Intel processor graphics gen9, version 1.0, Aug. 2015.Google ScholarGoogle Scholar
  15. ISO/IEC. Standard for programming language C++, 2012.Google ScholarGoogle Scholar
  16. Khronos Group. The OpenCL C specification version: 2.0. https://www.khronos.org/registry/cl/ specs/opencl-2.0-openclc.pdf.Google ScholarGoogle Scholar
  17. Khronos Group. The OpenCL specification version: 2.0 (rev. 29), July 2015.Google ScholarGoogle Scholar
  18. https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.Google ScholarGoogle Scholar
  19. G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Maleki, A. Yang, and M. Burtscher. Higher-order and tuplebased massively-parallel prefix sums. In PLDI, pages 539– 552. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Mrozek and Z. Zdanowicz. GPU daemon: Road to zero cost submission. In IWOCL, pages 11:1–11:4. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nvidia. CUB, April 2015. http://nvlabs.github. io/cub/.Google ScholarGoogle Scholar
  24. Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf.Google ScholarGoogle Scholar
  25. OpenMP Architecture Review Board. OpenMP application programming interface version 4.5, November 2015.Google ScholarGoogle Scholar
  26. M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, pages 73–86. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS, pages 407–418. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Publishing, 2009.Google ScholarGoogle Scholar
  29. T. Sorensen and A. F. Donaldson. The hitchhiker’s guide to cross-platform OpenCL application development. IWOCL, pages 2:1–2:12. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Torres, A. Gonzalez-Escribano, and D. Llanos. Understanding the impact of CUDA tuning techniques for Fermi. In High Performance Computing and Simulation (HPCS), pages 631–639, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  31. S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29– 37, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In ICS, pages 119–130. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In IPDPS, pages 1–12. IEEE, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Portable inter-workgroup barrier synchronisation for GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
        October 2016
        915 pages
        ISBN:9781450344449
        DOI:10.1145/2983990

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 October 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate268of1,244submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader