research-article

Portable inter-workgroup barrier synchronisation for GPUs

Authors:
Tyler Sorensen

Imperial College London, UK

Imperial College London, UK
View Profile

,
Alastair F. Donaldson

Imperial College London, UK

Imperial College London, UK
View Profile

,
Mark Batty

University of Kent, UK

University of Kent, UK
View Profile

,
Ganesh Gopalakrishnan

University of Utah, USA

University of Utah, USA
View Profile

,
Zvonimir Rakamarić

University of Utah, USA

University of Utah, USA
View Profile

OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsOctober 2016Pages 39–58https://doi.org/10.1145/2983990.2984032

Published:19 October 2016Publication History

OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications

Pages 39–58

ABSTRACT

Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation.

We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study.

References

J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015. Google ScholarDigital Library
M. Batty, M. Dodds, and A. Gotsman. Library abstraction for C/C++ concurrency. In POPL, pages 235–248. ACM, 2013. Google ScholarDigital Library
M. Batty, A. F. Donaldson, and J. Wickerson. Overhauling SC atomics in C11 and OpenCL. In POPL, pages 634–648. ACM, 2016. Google ScholarDigital Library
A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google ScholarDigital Library
M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012. Google ScholarDigital Library
D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH, pages 57–64. Eurographics Association, 2008. Google ScholarDigital Library
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185–195. IEEE, 2013.Google Scholar
P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In ESOP, pages 270–289. Springer, 2013. Google ScholarDigital Library
B. Gaster. A look at the OpenCL 2.0 execution model. In IWOCL, pages 2:1–2:1. ACM, 2015. Google ScholarDigital Library
B. R. Gaster, D. Hower, and L. Howes. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. Trans. Archit. Code Optim., 2015. Google ScholarDigital Library
K. Gupta, J. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of Innovative Parallel Computing, InPar, pages 1–14. IEEE, 2012.Google Scholar
M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., 2008. Google ScholarDigital Library
D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, pages 427–440. ACM, 2014. Google ScholarDigital Library
Intel. The compute architecture of Intel processor graphics gen9, version 1.0, Aug. 2015.Google Scholar
ISO/IEC. Standard for programming language C++, 2012.Google Scholar
Khronos Group. The OpenCL C specification version: 2.0. https://www.khronos.org/registry/cl/ specs/opencl-2.0-openclc.pdf.Google Scholar
Khronos Group. The OpenCL specification version: 2.0 (rev. 29), July 2015.Google Scholar
https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.Google Scholar
G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012. Google ScholarDigital Library
S. Maleki, A. Yang, and M. Burtscher. Higher-order and tuplebased massively-parallel prefix sums. In PLDI, pages 539– 552. ACM, 2016. Google ScholarDigital Library
D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012. Google ScholarDigital Library
M. Mrozek and Z. Zdanowicz. GPU daemon: Road to zero cost submission. In IWOCL, pages 11:1–11:4. ACM, 2016. Google ScholarDigital Library
Nvidia. CUB, April 2015. http://nvlabs.github. io/cub/.Google Scholar
Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf.Google Scholar
OpenMP Architecture Review Board. OpenMP application programming interface version 4.5, November 2015.Google Scholar
M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, pages 73–86. ACM, 2015. Google ScholarDigital Library
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS, pages 407–418. ACM, 2013. Google ScholarDigital Library
Y. Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Publishing, 2009.Google Scholar
T. Sorensen and A. F. Donaldson. The hitchhiker’s guide to cross-platform OpenCL application development. IWOCL, pages 2:1–2:12. ACM, 2016. Google ScholarDigital Library
Y. Torres, A. Gonzalez-Escribano, and D. Llanos. Understanding the impact of CUDA tuning techniques for Fermi. In High Performance Computing and Simulation (HPCS), pages 631–639, 2011.Google ScholarCross Ref
S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29– 37, 2010. Google ScholarDigital Library
B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In ICS, pages 119–130. ACM, 2015. Google ScholarDigital Library
S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In IPDPS, pages 1–12. IEEE, 2010.Google Scholar

Index Terms

Portable inter-workgroup barrier synchronisation for GPUs
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Portable inter-workgroup barrier synchronisation for GPUs
OOPSLA '16

Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
October 2016
915 pages
ISBN:9781450344449
DOI:10.1145/2983990
General Chair:
Eelco Visser
Delft University of Technology, Netherlands
,
Program Chair:
Yannis Smaragdakis
University of Athens, Greece
ACM SIGPLAN Notices Volume 51, Issue 10
OOPSLA '16
October 2016
915 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3022671
Editor:
Matthew Fluet
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
OpenCL
barrier
portability
synchronisation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate268of1,244submissions,22%
Upcoming Conference
SPLASH '24

Sponsor:

sigplan

ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

October 20 - 25, 2024

Pasadena , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 318
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Portable inter-workgroup barrier synchronisation for GPUs

OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Portable inter-workgroup barrier synchronisation for GPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

An OpenCL micro-benchmark suite for GPUs and CPUs