DawnCC: Automatic Annotation for Data Parallelism and Offloading

Authors:
Gleison Mendonça

UFMG

UFMG
View Profile

,
Breno Guimarães

UFMG

UFMG
View Profile

,
Péricles Alves

UFMG

UFMG
View Profile

,
Márcio Pereira

Unicamp

Unicamp
View Profile

,
Guido Araújo

Unicamp

Unicamp
View Profile

,
Fernando Magno Quintão Pereira

UFMG

UFMG
View Profile

ACM Transactions on Architecture and Code Optimization Volume 14 Issue 2Article No.: 13pp 1–25https://doi.org/10.1145/3084540

Published:26 May 2017Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This article provides a suite of compiler-related methods to mitigate this problem. Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing. We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface. To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users. Such annotations lead to speedups of over 100× in an Nvidia architecture and over 50× in an ARM architecture.

References

C. Alias, A. Darte, and A. Plesco. 2013. Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proceedings of the 2013 DATE Conference (DATE’13). 575--580. Google ScholarDigital Library
Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime pointer disambiguation. In Proceedings of the 2015 OOPSLA Conference (OOPSLA’15). ACM, New York, NY, 589--606. Google ScholarDigital Library
M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is Not (Only) Polyhedral Software. Technical Report. IMPACT.Google Scholar
Lars Ole Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. DIKU, University of Copenhagen.Google Scholar
José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez, and Juan Tourino. 2016. Locality-aware automatic parallelization for GPGPU with OpenHMPP directives. International Journal of Parallel Programming 44, 3, 620--643. Google ScholarDigital Library
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, et al. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 PACT Conference (PACT’15). IEEE, Los Alamitos, CA, 138--149. Google ScholarDigital Library
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 2010 CC Conference (CC’10). 244--263. Google ScholarDigital Library
Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the LLVM-HPC Conference (LLVM-HPC’14). IEEE, Los Alamitos, CA, 12--21. Google ScholarDigital Library
Victor H. S. Campos, Péricles Rafael Oliveira Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of function arguments. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 163--173. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IISWC Conference (IISWC’09). IEEE, Los Alamitos, CA, 44--54. Google ScholarDigital Library
R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems 13, 4, 451--490. Google ScholarDigital Library
Gregory J. Duck and Roland H. C. Yap. 2016. Heap bounds protection with low fat pointers. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 132--142. Google ScholarDigital Library
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems 9, 3, 319--349. Google ScholarDigital Library
Swapnil Ghike, Ruben Gran, María Jesús Garzarán, and David A. Padua. 2014. Directive-based compilers for GPUs. In Proceedings of the 2014 LCPC Conference (LCPC’14). 19--35.Google Scholar
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 InPar Conference (InPar’12). IEEE, Los Alamitos, CA, 1--10.Google Scholar
Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the 2011 ISPASS Conference (ISPASS’11). IEEE, Los Alamitos, CA, 134--144. Google ScholarDigital Library
Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4, 1--28.Google ScholarCross Ref
Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond do loops: Data transfer generation with convex array regions. In Proceedings of the 2012 LCPC Conference (LCPC’12). 249--263.Google Scholar
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 2011 PLDI Conference (PLDI’11). ACM, New York, NY, 142--151. Google ScholarDigital Library
Julien Jaeger, Patrick Carribault, and Marc Pérache. 2015. Fine-grain data management directory for OpenMP 4.0 and OpenACC. Concurrency and Computation: Practice and Experience 27, 6, 1528--1539. Google ScholarDigital Library
Jens Knoop, Oliver Rüthing, and Bernhard Steffen. 1992. Lazy code motion. In Proceedings of the 1992 PLDI Conference (PLDI’92). ACM, New York, NY, 224--234. Google ScholarDigital Library
Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open source OpenACC to CUDA/OpenCL translator. arXiv:1412.1127.Google Scholar
Chris Lattner and Sarita Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the 2004 CGO Conference (CGO’04). IEEE, Los Alamitos, CA, 75--86. Google ScholarDigital Library
S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the 2010 SC Conference (SC’10). IEEE, Los Alamitos, CA, 1--11. Google ScholarDigital Library
Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In Proceedings of the 2014 HPDC Conference (HPDC’14). ACM, New York, NY, 115--120. Google ScholarDigital Library
Cor Meenderinck and Ben Juurlink. 2011. Nexus: Hardware support for task-based programming. In Proceedings of the 2011 DSD Conference (DSD’11). 442--445. Google ScholarDigital Library
Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintao Pereira. 2016. Automatic insertion of copy annotation in data-parallel programs. In Proceedings of the 2016 SBAC-PAD Conference (SBAC-PAD’16). IEEE, Los Alamitos, CA, 1--8.Google ScholarCross Ref
H. Nazaré, I. Maffra, W. Santos, L. Barbosa, L. Gonnord, and F. M. Q. Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the 2014 OOPSLA Conference (OOPSLA’14). ACM, New York, NY, 791--809. Google ScholarDigital Library
Cedric Nugteren and Henk Corporaal. 2014. Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs. ACM Transactions on Architecture and Code Optimization 11, 4, 35:1--35:25. Google ScholarDigital Library
OpenACC Standard. 2013. The OpenACC Programming Interface. Technical Report. CAPS.Google Scholar
Fernando Magno Quintao Pereira and Daniel Berlin. 2009. Wave propagation and deep propagation for pointer analysis. In Proceedings of the 2009 CGO Conference (CGO’09). IEEE, Los Alamitos, CA, 126--135. Google ScholarDigital Library
A Raghesh. 2011. A Framework for Automatic OpenMP Code Generation. Master’s thesis. IIT Madras.Google Scholar
R. Reyes, I. López-Rodríguez, J. Fumero, and F. Sande. 2012. AccULL: An OpenACC implementation with CUDA and OpenCL support. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 871--882. Google ScholarDigital Library
Radu Rugina and Martin Rinard. 2000. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM SIGPLAN Notices 35, 5, 182--195. Google ScholarDigital Library
Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid analysis: Static and dynamic memory reference analysis. International Journal of Parallel Programming 31, 251--283. Google ScholarDigital Library
O. Shivers. 1988. Control flow analysis in scheme. In Proceedings of the 1988 PLDI Conference (PLDI’88). ACM, New York, NY, 164--174. Google ScholarDigital Library
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. IMPACT.Google Scholar
Rémi Triolet, Francois Irigoin, and Paul Feautrier. 1986. Direct parallelization of call statements. In Proceedings of the 1986 SIGPLAN Conference (SIGPLAN’86). ACM, New York, NY, 176--185. Google ScholarDigital Library
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, 54:1--54:23. Google ScholarDigital Library
Sandra Wienke, Paul L. Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC—first experiences with real-world applications. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 859--870. Google ScholarDigital Library
M. J. Wolfe. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley, Boston, MA. Google ScholarDigital Library

Index Terms

DawnCC: Automatic Annotation for Data Parallelism and Offloading
1. Computing methodologies
  1. Parallel computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate ...
Read More
Precise flow-insensitive may-alias analysis is NP-hard

Determining aliases is one of the foundamental static analysis problems, in part because the precision with which this problem is solved can affect the precision of other analyses such as live variables, available expressions, and constant propagation. ...
Read More
WYSINWYX: What you see is not what you eXecute

Over the last seven years, we have developed static-analysis methods to recover a good approximation to the variables and dynamically allocated memory objects of a stripped executable, and to track the flow of values through them. The article presents ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 14, Issue 2
June 2017
259 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3086564
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 May 2017
- Accepted: 1 April 2017
- Revised: 1 March 2017
- Received: 1 November 2016
Published in taco Volume 14, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Automatic parallelization
static analysis
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 655
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DawnCC: Automatic Annotation for Data Parallelism and Offloading

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Precise flow-insensitive may-alias analysis is NP-hard

WYSINWYX: What you see is not what you eXecute