Abstract
Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This article provides a suite of compiler-related methods to mitigate this problem. Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing. We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface. To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users. Such annotations lead to speedups of over 100× in an Nvidia architecture and over 50× in an ARM architecture.
- C. Alias, A. Darte, and A. Plesco. 2013. Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proceedings of the 2013 DATE Conference (DATE’13). 575--580. Google ScholarDigital Library
- Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime pointer disambiguation. In Proceedings of the 2015 OOPSLA Conference (OOPSLA’15). ACM, New York, NY, 589--606. Google ScholarDigital Library
- M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is Not (Only) Polyhedral Software. Technical Report. IMPACT.Google Scholar
- Lars Ole Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. DIKU, University of Copenhagen.Google Scholar
- José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez, and Juan Tourino. 2016. Locality-aware automatic parallelization for GPGPU with OpenHMPP directives. International Journal of Parallel Programming 44, 3, 620--643. Google ScholarDigital Library
- R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, et al. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 PACT Conference (PACT’15). IEEE, Los Alamitos, CA, 138--149. Google ScholarDigital Library
- M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 2010 CC Conference (CC’10). 244--263. Google ScholarDigital Library
- Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the LLVM-HPC Conference (LLVM-HPC’14). IEEE, Los Alamitos, CA, 12--21. Google ScholarDigital Library
- Victor H. S. Campos, Péricles Rafael Oliveira Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of function arguments. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 163--173. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IISWC Conference (IISWC’09). IEEE, Los Alamitos, CA, 44--54. Google ScholarDigital Library
- R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems 13, 4, 451--490. Google ScholarDigital Library
- Gregory J. Duck and Roland H. C. Yap. 2016. Heap bounds protection with low fat pointers. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 132--142. Google ScholarDigital Library
- Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems 9, 3, 319--349. Google ScholarDigital Library
- Swapnil Ghike, Ruben Gran, María Jesús Garzarán, and David A. Padua. 2014. Directive-based compilers for GPUs. In Proceedings of the 2014 LCPC Conference (LCPC’14). 19--35.Google Scholar
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 InPar Conference (InPar’12). IEEE, Los Alamitos, CA, 1--10.Google Scholar
- Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the 2011 ISPASS Conference (ISPASS’11). IEEE, Los Alamitos, CA, 134--144. Google ScholarDigital Library
- Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4, 1--28.Google ScholarCross Ref
- Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond do loops: Data transfer generation with convex array regions. In Proceedings of the 2012 LCPC Conference (LCPC’12). 249--263.Google Scholar
- Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 2011 PLDI Conference (PLDI’11). ACM, New York, NY, 142--151. Google ScholarDigital Library
- Julien Jaeger, Patrick Carribault, and Marc Pérache. 2015. Fine-grain data management directory for OpenMP 4.0 and OpenACC. Concurrency and Computation: Practice and Experience 27, 6, 1528--1539. Google ScholarDigital Library
- Jens Knoop, Oliver Rüthing, and Bernhard Steffen. 1992. Lazy code motion. In Proceedings of the 1992 PLDI Conference (PLDI’92). ACM, New York, NY, 224--234. Google ScholarDigital Library
- Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open source OpenACC to CUDA/OpenCL translator. arXiv:1412.1127.Google Scholar
- Chris Lattner and Sarita Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the 2004 CGO Conference (CGO’04). IEEE, Los Alamitos, CA, 75--86. Google ScholarDigital Library
- S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the 2010 SC Conference (SC’10). IEEE, Los Alamitos, CA, 1--11. Google ScholarDigital Library
- Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In Proceedings of the 2014 HPDC Conference (HPDC’14). ACM, New York, NY, 115--120. Google ScholarDigital Library
- Cor Meenderinck and Ben Juurlink. 2011. Nexus: Hardware support for task-based programming. In Proceedings of the 2011 DSD Conference (DSD’11). 442--445. Google ScholarDigital Library
- Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintao Pereira. 2016. Automatic insertion of copy annotation in data-parallel programs. In Proceedings of the 2016 SBAC-PAD Conference (SBAC-PAD’16). IEEE, Los Alamitos, CA, 1--8.Google ScholarCross Ref
- H. Nazaré, I. Maffra, W. Santos, L. Barbosa, L. Gonnord, and F. M. Q. Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the 2014 OOPSLA Conference (OOPSLA’14). ACM, New York, NY, 791--809. Google ScholarDigital Library
- Cedric Nugteren and Henk Corporaal. 2014. Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs. ACM Transactions on Architecture and Code Optimization 11, 4, 35:1--35:25. Google ScholarDigital Library
- OpenACC Standard. 2013. The OpenACC Programming Interface. Technical Report. CAPS.Google Scholar
- Fernando Magno Quintao Pereira and Daniel Berlin. 2009. Wave propagation and deep propagation for pointer analysis. In Proceedings of the 2009 CGO Conference (CGO’09). IEEE, Los Alamitos, CA, 126--135. Google ScholarDigital Library
- A Raghesh. 2011. A Framework for Automatic OpenMP Code Generation. Master’s thesis. IIT Madras.Google Scholar
- R. Reyes, I. López-Rodríguez, J. Fumero, and F. Sande. 2012. AccULL: An OpenACC implementation with CUDA and OpenCL support. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 871--882. Google ScholarDigital Library
- Radu Rugina and Martin Rinard. 2000. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM SIGPLAN Notices 35, 5, 182--195. Google ScholarDigital Library
- Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid analysis: Static and dynamic memory reference analysis. International Journal of Parallel Programming 31, 251--283. Google ScholarDigital Library
- O. Shivers. 1988. Control flow analysis in scheme. In Proceedings of the 1988 PLDI Conference (PLDI’88). ACM, New York, NY, 164--174. Google ScholarDigital Library
- John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. IMPACT.Google Scholar
- Rémi Triolet, Francois Irigoin, and Paul Feautrier. 1986. Direct parallelization of call statements. In Proceedings of the 1986 SIGPLAN Conference (SIGPLAN’86). ACM, New York, NY, 176--185. Google ScholarDigital Library
- Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, 54:1--54:23. Google ScholarDigital Library
- Sandra Wienke, Paul L. Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC—first experiences with real-world applications. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 859--870. Google ScholarDigital Library
- M. J. Wolfe. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley, Boston, MA. Google ScholarDigital Library
Index Terms
- DawnCC: Automatic Annotation for Data Parallelism and Offloading
Recommendations
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate ...
Precise flow-insensitive may-alias analysis is NP-hard
Determining aliases is one of the foundamental static analysis problems, in part because the precision with which this problem is solved can affect the precision of other analyses such as live variables, available expressions, and constant propagation. ...
WYSINWYX: What you see is not what you eXecute
Over the last seven years, we have developed static-analysis methods to recover a good approximation to the variables and dynamically allocated memory objects of a stripped executable, and to track the flow of values through them. The article presents ...
Comments