skip to main content
research-article
Free Access

DawnCC: Automatic Annotation for Data Parallelism and Offloading

Published:26 May 2017Publication History
Skip Abstract Section

Abstract

Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This article provides a suite of compiler-related methods to mitigate this problem. Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing. We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface. To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users. Such annotations lead to speedups of over 100× in an Nvidia architecture and over 50× in an ARM architecture.

References

  1. C. Alias, A. Darte, and A. Plesco. 2013. Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proceedings of the 2013 DATE Conference (DATE’13). 575--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime pointer disambiguation. In Proceedings of the 2015 OOPSLA Conference (OOPSLA’15). ACM, New York, NY, 589--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is Not (Only) Polyhedral Software. Technical Report. IMPACT.Google ScholarGoogle Scholar
  4. Lars Ole Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. DIKU, University of Copenhagen.Google ScholarGoogle Scholar
  5. José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez, and Juan Tourino. 2016. Locality-aware automatic parallelization for GPGPU with OpenHMPP directives. International Journal of Parallel Programming 44, 3, 620--643. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, et al. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 PACT Conference (PACT’15). IEEE, Los Alamitos, CA, 138--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 2010 CC Conference (CC’10). 244--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the LLVM-HPC Conference (LLVM-HPC’14). IEEE, Los Alamitos, CA, 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Victor H. S. Campos, Péricles Rafael Oliveira Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of function arguments. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 163--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IISWC Conference (IISWC’09). IEEE, Los Alamitos, CA, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems 13, 4, 451--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gregory J. Duck and Roland H. C. Yap. 2016. Heap bounds protection with low fat pointers. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 132--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems 9, 3, 319--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Swapnil Ghike, Ruben Gran, María Jesús Garzarán, and David A. Padua. 2014. Directive-based compilers for GPUs. In Proceedings of the 2014 LCPC Conference (LCPC’14). 19--35.Google ScholarGoogle Scholar
  15. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 InPar Conference (InPar’12). IEEE, Los Alamitos, CA, 1--10.Google ScholarGoogle Scholar
  16. Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the 2011 ISPASS Conference (ISPASS’11). IEEE, Los Alamitos, CA, 134--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4, 1--28.Google ScholarGoogle ScholarCross RefCross Ref
  18. Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond do loops: Data transfer generation with convex array regions. In Proceedings of the 2012 LCPC Conference (LCPC’12). 249--263.Google ScholarGoogle Scholar
  19. Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 2011 PLDI Conference (PLDI’11). ACM, New York, NY, 142--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Julien Jaeger, Patrick Carribault, and Marc Pérache. 2015. Fine-grain data management directory for OpenMP 4.0 and OpenACC. Concurrency and Computation: Practice and Experience 27, 6, 1528--1539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jens Knoop, Oliver Rüthing, and Bernhard Steffen. 1992. Lazy code motion. In Proceedings of the 1992 PLDI Conference (PLDI’92). ACM, New York, NY, 224--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open source OpenACC to CUDA/OpenCL translator. arXiv:1412.1127.Google ScholarGoogle Scholar
  23. Chris Lattner and Sarita Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the 2004 CGO Conference (CGO’04). IEEE, Los Alamitos, CA, 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the 2010 SC Conference (SC’10). IEEE, Los Alamitos, CA, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In Proceedings of the 2014 HPDC Conference (HPDC’14). ACM, New York, NY, 115--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Cor Meenderinck and Ben Juurlink. 2011. Nexus: Hardware support for task-based programming. In Proceedings of the 2011 DSD Conference (DSD’11). 442--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintao Pereira. 2016. Automatic insertion of copy annotation in data-parallel programs. In Proceedings of the 2016 SBAC-PAD Conference (SBAC-PAD’16). IEEE, Los Alamitos, CA, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  28. H. Nazaré, I. Maffra, W. Santos, L. Barbosa, L. Gonnord, and F. M. Q. Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the 2014 OOPSLA Conference (OOPSLA’14). ACM, New York, NY, 791--809. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Cedric Nugteren and Henk Corporaal. 2014. Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs. ACM Transactions on Architecture and Code Optimization 11, 4, 35:1--35:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. OpenACC Standard. 2013. The OpenACC Programming Interface. Technical Report. CAPS.Google ScholarGoogle Scholar
  31. Fernando Magno Quintao Pereira and Daniel Berlin. 2009. Wave propagation and deep propagation for pointer analysis. In Proceedings of the 2009 CGO Conference (CGO’09). IEEE, Los Alamitos, CA, 126--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A Raghesh. 2011. A Framework for Automatic OpenMP Code Generation. Master’s thesis. IIT Madras.Google ScholarGoogle Scholar
  33. R. Reyes, I. López-Rodríguez, J. Fumero, and F. Sande. 2012. AccULL: An OpenACC implementation with CUDA and OpenCL support. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 871--882. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Radu Rugina and Martin Rinard. 2000. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM SIGPLAN Notices 35, 5, 182--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid analysis: Static and dynamic memory reference analysis. International Journal of Parallel Programming 31, 251--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. O. Shivers. 1988. Control flow analysis in scheme. In Proceedings of the 1988 PLDI Conference (PLDI’88). ACM, New York, NY, 164--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. IMPACT.Google ScholarGoogle Scholar
  38. Rémi Triolet, Francois Irigoin, and Paul Feautrier. 1986. Direct parallelization of call statements. In Proceedings of the 1986 SIGPLAN Conference (SIGPLAN’86). ACM, New York, NY, 176--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, 54:1--54:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sandra Wienke, Paul L. Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC—first experiences with real-world applications. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 859--870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. J. Wolfe. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley, Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DawnCC: Automatic Annotation for Data Parallelism and Offloading

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 2
        June 2017
        259 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3086564
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 May 2017
        • Accepted: 1 April 2017
        • Revised: 1 March 2017
        • Received: 1 November 2016
        Published in taco Volume 14, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader