skip to main content
10.1145/2832105.2832112acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Experiences in extending parallware to support OpenACC

Published:15 November 2015Publication History

ABSTRACT

Porting scientific codes to accelerator-based computers using OpenACC and OpenMP is an important topic for the HPC community. Programmability, performance portability and developer productivity are key issues for the widespread use of these systems. In the scope of general-purpose parallel computing, Parallware is a new commercial OpenMP-enabling source-to-source compiler that automatically adds OpenMP capabilities in scientific programs. Thus, extending Parallware with OpenACC or OpenMP 4.x support would contribute to improve programmability and developer productivity. In contrast, the performance portability of such approach needs to be demonstrated in practice. This paper presents a preliminary study to extend Parallware with OpenACC support for GPU devices. A simple benchmark suite has been designed to mimic important features and computational patterns of real scientific applications. Handcoded OpenACC versions are compared to OpenMP versions automatically generated by Parallware. Performance is evaluated with the PGI OpenACC compiler on systems accelerated with NVIDIA GPUs.

References

  1. OpenCL: The Open Standard for Parallel Programming of Heterogeneous Systems. http://www.khronos.org/opencl, 2015.Google ScholarGoogle Scholar
  2. J. Andión, M. Arenaz, G. Rodríguez, and J. Touriño. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9):442--460, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  3. Appentra. Parallware: The OpenMP-enabling Source-to-Source Compiler. http://www.appentra.com/products/parallware, May 2015.Google ScholarGoogle Scholar
  4. M. Arenaz, J. Domínguez, and A. Crespo. Democratization of HPC in the Oil & Gas Industry through Automatic Parallelization with Parallware. 2015 Rice Oil and Gas HPC Workshop, Mar. 2015.Google ScholarGoogle Scholar
  5. M. Arenaz, J. Touriño, and R. Doallo. A GSA-based Compiler Infrastructure to Extract Parallelism from Complex Loops. In Proceedings of the 17th International Conference on Supercomputing, ICS '03, pages 193--204, New York, USA, 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Arenaz, J. Touriño, and R. Doallo. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems (TOPLAS), 30(6):32:1--32:56, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. The NAS Parallel Benchmarks - Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing '91, pages 158--165. ACM, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Barcelona Supercomputing Center (BSC). The OmpSs Programming Model. http://pm.bsc.es/ompss, 2015.Google ScholarGoogle Scholar
  9. W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel Programming with Polaris. Computer, 29(12):78--82, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Chapman, G. Jost, and R. Pas. Using OpenMP: Portable Shared Memory Parallel Programming. Scientific and Engineering Computation. The MIT Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J.-H. Chow, L. E. Lyon, and V. Sarkar. Automatic Parallelization for Symmetric Shared-memory Multiprocessors. In Proceedings of the 1996 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON '96, pages 1--14. IBM Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Dave, H. Bae, S.-J. Min, S. Lee, R. Eigenmann, and S. Midkiff. Cetus: A Source-to-Source Compiler Infrastructure for Multicores. IEEE Micro, 42(12):36--42, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. D. Davis and E. S. Chung. SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place. Technical Report MSR-TR-2012-95, Sept. 2012.Google ScholarGoogle Scholar
  14. M. Gerndt, A. Hollmann, M. Meyer, M. Schreiber, and J. Weidendorfer. Invasive Computing with iOMP. In Proceeding of the 2012 Forum on Specification and Design Languages, Vienna, Austria, September 18-20, 2012, pages 225--231, 2012.Google ScholarGoogle Scholar
  15. H. Gómez-Sousa, M. Arenaz, O. Rubiños-López, and J. Martínez-Lorenzo. Novel Source-to-Source Compiler Approach for the Automatic Parallelization of Codes based on the Method of Moments. In Proceedings of the 9th European Conference on Antenas and Propagation, EuCap 2015, Apr. 2015.Google ScholarGoogle Scholar
  16. K. Goto and R. Geijn. Anatomy of High-Performance Matrix Multiplication. ACM Transactions on Mathematical Software, 34(3):12:1--12:25, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-Tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing, InPar '12, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  18. Intel. Intel Math Kernel Library, Reference Manual, 2014. v11.2.Google ScholarGoogle Scholar
  19. Intel. Intel Architecture Instruction Set Extensions Programming Reference, Aug. 2015.Google ScholarGoogle Scholar
  20. M. Ishihara, H. Honda, and M. Sato. Development and Implementation of an Interactive Parallelization Assistance Tool for OpenMP: iPat/OMP. IEICE Transactions on Information and Systems, 89-D(2):399--407, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Q. Jiang, Y. C. Lee, A. Zomaya, M. Arenaz, and L. Leslie. Optimizing Scientific Workflows in the Cloud: A Montage Example. In Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing (UCC), pages 517--522. IEEE, Dec. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Jin, H. Jin, M. Frumkin, M. Frumkin, J. Yan, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance. Technical report, NASA Ames Research Center (NAS System Division), 1999.Google ScholarGoogle Scholar
  23. S. Johnson, E. Evans, H. Jin, and C. Ierotheou. The Parawise Expert Assistant - Widening Accessibility to Efficient and Scalable Tool Generated OpenMP Code. In Proceedings of the 5th International Conference on OpenMP Applications and Tools: Shared Memory Parallel Programming with OpenMP, WOMPAT'04, pages 67--82. Springer-Verlag, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Juckeland, W. Brantley, S. Chandrasekaran, B. Chapman, S. Che, M. Colgrove, H. Feng, A. Grund, R. Henschel, W.-M. W.-Hwu, H. Li, M. S. Muller, W. E. Nagel, M. Perminov, P. Shelepugin, K. Skadron, J. Stratton, A. Titov, K. Wang, M. van Waveren, B. Whitney, S. Wienke, R. Xu, and K. Kumaran. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In S. A. Jarvis, S. A. Wright, and S. D. Hammond, editors, High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, volume 8966 of Lecture Notes in Computer Science (LNCS), pages 46--67. Springer International Publishing, 2015.Google ScholarGoogle Scholar
  25. M. Kandemir, A. Choudhary, J. Ramanujam, and M. A. Kandaswamy. A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations. IEEE Transactions on Parallel Distributed Systems (TPDS), 11(7):648--668, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Larking. Advanced OpenACC Programming. http://on-demand.gputechconf.com/gtc/2015/presentation/S5195-Jeff-Larkin.pdf, GPU Technology Conference (GTC), Mar. 2015.Google ScholarGoogle Scholar
  27. J. Larking. Comparing OpenACC and OpenMP Performance and Programmability. http://on-demand.gputechconf.com/gtc/2015/presentation/S5196-Jeff-Larkin.pdf, GPU Technology Conference (GTC), Mar. 2015.Google ScholarGoogle Scholar
  28. J. Larking. Comparing OpenACC and OpenMP Performance and Programmability. http://on-demand.gputechconf.com/gtc/2015/presentation/S5196-Jeff-Larkin.pdf, GPU Technology Conference (GTC), Mar. 2015.Google ScholarGoogle Scholar
  29. S.-W. Liao, A. Diwan, R. P. Bosch, Jr., A. Ghuloum, and M. S. Lam. SUIF Explorer: An Interactive and Interprocedural Parallelizer. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '99, pages 37--48, New York, NY, USA, 1999. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Lobeiras and M. Arenaz. A Success Case using Parallware: The NAS Parallel Benchmark EP. In Proceedings of the OpenMPCon Developers Conference, 2015.Google ScholarGoogle Scholar
  31. M. Martín, D. Singh, J. Touriño, and F. Rivera. Exploiting locality in the run-time parallelization of irregular loops. In Proceedings of the 2002 International Conference on Parallel Processing, ICPP '02, pages 27--34, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. S. McKinley. A Compiler Optimization Algorithm for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems (TPDS), 9(8):769--787, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Megiddo and V. Sarkar. Optimal Weighted Loop Fusion for Parallel Programs. In Proceedings of the 9th ACM Symposium on Parallel Algorithms and Architectures, SPAA '97, pages 282--291, New York, NY, USA, 1997. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. NVIDIA. CUDA C Best Practices Guide (SDK Documentation), 2015. v7.0.Google ScholarGoogle Scholar
  35. NVIDIA. CUDA C Programming Guide (SDK Documentation), 2015. v7.0.Google ScholarGoogle Scholar
  36. NVIDIA. NVIDIA OpenACC Toolkit. http://developer.nvidia.com/openacc, July 2015.Google ScholarGoogle Scholar
  37. OpenACC Architecture Review Board. The OpenACC API Specification for Parallel Programming. http://www.openacc.org, May 2015.Google ScholarGoogle Scholar
  38. OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 4.0. http://www.openmp.org, July 2013.Google ScholarGoogle Scholar
  39. S. Rus and L. Rauchwerger. Compiler technology for migrating sequential code to multi-threaded Architectures. Technical report, Texas A&M University, 2006.Google ScholarGoogle Scholar
  40. S. Squires, M. V. D. Vanter, and L. Votta. Software Productivity Research in High Performance Computing. CTWatch Quarterly, 2(4A), 2006.Google ScholarGoogle Scholar
  41. Standard Performance Evaluation Corporation (SPEC). The SPEC ACCEL V1.0 benchmark. https://www.spec.org/accel, 2015.Google ScholarGoogle Scholar
  42. M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining Loop Transformations Considering Caches and Scheduling. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 29, pages 274--286, Washington, DC, USA, 1996. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Wolfe. Compilers and More: MPI+X. HPC Wire, July 2014.Google ScholarGoogle Scholar
  44. M. Wolfe. OpenACC and CUDA Unified Memory. http://www.pgroup.com/lit/articles/insider/v6n2a4.htm, Mar. 2015.Google ScholarGoogle Scholar
  45. H. Yu and L. Rauchwerger. Adaptive Reduction Parallelization Techniques. In Proceedings of the 14th International Conference on Supercomputing, ICS '00, pages 66--77, New York, NY, USA, 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM Press, New York, NY, USA, 1991. Google ScholarGoogle Scholar

Index Terms

  1. Experiences in extending parallware to support OpenACC

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                WACCPD '15: Proceedings of the Second Workshop on Accelerator Programming using Directives
                November 2015
                68 pages
                ISBN:9781450340144
                DOI:10.1145/2832105

                Copyright © 2015 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 15 November 2015

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                WACCPD '15 Paper Acceptance Rate7of14submissions,50%Overall Acceptance Rate7of14submissions,50%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader