ABSTRACT
Porting scientific codes to accelerator-based computers using OpenACC and OpenMP is an important topic for the HPC community. Programmability, performance portability and developer productivity are key issues for the widespread use of these systems. In the scope of general-purpose parallel computing, Parallware is a new commercial OpenMP-enabling source-to-source compiler that automatically adds OpenMP capabilities in scientific programs. Thus, extending Parallware with OpenACC or OpenMP 4.x support would contribute to improve programmability and developer productivity. In contrast, the performance portability of such approach needs to be demonstrated in practice. This paper presents a preliminary study to extend Parallware with OpenACC support for GPU devices. A simple benchmark suite has been designed to mimic important features and computational patterns of real scientific applications. Handcoded OpenACC versions are compared to OpenMP versions automatically generated by Parallware. Performance is evaluated with the PGI OpenACC compiler on systems accelerated with NVIDIA GPUs.
- OpenCL: The Open Standard for Parallel Programming of Heterogeneous Systems. http://www.khronos.org/opencl, 2015.Google Scholar
- J. Andión, M. Arenaz, G. Rodríguez, and J. Touriño. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9):442--460, 2013.Google ScholarCross Ref
- Appentra. Parallware: The OpenMP-enabling Source-to-Source Compiler. http://www.appentra.com/products/parallware, May 2015.Google Scholar
- M. Arenaz, J. Domínguez, and A. Crespo. Democratization of HPC in the Oil & Gas Industry through Automatic Parallelization with Parallware. 2015 Rice Oil and Gas HPC Workshop, Mar. 2015.Google Scholar
- M. Arenaz, J. Touriño, and R. Doallo. A GSA-based Compiler Infrastructure to Extract Parallelism from Complex Loops. In Proceedings of the 17th International Conference on Supercomputing, ICS '03, pages 193--204, New York, USA, 2003. ACM Press. Google ScholarDigital Library
- M. Arenaz, J. Touriño, and R. Doallo. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems (TOPLAS), 30(6):32:1--32:56, 2008. Google ScholarDigital Library
- D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. The NAS Parallel Benchmarks - Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing '91, pages 158--165. ACM, 1991. Google ScholarDigital Library
- Barcelona Supercomputing Center (BSC). The OmpSs Programming Model. http://pm.bsc.es/ompss, 2015.Google Scholar
- W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel Programming with Polaris. Computer, 29(12):78--82, 1996. Google ScholarDigital Library
- B. Chapman, G. Jost, and R. Pas. Using OpenMP: Portable Shared Memory Parallel Programming. Scientific and Engineering Computation. The MIT Press, 2007. Google ScholarDigital Library
- J.-H. Chow, L. E. Lyon, and V. Sarkar. Automatic Parallelization for Symmetric Shared-memory Multiprocessors. In Proceedings of the 1996 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON '96, pages 1--14. IBM Press, 1996. Google ScholarDigital Library
- C. Dave, H. Bae, S.-J. Min, S. Lee, R. Eigenmann, and S. Midkiff. Cetus: A Source-to-Source Compiler Infrastructure for Multicores. IEEE Micro, 42(12):36--42, 2009. Google ScholarDigital Library
- J. D. Davis and E. S. Chung. SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place. Technical Report MSR-TR-2012-95, Sept. 2012.Google Scholar
- M. Gerndt, A. Hollmann, M. Meyer, M. Schreiber, and J. Weidendorfer. Invasive Computing with iOMP. In Proceeding of the 2012 Forum on Specification and Design Languages, Vienna, Austria, September 18-20, 2012, pages 225--231, 2012.Google Scholar
- H. Gómez-Sousa, M. Arenaz, O. Rubiños-López, and J. Martínez-Lorenzo. Novel Source-to-Source Compiler Approach for the Automatic Parallelization of Codes based on the Method of Moments. In Proceedings of the 9th European Conference on Antenas and Propagation, EuCap 2015, Apr. 2015.Google Scholar
- K. Goto and R. Geijn. Anatomy of High-Performance Matrix Multiplication. ACM Transactions on Mathematical Software, 34(3):12:1--12:25, 2008. Google ScholarDigital Library
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-Tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing, InPar '12, 2012.Google ScholarCross Ref
- Intel. Intel Math Kernel Library, Reference Manual, 2014. v11.2.Google Scholar
- Intel. Intel Architecture Instruction Set Extensions Programming Reference, Aug. 2015.Google Scholar
- M. Ishihara, H. Honda, and M. Sato. Development and Implementation of an Interactive Parallelization Assistance Tool for OpenMP: iPat/OMP. IEICE Transactions on Information and Systems, 89-D(2):399--407, 2006. Google ScholarDigital Library
- Q. Jiang, Y. C. Lee, A. Zomaya, M. Arenaz, and L. Leslie. Optimizing Scientific Workflows in the Cloud: A Montage Example. In Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing (UCC), pages 517--522. IEEE, Dec. 2014. Google ScholarDigital Library
- H. Jin, H. Jin, M. Frumkin, M. Frumkin, J. Yan, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance. Technical report, NASA Ames Research Center (NAS System Division), 1999.Google Scholar
- S. Johnson, E. Evans, H. Jin, and C. Ierotheou. The Parawise Expert Assistant - Widening Accessibility to Efficient and Scalable Tool Generated OpenMP Code. In Proceedings of the 5th International Conference on OpenMP Applications and Tools: Shared Memory Parallel Programming with OpenMP, WOMPAT'04, pages 67--82. Springer-Verlag, 2004. Google ScholarDigital Library
- G. Juckeland, W. Brantley, S. Chandrasekaran, B. Chapman, S. Che, M. Colgrove, H. Feng, A. Grund, R. Henschel, W.-M. W.-Hwu, H. Li, M. S. Muller, W. E. Nagel, M. Perminov, P. Shelepugin, K. Skadron, J. Stratton, A. Titov, K. Wang, M. van Waveren, B. Whitney, S. Wienke, R. Xu, and K. Kumaran. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In S. A. Jarvis, S. A. Wright, and S. D. Hammond, editors, High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, volume 8966 of Lecture Notes in Computer Science (LNCS), pages 46--67. Springer International Publishing, 2015.Google Scholar
- M. Kandemir, A. Choudhary, J. Ramanujam, and M. A. Kandaswamy. A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations. IEEE Transactions on Parallel Distributed Systems (TPDS), 11(7):648--668, 2000. Google ScholarDigital Library
- J. Larking. Advanced OpenACC Programming. http://on-demand.gputechconf.com/gtc/2015/presentation/S5195-Jeff-Larkin.pdf, GPU Technology Conference (GTC), Mar. 2015.Google Scholar
- J. Larking. Comparing OpenACC and OpenMP Performance and Programmability. http://on-demand.gputechconf.com/gtc/2015/presentation/S5196-Jeff-Larkin.pdf, GPU Technology Conference (GTC), Mar. 2015.Google Scholar
- J. Larking. Comparing OpenACC and OpenMP Performance and Programmability. http://on-demand.gputechconf.com/gtc/2015/presentation/S5196-Jeff-Larkin.pdf, GPU Technology Conference (GTC), Mar. 2015.Google Scholar
- S.-W. Liao, A. Diwan, R. P. Bosch, Jr., A. Ghuloum, and M. S. Lam. SUIF Explorer: An Interactive and Interprocedural Parallelizer. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '99, pages 37--48, New York, NY, USA, 1999. ACM Press. Google ScholarDigital Library
- J. Lobeiras and M. Arenaz. A Success Case using Parallware: The NAS Parallel Benchmark EP. In Proceedings of the OpenMPCon Developers Conference, 2015.Google Scholar
- M. Martín, D. Singh, J. Touriño, and F. Rivera. Exploiting locality in the run-time parallelization of irregular loops. In Proceedings of the 2002 International Conference on Parallel Processing, ICPP '02, pages 27--34, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarDigital Library
- K. S. McKinley. A Compiler Optimization Algorithm for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems (TPDS), 9(8):769--787, 1998. Google ScholarDigital Library
- N. Megiddo and V. Sarkar. Optimal Weighted Loop Fusion for Parallel Programs. In Proceedings of the 9th ACM Symposium on Parallel Algorithms and Architectures, SPAA '97, pages 282--291, New York, NY, USA, 1997. ACM Press. Google ScholarDigital Library
- NVIDIA. CUDA C Best Practices Guide (SDK Documentation), 2015. v7.0.Google Scholar
- NVIDIA. CUDA C Programming Guide (SDK Documentation), 2015. v7.0.Google Scholar
- NVIDIA. NVIDIA OpenACC Toolkit. http://developer.nvidia.com/openacc, July 2015.Google Scholar
- OpenACC Architecture Review Board. The OpenACC API Specification for Parallel Programming. http://www.openacc.org, May 2015.Google Scholar
- OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 4.0. http://www.openmp.org, July 2013.Google Scholar
- S. Rus and L. Rauchwerger. Compiler technology for migrating sequential code to multi-threaded Architectures. Technical report, Texas A&M University, 2006.Google Scholar
- S. Squires, M. V. D. Vanter, and L. Votta. Software Productivity Research in High Performance Computing. CTWatch Quarterly, 2(4A), 2006.Google Scholar
- Standard Performance Evaluation Corporation (SPEC). The SPEC ACCEL V1.0 benchmark. https://www.spec.org/accel, 2015.Google Scholar
- M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining Loop Transformations Considering Caches and Scheduling. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 29, pages 274--286, Washington, DC, USA, 1996. IEEE Computer Society. Google ScholarDigital Library
- M. Wolfe. Compilers and More: MPI+X. HPC Wire, July 2014.Google Scholar
- M. Wolfe. OpenACC and CUDA Unified Memory. http://www.pgroup.com/lit/articles/insider/v6n2a4.htm, Mar. 2015.Google Scholar
- H. Yu and L. Rauchwerger. Adaptive Reduction Parallelization Techniques. In Proceedings of the 14th International Conference on Supercomputing, ICS '00, pages 66--77, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
- H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM Press, New York, NY, USA, 1991. Google Scholar
Index Terms
- Experiences in extending parallware to support OpenACC
Recommendations
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and AnalysisHybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Hybridizing S3D into an exascale application using OpenACC: an approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisHybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Comments