Designing a Tunable Nested Data-Parallel Programming System

Authors:
Saurav Muralidharan

University of Utah

University of Utah
View Profile

,
Michael Garland

NVIDIA Corporation, Santa Clara, CA

NVIDIA Corporation, Santa Clara, CA
View Profile

,
Albert Sidelnik

NVIDIA Corporation, Santa Clara, CA

NVIDIA Corporation, Santa Clara, CA
View Profile

,
Mary Hall

University of Utah, Salt Lake City, UT

University of Utah, Salt Lake City, UT
View Profile

ACM Transactions on Architecture and Code Optimization Volume 13 Issue 4Article No.: 47pp 1–24https://doi.org/10.1145/3012011

Published:28 December 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed using a C++ programming interface, from low-level implementation details using two first-class constructs: schedules and policies. Schedules describe the valid ways in which data-parallel operators may be implemented, while policies encapsulate a set of parameters that govern platform-specific code generation. These two mechanisms are used to implement a code generation system that analyzes computations and automatically generates a search space of valid platform-specific implementations. An input and architecture-adaptive autotuning system then explores this search space to find optimized implementations. We express in Surge five real-world benchmarks from domains such as machine learning and sparse linear algebra and from the high-level specifications, Surge automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.

References

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 303--316. Google ScholarDigital Library
Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah, and Sunil Shukla. 2012. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 271--276. Google ScholarDigital Library
Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, 89--108. Google ScholarDigital Library
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, Article 18. Google ScholarDigital Library
Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. GPU Comput. Gems Jade Ed. 2 (2011), 359--371.Google Scholar
Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, Stephen Rosen, and Adam Shaw. 2013. Data-only flattening for nested data parallelism. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, 81--92. Google ScholarDigital Library
Guy E. Blelloch. 1992. NESL: A Nested Data-Parallel Language. Technical Report CMU-CS-95-170. Carnegie Mellon University, Pittsburgh, PA.Google ScholarDigital Library
Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. 2016. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, 194--205. Google ScholarDigital Library
Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th Parallel Architectures and Compilation Techniques Conference (PACT’11). ACM, 89--100. Google ScholarDigital Library
Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sağnak Taşirlar. 2010. Concurrent collections. Sci. Program. 18, 3--4 (Aug. 2010), 203--217.Google ScholarDigital Library
Bryan Catanzaro. 2014. GPU K-Means Clustering. Retrieved October 28, 2016 from https://github.com/bryancatanzaro/kmeans.Google Scholar
Bryan Catanzaro, Michael Garland, and Kurt Keutzer. 2011. Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 47--56. Google ScholarDigital Library
Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data parallel haskell: A status report. In Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming (DAMP’07). ACM, 10--18. Google ScholarDigital Library
Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, and Wen-mei Hwu. 2016. Efficient kernel synthesis for performance portable programming. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). Google ScholarCross Ref
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1 (1998), 46--55. Google ScholarDigital Library
Steven Dalton, Nathan Bell, and Michael Garland. 2010. CUSP Library. Retrieved October 28, 2016 from http://cusplibrary.github.io/.Google Scholar
Tim Davis. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38 (2011), 1:1--1:25. Issue 1.Google ScholarDigital Library
Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. 2016. VexCL Symbolic Type. Retrieved October 28, 2016 from http://vexcl.readthedocs.io/en/latest/symbolic.html.Google Scholar
H. Carter Edwards, Christian Trott, Juan Alday, Jesse Perla, Mauro Bianco, Robin Maffeo, Ben Sander, and Bryce Lelbach. 2016. Polymorphic multidimensional array reference. ISO/IEC C++ Standards Committee Paper P0009R2. Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0009r2.html.Google Scholar
ExMatEx. 2015. DoE Exascale Co-Design Center for Materials in Extreme Environments. Retrieved October 28, 2016 from http://www.exmatex.org.Google Scholar
Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’09). IEEE Computer Society, 1--11. Google ScholarDigital Library
Jared Hoberock. 2016. Working draft, technical specification for C++ extensions for parallelism version 2. ISO/IEC C++ Standards Committee Paper N4578 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4578.html.Google Scholar
Jared Hoberock, Michael Garland, and Olivier Giroux. 2016. An interface for abstracting execution. ISO/IEC C++ Standards Committee Paper P0058R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0058r1.pdf.Google Scholar
Paul Hudak. 1996. Building domain-specific embedded languages. ACM Comput. Surv. 28, 4es, Article 196 (Dec. 1996). Google ScholarDigital Library
Intel. 2016. Math Kernel Library. Retrieved October 28, 2016 from https://software.intel.com/en-us/intel-mkl.Google Scholar
Simon Peyton Jones and Satnam Singh. 2009. A tutorial on parallel and concurrent programming in haskell. In Proceedings of the 6th International Conference on Advanced Functional Programming (AFP’08). Springer-Verlag, 267--305. Google ScholarCross Ref
Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the 8th Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA’93). ACM, 91--108. Google ScholarDigital Library
Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Ben Lippmeier, and Simon Peyton Jones. 2012. Vectorisation avoidance. In Proceedings of the 2012 Haskell Symposium (Haskell’12). ACM, 37--48. Google ScholarDigital Library
Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in haskell. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming (ICFP’10). ACM, 261--272. Google ScholarDigital Library
Matthias Kretz. 2016. Data-parallel vector types and operations. ISO/IEC C++ Standards Committee Paper P0214R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0214r1.pdf.Google Scholar
HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-aware mapping of nested parallel patterns on GPUs. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, 63--74.Google ScholarDigital Library
Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theor. 28, 2 (Mar. 1982), 129--137. Google ScholarDigital Library
Duane G. Merrill, III. 2011. Allocation-oriented Algorithm Design with Application to GPU Computing. Ph.D. Dissertation. University of Virginia, Charlottesville, VA. UMI Order Number: AAI 3501820.Google Scholar
Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, 501--512. Google ScholarDigital Library
NVIDIA, Cray, CAPS, and PGI. 2015. The OpenACC Specification version 2.0a. Retrieved October 28, 2016 from http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf.Google Scholar
Shigetoshi Ohshima, Takahiro Katagiri, and Morio Matsumoto. 2014. Performance optimization of SpMV Using CRS format by considering OpenMP scheduling on CPUs and MIC. In Proceedings of the 8th IEEE International Symposium on Embedded Multicore/Manycore SoCs (MCSoc). 253--260. Google ScholarDigital Library
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). ACM, 12--25. Google ScholarDigital Library
Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. 2012. Elixir: A system for synthesizing concurrent graph programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, 375--394. Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, 519--530. Google ScholarDigital Library
Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering (GPCE’10). ACM, 127--136. Google ScholarDigital Library
Nikolay Sakharnykh. 2013. CoMD-CUDA. Retrieved October 28, 2016 from https://github.com/NVIDIA/CoMD-CUDA.Google Scholar
Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance OpenCL code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP’15). ACM, New York, NY, 205--217. Google ScholarDigital Library
Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, and Jeffrey K Hollingsworth. 2009. A scalable auto-tuning framework for compiler optimization. In Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium. 1--12. Google ScholarDigital Library
Vladimir N. Vapnik. 1998. Statistical Learning Theory.Google Scholar
Todd Veldhuizen. 1995. Expression templates. C++ Report 7, 5 (1995), 26--31.Google Scholar
Yongpeng Zhang and F. Mueller. 2012. CuNesl: Compiling nested data-parallel languages for SIMT architectures. In Proceedings of the 41st International Conference on Parallel Processing (ICPP). 340--349. Google ScholarDigital Library

Index Terms

Designing a Tunable Nested Data-Parallel Programming System
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

Today, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application ...
Read More
Nested data-parallelism on the gpu
ICFP '12

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ...
Read More
Nested data-parallelism on the gpu
ICFP '12: Proceedings of the 17th ACM SIGPLAN international conference on Functional programming

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4
December 2016
648 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3012405
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 December 2016
- Accepted: 1 October 2016
- Revised: 1 September 2016
- Received: 1 June 2016
Published in taco Volume 13, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Nested data parallelism
autotuning
performance portability
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 431
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.