Abstract
This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed using a C++ programming interface, from low-level implementation details using two first-class constructs: schedules and policies. Schedules describe the valid ways in which data-parallel operators may be implemented, while policies encapsulate a set of parameters that govern platform-specific code generation. These two mechanisms are used to implement a code generation system that analyzes computations and automatically generates a search space of valid platform-specific implementations. An input and architecture-adaptive autotuning system then explores this search space to find optimized implementations. We express in Surge five real-world benchmarks from domains such as machine learning and sparse linear algebra and from the high-level specifications, Surge automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.
- Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 303--316. Google ScholarDigital Library
- Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah, and Sunil Shukla. 2012. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 271--276. Google ScholarDigital Library
- Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, 89--108. Google ScholarDigital Library
- Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, Article 18. Google ScholarDigital Library
- Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. GPU Comput. Gems Jade Ed. 2 (2011), 359--371.Google Scholar
- Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, Stephen Rosen, and Adam Shaw. 2013. Data-only flattening for nested data parallelism. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, 81--92. Google ScholarDigital Library
- Guy E. Blelloch. 1992. NESL: A Nested Data-Parallel Language. Technical Report CMU-CS-95-170. Carnegie Mellon University, Pittsburgh, PA.Google ScholarDigital Library
- Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. 2016. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, 194--205. Google ScholarDigital Library
- Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th Parallel Architectures and Compilation Techniques Conference (PACT’11). ACM, 89--100. Google ScholarDigital Library
- Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sağnak Taşirlar. 2010. Concurrent collections. Sci. Program. 18, 3--4 (Aug. 2010), 203--217.Google ScholarDigital Library
- Bryan Catanzaro. 2014. GPU K-Means Clustering. Retrieved October 28, 2016 from https://github.com/bryancatanzaro/kmeans.Google Scholar
- Bryan Catanzaro, Michael Garland, and Kurt Keutzer. 2011. Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 47--56. Google ScholarDigital Library
- Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data parallel haskell: A status report. In Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming (DAMP’07). ACM, 10--18. Google ScholarDigital Library
- Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, and Wen-mei Hwu. 2016. Efficient kernel synthesis for performance portable programming. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). Google ScholarCross Ref
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1 (1998), 46--55. Google ScholarDigital Library
- Steven Dalton, Nathan Bell, and Michael Garland. 2010. CUSP Library. Retrieved October 28, 2016 from http://cusplibrary.github.io/.Google Scholar
- Tim Davis. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38 (2011), 1:1--1:25. Issue 1.Google ScholarDigital Library
- Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. 2016. VexCL Symbolic Type. Retrieved October 28, 2016 from http://vexcl.readthedocs.io/en/latest/symbolic.html.Google Scholar
- H. Carter Edwards, Christian Trott, Juan Alday, Jesse Perla, Mauro Bianco, Robin Maffeo, Ben Sander, and Bryce Lelbach. 2016. Polymorphic multidimensional array reference. ISO/IEC C++ Standards Committee Paper P0009R2. Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0009r2.html.Google Scholar
- ExMatEx. 2015. DoE Exascale Co-Design Center for Materials in Extreme Environments. Retrieved October 28, 2016 from http://www.exmatex.org.Google Scholar
- Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’09). IEEE Computer Society, 1--11. Google ScholarDigital Library
- Jared Hoberock. 2016. Working draft, technical specification for C++ extensions for parallelism version 2. ISO/IEC C++ Standards Committee Paper N4578 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4578.html.Google Scholar
- Jared Hoberock, Michael Garland, and Olivier Giroux. 2016. An interface for abstracting execution. ISO/IEC C++ Standards Committee Paper P0058R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0058r1.pdf.Google Scholar
- Paul Hudak. 1996. Building domain-specific embedded languages. ACM Comput. Surv. 28, 4es, Article 196 (Dec. 1996). Google ScholarDigital Library
- Intel. 2016. Math Kernel Library. Retrieved October 28, 2016 from https://software.intel.com/en-us/intel-mkl.Google Scholar
- Simon Peyton Jones and Satnam Singh. 2009. A tutorial on parallel and concurrent programming in haskell. In Proceedings of the 6th International Conference on Advanced Functional Programming (AFP’08). Springer-Verlag, 267--305. Google ScholarCross Ref
- Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the 8th Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA’93). ACM, 91--108. Google ScholarDigital Library
- Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Ben Lippmeier, and Simon Peyton Jones. 2012. Vectorisation avoidance. In Proceedings of the 2012 Haskell Symposium (Haskell’12). ACM, 37--48. Google ScholarDigital Library
- Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in haskell. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming (ICFP’10). ACM, 261--272. Google ScholarDigital Library
- Matthias Kretz. 2016. Data-parallel vector types and operations. ISO/IEC C++ Standards Committee Paper P0214R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0214r1.pdf.Google Scholar
- HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-aware mapping of nested parallel patterns on GPUs. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, 63--74.Google ScholarDigital Library
- Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theor. 28, 2 (Mar. 1982), 129--137. Google ScholarDigital Library
- Duane G. Merrill, III. 2011. Allocation-oriented Algorithm Design with Application to GPU Computing. Ph.D. Dissertation. University of Virginia, Charlottesville, VA. UMI Order Number: AAI 3501820.Google Scholar
- Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, 501--512. Google ScholarDigital Library
- NVIDIA, Cray, CAPS, and PGI. 2015. The OpenACC Specification version 2.0a. Retrieved October 28, 2016 from http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf.Google Scholar
- Shigetoshi Ohshima, Takahiro Katagiri, and Morio Matsumoto. 2014. Performance optimization of SpMV Using CRS format by considering OpenMP scheduling on CPUs and MIC. In Proceedings of the 8th IEEE International Symposium on Embedded Multicore/Manycore SoCs (MCSoc). 253--260. Google ScholarDigital Library
- Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). ACM, 12--25. Google ScholarDigital Library
- Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. 2012. Elixir: A system for synthesizing concurrent graph programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, 375--394. Google ScholarDigital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, 519--530. Google ScholarDigital Library
- Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering (GPCE’10). ACM, 127--136. Google ScholarDigital Library
- Nikolay Sakharnykh. 2013. CoMD-CUDA. Retrieved October 28, 2016 from https://github.com/NVIDIA/CoMD-CUDA.Google Scholar
- Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance OpenCL code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP’15). ACM, New York, NY, 205--217. Google ScholarDigital Library
- Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, and Jeffrey K Hollingsworth. 2009. A scalable auto-tuning framework for compiler optimization. In Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium. 1--12. Google ScholarDigital Library
- Vladimir N. Vapnik. 1998. Statistical Learning Theory.Google Scholar
- Todd Veldhuizen. 1995. Expression templates. C++ Report 7, 5 (1995), 26--31.Google Scholar
- Yongpeng Zhang and F. Mueller. 2012. CuNesl: Compiling nested data-parallel languages for SIMT architectures. In Proceedings of the 41st International Conference on Parallel Processing (ICPP). 340--349. Google ScholarDigital Library
Index Terms
- Designing a Tunable Nested Data-Parallel Programming System
Recommendations
A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array ProgrammingToday, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application ...
Nested data-parallelism on the gpu
ICFP '12Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ...
Nested data-parallelism on the gpu
ICFP '12: Proceedings of the 17th ACM SIGPLAN international conference on Functional programmingGraphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ...
Comments