skip to main content
research-article
Free Access

Designing a Tunable Nested Data-Parallel Programming System

Published:28 December 2016Publication History
Skip Abstract Section

Abstract

This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed using a C++ programming interface, from low-level implementation details using two first-class constructs: schedules and policies. Schedules describe the valid ways in which data-parallel operators may be implemented, while policies encapsulate a set of parameters that govern platform-specific code generation. These two mechanisms are used to implement a code generation system that analyzes computations and automatically generates a search space of valid platform-specific implementations. An input and architecture-adaptive autotuning system then explores this search space to find optimized implementations. We express in Surge five real-world benchmarks from domains such as machine learning and sparse linear algebra and from the high-level specifications, Surge automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.

References

  1. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 303--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah, and Sunil Shukla. 2012. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 271--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, 89--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, Article 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. GPU Comput. Gems Jade Ed. 2 (2011), 359--371.Google ScholarGoogle Scholar
  6. Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, Stephen Rosen, and Adam Shaw. 2013. Data-only flattening for nested data parallelism. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guy E. Blelloch. 1992. NESL: A Nested Data-Parallel Language. Technical Report CMU-CS-95-170. Carnegie Mellon University, Pittsburgh, PA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. 2016. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th Parallel Architectures and Compilation Techniques Conference (PACT’11). ACM, 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sağnak Taşirlar. 2010. Concurrent collections. Sci. Program. 18, 3--4 (Aug. 2010), 203--217.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bryan Catanzaro. 2014. GPU K-Means Clustering. Retrieved October 28, 2016 from https://github.com/bryancatanzaro/kmeans.Google ScholarGoogle Scholar
  12. Bryan Catanzaro, Michael Garland, and Kurt Keutzer. 2011. Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data parallel haskell: A status report. In Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming (DAMP’07). ACM, 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, and Wen-mei Hwu. 2016. Efficient kernel synthesis for performance portable programming. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). Google ScholarGoogle ScholarCross RefCross Ref
  15. Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1 (1998), 46--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Steven Dalton, Nathan Bell, and Michael Garland. 2010. CUSP Library. Retrieved October 28, 2016 from http://cusplibrary.github.io/.Google ScholarGoogle Scholar
  17. Tim Davis. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38 (2011), 1:1--1:25. Issue 1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. 2016. VexCL Symbolic Type. Retrieved October 28, 2016 from http://vexcl.readthedocs.io/en/latest/symbolic.html.Google ScholarGoogle Scholar
  19. H. Carter Edwards, Christian Trott, Juan Alday, Jesse Perla, Mauro Bianco, Robin Maffeo, Ben Sander, and Bryce Lelbach. 2016. Polymorphic multidimensional array reference. ISO/IEC C++ Standards Committee Paper P0009R2. Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0009r2.html.Google ScholarGoogle Scholar
  20. ExMatEx. 2015. DoE Exascale Co-Design Center for Materials in Extreme Environments. Retrieved October 28, 2016 from http://www.exmatex.org.Google ScholarGoogle Scholar
  21. Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’09). IEEE Computer Society, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jared Hoberock. 2016. Working draft, technical specification for C++ extensions for parallelism version 2. ISO/IEC C++ Standards Committee Paper N4578 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4578.html.Google ScholarGoogle Scholar
  23. Jared Hoberock, Michael Garland, and Olivier Giroux. 2016. An interface for abstracting execution. ISO/IEC C++ Standards Committee Paper P0058R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0058r1.pdf.Google ScholarGoogle Scholar
  24. Paul Hudak. 1996. Building domain-specific embedded languages. ACM Comput. Surv. 28, 4es, Article 196 (Dec. 1996). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Intel. 2016. Math Kernel Library. Retrieved October 28, 2016 from https://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  26. Simon Peyton Jones and Satnam Singh. 2009. A tutorial on parallel and concurrent programming in haskell. In Proceedings of the 6th International Conference on Advanced Functional Programming (AFP’08). Springer-Verlag, 267--305. Google ScholarGoogle ScholarCross RefCross Ref
  27. Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the 8th Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA’93). ACM, 91--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Ben Lippmeier, and Simon Peyton Jones. 2012. Vectorisation avoidance. In Proceedings of the 2012 Haskell Symposium (Haskell’12). ACM, 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in haskell. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming (ICFP’10). ACM, 261--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Matthias Kretz. 2016. Data-parallel vector types and operations. ISO/IEC C++ Standards Committee Paper P0214R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0214r1.pdf.Google ScholarGoogle Scholar
  31. HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-aware mapping of nested parallel patterns on GPUs. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, 63--74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theor. 28, 2 (Mar. 1982), 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Duane G. Merrill, III. 2011. Allocation-oriented Algorithm Design with Application to GPU Computing. Ph.D. Dissertation. University of Virginia, Charlottesville, VA. UMI Order Number: AAI 3501820.Google ScholarGoogle Scholar
  34. Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, 501--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. NVIDIA, Cray, CAPS, and PGI. 2015. The OpenACC Specification version 2.0a. Retrieved October 28, 2016 from http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf.Google ScholarGoogle Scholar
  36. Shigetoshi Ohshima, Takahiro Katagiri, and Morio Matsumoto. 2014. Performance optimization of SpMV Using CRS format by considering OpenMP scheduling on CPUs and MIC. In Proceedings of the 8th IEEE International Symposium on Embedded Multicore/Manycore SoCs (MCSoc). 253--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). ACM, 12--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. 2012. Elixir: A system for synthesizing concurrent graph programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, 375--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, 519--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering (GPCE’10). ACM, 127--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nikolay Sakharnykh. 2013. CoMD-CUDA. Retrieved October 28, 2016 from https://github.com/NVIDIA/CoMD-CUDA.Google ScholarGoogle Scholar
  42. Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance OpenCL code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP’15). ACM, New York, NY, 205--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, and Jeffrey K Hollingsworth. 2009. A scalable auto-tuning framework for compiler optimization. In Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Vladimir N. Vapnik. 1998. Statistical Learning Theory.Google ScholarGoogle Scholar
  45. Todd Veldhuizen. 1995. Expression templates. C++ Report 7, 5 (1995), 26--31.Google ScholarGoogle Scholar
  46. Yongpeng Zhang and F. Mueller. 2012. CuNesl: Compiling nested data-parallel languages for SIMT architectures. In Proceedings of the 41st International Conference on Parallel Processing (ICPP). 340--349. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Designing a Tunable Nested Data-Parallel Programming System

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
        December 2016
        648 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3012405
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 December 2016
        • Accepted: 1 October 2016
        • Revised: 1 September 2016
        • Received: 1 June 2016
        Published in taco Volume 13, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader