skip to main content
10.1145/2491956.2462176acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Published:16 June 2013Publication History

ABSTRACT

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.

We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

References

  1. A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.Google ScholarGoogle Scholar
  4. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CoreImage. Apple CoreImage programming guide, 2006.Google ScholarGoogle Scholar
  7. J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.Google ScholarGoogle Scholar
  9. K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.Google ScholarGoogle Scholar
  15. S. Kamil, C. Chan, L. Oliker, J. Shalf, , and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  18. R. Moore. Interval Analysis. 1966.Google ScholarGoogle Scholar
  19. A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. OpenMP. OpenMP. http://openmp.org/.Google ScholarGoogle Scholar
  21. S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. PixelBender. Adobe PixelBender reference, 2010.Google ScholarGoogle Scholar
  24. H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  26. J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.Google ScholarGoogle Scholar

Index Terms

  1. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
      June 2013
      546 pages
      ISBN:9781450320146
      DOI:10.1145/2491956
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 6
        PLDI '13
        June 2013
        515 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2499370
        Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      PLDI '13 Paper Acceptance Rate46of267submissions,17%Overall Acceptance Rate406of2,067submissions,20%

      Upcoming Conference

      PLDI '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader