skip to main content
10.1145/3123939.3123978acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Versapipe: a versatile programming framework for pipelined computing on GPU

Published:14 October 2017Publication History

ABSTRACT

Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel executions, lacks an efficient mechanism to support pipeline programming and executions. This paper provides a systematic examination of various existing pipeline execution models on GPU, and analyzes their strengths and weaknesses. To address their shortcomings, this paper then proposes three new execution models equipped with much improved controllability, including a hybrid model that is capable of getting the strengths of all. These insights ultimately lead to the development of a software programming framework named VersaPipe. With VersaPipe, users only need to write the operations for each pipeline stage. VersaPipe will then automatically assemble the stages into a hybrid execution model and configure it to achieve the best performance. Experiments on a set of pipeline benchmarks and a real-world face detection application show that VersaPipe produces up to 6.90X (2.88X on average) speedups over the original manual implementations.

References

  1. Edward Adelson, Charles Anderson, James Bergen, Peter Burt, and Joan Ogden. 1984. Pyramid Methods in Image Processing. RCA Engineer 29, 6 (1984), 33--41.Google ScholarGoogle Scholar
  2. Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face Description with Local Binary Patterns: Application to Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12 (2006), 2037--2041. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Conference on High PERFORMANCE Graphics. 145--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded Up Robust Features. In European Conference on Computer Vision. Springer, 404--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Christian Bienia and Kai Li. 2010. Characteristics of Workloads Using the Pipeline Programming Model. In International Symposium on Computer Architecture. Springer, 161--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael Boyer, David Tarjan, Scott Acton, and Kevin Skadron. 2009. Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Daniel Cederman and Philippas Tsigas. 2008. On Dynamic Load Balancing on Graphics Processors. In Eurographics/acm SIGGRAPH Conference on Graphics Hardware 2008, Sarajevo, Bosnia and Herzegovina. 57--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization. IEEE, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guoyang Chen and Xipeng Shen. 2015. Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R Gao. 2010. Dynamic Load Balancing on Single-and Multi-GPU Systems. In IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1--12.Google ScholarGoogle Scholar
  11. Nagai-Man Cheung, Xiaopeng Fan, Oscar C Au, and Man-Cheung Kung. 2010. Video Coding on Multicore Graphics Processors. IEEE Signal Processing Magazine 27, 2 (2010), 79--89.Google ScholarGoogle ScholarCross RefCross Ref
  12. Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes Image Rendering Architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A Efros. 2015. What Makes Paris Look Like Paris? Commun. ACM 58, 12 (2015), 103--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic, and Wen-mei Hwu. 2016. KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism. In 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--12.Google ScholarGoogle Scholar
  15. Robert Gallager. 1962. Low-density Parity-check Codes. IRE Transactions on Information Theory 8, 1 (1962), 21--28.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--14.Google ScholarGoogle Scholar
  17. John L Hess and A_M O Smith. 1967. Calculation of Potential Flow about Arbitrary Bodies. Progress in Aerospace Sciences 8 (1967), 1--138.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jiwei Liang. 2016. LDPC OOK Decoder. https://github.com/BibbyLiang/LDPC-OOK-Decoder-on-GPU. (2016).Google ScholarGoogle Scholar
  19. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Brucek Khailany, William Dally, Ujval Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media Processing with Streams. IEEE MICRO 21, 2 (2001), 35--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In International Conference on Parallel Architectures and Compilation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Samuli Laine, Tero Karras, and Timo Aila. 2013. Megakernels Considered Harmful: Wavefront Path Tracing on GPUs. In High-Performance Graphics Conference. 137--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Minseok Lee, Gwangsun Kim, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2016. iPAWS: Instruction-issue Pattern-based Adaptive Warp Scheduling for GPGPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 370--381.Google ScholarGoogle ScholarCross RefCross Ref
  26. Kai Li and Jeffrey F Naughton. 2000. Multiprocessor Main Memory Transaction Processing. In Proceedings of the first international symposium on Databases in parallel and distributed systems. IEEE Computer Society Press, 177--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. David G Lowe. 2004. Distinctive Image Features from Scale-invariant Keypoints. International journal of computer vision 60, 2 (2004), 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. NVIDIA Corporation. 2016. NVIDIA CUDA. http://www.nvidia.com/object/cuda_home_new.html. (2016).Google ScholarGoogle Scholar
  31. Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time Face Detection in Full HD Images Exploiting both Embedded CPU and GPU. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  32. Sreepathi Pai and Keshav Pingali. 2016. A Compiler for Throughput Optimization of Graph Algorithms on GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. ACM, 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Steven Parker, James Bigler, Andreas Dietrich, Heiko Friedrich, Jared Hoberock, David Luebke, David Mcallister, Morgan Mcguire, Keith Morley, and Austin Robison. 2010. OptiX: A General Purpose Ray Tracing Engine. Acm Transactions on Graphics 29, 4 (2010), 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Anjul Patney and John D Owens. 2008. Real-time Reyes: Programmable Pipelines and Research Challenges. ACM SIGGRAPH Asia 2008 Course Notes (2008).Google ScholarGoogle Scholar
  35. Anjul Patney, Stanley Tzeng, Kerry A. Seitz, and John D. Owens. 2015. Piko: A Framework for Authoring Programmable Graphics Pipelines. Acm Transactions on Graphics 34, 4 (2015), 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pixar. 2016. Pixar's RenderMan. https://renderman.pixar.com/view/renderman. (2016).Google ScholarGoogle Scholar
  37. Timothy Purcell, Ian Buck, William Mark, and Pat Hanrahan. 2002. Ray Tracing on Programmable Graphics Hardware. In ACM Transactions on Graphics (TOG), Vol. 21. ACM, 703--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. Rogers, M. O'Connor, and T. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. Rogers, M. O'Connor, and T. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Keigo Shirai, Hirokazu Madokoro, Satoshi Takahashi, and Kazuhito Sato. 2014. Parallel Implementation of Saliency Maps for Real-time Robot Vision. In Control, Automation and Systems (ICCAS), 2014 14th International Conference on. IEEE, 1046--1051.Google ScholarGoogle Scholar
  42. Changhe Song, Yunsong Li, and Bormin Huang. 2011. A GPU-accelerated Wavelet Decompression System with SPIHT and Reed-Solomon Decoding For Satellite Images. IEEE Journal of selected topics in applied earth observations and remote sensing 4, 3 (2011), 683--690.Google ScholarGoogle ScholarCross RefCross Ref
  43. Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, and Dieter Schmalstieg. 2012. Softshell: Dynamic Scheduling on GPUs. ACM Transactions on Graphics (TOG) 31, 6 (2012), 161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Markus Steinberger, Michael Kenzel, PedroBoechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. Acm Transactions on Graphics 33, 6 (2014), 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. 2009. GRAMPS: A Programming Model for Graphics Pipelines. ACM Transactions on Graphics (TOG) 28, 1 (2009), 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Weibin Sun and Robert Ricci. 2013. Fast and Flexible: Parallel Packet Processing with GPUs and Click. In Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems. IEEE Press, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task Management for Irregular-parallel Workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Paul Viola and Michael Jones. 2001. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I--511.Google ScholarGoogle ScholarCross RefCross Ref
  49. Wang, Jin and Yalamanchili, Sudhakar. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 51--60.Google ScholarGoogle Scholar
  50. Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 119--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Feng Zhang, Jidong Zhai, Bingsheng He, Shuhao Zhang, and Wenguang Chen. 2017. Understanding Co-running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 3 (2017), 905--918. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Versapipe: a versatile programming framework for pipelined computing on GPU

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
          October 2017
          850 pages
          ISBN:9781450349529
          DOI:10.1145/3123939

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 October 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate484of2,242submissions,22%

          Upcoming Conference

          MICRO '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader