Versapipe: a versatile programming framework for pipelined computing on GPU

Authors:
Zhen Zheng

Tsinghua University

Tsinghua University
View Profile

,
Chanyoung Oh

University of Seoul

University of Seoul
View Profile

,
Jidong Zhai

Tsinghua University

Tsinghua University
View Profile

,
Xipeng Shen

North Carolina State University

North Carolina State University
View Profile

,
Youngmin Yi

University of Seoul

University of Seoul
View Profile

,
Wenguang Chen

Tsinghua University

Tsinghua University
View Profile

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureOctober 2017Pages 587–599https://doi.org/10.1145/3123939.3123978

Published:14 October 2017Publication History

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 587–599

ABSTRACT

Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel executions, lacks an efficient mechanism to support pipeline programming and executions. This paper provides a systematic examination of various existing pipeline execution models on GPU, and analyzes their strengths and weaknesses. To address their shortcomings, this paper then proposes three new execution models equipped with much improved controllability, including a hybrid model that is capable of getting the strengths of all. These insights ultimately lead to the development of a software programming framework named VersaPipe. With VersaPipe, users only need to write the operations for each pipeline stage. VersaPipe will then automatically assemble the stages into a hybrid execution model and configure it to achieve the best performance. Experiments on a set of pipeline benchmarks and a real-world face detection application show that VersaPipe produces up to 6.90X (2.88X on average) speedups over the original manual implementations.

References

Edward Adelson, Charles Anderson, James Bergen, Peter Burt, and Joan Ogden. 1984. Pyramid Methods in Image Processing. RCA Engineer 29, 6 (1984), 33--41.Google Scholar
Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face Description with Local Binary Patterns: Application to Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12 (2006), 2037--2041. Google ScholarDigital Library
Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Conference on High PERFORMANCE Graphics. 145--149. Google ScholarDigital Library
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded Up Robust Features. In European Conference on Computer Vision. Springer, 404--417. Google ScholarDigital Library
Christian Bienia and Kai Li. 2010. Characteristics of Workloads Using the Pipeline Programming Model. In International Symposium on Computer Architecture. Springer, 161--171. Google ScholarDigital Library
Michael Boyer, David Tarjan, Scott Acton, and Kevin Skadron. 2009. Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarDigital Library
Daniel Cederman and Philippas Tsigas. 2008. On Dynamic Load Balancing on Graphics Processors. In Eurographics/acm SIGGRAPH Conference on Graphics Hardware 2008, Sarajevo, Bosnia and Herzegovina. 57--64. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization. IEEE, 44--54. Google ScholarDigital Library
Guoyang Chen and Xipeng Shen. 2015. Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R Gao. 2010. Dynamic Load Balancing on Single-and Multi-GPU Systems. In IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1--12.Google Scholar
Nagai-Man Cheung, Xiaopeng Fan, Oscar C Au, and Man-Cheung Kung. 2010. Video Coding on Multicore Graphics Processors. IEEE Signal Processing Magazine 27, 2 (2010), 79--89.Google ScholarCross Ref
Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes Image Rendering Architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102. Google ScholarDigital Library
Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A Efros. 2015. What Makes Paris Look Like Paris? Commun. ACM 58, 12 (2015), 103--110. Google ScholarDigital Library
Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic, and Wen-mei Hwu. 2016. KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism. In 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--12.Google Scholar
Robert Gallager. 1962. Low-density Parity-check Codes. IRE Transactions on Information Theory 8, 1 (1962), 21--28.Google ScholarCross Ref
Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--14.Google Scholar
John L Hess and A_M O Smith. 1967. Calculation of Potential Flow about Arbitrary Bodies. Progress in Aerospace Sciences 8 (1967), 1--138.Google ScholarCross Ref
Jiwei Liang. 2016. LDPC OOK Decoder. https://github.com/BibbyLiang/LDPC-OOK-Decoder-on-GPU. (2016).Google Scholar
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA. Google ScholarDigital Library
Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT. Google ScholarDigital Library
Brucek Khailany, William Dally, Ujval Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media Processing with Streams. IEEE MICRO 21, 2 (2001), 35--46. Google ScholarDigital Library
Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In International Conference on Parallel Architectures and Compilation. Google ScholarDigital Library
Samuli Laine, Tero Karras, and Timo Aila. 2013. Megakernels Considered Harmful: Wavefront Path Tracing on GPUs. In High-Performance Graphics Conference. 137--143. Google ScholarDigital Library
Minseok Lee, Gwangsun Kim, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2016. iPAWS: Instruction-issue Pattern-based Adaptive Warp Scheduling for GPGPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 370--381.Google ScholarCross Ref
Kai Li and Jeffrey F Naughton. 2000. Multiprocessor Main Memory Transaction Processing. In Proceedings of the first international symposium on Databases in parallel and distributed systems. IEEE Computer Society Press, 177--187. Google ScholarDigital Library
Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 77. Google ScholarDigital Library
David G Lowe. 2004. Distinctive Image Features from Scale-invariant Keypoints. International journal of computer vision 60, 2 (2004), 91--110. Google ScholarDigital Library
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In MICRO. Google ScholarDigital Library
NVIDIA Corporation. 2016. NVIDIA CUDA. http://www.nvidia.com/object/cuda_home_new.html. (2016).Google Scholar
Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time Face Detection in Full HD Images Exploiting both Embedded CPU and GPU. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.Google ScholarCross Ref
Sreepathi Pai and Keshav Pingali. 2016. A Compiler for Throughput Optimization of Graph Algorithms on GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. ACM, 1--19. Google ScholarDigital Library
Steven Parker, James Bigler, Andreas Dietrich, Heiko Friedrich, Jared Hoberock, David Luebke, David Mcallister, Morgan Mcguire, Keith Morley, and Austin Robison. 2010. OptiX: A General Purpose Ray Tracing Engine. Acm Transactions on Graphics 29, 4 (2010), 157--166. Google ScholarDigital Library
Anjul Patney and John D Owens. 2008. Real-time Reyes: Programmable Pipelines and Research Challenges. ACM SIGGRAPH Asia 2008 Course Notes (2008).Google Scholar
Anjul Patney, Stanley Tzeng, Kerry A. Seitz, and John D. Owens. 2015. Piko: A Framework for Authoring Programmable Graphics Pipelines. Acm Transactions on Graphics 34, 4 (2015), 1--13. Google ScholarDigital Library
Pixar. 2016. Pixar's RenderMan. https://renderman.pixar.com/view/renderman. (2016).Google Scholar
Timothy Purcell, Ian Buck, William Mark, and Pat Hanrahan. 2002. Ray Tracing on Programmable Graphics Hardware. In ACM Transactions on Graphics (TOG), Vol. 21. ACM, 703--712. Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519--530. Google ScholarDigital Library
T. Rogers, M. O'Connor, and T. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
T. Rogers, M. O'Connor, and T. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Keigo Shirai, Hirokazu Madokoro, Satoshi Takahashi, and Kazuhito Sato. 2014. Parallel Implementation of Saliency Maps for Real-time Robot Vision. In Control, Automation and Systems (ICCAS), 2014 14th International Conference on. IEEE, 1046--1051.Google Scholar
Changhe Song, Yunsong Li, and Bormin Huang. 2011. A GPU-accelerated Wavelet Decompression System with SPIHT and Reed-Solomon Decoding For Satellite Images. IEEE Journal of selected topics in applied earth observations and remote sensing 4, 3 (2011), 683--690.Google ScholarCross Ref
Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, and Dieter Schmalstieg. 2012. Softshell: Dynamic Scheduling on GPUs. ACM Transactions on Graphics (TOG) 31, 6 (2012), 161. Google ScholarDigital Library
Markus Steinberger, Michael Kenzel, PedroBoechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. Acm Transactions on Graphics 33, 6 (2014), 1--11. Google ScholarDigital Library
Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. 2009. GRAMPS: A Programming Model for Graphics Pipelines. ACM Transactions on Graphics (TOG) 28, 1 (2009), 4. Google ScholarDigital Library
Weibin Sun and Robert Ricci. 2013. Fast and Flexible: Parallel Packet Processing with GPUs and Click. In Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems. IEEE Press, 25--36. Google ScholarDigital Library
Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task Management for Irregular-parallel Workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37. Google ScholarDigital Library
Paul Viola and Michael Jones. 2001. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I--511.Google ScholarCross Ref
Wang, Jin and Yalamanchili, Sudhakar. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 51--60.Google Scholar
Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 119--130. Google ScholarDigital Library
Feng Zhang, Jidong Zhai, Bingsheng He, Shuhao Zhang, and Wenguang Chen. 2017. Understanding Co-running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 3 (2017), 905--918. Google ScholarDigital Library

Index Terms

Versapipe: a versatile programming framework for pipelined computing on GPU

Recommendations

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Heterogeneous microprocessors integrate a CPU and GPU on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data “in place.” This permits exploiting a finer granularity of parallelism on the integrated GPUs, and enables ...
Read More
Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs
COSMIC '15: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores

Graphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naïve use of this ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
General Chairs:
Hillery Hunter
IBM Research
,
Jaime Moreno
IBM Research
,
Program Chairs:
Joel Emer
NVIDIA and MIT
,
Daniel Sanchez
MIT
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
pipelined computing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 860
  Total Downloads
- Downloads (Last 12 months)100
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Versapipe: a versatile programming framework for pipelined computing on GPU

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing