ABSTRACT
Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel executions, lacks an efficient mechanism to support pipeline programming and executions. This paper provides a systematic examination of various existing pipeline execution models on GPU, and analyzes their strengths and weaknesses. To address their shortcomings, this paper then proposes three new execution models equipped with much improved controllability, including a hybrid model that is capable of getting the strengths of all. These insights ultimately lead to the development of a software programming framework named VersaPipe. With VersaPipe, users only need to write the operations for each pipeline stage. VersaPipe will then automatically assemble the stages into a hybrid execution model and configure it to achieve the best performance. Experiments on a set of pipeline benchmarks and a real-world face detection application show that VersaPipe produces up to 6.90X (2.88X on average) speedups over the original manual implementations.
- Edward Adelson, Charles Anderson, James Bergen, Peter Burt, and Joan Ogden. 1984. Pyramid Methods in Image Processing. RCA Engineer 29, 6 (1984), 33--41.Google Scholar
- Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face Description with Local Binary Patterns: Application to Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12 (2006), 2037--2041. Google ScholarDigital Library
- Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Conference on High PERFORMANCE Graphics. 145--149. Google ScholarDigital Library
- Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded Up Robust Features. In European Conference on Computer Vision. Springer, 404--417. Google ScholarDigital Library
- Christian Bienia and Kai Li. 2010. Characteristics of Workloads Using the Pipeline Programming Model. In International Symposium on Computer Architecture. Springer, 161--171. Google ScholarDigital Library
- Michael Boyer, David Tarjan, Scott Acton, and Kevin Skadron. 2009. Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarDigital Library
- Daniel Cederman and Philippas Tsigas. 2008. On Dynamic Load Balancing on Graphics Processors. In Eurographics/acm SIGGRAPH Conference on Graphics Hardware 2008, Sarajevo, Bosnia and Herzegovina. 57--64. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization. IEEE, 44--54. Google ScholarDigital Library
- Guoyang Chen and Xipeng Shen. 2015. Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R Gao. 2010. Dynamic Load Balancing on Single-and Multi-GPU Systems. In IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1--12.Google Scholar
- Nagai-Man Cheung, Xiaopeng Fan, Oscar C Au, and Man-Cheung Kung. 2010. Video Coding on Multicore Graphics Processors. IEEE Signal Processing Magazine 27, 2 (2010), 79--89.Google ScholarCross Ref
- Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes Image Rendering Architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102. Google ScholarDigital Library
- Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A Efros. 2015. What Makes Paris Look Like Paris? Commun. ACM 58, 12 (2015), 103--110. Google ScholarDigital Library
- Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic, and Wen-mei Hwu. 2016. KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism. In 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--12.Google Scholar
- Robert Gallager. 1962. Low-density Parity-check Codes. IRE Transactions on Information Theory 8, 1 (1962), 21--28.Google ScholarCross Ref
- Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--14.Google Scholar
- John L Hess and A_M O Smith. 1967. Calculation of Potential Flow about Arbitrary Bodies. Progress in Aerospace Sciences 8 (1967), 1--138.Google ScholarCross Ref
- Jiwei Liang. 2016. LDPC OOK Decoder. https://github.com/BibbyLiang/LDPC-OOK-Decoder-on-GPU. (2016).Google Scholar
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406. Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA. Google ScholarDigital Library
- Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT. Google ScholarDigital Library
- Brucek Khailany, William Dally, Ujval Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media Processing with Streams. IEEE MICRO 21, 2 (2001), 35--46. Google ScholarDigital Library
- Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In International Conference on Parallel Architectures and Compilation. Google ScholarDigital Library
- Samuli Laine, Tero Karras, and Timo Aila. 2013. Megakernels Considered Harmful: Wavefront Path Tracing on GPUs. In High-Performance Graphics Conference. 137--143. Google ScholarDigital Library
- Minseok Lee, Gwangsun Kim, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2016. iPAWS: Instruction-issue Pattern-based Adaptive Warp Scheduling for GPGPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 370--381.Google ScholarCross Ref
- Kai Li and Jeffrey F Naughton. 2000. Multiprocessor Main Memory Transaction Processing. In Proceedings of the first international symposium on Databases in parallel and distributed systems. IEEE Computer Society Press, 177--187. Google ScholarDigital Library
- Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 77. Google ScholarDigital Library
- David G Lowe. 2004. Distinctive Image Features from Scale-invariant Keypoints. International journal of computer vision 60, 2 (2004), 91--110. Google ScholarDigital Library
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In MICRO. Google ScholarDigital Library
- NVIDIA Corporation. 2016. NVIDIA CUDA. http://www.nvidia.com/object/cuda_home_new.html. (2016).Google Scholar
- Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time Face Detection in Full HD Images Exploiting both Embedded CPU and GPU. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.Google ScholarCross Ref
- Sreepathi Pai and Keshav Pingali. 2016. A Compiler for Throughput Optimization of Graph Algorithms on GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. ACM, 1--19. Google ScholarDigital Library
- Steven Parker, James Bigler, Andreas Dietrich, Heiko Friedrich, Jared Hoberock, David Luebke, David Mcallister, Morgan Mcguire, Keith Morley, and Austin Robison. 2010. OptiX: A General Purpose Ray Tracing Engine. Acm Transactions on Graphics 29, 4 (2010), 157--166. Google ScholarDigital Library
- Anjul Patney and John D Owens. 2008. Real-time Reyes: Programmable Pipelines and Research Challenges. ACM SIGGRAPH Asia 2008 Course Notes (2008).Google Scholar
- Anjul Patney, Stanley Tzeng, Kerry A. Seitz, and John D. Owens. 2015. Piko: A Framework for Authoring Programmable Graphics Pipelines. Acm Transactions on Graphics 34, 4 (2015), 1--13. Google ScholarDigital Library
- Pixar. 2016. Pixar's RenderMan. https://renderman.pixar.com/view/renderman. (2016).Google Scholar
- Timothy Purcell, Ian Buck, William Mark, and Pat Hanrahan. 2002. Ray Tracing on Programmable Graphics Hardware. In ACM Transactions on Graphics (TOG), Vol. 21. ACM, 703--712. Google ScholarDigital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519--530. Google ScholarDigital Library
- T. Rogers, M. O'Connor, and T. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- T. Rogers, M. O'Connor, and T. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Keigo Shirai, Hirokazu Madokoro, Satoshi Takahashi, and Kazuhito Sato. 2014. Parallel Implementation of Saliency Maps for Real-time Robot Vision. In Control, Automation and Systems (ICCAS), 2014 14th International Conference on. IEEE, 1046--1051.Google Scholar
- Changhe Song, Yunsong Li, and Bormin Huang. 2011. A GPU-accelerated Wavelet Decompression System with SPIHT and Reed-Solomon Decoding For Satellite Images. IEEE Journal of selected topics in applied earth observations and remote sensing 4, 3 (2011), 683--690.Google ScholarCross Ref
- Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, and Dieter Schmalstieg. 2012. Softshell: Dynamic Scheduling on GPUs. ACM Transactions on Graphics (TOG) 31, 6 (2012), 161. Google ScholarDigital Library
- Markus Steinberger, Michael Kenzel, PedroBoechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. Acm Transactions on Graphics 33, 6 (2014), 1--11. Google ScholarDigital Library
- Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. 2009. GRAMPS: A Programming Model for Graphics Pipelines. ACM Transactions on Graphics (TOG) 28, 1 (2009), 4. Google ScholarDigital Library
- Weibin Sun and Robert Ricci. 2013. Fast and Flexible: Parallel Packet Processing with GPUs and Click. In Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems. IEEE Press, 25--36. Google ScholarDigital Library
- Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task Management for Irregular-parallel Workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37. Google ScholarDigital Library
- Paul Viola and Michael Jones. 2001. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I--511.Google ScholarCross Ref
- Wang, Jin and Yalamanchili, Sudhakar. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 51--60.Google Scholar
- Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 119--130. Google ScholarDigital Library
- Feng Zhang, Jidong Zhai, Bingsheng He, Shuhao Zhang, and Wenguang Chen. 2017. Understanding Co-running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 3 (2017), 905--918. Google ScholarDigital Library
Index Terms
- Versapipe: a versatile programming framework for pipelined computing on GPU
Recommendations
Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors
Heterogeneous microprocessors integrate a CPU and GPU on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data “in place.” This permits exploiting a finer granularity of parallelism on the integrated GPUs, and enables ...
Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs
COSMIC '15: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many CoresGraphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naïve use of this ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments