Abstract
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%.
- ]]Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). ACM, 248--259. Google ScholarDigital Library
- ]]Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Upper Saddle River, NJ. Google ScholarDigital Library
- ]]Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th Symposium on Principles of Programming Languages (POPL '83). ACM, 177--189. Google ScholarDigital Library
- ]]AMD, Inc. 2006. ATI CTM Guide, 1.01 ed. AMD, Inc.Google Scholar
- ]]Basu, A., Kirman, N., Kirman, M., Chaudhuri, M., and Martinez, J. 2007. Scavenger: A new last level cache architecture with global block priority. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'07). ACM, 421--432. Google ScholarDigital Library
- ]]Blinn, J. F. 1978. Simulation of wrinkled surfaces. SIGGRAPH Comput. Graph. 12, 3, 286--292. Google ScholarDigital Library
- ]]Bouknight, W., Denenberg, S., McIntyre, D., Randall, J., Sameh, A., and Slotnick, D. 1972. The Illiac IV System. Proc. IEEE 60, 4, 369--388.Google ScholarCross Ref
- ]]Buatois, L., Caumon, G., and Lévy, B. 2008. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. Int. J. Parall. Emerge. Distrib. Syst. Google ScholarDigital Library
- ]]Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: Stream computing on graphics hardware. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'04). ACM, 777--786. Google ScholarDigital Library
- ]]Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. http://www.simplescalar.com. Google ScholarDigital Library
- ]]Cervini, S. 2005. European Patent EP 1531391 A2: System and method for efficiently executing single program multiple data (SPMD) programs.Google Scholar
- ]]Clark, N., Hormati, A., Yehia, S., Mahlke, S., and Flautner, K. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE, 216--227. Google ScholarDigital Library
- ]]Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms 2nd Ed. MIT Press, Cambridge, MA. Google ScholarDigital Library
- ]]Crow, F. C. 1977. Shadow algorithms for computer graphics. SIGGRAPH Comput. Graph. 11, 2, 242--248. Google ScholarDigital Library
- ]]Dally, W. J., Labonte, F., Das, A., Hanrahan, P., Ahn, J.-H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., and Kapasi, U. J. 2003. Merrimac: Supercomputing with streams. In Proceedings of Supercomputing. IEEE. Google ScholarDigital Library
- ]]Dally, W. J. and Towles, B. 2004. Interconnection Networks. Morgan Kaufmann, San Francisco, CA.Google Scholar
- ]]del Barrio, V., Gonzalez, C., Roca, J., Fernandez, A., and Espasa, R. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 231--241.Google Scholar
- ]]Fowler, H., Fowler, F., and Thompson, D., Eds. 1995. The Concise Oxford Dictionary 9th Ed. Oxford University Press.Google Scholar
- ]]Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'07). ACM, 407--420. Google ScholarDigital Library
- ]]Hakura, Z. S. and Gupta, A. 1997. The design and analysis of a cache architecture for texture mapping. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97). ACM, 25, 2, 108--120. Google ScholarDigital Library
- ]]Harris, M. 2007. Optimizing parallel reduction in cuda. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf.Google Scholar
- ]]Hwu, W.-M., Kirk, D., Ryoo, S., Rodrigues, C., Stratton, J., and Huang, K. 2007. Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX. http://www.hotchips.org/archives/hc19/2Mon/HC19.02/HC19.02.03.pdf.Google Scholar
- ]]Intel Corp. 2008. Intel 965 Express Chipset Family and Intel G35 Express Chipset Graphics Controller Programmer's Reference Manual. Intel Corporation.Google Scholar
- ]]Ioannou, A. and Katevenis, M. G. H. 2007. Pipelined heap (priority queue) management for advanced scheduling in high-speed networks. IEEE/ACM Trans. Netw. 15, 2, 450--461. Google ScholarDigital Library
- ]]Jayasena, N., Erez, M., Ahn, J. H., and Dally, W. J. 2004. Stream register files with indexed access. In Proceedings of the 10th International Symposium on High-Performance Computer Architecutre (HPCA'04). IEEE, 60--71. Google ScholarDigital Library
- ]]Kapasi, U. J., Dally, W. J., Rixner, S., Mattson, P. R., Owens, J. D., and Khailany, B. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'33). ACM, 159--170. Google ScholarDigital Library
- ]]Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., and Asanovic, K. 2004. The vector-thread architecture. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA'04). ACM, 52--63. Google ScholarDigital Library
- ]]Levinthal, A. and Porter, T. 1984. Chap: A SIMD graphics processor. In Proceedings of SIGGRAPH. 77--82. Google ScholarDigital Library
- ]]Lindholm, E., Kligard, M. J., and Moreton, H. P. 2001. A user-programmable vertex engine. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'01). ACM, 149--158. Google ScholarDigital Library
- ]]Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. 2008. NVIDIA Tesla: A unified graphics and computing architecture. Micro IEEE 28, 2, 39--55. Google ScholarDigital Library
- ]]Lorie, R. A. and Strong, H. R. 1984. US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors.Google Scholar
- ]]Luebke, D. and Humphreys, G. 2007. How GPUs work. Computer 40, 2, 96--100. Google ScholarDigital Library
- ]]Montrym, J. and Moreton, H. 2005. The GeForce 6800. IEEE Micro 25, 2, 41--51. Google ScholarDigital Library
- ]]Moy, S. and Lindholm, E. 2005. US Patent 6,947,047: Method and system for programmable pipelined graphics processing with branching instructions.Google Scholar
- ]]Muchnick, S. S. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmanns. Google ScholarDigital Library
- ]]Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequences of tow proteins. Mol. Biol. 48, 443--453.Google ScholarCross Ref
- ]]NVIDIA Corp. CUDA SDK code samples. http://www.nvidia.com/object/cudagetsamples.html.Google Scholar
- ]]NVIDIA Corp. 2007a. NVIDIA CUDA Programming Guide, 1.1 ed. NVIDIA Corp.Google Scholar
- ]]NVIDIA Corp. 2007b. PTX: Parallel Thread Execution ISA, 1.1 ed. NVIDIA Corp.Google Scholar
- ]]Purcell, T. J., Buck, I., Mark, W. R., and Hanrahan, P. 2002. Ray tracing on programmable graphics hardware. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'02). ACM, 703--712. Google ScholarDigital Library
- ]]Rixner, S., Dally, W. J., Kapasi, U. J., Khailany, B., López-Lagunas, A., Mattson, P. R., and Owens, J. D. 1998. A bandwidth-efficient architecture for media processing. In Proceedings of the 31st International Symposium on Micro-architecture (MICRO'98). ACM, 3--13. Google ScholarDigital Library
- ]]Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). ACM, 128--138. Google ScholarDigital Library
- ]]Rotenberg, E., Jacobson, Q., and Smith, J. E. 1999. A study of control independence in super-scalar processors. In Proceedings of the 5th International Symposium on High-Performance Computer Architecutre (HPCA'99). IEEE, 115--124. Google ScholarDigital Library
- ]]Sheaffer, J. W., Luebke, D., and Skadron, K. 2004. A flexible simulation framework for graphics architectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS'04). ACM, 85--94. Google ScholarDigital Library
- ]]Shebanow, M. 2007. ECE 498 AL: Programming massively parallel processors (lecture 12). http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007.Google Scholar
- ]]Shin, J., Hall, M., and Chame, J. 2007. Introducing control flow into vectorized code. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 280--291. Google ScholarDigital Library
- ]]Standard Performance Evaluation Corporation. SPEC CPU2006 benchmarks. http://www.spec.org/cpu2006/.Google Scholar
- ]]Tarjan, D., Thoziyor, S., and Jouppi, N. P. 2006. CACTI 4.0. Tech. rep. HPL-2006-86, Hewlett Packard Laboratories, Palo Alto, CA.Google Scholar
- ]]Thistle, M. R. and Smith, B. J. 1988. A processor architecture for Horizon. In Proceedings of Super-computing. IEEE, 35--41. Google ScholarDigital Library
- ]]Thornton, J. E. 1964. Parallel operation in the control data 6600. In AFIPS Proceedings of FJCC. Vol. 26. 33--40. Google ScholarDigital Library
- ]]Upstill, S. 1990. The RenderMan Companion: A Programmer's Guide to Realistic Computer Graphics. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- ]]Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95). ACM, 24--36. Google ScholarDigital Library
- ]]Woop, S., Schmittler, J., and Slusallek, P. 2005. RPU: a programmable ray processing unit for real-time ray tracing. ACM Trans. Graph. 24, 3, 434--444. Google ScholarDigital Library
Index Terms
- Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
Recommendations
Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on MicroarchitectureDue to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...
Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization
Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. ...
SIMD divergence optimization through intra-warp compaction
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureSIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
Comments