research-article

Free Access

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Authors:
Wilson W. L. Fung

University of British Columbia, Vancouver, B.C., Canada

University of British Columbia, Vancouver, B.C., Canada
View Profile

,
Ivan Sham

University of British Columbia, Vancouver, B.C., Canada

University of British Columbia, Vancouver, B.C., Canada
View Profile

,
George Yuan

University of British Columbia, Vancouver, B.C., Canada

University of British Columbia, Vancouver, B.C., Canada
View Profile

,
Tor M. Aamodt

University of British Columbia, Vancouver, B.C., Canada

University of British Columbia, Vancouver, B.C., Canada
View Profile

ACM Transactions on Architecture and Code Optimization Volume 6 Issue 2Article No.: 7pp 1–37https://doi.org/10.1145/1543753.1543756

Published:06 July 2009Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%.

References

]]Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). ACM, 248--259. Google ScholarDigital Library
]]Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Upper Saddle River, NJ. Google ScholarDigital Library
]]Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th Symposium on Principles of Programming Languages (POPL '83). ACM, 177--189. Google ScholarDigital Library
]]AMD, Inc. 2006. ATI CTM Guide, 1.01 ed. AMD, Inc.Google Scholar
]]Basu, A., Kirman, N., Kirman, M., Chaudhuri, M., and Martinez, J. 2007. Scavenger: A new last level cache architecture with global block priority. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'07). ACM, 421--432. Google ScholarDigital Library
]]Blinn, J. F. 1978. Simulation of wrinkled surfaces. SIGGRAPH Comput. Graph. 12, 3, 286--292. Google ScholarDigital Library
]]Bouknight, W., Denenberg, S., McIntyre, D., Randall, J., Sameh, A., and Slotnick, D. 1972. The Illiac IV System. Proc. IEEE 60, 4, 369--388.Google ScholarCross Ref
]]Buatois, L., Caumon, G., and Lévy, B. 2008. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. Int. J. Parall. Emerge. Distrib. Syst. Google ScholarDigital Library
]]Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: Stream computing on graphics hardware. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'04). ACM, 777--786. Google ScholarDigital Library
]]Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. http://www.simplescalar.com. Google ScholarDigital Library
]]Cervini, S. 2005. European Patent EP 1531391 A2: System and method for efficiently executing single program multiple data (SPMD) programs.Google Scholar
]]Clark, N., Hormati, A., Yehia, S., Mahlke, S., and Flautner, K. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE, 216--227. Google ScholarDigital Library
]]Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms 2nd Ed. MIT Press, Cambridge, MA. Google ScholarDigital Library
]]Crow, F. C. 1977. Shadow algorithms for computer graphics. SIGGRAPH Comput. Graph. 11, 2, 242--248. Google ScholarDigital Library
]]Dally, W. J., Labonte, F., Das, A., Hanrahan, P., Ahn, J.-H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., and Kapasi, U. J. 2003. Merrimac: Supercomputing with streams. In Proceedings of Supercomputing. IEEE. Google ScholarDigital Library
]]Dally, W. J. and Towles, B. 2004. Interconnection Networks. Morgan Kaufmann, San Francisco, CA.Google Scholar
]]del Barrio, V., Gonzalez, C., Roca, J., Fernandez, A., and Espasa, R. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 231--241.Google Scholar
]]Fowler, H., Fowler, F., and Thompson, D., Eds. 1995. The Concise Oxford Dictionary 9th Ed. Oxford University Press.Google Scholar
]]Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'07). ACM, 407--420. Google ScholarDigital Library
]]Hakura, Z. S. and Gupta, A. 1997. The design and analysis of a cache architecture for texture mapping. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97). ACM, 25, 2, 108--120. Google ScholarDigital Library
]]Harris, M. 2007. Optimizing parallel reduction in cuda. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf.Google Scholar
]]Hwu, W.-M., Kirk, D., Ryoo, S., Rodrigues, C., Stratton, J., and Huang, K. 2007. Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX. http://www.hotchips.org/archives/hc19/2Mon/HC19.02/HC19.02.03.pdf.Google Scholar
]]Intel Corp. 2008. Intel 965 Express Chipset Family and Intel G35 Express Chipset Graphics Controller Programmer's Reference Manual. Intel Corporation.Google Scholar
]]Ioannou, A. and Katevenis, M. G. H. 2007. Pipelined heap (priority queue) management for advanced scheduling in high-speed networks. IEEE/ACM Trans. Netw. 15, 2, 450--461. Google ScholarDigital Library
]]Jayasena, N., Erez, M., Ahn, J. H., and Dally, W. J. 2004. Stream register files with indexed access. In Proceedings of the 10th International Symposium on High-Performance Computer Architecutre (HPCA'04). IEEE, 60--71. Google ScholarDigital Library
]]Kapasi, U. J., Dally, W. J., Rixner, S., Mattson, P. R., Owens, J. D., and Khailany, B. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'33). ACM, 159--170. Google ScholarDigital Library
]]Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., and Asanovic, K. 2004. The vector-thread architecture. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA'04). ACM, 52--63. Google ScholarDigital Library
]]Levinthal, A. and Porter, T. 1984. Chap: A SIMD graphics processor. In Proceedings of SIGGRAPH. 77--82. Google ScholarDigital Library
]]Lindholm, E., Kligard, M. J., and Moreton, H. P. 2001. A user-programmable vertex engine. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'01). ACM, 149--158. Google ScholarDigital Library
]]Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. 2008. NVIDIA Tesla: A unified graphics and computing architecture. Micro IEEE 28, 2, 39--55. Google ScholarDigital Library
]]Lorie, R. A. and Strong, H. R. 1984. US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors.Google Scholar
]]Luebke, D. and Humphreys, G. 2007. How GPUs work. Computer 40, 2, 96--100. Google ScholarDigital Library
]]Montrym, J. and Moreton, H. 2005. The GeForce 6800. IEEE Micro 25, 2, 41--51. Google ScholarDigital Library
]]Moy, S. and Lindholm, E. 2005. US Patent 6,947,047: Method and system for programmable pipelined graphics processing with branching instructions.Google Scholar
]]Muchnick, S. S. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmanns. Google ScholarDigital Library
]]Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequences of tow proteins. Mol. Biol. 48, 443--453.Google ScholarCross Ref
]]NVIDIA Corp. CUDA SDK code samples. http://www.nvidia.com/object/cudagetsamples.html.Google Scholar
]]NVIDIA Corp. 2007a. NVIDIA CUDA Programming Guide, 1.1 ed. NVIDIA Corp.Google Scholar
]]NVIDIA Corp. 2007b. PTX: Parallel Thread Execution ISA, 1.1 ed. NVIDIA Corp.Google Scholar
]]Purcell, T. J., Buck, I., Mark, W. R., and Hanrahan, P. 2002. Ray tracing on programmable graphics hardware. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'02). ACM, 703--712. Google ScholarDigital Library
]]Rixner, S., Dally, W. J., Kapasi, U. J., Khailany, B., López-Lagunas, A., Mattson, P. R., and Owens, J. D. 1998. A bandwidth-efficient architecture for media processing. In Proceedings of the 31st International Symposium on Micro-architecture (MICRO'98). ACM, 3--13. Google ScholarDigital Library
]]Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). ACM, 128--138. Google ScholarDigital Library
]]Rotenberg, E., Jacobson, Q., and Smith, J. E. 1999. A study of control independence in super-scalar processors. In Proceedings of the 5th International Symposium on High-Performance Computer Architecutre (HPCA'99). IEEE, 115--124. Google ScholarDigital Library
]]Sheaffer, J. W., Luebke, D., and Skadron, K. 2004. A flexible simulation framework for graphics architectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS'04). ACM, 85--94. Google ScholarDigital Library
]]Shebanow, M. 2007. ECE 498 AL: Programming massively parallel processors (lecture 12). http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007.Google Scholar
]]Shin, J., Hall, M., and Chame, J. 2007. Introducing control flow into vectorized code. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 280--291. Google ScholarDigital Library
]]Standard Performance Evaluation Corporation. SPEC CPU2006 benchmarks. http://www.spec.org/cpu2006/.Google Scholar
]]Tarjan, D., Thoziyor, S., and Jouppi, N. P. 2006. CACTI 4.0. Tech. rep. HPL-2006-86, Hewlett Packard Laboratories, Palo Alto, CA.Google Scholar
]]Thistle, M. R. and Smith, B. J. 1988. A processor architecture for Horizon. In Proceedings of Super-computing. IEEE, 35--41. Google ScholarDigital Library
]]Thornton, J. E. 1964. Parallel operation in the control data 6600. In AFIPS Proceedings of FJCC. Vol. 26. 33--40. Google ScholarDigital Library
]]Upstill, S. 1990. The RenderMan Companion: A Programmer's Guide to Realistic Computer Graphics. Addison-Wesley, Reading, MA. Google ScholarDigital Library
]]Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95). ACM, 24--36. Google ScholarDigital Library
]]Woop, S., Schmittler, J., and Slusallek, P. 2005. RPU: a programmable ray processing unit for real-time ray tracing. ACM Trans. Graph. 24, 3, 434--444. Google ScholarDigital Library

Index Terms

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...
Read More
Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. ...
Read More
SIMD divergence optimization through intra-warp compaction
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 6, Issue 2
June 2009
137 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1543753
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 July 2009
- Accepted: 1 December 2008
- Revised: 1 November 2008
- Received: 1 April 2008
Published in taco Volume 6, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
SIMD
control flow
fine-grained multithreading
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 68
  Total Citations
  View Citations
- 3,398
  Total Downloads
- Downloads (Last 12 months)376
- Downloads (Last 6 weeks)36
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Improving GPU performance via large warps and two-level warp scheduling

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

SIMD divergence optimization through intra-warp compaction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Improving GPU performance via large warps and two-level warp scheduling

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

SIMD divergence optimization through intra-warp compaction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media