skip to main content
research-article
Free Access

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Authors Info & Claims
Published:06 July 2009Publication History
Skip Abstract Section

Abstract

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%.

References

  1. ]]Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). ACM, 248--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ]]Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ]]Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th Symposium on Principles of Programming Languages (POPL '83). ACM, 177--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ]]AMD, Inc. 2006. ATI CTM Guide, 1.01 ed. AMD, Inc.Google ScholarGoogle Scholar
  5. ]]Basu, A., Kirman, N., Kirman, M., Chaudhuri, M., and Martinez, J. 2007. Scavenger: A new last level cache architecture with global block priority. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'07). ACM, 421--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. ]]Blinn, J. F. 1978. Simulation of wrinkled surfaces. SIGGRAPH Comput. Graph. 12, 3, 286--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ]]Bouknight, W., Denenberg, S., McIntyre, D., Randall, J., Sameh, A., and Slotnick, D. 1972. The Illiac IV System. Proc. IEEE 60, 4, 369--388.Google ScholarGoogle ScholarCross RefCross Ref
  8. ]]Buatois, L., Caumon, G., and Lévy, B. 2008. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. Int. J. Parall. Emerge. Distrib. Syst. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ]]Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: Stream computing on graphics hardware. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'04). ACM, 777--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ]]Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. http://www.simplescalar.com. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ]]Cervini, S. 2005. European Patent EP 1531391 A2: System and method for efficiently executing single program multiple data (SPMD) programs.Google ScholarGoogle Scholar
  12. ]]Clark, N., Hormati, A., Yehia, S., Mahlke, S., and Flautner, K. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE, 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ]]Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms 2nd Ed. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ]]Crow, F. C. 1977. Shadow algorithms for computer graphics. SIGGRAPH Comput. Graph. 11, 2, 242--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ]]Dally, W. J., Labonte, F., Das, A., Hanrahan, P., Ahn, J.-H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., and Kapasi, U. J. 2003. Merrimac: Supercomputing with streams. In Proceedings of Supercomputing. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ]]Dally, W. J. and Towles, B. 2004. Interconnection Networks. Morgan Kaufmann, San Francisco, CA.Google ScholarGoogle Scholar
  17. ]]del Barrio, V., Gonzalez, C., Roca, J., Fernandez, A., and Espasa, R. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 231--241.Google ScholarGoogle Scholar
  18. ]]Fowler, H., Fowler, F., and Thompson, D., Eds. 1995. The Concise Oxford Dictionary 9th Ed. Oxford University Press.Google ScholarGoogle Scholar
  19. ]]Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'07). ACM, 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. ]]Hakura, Z. S. and Gupta, A. 1997. The design and analysis of a cache architecture for texture mapping. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97). ACM, 25, 2, 108--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. ]]Harris, M. 2007. Optimizing parallel reduction in cuda. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf.Google ScholarGoogle Scholar
  22. ]]Hwu, W.-M., Kirk, D., Ryoo, S., Rodrigues, C., Stratton, J., and Huang, K. 2007. Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX. http://www.hotchips.org/archives/hc19/2Mon/HC19.02/HC19.02.03.pdf.Google ScholarGoogle Scholar
  23. ]]Intel Corp. 2008. Intel 965 Express Chipset Family and Intel G35 Express Chipset Graphics Controller Programmer's Reference Manual. Intel Corporation.Google ScholarGoogle Scholar
  24. ]]Ioannou, A. and Katevenis, M. G. H. 2007. Pipelined heap (priority queue) management for advanced scheduling in high-speed networks. IEEE/ACM Trans. Netw. 15, 2, 450--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ]]Jayasena, N., Erez, M., Ahn, J. H., and Dally, W. J. 2004. Stream register files with indexed access. In Proceedings of the 10th International Symposium on High-Performance Computer Architecutre (HPCA'04). IEEE, 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ]]Kapasi, U. J., Dally, W. J., Rixner, S., Mattson, P. R., Owens, J. D., and Khailany, B. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Micro-architecture (MICRO'33). ACM, 159--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. ]]Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., and Asanovic, K. 2004. The vector-thread architecture. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA'04). ACM, 52--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. ]]Levinthal, A. and Porter, T. 1984. Chap: A SIMD graphics processor. In Proceedings of SIGGRAPH. 77--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. ]]Lindholm, E., Kligard, M. J., and Moreton, H. P. 2001. A user-programmable vertex engine. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'01). ACM, 149--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ]]Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. 2008. NVIDIA Tesla: A unified graphics and computing architecture. Micro IEEE 28, 2, 39--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. ]]Lorie, R. A. and Strong, H. R. 1984. US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors.Google ScholarGoogle Scholar
  32. ]]Luebke, D. and Humphreys, G. 2007. How GPUs work. Computer 40, 2, 96--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. ]]Montrym, J. and Moreton, H. 2005. The GeForce 6800. IEEE Micro 25, 2, 41--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. ]]Moy, S. and Lindholm, E. 2005. US Patent 6,947,047: Method and system for programmable pipelined graphics processing with branching instructions.Google ScholarGoogle Scholar
  35. ]]Muchnick, S. S. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmanns. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. ]]Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequences of tow proteins. Mol. Biol. 48, 443--453.Google ScholarGoogle ScholarCross RefCross Ref
  37. ]]NVIDIA Corp. CUDA SDK code samples. http://www.nvidia.com/object/cudagetsamples.html.Google ScholarGoogle Scholar
  38. ]]NVIDIA Corp. 2007a. NVIDIA CUDA Programming Guide, 1.1 ed. NVIDIA Corp.Google ScholarGoogle Scholar
  39. ]]NVIDIA Corp. 2007b. PTX: Parallel Thread Execution ISA, 1.1 ed. NVIDIA Corp.Google ScholarGoogle Scholar
  40. ]]Purcell, T. J., Buck, I., Mark, W. R., and Hanrahan, P. 2002. Ray tracing on programmable graphics hardware. In Proceeding of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'02). ACM, 703--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. ]]Rixner, S., Dally, W. J., Kapasi, U. J., Khailany, B., López-Lagunas, A., Mattson, P. R., and Owens, J. D. 1998. A bandwidth-efficient architecture for media processing. In Proceedings of the 31st International Symposium on Micro-architecture (MICRO'98). ACM, 3--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. ]]Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). ACM, 128--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. ]]Rotenberg, E., Jacobson, Q., and Smith, J. E. 1999. A study of control independence in super-scalar processors. In Proceedings of the 5th International Symposium on High-Performance Computer Architecutre (HPCA'99). IEEE, 115--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. ]]Sheaffer, J. W., Luebke, D., and Skadron, K. 2004. A flexible simulation framework for graphics architectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS'04). ACM, 85--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. ]]Shebanow, M. 2007. ECE 498 AL: Programming massively parallel processors (lecture 12). http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007.Google ScholarGoogle Scholar
  46. ]]Shin, J., Hall, M., and Chame, J. 2007. Introducing control flow into vectorized code. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 280--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. ]]Standard Performance Evaluation Corporation. SPEC CPU2006 benchmarks. http://www.spec.org/cpu2006/.Google ScholarGoogle Scholar
  48. ]]Tarjan, D., Thoziyor, S., and Jouppi, N. P. 2006. CACTI 4.0. Tech. rep. HPL-2006-86, Hewlett Packard Laboratories, Palo Alto, CA.Google ScholarGoogle Scholar
  49. ]]Thistle, M. R. and Smith, B. J. 1988. A processor architecture for Horizon. In Proceedings of Super-computing. IEEE, 35--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. ]]Thornton, J. E. 1964. Parallel operation in the control data 6600. In AFIPS Proceedings of FJCC. Vol. 26. 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. ]]Upstill, S. 1990. The RenderMan Companion: A Programmer's Guide to Realistic Computer Graphics. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. ]]Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95). ACM, 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. ]]Woop, S., Schmittler, J., and Slusallek, P. 2005. RPU: a programmable ray processing unit for real-time ray tracing. ACM Trans. Graph. 24, 3, 434--444. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 6, Issue 2
      June 2009
      137 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/1543753
      Issue’s Table of Contents

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 July 2009
      • Accepted: 1 December 2008
      • Revised: 1 November 2008
      • Received: 1 April 2008
      Published in taco Volume 6, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader