skip to main content
10.1145/2155620.2155655acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Hardware transactional memory for GPU architectures

Published:03 December 2011Publication History

ABSTRACT

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.

References

  1. NVIDIA Forums - atomicCAS does NOT seem to work. http://forums.nvidia.com/index.php?showtopic=98444.Google ScholarGoogle Scholar
  2. R. Agrawal et al. Advances in Knowledge Discovery and Data Mining. chapter Fast Discovery of Association Rules. American Association for Artificial Intelligence, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMD. R700-Family Instruction Set Architecture, March 2009.Google ScholarGoogle Scholar
  4. D. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ars Technica. IBM's new transactional memory: make-or-break time for multithreaded revolution, 2011.Google ScholarGoogle Scholar
  6. A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. G. Blake, R. G. Dreslinski, and T. Mudge. Bloom Filter Guided Transaction Scheduling. In HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Blundell et al. Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory. In ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Bobba et al. Performance Pathologies in Hardware Transactional Memory. In ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Bobba et al. TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory. In ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Brownsword. Cloth in OpenCL, 2009.Google ScholarGoogle Scholar
  12. M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, 2011.Google ScholarGoogle Scholar
  13. J. Casper et al. Hardware Acceleration of Transactional Memory on Commodity Systems. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Cederman et al. Towards a Software Transactional Memory for Graphics Processors. In EGPGV, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation of Speculative Threads in Multiprocessors. In ISCA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Chafi et al. A Scalable, Non-blocking Approach to Transactional Memory. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Chung et al. ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. In MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. W. Coon et al. United States Patent #7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.Google ScholarGoogle Scholar
  19. L. Dalessandro, M. F. Spear, and M. L. Scott. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. J. Dally and B. Towles. Interconnection Networks. Morgan Kaufmann, 2004.Google ScholarGoogle Scholar
  21. D. Dice et al. Early Experience With a Commercial Hardware Transactional Memory Implementation. In ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Ferdman et al. Cuckoo Directory: A Scalable Directory for Many-Core Systems. In HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Fung et al. Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware. ACM TACO, 6(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. E. Gottschlich et al. An Efficient Software Transactional Memory Using Commit-Time Invalidation. In CGO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Guerraoui and M. Kapalka. On the Correctness of Transactional Memory. In PPoPP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Harris, J. Larus, and R. Rajwar. Transactional Memory. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. H. Kelm et al. WAYPOINT: Scaling Coherence to Thousand-Core Architectures. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Kestor et al. RMS-TM: A Comprehensive Benchmark Suite for Transactional Memory Systems. In ICPE '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  32. S. Kong et al. Time-Out Bloom Filter: A New Sampling Method for Recording More Flows. In ICOIN, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. A. Lee. The Problem with Threads. Computer, 39, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Levinthal and T. Porter. Chap - A SIMD Graphics Processor. In SIGGRAPH, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. C. Minh et al. An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees. In ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Molnar, J. Eyles, and J. Poulton. PixelFlow: High-Speed Rendering Using Image Composition. In SIGGRAPH, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Moore et al. LogTM: Log-Based Transactional Memory. In HPCA, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  39. J. Nickolls et al. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, Mar.-Apr. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, October 2009.Google ScholarGoogle Scholar
  41. NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010.Google ScholarGoogle Scholar
  42. M. Olszewski et al. JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory. In PACT, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures. Master's thesis, University of British Columbia, 2011.Google ScholarGoogle Scholar
  44. B. Saha et al. Architectural Support for Software Transactional Memory. In MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. D. Sanchez et al. Implementing Signatures for Transactional Memory. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. L. Seiler et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. In SIGGRAPH, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. P. Shivakumar and N. Jouppi. CACTI 5.0. Technical Report HPL-2007-167. HP Laboratories, 2007.Google ScholarGoogle Scholar
  48. A. Shriraman et al. Flexible Decoupled Transactional Memory Support. In ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. F. Spear et al. RingSTM: Scalable Transactions with a Single Atomic Instruction. In SPAA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. F. Tabba et al. Transactional Conflict Decoupling and Value Prediction. ICS '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D. Tarjan and K. Skadron. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. SC '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Tomić et al. EazyHTM: Eager-Lazy Hardware Transactional Memory. In MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. V. Vineet and P. Narayanan. CudaCuts: Fast Graph Cuts on the GPU. In CVPRW '08, 2008.Google ScholarGoogle Scholar
  54. B. A. Wallace. Merging and Transformation of Raster Images for Cartoon Animation. In SIGGRAPH, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  56. L. Yen et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. R. M. Yoo and H.-H. S. Lee. Adaptive Transaction Scheduling for Transactional Memory Systems. In SPAA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. H. Zhao et al. SPACE: Sharing Pattern-based Directory Coherence for Multicore Scalability. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. F. Zyulkyarov et al. Discovering and understanding performance bottlenecks in transactional applications. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hardware transactional memory for GPU architectures

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
          December 2011
          519 pages
          ISBN:9781450310536
          DOI:10.1145/2155620
          • Conference Chair:
          • Carlo Galuzzi,
          • General Chair:
          • Luigi Carro,
          • Program Chairs:
          • Andreas Moshovos,
          • Milos Prvulovic

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 December 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate484of2,242submissions,22%

          Upcoming Conference

          MICRO '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader