ABSTRACT
Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.
- NVIDIA Forums - atomicCAS does NOT seem to work. http://forums.nvidia.com/index.php?showtopic=98444.Google Scholar
- R. Agrawal et al. Advances in Knowledge Discovery and Data Mining. chapter Fast Discovery of Association Rules. American Association for Artificial Intelligence, 1996. Google ScholarDigital Library
- AMD. R700-Family Instruction Set Architecture, March 2009.Google Scholar
- D. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007.Google ScholarCross Ref
- Ars Technica. IBM's new transactional memory: make-or-break time for multithreaded revolution, 2011.Google Scholar
- A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarCross Ref
- G. Blake, R. G. Dreslinski, and T. Mudge. Bloom Filter Guided Transaction Scheduling. In HPCA, 2011. Google ScholarDigital Library
- C. Blundell et al. Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory. In ISCA, 2007. Google ScholarDigital Library
- J. Bobba et al. Performance Pathologies in Hardware Transactional Memory. In ISCA, 2007. Google ScholarDigital Library
- J. Bobba et al. TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory. In ISCA, 2008. Google ScholarDigital Library
- A. Brownsword. Cloth in OpenCL, 2009.Google Scholar
- M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, 2011.Google Scholar
- J. Casper et al. Hardware Acceleration of Transactional Memory on Commodity Systems. In ASPLOS, 2011. Google ScholarDigital Library
- D. Cederman et al. Towards a Software Transactional Memory for Graphics Processors. In EGPGV, 2010. Google ScholarDigital Library
- L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation of Speculative Threads in Multiprocessors. In ISCA, 2006. Google ScholarDigital Library
- H. Chafi et al. A Scalable, Non-blocking Approach to Transactional Memory. In HPCA, 2007. Google ScholarDigital Library
- J. Chung et al. ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. In MICRO, 2010. Google ScholarDigital Library
- B. W. Coon et al. United States Patent #7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.Google Scholar
- L. Dalessandro, M. F. Spear, and M. L. Scott. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010. Google ScholarDigital Library
- W. J. Dally and B. Towles. Interconnection Networks. Morgan Kaufmann, 2004.Google Scholar
- D. Dice et al. Early Experience With a Commercial Hardware Transactional Memory Implementation. In ASPLOS, 2009. Google ScholarDigital Library
- M. Ferdman et al. Cuckoo Directory: A Scalable Directory for Many-Core Systems. In HPCA, 2011. Google ScholarDigital Library
- W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarDigital Library
- W. Fung et al. Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware. ACM TACO, 6(2), 2009. Google ScholarDigital Library
- J. E. Gottschlich et al. An Efficient Software Transactional Memory Using Commit-Time Invalidation. In CGO, 2010. Google ScholarDigital Library
- R. Guerraoui and M. Kapalka. On the Correctness of Transactional Memory. In PPoPP, 2008. Google ScholarDigital Library
- T. Harris, J. Larus, and R. Rajwar. Transactional Memory. 2010. Google ScholarDigital Library
- M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA, 1993. Google ScholarDigital Library
- J. H. Kelm et al. WAYPOINT: Scaling Coherence to Thousand-Core Architectures. In PACT, 2010. Google ScholarDigital Library
- G. Kestor et al. RMS-TM: A Comprehensive Benchmark Suite for Transactional Memory Systems. In ICPE '11, 2011. Google ScholarDigital Library
- Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google Scholar
- S. Kong et al. Time-Out Bloom Filter: A New Sampling Method for Recording More Flows. In ICOIN, 2006. Google ScholarDigital Library
- E. A. Lee. The Problem with Threads. Computer, 39, May 2006. Google ScholarDigital Library
- A. Levinthal and T. Porter. Chap - A SIMD Graphics Processor. In SIGGRAPH, 1984. Google ScholarDigital Library
- E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008. Google ScholarDigital Library
- C. C. Minh et al. An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees. In ISCA, 2007. Google ScholarDigital Library
- S. Molnar, J. Eyles, and J. Poulton. PixelFlow: High-Speed Rendering Using Image Composition. In SIGGRAPH, 1992. Google ScholarDigital Library
- K. Moore et al. LogTM: Log-Based Transactional Memory. In HPCA, 2006.Google ScholarCross Ref
- J. Nickolls et al. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, Mar.-Apr. 2008. Google ScholarDigital Library
- NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, October 2009.Google Scholar
- NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010.Google Scholar
- M. Olszewski et al. JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory. In PACT, 2007. Google ScholarDigital Library
- A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures. Master's thesis, University of British Columbia, 2011.Google Scholar
- B. Saha et al. Architectural Support for Software Transactional Memory. In MICRO, 2006. Google ScholarDigital Library
- D. Sanchez et al. Implementing Signatures for Transactional Memory. In MICRO, 2007. Google ScholarDigital Library
- L. Seiler et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. In SIGGRAPH, 2008. Google ScholarDigital Library
- P. Shivakumar and N. Jouppi. CACTI 5.0. Technical Report HPL-2007-167. HP Laboratories, 2007.Google Scholar
- A. Shriraman et al. Flexible Decoupled Transactional Memory Support. In ISCA, 2008. Google ScholarDigital Library
- M. F. Spear et al. RingSTM: Scalable Transactions with a Single Atomic Instruction. In SPAA, 2008. Google ScholarDigital Library
- F. Tabba et al. Transactional Conflict Decoupling and Value Prediction. ICS '11, 2011. Google ScholarDigital Library
- D. Tarjan and K. Skadron. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. SC '10, 2010. Google ScholarDigital Library
- S. Tomić et al. EazyHTM: Eager-Lazy Hardware Transactional Memory. In MICRO, 2009. Google ScholarDigital Library
- V. Vineet and P. Narayanan. CudaCuts: Fast Graph Cuts on the GPU. In CVPRW '08, 2008.Google Scholar
- B. A. Wallace. Merging and Transformation of Raster Images for Cartoon Animation. In SIGGRAPH, 1981. Google ScholarDigital Library
- H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010.Google ScholarCross Ref
- L. Yen et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In HPCA, 2007. Google ScholarDigital Library
- R. M. Yoo and H.-H. S. Lee. Adaptive Transaction Scheduling for Transactional Memory Systems. In SPAA, 2008. Google ScholarDigital Library
- H. Zhao et al. SPACE: Sharing Pattern-based Directory Coherence for Multicore Scalability. In PACT, 2010. Google ScholarDigital Library
- F. Zyulkyarov et al. Discovering and understanding performance bottlenecks in transactional applications. In PACT, 2010. Google ScholarDigital Library
Index Terms
- Hardware transactional memory for GPU architectures
Recommendations
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationModern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationModern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Accelerating GPU Hardware Transactional Memory with Snapshot Isolation
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureSnapshot Isolation (SI) is an established model in the database community, which permits write-read conflicts to pass and aborts transactions only on write-write conflicts. With the Write Skew anomaly correctly eliminated, SI can reduce the occurrence ...
Comments