research-article

Hardware transactional memory for GPU architectures

Authors:
Wilson W. L. Fung

University of British Columbia

University of British Columbia
View Profile

,
Inderpreet Singh

University of British Columbia

University of British Columbia
View Profile

,
Andrew Brownsword

University of British Columbia

University of British Columbia
View Profile

,
Tor M. Aamodt

University of British Columbia

University of British Columbia
View Profile

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on MicroarchitectureDecember 2011Pages 296–307https://doi.org/10.1145/2155620.2155655

Published:03 December 2011Publication History

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 296–307

ABSTRACT

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.

References

NVIDIA Forums - atomicCAS does NOT seem to work. http://forums.nvidia.com/index.php?showtopic=98444.Google Scholar
R. Agrawal et al. Advances in Knowledge Discovery and Data Mining. chapter Fast Discovery of Association Rules. American Association for Artificial Intelligence, 1996. Google ScholarDigital Library
AMD. R700-Family Instruction Set Architecture, March 2009.Google Scholar
D. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007.Google ScholarCross Ref
Ars Technica. IBM's new transactional memory: make-or-break time for multithreaded revolution, 2011.Google Scholar
A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarCross Ref
G. Blake, R. G. Dreslinski, and T. Mudge. Bloom Filter Guided Transaction Scheduling. In HPCA, 2011. Google ScholarDigital Library
C. Blundell et al. Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory. In ISCA, 2007. Google ScholarDigital Library
J. Bobba et al. Performance Pathologies in Hardware Transactional Memory. In ISCA, 2007. Google ScholarDigital Library
J. Bobba et al. TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory. In ISCA, 2008. Google ScholarDigital Library
A. Brownsword. Cloth in OpenCL, 2009.Google Scholar
M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, 2011.Google Scholar
J. Casper et al. Hardware Acceleration of Transactional Memory on Commodity Systems. In ASPLOS, 2011. Google ScholarDigital Library
D. Cederman et al. Towards a Software Transactional Memory for Graphics Processors. In EGPGV, 2010. Google ScholarDigital Library
L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation of Speculative Threads in Multiprocessors. In ISCA, 2006. Google ScholarDigital Library
H. Chafi et al. A Scalable, Non-blocking Approach to Transactional Memory. In HPCA, 2007. Google ScholarDigital Library
J. Chung et al. ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. In MICRO, 2010. Google ScholarDigital Library
B. W. Coon et al. United States Patent #7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.Google Scholar
L. Dalessandro, M. F. Spear, and M. L. Scott. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010. Google ScholarDigital Library
W. J. Dally and B. Towles. Interconnection Networks. Morgan Kaufmann, 2004.Google Scholar
D. Dice et al. Early Experience With a Commercial Hardware Transactional Memory Implementation. In ASPLOS, 2009. Google ScholarDigital Library
M. Ferdman et al. Cuckoo Directory: A Scalable Directory for Many-Core Systems. In HPCA, 2011. Google ScholarDigital Library
W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarDigital Library
W. Fung et al. Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware. ACM TACO, 6(2), 2009. Google ScholarDigital Library
J. E. Gottschlich et al. An Efficient Software Transactional Memory Using Commit-Time Invalidation. In CGO, 2010. Google ScholarDigital Library
R. Guerraoui and M. Kapalka. On the Correctness of Transactional Memory. In PPoPP, 2008. Google ScholarDigital Library
T. Harris, J. Larus, and R. Rajwar. Transactional Memory. 2010. Google ScholarDigital Library
M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA, 1993. Google ScholarDigital Library
J. H. Kelm et al. WAYPOINT: Scaling Coherence to Thousand-Core Architectures. In PACT, 2010. Google ScholarDigital Library
G. Kestor et al. RMS-TM: A Comprehensive Benchmark Suite for Transactional Memory Systems. In ICPE '11, 2011. Google ScholarDigital Library
Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google Scholar
S. Kong et al. Time-Out Bloom Filter: A New Sampling Method for Recording More Flows. In ICOIN, 2006. Google ScholarDigital Library
E. A. Lee. The Problem with Threads. Computer, 39, May 2006. Google ScholarDigital Library
A. Levinthal and T. Porter. Chap - A SIMD Graphics Processor. In SIGGRAPH, 1984. Google ScholarDigital Library
E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008. Google ScholarDigital Library
C. C. Minh et al. An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees. In ISCA, 2007. Google ScholarDigital Library
S. Molnar, J. Eyles, and J. Poulton. PixelFlow: High-Speed Rendering Using Image Composition. In SIGGRAPH, 1992. Google ScholarDigital Library
K. Moore et al. LogTM: Log-Based Transactional Memory. In HPCA, 2006.Google ScholarCross Ref
J. Nickolls et al. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, Mar.-Apr. 2008. Google ScholarDigital Library
NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, October 2009.Google Scholar
NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010.Google Scholar
M. Olszewski et al. JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory. In PACT, 2007. Google ScholarDigital Library
A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures. Master's thesis, University of British Columbia, 2011.Google Scholar
B. Saha et al. Architectural Support for Software Transactional Memory. In MICRO, 2006. Google ScholarDigital Library
D. Sanchez et al. Implementing Signatures for Transactional Memory. In MICRO, 2007. Google ScholarDigital Library
L. Seiler et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. In SIGGRAPH, 2008. Google ScholarDigital Library
P. Shivakumar and N. Jouppi. CACTI 5.0. Technical Report HPL-2007-167. HP Laboratories, 2007.Google Scholar
A. Shriraman et al. Flexible Decoupled Transactional Memory Support. In ISCA, 2008. Google ScholarDigital Library
M. F. Spear et al. RingSTM: Scalable Transactions with a Single Atomic Instruction. In SPAA, 2008. Google ScholarDigital Library
F. Tabba et al. Transactional Conflict Decoupling and Value Prediction. ICS '11, 2011. Google ScholarDigital Library
D. Tarjan and K. Skadron. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. SC '10, 2010. Google ScholarDigital Library
S. Tomić et al. EazyHTM: Eager-Lazy Hardware Transactional Memory. In MICRO, 2009. Google ScholarDigital Library
V. Vineet and P. Narayanan. CudaCuts: Fast Graph Cuts on the GPU. In CVPRW '08, 2008.Google Scholar
B. A. Wallace. Merging and Transformation of Raster Images for Cartoon Animation. In SIGGRAPH, 1981. Google ScholarDigital Library
H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010.Google ScholarCross Ref
L. Yen et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In HPCA, 2007. Google ScholarDigital Library
R. M. Yoo and H.-H. S. Lee. Adaptive Transaction Scheduling for Transactional Memory Systems. In SPAA, 2008. Google ScholarDigital Library
H. Zhao et al. SPACE: Sharing Pattern-based Directory Coherence for Multicore Scalability. In PACT, 2010. Google ScholarDigital Library
F. Zyulkyarov et al. Discovering and understanding performance bottlenecks in transactional applications. In PACT, 2010. Google ScholarDigital Library

Index Terms

Hardware transactional memory for GPU architectures

Recommendations

Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Read More
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Read More
Accelerating GPU Hardware Transactional Memory with Snapshot Isolation
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Snapshot Isolation (SI) is an established model in the database community, which permits write-read conflicts to pass and aborts transactions only on write-write conflicts. With the Write Skew anomaly correctly eliminated, SI can reduce the occurrence ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
December 2011
519 pages
ISBN:9781450310536
DOI:10.1145/2155620
Conference Chair:
Carlo Galuzzi
Technische Universiteit Delft, The Netherlands
,
General Chair:
Luigi Carro
Universidade Federal do Rio Grande do Sul, Brasil
,
Program Chairs:
Andreas Moshovos
University of Toronto, Canada
,
Milos Prvulovic
Georgia Institute of Technology, United States
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 December 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 77
  Total Citations
  View Citations
- 961
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.