research-article

Energy efficient GPU transactional memory via space-time optimizations

Authors:
Wilson W. L. Fung

University of British Columbia

University of British Columbia
View Profile

,
Tor M. Aamodt

University of British Columbia

University of British Columbia
View Profile

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on MicroarchitectureDecember 2013Pages 408–420https://doi.org/10.1145/2540708.2540743

Published:07 December 2013Publication History

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 408–420

ABSTRACT

Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer. One major, non-trivial effort/risk is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks via fine-grained data synchronization. To reduce this effort, prior work has proposed supporting transactional memory on GPU architectures. One hardware proposal, Kilo TM, can scale to 1000s of concurrent transaction. However, performance and energy overhead of Kilo TM may deter GPU vendors from incorporating it into future designs.

In this paper, we analyze the performance and energy efficiency of Kilo TM and propose two enhancements: (1) Warp-level transaction management allows transactions within a warp to be managed as a group. This aggregates protocol messages to reduce communication overhead and captures spatial locality from multiple transactions to increase memory subsystem utility. (2) Temporal conflict detection uses globally synchronized timers to detect conflicts in read-only transactions with low overhead. Our evaluation shows that combining the two enhancements in combination can improve the overall performance and energy efficiency of Kilo TM by 65% and 34% respectively. Kilo TM with the above two enhancements achieves 66% of the performance of fine-grained locking with 34% energy overhead.

References

R. Agrawal et al. Advances in Knowledge Discovery and Data Mining. chapter Fast Discovery of Association Rules. American Association for Artificial Intelligence, 1996. Google ScholarDigital Library
A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarCross Ref
A. Brownsword. Cloth in OpenCL, 2009.Google Scholar
I. A. Buck et al. United States Patent #7,627,723: Atomic Memory Operators in a Parallel Processor (Assignee NVIDIA Corp.), December 2009.Google Scholar
M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, 2011.Google Scholar
D. Cederman et al. Towards a Software Transactional Memory for Graphics Processors. In EGPGV, 2010. Google ScholarDigital Library
L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation of Speculative Threads in Multiprocessors. In ISCA, 2006. Google ScholarDigital Library
L. Dalessandro et al. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010. Google ScholarDigital Library
C. Ferri et al. SoC-TM: Integrated HW/SW Support for Trasnactional Memory Programming on Embedded MPSoCs. In CODES+ISSS, 2011. Google ScholarDigital Library
W. W. L. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarDigital Library
W. W. L. Fung et al. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011. Google ScholarDigital Library
W. W. L. Fung et al. http://www.ece.ubc.ca/~wwlfung/code/kilotm-gpgpu_sim.tgz, 2013.Google Scholar
W. W. L. Fung et al. http://www.ece.ubc.ca/~wwlfung/code/gpu-tm-tests.tgz, 2013.Google Scholar
A. Gharaibeh et al. A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing. In PACT, 2012. Google ScholarDigital Library
L. Hammond et al. Transactional Memory Coherence and Consistency. In ISCA, 2004. Google ScholarDigital Library
T. Harris, J. Larus, and R. Rajwar. Transactional Memory. Morgan and Claypool, second edition, 2010. Google ScholarDigital Library
M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA, 1993. Google ScholarDigital Library
Intel Corp. Intel 64 and IA-32 Architectures Software Developer's Manual, May 2012.Google Scholar
G. Kestor et al. RMS-TM: A Comprehensive Benchmark Suite for Transactional Memory Systems. In ICPE '11, 2011. Google ScholarDigital Library
Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google Scholar
J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA, 2013. Google ScholarDigital Library
D. Merrill et al. Scalable GPU Graph Traversal. In PPoPP, 2012. Google ScholarDigital Library
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations with Wiring Alternatives for Large Caches with CACTI 6.5. In MICRO, 2007. Google ScholarDigital Library
R. Nasre et al. Morph Algorithms on GPUs. In PPoPP, 2013. Google ScholarDigital Library
J. Nickolls et al. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, Mar.-Apr. 2008. Google ScholarDigital Library
NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.Google Scholar
NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010.Google Scholar
M. Olszewski et al. JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory. In PACT, 2007. Google ScholarDigital Library
X. Qian et al. BulkSMT: Designing SMT Processors for Atomic-Block Execution. In HPCA, 2012. Google ScholarDigital Library
T. G. Rogers et al. Cache-Conscious Wavefront Scheduling. In MICRO, 2012. Google ScholarDigital Library
W. Ruan et al. Boosting Timestamp-based Tranasctional Memory by Exploiting Hardware Cycle Counters. In TRANSACT, 2013.Google Scholar
T. A. Shah. FabMem: A Multiported RAM and CAM Compiler for Superscalar Design Space Exploration. Master's thesis, North Carolina State University, 2010.Google Scholar
A. Shriraman et al. Flexible Decoupled Transactional Memory Support. In ISCA, 2008. Google ScholarDigital Library
I. Singh et al. Cache Coherence for GPU Architectures. In HPCA, 2013. Google ScholarDigital Library
V. Vineet and P. Narayanan. CudaCuts: Fast Graph Cuts on the GPU. In CVPRW '08, 2008.Google Scholar
Y. Xu et al. Software Transactional Memory for GPU Architectures. In Computer Architecture Letters, volume PP, 2013.Google Scholar
Y. Yang et al. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput. In PACT, 2012. Google ScholarDigital Library
K. Yelick. Antisocial Parallelism: Avoiding, Hiding and Managing Communication. 2013. Keynote at HPCA-2013.Google Scholar
L. Yen et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In HPCA, 2007. Google ScholarDigital Library

Index Terms

Energy efficient GPU transactional memory via space-time optimizations

Recommendations

Accelerating GPU Hardware Transactional Memory with Snapshot Isolation
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Snapshot Isolation (SI) is an established model in the database community, which permits write-read conflicts to pass and aborts transactions only on write-write conflicts. With the Write Skew anomaly correctly eliminated, SI can reduce the occurrence ...
Read More
Hardware transactional memory for GPU architectures
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various ...
Read More
Time-Based Software Transactional Memory

Software transactional memory (STM) is a concurrency control mechanism that is widely considered to be easier to use by programmers than other mechanisms such as locking. The first generations of STMs have either relied on visible read designs, which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
December 2013
498 pages
ISBN:9781450326384
DOI:10.1145/2540708
General Chair:
Matthew Farrens
UC Davis
,
Program Chair:
Christos Kozyrakis
Stanford University
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 December 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
transactional memory
Qualifiers
- research-article
Conference

Acceptance Rates
MICRO-46 Paper Acceptance Rate39of239submissions,16%Overall Acceptance Rate484of2,242submissions,22%
More
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 493
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Energy efficient GPU transactional memory via space-time optimizations

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerating GPU Hardware Transactional Memory with Snapshot Isolation

Hardware transactional memory for GPU architectures

Time-Based Software Transactional Memory