ABSTRACT
Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer. One major, non-trivial effort/risk is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks via fine-grained data synchronization. To reduce this effort, prior work has proposed supporting transactional memory on GPU architectures. One hardware proposal, Kilo TM, can scale to 1000s of concurrent transaction. However, performance and energy overhead of Kilo TM may deter GPU vendors from incorporating it into future designs.
In this paper, we analyze the performance and energy efficiency of Kilo TM and propose two enhancements: (1) Warp-level transaction management allows transactions within a warp to be managed as a group. This aggregates protocol messages to reduce communication overhead and captures spatial locality from multiple transactions to increase memory subsystem utility. (2) Temporal conflict detection uses globally synchronized timers to detect conflicts in read-only transactions with low overhead. Our evaluation shows that combining the two enhancements in combination can improve the overall performance and energy efficiency of Kilo TM by 65% and 34% respectively. Kilo TM with the above two enhancements achieves 66% of the performance of fine-grained locking with 34% energy overhead.
- R. Agrawal et al. Advances in Knowledge Discovery and Data Mining. chapter Fast Discovery of Association Rules. American Association for Artificial Intelligence, 1996. Google ScholarDigital Library
- A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarCross Ref
- A. Brownsword. Cloth in OpenCL, 2009.Google Scholar
- I. A. Buck et al. United States Patent #7,627,723: Atomic Memory Operators in a Parallel Processor (Assignee NVIDIA Corp.), December 2009.Google Scholar
- M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, 2011.Google Scholar
- D. Cederman et al. Towards a Software Transactional Memory for Graphics Processors. In EGPGV, 2010. Google ScholarDigital Library
- L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation of Speculative Threads in Multiprocessors. In ISCA, 2006. Google ScholarDigital Library
- L. Dalessandro et al. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010. Google ScholarDigital Library
- C. Ferri et al. SoC-TM: Integrated HW/SW Support for Trasnactional Memory Programming on Embedded MPSoCs. In CODES+ISSS, 2011. Google ScholarDigital Library
- W. W. L. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarDigital Library
- W. W. L. Fung et al. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011. Google ScholarDigital Library
- W. W. L. Fung et al. http://www.ece.ubc.ca/~wwlfung/code/kilotm-gpgpu_sim.tgz, 2013.Google Scholar
- W. W. L. Fung et al. http://www.ece.ubc.ca/~wwlfung/code/gpu-tm-tests.tgz, 2013.Google Scholar
- A. Gharaibeh et al. A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing. In PACT, 2012. Google ScholarDigital Library
- L. Hammond et al. Transactional Memory Coherence and Consistency. In ISCA, 2004. Google ScholarDigital Library
- T. Harris, J. Larus, and R. Rajwar. Transactional Memory. Morgan and Claypool, second edition, 2010. Google ScholarDigital Library
- M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA, 1993. Google ScholarDigital Library
- Intel Corp. Intel 64 and IA-32 Architectures Software Developer's Manual, May 2012.Google Scholar
- G. Kestor et al. RMS-TM: A Comprehensive Benchmark Suite for Transactional Memory Systems. In ICPE '11, 2011. Google ScholarDigital Library
- Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google Scholar
- J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA, 2013. Google ScholarDigital Library
- D. Merrill et al. Scalable GPU Graph Traversal. In PPoPP, 2012. Google ScholarDigital Library
- N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations with Wiring Alternatives for Large Caches with CACTI 6.5. In MICRO, 2007. Google ScholarDigital Library
- R. Nasre et al. Morph Algorithms on GPUs. In PPoPP, 2013. Google ScholarDigital Library
- J. Nickolls et al. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, Mar.-Apr. 2008. Google ScholarDigital Library
- NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.Google Scholar
- NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010.Google Scholar
- M. Olszewski et al. JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory. In PACT, 2007. Google ScholarDigital Library
- X. Qian et al. BulkSMT: Designing SMT Processors for Atomic-Block Execution. In HPCA, 2012. Google ScholarDigital Library
- T. G. Rogers et al. Cache-Conscious Wavefront Scheduling. In MICRO, 2012. Google ScholarDigital Library
- W. Ruan et al. Boosting Timestamp-based Tranasctional Memory by Exploiting Hardware Cycle Counters. In TRANSACT, 2013.Google Scholar
- T. A. Shah. FabMem: A Multiported RAM and CAM Compiler for Superscalar Design Space Exploration. Master's thesis, North Carolina State University, 2010.Google Scholar
- A. Shriraman et al. Flexible Decoupled Transactional Memory Support. In ISCA, 2008. Google ScholarDigital Library
- I. Singh et al. Cache Coherence for GPU Architectures. In HPCA, 2013. Google ScholarDigital Library
- V. Vineet and P. Narayanan. CudaCuts: Fast Graph Cuts on the GPU. In CVPRW '08, 2008.Google Scholar
- Y. Xu et al. Software Transactional Memory for GPU Architectures. In Computer Architecture Letters, volume PP, 2013.Google Scholar
- Y. Yang et al. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput. In PACT, 2012. Google ScholarDigital Library
- K. Yelick. Antisocial Parallelism: Avoiding, Hiding and Managing Communication. 2013. Keynote at HPCA-2013.Google Scholar
- L. Yen et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In HPCA, 2007. Google ScholarDigital Library
Index Terms
- Energy efficient GPU transactional memory via space-time optimizations
Recommendations
Accelerating GPU Hardware Transactional Memory with Snapshot Isolation
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureSnapshot Isolation (SI) is an established model in the database community, which permits write-read conflicts to pass and aborts transactions only on write-write conflicts. With the Write Skew anomaly correctly eliminated, SI can reduce the occurrence ...
Hardware transactional memory for GPU architectures
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on MicroarchitectureGraphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various ...
Time-Based Software Transactional Memory
Software transactional memory (STM) is a concurrency control mechanism that is widely considered to be easier to use by programmers than other mechanisms such as locking. The first generations of STMs have either relied on visible read designs, which ...
Comments