skip to main content
10.1145/2540708.2540743acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Energy efficient GPU transactional memory via space-time optimizations

Published:07 December 2013Publication History

ABSTRACT

Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer. One major, non-trivial effort/risk is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks via fine-grained data synchronization. To reduce this effort, prior work has proposed supporting transactional memory on GPU architectures. One hardware proposal, Kilo TM, can scale to 1000s of concurrent transaction. However, performance and energy overhead of Kilo TM may deter GPU vendors from incorporating it into future designs.

In this paper, we analyze the performance and energy efficiency of Kilo TM and propose two enhancements: (1) Warp-level transaction management allows transactions within a warp to be managed as a group. This aggregates protocol messages to reduce communication overhead and captures spatial locality from multiple transactions to increase memory subsystem utility. (2) Temporal conflict detection uses globally synchronized timers to detect conflicts in read-only transactions with low overhead. Our evaluation shows that combining the two enhancements in combination can improve the overall performance and energy efficiency of Kilo TM by 65% and 34% respectively. Kilo TM with the above two enhancements achieves 66% of the performance of fine-grained locking with 34% energy overhead.

References

  1. R. Agrawal et al. Advances in Knowledge Discovery and Data Mining. chapter Fast Discovery of Association Rules. American Association for Artificial Intelligence, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Brownsword. Cloth in OpenCL, 2009.Google ScholarGoogle Scholar
  4. I. A. Buck et al. United States Patent #7,627,723: Atomic Memory Operators in a Parallel Processor (Assignee NVIDIA Corp.), December 2009.Google ScholarGoogle Scholar
  5. M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, 2011.Google ScholarGoogle Scholar
  6. D. Cederman et al. Towards a Software Transactional Memory for Graphics Processors. In EGPGV, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation of Speculative Threads in Multiprocessors. In ISCA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Dalessandro et al. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Ferri et al. SoC-TM: Integrated HW/SW Support for Trasnactional Memory Programming on Embedded MPSoCs. In CODES+ISSS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. W. L. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. W. L. Fung et al. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. W. L. Fung et al. http://www.ece.ubc.ca/~wwlfung/code/kilotm-gpgpu_sim.tgz, 2013.Google ScholarGoogle Scholar
  13. W. W. L. Fung et al. http://www.ece.ubc.ca/~wwlfung/code/gpu-tm-tests.tgz, 2013.Google ScholarGoogle Scholar
  14. A. Gharaibeh et al. A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing. In PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Hammond et al. Transactional Memory Coherence and Consistency. In ISCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Harris, J. Larus, and R. Rajwar. Transactional Memory. Morgan and Claypool, second edition, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Intel Corp. Intel 64 and IA-32 Architectures Software Developer's Manual, May 2012.Google ScholarGoogle Scholar
  19. G. Kestor et al. RMS-TM: A Comprehensive Benchmark Suite for Transactional Memory Systems. In ICPE '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  21. J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Merrill et al. Scalable GPU Graph Traversal. In PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations with Wiring Alternatives for Large Caches with CACTI 6.5. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Nasre et al. Morph Algorithms on GPUs. In PPoPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Nickolls et al. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, Mar.-Apr. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.Google ScholarGoogle Scholar
  27. NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010.Google ScholarGoogle Scholar
  28. M. Olszewski et al. JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory. In PACT, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Qian et al. BulkSMT: Designing SMT Processors for Atomic-Block Execution. In HPCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. G. Rogers et al. Cache-Conscious Wavefront Scheduling. In MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Ruan et al. Boosting Timestamp-based Tranasctional Memory by Exploiting Hardware Cycle Counters. In TRANSACT, 2013.Google ScholarGoogle Scholar
  32. T. A. Shah. FabMem: A Multiported RAM and CAM Compiler for Superscalar Design Space Exploration. Master's thesis, North Carolina State University, 2010.Google ScholarGoogle Scholar
  33. A. Shriraman et al. Flexible Decoupled Transactional Memory Support. In ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. Singh et al. Cache Coherence for GPU Architectures. In HPCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. Vineet and P. Narayanan. CudaCuts: Fast Graph Cuts on the GPU. In CVPRW '08, 2008.Google ScholarGoogle Scholar
  36. Y. Xu et al. Software Transactional Memory for GPU Architectures. In Computer Architecture Letters, volume PP, 2013.Google ScholarGoogle Scholar
  37. Y. Yang et al. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput. In PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Yelick. Antisocial Parallelism: Avoiding, Hiding and Managing Communication. 2013. Keynote at HPCA-2013.Google ScholarGoogle Scholar
  39. L. Yen et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Energy efficient GPU transactional memory via space-time optimizations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
          December 2013
          498 pages
          ISBN:9781450326384
          DOI:10.1145/2540708

          Copyright © 2013 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 December 2013

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MICRO-46 Paper Acceptance Rate39of239submissions,16%Overall Acceptance Rate484of2,242submissions,22%

          Upcoming Conference

          MICRO '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader