Cooperative Caching for GPUs

Authors:
Saumay Dublish

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

,
Vijay Nagarajan

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

,
Nigel Topham

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

ACM Transactions on Architecture and Code Optimization Volume 13 Issue 4Article No.: 39pp 1–25https://doi.org/10.1145/3001589

Published:12 December 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies.

In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.

References

Manuel E. Acacio, José González, José M. García, and José Duato. 2002. Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC’02). IEEE, Los Alamitos, CA, 1--12. http://dl.acm.org/citation.cfm?id=762761.762762 Google ScholarDigital Library
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU concurrency: Weak behaviours and programming assumptions. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 577--591. Google ScholarDigital Library
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, Los Alamitos, CA, 163--174.Google Scholar
L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th International Symposium on Computer Architecture. 282--293. Google ScholarDigital Library
Bradford M. Beckmann and David A. Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-37). IEEE, Los Alamitos, CA, 319--330. Google ScholarDigital Library
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks. 2014. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA’14). IEEE, Los Alamitos, CA, 217--228. http://dl.acm.org/citation.cfm?id=2665671.2665705 Google ScholarDigital Library
Jichuan Chang and Gurindar S. Sohi. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA’06). IEEE, Los Alamitos, CA, 264--276. Google ScholarDigital Library
Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS’07). ACM, New York, NY, 242--252. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, Los Alamitos, CA, 44--54. Google ScholarDigital Library
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive cache management for energy-efficient GPU computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE, Los Alamitos, CA, 343--355. Google ScholarDigital Library
Hyojin Choi, Jaewoo Ahn, and Wonyong Sung. 2012. Reducing off-chip memory traffic by selective cache management scheme in GPGPUs. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, 110--119. Google ScholarDigital Library
George Chrysos. 2012. Intel Xeon Phi Coprocessor—The Architecture. Technical Report. Intel Corporation. https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner.Google Scholar
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Characterizing memory bottlenecks in GPGPU workloads. In Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC’16).Google ScholarCross Ref
Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE, Los Alamitos, CA, 389--400. Google ScholarDigital Library
Jayesh Gaur, Mainak Chaudhuri, and Sreenivas Subramoney. 2011. Bypass and insertion algorithms for exclusive last-level caches. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 81--92. Google ScholarDigital Library
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce framework on graphics processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 260--269. Google ScholarDigital Library
Enric Herrero, José González, and Ramon Canal. 2008. Distributed cooperative caching. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 134--143. Google ScholarDigital Library
Mark A. Holliday and Michael Stumm. 1994. Performance evaluation of hierarchical ring-based shared memory multiprocessors. IEEE Transactions on Computers 43, 1, 52--67. Google ScholarDigital Library
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 15--24. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 395--406. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2016. Exploiting core criticality for enhanced GPU performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. 351--363. Google ScholarDigital Library
Stefanos Kaxiras and Georgios Keramidas. 2010. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro 30, 5, 54--65. Google ScholarDigital Library
Mohammad Mahdi Keshtegar, Hajar Falahati, and Shaahin Hessabi. 2015. Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality. IET Computers and Digital Techniques 9, 5, 275--282. http://digital-library.theiet.org/content/journals/10.1049/iet-cdt.2014.0092.Google ScholarCross Ref
P. Kongetira, K. Aingaran, and K. Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25, 2, 21--29. Google ScholarDigital Library
Alvin R. Lebeck and David A. Wood. 1995. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). ACM, New York, NY, 48--59. Google ScholarDigital Library
Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, Los Alamitos, CA, 1--12. Google ScholarDigital Library
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 487--498. Google ScholarDigital Library
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 67--77. Google ScholarDigital Library
Lingda Li, Ari B. Hayes, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Tag-split cache for efficient GPGPU cache utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, New York, NY, Article No. 43. Google ScholarDigital Library
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why on-chip cache coherence is here to stay. Communications of the ACM 55, 7, 78--89. Google ScholarDigital Library
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 308--317. Google ScholarDigital Library
NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. Technical Report. NVIDIA Corporation. http://www.nvidia.co.uk/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google Scholar
NVIDIA. 2012. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Technical Report. NVIDIA Corporation. http://www.nvidia.co.uk/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.Google Scholar
NVIDIA. 2014. CUDA by Example—Errata Page. Retrieved October 27, 2016, from http://developer.nvidia.com/cuda-example-errata-page.Google Scholar
NVIDIA. 2016. Parallel Thread Execution ISA, Version 5.0. Retrieved October 27, 2016, from http://docs.nvidia.com/cuda/parallel-thread-execution.Google Scholar
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 457--467. Google ScholarDigital Library
Govindan Ravindran and Michael Stumm. 1997. A performance comparison of hierarchical ring- and mesh-connected multiprocessor networks. In Proceedings of the 3rd IEEE Symposium on High Performance Computer Architecture (HPCA’97). IEEE, Los Alamitos, CA, 58. http://dl.acm.org/citation.cfm?id=548716.822685 Google ScholarDigital Library
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 86--98. Google ScholarDigital Library
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE, Los Alamitos, CA, 72--83. Google ScholarDigital Library
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 99--110. Google ScholarDigital Library
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, et al. 2008. Larrabee: A many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 Papers (SIGGRAPH’08). ACM, New York, NY, Article No. 18. Google ScholarDigital Library
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the 2013 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’13). 578--590. Google ScholarDigital Library
John A. Stratton, Christopher Rodrigrues, I.-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.Google Scholar
David Tarjan, Jiayuan Meng, and Kevin Skadron. 2009. Increasing memory miss tolerance for SIMD cores. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). ACM, New York, NY, Article No. 22. Google ScholarDigital Library
David Tarjan and Kevin Skadron. 2010. The sharing tracker: Using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’10). IEEE, Los Alamitos, CA, 1--10. Google ScholarDigital Library
Yi Yang, Ping Xiang, Mike Mantor, and Huiyang Zhou. 2012. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). IEEE, Los Alamitos, CA, 1--12. Google ScholarDigital Library
A. Yazdanbakhsh, B. Thwaites, H. Esmaeilzadeh, G. Pekhimenko, O. Mutlu, and T. C. Mowry. 2016. Mitigating the memory bottleneck with approximate load value prediction. IEEE Design and Test 33, 1, 32--42.Google ScholarCross Ref

Index Terms

Cooperative Caching for GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Cooperative Caching for Chip Multiprocessors

This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through ...
Read More
Early miss prediction based periodic cache bypassing for high performance GPUs

The aim of the hierarchical cache memories that are equipped for GPUs is the management of irregular memory access patterns for general purpose workloads. The level-1 data cache (L1D) of the GPU plays an important role for its ability in the provision ...
Read More
Cooperative Caching for Chip Multiprocessors
ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture

This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4
December 2016
648 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3012405
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2016
- Accepted: 1 September 2016
- Revised: 1 August 2016
- Received: 1 June 2016
Published in taco Volume 13, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
bandwidth bottlenecks
intercore reuse
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 739
  Total Downloads
- Downloads (Last 12 months)105
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.