ABSTRACT
Modern GPUs suffer from cache contention due to the limited cache size that is shared across tens of concurrently running warps. To increase the per-warp cache size prior techniques proposed warp throttling which limits the number of active warps. Warp throttling leaves several registers to be dynamically unused whenever a warp is throttled. Given the stringent cache size limitation in GPUs this work proposes a new cache management technique named Linebacker (LB) that improves GPU performance by utilizing idle register file space as victim cache space. Whenever a CTA becomes inactive, linebacker backs up the registers of the throttled CTA to the off-chip memory. Then, linebacker utilizes the corresponding register file space as victim cache space. If any load instruction finds data in the victim cache line, the data is directly copied to the destination register through a simple register-register move operation. To further improve the efficiency of victim cache linebacker allocates victim cache space only to a select few load instructions that exhibit high data locality. Through a careful design of victim cache indexing and management scheme linebacker provides 29.0% of speedup compared to the previously proposed warp throttling techniques.
- N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
- Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). Google ScholarDigital Library
- Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). Google ScholarDigital Library
- Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and Nael Abu-Ghazaleh. 2019. CORF: Coalescing Operand Register File for GPUs. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 701--714.Google ScholarDigital Library
- A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '09).Google Scholar
- N. Chatterjee, M. O'Connor, G.H. Loh, N. Jayasena, and R. Balasubramonia. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). Google ScholarDigital Library
- Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J.W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009. IEEE International Symposium on Workload Characterization (IISWC '09). Google ScholarDigital Library
- Xuhao Chen, Li-Wen Chang, C.I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47).Google ScholarDigital Library
- Ahmed ElTantawy and Tor M. Aamodt. 2018. Warp Scheduling for Fine-Grained Synchronization. In Proceedings of the 2018 IEEE 24th International Symposium on High Performance Computer Architecture (HPCA '18).Google Scholar
- J. Gaur, A. R. Alameldeen, and S. Subramoney. 2016. Base-Victim Compression: An Opportunistic Cache Compression Architecture. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). Google ScholarDigital Library
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar).Google Scholar
- Sunpyo Hong and Hyesoon Kim. 2010. An Integrated GPU Power and Performance Model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). Google ScholarDigital Library
- K. Hsieh, E. Ebrahim, G. Kim, N. Chatterjee, M. O'Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA '16).Google Scholar
- Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU Register File Virtualization. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48).Google ScholarDigital Library
- Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). Google ScholarDigital Library
- W. Jia, K. A. Shaw, and M. Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14).Google Scholar
- N. Jing, J. Wang, F. Fan, W. Yu, L. Jiang, C. Li, and X. Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13).Google Scholar
- Norman P. Jouppi. 1990. Improving Direct-mapped Cache Performance by the Addition of a Small Fully-associative Cache and Prefetch Buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA '90). Google ScholarDigital Library
- O. Kayiran, A. Jog, M.T. Kandemir, and C.R. Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). Google ScholarDigital Library
- Samira M. Khan, Daniel A. Jiménez, Doug Burger, and Babak Falsafi. 2010. Using Dead Blocks As a Virtual Victim Cache. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). Google ScholarDigital Library
- F. Khorasani, H. Asghari Esfeden, A. Farmahini-Farahani, N. Jayasena, and V. Sarkar. 2018. RegMutex: Inter-Warp GPU Register Time-Sharing. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA '18). Google ScholarDigital Library
- Jungrae Kim, Michael Sullivan, Esha Choukse, and Mattan Erez. 2016. Bit-plane Compression: Transforming Data for Better Compression in Many-core Architectures. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). Google ScholarDigital Library
- K. Kim, S. Lee, M. K. Yoon, G. Koo, W. W. Ro, and M. Annavaram. 2016. Warpedpreexecution: A GPU pre-execution approach for improving latency hiding. In Proceedings of the 2016 IEEE 22nd International Symposium on High Performance Computer Architecture (HPCA '16).Google Scholar
- John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time Operand Staging for GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50). Google ScholarDigital Library
- Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). Google ScholarDigital Library
- G. Koo, H. Jeon, and M. Annavaram. 2015. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In 2015 IEEE International Symposium on Workload Characterization (IISWC '15). Google ScholarDigital Library
- Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. 2017. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). Google ScholarDigital Library
- Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling Power Efficient GPUs Through Register Compression. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). Google ScholarDigital Library
- Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu. 2015. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). Google ScholarDigital Library
- Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware Warp Scheduling for GPGPU Workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). Google ScholarDigital Library
- Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13).Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, 297--311. Google ScholarDigital Library
- Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and Transparent Cache Bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). Google ScholarDigital Library
- Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Google ScholarDigital Library
- D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Keckler. 2015. Priority-based cache allocation in throughput processors. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA '15).Google Scholar
- Lingda Li, Ari B. Hayes, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16).Google Scholar
- Jieun Lim, Nagesh B. Lakshminarayana, Hyesoon Kim, William Song, Sudhakar Yalamanchili, and Wonyong Sung. 2014. Power Modeling for GPU Architectures Using McPAT. ACM Trans. Des. Autom. Electron. Syst. 19, 3 (June 2014).Google ScholarDigital Library
- Mengjie Mao, Jingtong Hu, Yiran Chen, and Hai Li. 2015. VWS: A versatile warp scheduler for exploring diverse cache localities of GPGPU applications. In Design Automation Conference, 2015 52nd ACM/EDAC/IEEE (DAC '15). Google ScholarDigital Library
- Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA-aware GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50). Google ScholarDigital Library
- NVIDIA. 2014. NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced GPU Ever Made.Google Scholar
- NVIDIA. 2016. NVIDIA CUDA SDK Code Sample 4.0.Google Scholar
- NVIDIA. 2016. NVIDIA GeForce GTX 1080: Gaming Perfected.Google Scholar
- NVIDIA. 2016. NVIDIA Tesla P100: The Most Advanced Datacenter Accelerator Ever Built.Google Scholar
- NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE.Google Scholar
- NVIDIA. 2018. NVIDIA Turing GPU Architecture: Graphics reinvented.Google Scholar
- Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram. 2016. APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). Google ScholarDigital Library
- Yunho Oh, Myung Kuk Yoon, William J. Song, and Won Woo Ro. 2018. FineReg: Fine-Grained Register File Management for Augmenting GPU Throughput. In 2018 Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-51).Google Scholar
- Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2014. Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). Google ScholarDigital Library
- Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). 593--606. Google ScholarDigital Library
- G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA '16).Google Scholar
- Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). Google ScholarDigital Library
- Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18).Google ScholarDigital Library
- A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA '15).Google Scholar
- D. Stiliadis and A. Varma. 1997. Selective victim caching: a method to improve the performance of direct-mapped caches. IEEE Trans. Comput. 46, 5 (1997). Google ScholarDigital Library
- J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing (2012).Google Scholar
- I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. 2014. Enabling preemptive multiprogramming on GPUs. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). 193--204. Google ScholarDigital Library
- Shyamkumar Thoziyoor, Naveen Muralimanohar, and Jung Ho Ahn. 2008. Cacti 5.1. Technical Report. Hewlett-Packard Laboratories.Google Scholar
- Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA '18). Google ScholarDigital Library
- Bin Wang, Weikuan Yu, Xian-He Sun, and Xinning Wang. 2015. DaCache: Memory Divergence-Aware GPU Cache Management. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15).Google ScholarDigital Library
- Bin Wang, Yue Zhu, and Weikuan Yu. 2016. OAWS: Memory Occlusion Aware Warp Scheduling. In Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT '16).Google ScholarDigital Library
- Haonan Wang, Fan Leon Luo, Mohamed Ibrahim, Onur Kayiran, and Adwait Jog. 2018. Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management. In Proceedings of the 2018 IEEE 24th International Symposium on High Performance Computer Architecture (HPCA '18).Google ScholarCross Ref
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). Google ScholarDigital Library
- X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA '15).Google Scholar
- Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. 2016. Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 230--242.Google Scholar
- Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram. 2016. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). Google ScholarDigital Library
- Michael Zhang and Krste Asanovic. 2005. Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA '05). Google ScholarDigital Library
Recommendations
Virtual-Cache: A cache-line borrowing technique for efficient GPU cache architectures
AbstractGPUs provide megabytes of registers and shared memories to maintain the contexts for thousands of threads and enable fast data sharing amongst threads of a thread block, respectively. Besides, GPUs employ L1 cache to provide the high ...
REMOC: efficient request managements for on-chip memories of GPUs
CF '22: Proceedings of the 19th ACM International Conference on Computing FrontiersThe on-chip memories of GPUs, including the register file, shared memory and L1 cache, can provide high bandwidth and low latency access for the temporary storage of data. The capacity of L1 cache can be increased by using the registers/shared memory ...
Exploring cache bypassing and partitioning for multi-tasking on GPUs
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided DesignGraphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to ...
Comments