Abstract
In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%.
Similar content being viewed by others
References
AMD Graphics Cores Next (GCN) Architecture white paper, AMD, 2012
NVIDIA Corp (2012) NVIDIA’s next generation CUDA compute architecture: Kepler GK110
Narasiman V et al (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the MICRO, Porto Alegre, Brazil
Fung WWL et al (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the MICRO, Chicago, IL, pp 407–418
NVIDIA GeForce GTX 680, The fastest, most efficient GPU ever built, V1.0
NVIDIA GeForce GTX 980, Featuring Maxwell, The Most Advanced GPU Ever Made, V1.1
Bakhoda A, Kim J, Aamodt T (2010) Throughput-effective on-chip networks for Manycore accelerators. In: MICRO
Singh I et al (2013) Cache coherence for GPU architectures. In: Proceedings of the HPCA
Abali B, Franke H, Poff DE, Saccone RA, Schulz CO, Herger LM, Smith TB (2001) Memory expansion technology (MXT): software support and performance, IBM JRD
Pekhimenko G et al (2012) Base-delta-immediate compression: practical data compression for on-chip caches. In: Proceedings of the PACT, Minneapolis, MN, USA
Sardashti S et al (2013) Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. In: Proceedings of the MICRO, Davis, CA
Alameldeen AR, Wood DA (2004) Adaptive cache compression for high-performance processors. In: Proceedings of the 31st Annual International Symposium on Computer Architecture
Gomez L, Cappello F (2013) Improving floating point compression through binary masks. In: IEEE International Conference on Big Data, pp 326–331
Townsend K, Zambreno J (2015) A multi-phase approach to floating-point compression. In: Proceedings of the IEEE International Conference on Electro/Information Technology (EIT)
Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1
Bakhoda A et al (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of the ISPASS
Arelakis A, Stenstrom P (2014) SC\(^{2}\): a statistical compression cache scheme. In: Proceeding of the 41st Annual International Symposium on Computer Architecture, Minneapolis, MN, USA
Muralimanohar N et al (2007) Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the MICRO, pp 3–14
FreePDK\(^{TM}\) process design kit. http://www.eda.ncsu.edu/wiki/FreePDK
Lee S et al (2015) Warped-compression: enabling power efficient GPUs through register compression. In: Proceedings of the ISCA, pp 502–514
NVIDIA (2013) CUDA C/C++ SDK code samples
Stratton JA et al (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing
Boyer CM et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Zhang Y, Yang J, Gupta R (2000) Frequent value compression in data caches. In: Proceeding of the MICRO-33
Vijaykumar N et al (2015) A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In: Proceedings of the ISCA, Portland, OR
Sathish V, Schulte MJ, Kim NS (2012) Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA
Xiang P et al (2013) Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In: Proceedings of the ICS, Oregon, USA
Collange S, Kouyoumdjian A (2011) Affine vector cache for memory bandwidth savings. Universite de Lyon, Tech. Rep
Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1–1
Nitta C, Farrens M (2008) Techniques for increasing effective data bandwidth. In: IEEE International Conference on Computer Design (ICCD), pp 514–519
Burtscher M, Ratanaworabhan P (2009) FPC: a high-speed compressor for double-precision floating-point data. IEEE Trans Comput 58:18–31
Sazeides Y, Smith JE (1997) The predictability of data values. In: Proceedings of the 30th International Symposium Microarchitecture (MICRO’97), pp 248–258
Goeman B, Vandierendonck H, Bosschere K (2001) Differential FCM: increasing value prediction accuracy by improving table usage efficiency. In: Proceedings of the Seventh International Symposium on High Performance Computer Architecture (HPCA’01), pp 207–216
Arelakis A, Dahlgren F, Stenstrom P (2015) HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. In: Proceedings of the 48th International Symposium on Microarchitecture, Waikiki, Hawaii, pp 38–49
Falahati H, Hessabi S, Abdi M, Baniasadi A (2015) Power-efficient prefetching on GPGPUs. J Supercomput 71:2808–2829
Wang S-Y, Chang R-G (2007) Code size reduction by compressing repeated instruction sequences. J Supercomput 40:319–331
Hijaz F, Shi Q, Kurian G, Devadas S, Khan O (2016) Locality-aware data replication in the last-level cache for large scale multicores. J Supercomput 72:718–752
Atoofian E (2016) Compressed L1 data cache and L2 cache in GPGPUs. In: Proceedings of the 2016 IEEE 27th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)
Atoofian E (2016) Many-thread aware compression in GPGPUs. In: Proceedings of the Scalable Computing and Communications, pp 628–635
Acknowledgements
This work was supported by the Natural Sciences and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Atoofian, E., Rea, S. Data-type specific cache compression in GPGPUs. J Supercomput 74, 1609–1635 (2018). https://doi.org/10.1007/s11227-017-2185-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2185-5