Data-type specific cache compression in GPGPUs

Atoofian, Ehsan; Rea, Sean

doi:10.1007/s11227-017-2185-5

Data-type specific cache compression in GPGPUs

Published: 09 November 2017

Volume 74, pages 1609–1635, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ehsan Atoofian¹ &
Sean Rea¹

272 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

AMD Graphics Cores Next (GCN) Architecture white paper, AMD, 2012
NVIDIA Corp (2012) NVIDIA’s next generation CUDA compute architecture: Kepler GK110
Narasiman V et al (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the MICRO, Porto Alegre, Brazil
Fung WWL et al (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the MICRO, Chicago, IL, pp 407–418
NVIDIA GeForce GTX 680, The fastest, most efficient GPU ever built, V1.0
NVIDIA GeForce GTX 980, Featuring Maxwell, The Most Advanced GPU Ever Made, V1.1
Bakhoda A, Kim J, Aamodt T (2010) Throughput-effective on-chip networks for Manycore accelerators. In: MICRO
Singh I et al (2013) Cache coherence for GPU architectures. In: Proceedings of the HPCA
Abali B, Franke H, Poff DE, Saccone RA, Schulz CO, Herger LM, Smith TB (2001) Memory expansion technology (MXT): software support and performance, IBM JRD
Pekhimenko G et al (2012) Base-delta-immediate compression: practical data compression for on-chip caches. In: Proceedings of the PACT, Minneapolis, MN, USA
Sardashti S et al (2013) Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. In: Proceedings of the MICRO, Davis, CA
Alameldeen AR, Wood DA (2004) Adaptive cache compression for high-performance processors. In: Proceedings of the 31st Annual International Symposium on Computer Architecture
Gomez L, Cappello F (2013) Improving floating point compression through binary masks. In: IEEE International Conference on Big Data, pp 326–331
Townsend K, Zambreno J (2015) A multi-phase approach to floating-point compression. In: Proceedings of the IEEE International Conference on Electro/Information Technology (EIT)
Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1
Article Google Scholar
Bakhoda A et al (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of the ISPASS
Arelakis A, Stenstrom P (2014) SC\(^{2}\): a statistical compression cache scheme. In: Proceeding of the 41st Annual International Symposium on Computer Architecture, Minneapolis, MN, USA
Muralimanohar N et al (2007) Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the MICRO, pp 3–14
FreePDK\(^{TM}\) process design kit. http://www.eda.ncsu.edu/wiki/FreePDK
Lee S et al (2015) Warped-compression: enabling power efficient GPUs through register compression. In: Proceedings of the ISCA, pp 502–514
NVIDIA (2013) CUDA C/C++ SDK code samples
Stratton JA et al (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing
Boyer CM et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Zhang Y, Yang J, Gupta R (2000) Frequent value compression in data caches. In: Proceeding of the MICRO-33
Vijaykumar N et al (2015) A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In: Proceedings of the ISCA, Portland, OR
Sathish V, Schulte MJ, Kim NS (2012) Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA
Xiang P et al (2013) Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In: Proceedings of the ICS, Oregon, USA
Collange S, Kouyoumdjian A (2011) Affine vector cache for memory bandwidth savings. Universite de Lyon, Tech. Rep
Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1–1
Article Google Scholar
Nitta C, Farrens M (2008) Techniques for increasing effective data bandwidth. In: IEEE International Conference on Computer Design (ICCD), pp 514–519
Burtscher M, Ratanaworabhan P (2009) FPC: a high-speed compressor for double-precision floating-point data. IEEE Trans Comput 58:18–31
Article MathSciNet MATH Google Scholar
Sazeides Y, Smith JE (1997) The predictability of data values. In: Proceedings of the 30th International Symposium Microarchitecture (MICRO’97), pp 248–258
Goeman B, Vandierendonck H, Bosschere K (2001) Differential FCM: increasing value prediction accuracy by improving table usage efficiency. In: Proceedings of the Seventh International Symposium on High Performance Computer Architecture (HPCA’01), pp 207–216
Arelakis A, Dahlgren F, Stenstrom P (2015) HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. In: Proceedings of the 48th International Symposium on Microarchitecture, Waikiki, Hawaii, pp 38–49
Falahati H, Hessabi S, Abdi M, Baniasadi A (2015) Power-efficient prefetching on GPGPUs. J Supercomput 71:2808–2829
Article Google Scholar
Wang S-Y, Chang R-G (2007) Code size reduction by compressing repeated instruction sequences. J Supercomput 40:319–331
Article Google Scholar
Hijaz F, Shi Q, Kurian G, Devadas S, Khan O (2016) Locality-aware data replication in the last-level cache for large scale multicores. J Supercomput 72:718–752
Article Google Scholar
Atoofian E (2016) Compressed L1 data cache and L2 cache in GPGPUs. In: Proceedings of the 2016 IEEE 27th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)
Atoofian E (2016) Many-thread aware compression in GPGPUs. In: Proceedings of the Scalable Computing and Communications, pp 628–635

Download references

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Electrical Engineering Department, Lakehead University, Thunder Bay, Canada
Ehsan Atoofian & Sean Rea

Authors

Ehsan Atoofian
View author publications
You can also search for this author in PubMed Google Scholar
Sean Rea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ehsan Atoofian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Atoofian, E., Rea, S. Data-type specific cache compression in GPGPUs. J Supercomput 74, 1609–1635 (2018). https://doi.org/10.1007/s11227-017-2185-5

Download citation

Published: 09 November 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11227-017-2185-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-type specific cache compression in GPGPUs

Abstract

Access this article

Similar content being viewed by others

In-memory database acceleration on FPGAs: a survey

Efficient High-Level Programming in Plain Java

MT-3000: a heterogeneous multi-zone processor for HPC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data-type specific cache compression in GPGPUs

Abstract

Access this article

Similar content being viewed by others

In-memory database acceleration on FPGAs: a survey

Efficient High-Level Programming in Plain Java

MT-3000: a heterogeneous multi-zone processor for HPC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation