Skip to main content
Log in

Data-type specific cache compression in GPGPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. AMD Graphics Cores Next (GCN) Architecture white paper, AMD, 2012

  2. NVIDIA Corp (2012) NVIDIA’s next generation CUDA compute architecture: Kepler GK110

  3. Narasiman V et al (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the MICRO, Porto Alegre, Brazil

  4. Fung WWL et al (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the MICRO, Chicago, IL, pp 407–418

  5. NVIDIA GeForce GTX 680, The fastest, most efficient GPU ever built, V1.0

  6. NVIDIA GeForce GTX 980, Featuring Maxwell, The Most Advanced GPU Ever Made, V1.1

  7. Bakhoda A, Kim J, Aamodt T (2010) Throughput-effective on-chip networks for Manycore accelerators. In: MICRO

  8. Singh I et al (2013) Cache coherence for GPU architectures. In: Proceedings of the HPCA

  9. Abali B, Franke H, Poff DE, Saccone RA, Schulz CO, Herger LM, Smith TB (2001) Memory expansion technology (MXT): software support and performance, IBM JRD

  10. Pekhimenko G et al (2012) Base-delta-immediate compression: practical data compression for on-chip caches. In: Proceedings of the PACT, Minneapolis, MN, USA

  11. Sardashti S et al (2013) Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. In: Proceedings of the MICRO, Davis, CA

  12. Alameldeen AR, Wood DA (2004) Adaptive cache compression for high-performance processors. In: Proceedings of the 31st Annual International Symposium on Computer Architecture

  13. Gomez L, Cappello F (2013) Improving floating point compression through binary masks. In: IEEE International Conference on Big Data, pp 326–331

  14. Townsend K, Zambreno J (2015) A multi-phase approach to floating-point compression. In: Proceedings of the IEEE International Conference on Electro/Information Technology (EIT)

  15. Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1

    Article  Google Scholar 

  16. Bakhoda A et al (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of the ISPASS

  17. Arelakis A, Stenstrom P (2014) SC\(^{2}\): a statistical compression cache scheme. In: Proceeding of the 41st Annual International Symposium on Computer Architecture, Minneapolis, MN, USA

  18. Muralimanohar N et al (2007) Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the MICRO, pp 3–14

  19. FreePDK\(^{TM}\) process design kit. http://www.eda.ncsu.edu/wiki/FreePDK

  20. Lee S et al (2015) Warped-compression: enabling power efficient GPUs through register compression. In: Proceedings of the ISCA, pp 502–514

  21. NVIDIA (2013) CUDA C/C++ SDK code samples

  22. Stratton JA et al (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing

  23. Boyer CM et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)

  24. Zhang Y, Yang J, Gupta R (2000) Frequent value compression in data caches. In: Proceeding of the MICRO-33

  25. Vijaykumar N et al (2015) A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In: Proceedings of the ISCA, Portland, OR

  26. Sathish V, Schulte MJ, Kim NS (2012) Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA

  27. Xiang P et al (2013) Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In: Proceedings of the ICS, Oregon, USA

  28. Collange S, Kouyoumdjian A (2011) Affine vector cache for memory bandwidth savings. Universite de Lyon, Tech. Rep

  29. Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1–1

    Article  Google Scholar 

  30. Nitta C, Farrens M (2008) Techniques for increasing effective data bandwidth. In: IEEE International Conference on Computer Design (ICCD), pp 514–519

  31. Burtscher M, Ratanaworabhan P (2009) FPC: a high-speed compressor for double-precision floating-point data. IEEE Trans Comput 58:18–31

    Article  MathSciNet  MATH  Google Scholar 

  32. Sazeides Y, Smith JE (1997) The predictability of data values. In: Proceedings of the 30th International Symposium Microarchitecture (MICRO’97), pp 248–258

  33. Goeman B, Vandierendonck H, Bosschere K (2001) Differential FCM: increasing value prediction accuracy by improving table usage efficiency. In: Proceedings of the Seventh International Symposium on High Performance Computer Architecture (HPCA’01), pp 207–216

  34. Arelakis A, Dahlgren F, Stenstrom P (2015) HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. In: Proceedings of the 48th International Symposium on Microarchitecture, Waikiki, Hawaii, pp 38–49

  35. Falahati H, Hessabi S, Abdi M, Baniasadi A (2015) Power-efficient prefetching on GPGPUs. J Supercomput 71:2808–2829

    Article  Google Scholar 

  36. Wang S-Y, Chang R-G (2007) Code size reduction by compressing repeated instruction sequences. J Supercomput 40:319–331

    Article  Google Scholar 

  37. Hijaz F, Shi Q, Kurian G, Devadas S, Khan O (2016) Locality-aware data replication in the last-level cache for large scale multicores. J Supercomput 72:718–752

    Article  Google Scholar 

  38. Atoofian E (2016) Compressed L1 data cache and L2 cache in GPGPUs. In: Proceedings of the 2016 IEEE 27th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

  39. Atoofian E (2016) Many-thread aware compression in GPGPUs. In: Proceedings of the Scalable Computing and Communications, pp 628–635

Download references

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ehsan Atoofian.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Atoofian, E., Rea, S. Data-type specific cache compression in GPGPUs. J Supercomput 74, 1609–1635 (2018). https://doi.org/10.1007/s11227-017-2185-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2185-5

Keywords

Navigation