Abstract
Recent proposals present compression as a cost-effective technique to increase cache and memory capacity and bandwidth. While these proposals show potentials of compression, there are several open questions to adopt these proposals in real systems including the following: (1) Do these techniques work for real-world workloads running for long time? (2) Which application domains would potentially benefit the most from compression? (3) At which level of memory hierarchy should we apply compression: caches, main memory, or both?
In this article, our goal is to shed light on some main questions on applicability of compression. We evaluate compression in the memory hierarchy for selected examples from different application classes. We analyze real applications with real data and complete runs of several benchmarks. While simulators provide a pretty accurate framework to study potential performance/energy impacts of ideas, they mostly limit us to a small range of workloads with short runtimes. To enable studying real workloads, we introduce a fast and simple methodology to get samples of memory and cache contents of a real machine (a desktop or a server). Compared to a cycle-accurate simulator, our methodology allows us to study real workloads as well as benchmarks. Our toolset is not a replacement for simulators but mostly complements them. While we can use a simulator to measure performance/energy impact of a particular compression proposal, here with our methodology we can study the potentials with long running workloads in early stages of the design.
Using our toolset, we evaluate a collection of workloads from different domains, such as a web server of CS department of UW—Madison for 24h, Google Chrome (watching a 1h-long movie on YouTube), and Linux games (playing for about an hour). We also use several benchmarks from different domains, including SPEC, mobile, and big data. We run these benchmarks to completion.
Using these workloads and our toolset, we analyze different compression properties for both real applications and benchmarks. We focus on eight main hypotheses on compression, derived from previous work on compression. These properties (Table 2) act as foundation of several proposals on compression, so performance of those proposals depends very much on these basic properties.
Overall, our results suggest that compression could be of general use both in main memory and caches. On average, the compression ratio is ≥2 for 64% and 54% of workloads, respectively, for memory and cache data. Our evaluation indicates significant potential for both cache and memory compression, with higher compressibility in memory due to abundance of zero blocks. Among application domains we studied, servers show on average the highest compressibility, while our mobile benchmarks show the lowest compressibility.
For comparing benchmarks with real workloads, we show that (1) it is critical to run benchmarks to completion or considerably long runtimes to avoid biased conclusions, and (2) SPEC benchmarks are good representative of real Desktop applications in terms of compressibility of their datasets. However, this does not hold for all compression properties. For example, SPEC benchmarks have much better compression locality (i.e., neighboring blocks have similar compressibility) than real workloads. Thus, it is critical for designers to consider wider range of workloads, including real applications, to evaluate their compression techniques.
Supplemental Material
Available for Download
Slide deck associated with this paper
- B. Abali, H. Franke, X. Shen, D. Poff, and T. Smith. 2001. Performance of hardware compressed main memory. In Proceedings of the 7th IEEE Symposium on High-Performance Computer Architecture. Google ScholarDigital Library
- Alaa R. Alameldeen and David A. Wood. 2004. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA-31). Google ScholarDigital Library
- Apple OS X Mavericks. 2013. Retrieved from http://www.apple.com/media/us/osx/2013/docs/OSX_Mavericks_Core_Technology_Overview.pdf.Google Scholar
- Angelos Arelakis and P. Stenstrom. 2014. Sc2: A statistical compression cache scheme. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, 145--156. Google ScholarDigital Library
- Seungcheol Baek, Hyung Gyu Lee, Chrysostomos Nicopoulos, Junghee Lee, and Jongman Kim. 2013. ECM: Effective capacity maximizer for high-performance compressed caching. In Proceedings of IEEE Symposium on High-Performance Computer Architecture. Google ScholarDigital Library
- Á. Beszédes, R. Ferenc, T. Gyimóthy, A. Dolenc, and K. Karsisto. 2003. Survey of code-size reduction methods. ACM Comput. Surv. 35, 3 (2003), 223--267. Google ScholarDigital Library
- N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. Hill, and D. Wood. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News. 1--7. Google ScholarDigital Library
- M. Burtscher and P. Ratanaworabhan. 2007. High throughput compression of double-precision floating-point data. Data Compression Conference. Google ScholarDigital Library
- M. Burtscher and P. Ratanaworabhan. 2010. gFPC: A self-tuning compression algorithm. In Proceedings of the Data Compression Conference. Google ScholarDigital Library
- I. Chen, P. Bird, and T. Mudge. 1997. The impact of instruction compression on I-cache performance. Tech. Rep. CSE-TR-330--97, EECS Department, University of Michigan.Google Scholar
- Xi Chen, Lei Yang, Robert P. Dick, Li Shang, and Haris Lekatsas. 2010. C-pack: A high-performance microprocessor cache compression algorithm. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 1196--1208. Google ScholarDigital Library
- Yann Collet and Chip Turner. 2016. Facebook zstandard compression: Smaller and faster data compression with zstandard. Retrieved from https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/.Google Scholar
- Coremark. Retrieved from www.coremark.org.Google Scholar
- Arelakis F. Dahlgren and P. Stenstrom. 2015. Hycomp: A hybrid cache compression method for selection of data-type-specific compression methods. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, 38--49. Google ScholarDigital Library
- Julien Dusser and Andre Seznec. 2011. Decoupled zero-compressed memory. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. Google ScholarDigital Library
- M. Ekman and P. Stenstrom. 2005. A robust main-memory compression scheme. In Proceedings of the 32nd Annual International Symposium on Computer Architecture. 74--85. Google ScholarDigital Library
- E. Hallnor and S. Reinhardt. 2005. A unified compressed memory hierarchy. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. Google ScholarDigital Library
- M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. Popescu, A. Ailamaki, and B. Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). Google ScholarDigital Library
- J. Gandhi, A. Basu, M. Hill, and M. Swift 2014. BadgerTrap: A tool to instrument x86-64 TLB misses. SIGARCH Computer Architecture News (CAN), 2014 Google ScholarDigital Library
- Jayesh Gaur, Alaa R. Alameldeen, and Sreenivas Subramoney. 2016. Base-victim compression: An opportunistic cache compression architecture. In Proceedings of the 43th Annual International Symposium on Computer Architecture (ISCA’16). Google ScholarDigital Library
- R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang. 2010. Introducing the Graph 500. Cray User Group 2010 Proceedings.Google Scholar
- A. Gutierrez, R. Dreslinski, T. Wenisch, T. Mudge, A. Saidi, C. Emmons, and N. Paver. 2011. Full-system analysis and characterization of interactive smartphone applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'11). Google ScholarDigital Library
- G. Hamerly, E. Perelman, J. Lau, and B. Calder. 2005. SimPoint 3.0: Faster and more flexible program analysis. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation.Google Scholar
- Y. Jin and R. Chen 2000. Instruction Cache Compression for Embedded Systems. Berkley Wireless Research Center,” Technical Report, 2000.Google Scholar
- K. Kant and R. Iyer. 2002. Compressibility characteristics of address/data transfers in commercial workloads. In Proceedings of the 5th Workshop on Computer Architecture Evaluation Using Commercial Workloads. 59--67.Google Scholar
- Nam Sung Kim, Todd Austin, and Trevor Mudge. 2002. Low-energy data cache using sign compression and cache line bisection. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues.Google Scholar
- Soontae Kim, Jesung Kim, Jongmin Lee, and Seokin Hong. 2011. Residue cache: A low-energy low-area L2 cache architecture via compression and partial hits. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Jungrae Kim, Michael Sullivan, Esha Choukse, and Mattan Erez. 2016. Bit-plane compression: Transforming data for better compression in many-core architectures. In Proceedings of the 43th Annual International Symposium on Computer Architecture (ISCA’16) Google ScholarDigital Library
- Jang-Soo Lee, Won-Kee Hong, and Shin-Dug Kim. 2000. An on-chip cache compression technique to reduce decompression overhead and design complexity. Journal of Systems Architecture: The EUROMICRO Journal 46, 15 (2000), 1365--1382. 2000. Google ScholarDigital Library
- Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’15). Google ScholarDigital Library
- C. Lefurgy, P. Bird, I. Chen, and T. Mudge. 1997. Improving code density using compression techniques. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. 194--203. Google ScholarDigital Library
- N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. V. Venkatrao. 2003. The potential of compression to improve memory system performance, power consumption, and cost. In Proceedings of IEEE Performance, Computing and Communications Conference.Google Scholar
- N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. V. Venkatrao 2005. A limit study on the potential of compression for improving memory system performance, power consumption, and cost. J. Instruct.-Level Parallel. 7 (2005), 1--37.Google Scholar
- Sparsh Mittal and Jeffrey S. Vetter. 2016. A survey of architectural approaches for data compression in cache and main memory systems. IEEE Transactions on Parallel and Distributed Systems, 2016. Google ScholarDigital Library
- Tri M. Nguyen and David Wentzlaff. 2015. MORC: A manycore-oriented compressed cache. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’15). Google ScholarDigital Library
- Poovaiah M. Palangappa and Kartik Mohanram. 2016. CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16).Google Scholar
- Poovaiah M. Palangappa and Kartik Mohanram. 2017, CompEx++: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVMs. ACM Transactions on Architecture and Code Optimization (TACO), 2017. Google ScholarDigital Library
- Biswabandan Panda (INRIA) and André Seznec. 2016. Dictionary sharing: An efficient cache compression scheme for compressed caches. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture, 2016.Google Scholar
- Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16), 2016.Google ScholarCross Ref
- G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. 2015. Exploiting compressed block size as an indicator of future reuse. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 51--63.Google Scholar
- Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT'12). ACM, New York, NY, 377--388. Google ScholarDigital Library
- Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2013. Linearly compressed pages: A low-complexity, low-latency main memory compression framework. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture, 2013. Google ScholarDigital Library
- P. Ratanaworabhan, J. Ke, and M. Burtscher. 2006. Fast lossless compression of scientific floating-point data. In Proceedings of the Data Compression Conference. Google ScholarDigital Library
- Somayeh Sardashti and David A. Wood. 2013. Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Somayeh Sardashti, Angelos Arelakis, Per Stenstrom, and David A. Wood. 2015. A primer on compression in the memory hierarchy. Morgan and Claypool. Google ScholarDigital Library
- Somayeh Sardashti, Andre Seznec, and David A. Wood. 2014. Skewed compressed caches. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). Google ScholarDigital Library
- Somayeh Sardashti, Andre Seznec, and David A. Wood. 2016. Yet another compressed cache: A low-cost yet effective compressed cache. ACM Transactions on Architecture and Code Optimization (TACO), 2016. Google ScholarDigital Library
- Vijay Sathish, Michael J. Schulte, and Nam Sung Kim. 2012. Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, and Al Davis. 2014. Memzip: Exploiting unconventional benefits from memory compression. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’14).Google ScholarCross Ref
- Luis Villa, Michael Zhang, and Krste Asanovic. 2000. Dynamic zero compression for cache energy reduction. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture. Google ScholarDigital Library
- Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’15). Google ScholarDigital Library
- C. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely, and J. Emer. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Jun Yang and Rajiv Gupta. 2002. Frequent value locality and its applications. ACM Trans. Embed. Comput. Syst. 2002. Google ScholarDigital Library
- Jun Yang, Youtao Zhang, and Rajiv Gupta. 2000. Frequent value compression in data caches. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’00). Google ScholarDigital Library
- D. Yoon, M. Jeong, and M. Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In Proceeding of the 38th Annual International Symposium on Computer Architecture. Google ScholarDigital Library
- Vinson Young, Prashant J. Nair, Moinuddin K. Qureshi. 2017. DICE: Compressing DRAM Caches for bandwidth and capacity. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Google ScholarDigital Library
Index Terms
- Could Compression Be of General Use? Evaluating Memory Compression across Domains
Recommendations
Yet Another Compressed Cache: A Low-Cost Yet Effective Compressed Cache
Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip's area and thus account for a significant fraction ...
Base-victim compression: an opportunistic cache compression architecture
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureThe memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Base-victim compression: an opportunistic cache compression architecture
ISCA'16The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Comments