ABSTRACT
Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor.
We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.
- HMC Specification 1.0, 2013. {Online}. Available: http://www.hybridmemorycube.orgGoogle Scholar
- JEDEC, High Bandwidth Memory (HBM) DRAM (JESD235), JEDEC, 2013.Google Scholar
- Micron, HMC Gen2, Micron, 2013.Google Scholar
- 1Gb_DDR3_SDRAM.pdf - Rev. I 02/10 EN, Micron, 2010.Google Scholar
- DDR4 SPEC (JESD79-4), JEDEC, 2013.Google Scholar
- G. H. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-stacked dram caches," in Proceedings of the 44th Annual International Symposium on Microarchitecture, 2011. Google ScholarDigital Library
- G. H. Loh, N. Jayasena, J. Chung, S. K. Reinhardt, J. M. O'Connor, and K. McGrath, "Challenges in heterogeneous die-stacked and off-chip memory systems," in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads, 2012.Google Scholar
- J. Sim, G. H. Loh, H. Kim, M. O'Connor, and M. Thottethodi, "A mostly-clean dram cache for effective hit speculation and self-balancing dispatch," in Proceedings of the 2012 45th Annual International Symposium on Microarchitecture, 2012. Google ScholarDigital Library
- M. K. Qureshi and G. H. Loh, "Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design," in Proceedings of the 2012 45th Annual International Symposium on Microarchitecture, 2012. Google ScholarDigital Library
- D. Jevdjic, S. Volos, and B. Falsafi, "Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013. Google ScholarDigital Library
- C.-C. Huang and V. Nagarajan, "Atcache: Reducing dram cache latency via a small sram tag cache," in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014. Google ScholarDigital Library
- J. B. Rothman and A. J. Smith, "Sector cache design and performance," in Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000. Google ScholarDigital Library
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive insertion policies for high performance caching," in Proceedings of the 34th Annual International Symposium on Computer Architecture, 2007. Google ScholarDigital Library
- A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer, "High performance cache replacement using re-reference interval prediction (rrip)," in Proceedings of the 37th Annual International Symposium on Computer Architecture, 2010. Google ScholarDigital Library
- N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. H. Pugsley, A. N. Udipi, A. Shafiee, K. Sudan, and M. Awasthi, USIMM, University of Utah, 2012.Google Scholar
- E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, "Using simpoint for accurate and efficient simulation," in Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2003. Google ScholarDigital Library
- A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr., and J. Emer, "Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies," in Proceedings of the 2010 43rd Annual International Symposium on Microarchitecture, 2010. Google ScholarDigital Library
- M. Kharbutli and Y. Solihin, "Counter-based cache replacement and bypassing algorithms," IEEE Trans. Comput., Apr. 2008. Google ScholarDigital Library
- A.-C. Lai, C. Fide, and B. Falsafi, "Dead-block prediction & dead-block correlating prefetchers," in Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001. Google ScholarDigital Library
- S. M. Khan, Y. Tian, and D. A. Jimenez, "Sampling dead block prediction for last-level caches," in Proceedings of the 2010 43rd Annual International Symposium on Microarchitecture, 2010. Google ScholarDigital Library
- V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, "The evicted-address filter: A unified mechanism to address both cache pollution and thrashing," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012. Google ScholarDigital Library
- AMD Phenom II. {Online}. Available: http://www.amd.com/us/products/desktop/processors/phenom-iiGoogle Scholar
- "Intel core i7-3940xm processor specification." {Online}. Available: http://ark.intel.com/products/71096/Google Scholar
- C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, "Ship: Signature-based hit predictor for high performance caching," in Proceedings of the 44th Annual International Symposium on Microarchitecture, 2011. Google ScholarDigital Library
- D. A. Jiménez, "Insertion and promotion for tree-based pseudolru last-level caches," in Proceedings of the 46th Annual International Symposium on Microarchitecture, 2013. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proceedings of the 39th Annual International Symposium on Microarchitecture, 2006. Google ScholarDigital Library
- G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, "A modified approach to data cache management," in Proceedings of the 28th Annual International Symposium on Microarchitecture, 1995. Google ScholarDigital Library
- S. M. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jiménez, "Improving cache performance using read-write partitioning," in High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on., 2014.Google Scholar
- Z. Zhang, Z. Zhu, and X. Zhang, "Design and optimization of large size and low overhead off-chip caches," IEEE Trans. Comput., Jul. 2004. Google ScholarDigital Library
Index Terms
- BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches
Recommendations
BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches
ISCA'15Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, ...
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Comments