research-article

Optimal bypass monitor for high performance last-level caches

Authors:
Lingda Li

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Dong Tong

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Zichao Xie

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Junlin Lu

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Xu Cheng

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesSeptember 2012Pages 315–324https://doi.org/10.1145/2370816.2370862

Published:19 September 2012Publication History

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 315–324

ABSTRACT

In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity. Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer. The bypass technique is an effective and attractive solution that prevents the insertion of harmful blocks.

Our analysis shows that bypass can contribute significant performance improvement, and the optimal bypass can achieve similar performance compared to OPT+B, which is the theoretical optimal replacement policy. Thus, we propose a bypass technique called Optimal Bypass Monitor (OBM), which makes bypass decisions by learning and predicting the behavior of the optimal bypass. OBM keeps a short global track of the incoming-victim block pairs. By detecting the first reuse block in each pair, the behavior of the optimal bypass on the track can be asserted to guide the bypass choice.

Any existing replacement policy can be extended with OBM while requiring negligible design modification. Our experimental results show that using less than 1.5KB extra memory, OBM with the NRU replacement policy outperforms LRU by 9.7% and 8.9% for single-thread and multi-programmed workloads respectively. Compared with other state-of-the-art proposals such as DRRIP and SDBP, it achieves superior performance with less storage overhead.

References

S. Bansal and D. S. Modha. Car: Clock with adaptive replacement. In FAST-3, 2004. Google ScholarDigital Library
L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966. Google ScholarDigital Library
S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54:67--77, 2011. Google ScholarDigital Library
M. Chaudhuri. Pseudo-lifo: the foundation of a new family of replacement policies for last-level caches. In MICRO-42, 2009. Google ScholarDigital Library
C.-H. Chi and H. Dietz. Improving cache performance by selective cache bypass. In HICSS-22, 1989.Google ScholarCross Ref
H. Gao and C. Wilkerson. A dueling segmented lru replacement algorithm with adaptive bypassing. In JWAC-1, 2010.Google Scholar
J. Gaur, M. Chaudhuri, and S. Subramoney. Bypass and insertion algorithms for exclusive last-level caches. In ISCA-38, 2011. Google ScholarDigital Library
A. González, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In ICS-9, 1995.Google Scholar
J. L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34:1--17, 2006. Google ScholarDigital Library
Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the memory system: predicting and optimizing memory behavior. In ISCA-29, 2002. Google ScholarDigital Library
Intel. Intel core i7 processor. http://www.intel.com/products/processor/corei7/.Google Scholar
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ISCA-37, 2010. Google ScholarDigital Library
J. Jalminger and P. Stenstrom. A novel approach to cache block reuse predictions. In ICPP '03, 2003.Google ScholarCross Ref
L. John and A. Subramanian. Design and performance evaluation of a cache assist to implement selective caching. In ICCD '97, 1997.Google ScholarCross Ref
T. Johnson, D. Connors, M. Merten, and W.-M. Hwu. Run-time cache bypassing. Computers, IEEE Transactions on, 48(12):1338--1354, 1999. Google ScholarDigital Library
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA-17, 1990. Google ScholarDigital Library
G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache replacement based on reuse-distance prediction. In ICCD-25, 2007.Google ScholarCross Ref
S. M. Khan, D. A. Jiménez, D. Burger, and B. Falsafi. Using dead blocks as a virtual victim cache. In PACT-19, 2010. Google ScholarDigital Library
S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO-43, 2010. Google ScholarDigital Library
M. Kharbutli and Y. Solihin. Counter-based cache replacement and bypassing algorithms. Computers, IEEE Transactions on, 57(4):433--447, 2008. Google ScholarDigital Library
A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In ISCA-28, 2001. Google ScholarDigital Library
H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In MICRO-41, 2008. Google ScholarDigital Library
R. Manikantan, K. Rajan, and R. Govindarajan. Nucache: An efficient multicore cache organization based on next-use distance. In HPCA-17, 2011. Google ScholarDigital Library
N. Megiddo and D. S. Modha. Arc: A self-tuning, low overhead replacement cache. In FAST-2, 2003. Google ScholarDigital Library
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Cacti 6.0: A tool to understand large caches. HP Research Report, 2007.Google Scholar
H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing representative portions of large intel itanium programs with dynamic instrumentation. In MICRO-37, 2004. Google ScholarDigital Library
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA-34, 2007. Google ScholarDigital Library
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for mlp-aware cache replacement. In ISCA-33, 2006. Google ScholarDigital Library
K. Rajan and G. Ramaswamy. Emulating optimal replacement with a shepherd cache. In MICRO-40, 2007. Google ScholarDigital Library
J. Rivers and E. Davidson. Reducing conflicts in direct-mapped caches with a temporality-based design. In ICPP '96, 1996.Google ScholarCross Ref
J. A. Rivers, E. S. Tam, G. S. Tyson, E. S. Davidson, and M. Farrens. Utilizing reuse information in data cache management. In ICS-12, 1998. Google ScholarDigital Library
R. Subramanian, Y. Smaragdakis, and G. H. Loh. Adaptive caches: Effective shaping of cache behavior to workloads. In MICRO-39, 2006. Google ScholarDigital Library
G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. A modified approach to data cache management. In MICRO-28, 1995. Google ScholarDigital Library
S. Walsh and J. Board. Pollution control caching. In ICCD '95, 1995. Google ScholarDigital Library
C. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely Jr, and J. Emer. Ship: Signature-based hit predictor for high performance caching. In MICRO-44, 2011. Google ScholarDigital Library
C. Wu, A. Jaleel, M. Martonosi, S. Steely Jr, and J. Emer. Pacman: Prefetch-aware cache management for high performance caching. In MICRO-44, 2011. Google ScholarDigital Library
L. Xiang, T. Chen, Q. Shi, and W. Hu. Less reused filter: improving l2 cache performance via filtering less reused lines. In ICS-23, 2009. Google ScholarDigital Library
Y. Xie and G. H. Loh. Pipp: Promotion/insertion pseudo-partitioning of multi-core shared caches. In ISCA-36, 2009. Google ScholarDigital Library

Index Terms

Optimal bypass monitor for high performance last-level caches
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

The replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
Read More
Bypass and insertion algorithms for exclusive last-level caches
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted cache space leads to bigger performance losses compared ...
Read More
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
September 2012
512 pages
ISBN:9781450311823
DOI:10.1145/2370816
General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 September 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
last-level cache
optimal bypass
replacement
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 449
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimal bypass monitor for high performance last-level caches

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Bypass and insertion algorithms for exclusive last-level caches

High performance cache replacement using re-reference interval prediction (RRIP)