Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

Authors:
Vivek Seshadri

Carnegie Mellon University, Pittsburgh PA

Carnegie Mellon University, Pittsburgh PA
View Profile

,
Samihan Yedkar

Carnegie Mellon University, Pittsburgh PA

Carnegie Mellon University, Pittsburgh PA
View Profile

,
Hongyi Xin

Carnegie Mellon University, Pittsburgh PA

Carnegie Mellon University, Pittsburgh PA
View Profile

,
Onur Mutlu

Carnegie Mellon University, Pittsburgh PA

Carnegie Mellon University, Pittsburgh PA
View Profile

,
Phillip B. Gibbons

Intel Pittsburgh, Pittsburgh PA

Intel Pittsburgh, Pittsburgh PA
View Profile

,
Michael A. Kozuch

Intel Pittsburgh, Pittsburgh PA

Intel Pittsburgh, Pittsburgh PA
View Profile

,
Todd C. Mowry

Carnegie Mellon University, Pittsburgh PA

Carnegie Mellon University, Pittsburgh PA
View Profile

ACM Transactions on Architecture and Code Optimization Volume 11 Issue 4Article No.: 51pp 1–22https://doi.org/10.1145/2677956

Published:09 January 2015Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache pollution, and propose a comprehensive mechanism to mitigate prefetcher-caused cache pollution.

First, we observe that over 95% of useful prefetches in a wide variety of applications are not reused after the first demand hit (in secondary caches). Based on this observation, our first mechanism simply demotes a prefetched block to the lowest priority on a demand hit. Second, to address pollution caused by inaccurate prefetches, we propose a self-tuning prefetch accuracy predictor to predict if a prefetch is accurate or inaccurate. Only predicted-accurate prefetches are inserted into the cache with a high priority.

Evaluations show that our final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution (up to 49%, and 6% on average for 157 two-core multiprogrammed workloads). The performance improvement is consistent across a wide variety of system configurations.

References

Alaa R. Alameldeen and David A. Wood. 2007. Interactions between compression and prefetching in chip multiprocessors. In HPCA. Google ScholarDigital Library
Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In MICRO. Google ScholarDigital Library
Susanne Albers and Markus Büttner. 2003. Integrated prefetching and caching in single and parallel disk systems. In SPAA. Google ScholarDigital Library
AMD. 2012. AMD Phenom II processor model. Retrieved November 11, 2014 from http://www.amd.com/en-us/products/processors/desktop/phenom-ii. (2012).Google Scholar
Jean-Loup Baer and Tien-Fu Chen. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE TC (1995). Google ScholarDigital Library
Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, and Jose F. Martinez. 2007. Scavenger: a new last level cache architecture with global block priority. In MICRO. Google ScholarDigital Library
Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1995. A study ofintegrated prefetching and caching strategies. In SIGMETRICS. Google ScholarDigital Library
Mainak Chaudhuri. 2009. Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. In MICRO. Google ScholarDigital Library
Fredrik Dahlgren, Michel Dubois, and Per Stenström. 1995. Sequential hardware prefetching inshared-memory multiprocessors. IEEE TPDS. Google ScholarDigital Library
Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In MICRO. Google ScholarDigital Library
Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving cache managementpolicies using dynamic reuse distances. In MICRO. Google ScholarDigital Library
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In ISCA. Google ScholarDigital Library
Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In MICRO. Google ScholarDigital Library
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching dystems. In HPCA.Google Scholar
Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro. Google ScholarDigital Library
Erik G. Hallnor and Steven K. Reinhardt. 2000. A fully associative software-managed cache design. In ISCA. Google ScholarDigital Library
Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. 2002. Timekeeping in the memory system: predicting and optimizing memory behavior. In ISCA. Google ScholarDigital Library
Intel. 2006. Inside Intel Core microarchitecture and smart memory access. Intel White Paper.Google Scholar
Prabhat Jain, Srini Devadas, and Larry Rudolph. 2001. Controlling Cache Pollution in Prefetching with Software-assisted Cache Replacement. Technical Report CSG-462. Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel Emer. 2010a. Achieving non-inclusive cache performance with inclusive caches: temporal locality aware (TLA) cache management policies. In MICRO. Google ScholarDigital Library
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010b. High performance cache replacement using re-reference intervalprediction (RRIP). In ISCA. Google ScholarDigital Library
Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro. Google ScholarDigital Library
Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In ICCD.Google Scholar
Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jimenez. 2014. Improving cache performance using read-write partitioning. In HPCA.Google Scholar
Samira Manabi Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In MICRO.Google Scholar
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010a. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA.Google Scholar
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010b. Thread cluster memory scheduling: exploiting differences in memory access behavior. In MICRO. Google ScholarDigital Library
An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction and dead-block correlating prefetchers. In ISCA. Google ScholarDigital Library
H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. 2007. IBM Power6 microarchitecture. IBM JRD. Google ScholarDigital Library
Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In MICRO.Google Scholar
Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving memory bank-level parallelism in the presence of prefetching. In MICRO.Google Scholar
Wei-Fen Lin, Steven K. Reinhardt, and Doug Burger. 2001a. Reducing DRAM latencies with an integrated memory hierarchy design. In HPCA. Google ScholarDigital Library
Wei-Fen Lin, Steven K. Reinhardt, Doug Burger, and Thomas R. Puzak. 2001b. Filtering superfluous prefetches using density vectors. In ICCD.Google Scholar
Kun Luo, Jayanth Gummaraju, and Manoj Franklin. 2001. Balancing throughput and fairness in SMT processors. In ISPASS.Google Scholar
Kyle J. Nesbit, Ashutosh S. Dhodapkar, and James E. Smith. 2004. AC/DC: An adaptive data cache prefetcher. In PACT. Google ScholarDigital Library
Oracle. 2011. Oracle’s Sparc T4 server architecture. Oracle White Paper.Google Scholar
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In SOSP. Google ScholarDigital Library
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: practical data compression for on-chip caches. In PACT. Google ScholarDigital Library
Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive Insertion Policies for High Performance Caching. In ISCA. Google ScholarDigital Library
Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A case for MLP-aware cache replacement. In ISCA. Google ScholarDigital Library
Kaushik Rajan and Govindarajan Ramaswamy. 2007. Emulating optimal replacement with a shepherd cache. In MICRO. Google ScholarDigital Library
Vivek Seshadri. 2014. Source code for Mem-Sim. Retrieved November 11, 2014 from www.ece.cmu.edu/&sim;safari/tools.html. (2014).Google Scholar
Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry. 2012. The evicted-address filter: a unified mechanism to address both cache pollution and thrashing. In PACT. Google ScholarDigital Library
André Seznec. 1993. A case for two-way skewed-associative caches. In ISCA. Google ScholarDigital Library
Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In ASPLOS. Google ScholarDigital Library
Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim. 2012. FLEXclusion: balancing cache capacity and on-chip bandwidth via flexible exclusion. In ISCA. Google ScholarDigital Library
Allan Snavely and Dean M. Tullsen. 2000. Symbiotic job scheduling for a simultaneous multithreaded processor. In ASPLOS. Google ScholarDigital Library
Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA. Google ScholarDigital Library
Hans Vandierendonck and André Seznec. 2011. Fairness metrics for multithreaded processors. IEEE Computer Architecture Letters (Jan. 2011). Google ScholarDigital Library
VIA. 2005. VIA C7 Processor. Retrieved November 11, 2014 from http://www.via.com.tw/en/products/processors/c7/. (2005).Google Scholar
Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. PACMan: prefetch-aware cache management for high performance caching. In MICRO. Google ScholarDigital Library
Xiaotong Zhuang and Hsien-Hsin S. Lee. 2003. A hardware-based cache pollution filtering mechanism for aggressive prefetches. In ICPP.Google Scholar

Index Terms

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs
Special Issue ESWEEK 2023
Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention ...
Read More
Using the first-level caches as filters to reduce the pollution caused by speculative memory references

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. ...
Read More
CAFFEINE: A Utility-Driven Prefetcher Aggressiveness Engine for Multicores

Aggressive prefetching improves system performance by hiding and tolerating off-chip memory latency. However, on a multicore system, prefetchers of different cores contend for shared resources and aggressive prefetching can degrade the overall system ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4
January 2015
797 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2695583
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 January 2015
- Revised: 1 October 2014
- Accepted: 1 October 2014
- Received: 1 February 2014
Published in taco Volume 11, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Prefetching
cache insertion/promotion policy
cache pollution
caches
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 50
  Total Citations
  View Citations
- 827
  Total Downloads
- Downloads (Last 12 months)141
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs

Using the first-level caches as filters to reduce the pollution caused by speculative memory references

CAFFEINE: A Utility-Driven Prefetcher Aggressiveness Engine for Multicores