Abstract
Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache pollution, and propose a comprehensive mechanism to mitigate prefetcher-caused cache pollution.
First, we observe that over 95% of useful prefetches in a wide variety of applications are not reused after the first demand hit (in secondary caches). Based on this observation, our first mechanism simply demotes a prefetched block to the lowest priority on a demand hit. Second, to address pollution caused by inaccurate prefetches, we propose a self-tuning prefetch accuracy predictor to predict if a prefetch is accurate or inaccurate. Only predicted-accurate prefetches are inserted into the cache with a high priority.
Evaluations show that our final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution (up to 49%, and 6% on average for 157 two-core multiprogrammed workloads). The performance improvement is consistent across a wide variety of system configurations.
- Alaa R. Alameldeen and David A. Wood. 2007. Interactions between compression and prefetching in chip multiprocessors. In HPCA. Google ScholarDigital Library
- Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In MICRO. Google ScholarDigital Library
- Susanne Albers and Markus Büttner. 2003. Integrated prefetching and caching in single and parallel disk systems. In SPAA. Google ScholarDigital Library
- AMD. 2012. AMD Phenom II processor model. Retrieved November 11, 2014 from http://www.amd.com/en-us/products/processors/desktop/phenom-ii. (2012).Google Scholar
- Jean-Loup Baer and Tien-Fu Chen. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE TC (1995). Google ScholarDigital Library
- Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, and Jose F. Martinez. 2007. Scavenger: a new last level cache architecture with global block priority. In MICRO. Google ScholarDigital Library
- Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1995. A study ofintegrated prefetching and caching strategies. In SIGMETRICS. Google ScholarDigital Library
- Mainak Chaudhuri. 2009. Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. In MICRO. Google ScholarDigital Library
- Fredrik Dahlgren, Michel Dubois, and Per Stenström. 1995. Sequential hardware prefetching inshared-memory multiprocessors. IEEE TPDS. Google ScholarDigital Library
- Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In MICRO. Google ScholarDigital Library
- Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving cache managementpolicies using dynamic reuse distances. In MICRO. Google ScholarDigital Library
- Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In ISCA. Google ScholarDigital Library
- Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In MICRO. Google ScholarDigital Library
- Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching dystems. In HPCA.Google Scholar
- Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro. Google ScholarDigital Library
- Erik G. Hallnor and Steven K. Reinhardt. 2000. A fully associative software-managed cache design. In ISCA. Google ScholarDigital Library
- Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. 2002. Timekeeping in the memory system: predicting and optimizing memory behavior. In ISCA. Google ScholarDigital Library
- Intel. 2006. Inside Intel Core microarchitecture and smart memory access. Intel White Paper.Google Scholar
- Prabhat Jain, Srini Devadas, and Larry Rudolph. 2001. Controlling Cache Pollution in Prefetching with Software-assisted Cache Replacement. Technical Report CSG-462. Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
- Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel Emer. 2010a. Achieving non-inclusive cache performance with inclusive caches: temporal locality aware (TLA) cache management policies. In MICRO. Google ScholarDigital Library
- Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010b. High performance cache replacement using re-reference intervalprediction (RRIP). In ISCA. Google ScholarDigital Library
- Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro. Google ScholarDigital Library
- Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In ICCD.Google Scholar
- Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jimenez. 2014. Improving cache performance using read-write partitioning. In HPCA.Google Scholar
- Samira Manabi Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In MICRO.Google Scholar
- Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010a. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA.Google Scholar
- Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010b. Thread cluster memory scheduling: exploiting differences in memory access behavior. In MICRO. Google ScholarDigital Library
- An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction and dead-block correlating prefetchers. In ISCA. Google ScholarDigital Library
- H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. 2007. IBM Power6 microarchitecture. IBM JRD. Google ScholarDigital Library
- Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In MICRO.Google Scholar
- Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving memory bank-level parallelism in the presence of prefetching. In MICRO.Google Scholar
- Wei-Fen Lin, Steven K. Reinhardt, and Doug Burger. 2001a. Reducing DRAM latencies with an integrated memory hierarchy design. In HPCA. Google ScholarDigital Library
- Wei-Fen Lin, Steven K. Reinhardt, Doug Burger, and Thomas R. Puzak. 2001b. Filtering superfluous prefetches using density vectors. In ICCD.Google Scholar
- Kun Luo, Jayanth Gummaraju, and Manoj Franklin. 2001. Balancing throughput and fairness in SMT processors. In ISPASS.Google Scholar
- Kyle J. Nesbit, Ashutosh S. Dhodapkar, and James E. Smith. 2004. AC/DC: An adaptive data cache prefetcher. In PACT. Google ScholarDigital Library
- Oracle. 2011. Oracle’s Sparc T4 server architecture. Oracle White Paper.Google Scholar
- R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In SOSP. Google ScholarDigital Library
- Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: practical data compression for on-chip caches. In PACT. Google ScholarDigital Library
- Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive Insertion Policies for High Performance Caching. In ISCA. Google ScholarDigital Library
- Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A case for MLP-aware cache replacement. In ISCA. Google ScholarDigital Library
- Kaushik Rajan and Govindarajan Ramaswamy. 2007. Emulating optimal replacement with a shepherd cache. In MICRO. Google ScholarDigital Library
- Vivek Seshadri. 2014. Source code for Mem-Sim. Retrieved November 11, 2014 from www.ece.cmu.edu/∼safari/tools.html. (2014).Google Scholar
- Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry. 2012. The evicted-address filter: a unified mechanism to address both cache pollution and thrashing. In PACT. Google ScholarDigital Library
- André Seznec. 1993. A case for two-way skewed-associative caches. In ISCA. Google ScholarDigital Library
- Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In ASPLOS. Google ScholarDigital Library
- Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim. 2012. FLEXclusion: balancing cache capacity and on-chip bandwidth via flexible exclusion. In ISCA. Google ScholarDigital Library
- Allan Snavely and Dean M. Tullsen. 2000. Symbiotic job scheduling for a simultaneous multithreaded processor. In ASPLOS. Google ScholarDigital Library
- Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA. Google ScholarDigital Library
- Hans Vandierendonck and André Seznec. 2011. Fairness metrics for multithreaded processors. IEEE Computer Architecture Letters (Jan. 2011). Google ScholarDigital Library
- VIA. 2005. VIA C7 Processor. Retrieved November 11, 2014 from http://www.via.com.tw/en/products/processors/c7/. (2005).Google Scholar
- Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. PACMan: prefetch-aware cache management for high performance caching. In MICRO. Google ScholarDigital Library
- Xiaotong Zhuang and Hsien-Hsin S. Lee. 2003. A hardware-based cache pollution filtering mechanism for aggressive prefetches. In ICPP.Google Scholar
Index Terms
- Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks
Recommendations
Using the first-level caches as filters to reduce the pollution caused by speculative memory references
High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. ...
CAFFEINE: A Utility-Driven Prefetcher Aggressiveness Engine for Multicores
Aggressive prefetching improves system performance by hiding and tolerating off-chip memory latency. However, on a multicore system, prefetchers of different cores contend for shared resources and aggressive prefetching can degrade the overall system ...
Comments