skip to main content
research-article
Free Access

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

Published:09 January 2015Publication History
Skip Abstract Section

Abstract

Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache pollution, and propose a comprehensive mechanism to mitigate prefetcher-caused cache pollution.

First, we observe that over 95% of useful prefetches in a wide variety of applications are not reused after the first demand hit (in secondary caches). Based on this observation, our first mechanism simply demotes a prefetched block to the lowest priority on a demand hit. Second, to address pollution caused by inaccurate prefetches, we propose a self-tuning prefetch accuracy predictor to predict if a prefetch is accurate or inaccurate. Only predicted-accurate prefetches are inserted into the cache with a high priority.

Evaluations show that our final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution (up to 49%, and 6% on average for 157 two-core multiprogrammed workloads). The performance improvement is consistent across a wide variety of system configurations.

References

  1. Alaa R. Alameldeen and David A. Wood. 2007. Interactions between compression and prefetching in chip multiprocessors. In HPCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Susanne Albers and Markus Büttner. 2003. Integrated prefetching and caching in single and parallel disk systems. In SPAA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AMD. 2012. AMD Phenom II processor model. Retrieved November 11, 2014 from http://www.amd.com/en-us/products/processors/desktop/phenom-ii. (2012).Google ScholarGoogle Scholar
  5. Jean-Loup Baer and Tien-Fu Chen. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE TC (1995). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, and Jose F. Martinez. 2007. Scavenger: a new last level cache architecture with global block priority. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1995. A study ofintegrated prefetching and caching strategies. In SIGMETRICS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mainak Chaudhuri. 2009. Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fredrik Dahlgren, Michel Dubois, and Per Stenström. 1995. Sequential hardware prefetching inshared-memory multiprocessors. IEEE TPDS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving cache managementpolicies using dynamic reuse distances. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching dystems. In HPCA.Google ScholarGoogle Scholar
  15. Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Erik G. Hallnor and Steven K. Reinhardt. 2000. A fully associative software-managed cache design. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. 2002. Timekeeping in the memory system: predicting and optimizing memory behavior. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Intel. 2006. Inside Intel Core microarchitecture and smart memory access. Intel White Paper.Google ScholarGoogle Scholar
  19. Prabhat Jain, Srini Devadas, and Larry Rudolph. 2001. Controlling Cache Pollution in Prefetching with Software-assisted Cache Replacement. Technical Report CSG-462. Massachusetts Institute of Technology, Cambridge, MA.Google ScholarGoogle Scholar
  20. Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel Emer. 2010a. Achieving non-inclusive cache performance with inclusive caches: temporal locality aware (TLA) cache management policies. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010b. High performance cache replacement using re-reference intervalprediction (RRIP). In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In ICCD.Google ScholarGoogle Scholar
  24. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jimenez. 2014. Improving cache performance using read-write partitioning. In HPCA.Google ScholarGoogle Scholar
  25. Samira Manabi Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In MICRO.Google ScholarGoogle Scholar
  26. Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010a. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA.Google ScholarGoogle Scholar
  27. Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010b. Thread cluster memory scheduling: exploiting differences in memory access behavior. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction and dead-block correlating prefetchers. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. 2007. IBM Power6 microarchitecture. IBM JRD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In MICRO.Google ScholarGoogle Scholar
  31. Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving memory bank-level parallelism in the presence of prefetching. In MICRO.Google ScholarGoogle Scholar
  32. Wei-Fen Lin, Steven K. Reinhardt, and Doug Burger. 2001a. Reducing DRAM latencies with an integrated memory hierarchy design. In HPCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wei-Fen Lin, Steven K. Reinhardt, Doug Burger, and Thomas R. Puzak. 2001b. Filtering superfluous prefetches using density vectors. In ICCD.Google ScholarGoogle Scholar
  34. Kun Luo, Jayanth Gummaraju, and Manoj Franklin. 2001. Balancing throughput and fairness in SMT processors. In ISPASS.Google ScholarGoogle Scholar
  35. Kyle J. Nesbit, Ashutosh S. Dhodapkar, and James E. Smith. 2004. AC/DC: An adaptive data cache prefetcher. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Oracle. 2011. Oracle’s Sparc T4 server architecture. Oracle White Paper.Google ScholarGoogle Scholar
  37. R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In SOSP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: practical data compression for on-chip caches. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive Insertion Policies for High Performance Caching. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A case for MLP-aware cache replacement. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Kaushik Rajan and Govindarajan Ramaswamy. 2007. Emulating optimal replacement with a shepherd cache. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Vivek Seshadri. 2014. Source code for Mem-Sim. Retrieved November 11, 2014 from www.ece.cmu.edu/∼safari/tools.html. (2014).Google ScholarGoogle Scholar
  43. Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry. 2012. The evicted-address filter: a unified mechanism to address both cache pollution and thrashing. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. André Seznec. 1993. A case for two-way skewed-associative caches. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim. 2012. FLEXclusion: balancing cache capacity and on-chip bandwidth via flexible exclusion. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Allan Snavely and Dean M. Tullsen. 2000. Symbiotic job scheduling for a simultaneous multithreaded processor. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Hans Vandierendonck and André Seznec. 2011. Fairness metrics for multithreaded processors. IEEE Computer Architecture Letters (Jan. 2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. VIA. 2005. VIA C7 Processor. Retrieved November 11, 2014 from http://www.via.com.tw/en/products/processors/c7/. (2005).Google ScholarGoogle Scholar
  51. Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. PACMan: prefetch-aware cache management for high performance caching. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Xiaotong Zhuang and Hsien-Hsin S. Lee. 2003. A hardware-based cache pollution filtering mechanism for aggressive prefetches. In ICPP.Google ScholarGoogle Scholar

Index Terms

  1. Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
      January 2015
      797 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2695583
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 January 2015
      • Revised: 1 October 2014
      • Accepted: 1 October 2014
      • Received: 1 February 2014
      Published in taco Volume 11, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader