ABSTRACT
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.
- D. Albonesi. Selective cache ways: on-demand cache resource allocation. In MICRO, pages 248--259, 1999. Google ScholarDigital Library
- C. Antonopoulos, D. Nikolopoulos, and T. Papatheodorou. Scheduling algorithms with bus bandwidth considerations for SMPs. In ICPP, pages 547--554, 2003.Google ScholarCross Ref
- R. Azimi, L. Soares, M. Stumm, T. Walsh, and A. Demke Brown. PATH: page access tracking to improve memory management. In ISMM, pages 31--42, 2007. Google ScholarDigital Library
- R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In ICS, pages 101--110, 2005. Google ScholarDigital Library
- R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO, pages 245--257, 2000. Google ScholarDigital Library
- E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In ISPASS, pages 20--27, 2004. Google ScholarDigital Library
- E. Berg and E. Hagersten. Fast data-locality profiling of native execution. In SIGMETRICS, pages 169--180, 2005. Google ScholarDigital Library
- E. Berg, H. Zeffer, and E. Hagersten. A statistical multiprocessor cache model. In ISPASS, pages 89--99, 2006.Google ScholarCross Ref
- D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In FDDO, 2001.Google Scholar
- B. Buck and J. Hollingsworth. An API for runtime code patching. J. of High Performance Computing Applications, 14(4):317--329, 2000. Google ScholarDigital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, pages 340--351, 2005. Google ScholarDigital Library
- S. Cho and L Jin. Managing distributed, shared L2 caches through OS-level page allocation. In MICRO, pages 455--468, 2006. Google ScholarDigital Library
- J. Edler and M. Hill. Dinero IV trace-driven uniprocessor cache simulator. URL http://www.cs.wisc.edu/~markhill/DineroIV.Google Scholar
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In USENIX ATC, pages 26--26, 2005. Google ScholarDigital Library
- F. Guo and Y. Solihin. An analytical model for cache replacement policy performance. In SIGMETRICS, pages 228--239, 2006. Google ScholarDigital Library
- R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In ICS, pages 257--266, 2004. Google ScholarDigital Library
- R. Iyer, L. Zhao, F. Guo, R. Illikkal, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, pages 25--36, 2007. Google ScholarDigital Library
- J. Kim, J. Choi, J. Kim, S. Noh, S. Min, Y. Cho, and C. Kim. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In OSDI, pages 119--34, 2000. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, pages 111--122, 2004. Google ScholarDigital Library
- Y. Kim, M. Hill, and D. Wood. Implementing stack simulation for highlyassociative memories. In SIGMETRICS, pages 212--213, 1991. Google ScholarDigital Library
- J. Liedtke, H. Hartig, and M. Hohmuth. OS-controlled cache predictability for real-time systems. In RTAS, pages 213--227, 1997. Google ScholarDigital Library
- J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, pages 367--378, 2008.Google Scholar
- C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA, pages 176--185, 2004. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, pages 190--200, 2005. Google ScholarDigital Library
- R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques and storage hierarchies. IBM Systems J., 9(2):78--117, 1970.Google ScholarDigital Library
- K. Meng, R. Joseph, R. Dick, and L. Shang. Multi-optimization power management for chip multiprocessors. In PACT, pages 177--186, 2008. Google ScholarDigital Library
- M. Olszewski, K. Mierle, A. Czajkowski, and A. Demke Brown. JIT instrumentation: a novel approach to dynamically instrument operating systems. In EuroSys, pages 3--16, 2007. Google ScholarDigital Library
- R. Patterson, G. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. In SOSP, pages 79--95, 1995. Google ScholarDigital Library
- M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, pages 423--432, 2006. Google ScholarDigital Library
- N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In PACT, pages 2--12, 2006. Google ScholarDigital Library
- R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek. A resource allocation model for QoS management. In RTSS, pages 298--308, 1997. Google ScholarDigital Library
- A. Settle, J. Kihm, A. Janiszewski, and D. Connors. Architectural support for enhanced SMT job scheduling. In PACT, 2004. Google ScholarDigital Library
- J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In USENIX ATC, pages 17--30, 2005. Google ScholarDigital Library
- X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In POPL, pages 55--61, 2007. Google ScholarDigital Library
- T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In ISCA, pages 336--349, 2003. Google ScholarDigital Library
- A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In ASPLOS, pages 234--244, 2000. Google ScholarDigital Library
- L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. In MICRO, 2008. Google ScholarDigital Library
- G. Soundararajan, J. Chen, M. Sharaf, and C. Amza. Dynamic partitioning of the cache hierarchy in shared data centers. In VLDB, pages 635--646, 2008. Google ScholarDigital Library
- S. Srikantaiah, M. Kandemir, and M. Irwin. Adaptive set pinning: Managing shared caches in chip multiprocessors. In ASPLOS, pages 135--144, 2008. Google ScholarDigital Library
- H. Stone, J. Turek, and J. Wolf. Optimal partitioning of cache memory. IEEE TOC, 41(9):1054--1068, 1992. Google ScholarDigital Library
- E. Suh, L Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. The J. of Supercomputing, 28(1):7--26, Apr. 2004. Google ScholarDigital Library
- D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared L2 caches on multicore systems in software. In WIOSCA, pages 26--33, 2007.Google Scholar
- D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In EuroSys, pages 47--58, 2007. Google ScholarDigital Library
- D. Thiebaut, H. Stone, and J.Wolf. Improving disk cache hit-ratios through cache partitioning. IEEE TOC, 41(6):665--676, 1992. Google ScholarDigital Library
- T. Yang, E. Berger, S. Kaplan, and J. Moss. CRAMM: virtual memory support for garbage-collected applications. In OSDI, pages 103--116, 2006. Google ScholarDigital Library
- Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In CGO, pages 299--311, 2007. Google ScholarDigital Library
- P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In ASPLOS, pages 177--188, 2004. Google ScholarDigital Library
- Y. Zhou, J. Philbin, and K. Li. The multi-queue replacement algorithm for second level buffer caches. In USENIX ATC, pages 91--104, 2001. Google ScholarDigital Library
Index Terms
- RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
Recommendations
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
ASPLOS 2009Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been ...
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
ASPLOS 2009Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Comments