Abstract
This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.
- {1} A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, D. J. Sorin, M. D. Hill, and D. A. Wood. Simulating a $2M commercial server on a $2K PC. IEEE Computer, 36(2):50-57, Feb. 2003. Google ScholarDigital Library
- {2} J. K. Archibald. A cache coherence approach for large multiprocessor systems. In the 2nd ICS, pages 337-345, 1988. Google ScholarDigital Library
- {3} V. Aslot, M. J. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady. SPECOMP: A new benchmark suite for measuring parallel computer performance. In the International Workshop on OpenMP Applications and Tools, pages 1-10, 2001. Google ScholarDigital Library
- {4} J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hierarchies. In the 15th ISCA, pages 73- 80, 1988. Google ScholarDigital Library
- {5} L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000. Google ScholarDigital Library
- {6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In the 37th MICRO, pages 319-330, Dec. 2004. Google ScholarDigital Library
- {7} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In the 36th MICRO, pages 55- 66, Dec 2003. Google ScholarDigital Library
- {8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication and capacity allocation in CMPs. In the 32th ISCA, pages 357-368, June 2005. Google ScholarDigital Library
- {9} M. Dahlin, R. Wang, T. E. Anderson, and D. A. Patterson. Cooperative caching: Using remote client memory to improve file system performance. In the 1st OSDI, pages 267-280, Nov 1994. Google ScholarDigital Library
- {10} L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE Transactions on Networking, 8(3): 281-293, 2000. Google ScholarDigital Library
- {11} M. J. Feeley, W. E. Morgan, E. P. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath. Implementing global memory management in a workstation cluster. In the 15th SOSP, pages 201-212, Dec 1995. Google ScholarDigital Library
- {12} E. Hagersten, A. Landin, and S. Haridi. DDM: A cache-only memory architecture. IEEE Computer, 25(9): 44-54, 1992. Google ScholarDigital Library
- {13} S. Harris. Synergistic Caching in Single-Chip Multiprocessors. PhD thesis, Stanford University, 2005.Google Scholar
- {14} J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 199-210, Sep 2001. Google ScholarDigital Library
- {15} J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA substrate for flexible CMP cache sharing. In the 19th ICS, pages 31-40, June 2005. Google ScholarDigital Library
- {16} R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In the 18th ICS, pages 257-266, June 2004. Google ScholarDigital Library
- {17} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS-X, pages 211-222, Oct, 2002. Google ScholarDigital Library
- {18} S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In the 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. Google ScholarDigital Library
- {19} P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2): 21-29, 2005. Google ScholarDigital Library
- {20} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA-10, pages 176-185, Feb. 2004. Google ScholarDigital Library
- {21} P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2): 50-58, Feb 2002. Google ScholarDigital Library
- {22} M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 2005. Google ScholarDigital Library
- {23} M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In the 30th ISCA, pages 182-193, June 2003. Google ScholarDigital Library
- {24} M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving multiple-CMP systems using token coherence. In HPCA-11, pages 328-339, Feb 2005. Google ScholarDigital Library
- {25} A. K. Nanda, A.-T. Nguyen, M. M. Michael, and D. J. Joseph. High-throughput coherence control and hardware messaging in Everest. IBM Journal of Research and Development, 45(2), 2001.Google Scholar
- {26} B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a multiprocessor microprocessor. In the 23rd ISCA, pages 67-77, May 1996. Google ScholarDigital Library
- {27} A. K. Osowski and D. J. Lilja. MinneSPEC: A new spec benchmark workload for simulation-based computer architecture research. Computer Architecture Letters, June 2002.Google Scholar
- {28} M. S. Papamarcos and J. H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA, pages 348-354, 1984. Google ScholarDigital Library
- {29} M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-way cache: Demand based associativity via global replacement. In the 32nd ISCA, pages 544-555, June 2005. Google ScholarDigital Library
- {30} A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An argument for simple COMA. In HPCA 1, pages 276-285, Jan, 1995. Google ScholarDigital Library
- {31} M. Takahashi, H. Takano, E. Kaneko, and S. Suzuki. A shared-bus control mechanism and a cache coherence protocol for a high-performance on-chip multiprocessor. In HPCA 2, pages 314-322, Feb 1996. Google ScholarDigital Library
- {32} J. M. Tendler, J. S. Dodson, J. S. F. Jr., H. Le, and B. Sinharoy. IBM Power4 system microarchitecture. IBM Journal of Research and Development, 46(1): 5-26, 2002.Google ScholarDigital Library
- {33} B. Verghese, A. Gupta, and M. Rosenblum. Performance isolation: Sharing and isolation in shared-memory multiprocessors. In ASPLOS-VIII, pages 181-192, Oct, 1998. Google ScholarDigital Library
- {34} T. Y. Yeh and G. Reinman. Fast and fair: data-stream quality of service. In CASES '05, pages 237-248, Sep 2005. Google ScholarDigital Library
- {35} M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled CMPs. In the 32th ISCA, pages 336-345, June 2005. Google ScholarDigital Library
Index Terms
- Cooperative Caching for Chip Multiprocessors
Recommendations
Cooperative Caching for Chip Multiprocessors
ISCA '06: Proceedings of the 33rd annual international symposium on Computer ArchitectureThis paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Comments