skip to main content
article

Cooperative Caching for Chip Multiprocessors

Published:01 May 2006Publication History
Skip Abstract Section

Abstract

This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

References

  1. {1} A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, D. J. Sorin, M. D. Hill, and D. A. Wood. Simulating a $2M commercial server on a $2K PC. IEEE Computer, 36(2):50-57, Feb. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {2} J. K. Archibald. A cache coherence approach for large multiprocessor systems. In the 2nd ICS, pages 337-345, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. {3} V. Aslot, M. J. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady. SPECOMP: A new benchmark suite for measuring parallel computer performance. In the International Workshop on OpenMP Applications and Tools, pages 1-10, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hierarchies. In the 15th ISCA, pages 73- 80, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {5} L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In the 37th MICRO, pages 319-330, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In the 36th MICRO, pages 55- 66, Dec 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication and capacity allocation in CMPs. In the 32th ISCA, pages 357-368, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {9} M. Dahlin, R. Wang, T. E. Anderson, and D. A. Patterson. Cooperative caching: Using remote client memory to improve file system performance. In the 1st OSDI, pages 267-280, Nov 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE Transactions on Networking, 8(3): 281-293, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} M. J. Feeley, W. E. Morgan, E. P. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath. Implementing global memory management in a workstation cluster. In the 15th SOSP, pages 201-212, Dec 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} E. Hagersten, A. Landin, and S. Haridi. DDM: A cache-only memory architecture. IEEE Computer, 25(9): 44-54, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {13} S. Harris. Synergistic Caching in Single-Chip Multiprocessors. PhD thesis, Stanford University, 2005.Google ScholarGoogle Scholar
  14. {14} J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 199-210, Sep 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA substrate for flexible CMP cache sharing. In the 19th ICS, pages 31-40, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. {16} R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In the 18th ICS, pages 257-266, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {17} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS-X, pages 211-222, Oct, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. {18} S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In the 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2): 21-29, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA-10, pages 176-185, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. {21} P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2): 50-58, Feb 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. {23} M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In the 30th ISCA, pages 182-193, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving multiple-CMP systems using token coherence. In HPCA-11, pages 328-339, Feb 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. {25} A. K. Nanda, A.-T. Nguyen, M. M. Michael, and D. J. Joseph. High-throughput coherence control and hardware messaging in Everest. IBM Journal of Research and Development, 45(2), 2001.Google ScholarGoogle Scholar
  26. {26} B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a multiprocessor microprocessor. In the 23rd ISCA, pages 67-77, May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. {27} A. K. Osowski and D. J. Lilja. MinneSPEC: A new spec benchmark workload for simulation-based computer architecture research. Computer Architecture Letters, June 2002.Google ScholarGoogle Scholar
  28. {28} M. S. Papamarcos and J. H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA, pages 348-354, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. {29} M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-way cache: Demand based associativity via global replacement. In the 32nd ISCA, pages 544-555, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. {30} A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An argument for simple COMA. In HPCA 1, pages 276-285, Jan, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. {31} M. Takahashi, H. Takano, E. Kaneko, and S. Suzuki. A shared-bus control mechanism and a cache coherence protocol for a high-performance on-chip multiprocessor. In HPCA 2, pages 314-322, Feb 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. {32} J. M. Tendler, J. S. Dodson, J. S. F. Jr., H. Le, and B. Sinharoy. IBM Power4 system microarchitecture. IBM Journal of Research and Development, 46(1): 5-26, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. {33} B. Verghese, A. Gupta, and M. Rosenblum. Performance isolation: Sharing and isolation in shared-memory multiprocessors. In ASPLOS-VIII, pages 181-192, Oct, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. {34} T. Y. Yeh and G. Reinman. Fast and fair: data-stream quality of service. In CASES '05, pages 237-248, Sep 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. {35} M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled CMPs. In the 32th ISCA, pages 336-345, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cooperative Caching for Chip Multiprocessors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 34, Issue 2
          May 2006
          383 pages
          ISSN:0163-5964
          DOI:10.1145/1150019
          Issue’s Table of Contents
          • cover image ACM Conferences
            ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture
            June 2006
            383 pages
            ISBN:076952608X

          Copyright © 2006 Authors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 May 2006

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader