skip to main content
10.1145/1669112.1669166acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

A tagless coherence directory

Published:12 December 2009Publication History

ABSTRACT

A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs.

The Tagless Coherence Directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information. Conceptually, TL consists of a grid of small Bloom filters. The grid has one column per core and one row per cache set. TL uses 48% less area, 57% less leakage power, and 44% less dynamic energy than a conventional coherence directory for a 16-core CMP with 1MB private L2 caches. Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization. Analytical modelling predicts that TL continues to scale well up to at least 1024 cores.

References

  1. First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.Google ScholarGoogle Scholar
  2. OpenSPARC#8482; T2 system-on-chip (SoC) microarchitecture specification, May 2008.Google ScholarGoogle Scholar
  3. A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An evaluation of directory schemes for cache coherence. In Proc. of the Int'l Symposium on Computer Architecture, June 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. S. Ballapuram, A. Sharif, and H.-H. S. Lee. Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors. In Proc. of the Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: a scalable architecture base on single-chip multiprocessing. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  8. J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Trans. Comput., C-27(12):1112--1118, Dec. 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. H. Choi and K. H. Park. Segment directory enhancing the limited directory cache coherence schemes. In Proc. of the Int'l Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pages 258--267, Apr 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Ekman and P. Stenström. Enhancing multiprocessor architecture simulation speed using matched-pair comparison. In Proc. of the Int'l Symposium on the Performance Analysis of Systems and Software, Mar. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network mulitcast for scalable cache coherence. In Proc. of the Int'l Symposium on Micorarchitecture, Dec. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281--293, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Gupta, W. dietrich Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. of the Int'l Conference on Parallel Processing, Aug. 1990.Google ScholarGoogle Scholar
  15. D. Gustavson and Q. Li. The scalable coherent interface (sci). Communications Magazine, IEEE, 34(8):52--63, Aug 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. ACM SIGMETRICS Performance Evaluation Review, 31(4):31--35, Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro, 27(5):51--61, Sept.--Oct. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proc. of the Int'l Symposium on Computer Architecture, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER 6 microarchitecture. IBM Journal of Research and Development, 51(6):639--662, Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Mak, C. R. Walters, and G. E. Strait. IBM System z10 processor cache subsystem. IBM Journal of Research and Development, 53(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. R. Marty and M. D. Hill. Virtual hierarchies to support server consolidation. In Proc. of the Int'l Symposium on Computer Architecture, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Moshovos. RegionScout: Exploiting coarse grain sharing in snoop-based coherence. In Proc. of the Int'l Symposium on Computer Architecture, pages 234--245, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi. Jetty: Filtering snoops for reduced energy consumption in smp servers. In Proc. of the Int'l Symposium on High-Performance Computer Architecture, Jan. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. W. O'Krafka and A. R. Newton. An empirical evaluation of two memory-efficient directory methods. In Proc. of the Int'l Symposium on Computer Architecture, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L.-S. Peh and W. Dally. A delay model and speculative architecture for pipelined routers. pages 255--266, 2001.Google ScholarGoogle Scholar
  27. A. Raghaven, C. Blundell, and M. M. K. Martin. Token tenure: PATCHing token counting using directory-based cache coherence. In Proc. of the Int'l Symposium on Microarchitecture, Dec. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Rusu, S. Tam, H. Mulijono, D. Ayers, and J. Chang. A dual-core multi-threaded Xeon processor with 16MB L3 cache. In Proc of the Int'l Solid-State Circuits Conference, Feb. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  29. V. Salapura, M. Blumrich, and A. Gara. Design and implementation of the Blue Gene/P snoop filter. In Proc. of the Int'l Symposium on High Performance Computer Architecture, Feb. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  30. C. Saldanha and M. H. Lipasti. Power Efficient Cache Coherence. Springer-Verlag, 2003.Google ScholarGoogle Scholar
  31. D. Sanchez, L. Yen, M. Hill, and K. Sankaralingam. Implementing signatures for transactional memory. In Proc. of the Int'l Symposium on Microarchitecture, Dec. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Simoni. Cache Coherence Directories for Scalable Multiprocessors. PhD thesis, Stanford University, Oct. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. K. Tang. Cache system design in the tightly coupled multiprocessor system. In AFIPS '76: Proc. of the June 7--10, 1976, National Computer Conference and Exposition, pages 749--753, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0: An integrated cache timing, power, and area model. Technical report, HP Laboratories Palo Alto, 2007.Google ScholarGoogle Scholar
  35. T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proc. of the Int'l Symposium on Computer Architecture, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A tagless coherence directory

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
      December 2009
      601 pages
      ISBN:9781605587981
      DOI:10.1145/1669112

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 December 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate484of2,242submissions,22%

      Upcoming Conference

      MICRO '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader