research-article

A tagless coherence directory

Authors:
Jason Zebchuk

University of Toronto

University of Toronto
View Profile

,
Vijayalakshmi Srinivasan

T.J. Watson Research Center, IBM

T.J. Watson Research Center, IBM
View Profile

,
Moinuddin K. Qureshi

T.J. Watson Research Center, IBM

T.J. Watson Research Center, IBM
View Profile

,
Andreas Moshovos

University of Toronto

University of Toronto
View Profile

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureDecember 2009Pages 423–434https://doi.org/10.1145/1669112.1669166

Published:12 December 2009Publication History

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 423–434

ABSTRACT

A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs.

The Tagless Coherence Directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information. Conceptually, TL consists of a grid of small Bloom filters. The grid has one column per core and one row per cache set. TL uses 48% less area, 57% less leakage power, and 44% less dynamic energy than a conventional coherence directory for a 16-core CMP with 1MB private L2 caches. Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization. Analytical modelling predicts that TL continues to scale well up to at least 1024 cores.

References

First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.Google Scholar
OpenSPARC#8482; T2 system-on-chip (SoC) microarchitecture specification, May 2008.Google Scholar
A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An evaluation of directory schemes for cache coherence. In Proc. of the Int'l Symposium on Computer Architecture, June 1988. Google ScholarDigital Library
C. S. Ballapuram, A. Sharif, and H.-H. S. Lee. Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors. In Proc. of the Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2008. Google ScholarDigital Library
L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: a scalable architecture base on single-chip multiprocessing. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google ScholarDigital Library
A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2005.Google ScholarCross Ref
J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
L. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Trans. Comput., C-27(12):1112--1118, Dec. 1978. Google ScholarDigital Library
J. H. Choi and K. H. Park. Segment directory enhancing the limited directory cache coherence schemes. In Proc. of the Int'l Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pages 258--267, Apr 1999. Google ScholarDigital Library
M. Ekman and P. Stenström. Enhancing multiprocessor architecture simulation speed using matched-pair comparison. In Proc. of the Int'l Symposium on the Performance Analysis of Systems and Software, Mar. 2005. Google ScholarDigital Library
N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network mulitcast for scalable cache coherence. In Proc. of the Int'l Symposium on Micorarchitecture, Dec. 2008. Google ScholarDigital Library
L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281--293, 2000. Google ScholarDigital Library
A. Gupta, W. dietrich Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. of the Int'l Conference on Parallel Processing, Aug. 1990.Google Scholar
D. Gustavson and Q. Li. The scalable coherent interface (sci). Communications Magazine, IEEE, 34(8):52--63, Aug 1996. Google ScholarDigital Library
N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. ACM SIGMETRICS Performance Evaluation Review, 31(4):31--35, Mar. 2004. Google ScholarDigital Library
Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro, 27(5):51--61, Sept.--Oct. 2007. Google ScholarDigital Library
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proc. of the Int'l Symposium on Computer Architecture, June 1997. Google ScholarDigital Library
H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER 6 microarchitecture. IBM Journal of Research and Development, 51(6):639--662, Nov. 2007. Google ScholarDigital Library
P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb. 2002. Google ScholarDigital Library
P. Mak, C. R. Walters, and G. E. Strait. IBM System z10 processor cache subsystem. IBM Journal of Research and Development, 53(1), 2009. Google ScholarDigital Library
M. R. Marty and M. D. Hill. Virtual hierarchies to support server consolidation. In Proc. of the Int'l Symposium on Computer Architecture, June 2007. Google ScholarDigital Library
A. Moshovos. RegionScout: Exploiting coarse grain sharing in snoop-based coherence. In Proc. of the Int'l Symposium on Computer Architecture, pages 234--245, June 2005. Google ScholarDigital Library
A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi. Jetty: Filtering snoops for reduced energy consumption in smp servers. In Proc. of the Int'l Symposium on High-Performance Computer Architecture, Jan. 2001. Google ScholarDigital Library
B. W. O'Krafka and A. R. Newton. An empirical evaluation of two memory-efficient directory methods. In Proc. of the Int'l Symposium on Computer Architecture, June 1990. Google ScholarDigital Library
L.-S. Peh and W. Dally. A delay model and speculative architecture for pipelined routers. pages 255--266, 2001.Google Scholar
A. Raghaven, C. Blundell, and M. M. K. Martin. Token tenure: PATCHing token counting using directory-based cache coherence. In Proc. of the Int'l Symposium on Microarchitecture, Dec. 2008. Google ScholarDigital Library
S. Rusu, S. Tam, H. Mulijono, D. Ayers, and J. Chang. A dual-core multi-threaded Xeon processor with 16MB L3 cache. In Proc of the Int'l Solid-State Circuits Conference, Feb. 2006.Google ScholarCross Ref
V. Salapura, M. Blumrich, and A. Gara. Design and implementation of the Blue Gene/P snoop filter. In Proc. of the Int'l Symposium on High Performance Computer Architecture, Feb. 2008.Google ScholarCross Ref
C. Saldanha and M. H. Lipasti. Power Efficient Cache Coherence. Springer-Verlag, 2003.Google Scholar
D. Sanchez, L. Yen, M. Hill, and K. Sankaralingam. Implementing signatures for transactional memory. In Proc. of the Int'l Symposium on Microarchitecture, Dec. 2007. Google ScholarDigital Library
R. Simoni. Cache Coherence Directories for Scalable Multiprocessors. PhD thesis, Stanford University, Oct. 1992. Google ScholarDigital Library
C. K. Tang. Cache system design in the tightly coupled multiprocessor system. In AFIPS '76: Proc. of the June 7--10, 1976, National Computer Conference and Exposition, pages 749--753, 1976. Google ScholarDigital Library
S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0: An integrated cache timing, power, and area model. Technical report, HP Laboratories Palo Alto, 2007.Google Scholar
T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proc. of the Int'l Symposium on Computer Architecture, June 2003. Google ScholarDigital Library

Index Terms

A tagless coherence directory
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Multi-grain coherence directories
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Conventional directory coherence operates at the finest granularity possible, that of a cache block. While simple, this organization fails to exploit frequent application behavior: at any given point in time, large, continuous chunks of memory are often ...
Read More
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Read More
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
December 2009
601 pages
ISBN:9781605587981
DOI:10.1145/1669112
General Chairs:
David Albonesi
Cornell
,
Margaret Martonosi
Princeton
,
Program Chairs:
David August
Princeton/Parakinetics
,
José Martínez
Cornell
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bloom filters
cache coherence
directory coherence
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 112
  Total Citations
  View Citations
- 749
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A tagless coherence directory

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-grain coherence directories

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A tagless coherence directory

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-grain coherence directories

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media