ABSTRACT
A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs.
The Tagless Coherence Directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information. Conceptually, TL consists of a grid of small Bloom filters. The grid has one column per core and one row per cache set. TL uses 48% less area, 57% less leakage power, and 44% less dynamic energy than a conventional coherence directory for a 16-core CMP with 1MB private L2 caches. Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization. Analytical modelling predicts that TL continues to scale well up to at least 1024 cores.
- First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.Google Scholar
- OpenSPARC#8482; T2 system-on-chip (SoC) microarchitecture specification, May 2008.Google Scholar
- A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An evaluation of directory schemes for cache coherence. In Proc. of the Int'l Symposium on Computer Architecture, June 1988. Google ScholarDigital Library
- C. S. Ballapuram, A. Sharif, and H.-H. S. Lee. Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors. In Proc. of the Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2008. Google ScholarDigital Library
- L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: a scalable architecture base on single-chip multiprocessing. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2005.Google ScholarCross Ref
- J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- L. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Trans. Comput., C-27(12):1112--1118, Dec. 1978. Google ScholarDigital Library
- J. H. Choi and K. H. Park. Segment directory enhancing the limited directory cache coherence schemes. In Proc. of the Int'l Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pages 258--267, Apr 1999. Google ScholarDigital Library
- M. Ekman and P. Stenström. Enhancing multiprocessor architecture simulation speed using matched-pair comparison. In Proc. of the Int'l Symposium on the Performance Analysis of Systems and Software, Mar. 2005. Google ScholarDigital Library
- N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network mulitcast for scalable cache coherence. In Proc. of the Int'l Symposium on Micorarchitecture, Dec. 2008. Google ScholarDigital Library
- L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281--293, 2000. Google ScholarDigital Library
- A. Gupta, W. dietrich Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. of the Int'l Conference on Parallel Processing, Aug. 1990.Google Scholar
- D. Gustavson and Q. Li. The scalable coherent interface (sci). Communications Magazine, IEEE, 34(8):52--63, Aug 1996. Google ScholarDigital Library
- N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. ACM SIGMETRICS Performance Evaluation Review, 31(4):31--35, Mar. 2004. Google ScholarDigital Library
- Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro, 27(5):51--61, Sept.--Oct. 2007. Google ScholarDigital Library
- J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proc. of the Int'l Symposium on Computer Architecture, June 1997. Google ScholarDigital Library
- H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER 6 microarchitecture. IBM Journal of Research and Development, 51(6):639--662, Nov. 2007. Google ScholarDigital Library
- P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb. 2002. Google ScholarDigital Library
- P. Mak, C. R. Walters, and G. E. Strait. IBM System z10 processor cache subsystem. IBM Journal of Research and Development, 53(1), 2009. Google ScholarDigital Library
- M. R. Marty and M. D. Hill. Virtual hierarchies to support server consolidation. In Proc. of the Int'l Symposium on Computer Architecture, June 2007. Google ScholarDigital Library
- A. Moshovos. RegionScout: Exploiting coarse grain sharing in snoop-based coherence. In Proc. of the Int'l Symposium on Computer Architecture, pages 234--245, June 2005. Google ScholarDigital Library
- A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi. Jetty: Filtering snoops for reduced energy consumption in smp servers. In Proc. of the Int'l Symposium on High-Performance Computer Architecture, Jan. 2001. Google ScholarDigital Library
- B. W. O'Krafka and A. R. Newton. An empirical evaluation of two memory-efficient directory methods. In Proc. of the Int'l Symposium on Computer Architecture, June 1990. Google ScholarDigital Library
- L.-S. Peh and W. Dally. A delay model and speculative architecture for pipelined routers. pages 255--266, 2001.Google Scholar
- A. Raghaven, C. Blundell, and M. M. K. Martin. Token tenure: PATCHing token counting using directory-based cache coherence. In Proc. of the Int'l Symposium on Microarchitecture, Dec. 2008. Google ScholarDigital Library
- S. Rusu, S. Tam, H. Mulijono, D. Ayers, and J. Chang. A dual-core multi-threaded Xeon processor with 16MB L3 cache. In Proc of the Int'l Solid-State Circuits Conference, Feb. 2006.Google ScholarCross Ref
- V. Salapura, M. Blumrich, and A. Gara. Design and implementation of the Blue Gene/P snoop filter. In Proc. of the Int'l Symposium on High Performance Computer Architecture, Feb. 2008.Google ScholarCross Ref
- C. Saldanha and M. H. Lipasti. Power Efficient Cache Coherence. Springer-Verlag, 2003.Google Scholar
- D. Sanchez, L. Yen, M. Hill, and K. Sankaralingam. Implementing signatures for transactional memory. In Proc. of the Int'l Symposium on Microarchitecture, Dec. 2007. Google ScholarDigital Library
- R. Simoni. Cache Coherence Directories for Scalable Multiprocessors. PhD thesis, Stanford University, Oct. 1992. Google ScholarDigital Library
- C. K. Tang. Cache system design in the tightly coupled multiprocessor system. In AFIPS '76: Proc. of the June 7--10, 1976, National Computer Conference and Exposition, pages 749--753, 1976. Google ScholarDigital Library
- S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0: An integrated cache timing, power, and area model. Technical report, HP Laboratories Palo Alto, 2007.Google Scholar
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proc. of the Int'l Symposium on Computer Architecture, June 2003. Google ScholarDigital Library
Index Terms
- A tagless coherence directory
Recommendations
Multi-grain coherence directories
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on MicroarchitectureConventional directory coherence operates at the finest granularity possible, that of a cache block. While simple, this organization fails to exploit frequent application behavior: at any given point in time, large, continuous chunks of memory are often ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureTo meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Comments