ABSTRACT
Conventional directory coherence operates at the finest granularity possible, that of a cache block. While simple, this organization fails to exploit frequent application behavior: at any given point in time, large, continuous chunks of memory are often accessed only by a single core.
We take advantage of this behavior and investigate reducing the coherence directory size by tracking coherence at multiple different granularities. We show that such a Multi-grain Directory (MGD) can significantly reduce the required number of directory entries across a variety of different workloads. Our analysis shows a simple dual-grain directory (DGD) obtains the majority of the benefit while tracking individual cache blocks and coarse-grain regions of 1KB to 8KB. We propose a practical DGD design that is transparent to software, requires no changes to the coherence protocol, and has no unnecessary bandwidth overhead. This design can reduce the coherence directory size by 41% to 66% with no statistically significant performance loss.
- First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.Google Scholar
- OpenSPARC#8482; system-on-chip (SoC) microarchitecture specification, May 2008.Google Scholar
- A. Agarwal et al. An evaluation of directory schemes for cache coherence. In Proc. of the Int'l Symposium on Computer Architecture, June 1988. Google ScholarDigital Library
- M. Alisafaee. Spatiotemporal coherence tracking. In Proc of the Int'l Symposium on Microarchitecture, Dec. 2012. Google ScholarDigital Library
- L. A. Barroso et al. Piranha: a scalable architecture base on single-chip multiprocessing. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011. Google ScholarDigital Library
- J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- J. L. Carter and M. N. Wegman. Universal classes of hash functions (extended abstract). In Proc. of the Ninth Annual ACM Symposium on Theory of Computing, 1977. Google ScholarDigital Library
- J. H. Choi and K. H. Park. Segment directory enhancing the limited directory cache coherence schemes. In Proc. of the Int'l Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pages 258--267, Apr 1999. Google ScholarDigital Library
- G. Chrysos. Intel® many integrated core architecture: The first Intel® Xeon Phi coprocessor (codenamed Knights Corner). presented at Hot Chips 24, Stanford, CA, Aug. 2012.Google Scholar
- B. A. Cuesta et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. of the Int'l Symposium on Computer Architecture, 2011. Google ScholarDigital Library
- M. Ferdman et al. Cuckoo directory: A scalable directory for many-core systems. In Proc. of the Int'l Symposium on High Performance Computer Architecture, Feb. 2011. Google ScholarDigital Library
- M. Ferdman et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In Proc. of the Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, 2012. Google ScholarDigital Library
- G. Grohoski. Niagara2: A highly-threaded server-on-a-chip. presented at Hot Chips 18, Stanford, CA, Aug. 2006.Google Scholar
- S.-L. Guo et al. Hierarchical cache directory for CMP. Journal of Computer Science and Technology, 25:246--256, 2010.Google ScholarCross Ref
- A. Gupta, W.-D. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. of the Int'l Conf. on Parallel Processing, 1990.Google Scholar
- N. Hardavellas et al. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. of the Int'l Symposium on Computer Architecture, 2009. Google ScholarDigital Library
- J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proc. of the Int'l Symposium on Computer Architecture, June 1997. Google ScholarDigital Library
- P. Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb. 2002. Google ScholarDigital Library
- M. M. K. Martin, M. D. Hill, and D. J. Sorin. Why on-chip cache coherence is here to stay. Commun. ACM, 55(7):78--89, July 2012. Google ScholarDigital Library
- A. Moshovos. RegionScout: Exploiting coarse grain sharing in snoop-based coherence. In Proc. of the Int'l Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In Proc of the Int'l Conf. on Parallel Architectures and Compilation Techniques, 2012. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. The ZCache: Decoupling ways and associativity. In Proc. of the Int'l Symp. on Microarchitecture, Dec. 2010. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. SCD: A scalable coherence directory with flexible sharer set encoding. In Proc. of the Int'l Symposium on High-Performance Computer Architecture, Feb. 2012. Google ScholarDigital Library
- A. Seznec. A case for two-way skewed-associative caches. In Proc. of the Int'l Symposium on Computer Architecture, 1993. Google ScholarDigital Library
- S. Turullols and R. Sivaramakrishnan. SPARC T5: 16-core CMT processor with glueless 1-hop scaling to 8-sockets. presented at Hot Chips 24, Stanford, CA, Aug. 2012.Google ScholarCross Ref
- D. A. Wallach. PHD: A hierarchical cache coherent protocol. Technical report, Cambridge, MA, USA, 1992. Google ScholarDigital Library
- T. F. Wenisch et al. SimFlex: statistical sampling of computer system simulation. IEEE Micro, 26(4):18--31, 2006. Google ScholarDigital Library
- B. Wheeler. Tilera sees opening in clouds. Microprocessor Report, 25(7):13--16, July 2011.Google Scholar
- R. E. Wunderlich et al. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proc. of the Int'l Symposium on Computer Architecture, June 2003. Google ScholarDigital Library
- Q. Yang, G. Thangadurai, and L. M. Bhuyan. Design of an adaptive cache coherence protocol for large scale multiprocessors. IEEE Trans. Parallel Distrib. Syst., 3(3):281--293, May 1992. Google ScholarDigital Library
- J. Zebchuk et al. A tagless coherence directory. In Proc. of the Int'l Symposium on Microarchitecture, Dec. 2009. Google ScholarDigital Library
- H. Zhao et al. SPACE: sharing pattern-based directory coherence for multicore scalability. In Proc. of the Int'l Conf. on Parallel Architectures and Compilation Techniques, 2010. Google ScholarDigital Library
- H. Zhao et al. Spatl: Honey, i shrunk the coherence directory. In Proc of the 2011 Int'l Conf. on Parallel Architectures and Compilation Techniques, 2011. Google ScholarDigital Library
Index Terms
- Multi-grain coherence directories
Recommendations
A tagless coherence directory
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureA key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Filtering directory lookups in CMPs
Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed ...
Comments