ABSTRACT
Across a broad range of applications, multicore technology is the most important factor that drives today's microprocessor performance improvements. Closely coupled is a growing complexity of the memory subsystems with several cache levels that need to be exploited efficiently to gain optimal application performance. Many important implementation details of these memory subsystems are undocumented. We therefore present a set of sophisticated benchmarks for latency and bandwidth measurements to arbitrary locations in the memory subsystem. We consider the coherency state of cache lines to analyze the cache coherency protocols and their performance impact. The potential of our approach is demonstrated with an in-depth comparison of ccNUMA multiprocessor systems with AMD (Shanghai) and Intel (Nehalem-EP) quad-core x86-64 processors that both feature integrated memory controllers and coherent point-to-point interconnects. Using our benchmarks we present fundamental memory performance data and architectural properties of both processors. Our comparison reveals in detail how the microarchitectural differences tremendously affect the performance of the memory subsystem.
- SPEC CPU2006 published results page: http://www.spec.org/cpu2006/results/.Google Scholar
- AMD. AMD64 Architecture Programmer's Manual Volume 2: System Programming, revision: 3.14 edition, September 2007. Publication # 24593.Google Scholar
- AMD. Software Optimization Guide For AMD Family 10h Processors, revision: 3.04 edition, September 2007. Publication # 40546.Google Scholar
- V. Babka and P. Tůma. Investigating cache parameters of x86 family processors. In SPEC Benchmark Workshop, pages 77--96, 2009. Google ScholarDigital Library
- P. Conway and B. Hughes. The AMD opteron northbridge architecture. Micro, IEEE, 27(2):10--21, 2007. Google ScholarDigital Library
- J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Meyers, E. Fang, and R. Kumar. An integrated quad-core opteron processor. In IEEE International Solid-State Circuits Conference, pages 102--103, 2007.Google Scholar
- J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Fourth edition, 2006. Google ScholarDigital Library
- Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual, March 2009.Google Scholar
- Intel. An Introduction to the Intel QuickPath Interconnect, January 2009.Google Scholar
- G. Juckeland, S. Börner, M. Kluge, S. Kölling, W. E. Nagel, S. Pflüger, and H. Röding. BenchIT - performance measurements and comparison for scientific applications. In PARCO, pages 501--508, 2003.Google Scholar
- J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pages 19--25, December 1995.Google Scholar
- L. Peng, J.-K. Peir, T. K. Prakash, C. Staelin, Y.-K. Chen, and D. Koppelman. Memory hierarchy performance measurement of commercial dual-core desktop processors. Journal of Systems Architecture, 54(8):816--828, 2008. Google ScholarDigital Library
Index Terms
- Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems
Recommendations
Performance Analysis of Cache Coherence Protocols for Multi-core Architectures: A System Attribute Perspective
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & ComputingShared memory multi-core processors are becoming dominant in todays computer architectures. Caching of shared data may produce a problem of replication in multiple caches. Replication provides reduction in contention for shared data items along with ...
wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems
AbstractCache performance is a critical design constraint for modern many-core systems. Since the cache often works in a “black-box” manner, it is difficult for the software to reason about the cache behavior to match the running software to the ...
Evaluating the performance of four snooping cache coherency protocols
Special Issue: Proceedings of the 16th annual international symposium on Computer ArchitectureWrite-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large ...
Comments