ABSTRACT
The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote memory. However, the access latency to remote memory is 3 to 5 times the latency to local memory. CC-NOW machines provide the benefits of cache coherence to networks of workstations, at the cost of even higher remote access latency. Given the large remote access latencies of these architectures, data locality is potentially the most important performance issue. Using realistic workloads, we study the performance improvements provided by OS supported dynamic page migration and replication. Analyzing our kernel-based implementation, we provide a detailed breakdown of the costs. We show that sampling of cache misses can be used to reduce cost without compromising performance, and that TLB misses may not be a consistent approximation for cache misses. Finally, our experiments show that dynamic page migration and replication can substantially increase application performance, as much as 30%, and reduce contention for resources in the NUMA memory system.
- ABL+91.T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: effective kernel support for the user-level management of parallelism. in Proceedings of the 13th ACM Symposium on Operating System Principles, pages 95-109, October 1991.]] Google ScholarDigital Library
- ACD+91.Anant Agarwal et al. The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor. MIT/LCS Memo TM-454, Massachusetts Institute of Technology, 1991.]] Google ScholarDigital Library
- BCZ90.J.K. Bennett, J. B. Carter, W. Zwaeneopoel. Munin: Distributed shared memory based on type-specific memory coherence. In Proceedings of the Second Symposium on Principles and Practiceof Parallel Programming, pages 168-175, March 1990.]] Google ScholarDigital Library
- BZS93.B.N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The Midway Distributed Shared Memory System. In Proceedings of the 1993 IEEE CompCon Conference, pages 528-537, February 1993.]]Google ScholarCross Ref
- BGW89.D. Black, A. Gupta, and W. D. Weber. Competitive management of distributed shared memory. In Proceedings of COMPCON, pages 184-190, March 1989.]]Google ScholarCross Ref
- BSF+91.W. Bolosky, M. Scott, R. Fitzgerald, and A. Cox. NUMA policies and their relationship to memory architecture. In Proceedings, Architectural Support for Programming Languages and Operating Systems, pages 212-221, April 1991.]] Google ScholarDigital Library
- CDV+94.R. Chandra, S Devine, B Verghese, A Gupta, and Mendel Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers. in Proceedings, Architectural Support for Programming Languages and Operating Systems, 12-24, October 1994.]] Google ScholarDigital Library
- CoF89.A.L. Cox and R. J. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with Platinum. In Proceedings of the Twelfth A CM Symposium on Operating Systems Principles, pages 32-43, December 1989.]] Google ScholarDigital Library
- Hol89.M Holliday. Reference history, page size, and migration daemons in local/remote architectures. In Proceedings, Architectural Support for Programming Languages and Operating Systems, pages 104-112, April 1989.]] Google ScholarDigital Library
- Kus+94.J. Kuskin, et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302-313, April 1994.]] Google ScholarDigital Library
- LEK91.R.P. LaRowe Jr., C. S. Ellis, and L. S. Kaplan. The robustness of NUMA memory management. In Proceedings of the Thirteenth A CM Symposium on Operating System Principles, pages 137-151, October 1991.]] Google ScholarDigital Library
- LLG+90.D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessey. The directory-based cache coherence protocol for the DASH multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148-159, May 1990.]] Google ScholarDigital Library
- Li88.K. Li. IVY: A shared virtual memory system for parallel computing. In Proceedings of the 1988 International Conference on Parallel Processing, pages 125-132, August 1988.]]Google Scholar
- LoC96.T. Lovett and R. Clapp. STING: A CC-NUMA Computer System for the Commercial Marketplace. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 308-317, May 1996.]] Google ScholarDigital Library
- NAB+95.A. Nowatzyk et al. The S3.mp Scalable Memory Multiprocessor. Proceedings of the 24th International Conference on Parallel Processing, Aug. 1995]]Google Scholar
- RHW+95.M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta. Complete Computer Simulation: the SimOS approach. In IEEE Parallel and Distributed Technology, Fall 1995.]] Google ScholarDigital Library
- RSL92.M. Rinard, D. Scales, M. Lam. Heterogeneous parallel programming in Jade. in Proceedings of Supercomputing '92, pages 245-56.]] Google ScholarDigital Library
- ScL94.D.J. Scales and M. S. Lam. The design and evaluation of a shared object system for distributed memory machines. In Proceedings, Operating Systems Design and Implementation, pages 101-114, November 1994.]] Google ScholarDigital Library
- SWG92.J.P. Singh, W. Weber, A. Gupta. Splash: Stanford Parallel Applications for Shared Memory. Computer Architecture News, 20(1):5-44, 1992.]] Google ScholarDigital Library
- TUG91.A. Tucker and A. Gupta. Process control and scheduling issues for multiprogrammed sharedmemory multiprocessors. In Proceedings of the Twelfth A CM Symposium on Operating Systems Principles, pages 159-166, December 1991.]] Google ScholarDigital Library
- VaZ91.R. Vaswani and J Zahorjan. The implications of cache affinity on processor scheduling for multiprogrammed, shared-memory multiprocessors. In Proceedings of the Thirteenth A CM Symposium on Operating Systems Principles, pages 26-40, October 1991.]] Google ScholarDigital Library
Index Terms
- Operating system support for improving data locality on CC-NUMA compute servers
Recommendations
Operating system support for improving data locality on CC-NUMA compute servers
The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote ...
Operating system support for improving data locality on CC-NUMA compute servers
The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote ...
Comments