Abstract
In a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a shared DRAM system, requests from athread can not only delay requests from other threads by causingbank/bus/row-buffer conflicts but they can also destroy other threads’DRAM-bank-level parallelism. Requests whose latencies would otherwisehave been overlapped could effectively become serialized. As aresult both fairness and system throughput degrade, and some threadscan starve for long time periods.This paper proposes a fundamentally new approach to designinga shared DRAM controller that provides quality of service to threads,while also improving system throughput. Our parallelism-aware batchscheduler (PAR-BS) design is based on two key ideas. First, PARBSprocesses DRAM requests in batches to provide fairness and toavoid starvation of requests. Second, to optimize system throughput,PAR-BS employs a parallelism-aware DRAM scheduling policythat aims to process requests from a thread in parallel in the DRAMbanks, thereby reducing the memory-related stall-time experienced bythe thread. PAR-BS seamlessly incorporates support for system-levelthread priorities and can provide different service levels, includingpurely opportunistic service, to threads with different priorities.We evaluate the design trade-offs involved in PAR-BS and compareit to four previously proposed DRAM scheduler designs on 4-, 8-, and16-core systems. Our evaluations show that, averaged over 100 4-coreworkloads, PAR-BS improves fairness by 1.11X and system throughputby 8.3% compared to the best previous scheduling technique, Stall-Time Fair Memory (STFM) scheduling. Based on simple request prioritizationrules, PAR-BS is also simpler to implement than STFM.
- S. Bhansali et al. Framework for instruction-level tracing and analysis of programs. In VEE, 2006. Google ScholarDigital Library
- Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In ISCA-31, 2004. Google ScholarDigital Library
- V. Cuppu, B. Jacob, B. T. Davis, and T. Mudge. A performance comparison of contemporary DRAM architectures. In ISCA-26, 1999. Google ScholarDigital Library
- B. T. Davis. Modern DRAM Architectures. PhD thesis, University of Michigan, 2000. Google ScholarDigital Library
- J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In ICS-11, 1997. Google ScholarDigital Library
- J. M. Frailong, W. Jalby, and J. Lenfant. XOR-Schemes: A flexible data organization in parallel memories. In ICPP, 1985.Google Scholar
- H. Frank. Analysis and optimization of disk storage devices for time-sharing systems. Journal of the ACM, 16(4):602-620, Oct. 1969. Google ScholarDigital Library
- R. Gabor, S. Weiss, and A. Mendelson. Fairness and throughput in switch on event multithreading. In MICRO-39, 2006. Google ScholarDigital Library
- A. Glew. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session, Oct. 1998.Google Scholar
- I. Hur and C. Lin. Adaptive history-based memory schedulers. In MICRO-37, 2004. Google ScholarDigital Library
- R. Iyer et al. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, 2007. Google ScholarDigital Library
- D. M. Jacobson and J. Wilkes. Disk scheduling algorithms based on rotational position. Technical Report HPLCSP917rev1, HP Labs, 1991.Google Scholar
- T. Karkhanis and J. E. Smith. A day in the life of a data cache miss. In Second Workshop on Memory Performance Issues, 2002.Google Scholar
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. PACT-13, 2004. Google ScholarDigital Library
- D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA-8, 1981. Google ScholarDigital Library
- T.-C. Lin et al. Quality-aware memory controller for multimedia platform SoC. In IEEE Workshop on Signal Processing Systems, 2003.Google Scholar
- C.-K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005. Google ScholarDigital Library
- K. Luo, J. Gummaraju, and M. Franklin. Balancing throughput and fairness in SMT processors. In ISPASS, 2001.Google Scholar
- C. Macian et al. Beyond performance: secure and fair memory management for multiple systems on a chip. In FPT, 2003.Google ScholarCross Ref
- S. A. McKee et al. Dynamic access ordering for streamed computations. IEEE Transactions on Computers, 49(11):1255-1271, Nov. 2000. Google ScholarDigital Library
- Micron. 1Gb DDR2 SDRAM Component: MT47H128M8HQ-25, May 2007. http://download.micron.com/pdf/datasheets/dram/ddr2/1GbDDR2.pdf.Google Scholar
- T. Moscibroda and O. Mutlu. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX Security, 2007. Google ScholarDigital Library
- O. Mutlu et al. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA-9, 2003. Google ScholarDigital Library
- O. Mutlu, H. Kim, and Y. N. Patt. Efficient runahead execution: Power-efficient memory latency tolerance. IEEE Micro, 26(1):10-20, 2006. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO-40, 2007. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Enhancing the performance and fairness of shared dram systems with parallelism-aware batch scheduling. Technical Report MSR-TR- 2008-26, Microsoft Research, Feb. 2008.Google ScholarDigital Library
- C. Natarajan et al. A study of performance impact of memory controller features in multi-processor server environment. In WMPI, 2004. Google ScholarDigital Library
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. In MICRO-39, 2006. Google ScholarDigital Library
- H. Patil et al. Pinpointing representative portions of large Intel Itanium programs with dynamic instrumentation. In MICRO-37, 2004. Google ScholarDigital Library
- M. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA-33, 2006. Google ScholarDigital Library
- N. Rafique, W.-T. Lim, and M. Thottethodi. Effective management of DRAM bandwidth in multicore processors. In PACT, 2007. Google ScholarDigital Library
- S. Rixner. Memory controller optimizations for web servers. In MICRO-37, 2004. Google ScholarDigital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In ISCA-27, 2000. Google ScholarDigital Library
- J. Shao and B. T. Davis. A burst scheduling access reordering mechanism. In HPCA- 13, 2007. Google ScholarDigital Library
- J. E. Smith and A. R. Pleszkun. Implementation of precise interrupts in pipelined processors. In ISCA-12, 1985. Google ScholarDigital Library
- W. E. Smith. Various optimizers for single stage production. Naval Research Logistics Quarterly, 3:59-66, 1956.Google ScholarCross Ref
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In ASPLOS-IX, 2000. Google ScholarDigital Library
- G. E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. HPCA-8, 2002. Google ScholarDigital Library
- T. J. Teorey and T. B. Pinkerton. A comparative analysis of disk scheduling policies. Communications of the ACM, 15(3):177-184, 1972. Google ScholarDigital Library
- R.M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 11:25-33, 1967.Google ScholarDigital Library
- D. H. Woo et al. Analyzing performance vulnerability due to resource denial of service attack on chip multiprocessors. In CMP-MSI, 2007.Google Scholar
- Z. Zhang et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In MICRO-33, 2000. Google ScholarDigital Library
- Z. Zhu and Z. Zhang. A performance comparison of DRAM memory system optimizations for SMT processors. In HPCA-11, 2005. Google ScholarDigital Library
- W. K. Zuravleff and T. Robinson. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. U.S. Patent Number 5,630,096, May 1997.Google Scholar
Index Terms
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
Recommendations
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer ArchitectureIn a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a shared DRAM system, requests from athread can not only delay requests from other threads by causingbank/bus/row-buffer conflicts but they can also destroy other threads’...
Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Shared Memory Controllers
Uncontrolled interthread interference in main memory can destroy individual threads' memory-level parallelism, effectively serializing the memory requests of a thread whose latencies would otherwise have largely overlapped, thereby reducing single-...
Thread Cluster Memory Scheduling
Memory schedulers in multicore systems should carefully schedule memory requests from different threads to ensure high system performance and fair, fast progress of each thread. No existing memory scheduler provides both the highest system performance ...
Comments