Abstract
Modern SoCs integrate multiple CPU cores and hardware accelerators (HWAs) that share the same main memory system, causing interference among memory requests from different agents. The result of this interference, if it is not controlled well, is missed deadlines for HWAs and low CPU performance. Few previous works have tackled this problem. State-of-the-art mechanisms designed for CPU-GPU systems strive to meet a target frame rate for GPUs by prioritizing the GPU close to the time when it has to complete a frame. We observe two major problems when such an approach is adapted to a heterogeneous CPU-HWA system. First, HWAs miss deadlines because they are prioritized only when close to their deadlines. Second, such an approach does not consider the diverse memory access characteristics of different applications running on CPUs and HWAs, leading to low performance for latency-sensitive CPU applications and deadline misses for some HWAs, including GPUs.
In this article, we propose a Deadline-Aware memory Scheduler for Heterogeneous systems (DASH), which overcomes these problems using three key ideas, with the goal of meeting HWAs’ deadlines while providing high CPU performance. First, DASH prioritizes an HWA when it is not on track to meet its deadline any time during a deadline period, instead of prioritizing it only when close to a deadline. Second, DASH prioritizes HWAs over memory-intensive CPU applications based on the observation that memory-intensive applications’ performance is not sensitive to memory latency. Third, DASH treats short-deadline HWAs differently as they are more likely to miss their deadlines and schedules their requests based on worst-case memory access time estimates.
Extensive evaluations across a wide variety of different workloads and systems show that DASH achieves significantly better CPU performance than the best previous scheduler while always meeting the deadlines for all HWAs, including GPUs, thereby largely improving frame rates.
Supplemental Material
Available for Download
Slide deck associated with this paper
- L. Acasandrei and A. Barriga. 2013. AMBA bus hardware accelerator IP for Viola-Jones face detection. IET Computers Digital Techniques, 7, 5 (September 2013).Google ScholarCross Ref
- Advanced Micro Devices. 2009. AMD Radeon HD 5870 Graphics. Retrieved July 30, 2015, from http://www.amd.com/en-us/products/graphics/desktop/5000/5870#.Google Scholar
- B. Akesson, K. Goossens, and M. Ringhofer. 2007. Predator: A predictable SDRAM memory controller. In CODES+ISSS. Google ScholarDigital Library
- R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In ISCA. Google ScholarDigital Library
- H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. 2008. SURF: Speeded up robust features. In CVIU. Google ScholarDigital Library
- A. Chandra, M. Adler, P. Goyal, and P. Shenoy. 2000. Surplus fair scheduling: A proportional-share CPU scheduling algorithm for symmetric multiprocessors. In OSDI. Google ScholarDigital Library
- N. Chandramoorthy, G. Tagliavini, K. Irick, A. Pullini, S. Advani, S. Al Habsi, M. Cotter, J. Sampson, V. Narayanan, and L. Benini. 2015. Exploring architectural heterogeneity in intelligent vision systems. In HPCA.Google Scholar
- K. Chang, R. Ausavarungnirun, C. Fallin, and O. Mutlu. 2012. HAT: Heterogeneous adaptive throttling for on-chip networks. In SBAC-PAD. Google ScholarDigital Library
- K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu. 2014. Improving DRAM performance by parallelizing refreshes with accesses. In HPCA.Google Scholar
- H.-Y. Cheng, C.-H. Lin, J. Li, and C.-L. Yang. 2010. Memory latency reduction via thread throttling. In MICRO. Google ScholarDigital Library
- CMU SAFARI Research Group. 2015a. Ramulator. (2015). Retrieved October 29, 2015, from https://github.com/CMU-SAFARI/ramulator.Google Scholar
- CMU SAFARI Research Group. 2015b. SAFARI GitHub. (2015). Retrieved November 9, 2015, from https://github.com/CMU-SAFARI.Google Scholar
- R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In HPCA. Google ScholarDigital Library
- R. Das, O. Mutlu, T. Moscibroda, and C. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In MICRO. 280--291. Google ScholarDigital Library
- R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. 2010. AéRgia: Exploiting packet latency slack in on-chip networks. In ISCA. Google ScholarDigital Library
- K. J. Duda and D. R. Cheriton. 1999. Borrowed-virtual-time (BVT) scheduling: Supporting latency-sensitive threads in a general-purpose scheduler. In SOSP. Google ScholarDigital Library
- E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. 2010. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In ASPLOS. Google ScholarDigital Library
- E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. 2011. Parallel application memory scheduling. In MICRO. Google ScholarDigital Library
- S. Eyerman and L. Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 3 (2008). Google ScholarDigital Library
- P. N. Gour, S. Narumanchi, S. Saurav, and S. Singh. 2014. Hardware accelerator for real-time image resizing. In 18th International Symposium on VLSI Design and Test.Google Scholar
- P. Goyal, X. Guo, and H. M. Vin. 1996. A hierarchical CPU scheduler for multimedia operating systems. In OSDI. Google ScholarDigital Library
- Y. Heechul, Y. Gang, P. Rodolfo, C. Marco, and S. Lui. 2013. MemGuard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In RTAS.Google Scholar
- F.-C. Huang, S.-Y. Huang, J.-W. Ker, and Y.-C. Chen. 2012. High-performance SIFT hardware accelerator for real-time image feature extraction. IEEE Transactions on Circuits and Systems for Video Technology, 22, 3 (March 2012). Google ScholarDigital Library
- I. Hur and C. Lin. 2004. Adaptive history-based memory schedulers. In MICRO. Google ScholarDigital Library
- E. Ipek, O. Mutlu, J. Martinez, and R. Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. In ISCA. Google ScholarDigital Library
- Itseez. 2015. Open Source Computer Vision. (2015). Retrieved July 30, 2015, from http://opencv.org.Google Scholar
- JEDEC. 2010. Standard No. 79-3. DDR3 SDRAM STANDARD. (2010).Google Scholar
- M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. 2012a. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In DAC-49. Google ScholarDigital Library
- M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. 2012b. Balancing DRAM locality and parallelism in shared memory CMP systems. In HPCA. Google ScholarDigital Library
- O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In MICRO. Google ScholarDigital Library
- S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson, and O. Mutlu. 2014. The efficacy of error mitigation techniques for DRAM retention failures: A comparative experimental study. In SIGMETRICS. Google ScholarDigital Library
- H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar. 2014. Bounding memory interference delay in COTS-based multi-core systems. In RTAS.Google Scholar
- W. Kim, H. Chung, H.-D. Cho, and Y. Kim. 2012. Enjoy the ultimate WQXGA solution with Exynos 5 Dual. Samsung Electronics White Paper (2012).Google Scholar
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. 2010a. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA.Google Scholar
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. 2010b. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO. Google ScholarDigital Library
- Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu. 2012. A case for exploiting subarray-level parallelism (SALP) in DRAM. In ISCA. Google ScholarDigital Library
- Y. Kim, W. Yang, and O. Mutlu. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE CAL PP, 99 (2015).Google Scholar
- C. J. Lee, O. Mutlu, V. Narasiman, and Y. Patt. 2008. Prefetch-aware DRAM controllers. In MICRO. Google ScholarDigital Library
- C. J. Lee, V. Narasiman, E. Ebrahimi, O. Mutlu, and Y. N. Patt. 2010. DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems. HPS Technical Report, TR-HPS-2010-002. (2010).Google Scholar
- C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. 2009a. Improving memory bank-level parallelism in the presence of prefetching. In MICRO. Google ScholarDigital Library
- D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu. 2015. Adaptive-latency DRAM: Optimizing DRAM timing for the common-case. In HPCA.Google Scholar
- D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu. 2013. Tiered-latency DRAM: A low latency and low cost DRAM architecture. In HPCA. Google ScholarDigital Library
- K.-B. Lee, T.-C. Lin, and C.-W. Jen. 2005. An efficient quality-aware memory controller for multimedia platform SoC. IEEE Transactions on Circuits and Systems for Video Technology 15, 5 (May 2005). Google ScholarDigital Library
- S. E. Lee, Y. Zhang, Z. Fang, S. Srinivasan, R. Iyer, and D. Newell. 2009b. Accelerating mobile augmented reality on a handheld platform. In ICCD. Google ScholarDigital Library
- J. Liu, B. Jaiyen, R. Veras, and O. Mutlu. 2012a. RAIDR: Retention-aware intelligent DRAM refresh. In ISCA. Google ScholarDigital Library
- L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. 2012b. A software memory partition approach for eliminating bank-level interference in multicore systems. In PACT. Google ScholarDigital Library
- C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI. Google ScholarDigital Library
- C. Macian, S. Dharmapurikar, and J. Lockwood. 2003. Beyond performance: Secure and fair memory management for multiple systems on a chip. In FPT.Google Scholar
- Micron. 2014. 1Gb: x4, x8, x16 DDR3 SDRAM Features. Retrieved July 30, 2015, from http://www.micron.com/∼/media/Documents/products/data-sheet/dram/ddr3/1gb_ddr3_sdram.pdf.Google Scholar
- T. Moscibroda and O. Mutlu. 2008. Distributed order scheduling and its application to multi-core dram controllers. In PODC. Google ScholarDigital Library
- S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. 2011. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In MICRO. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA. Google ScholarDigital Library
- N. C. Nachiappan, P. Yedlapalli, N. Soundararajan, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das. 2014. GemDroid: A framework to evaluate mobile platforms. In SIGMETRICS. Google ScholarDigital Library
- NASA. 2012. NAS Parallel Benchmark Suite. Retrieved July 30, 2015, from http://www.nas.nasa.gov/publications/npb.html.Google Scholar
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. 2006. Fair queuing memory systems. In MICRO. Google ScholarDigital Library
- J. Nieh and M. S. Lam. 1997. The design, implementation and evaluation of SMART: A scheduler for multimedia applications. In SOSP. Google ScholarDigital Library
- J. Nieh, C. Vaill, and H. Zhong. 2001. Virtual-time round-robin: An O(1) proportional share scheduler. In Proceedings of the General Track: 2001 USENIX Annual Technical Conference. Google ScholarDigital Library
- G. Nychis, C. Fallin, T. Moscibroda, and O. Mutlu. 2010. Next generation on-chip networks: What kind of congestion control do we need? In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks (HOTNETS’10). Google ScholarDigital Library
- G. P. Nychis, C. Fallin, T. Moscibroda, O. Mutlu, and S. Seshan. 2012. On-chip networks from a networking perspective: Congestion and scalability in many-core interconnects. In SIGCOMM. Google ScholarDigital Library
- M. Paolieri, E. Quiones, F. Cazorla, and M. Valero. 2009. An analyzable memory controller for hard real-time CMPs. IEEE Embedded Systems Letters 1, 4 (December 2009). Google ScholarDigital Library
- H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. 2004. Pinpointing representative portions of large Intel Itanium programs with dynamic instrumentation. In MICRO. Google ScholarDigital Library
- Qualcomm. 2011. Snapdragon S4 processors: System on chip solutions for a new mobile age. Qualcomm White Paper (2011).Google Scholar
- J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee. 2011. PRET DRAM controller: Bank privatization for predictability and temporal isolation. In CODES+ISSS. Google ScholarDigital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. 2000. Memory access scheduling. In ISCA. Google ScholarDigital Library
- I. Schmadecke and H. Blume. 2013. Hardware-accelerator design for energy-efficient acoustic feature extraction. In 2013 IEEE 2nd Global Conference on Consumer Electronics (GCCE’13).Google Scholar
- V. Seshadri, A. Bhowmick, O. Mutlu, P. Gibbons, M. Kozuch, and T. Mowry. 2014. The dirty-block index. In ISCA. Google ScholarDigital Library
- V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. 2013. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization. In MICRO. Google ScholarDigital Library
- V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. 2015. Gather-scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses. In MICRO. Google ScholarDigital Library
- A. Snavely and D. M. Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS. Google ScholarDigital Library
- I. Sobel. 1990. An isotropic 3x3 image gradient operator. In Machine Vision for Three-Dimensional Scenes. Academic Press, 376--379.Google Scholar
- Standard Performance Evaluation Corporation. 2014. SPEC CPU2006. Retrieved July 30, 2015, from http://www.spec.org/spec2006.Google Scholar
- G. P. Stein, I. Gat, and G. Hayon. 2008. Challenges and solutions for bundling multiple DAS applications on a single hardware platform. In V.I.S.I.O.N.Google Scholar
- A. Stevens. 2010. QoS for high-performance and power-efficient HD multimedia. ARM White Paper (2010).Google Scholar
- L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. 2014. The blacklisting memory scheduler: Achieving high performance and fairness at low cost. In ICCD.Google Scholar
- L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. 2015a. The blacklisting memory scheduler: Balancing performance, fairness and complexity. CoRR abs/1504.00390 (2015).Google Scholar
- L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu. 2015b. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In MICRO. Google ScholarDigital Library
- L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu. 2013. MISE: Providing performance predictability and improving fairness in shared main memory systems. In HPCA. Google ScholarDigital Library
- J. Tanabe, S. Toru, Y. Yamada, T. Watanabe, M. Okumura, M. Nishiyama, T. Nomura, K. Oma, N. Sato, M. Banno, H. Hayashi, and T. Miyamori. 2015. A 1.9TOPS and 564GOPS/W heterogeneous multicore SoC with color-based object classification accelerator for image-recognition applications. In ISSCC.Google Scholar
- Y. Tanabe, M. Sumiyoshi, M. Nishiyama, I. Yamazaki, S. Fujii, K. Kimura, T. Aoyama, M. Banno, H. Hayashi, and T. Miyamori. 2012. A 464GOPS 620GOPS/W heterogeneous multi-core SoC for image-recognition applications. In ISSCC.Google Scholar
- TPC. 2015. TPC Benchmarks. Retrieved July 30, 2015, from http://www.tpc.org/.Google Scholar
- H. Usui, L. Subramanian, K. Chang, and O. Mutlu. 2015. SQUASH: Simple QoS-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. CoRR abs/1505.07502 (2015).Google Scholar
- H. Vandierendonck and A. Seznec. 2011. Fairness metrics for multi-threaded processors. IEEE CAL 10, 1 (February 2011). Google ScholarDigital Library
- P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In CVPR.Google Scholar
- C. A. Waldspurger and W. E. Weihl. 1994. Lottery scheduling: Flexible proportional-share resource management. In OSDI. Google ScholarDigital Library
- H. Wang, C. Isci, L. Subramanian, J. Choi, D. Qian, and O. Mutlu. 2015. A-DRM: Architecture-aware distributed resource management of virtualized clusters. In VEE. Google ScholarDigital Library
- L. Wu and W. Zhang. 2013. Time-predictable DRAM access scheduling algorithms for real-time multicore processors. In Southeastcon.Google Scholar
- P. Yedlapalli, N. Nachiappan, N. Soundararajan, A. Sivasubramaniam, M. Kandemir, and C. Das. 2014. Short-Circuiting Memory Traffic in Handheld Platforms. In MICRO. Google ScholarDigital Library
- J. Zhao, O. Mutlu, and Y. Xie. 2014. FIRM: Fair and high-performance memory control for persistent memory systems. In MICRO. Google ScholarDigital Library
- W. K. Zuravleff and T. Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. U.S. Patent Number 5,630,096. (1997).Google Scholar
Index Terms
- DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators
Recommendations
Profiling Heterogeneous Computing Performance with VTune Profiler
IWOCL '21: Proceedings of the 9th International Workshop on OpenCLProgramming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in ...
Disengaged scheduling for fair, protected access to fast computational accelerators
ASPLOS '14Today's operating systems treat GPUs and other computational accelerators as if they were simple devices, with bounded and predictable response times. With accelerators assuming an increasing share of the workload on modern machines, this strategy is ...
Portable performance on asymmetric multicore processors
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationStatic and dynamic power constraints are steering chip manufacturers to build single-ISA Asymmetric Multicore Processors (AMPs) with big and small cores. To deliver on their energy efficiency potential, schedulers must consider core sensitivity, load ...
Comments