ABSTRACT
Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency.
In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.
- A. Agarwal, J. Hennessy, and M. Horowitz. An analytical cache model. ACM Transactions on Computer Systems, 7 (2), 1989. Google ScholarDigital Library
- A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26 (4), 2006. Google ScholarDigital Library
- L. Barroso and U. Hölzle. The case for energy-proportional computing. IEEE Computer, 40 (12): 33--37, 2007. Google ScholarDigital Library
- N. Beckmann and D. Sanchez. Jigsaw: Scalable Software-Defined Caches. In Proc. PACT-22, 2013. Google ScholarDigital Library
- S. Bird and B. Smith. PACORA: Performance aware convex optimization for resource allocation. In Proc. HotPar-3, 2011.Google Scholar
- E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs CISC Debate on Contemporary ARM and x86 Architectures. In Proc. HPCA-16, 2013. Google ScholarDigital Library
- R. Brain, A. Baran, N. Bisnik, et al. A 22nm High Performance Embedded DRAM SoC Technology Featuring Tri-Gate Transistors and MIMCAP COB. In Proc. of the Symposium on VLSI Technology, 2013.Google Scholar
- B. D. Bui, M. Caccamo, L. Sha, and J. Martinez. Impact of cache partitioning on multi-tasking real time embedded systems. In Proc. RTCSA-14, 2008. Google ScholarDigital Library
- D. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory management for embedded systems using software-controlled caches. In Proc. DAC-37, 2000. Google ScholarDigital Library
- H. Cook, M. Moreto, S. Bird, et al. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In Proc. ISCA-40, 2013. Google ScholarDigital Library
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56 (2): 74--80, 2013. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proc. ASPLOS-18, 2013. Google ScholarDigital Library
- E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In Proc. ASPLOS-15, 2010. Google ScholarDigital Library
- S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. In Proc. ASPLOS-12, 2006. Google ScholarDigital Library
- M. Ferdman, A. Adileh, O. Kocberber, et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In Proc. ASPLOS-17, 2012. Google ScholarDigital Library
- B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. In Proc. ISCA-38, 2011. Google ScholarDigital Library
- F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In Proc. MICRO-40, 2007. Google ScholarDigital Library
- N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. ISCA-36, 2009. Google ScholarDigital Library
- A. Hilton, N. Eswaran, and A. Roth. FIESTA: A sample\hyphbalanced multi-program workload methodology. In MoBS, 2009.Google Scholar
- R. Iyer, L. Zhao, F. Guo, et al. QoS policies and architecture for cache/memory in CMP platforms. In Proc. SIGMETRICS, 2007. Google ScholarDigital Library
- M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proc. DAC-49, 2012. Google ScholarDigital Library
- N. Jiang, D. Becker, G. Michelogiannakis, and W. Dally. Network congestion avoidance through speculative reservation. In Proc. HPCA-18, 2012. Google ScholarDigital Library
- R. Johnson, I. Pandis, N. Hardavellas, et al. Shore-MT: A scalable storage manager for the multicore era. In Proc. EDBT-12, 2009. Google ScholarDigital Library
- R. Kapoor, G. Porter, M. Tewari, et al. Chronos: predictable low latency for data center applications. In Proc. SoCC-3, 2012. Google ScholarDigital Library
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proc. MICRO-43, 2010. Google ScholarDigital Library
- P. Koehn, H. Hoang, A. Birch, et al. Moses: Open source toolkit for statistical machine translation. In Proc. ACL-45, 2007. Google ScholarDigital Library
- N. Kurd, S. Bhamidipati, C. Mozak, et al. Westmere: A family of 32nm IA processors. In Proc. ISSCC, 2010.Google Scholar
- B. Lesage, I. Puaut, and A. Seznec. PRETI: Partitioned REal-TIme shared cache for mixed-criticality real-time systems. In Proc. ICRTNS-20, 2012. Google ScholarDigital Library
- B. Li, L. Zhao, R. Iyer, et al. CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs. Journal of Parallel and Distributed Computing, 71 (5), 2011. Google ScholarDigital Library
- X. Lin and R. Balasubramonian. Refining the utility metric for utility-based cache partitioning. In Proc. WDDD, 2011.Google Scholar
- and Kubiatowicz}liu:hotpar09:tessellationR. Liu, K. Klues, S. Bird, et al. Tessellation: Space-time partitioning in a manycore client OS. In Proc. HotPar-1, 2009. Google ScholarDigital Library
- Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In Proc. EuroSys-7, 2012. Google ScholarDigital Library
- J. Mars, L. Tang, R. Hundt, et al. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proc. MICRO-44, 2011. Google ScholarDigital Library
- D. Meisner and T. F. Wenisch. Stochastic queuing simulation for data center workloads. EXERT, 2010.Google Scholar
- D. Meisner, B. Gold, and T. Wenisch. PowerNap: Eliminating server idle power. Proc. ASPLOS-14, 2009. Google ScholarDigital Library
- J. Mogul and K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. In Proc. USENIX ATC, 1996. Google ScholarDigital Library
- M. Moreto, F. J. Cazorla, A. Ramirez, et al. FlexDCP: A QoS framework for CMP architectures. SIGOPS Operating Systems Review, 43 (2), 2009. Google ScholarDigital Library
- K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. In Proc. MICRO-39, 2006. Google ScholarDigital Library
- K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In Proc. ISCA-34, 2007. Google ScholarDigital Library
- s, Mitra, Narayanan, Parulkar, Rosenblum, Rumble, Stratmann, and Stutsman}ousterhout:sigops10:ramcloudJ. Ousterhout, P. Agrawal, D. Erickson, et al. The case for RAMClouds: scalable high\hyphperformance storage entirely in DRAM. SIGOPS Operating Systems Review, 43 (4), 2010. Google ScholarDigital Library
- I. Puaut and C. Pais. Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison. In Proc. DATE, 2007. Google ScholarDigital Library
- M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO-39, 2006. Google ScholarDigital Library
- V. Reddi, B. Lee, T. Chilimbi, and K. Vaid. Web search using mobile cores: quantifying and mitigating the price of efficiency. In Proc. ISCA-37, 2010. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In Proc. MICRO-43, 2010. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. ISCA-38, 2011. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In Proc. ISCA-40, 2013. Google ScholarDigital Library
- M. Schoeberl. Time-predictable computer architecture. EURASIP Journal on Embedded Systems, 2009. Google ScholarDigital Library
- A. Seznec. A case for two-way skewed-associative caches. In Proc. ISCA-20, 1993. Google ScholarDigital Library
- A. Sharifi, S. Srikantaiah, A. Mishra, et al. METE: meeting end-to-end QoS in multicores through system-wide resource management. In Proc. SIGMETRICS, 2011. Google ScholarDigital Library
- J. Shin, K. Tam, D. Huang, et al. A 40nm 16\hyphcore 128-thread CMT SPARC SoC processor. In ISSCC, 2010.Google Scholar
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS-8, 2000. Google ScholarDigital Library
- S. Srikantaiah, M. Kandemir, and Q. Wang. SHARP control: Controlled shared cache management in chip multiprocessors. In MICRO-42, 2009. Google ScholarDigital Library
- W. D. Strecker. Transient behavior of cache memories. ACM Transactions on Computer Systems, 1 (4), 1983. Google ScholarDigital Library
- L. Tang, J. Mars, W. Wang, et al. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proc. ASPLOS-18, 2013. Google ScholarDigital Library
- G. Tene, B. Iyengar, and M. Wolf. C4: The continuously concurrent compacting collector. In Proc. ISMM, 2011. Google ScholarDigital Library
- 011)}tilera:tilegxTilera. TILE-Gx 3000 Series Overview. Technical report, 2011.Google Scholar
- X. Vera, B. Lisper, and J. Xue. Data caches in multitasking hard real-time systems. In Proc. RTSS-24, 2003. Google ScholarDigital Library
- D. Wendel, R. Kalla, R. Cargoni, et al. The implementation of POWER7: A highly parallel and scalable multi-core high-end server processor. In ISSCC, 2010.Google ScholarCross Ref
- Y. Xie and G. H. Loh. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. In Proc. ISCA-36, 2009. Google ScholarDigital Library
- H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-Flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proc. ISCA-40, 2013. Google ScholarDigital Library
- X. Zhang, S. Dwarkadas, and K. Shen. Hardware execution throttling for multi-core resource management. In Proc. of USENIX ATC, 2009. Google ScholarDigital Library
Index Terms
- Ubik: efficient cache sharing with strict qos for latency-critical workloads
Recommendations
PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsMulti-tenancy in modern datacenters is currently limited to a single latency-critical, interactive service, running alongside one or more low-priority, best-effort jobs. This limits the efficiency gains from multi-tenancy, especially as an increasing ...
Ubik: efficient cache sharing with strict qos for latency-critical workloads
ASPLOS '14Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, ...
Ubik: efficient cache sharing with strict qos for latency-critical workloads
ASPLOS '14Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, ...
Comments