research-article

Ubik: efficient cache sharing with strict qos for latency-critical workloads

Authors:
Harshad Kasture

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Daniel Sanchez

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsFebruary 2014Pages 729–742https://doi.org/10.1145/2541940.2541944

Published:24 February 2014Publication History

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Pages 729–742

ABSTRACT

Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency.

In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.

References

A. Agarwal, J. Hennessy, and M. Horowitz. An analytical cache model. ACM Transactions on Computer Systems, 7 (2), 1989. Google ScholarDigital Library
A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26 (4), 2006. Google ScholarDigital Library
L. Barroso and U. Hölzle. The case for energy-proportional computing. IEEE Computer, 40 (12): 33--37, 2007. Google ScholarDigital Library
N. Beckmann and D. Sanchez. Jigsaw: Scalable Software-Defined Caches. In Proc. PACT-22, 2013. Google ScholarDigital Library
S. Bird and B. Smith. PACORA: Performance aware convex optimization for resource allocation. In Proc. HotPar-3, 2011.Google Scholar
E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs CISC Debate on Contemporary ARM and x86 Architectures. In Proc. HPCA-16, 2013. Google ScholarDigital Library
R. Brain, A. Baran, N. Bisnik, et al. A 22nm High Performance Embedded DRAM SoC Technology Featuring Tri-Gate Transistors and MIMCAP COB. In Proc. of the Symposium on VLSI Technology, 2013.Google Scholar
B. D. Bui, M. Caccamo, L. Sha, and J. Martinez. Impact of cache partitioning on multi-tasking real time embedded systems. In Proc. RTCSA-14, 2008. Google ScholarDigital Library
D. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory management for embedded systems using software-controlled caches. In Proc. DAC-37, 2000. Google ScholarDigital Library
H. Cook, M. Moreto, S. Bird, et al. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In Proc. ISCA-40, 2013. Google ScholarDigital Library
J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56 (2): 74--80, 2013. Google ScholarDigital Library
C. Delimitrou and C. Kozyrakis. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proc. ASPLOS-18, 2013. Google ScholarDigital Library
E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In Proc. ASPLOS-15, 2010. Google ScholarDigital Library
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. In Proc. ASPLOS-12, 2006. Google ScholarDigital Library
M. Ferdman, A. Adileh, O. Kocberber, et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In Proc. ASPLOS-17, 2012. Google ScholarDigital Library
B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. In Proc. ISCA-38, 2011. Google ScholarDigital Library
F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In Proc. MICRO-40, 2007. Google ScholarDigital Library
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. ISCA-36, 2009. Google ScholarDigital Library
A. Hilton, N. Eswaran, and A. Roth. FIESTA: A sample\hyphbalanced multi-program workload methodology. In MoBS, 2009.Google Scholar
R. Iyer, L. Zhao, F. Guo, et al. QoS policies and architecture for cache/memory in CMP platforms. In Proc. SIGMETRICS, 2007. Google ScholarDigital Library
M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proc. DAC-49, 2012. Google ScholarDigital Library
N. Jiang, D. Becker, G. Michelogiannakis, and W. Dally. Network congestion avoidance through speculative reservation. In Proc. HPCA-18, 2012. Google ScholarDigital Library
R. Johnson, I. Pandis, N. Hardavellas, et al. Shore-MT: A scalable storage manager for the multicore era. In Proc. EDBT-12, 2009. Google ScholarDigital Library
R. Kapoor, G. Porter, M. Tewari, et al. Chronos: predictable low latency for data center applications. In Proc. SoCC-3, 2012. Google ScholarDigital Library
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proc. MICRO-43, 2010. Google ScholarDigital Library
P. Koehn, H. Hoang, A. Birch, et al. Moses: Open source toolkit for statistical machine translation. In Proc. ACL-45, 2007. Google ScholarDigital Library
N. Kurd, S. Bhamidipati, C. Mozak, et al. Westmere: A family of 32nm IA processors. In Proc. ISSCC, 2010.Google Scholar
B. Lesage, I. Puaut, and A. Seznec. PRETI: Partitioned REal-TIme shared cache for mixed-criticality real-time systems. In Proc. ICRTNS-20, 2012. Google ScholarDigital Library
B. Li, L. Zhao, R. Iyer, et al. CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs. Journal of Parallel and Distributed Computing, 71 (5), 2011. Google ScholarDigital Library
X. Lin and R. Balasubramonian. Refining the utility metric for utility-based cache partitioning. In Proc. WDDD, 2011.Google Scholar
and Kubiatowicz}liu:hotpar09:tessellationR. Liu, K. Klues, S. Bird, et al. Tessellation: Space-time partitioning in a manycore client OS. In Proc. HotPar-1, 2009. Google ScholarDigital Library
Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In Proc. EuroSys-7, 2012. Google ScholarDigital Library
J. Mars, L. Tang, R. Hundt, et al. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proc. MICRO-44, 2011. Google ScholarDigital Library
D. Meisner and T. F. Wenisch. Stochastic queuing simulation for data center workloads. EXERT, 2010.Google Scholar
D. Meisner, B. Gold, and T. Wenisch. PowerNap: Eliminating server idle power. Proc. ASPLOS-14, 2009. Google ScholarDigital Library
J. Mogul and K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. In Proc. USENIX ATC, 1996. Google ScholarDigital Library
M. Moreto, F. J. Cazorla, A. Ramirez, et al. FlexDCP: A QoS framework for CMP architectures. SIGOPS Operating Systems Review, 43 (2), 2009. Google ScholarDigital Library
K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. In Proc. MICRO-39, 2006. Google ScholarDigital Library
K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In Proc. ISCA-34, 2007. Google ScholarDigital Library
s, Mitra, Narayanan, Parulkar, Rosenblum, Rumble, Stratmann, and Stutsman}ousterhout:sigops10:ramcloudJ. Ousterhout, P. Agrawal, D. Erickson, et al. The case for RAMClouds: scalable high\hyphperformance storage entirely in DRAM. SIGOPS Operating Systems Review, 43 (4), 2010. Google ScholarDigital Library
I. Puaut and C. Pais. Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison. In Proc. DATE, 2007. Google ScholarDigital Library
M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO-39, 2006. Google ScholarDigital Library
V. Reddi, B. Lee, T. Chilimbi, and K. Vaid. Web search using mobile cores: quantifying and mitigating the price of efficiency. In Proc. ISCA-37, 2010. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In Proc. MICRO-43, 2010. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. ISCA-38, 2011. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In Proc. ISCA-40, 2013. Google ScholarDigital Library
M. Schoeberl. Time-predictable computer architecture. EURASIP Journal on Embedded Systems, 2009. Google ScholarDigital Library
A. Seznec. A case for two-way skewed-associative caches. In Proc. ISCA-20, 1993. Google ScholarDigital Library
A. Sharifi, S. Srikantaiah, A. Mishra, et al. METE: meeting end-to-end QoS in multicores through system-wide resource management. In Proc. SIGMETRICS, 2011. Google ScholarDigital Library
J. Shin, K. Tam, D. Huang, et al. A 40nm 16\hyphcore 128-thread CMT SPARC SoC processor. In ISSCC, 2010.Google Scholar
A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS-8, 2000. Google ScholarDigital Library
S. Srikantaiah, M. Kandemir, and Q. Wang. SHARP control: Controlled shared cache management in chip multiprocessors. In MICRO-42, 2009. Google ScholarDigital Library
W. D. Strecker. Transient behavior of cache memories. ACM Transactions on Computer Systems, 1 (4), 1983. Google ScholarDigital Library
L. Tang, J. Mars, W. Wang, et al. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proc. ASPLOS-18, 2013. Google ScholarDigital Library
G. Tene, B. Iyengar, and M. Wolf. C4: The continuously concurrent compacting collector. In Proc. ISMM, 2011. Google ScholarDigital Library
011)}tilera:tilegxTilera. TILE-Gx 3000 Series Overview. Technical report, 2011.Google Scholar
X. Vera, B. Lisper, and J. Xue. Data caches in multitasking hard real-time systems. In Proc. RTSS-24, 2003. Google ScholarDigital Library
D. Wendel, R. Kalla, R. Cargoni, et al. The implementation of POWER7: A highly parallel and scalable multi-core high-end server processor. In ISSCC, 2010.Google ScholarCross Ref
Y. Xie and G. H. Loh. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. In Proc. ISCA-36, 2009. Google ScholarDigital Library
H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-Flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proc. ISCA-40, 2013. Google ScholarDigital Library
X. Zhang, S. Dwarkadas, and K. Shen. Hardware execution throttling for multi-core resource management. In Proc. of USENIX ATC, 2009. Google ScholarDigital Library

Index Terms

Ubik: efficient cache sharing with strict qos for latency-critical workloads
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Multi-tenancy in modern datacenters is currently limited to a single latency-critical, interactive service, running alongside one or more low-priority, best-effort jobs. This limits the efficiency gains from multi-tenancy, especially as an increasing ...
Read More
Ubik: efficient cache sharing with strict qos for latency-critical workloads
ASPLOS '14

Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, ...
Read More
Ubik: efficient cache sharing with strict qos for latency-critical workloads
ASPLOS '14

Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ
ACM SIGARCH Computer Architecture News Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 49, Issue 4
ASPLOS '14
April 2014
729 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2644865
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache partitioning
interference
isolation
multicore
quality of service
resource management
tail latency
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '14 Paper Acceptance Rate49of217submissions,23%Overall Acceptance Rate535of2,713submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 144
  Total Citations
  View Citations
- 974
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Ubik: efficient cache sharing with strict qos for latency-critical workloads

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services

Ubik: efficient cache sharing with strict qos for latency-critical workloads

Ubik: efficient cache sharing with strict qos for latency-critical workloads