skip to main content
10.1145/2541940.2541944acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Ubik: efficient cache sharing with strict qos for latency-critical workloads

Published:24 February 2014Publication History

ABSTRACT

Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency.

In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.

References

  1. A. Agarwal, J. Hennessy, and M. Horowitz. An analytical cache model. ACM Transactions on Computer Systems, 7 (2), 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26 (4), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Barroso and U. Hölzle. The case for energy-proportional computing. IEEE Computer, 40 (12): 33--37, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Beckmann and D. Sanchez. Jigsaw: Scalable Software-Defined Caches. In Proc. PACT-22, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Bird and B. Smith. PACORA: Performance aware convex optimization for resource allocation. In Proc. HotPar-3, 2011.Google ScholarGoogle Scholar
  6. E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs CISC Debate on Contemporary ARM and x86 Architectures. In Proc. HPCA-16, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Brain, A. Baran, N. Bisnik, et al. A 22nm High Performance Embedded DRAM SoC Technology Featuring Tri-Gate Transistors and MIMCAP COB. In Proc. of the Symposium on VLSI Technology, 2013.Google ScholarGoogle Scholar
  8. B. D. Bui, M. Caccamo, L. Sha, and J. Martinez. Impact of cache partitioning on multi-tasking real time embedded systems. In Proc. RTCSA-14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory management for embedded systems using software-controlled caches. In Proc. DAC-37, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Cook, M. Moreto, S. Bird, et al. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In Proc. ISCA-40, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56 (2): 74--80, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Delimitrou and C. Kozyrakis. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proc. ASPLOS-18, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In Proc. ASPLOS-15, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. In Proc. ASPLOS-12, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Ferdman, A. Adileh, O. Kocberber, et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In Proc. ASPLOS-17, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. In Proc. ISCA-38, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In Proc. MICRO-40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. ISCA-36, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Hilton, N. Eswaran, and A. Roth. FIESTA: A sample\hyphbalanced multi-program workload methodology. In MoBS, 2009.Google ScholarGoogle Scholar
  20. R. Iyer, L. Zhao, F. Guo, et al. QoS policies and architecture for cache/memory in CMP platforms. In Proc. SIGMETRICS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proc. DAC-49, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Jiang, D. Becker, G. Michelogiannakis, and W. Dally. Network congestion avoidance through speculative reservation. In Proc. HPCA-18, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Johnson, I. Pandis, N. Hardavellas, et al. Shore-MT: A scalable storage manager for the multicore era. In Proc. EDBT-12, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Kapoor, G. Porter, M. Tewari, et al. Chronos: predictable low latency for data center applications. In Proc. SoCC-3, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proc. MICRO-43, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Koehn, H. Hoang, A. Birch, et al. Moses: Open source toolkit for statistical machine translation. In Proc. ACL-45, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Kurd, S. Bhamidipati, C. Mozak, et al. Westmere: A family of 32nm IA processors. In Proc. ISSCC, 2010.Google ScholarGoogle Scholar
  28. B. Lesage, I. Puaut, and A. Seznec. PRETI: Partitioned REal-TIme shared cache for mixed-criticality real-time systems. In Proc. ICRTNS-20, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Li, L. Zhao, R. Iyer, et al. CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs. Journal of Parallel and Distributed Computing, 71 (5), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X. Lin and R. Balasubramonian. Refining the utility metric for utility-based cache partitioning. In Proc. WDDD, 2011.Google ScholarGoogle Scholar
  31. and Kubiatowicz}liu:hotpar09:tessellationR. Liu, K. Klues, S. Bird, et al. Tessellation: Space-time partitioning in a manycore client OS. In Proc. HotPar-1, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In Proc. EuroSys-7, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Mars, L. Tang, R. Hundt, et al. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proc. MICRO-44, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Meisner and T. F. Wenisch. Stochastic queuing simulation for data center workloads. EXERT, 2010.Google ScholarGoogle Scholar
  35. D. Meisner, B. Gold, and T. Wenisch. PowerNap: Eliminating server idle power. Proc. ASPLOS-14, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Mogul and K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. In Proc. USENIX ATC, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Moreto, F. J. Cazorla, A. Ramirez, et al. FlexDCP: A QoS framework for CMP architectures. SIGOPS Operating Systems Review, 43 (2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. In Proc. MICRO-39, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In Proc. ISCA-34, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. s, Mitra, Narayanan, Parulkar, Rosenblum, Rumble, Stratmann, and Stutsman}ousterhout:sigops10:ramcloudJ. Ousterhout, P. Agrawal, D. Erickson, et al. The case for RAMClouds: scalable high\hyphperformance storage entirely in DRAM. SIGOPS Operating Systems Review, 43 (4), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. I. Puaut and C. Pais. Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison. In Proc. DATE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO-39, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. V. Reddi, B. Lee, T. Chilimbi, and K. Vaid. Web search using mobile cores: quantifying and mitigating the price of efficiency. In Proc. ISCA-37, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In Proc. MICRO-43, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. ISCA-38, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. D. Sanchez and C. Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In Proc. ISCA-40, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. Schoeberl. Time-predictable computer architecture. EURASIP Journal on Embedded Systems, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. Seznec. A case for two-way skewed-associative caches. In Proc. ISCA-20, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Sharifi, S. Srikantaiah, A. Mishra, et al. METE: meeting end-to-end QoS in multicores through system-wide resource management. In Proc. SIGMETRICS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Shin, K. Tam, D. Huang, et al. A 40nm 16\hyphcore 128-thread CMT SPARC SoC processor. In ISSCC, 2010.Google ScholarGoogle Scholar
  51. A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS-8, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Srikantaiah, M. Kandemir, and Q. Wang. SHARP control: Controlled shared cache management in chip multiprocessors. In MICRO-42, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. W. D. Strecker. Transient behavior of cache memories. ACM Transactions on Computer Systems, 1 (4), 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. L. Tang, J. Mars, W. Wang, et al. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proc. ASPLOS-18, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. G. Tene, B. Iyengar, and M. Wolf. C4: The continuously concurrent compacting collector. In Proc. ISMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. 011)}tilera:tilegxTilera. TILE-Gx 3000 Series Overview. Technical report, 2011.Google ScholarGoogle Scholar
  57. X. Vera, B. Lisper, and J. Xue. Data caches in multitasking hard real-time systems. In Proc. RTSS-24, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Wendel, R. Kalla, R. Cargoni, et al. The implementation of POWER7: A highly parallel and scalable multi-core high-end server processor. In ISSCC, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  59. Y. Xie and G. H. Loh. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. In Proc. ISCA-36, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-Flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proc. ISCA-40, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. X. Zhang, S. Dwarkadas, and K. Shen. Hardware execution throttling for multi-core resource management. In Proc. of USENIX ATC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Ubik: efficient cache sharing with strict qos for latency-critical workloads

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
        February 2014
        780 pages
        ISBN:9781450323055
        DOI:10.1145/2541940

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 February 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ASPLOS '14 Paper Acceptance Rate49of217submissions,23%Overall Acceptance Rate535of2,713submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader