skip to main content
10.1145/3545008.3545064acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Themis: Fair Memory Subsystem Resource Sharing with Differentiated QoS in Public Clouds

Authors Info & Claims
Published:13 January 2023Publication History

ABSTRACT

To reduce the increasing cost of building and operating cloud data centers, cloud providers are seeking various mechanisms to achieve higher resource effectiveness. For example, cloud operators are leveraging dynamic resource management techniques to consolidate a higher density of application workloads into commodity physical servers to maximize server resource utilization. However, higher workload density is a major source of performance interference problems in multi-tenant clouds. Existing performance isolation techniques such as dedicated CPU cores for specific workloads are not enough as there are still common resource (e.g., last-level cache and memory bandwidth in memory subsystem) on the processor that are shared among all CPUs on the same NUMA node. While prior work has proposed a variety of resource partitioning techniques, it still remains unexplored to characterize the impact of memory subsystem resource partitioning for the consolidated workloads with different priorities and investigate software support to dynamically manage memory subsystem resource sharing in a real-time manner. To bridge the gap, we propose Themis, a feedback-based controller that enables a priority-aware and fairness-aware memory subsystem resource management strategy to guarantee the performance of high-priority workloads while maintaining fairness across all colocated workloads in high-density clouds. Themis is evaluated with multiple typical cloud applications in our data center environment. The results show that Themis improves the performance of various workloads by up to 3.15%, and fairness by more than 70% in memory subsystem resource allocation compared to existing state-of-the-art work.

References

  1. A. Beitch, B. Liu, T. Yung, R. Griffith, A. Fox, D. A. Patterson, and A. Beitch. 2010. A Workload Generation Toolkit for Cloud Computing Applications. (2010).Google ScholarGoogle Scholar
  2. Ruobing Chen, Jinping Wu, Haosen Shi, Yusen Li, Xiaoguang Liu, and Gang Wang. 2021. DRLPart: A Deep Reinforcement Learning Framework for Optimally Efficient and Robust Resource Partitioning on Commodity Servers. In HPDC ’21: The 30th International Symposium on High-Performance Parallel and Distributed Computing, Virtual Event, Sweden, June 21-25, 2021, Erwin Laure, Stefano Markidis, Ana Lucia Verbanescu, and Jay F. Lofstead (Eds.). ACM, 175–188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019, Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 107–120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jongsok Choi, Ruolong Lian, Zhi Li, Andrew Canis, and Jason Helge Anderson. 2018. Accelerating Memcached on AWS Cloud FPGAs. In Proceedings of the 9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2018, Toronto, ON, Canada, June 20-22, 2018. ACM, 2:1–2:8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, Indianapolis, Indiana, USA, June 10-11, 2010, Joseph M. Hellerstein, Surajit Chaudhuri, and Mendel Rosenblum (Eds.). ACM, 143–154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (Shanghai, China) (SOSP ’17). Association for Computing Machinery, New York, NY, USA, 153–167. https://doi.org/10.1145/3132747.3132772Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017. ACM, 153–167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stijn Eyerman and Lieven Eeckhout. 2008. System-Level Performance Metrics for Multiprogram Workloads. IEEE Micro 28, 3 (2008), 42–53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael Ferdman, Almutaz Adileh, Yusuf Onur Koçberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 3-7, 2012, Tim Harris and Michael L. Scott (Eds.). ACM, 37–48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Senbo Fu, Rui Prior, and Hyong Kim. 2019. DMFD: Non-Intrusive Dependency Inference and Flow Ratio Model for Performance Anomaly Detection in Multi-Tier Cloud Applications. In 12th IEEE International Conference on Cloud Computing, CLOUD 2019, Milan, Italy, July 8-13, 2019, Elisa Bertino, Carl K. Chang, Peter Chen, Ernesto Damiani, Michael Goul, and Katsunori Oyama (Eds.). IEEE, 164–173.Google ScholarGoogle Scholar
  11. Samuel Ginzburg and Michael J. Freedman. 2020. Serverless Isn’t Server-Less: Measuring and Exploiting Resource Variability on Cloud FaaS Platforms. In Proceedings of the 2020 Sixth International Workshop on Serverless Computing (Delft, Netherlands) (WoSC’20). Association for Computing Machinery, New York, NY, USA, 43–48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces. In Proceedings of the International Symposium on Quality of Service, IWQoS 2019, Phoenix, AZ, USA, June 24-25, 2019. ACM, 39:1–39:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, and Ion Stoica. [n.d.]. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011, David G. Andersen and Sylvia Ratnasamy (Eds.).Google ScholarGoogle Scholar
  14. Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA. IEEE Computer Society, 41–51.Google ScholarGoogle ScholarCross RefCross Ref
  15. Seyyed Ahmad Javadi, Amoghavarsha Suresh, Muhammad Wajahat, and Anshul Gandhi. 2019. Scavenger: A Black-Box Batch Workload Resource Manager for Improving Utilization in Cloud Environments. In Proceedings of the ACM Symposium on Cloud Computing (Santa Cruz, CA, USA) (SoCC ’19). Association for Computing Machinery, New York, NY, USA, 272–285. https://doi.org/10.1145/3357223.3362734Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kostis Kaffes, Dragos Sbirlea, Yiyan Lin, David Lo, and Christos Kozyrakis. 2020. Leveraging application classes to save power in highly-utilized data centers. In SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi (Eds.). ACM, 134–149. https://doi.org/10.1145/3419111.3421274Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India. IEEE Computer Society, 1–12.Google ScholarGoogle Scholar
  18. Alok Gautam Kumbhare, Reza Azimi, Ioannis Manousakis, Anand Bonde, Felipe Vieira Frujeri, Nithish Mahalingam, Pulkit A. Misra, Seyyed Ahmad Javadi, Bianca Schroeder, Marcus Fontoura, and Ricardo Bianchini. 2021. Prediction-Based Power Oversubscription in Cloud Platforms. In 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, Irina Calciu and Geoff Kuenning (Eds.). USENIX Association, 473–487. https://www.usenix.org/conference/atc21/presentation/kumbhareGoogle ScholarGoogle Scholar
  19. Redis Lab. 2014. Memtier Benchmark. https://github.com/RedisLabs/memtier_benchmark. Accessed Dec 18, 2021.Google ScholarGoogle Scholar
  20. Shaohong Li, Xi Wang, Xiao Zhang, Vasileios Kontorinis, Sreekumar Kodakara, David Lo, and Parthasarathy Ranganathan. 2020. Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020. 1241–1255.Google ScholarGoogle Scholar
  21. David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015, Deborah T. Marr and David H. Albonesi (Eds.). ACM, 450–462. https://doi.org/10.1145/2749469.2749475Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shanka Subhra Mondal, Nikhil Sheoran, and Subrata Mitra. 2021. Scheduling of Time-Varying Workloads Using Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 35, 10 (May 2021). 9000–9008.Google ScholarGoogle ScholarCross RefCross Ref
  23. Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China. IEEE Computer Society, 63–74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jinsu Park, Seongbeom Park, and Woongki Baek. 2019. CoPart: Coordinated Partitioning of Last-Level Cache and Memory Bandwidth for Fairness-Aware Workload Consolidation on Commodity Servers. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 10:1–10:16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tirthak Patel and Devesh Tiwari. 2020. CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 193–206.Google ScholarGoogle Scholar
  26. L.G.B. Ruiz, M.C. Pegalajar, R. Arcucci, and M. Molina-Solana. 2020. A time-series clustering methodology for knowledge extraction in energy consumption data. Expert Systems with Applications 160 (2020), 113731.Google ScholarGoogle ScholarCross RefCross Ref
  27. Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Eighth Eurosys Conference 2013, EuroSys ’13, Prague, Czech Republic, April 14-17, 2013, Zdenek Hanzálek, Hermann Härtig, Miguel Castro, and M. Frans Kaashoek (Eds.). ACM, 351–364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2012. Light-Weight Black-Box Failure Detection for Distributed Systems. In Proceedings of the 2012 Workshop on Management of Big Data Systems (San Jose, California, USA) (MBDS ’12). Association for Computing Machinery, New York, NY, USA, 13–18. https://doi.org/10.1145/2378356.2378360Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gil Tene. 2014. Wrk2: A constant throughput, correct latency recording variant of wrk.http://github.com/giltene/wrk2. Accessed Dec 18, 2021.Google ScholarGoogle Scholar
  30. Huangshi Tian, Yunchuan Zheng, and Wei Wang. 2019. Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019, Santa Cruz, CA, USA, November 20-23, 2019. ACM, 139–151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Tirmazi, A. Barker, N. Deng, M. E. Haque, and J. Wilkes. 2020. Borg: the next generation. In EuroSys ’20: Fifteenth EuroSys Conference 2020.Google ScholarGoogle Scholar
  32. Yawen Wang, Kapil Arya, Marios Kogias, Manohar Vanga, Aditya Bhandari, Neeraja J. Yadwadkar, Siddhartha Sen, Sameh Elnikety, Christos Kozyrakis, and Ricardo Bianchini. 2021. SmartHarvest: harvesting idle CPUs safely and efficiently in the cloud. In EuroSys ’21: Sixteenth European Conference on Computer Systems, Online Event, United Kingdom, April 26-28, 2021, Antonio Barbalace, Pramod Bhatotia, Lorenzo Alvisi, and Cristian Cadar (Eds.). ACM, 1–16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, and Zhenlin Wang. 2019. EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, Kyoto, Japan, August 05-08, 2019. ACM, 16:1–16:12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mingli Xie, Dong Tong, Kan Huang, and Xu Cheng. 2014. Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15-19, 2014. IEEE Computer Society, 344–355.Google ScholarGoogle ScholarCross RefCross Ref
  35. Cong Xu, Karthick Rajamani, Alexandre Ferreira, Wesley Felter, Juan Rubio, and Yang Li. 2018. DCat: Dynamic Cache Management for Efficient, Performance-Sensitive Infrastructure-as-a-Service. In Proceedings of the Thirteenth EuroSys Conference(Porto, Portugal) (EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 14, 13 pages. https://doi.org/10.1145/3190508.3190555Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU performance isolation for shared compute clusters. In Eighth Eurosys Conference 2013, EuroSys ’13, Prague, Czech Republic, April 14-17, 2013, Zdenek Hanzálek, Hermann Härtig, Miguel Castro, and M. Frans Kaashoek (Eds.). ACM, 379–391.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ying Zhang, Jian Chen, Xiaowei Jiang, Qiang Liu, Ian M. Steiner, Andrew J. Herdrich, Kevin Shu, Ripan Das, Long Cui, and Litrin Jiang. 2021. LIBRA: Clearing the Cloud Through Dynamic Memory Bandwidth Management. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2021, Seoul, South Korea, February 27 - March 3, 2021. IEEE, 815–826.Google ScholarGoogle Scholar

Index Terms

  1. Themis: Fair Memory Subsystem Resource Sharing with Differentiated QoS in Public Clouds

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
          August 2022
          976 pages
          ISBN:9781450397339
          DOI:10.1145/3545008

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 January 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate91of313submissions,29%
        • Article Metrics

          • Downloads (Last 12 months)71
          • Downloads (Last 6 weeks)8

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format