skip to main content
10.1145/3326285.3329074acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwqosConference Proceedingsconference-collections
research-article

Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces

Published:24 June 2019Publication History

ABSTRACT

Cloud platform provides great flexibility and cost-efficiency for end-users and cloud operators. However, low resource utilization in modern datacenters brings huge wastes of hardware resources and infrastructure investment. To improve resource utilization, a straightforward way is co-locating different workloads on the same hardware. To figure out the resource efficiency and understand the key characteristics of workloads in co-located cluster, we analyze an 8-day trace from Alibaba's production trace. We reveal three key findings as follows. First, memory becomes the new bottleneck and limits the resource efficiency in Alibaba's datacenter. Second, in order to protect latency-critical applications, batch-processing applications are treated as second-class citizens and restricted to utilize limited resources. Third, more than 90% of latency-critical applications are written in Java applications. Massive self-contained JVMs further complicate resource management and limit the resource efficiency in datacenters.

References

  1. Omar Arif Abdul-Rahman and Kento Aida. 2014. Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In 2014 IEEE 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 272--277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. George Amvrosiadis, Jun Woo Park, Gregory R Ganger, Garth A Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the diversity of cluster workloads and its impact on research results. In 2018 { USENIX} Annual Technical Conference ({USENIX}{ATC} 18). 533--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Shuang Chen, Christina Delimitrou, and Jose F. Martinez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yue Cheng, Zheng Chai, and Ali Anwar. 2018. Characterizing Co-located Data-center Workloads: An Alibaba Case Study. In Proceedings of the 9th Asia-Pacific Workshop on Systems (APSys '18). ACM, New York, NY, USA, Article 12, 3 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 153--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015. Tarcil: reconciling scheduling speed and quality in large shared clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, 97--110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, Xiaosong Ma, and Daniel Sanchez. 2018. KPart: A hybrid cache partitioning-sharing technique for commodity multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 104--117.Google ScholarGoogle ScholarCross RefCross Ref
  10. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center.. In NSDI, Vol. 11. 22--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alibaba Inc. 2017. Alibaba production cluster data v2017. Website. https://github.com/alibaba/clusterdata.Google ScholarGoogle Scholar
  12. Alibaba Inc. 2018. Alibaba production cluster data v2018. Website. https://github.com/alibaba/clusterdata/tree/v2018.Google ScholarGoogle Scholar
  13. Alibaba Inc. 2018. Evolution of Alibaba Large-Scale Colocation Technology. Website. https://www.alibabacloud.com/blog/evolution-of-alibaba-large-scale-colocation-technology_594172.Google ScholarGoogle Scholar
  14. Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. 2017. HeteroOS: OS Design for Heterogeneous Memory Management in Datacenter. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017. 521--534. https://dl.acm.org/citation.cfm?id=3080245 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. James R. Larus. 2008. Spending Moore's Dividend. 19. https://www.microsoft.com/en-us/research/publication/spending-moores-dividend/Google ScholarGoogle Scholar
  16. Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K Reinhardt, and Thomas F Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In ACM SIGARCH computer architecture news, Vol. 37. ACM, 267--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Huan Liu. 2011. A measurement study of server utilization in public clouds. In 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing. IEEE, 435--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Qixiao Liu and Zhibin Yu. 2018. The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace. In Proceedings of ACM Symposium on Cloud Computing (SOCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chengzhi Lu, Kejiang Ye, Guoyao Xu, Cheng-Zhong Xu, and Tongxin Bai. 2017. Imbalance in the cloud: an analysis on Alibaba cluster trace. In Big Data (Big Data), 2017 IEEE International Conference on. IEEE, 2884--2892.Google ScholarGoogle ScholarCross RefCross Ref
  21. Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. 2015. Trash day: Coordinating garbage collection in distributed systems. In 15th Workshop on Hot Topics in Operating Systems (HotOS {XV}). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jeremy Manson, William Pugh, and Sarita V Adve. 2005. The Java memory model. Vol. 40. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 69--84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing. ACM, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: format + schema. Technical Report. Google Inc., Mountain View, CA, USA. Revised 2014-11-17 for version 2.1. Posted at https://github.com/google/cluster-data.Google ScholarGoogle Scholar
  26. Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 351--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A Disseminated, Distributed {OS} for Hardware Resource Disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 69--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ting Yang, Emery D Berger, Scott F Kaplan, and J Eliot B Moss. 2006. CRAMM: Virtual memory support for garbage-collected applications. In Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 103--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. Proceedings of the VLDB Endowment 7, 13 (2014), 1393--1404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In ACM Sigplan Notices, Vol. 45. ACM, 129--142. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      IWQoS '19: Proceedings of the International Symposium on Quality of Service
      June 2019
      420 pages
      ISBN:9781450367783
      DOI:10.1145/3326285

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 June 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader