research-article

Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces

Authors:
Jing Guo

University of Chinese Academy of Sciences

University of Chinese Academy of Sciences
View Profile

,
Zihao Chang

University of Chinese Academy of Sciences

University of Chinese Academy of Sciences
View Profile

,
Sa Wang

University of Chinese Academy of Sciences

University of Chinese Academy of Sciences
View Profile

,
Haiyang Ding

Alibaba Inc.

Alibaba Inc.
View Profile

,
Yihui Feng

Alibaba Inc.

Alibaba Inc.
View Profile

,
Liang Mao

Alibaba Inc.

Alibaba Inc.
View Profile

,
Yungang Bao

University of Chinese Academy of Sciences

University of Chinese Academy of Sciences
View Profile

IWQoS '19: Proceedings of the International Symposium on Quality of ServiceJune 2019Article No.: 39Pages 1–10https://doi.org/10.1145/3326285.3329074

Published:24 June 2019Publication History

IWQoS '19: Proceedings of the International Symposium on Quality of Service

Pages 1–10

ABSTRACT

Cloud platform provides great flexibility and cost-efficiency for end-users and cloud operators. However, low resource utilization in modern datacenters brings huge wastes of hardware resources and infrastructure investment. To improve resource utilization, a straightforward way is co-locating different workloads on the same hardware. To figure out the resource efficiency and understand the key characteristics of workloads in co-located cluster, we analyze an 8-day trace from Alibaba's production trace. We reveal three key findings as follows. First, memory becomes the new bottleneck and limits the resource efficiency in Alibaba's datacenter. Second, in order to protect latency-critical applications, batch-processing applications are treated as second-class citizens and restricted to utilize limited resources. Third, more than 90% of latency-critical applications are written in Java applications. Massive self-contained JVMs further complicate resource management and limit the resource efficiency in datacenters.

References

Omar Arif Abdul-Rahman and Kento Aida. 2014. Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In 2014 IEEE 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 272--277.Google ScholarDigital Library
George Amvrosiadis, Jun Woo Park, Gregory R Ganger, Garth A Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the diversity of cluster workloads and its impact on research results. In 2018 { USENIX} Annual Technical Conference ({USENIX}{ATC} 18). 533--546. Google ScholarDigital Library
Shuang Chen, Christina Delimitrou, and Jose F. Martinez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
Yue Cheng, Zheng Chai, and Ali Anwar. 2018. Characterizing Co-located Data-center Workloads: An Alibaba Case Study. In Proceedings of the 9th Asia-Pacific Workshop on Systems (APSys '18). ACM, New York, NY, USA, Article 12, 3 pages. Google ScholarDigital Library
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 153--167. Google ScholarDigital Library
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144. Google ScholarDigital Library
Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015. Tarcil: reconciling scheduling speed and quality in large shared clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, 97--110.Google ScholarDigital Library
Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, Xiaosong Ma, and Daniel Sanchez. 2018. KPart: A hybrid cache partitioning-sharing technique for commodity multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 104--117.Google ScholarCross Ref
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center.. In NSDI, Vol. 11. 22--22.Google ScholarDigital Library
Alibaba Inc. 2017. Alibaba production cluster data v2017. Website. https://github.com/alibaba/clusterdata.Google Scholar
Alibaba Inc. 2018. Alibaba production cluster data v2018. Website. https://github.com/alibaba/clusterdata/tree/v2018.Google Scholar
Alibaba Inc. 2018. Evolution of Alibaba Large-Scale Colocation Technology. Website. https://www.alibabacloud.com/blog/evolution-of-alibaba-large-scale-colocation-technology_594172.Google Scholar
Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. 2017. HeteroOS: OS Design for Heterogeneous Memory Management in Datacenter. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017. 521--534. https://dl.acm.org/citation.cfm?id=3080245 Google ScholarDigital Library
James R. Larus. 2008. Spending Moore's Dividend. 19. https://www.microsoft.com/en-us/research/publication/spending-moores-dividend/Google Scholar
Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K Reinhardt, and Thomas F Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In ACM SIGARCH computer architecture news, Vol. 37. ACM, 267--278. Google ScholarDigital Library
Huan Liu. 2011. A measurement study of server utilization in public clouds. In 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing. IEEE, 435--442. Google ScholarDigital Library
Qixiao Liu and Zhibin Yu. 2018. The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace. In Proceedings of ACM Symposium on Cloud Computing (SOCC). Google ScholarDigital Library
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450--462. Google ScholarDigital Library
Chengzhi Lu, Kejiang Ye, Guoyao Xu, Cheng-Zhong Xu, and Tongxin Bai. 2017. Imbalance in the cloud: an analysis on Alibaba cluster trace. In Big Data (Big Data), 2017 IEEE International Conference on. IEEE, 2884--2892.Google ScholarCross Ref
Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. 2015. Trash day: Coordinating garbage collection in distributed systems. In 15th Workshop on Hot Topics in Operating Systems (HotOS {XV}). Google ScholarDigital Library
Jeremy Manson, William Pugh, and Sarita V Adve. 2005. The Java memory model. Vol. 40. ACM. Google ScholarDigital Library
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 69--84.Google ScholarDigital Library
Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing. ACM, 7. Google ScholarDigital Library
Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: format + schema. Technical Report. Google Inc., Mountain View, CA, USA. Revised 2014-11-17 for version 2.1. Posted at https://github.com/google/cluster-data.Google Scholar
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 351--364. Google ScholarDigital Library
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A Disseminated, Distributed {OS} for Hardware Resource Disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 69--87. Google ScholarDigital Library
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 18. Google ScholarDigital Library
Ting Yang, Emery D Berger, Scott F Kaplan, and J Eliot B Moss. 2006. CRAMM: Virtual memory support for garbage-collected applications. In Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 103--116. Google ScholarDigital Library
Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. Proceedings of the VLDB Endowment 7, 13 (2014), 1393--1404. Google ScholarDigital Library
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In ACM Sigplan Notices, Vol. 45. ACM, 129--142. Google ScholarDigital Library

Index Terms

Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing

Recommendations

Improving Resource Efficiency at Scale with Heracles

User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared ...
Read More
Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension
CCGRID '10: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

We are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it ...
Read More
Structure-aware online virtual machine consolidation for datacenter energy improvement in cloud computing

The necessity and significance of improving the energy efficiency of cloud implementations have increased due to the rapid growth and proliferation of cloud computing services around the world. Virtual machines (VMs) comprise the backend of most, if not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IWQoS '19: Proceedings of the International Symposium on Quality of Service
June 2019
420 pages
ISBN:9781450367783
DOI:10.1145/3326285

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
datacenter
resource efficiency
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 150
  Total Citations
  View Citations
- 2,630
  Total Downloads
- Downloads (Last 12 months)421
- Downloads (Last 6 weeks)52
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces

IWQoS '19: Proceedings of the International Symposium on Quality of Service

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving Resource Efficiency at Scale with Heracles

Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension

Structure-aware online virtual machine consolidation for datacenter energy improvement in cloud computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces

IWQoS '19: Proceedings of the International Symposium on Quality of Service

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving Resource Efficiency at Scale with Heracles

Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension

Structure-aware online virtual machine consolidation for datacenter energy improvement in cloud computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media