ABSTRACT
Cloud platform provides great flexibility and cost-efficiency for end-users and cloud operators. However, low resource utilization in modern datacenters brings huge wastes of hardware resources and infrastructure investment. To improve resource utilization, a straightforward way is co-locating different workloads on the same hardware. To figure out the resource efficiency and understand the key characteristics of workloads in co-located cluster, we analyze an 8-day trace from Alibaba's production trace. We reveal three key findings as follows. First, memory becomes the new bottleneck and limits the resource efficiency in Alibaba's datacenter. Second, in order to protect latency-critical applications, batch-processing applications are treated as second-class citizens and restricted to utilize limited resources. Third, more than 90% of latency-critical applications are written in Java applications. Massive self-contained JVMs further complicate resource management and limit the resource efficiency in datacenters.
- Omar Arif Abdul-Rahman and Kento Aida. 2014. Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In 2014 IEEE 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 272--277.Google ScholarDigital Library
- George Amvrosiadis, Jun Woo Park, Gregory R Ganger, Garth A Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the diversity of cluster workloads and its impact on research results. In 2018 { USENIX} Annual Technical Conference ({USENIX}{ATC} 18). 533--546. Google ScholarDigital Library
- Shuang Chen, Christina Delimitrou, and Jose F. Martinez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- Yue Cheng, Zheng Chai, and Ali Anwar. 2018. Characterizing Co-located Data-center Workloads: An Alibaba Case Study. In Proceedings of the 9th Asia-Pacific Workshop on Systems (APSys '18). ACM, New York, NY, USA, Article 12, 3 pages. Google ScholarDigital Library
- Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 153--167. Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144. Google ScholarDigital Library
- Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015. Tarcil: reconciling scheduling speed and quality in large shared clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, 97--110.Google ScholarDigital Library
- Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, Xiaosong Ma, and Daniel Sanchez. 2018. KPart: A hybrid cache partitioning-sharing technique for commodity multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 104--117.Google ScholarCross Ref
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center.. In NSDI, Vol. 11. 22--22.Google ScholarDigital Library
- Alibaba Inc. 2017. Alibaba production cluster data v2017. Website. https://github.com/alibaba/clusterdata.Google Scholar
- Alibaba Inc. 2018. Alibaba production cluster data v2018. Website. https://github.com/alibaba/clusterdata/tree/v2018.Google Scholar
- Alibaba Inc. 2018. Evolution of Alibaba Large-Scale Colocation Technology. Website. https://www.alibabacloud.com/blog/evolution-of-alibaba-large-scale-colocation-technology_594172.Google Scholar
- Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. 2017. HeteroOS: OS Design for Heterogeneous Memory Management in Datacenter. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017. 521--534. https://dl.acm.org/citation.cfm?id=3080245 Google ScholarDigital Library
- James R. Larus. 2008. Spending Moore's Dividend. 19. https://www.microsoft.com/en-us/research/publication/spending-moores-dividend/Google Scholar
- Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K Reinhardt, and Thomas F Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In ACM SIGARCH computer architecture news, Vol. 37. ACM, 267--278. Google ScholarDigital Library
- Huan Liu. 2011. A measurement study of server utilization in public clouds. In 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing. IEEE, 435--442. Google ScholarDigital Library
- Qixiao Liu and Zhibin Yu. 2018. The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace. In Proceedings of ACM Symposium on Cloud Computing (SOCC). Google ScholarDigital Library
- David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450--462. Google ScholarDigital Library
- Chengzhi Lu, Kejiang Ye, Guoyao Xu, Cheng-Zhong Xu, and Tongxin Bai. 2017. Imbalance in the cloud: an analysis on Alibaba cluster trace. In Big Data (Big Data), 2017 IEEE International Conference on. IEEE, 2884--2892.Google ScholarCross Ref
- Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. 2015. Trash day: Coordinating garbage collection in distributed systems. In 15th Workshop on Hot Topics in Operating Systems (HotOS {XV}). Google ScholarDigital Library
- Jeremy Manson, William Pugh, and Sarita V Adve. 2005. The Java memory model. Vol. 40. ACM. Google ScholarDigital Library
- Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 69--84.Google ScholarDigital Library
- Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing. ACM, 7. Google ScholarDigital Library
- Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: format + schema. Technical Report. Google Inc., Mountain View, CA, USA. Revised 2014-11-17 for version 2.1. Posted at https://github.com/google/cluster-data.Google Scholar
- Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 351--364. Google ScholarDigital Library
- Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A Disseminated, Distributed {OS} for Hardware Resource Disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 69--87. Google ScholarDigital Library
- Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 18. Google ScholarDigital Library
- Ting Yang, Emery D Berger, Scott F Kaplan, and J Eliot B Moss. 2006. CRAMM: Virtual memory support for garbage-collected applications. In Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 103--116. Google ScholarDigital Library
- Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. Proceedings of the VLDB Endowment 7, 13 (2014), 1393--1404. Google ScholarDigital Library
- Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In ACM Sigplan Notices, Vol. 45. ACM, 129--142. Google ScholarDigital Library
Index Terms
- Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces
Recommendations
Improving Resource Efficiency at Scale with Heracles
User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared ...
Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension
CCGRID '10: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid ComputingWe are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it ...
Structure-aware online virtual machine consolidation for datacenter energy improvement in cloud computing
The necessity and significance of improving the energy efficiency of cloud implementations have increased due to the rapid growth and proliferation of cloud computing services around the world. Virtual machines (VMs) comprise the backend of most, if not ...
Comments