Abstract
This article advocates for the leveraging of machine learning to develop a workload manager that will improve the efficiency of modern data centres. The proposals stem from an existing tool that allows training deep reinforcement agents for this purpose. However, it incorporates several major improvements. It confers the ability to model heterogeneous data centres and then it proposes a novel learning agent that can not only choose the most adequate job for scheduling, but also determines the best compute resources for its execution. The evaluation experiments compare the performance of this learning agent against well known heuristic algorithms, revealing that the former is capable of improving the scheduling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bosque, J.L., Perez, L.P.: Theoretical scalability analysis for heterogeneous clusters. In: 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2004), Chicago, USA, pp. 285–292. IEEE Computer Society (2004)
Carastan-Santos, D., De Camargo, R.Y.: Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
García-Saiz, D., Zorrilla, M.E., Bosque, J.L.: A clustering-based knowledge discovery process for data Centre infrastructure management. J. Supercomput. 73(1), 215–226 (2017)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. Roy. Stat. Soc. ser. C 28(1), 100–108 (1979)
Herrera, A., Ibáñez, M., Stafford, E., Bosque, J.: A simulator for intelligent workload managers in heterogeneous clusters. In: 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 196–205 (2021)
Leonenkov, S., Zhumatiy, S.: Introducing new backfill-based scheduler for SLURM resource manager. In: Procedia Computer Science, 4th International Young Scientist Conference on Computational Science, vol. 66, pp. 661–669 (2015)
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56 (2016)
Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, p. 270–288. SIGCOMM 2019 (2019)
Pearl, J.: Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley Longman Publishing Co., Inc, Boston (1984)
Pinedo, M.: Scheduling, vol. 29. Springer, Berlin (2012)
Stafford, E., Bosque, J.L.: Improving utilization of heterogeneous clusters. J. Supercomput. 76(11), 8787–8800 (2020). https://doi.org/10.1007/s11227-020-03175-4
Stafford, E., Bosque, J.L.: Performance and energy task migration model for heterogeneous clusters. J. Supercomput. 77(9), 10053–10064 (2021). https://doi.org/10.1007/s11227-021-03663-1
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (2018)
Tang, W., Lan, Z., Desai, N., Buettner, D.: Fault-aware, utility-based job scheduling on blue, gene/p systems. In: IEEE International Conference on Cluster Computing and Workshops, pp. 1–10 (2009)
Vazirani, V.V.: Approximation Algorithms. Springer Science & Business Media, Berlin (2013)
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)
Acknowledgment
This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fomperosa, J., Ibañez, M., Stafford, E., Bosque, J.L. (2023). Task Scheduler for Heterogeneous Data Centres Based on Deep Reinforcement Learning. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13826. Springer, Cham. https://doi.org/10.1007/978-3-031-30442-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-30442-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30441-5
Online ISBN: 978-3-031-30442-2
eBook Packages: Computer ScienceComputer Science (R0)