Task Scheduler for Heterogeneous Data Centres Based on Deep Reinforcement Learning

Fomperosa, Jaime; Ibañez, Mario; Stafford, Esteban; Bosque, Jose Luis

doi:10.1007/978-3-031-30442-2_18

Jaime Fomperosa¹¹,
Mario Ibañez¹¹,
Esteban Stafford¹¹ &
…
Jose Luis Bosque¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13826))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

424 Accesses
1 Citations

Abstract

This article advocates for the leveraging of machine learning to develop a workload manager that will improve the efficiency of modern data centres. The proposals stem from an existing tool that allows training deep reinforcement agents for this purpose. However, it incorporates several major improvements. It confers the ability to model heterogeneous data centres and then it proposes a novel learning agent that can not only choose the most adequate job for scheduling, but also determines the best compute resources for its execution. The evaluation experiments compare the performance of this learning agent against well known heuristic algorithms, revealing that the former is capable of improving the scheduling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MRLCC: an adaptive cloud task scheduling method based on meta reinforcement learning

Article Open access 10 May 2023

Data Centers Job Scheduling with Deep Reinforcement Learning

Management of Heterogeneous Cloud Resources with Use of the PPO

References

Bosque, J.L., Perez, L.P.: Theoretical scalability analysis for heterogeneous clusters. In: 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2004), Chicago, USA, pp. 285–292. IEEE Computer Society (2004)
Google Scholar
Carastan-Santos, D., De Camargo, R.Y.: Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)
Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
García-Saiz, D., Zorrilla, M.E., Bosque, J.L.: A clustering-based knowledge discovery process for data Centre infrastructure management. J. Supercomput. 73(1), 215–226 (2017)
Article Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. Roy. Stat. Soc. ser. C 28(1), 100–108 (1979)
Google Scholar
Herrera, A., Ibáñez, M., Stafford, E., Bosque, J.: A simulator for intelligent workload managers in heterogeneous clusters. In: 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 196–205 (2021)
Google Scholar
Leonenkov, S., Zhumatiy, S.: Introducing new backfill-based scheduler for SLURM resource manager. In: Procedia Computer Science, 4th International Young Scientist Conference on Computational Science, vol. 66, pp. 661–669 (2015)
Google Scholar
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)
Article MATH Google Scholar
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56 (2016)
Google Scholar
Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, p. 270–288. SIGCOMM 2019 (2019)
Google Scholar
Pearl, J.: Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley Longman Publishing Co., Inc, Boston (1984)
Google Scholar
Pinedo, M.: Scheduling, vol. 29. Springer, Berlin (2012)
Book MATH Google Scholar
Stafford, E., Bosque, J.L.: Improving utilization of heterogeneous clusters. J. Supercomput. 76(11), 8787–8800 (2020). https://doi.org/10.1007/s11227-020-03175-4
Article Google Scholar
Stafford, E., Bosque, J.L.: Performance and energy task migration model for heterogeneous clusters. J. Supercomput. 77(9), 10053–10064 (2021). https://doi.org/10.1007/s11227-021-03663-1
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (2018)
MATH Google Scholar
Tang, W., Lan, Z., Desai, N., Buettner, D.: Fault-aware, utility-based job scheduling on blue, gene/p systems. In: IEEE International Conference on Cluster Computing and Workshops, pp. 1–10 (2009)
Google Scholar
Vazirani, V.V.: Approximation Algorithms. Springer Science & Business Media, Berlin (2013)
Google Scholar
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Chapter Google Scholar
Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)
Google Scholar

Download references

Acknowledgment

This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence.

Author information

Authors and Affiliations

Dpto. de Ingeniería Informática y Electrónica, Universidad de Cantabria, Santander, Spain
Jaime Fomperosa, Mario Ibañez, Esteban Stafford & Jose Luis Bosque

Authors

Jaime Fomperosa
View author publications
You can also search for this author in PubMed Google Scholar
Mario Ibañez
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Stafford
View author publications
You can also search for this author in PubMed Google Scholar
Jose Luis Bosque
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Esteban Stafford .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fomperosa, J., Ibañez, M., Stafford, E., Bosque, J.L. (2023). Task Scheduler for Heterogeneous Data Centres Based on Deep Reinforcement Learning. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13826. Springer, Cham. https://doi.org/10.1007/978-3-031-30442-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-30442-2_18
Published: 28 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30441-5
Online ISBN: 978-3-031-30442-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Task Scheduler for Heterogeneous Data Centres Based on Deep Reinforcement Learning

Abstract

Access this chapter

Similar content being viewed by others

MRLCC: an adaptive cloud task scheduling method based on meta reinforcement learning

Data Centers Job Scheduling with Deep Reinforcement Learning

Management of Heterogeneous Cloud Resources with Use of the PPO

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Task Scheduler for Heterogeneous Data Centres Based on Deep Reinforcement Learning

Abstract

Access this chapter

Similar content being viewed by others

MRLCC: an adaptive cloud task scheduling method based on meta reinforcement learning

Data Centers Job Scheduling with Deep Reinforcement Learning

Management of Heterogeneous Cloud Resources with Use of the PPO

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation