Abstract
Computing machines and communication links may fail permanently with nonzero probability in heterogeneous distributed computing systems (HDCSs), and the results of running applications (i.e., large-scale parallel image processing and neuroimaging) on these systems will expect to deteriorate over time. Therefore, the reliability and performance of applications on HDCS remain an imperative and open issue, especially when the parallel applications are scheduled on graphics processing unit architectures. It is urgent to tackle the problem of maximizing performance and reliability considering the impact of communication and machine failures. This work presents a rigorous probabilistic theory to analytically characterize the performance and reliability of an effective task scheduling in the presence of processor and communication failure. An optimal communication path search algorithm considering Reliability overhead and a reliability-driven lookahead scheduling algorithm for precedence constrained tasks are developed. The theoretical model and experimental data, which are based on randomly generated emulation applications represented by directed acyclic graph, reveal that the proposed algorithms significantly outperform previously existing scheduling algorithms in terms of expected makespan, reliability, and schedule length ratio. The weaknesses of the algorithms related to the input parameters are also observed.
Similar content being viewed by others
References
Bouguerra MS, Kondo D, Mendonca F, Trystram D (2014) Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions. J Parallel Distrib Comput 74(5):2411–2422
Chen CY (2016) Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Trans Parallel Distrib 27(2):521–532
Cheng D, Zhou X, Lama P, Ji M, Jiang C (2018) Energy efficiency aware task assignment with dvfs in heterogeneous hadoop clusters. IEEE Trans Parallel Distrib 29(1):70–82. https://doi.org/10.1109/TPDS.2017.2745571
Daoud MI, Kharma N (2008) A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J Parallel Distrib Comput 68(4):399–409
Darbha S, Agrawal DP (1998) Optimal scheduling algorithm for distributed-memory machines. IEEE Trans Parallel Distrib 9(1):87–95
Dogan A, Ozguner F (2002) Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans Parallel Distrib 13(3):308–323
Hagras T, Janeček J (2005) A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput 31(7):653–670
Jeannot E, Saule E, Trystram D (2012) Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J Parallel Distrib Comput 72(2):268–280
Kasahara H, Narita S (1984) Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans Comput 33(11):1023–1029
Liu L, Zhang M, Buyya R, Fan Q (2017) Deadline-constrained coevolutionary genetic algorithm for scientific workflow scheduling in cloud computing. Concurr Comput Pract Exp 29(5):e3942. https://doi.org/10.1002/cpe.3942 (e3942 CPE-16-0064.R2)
Ma Y, Wang L, Zomaya AY, Chen D, Ranjan R (2014) Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans Parallel Distrib 25(8):2126–2137
Makaratzis AT, Giannoutakis KM, Tzovaras D (2018) Energy modeling in cloud simulation frameworks. Future Gener Comput Syst 79:715–725
Palmer J, Mitrani I (2006) Empirical and analytical evaluation of systems with multiple unreliable servers. In: Proceedings Int’l conference dependable systems and networks, 2006. DSN 2006., IEEE, pp 517–525
Pezoa JE, Hayat MM (2014) Reliability of heterogeneous distributed computing systems in the presence of correlated failures. IEEE Trans Parallel Distrib 25(4):1034–1043
Pezoa JE, Dhakal S, Hayat MM (2010) Maximizing service reliability in distributed computing systems with random node failures: theory and implementation. IEEE Trans Parallel Distrib 21(10):1531–1544
Shahul AZS, Sinnen O (2010) Scheduling task graphs optimally with a*. J Supercomput 51(3):310–332
Sih GC, Lee EA (1993) A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans Parallel Distrib 4(2):175–187
Tang X, Li K, Li R, Veeravalli B (2010a) Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J Parallel Distrib Comput 70(9):941–952
Tang X, Li K, Liao G, Li R (2010b) List scheduling with duplication for heterogeneous computing systems. J Parallel Distrib Comput 70(4):323–329
Tang X, Li K, Qiu M, Sha EHM (2012) A hierarchical reliability-driven scheduling algorithm in grid systems. J Parallel Distrib Comput 72(4):525–535
Topcuoglu H, Hariri S, My Wu (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib 13(3):260–274
Tzeng GH, Huang JJ (2011) Multiple attribute decision making: methods and applications. Eur J Oper Res 4(4):287–288
Venugopalan S, Sinnen O (2015) Ilp formulations for optimal task scheduling with communication delays on parallel systems. IEEE Trans Parallel Distrib 26(1):142–151
Wang S, Li K, Mei J, Xiao G, Li K (2017) A reliability-aware task scheduling algorithm based on replication on heterogeneous computing systems. J Grid Comput 15(1):23–39. https://doi.org/10.1007/s10723-016-9386-7
Xie G, Zeng G, Chen Y, Bai Y, Zhou Z, Li R, Li K (2017) Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems. IEEE Trans Serv Comput PP(99):1–1. https://doi.org/10.1109/TSC.2017.2665552
Zhang L, Li K, Li C, Li K (2017) Bi-objective workflow scheduling of the energy consumption and reliability in heterogeneous computing systems. Inf Sci 379:241–256
Zheng Z, Lyu MR (2015) Selecting an optimal fault tolerance strategy for reliable service-oriented systems with local and global constraints. IEEE Trans Comput 64(1):219–232
Acknowledgements
We thank the five reviewers for constructive feedback that has significantly improved the paper. This work was supported by the Education Department of Jiangxi Province (Project no. GJJ170416) and Natural Science Foundation of Jiangxi Province (Project no. 20171BBH80005).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, H., Wang, Y. Maximizing reliability and performance with reliability-driven task scheduling in heterogeneous distributed computing systems. J Ambient Intell Human Comput (2018). https://doi.org/10.1007/s12652-018-0926-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12652-018-0926-9