Skip to main content
Log in

Maximizing reliability and performance with reliability-driven task scheduling in heterogeneous distributed computing systems

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Computing machines and communication links may fail permanently with nonzero probability in heterogeneous distributed computing systems (HDCSs), and the results of running applications (i.e., large-scale parallel image processing and neuroimaging) on these systems will expect to deteriorate over time. Therefore, the reliability and performance of applications on HDCS remain an imperative and open issue, especially when the parallel applications are scheduled on graphics processing unit architectures. It is urgent to tackle the problem of maximizing performance and reliability considering the impact of communication and machine failures. This work presents a rigorous probabilistic theory to analytically characterize the performance and reliability of an effective task scheduling in the presence of processor and communication failure. An optimal communication path search algorithm considering Reliability overhead and a reliability-driven lookahead scheduling algorithm for precedence constrained tasks are developed. The theoretical model and experimental data, which are based on randomly generated emulation applications represented by directed acyclic graph, reveal that the proposed algorithms significantly outperform previously existing scheduling algorithms in terms of expected makespan, reliability, and schedule length ratio. The weaknesses of the algorithms related to the input parameters are also observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Bouguerra MS, Kondo D, Mendonca F, Trystram D (2014) Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions. J Parallel Distrib Comput 74(5):2411–2422

    Article  Google Scholar 

  • Chen CY (2016) Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Trans Parallel Distrib 27(2):521–532

    Article  Google Scholar 

  • Cheng D, Zhou X, Lama P, Ji M, Jiang C (2018) Energy efficiency aware task assignment with dvfs in heterogeneous hadoop clusters. IEEE Trans Parallel Distrib 29(1):70–82. https://doi.org/10.1109/TPDS.2017.2745571

    Article  Google Scholar 

  • Daoud MI, Kharma N (2008) A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J Parallel Distrib Comput 68(4):399–409

    Article  MATH  Google Scholar 

  • Darbha S, Agrawal DP (1998) Optimal scheduling algorithm for distributed-memory machines. IEEE Trans Parallel Distrib 9(1):87–95

    Article  Google Scholar 

  • Dogan A, Ozguner F (2002) Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans Parallel Distrib 13(3):308–323

    Article  Google Scholar 

  • Hagras T, Janeček J (2005) A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput 31(7):653–670

    Article  MATH  Google Scholar 

  • Jeannot E, Saule E, Trystram D (2012) Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J Parallel Distrib Comput 72(2):268–280

    Article  MATH  Google Scholar 

  • Kasahara H, Narita S (1984) Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans Comput 33(11):1023–1029

    Article  Google Scholar 

  • Liu L, Zhang M, Buyya R, Fan Q (2017) Deadline-constrained coevolutionary genetic algorithm for scientific workflow scheduling in cloud computing. Concurr Comput Pract Exp 29(5):e3942. https://doi.org/10.1002/cpe.3942 (e3942 CPE-16-0064.R2)

    Article  Google Scholar 

  • Ma Y, Wang L, Zomaya AY, Chen D, Ranjan R (2014) Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans Parallel Distrib 25(8):2126–2137

    Article  Google Scholar 

  • Makaratzis AT, Giannoutakis KM, Tzovaras D (2018) Energy modeling in cloud simulation frameworks. Future Gener Comput Syst 79:715–725

    Article  Google Scholar 

  • Palmer J, Mitrani I (2006) Empirical and analytical evaluation of systems with multiple unreliable servers. In: Proceedings Int’l conference dependable systems and networks, 2006. DSN 2006., IEEE, pp 517–525

  • Pezoa JE, Hayat MM (2014) Reliability of heterogeneous distributed computing systems in the presence of correlated failures. IEEE Trans Parallel Distrib 25(4):1034–1043

    Article  Google Scholar 

  • Pezoa JE, Dhakal S, Hayat MM (2010) Maximizing service reliability in distributed computing systems with random node failures: theory and implementation. IEEE Trans Parallel Distrib 21(10):1531–1544

    Article  Google Scholar 

  • Shahul AZS, Sinnen O (2010) Scheduling task graphs optimally with a*. J Supercomput 51(3):310–332

    Article  Google Scholar 

  • Sih GC, Lee EA (1993) A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans Parallel Distrib 4(2):175–187

    Article  Google Scholar 

  • Tang X, Li K, Li R, Veeravalli B (2010a) Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J Parallel Distrib Comput 70(9):941–952

    Article  MATH  Google Scholar 

  • Tang X, Li K, Liao G, Li R (2010b) List scheduling with duplication for heterogeneous computing systems. J Parallel Distrib Comput 70(4):323–329

    Article  MATH  Google Scholar 

  • Tang X, Li K, Qiu M, Sha EHM (2012) A hierarchical reliability-driven scheduling algorithm in grid systems. J Parallel Distrib Comput 72(4):525–535

    Article  Google Scholar 

  • Topcuoglu H, Hariri S, My Wu (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib 13(3):260–274

    Article  Google Scholar 

  • Tzeng GH, Huang JJ (2011) Multiple attribute decision making: methods and applications. Eur J Oper Res 4(4):287–288

    MATH  Google Scholar 

  • Venugopalan S, Sinnen O (2015) Ilp formulations for optimal task scheduling with communication delays on parallel systems. IEEE Trans Parallel Distrib 26(1):142–151

    Article  Google Scholar 

  • Wang S, Li K, Mei J, Xiao G, Li K (2017) A reliability-aware task scheduling algorithm based on replication on heterogeneous computing systems. J Grid Comput 15(1):23–39. https://doi.org/10.1007/s10723-016-9386-7

    Article  Google Scholar 

  • Xie G, Zeng G, Chen Y, Bai Y, Zhou Z, Li R, Li K (2017) Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems. IEEE Trans Serv Comput PP(99):1–1. https://doi.org/10.1109/TSC.2017.2665552

    Google Scholar 

  • Zhang L, Li K, Li C, Li K (2017) Bi-objective workflow scheduling of the energy consumption and reliability in heterogeneous computing systems. Inf Sci 379:241–256

    Article  Google Scholar 

  • Zheng Z, Lyu MR (2015) Selecting an optimal fault tolerance strategy for reliable service-oriented systems with local and global constraints. IEEE Trans Comput 64(1):219–232

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank the five reviewers for constructive feedback that has significantly improved the paper. This work was supported by the Education Department of Jiangxi Province (Project no. GJJ170416) and Natural Science Foundation of Jiangxi Province (Project no. 20171BBH80005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Wang, Y. Maximizing reliability and performance with reliability-driven task scheduling in heterogeneous distributed computing systems. J Ambient Intell Human Comput (2018). https://doi.org/10.1007/s12652-018-0926-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12652-018-0926-9

Keywords

Navigation