A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems
Introduction
Heterogeneous systems have been increasingly used for scientific and commercial applications, including real-time safety-critical applications, in which the system depends not only on the results of a computation, but also on the time instants at which these results become available. Examples of such applications include aircraft control systems, transportation systems and medical electronics. To obtain high performance for real-time heterogeneous systems, scheduling algorithms play an important role. While a scheduling algorithm maps real-time tasks to processors in a system such that deadlines and response time requirements are met [29], the system must also guarantee its functional and timing correctness even in the presence of hardware and software faults, especially when applications are safety-critical. To address this important issue and to improve on some existing solutions in the literature, this study investigates a scheduling algorithm with which real-time tasks with precedence constraints can be statically scheduled to tolerate the failure of one processor in a heterogeneous system.
In this paper we comprehensively address the issues of fault-tolerance, reliability, real-time, task precedence constraints, and heterogeneity. We propose an algorithm, referred to as eFRD (efficient fault-tolerant reliability-driven algorithm), can tolerate one processor’s failures in a heterogeneous system with fully connected network. Failures considered in our study are of the fail-silent type, and the failures are detected after a fixed amount of time. To tolerate any one processor’s permanent failure, the algorithm uses a primary/backup technique [9], [10], [11], [17], [21] to allocate two copies of each task to different processors. Thus, the backup copy of a task executes if its primary copy fails due to failures of its assigned processor. To improve the quality of schedules backup copies are allowed to overlap with other backup copies on the same processor, as long as their corresponding primary copies are allocated to different processors [9], [21]. As an added measure of fault-tolerance, the proposed algorithm also takes the reliability of processors into account. Tasks are judiciously allocated to processors not only to reduce schedule lengths, but also to improve the reliability as well. In addition, times for detecting and handling of a permanent fault is incorporated into the scheduling scheme, thus making the algorithm more practical. Computational, communication and reliability heterogeneities are also taken into account in the algorithm, as explained in detail in later sections. Various algorithms studied in [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29] share one or two features with eFRD, in terms of the assumed operational conditions, as explained in Section 2. However, eFRD is arguably the most comprehensive, in terms of the number of different scheduling issues addressed, and outperforms several quantitatively comparable algorithms in the literature. More specifically, extensive simulation studies carried out by the authors showed that the proposed algorithm significantly outperforms all three relevant algorithms found in the literature, namely, FRCD [24], the one in [10], [11], which we call FGLS (fault-tolerant greedy list scheduling), and the one in [21], called OV by the original authors of that paper.
In the section that follows, related work in the literature is briefly reviewed to present a background for the proposed algorithm and to contrast eFRD with other algorithms to show its relevance, similarity, and uniqueness. The rest of the paper is organized as follows. Section 3 presents the system characteristics and quantitatively analyzes the reliability of a heterogeneous system. Section 4 describes the eFRD algorithm and the main principles behind it, including theorems used for presenting the algorithm. Performance evaluation is given in Section 5 where three main measures of performance, namely, schedulability, reliability, and performability are described and used for performance assessment of eFRD in comparison with three relevant and quantitatively comparable algorithms. Finally, Section 6 concludes the paper by summarizing the main contributions of this paper and by commenting on future directions for this work.
Section snippets
Related work
Fault-tolerance must be considered in the design of scheduling algorithms, because occurrences of faults are often unpredictable in computer systems [15], [18]. Ahn et al. studied a delayed scheduling algorithm using a passive replica method [2]. Liberato et al. proposed a necessary and sufficient feasibility-check algorithm for fault-tolerant scheduling [16]. Bertossi et al. extended the well-known rate-monotonic first-fit assignment algorithm. In their new algorithm, all task copies were
System model
In parallel and systems, real-time jobs with dependent tasks can be modeled by directed acyclic graphs (DAGs). In this paper, a DAG is defined as T = {V, E}, where V = {v1, v2, … , vn} represents a set of real-time tasks that are assumed to be non-preemptable, and a set of weighted and directed edges E represents communication among tasks. (vi, vj) ∈ E indicates a message transmitted from task vi to vj.
When one processor in a system fails, it takes a certain amount of time, denoted δ, to detect and handle
Scheduling algorithms
In this section, we present eFRD, an efficient fault-tolerant, reliability-cost driven scheduling algorithm for real-time tasks with precedence constraints in a heterogeneous system.
This algorithm schedules real-time jobs with dependent tasks at compile time, by allocating primary and backup copies of tasks to processors in such a way that: (1) total schedule length is reduced so that more tasks can complete before their deadlines; (2) permanent failures in one processor can be tolerated; and
Performance evaluation
In this section, we compare the performance of the proposed algorithm with three existing real-time fault-tolerant scheduling algorithms in the literature, namely, OV [21], FGLS [10], [11], and FRCD [24] by extensive simulations. For the purpose of comparison, we also simulated a non-fault-tolerant real-time scheduling algorithm (referred to as NFT hereafter) that is unable to tolerate any failure. In this study, we considered a real world application in addition to synthetic workloads.
Three
Conclusion
In this paper we presented an efficient fault-tolerant scheduling algorithm (eFRD), in which real-time tasks with precedence constraints can tolerate one processor’s failures in a heterogeneous system with fully connected network. The fault-tolerant capability is incorporated in the algorithm by using a primary/backup (PB) model, where failures are detected after a fixed amount of time. In this PB model, each task is associated with a primary copy and a backup copy that are allocated to two
Acknowledgments
This is a substantially revised and improved version of a preliminary paper [23] appeared in the Proceeding of the International Conference on Parallel Processing (ICPP2002), pages 360–368, August 2002. The revisions include a detailed reliability analysis, an improved overlapping scheme, consideration of more workload and system parameters, and performance evaluation with a real application. This work was partially supported by NSF under Grant EPS-0091900, New Mexico Institute of Mining and
References (41)
- et al.
Combined task and message scheduling in real-time systems
IEEE Transactions on Parallel and Systems
(1999) - K. Ahn, J. Kim, S. Hong, Fault-tolerant real-time scheduling using passive replicas, in: Proc. Pacific Rim Int....
- R. Al-Omari, A.K. Somani, G. Manimaran, A new fault-tolerant technique for improving the schedulability in...
- N.M. Amato, P. An, Task scheduling and parallel mesh-sweeps in transport computations, Technical Report TR00-009,...
- et al.
Fault-tolerant rate-monotonic first-fit scheduling in hard-real-time systems
IEEE Transactions on Parallel and Systems
(1999) - M. Caccamo, G. Buttazzo, Optimal scheduling for fault-tolerant and firm real-time systems, in: Proc. Int. Conf. on...
- C. Dima, A. Girault, C. Lavarenne, Y. Sorel, Off-line real-time fault-tolerant scheduling, in: Proc. Euromicro Workshop...
- A. Dogan, F. Ozguner, Reliable matching and scheduling of precedence-constrained tasks in heterogeneous computing, in:...
- et al.
Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems
IEEE Transactions on Parallel and System
(1997) - A. Girault, C. Lavarenne, M. Sighireanu, Y. Sorel, Fault-tolerant static scheduling for real-time embedded systems, in:...
Allocation of periodic task modules with precedence and deadline constraints in real-time systems
IEEE Transactions on Computers
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys
Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems
IEEE Transactions on Computers
A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis
IEEE Transactions on Parallel and Systems
Cited by (130)
Energy and cost optimization mechanism for workflow scheduling in the cloud
2023, Materials Today: ProceedingsA novel fault-tolerant scheduling approach for collaborative workflows in an edge-IoT environment
2022, Digital Communications and NetworksOptimisation of commercial bus body frame based on the improved grey wolf and Monte Carlo simulation algorithm
2024, International Journal of Vehicle PerformanceINVESTIGATION OF FAULT TOLERANT SCHEDULING ALGORITHM FOR TASK IN CLOUD SYSTEMS USING ANT COLONY OPTIMIZATION
2023, Journal of Theoretical and Applied Information TechnologyFailure-Aware Elastic Cloud Workflow Scheduling
2023, IEEE Transactions on Services ComputingFault-tolerant Cloud Workflow Scheduling with Uncertain Task Execution Time
2023, 2023 4th International Symposium on Computer Engineering and Intelligent Communications, ISCEIC 2023