Abstract
Hard real-time systems are employed in military, aeronautics, and astronautics fields where deployed systems are susceptible to software faults that can result in functional errors. Thus, there is a need to use fault-tolerant (FT) real-time scheduling. Among the various fault-tolerant real-time scheduling techniques, re-execution has been applied widely to existing real-time systems owing to its simplicity and applicability. However, re-execution requires multiple executions of every task, and some tasks miss their deadlines owing to the prolonged execution time; therefore, it has been found to be suitable for only soft real-time systems. In this paper, we propose an FT policy that can be incorporated into most (if not all) existing real-time scheduling algorithms on multiprocessor systems, which improves the reliability of the target system without a tradeoff against schedulability. As a case study, we apply the FT policy to existing fixed-priority scheduling and earliest deadline zero-laxity scheduling, and we demonstrate that it enhances reliability without schedulability loss.
1. Introduction
A computer system is referred to as a real-time system if the correctness of the system depends on not only its logical output, but also the time when the output is produced. Such times referring to the correctness of real-time are called deadlines, and the requirement to meet these deadlines is referred to as the timing constraint. There are two fundamental problems with designing a real-time system: the design of a real-time scheduling algorithm for assigning task priorities to meet deadlines and schedulability analysis for satisfying timing constraints [1].
A hard real-time system requires strict satisfaction of timing constraints; otherwise, such breaches may result in catastrophic consequences such as significant economic loss and threats to human lives. Hard real-time systems have been employed in many fields such as the military, aeronautics, and astronautics in which systems are susceptible to faults that produce a functional error. For example, a satellite system is deployed in a harsh operational environment where the state of the software can be affected by cosmic radiation [2]. In addition, such systems tend to be situated in remote and inaccessible locations, which necessitates the use of fault-tolerant real-time scheduling.
In hard real-time systems, there are several popular fault-tolerant real-time scheduling techniques such as check pointing with rollback, dual/triple modular redundancy, and re-execution [3,4,5]. The check pointing with rollback technique saves the state of the system on a stable storage at each checkpoint, and the systems rolls back to the latest checkpoint if a transient fault is detected. The dual/triple modular redundancy technique executes identical copies for each task simultaneously on multicore platforms, and their results are voted on to produce a single output. The re-execution technique executes the task multiple times and selects a correct output (without a transient fault) from the multiple executions. Thus, it re-executes the task when the correct output is not obtained at the given times of execution to improve reliability. The faults can mainly be categorized as permanent and transient [3,6]. Permanent faults normally indicate malfunction of any part that requires replacement with a spare part to restore system functionality. Transient faults are short-term faults where the system functionality is restored using a software–based approach such as re-execution.
Although re-execution with respect to a transient fault is an effective fault-tolerant technique for real-time scheduling, it is known to be suitable for only soft real-time systems [3]. This is because the technique’s main aim is to improve reliability, which can be measured based on the metric of the probability of successful executions (in terms of functionality) without any transient faults, and strict conformance to meeting a deadline is not the main requirement. The re-execution technique requires multiple executions of every task, so some deadlines may be missed owing to the prolonged execution time. Thus, studies [3,5] have focused on improving the reliability of mixed-criticality systems or energy-sensitive real-time systems while inevitably sacrificing the schedulability of the systems.
In this paper, we propose a fault-tolerant (FT) policy that can be incorporated into most existing (if not all) real-time scheduling algorithms, which improves the reliability of the target system without sacrificing schedulability. We target identical multiprocessor systems where each processor’s architecture is exactly the same. The FT policy employs the re-execution technique in conjunction with a new deadline-based schedulability analysis proposed in this paper for the re-execution technique while ensuring that the delayed finishing time of each task’s execution due to re-execution is never later than its corresponding deadline. The delayed finishing time of each task is dependent on how many times each task is executed. Here, the of a task is the execution count, and the -assignment problem is addressed to improve reliability while conserving schedulability. As a case study, we apply the FT policy to existing fixed-priority (FP) scheduling and earliest deadline zero-laxity (EDZL) scheduling, and we demonstrate that it enhances reliability without schedulability loss.
In summary, the contributions of this paper are as follows.
- It proposes the FT policy to improves reliability of the target system scheduled by a given real-time scheduling algorithm without sacrificing schedulability.
- A new deadline-based schedulability analysis designed for the re-execution technique is proposed, which can be incorporated into the FT policy.
- FT policy incorporated into FP and EDZL scheduling are proposed as a case study.
- The conducted experiments demonstrate that the FT policy dramatically improves the performance compared to the existing techniques (utilizing the predetermined ) when we consider the schedulability and reliability simultaneously.
The remainder of this paper is organized as follows. Section 2 presents the system model, including the task and fault models, and the safety metric. Section 3 introduces the proposed FT scheduling framework, called the FT policy. As a case study, the FT policy is applied to FP scheduling and EDZL scheduling, and its performance is evaluated in Section 4. Section 5 discusses related work. Section 6 concludes the paper.
2. The System Model
In this section, we describe our system and fault models including the task and system reliability, and the system safety for a performance metric.
2.1. The Task Model
We consider a task set following the Liu and Layland model [1], scheduled on m processors in a hard real-time system. A task in a task set is supposed to invoke a series of jobs, of which the length between two consecutive job’s release times is at least time units. Each job should complete its worst-case execution within the relative deadline . The q-th job of a task is released at , and its has its absolute deadline , meaning that should finish its execution before or at to be schedulable. The finishing time of a job is denoted by . A job is said to be schedulable if is smaller than or equal to . Thus, a task is schedulable if every job of is schedulable, and a task set is schedulable when every task is schedulable. We target a constrained-deadline task system in which holds for every task .
We consider a global preemptive work-conserving scheduling algorithm. An algorithm is referred to as global, preemptive, and work-conserving if a job can migrate from one processor to the other one, a lower-priority job’s execution can be hindered by a higher-priority job, and the scheduler always tries to keep the processors busy when there are released jobs with remaining execution. Moreover, a single job cannot execute in parallel. We assume quantum-based time where a time unit describes a quantum length of 1, meaning that all task parameters are specified by multiples of the quantum length.
2.2. The Fault Model
Among two types of faults (i.e., permanent and transient), we consider the transient fault that appears for a short time without damaging the device. Transient faults determine the reliability of a task (called the task reliability of ), which is defined as the probability of its successful execution (in terms of functionality) without any transient fault. An average arrival rate is the expected number of failures occurring per second. Using a given fault arrival rate and an exponential distribution, the task reliability (as a performance metric for fault tolerance) of task is expressed as [5]
For example, the task reliability of for given and is . Thus, the system reliability is defined as the average of the task reliability of tasks in calculated as
We assume that a transient fault can affect the reliability but not change the worst-case execution time of a task .
When it comes to an FT technique, we adopt re-execution to improve the reliability of the target system suffering from transient faults. In the re-execution technique, the fault (if any) is supposed to be detected at the end of a job execution, and the job is re-executed when the correct output is not obtained. Specifically, each job instance of a task is executed times, and the job is re-executed if the correct output (with no transient fault) is not obtained after executions, thereby resulting in + 1 executions. is the number of times that every job of a task is executed under the re-execution technique. For a given , is calculated by
We suppose that at most one transient fault can occur for a single job instance by following a common assumption [7]. Moreover, each execution over the executions shares the same absolute deadline .
By the definition of reliability, implies the possibility that a job of a task does not successfully execute without any transient fault. Since a job is executed times when the correct output is not obtained over executions in the re-execution technique, the reliability of a task is expressed as follows:
For example, the task reliability of for given , , and is .
The reliability of a hard real-time system should be maintained at a high level, and every single execution of a task should be finished before its corresponding absolute deadline. To support this requirement, we propose a new metric, i.e., system safety, to quantify the system’s reliability and schedulability simultaneously. The system safety is given by (i.e., ) if is schedulable and 0 (i.e., ) otherwise. Thus, the system safety indicates the system reliability of a schedulable task set.
3. The Fault-Tolerant Scheduling Framework
In this section, we present the FT policy that can be incorporated into most (if not all) existing real-time scheduling algorithms, which can improve reliability by exploiting the re-execution technique without sacrificing the schedulability of task sets under the scheduling algorithm. Thus, we perform a schedulability analysis to support the use of the policy.
3.1. The Scheduling Algorithm Incorporating FT Policy
As mentioned in Section 1, we aim at improving the reliability of the target systems without degrading the schedulability. Basically, the re-execution technique increases the number of times that every job of a task is executed. Thus, it inevitably prolongs the finishing time of every job of , and conditionally (depending on the scheduling policy) increases interference in the other tasks. Based on this reasoning, we need to address the following questions:
- Q1
- How can of be determined without compromising the schedulability of ?
- Q2
- How can of be determined without compromising the schedulability of the other tasks ?
To address both questions (Q1–2), we should guarantee that the increased finishing time (due to of ) of every job (likewise, ) of a task (likewise, ) should be less than or equal to the corresponding absolute deadline (likewise, ). One may argue that the finishing time of a job will be prolonged exactly by (e.g., in the case of the detection of a transient fault) for a given . However, such a phenomenon only occurs when is highest priority so that every job of does not suffer from any interference from the other tasks. Depending on the considered scheduling algorithm (e.g., whether a task-level or job-level priority assignment policy), the increased finishing time can be greater than due to the interference of other tasks while executing for for a given . Therefore, we should ensure an upper bound on the interference from other tasks while executing for , and carefully consider this for determining of to conserve schedulability.
The FT policy effectively assigns the value of using the -assignment algorithm so that the prolonged finishing time never exceeds . With of every task , a task set τ is scheduled according to the base scheduling algorithm. Every job is executed at least times, and once again when the correct output is not obtained.
Algorithm 1 illustrates how the FT-policy-incorporated scheduling algorithm operates. Before the system starts, for every task is assigned by a given -assignment algorithm (Line 1); we will describe how -assignment algorithm operates in Section 3.3. For every time instant t, a job of a task is inserted in a ready queue whenever is released (Lines 3–5). Released jobs in are scheduled according to a given base scheduling algorithm (Line 6). Each job in is executed at least times and once again if a fault is detected (Lines 7–9). Finally, is removed from when the execution of is completed.
| Algorithm 1 The FT-policy-incorporated scheduling algorithm. |
| 1: for every is assigned by a given assignment algorithm (Algorithm 2) 2: for Every time instance t do 3: if is released by then 4: Insert into 5: end if 6: Schedule jobs in according to a given base scheduling algorithm 7: if times of executions are completed for , and a fault is detected then 8: Execute again. 9: end if 10: if finishes its execution then 11: Delete from 12: end if 13: end for |
3.2. Schedulability Analysis
Since our goal is to ensure schedulability while improving reliability, we must be able to judge whether the task set is schedulable with the given values of for every task . To do so, we utilize a deadline-based analysis technique that has been widely used in real-time multiprocessor scheduling [8,9,10] and modify it to support the FT policy.
Deadline-based analysis for multiprocessor systems employs the concept of interference [11]. The interference in in an interval , which is denoted by , is the cumulative length of all sub-intervals in such that a job of cannot be executed due to the execution of other higher-priority jobs even though it is ready to be executed. In addition, the interference of with in , which is denoted by , is the cumulative length of all sub-intervals in such that a job of is executed even though a job of is ready to be executed. Since the execution of a job (in a ready queue) of is hindered when m other jobs are executed at the same time instance, under any global work-conserving can be upper-bounded by [11]
As derived in [11], the relationship between and for any arbitrary positive x is as follows.
We also let be the maximum interference of with in an interval of length between and of any job of , which is expressed as
Any job of is successfully executed before its deadline if the maximum interference in in an interval of length starting from the release time of any job of is strictly less than . The deadline-based schedulability analysis is expressed as follows using Equations (6) and (7).
Lemma 1
(Theorem 5 in [8]). Suppose that a task set τ is scheduled by a global, preemptive, and work-conserving algorithm. Thus, τ is schedulable if the following inequality holds for all .
Proof.
We briefly summarize the proof of Theorem 5 in [8]. To miss a deadline for a job of scheduled on m processors, the job executes in at most time instances. At each time instance, at least m other jobs are required to hinder the execution of a job of . Hence, at least amount of interference of other tasks with is required to miss the job’s deadline. □
We now develop for any work-conserving scheduling algorithm incorporating the FT policy. To upper-bound , we exploit the notion of the workload of a task in an interval of length ℓ, which is defined as the amount of computation time required for in the interval of length ℓ [12]. Figure 1 describes the scenario where the workload of a task is maximized under any preemptive scheduling incorporating the FT policy with a given value of . As seen in Figure 1, the left-most job of starts its execution at the beginning of the interval and finishes at , which executes for without any interference or delay. Thus, the following jobs are released and scheduled as soon as possible. Thus, the workload of a task under any preemptive scheduling incorporating the FT policy with a given value of in an interval of length ℓ is upper-bounded as
where is the number of jobs executed for calculated by
Figure 1.
Worst-case scenario in which the workload of is maximized under any work-conserving scheduling.
Thus, the following theorem is derived.
Theorem 1.
Suppose that a task set τ (which holds that for every ) is scheduled by the FT policy with a given base algorithm. Thus, τ is schedulable if the following inequality holds for all
Proof.
To miss a deadline for a job of under the FT policy with a given base algorithm on m processors, the job executes in at most time instances. At each time instance, at least m other jobs are required to hinder the execution of a job of . Hence, at least amount of interference of other tasks with is required to miss the job’s deadline. □
3.3. The -Assignment Algorithm
Under the base scheduling algorithm employing the FT policy, it is guaranteed that increased the finishing time of any job due to a given of is never later than its absolute deadline . The FT policy assigns such by exploiting the -assignment algorithm, which is described in this subsection.
The -assignment algorithm selects a task of a task set according to a given selection algorithm, and increases the value of one by one while checking that the increased value of does not make the schedulable task set unschedulable with a given schedulability analysis. (Note that we use another task index j to indicate a selected task for avoiding confusion since k indicates the index of an arbitrary task as we presented in Section 2.) It repeats this for every task in . A number of selection algorithms can be applied for this such as highest-priority first (i.e., selected in an order of scheduling priority).
Algorithm 2 presents how the -assignment algorithm operates. It first sets to zero for every task (Line 1). For every task selected by a given selection algorithm (Line 2), it increases the value of of a task one by one until is deemed unschedulable (Lines 3–5). Note that a task that holds naturally misses its deadline without any interference, so we assume that containing such is unschedulable. Thus, it decreases by one to make schedulable (Line 6). Lines 3–6 are repeated for each task selected by a given selection algorithm. The time complexity of Algorithm 2 is obtained as follows. It first initiates for every task in Line 1, which needs where n is the number of tasks in a task set . Thus, it considers a task one by one in Line 2, which requires . In Line 3, it repeatably conducts the schedulability analysis proposed in Theorem 1 while the condition in Line 3 holds. Since the calculation of the left-hand side and right-hand side in Equation (11) are done with and (i.e., constant time), the analysis requires in terms of time complexity. Because increases by one at each iteration, Line 3 in Algorithm 2 can be conducted at most times. Lines 4 and 6 are performed in a constant time. As a result, the time complexity of Algorithm 2 is = .
| Algorithm 2-Assignment Algorithm |
| 1: for all tasks 2: for from the first task to the last one selected by a given selection algorithm do 3: while is deemed schedulable by Theorem 1, and holds do 4: 5: end while 6: 7: end for |
4. Case Study
In this section, we apply the FT policy to FP scheduling and EDZL scheduling (we denote it by FT-FP-A and FT-EDZL-A, respectively) as a case study.
4.1. Schedulability Analysis for FT-FP-A
In FP scheduling, a priority is assigned to a task rather than each job. Thus, only a higher-priority task can interfere with a job of a lower-priority task . Well-known FP scheduling algorithms include rate monotonic (RM) [13] and earliest quasi-deadline first (EQDF) [14]; a task whose (likewise ) is smaller than that of other tasks has a higher priority under the RM (likewise EQDF) scheduling algorithm. We denote FP scheduling incorporating the FT policy with -assignment algorithm A employing any sorting algorithm by FT-FP-A. Let be a set of tasks whose priorities are higher than . Thus, Theorem 1 is re-formulated for FP scheduling as follows.
Theorem 2.
Suppose that a task set τ (which holds that for every ) is scheduled by FT-FP-A. Thus, τ is schedulable if the following inequality holds for all
Proof.
To miss a deadline for a job of under FT-FP-A scheduling on m processors, the job executes in at most time instances due to the existence of higher-priority tasks. At each time instance, at least m other jobs are required to hinder the execution of a job of . Hence, at least amount of interference of tasks in with is required to miss the job’s deadline. □
Thus, we schedule a given task set by FT-FP-A (Algorithm 1) exploiting -assignment algorithm A (Algorithm 2 with Theorem 2 instead of Theorem 1 in Line 3).
4.2. Schedulability Analysis for FT-EDZL-A
The EDZL scheduling algorithm assigns a higher priority to a job whose absolute deadline is earlier than that of other jobs such as earliest deadline first (EDF) scheduling. Thus, it promotes the job’s priority (to the highest) at time instance t at which the job’s laxity (The laxity of a job is defined as the difference between (i.e., the remaining time instances up to ) and the remaining executions of to finish.) is zero (i.e., is equal to the remaining execution time of the job) because the job would miss its deadline otherwise.
For deadline-based schedulability analysis for FT-EDZL-A, we first upper-bound under work-conserving EDF scheduling. Figure 2 illustrates the worst-case release pattern of higher-priority jobs of in an interval . As shown in Figure 2, the interference from higher-priority jobs to is maximized when their absolute deadlines are aligned because whose is later than cannot interfere with . Thus, the upper bound of the amount of interference from the jobs of to a job of is calculated by as follows:
Figure 2.
Worst-case scenario in which interference of to is maximized under work-conserving EDF scheduling.
Thus, we schedule a given task set by FT-EDZL-A (Algorithm 1) exploiting -assignment algorithm A (Algorithm 2 with Theorem 3 instead of Theorem 1 in Line 3).
Under EDZL scheduling, a job can interfere with even if ’s deadline is later than ’s deadline. This happens only when is in the zero-laxity state and its priority is promoted. Figure 3 illustrates a job (the right-most one in the figure) of is in the zero-laxity state. The key characteristic of such a job is that it finishes its execution at its absolute deadline. Thus, Figure 3 also derives the same upper bound of the amount of interference from the jobs of to a job of with Equation (13).
Figure 3.
Worst-case scenario in which interference of to is maximized under work-conserving EDZL scheduling.
In order for a job to miss its absolute deadline under EDZL scheduling on an m-processor platform, there should be at least zero-laxity jobs at the same time instance. Based on this reasoning, we derive the following schedulability conditions for FT-EDZL-A.
Theorem 3.
Suppose that a task set τ (which holds that for every ) is scheduled by FT-EDZL-A. Thus, τ is schedulable if the following inequality holds for at least tasks
Proof.
To be in the zero-laxity state for a job of under FT-EDZL-A scheduling on m processors, the job is interfered in time instances. At each time instance, at least m other jobs are required to hinder the execution of a job of . Thus, at least amount of interference of higher-priority jobs with is required to be in the zero-laxity state for a job of . Moreover, there should be at least zero-laxity jobs at the same time instance in order for a job to miss its absolute deadline on an m-processor platform. Thus, the theorem holds. □
4.3. Evaluation Environment
In this subsection, we evaluate the performance of the considered scheduling algorithms incorporating the proposed FT policy.
For our evaluation, we randomly generate task sets based on a well-known task set generation framework [8,15,16]. For the input parameters, we consider the number of processors and the individual task utilization () distribution (bimodal or exponential with its input parameter chosen in [17]). For a given bimodal input parameter p, the value is uniformly selected in [0, 0.5) and [0.5, 1) with probability p and , respectively. For a given exponential input parameter , the value is selected according to the exponential distribution whose probability density function is . For each task, is uniformly chosen in , is determined by the bimodal or exponential parameter, and is uniformly chosen in . We generate 10,000 task sets for each value of m. We then measure the number of tasks sets deemed schedulable by the proposed schedulability analysis, as well as the average system safety of task sets (defined as the average of the considered task sets’ system safety), as performance metrics.
We consider the following schedulability tests (as well as the corresponding scheduling algorithms for measuring the system safety):
- : for the EDZL scheduling algorithm (Equation (14) with every );
- (also denoted by -): for the RM scheduling algorithm (Equation (12) with every ),
- : for the EQDF scheduling algorithm (Equation (12) with every );
- --: for the EDZL scheduling algorithm incorporating the FT policy in which the -assignment algorithm increases in an order of index k (Equation (14) with determined by a given -assignment algorithm);
- --: for the RM scheduling algorithm incorporating the FT policy in which the -assignment algorithm increases in an order of task priority (Equation (12) with determined by a given -assignment algorithm);
- --: for the EQDF scheduling algorithm incorporating the FT policy in which the -assignment algorithm increases in an order of task priority (Equation (12) with determined by a given -assignment algorithm);
- -: for the RM scheduling algorithm (Equation (12) with every );
- -: for the RM scheduling algorithm (Equation (12) with every ).
4.4. Example of a Task Set: ACSW in Satellite Systems
In this subsection, we illustrate an actual real-time system whose operational characteristic can be specified by task parameters described in the previous subsection. A reconnaissance satellite system is a compelling example of a real-time system, which is equipped with a reconnaissance antenna to obtain a signal image of the target terrain by transmitting and receiving radio frequency signals. In a reconnaissance satellite system, antenna controller software (ACSW) [18] controls a reconnaissance antenna, of which tasks are scheduled by RM on RTEMS (real-time executive for multi-processor systems) [19] as a space-specific RTOS (real-time operating system). ACSW typically consists of five main tasks named tHigh, tMilbus, tOne, tTwo, and tSync, respectively, whose high-level description of main operation is described as follows.
- tHigh retrieves a single macro command (MCMD) from an MCMD queue in every period and invokes a job corresponding to the MCMD.
- tMilbus is responsible for receiving MCMDs from the ground station by utilizing the MIL-STD-1553B protocol [20] and verifies the integrity of each MCMD before the MCMD is inserted into an MCMD queue.
- tOne performs internal mode transitions such as turning on/off relevant equipment and transmits internal telemetries via the SpaceWire protocol [21].
- tTwo conducts various executions such as fault detection, formatting network packets that will be transferred to the ground station.
- tSync executes a job for the operation preparation whenever there are surplus computing resources.
Table 1 describes task parameters of the five tasks. and are determined by the system designer by considering the operating concept of the ACSW. That is, tHigh takes an MCMD from an MCMD queue every 62.5 ms, tMilbus receives an MCMD from the ground station every 125 ms, an internal mode transition occurs by tOne every 250 ms, and tTwo transmits the system status to the ground station every 500 ms. tSync does not have specified task parameters because it executes without deadlines when the other tasks are inactive. Thus, WCET, best case execution time (BCET), and average case execution time (ACET) are measured on a multiprocessor platform equipped with 256 Mbps SDRAM and FT-Leon3 CPU architecture (80 Mhz clock rate). Since the task set generation method described in the previous subsection considers a number of task sets whose parameters are randomly selected, it can cover various real-time embedded systems in which tasks conduct different roles in various operation scenarios.
Table 1.
Task parameters (millisecond base) of antenna controller software (ACSW).
4.5. Evaluation Results
Figure 4a,b plot the number of tasks deemed schedulable under the considered schedulability tests according to varying task set utilization for and , respectively. Note that the FT policy does not compromise the schedulability of the base scheduling algorithm B, so algorithm B in Figure 4a,b also represents B incorporating the FT policy. For example, the number of task sets deemed schedulable by and -- is the same. As shown in Figure 4a,b, (largely) outperforms , which performs better than .
Figure 4.
Evaluation results regarding schedulability and reliability of considered techniques.
Figure 4c,d show the average system safety (i.e., the average system reliability of schedulable task sets) of task sets under the considered techniques for . As shown, the average system safety of the considered techniques decreases with increasing task set utilization because the system safety is zero when the task set is unschedulable. Similar to Figure 4a,b, (largely) is shown to outperform , which also outperforms . This is because a better-performing schedulability analysis finds a higher number of schedulable task sets whose system safety is not zero. Moreover, the -series improves the average system safety for every task set utilization since the FT policy increases (or at least does not decrease) of all tasks, thereby increasing the system reliability.
A comparison of Figure 4e,f to Figure 4c,d, respectively, shows that the average system safety of considered techniques decreases due to a higher error rate (i.e., 0.01 compared to 0.001). However, the performance gap between the schedulability tests of the base algorithms (i.e., , , and ) and those incorporating the FT policy becomes larger as increases. This phenomenon happens because the system reliability is dramatically degraded with an increasing value of (as Equation (2) implies), but assigned by the FT policy makes up such system-safety degradation. Figure 4 excludes the evaluation results regarding the FT policy in which the -assignment algorithm increases in an order of lower task priority (e.g., denoted by --) because the trends demonstrated are similar to those shown in Figure 4c–f.
Figure 5 presents the average system safety of task sets under RM with different assignments for . As shown in Figure 5, the higher number of re-executions (i.e., a greater value of ) dramatically decreases the average system safety. It indicates that the schedulability is more important than the system reliability to obtain a high level of the average system safety. That is, a greater value of improves the system reliability according to Equation (4), while schedulability is not guaranteed when such an increase of is conducted not in conjunction with schedulability analysis. Thus, the higher number of re-executions compromises the average system safety due to the degraded schedulability even though it may improve reliability. Note that the system safety is 0 for an unschedulale task set according to the definitions of the system safety, as presented in Section 2.
Figure 5.
Evaluation results regarding the average system safety of rate monotonic (RM) with different assignments for .
5. Related Work
While a number of FT techniques have been previously implemented using hardware, hybrid software-based techniques such as checkpointing with rollback and re-execution have been proposed recently [3,4,5]. The former manages checkpoints at which the state of the system is saved on stable storage and recovered at the latest checkpoint in case of a transient fault. The latter executes tasks multiple times (e.g., times) and chooses a correct output (if any) obtained over multiple executions. If all outputs during the executions are not correct, the tasks are re-executed to improve reliability. Some tasks are executed multiple times under this technique, so they may miss their deadlines. Thus, existing studies [4,5] have focused on improving the reliability of mixed-criticality systems or energy-sensitive real-time systems while inevitably sacrificing the schedulability of the systems.
In multiprocessor domains, we can utilize the power of multiprocessors to tolerate faults. One popular approach for this is primary-backup approaches [22,23,24]. In this approach, the backup of a task does not need to be executed if its primary executes successfully. Backup overloading allows backup copies of primary tasks to be scheduled in a time-overlapping manner for task efficiency [22]. Backup overloading was improved by Manimaran and Murthy [23]. Another efficient overloading algorithm on multiprocessors was proposed with dynamic logical grouping among copies of tasks [24].
There are other ways to support fault tolerance in multiprocessors [25,26]. Cirinei et al. proposed a dynamic reconfiguration of multiprocessor hardware platforms considering the tradeoff between performance and fault tolerance (through simultaneous replication) [25]. Liberato et al. proposed FT global multiprocessor scheduling by re-executing an instance of a faulty job [26].
6. Conclusions
We proposed an FT policy that can be incorporated into most (if not all) existing real-time scheduling algorithms on multiprocessor systems, which improves the reliability of a target system without sacrificing schedulability. Our study was inspired by the fact that existing re-execution techniques enforce multiple executions of some tasks to improve system reliability, which can result in a loss of schedulability of schedulable tasks. Our proposed FT policy employs the re-execution technique in conjunction with deadline-based schedulability analysis while ensuring that schedulable task sets under the FT policy never become unschedulable. As a case study, we applied the FT policy to existing FP scheduling and EDZL scheduling and evaluate its performance regarding schedulability and reliability. In future, we plan to extend our work to mixed-criticality systems and try to apply better schedulability analysis techniques such as response-time analysis to improve the analytical capability of the FT policy.
Author Contributions
Conceptualization: J.L. and H.B.; software: H.B.; data curation: J.L. and H.B.; writing—original draft preparation: J.L. and H.B.; writing—review and editing: J.L. and H.B.; supervision: J.L.; project administration: J.L.; funding acquisition: J.L.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. 2018R1C1B5083050). This research was also supported by the Chung-Ang University Research Grants in 2018.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Liu, C.; Layland, J. Scheduling Algorithms for Multi-programming in A Hard-Real-Time Environment. J. ACM 1973, 20, 46–61. [Google Scholar] [CrossRef]
- Ekpo, S.; George, D. A system-based design methodology and architecture for highly adaptive small satellites. In Proceedings of the IEEE International Systems Conference, San Diego, CA, USA, 5–8 April 2010; pp. 516–519. [Google Scholar]
- Malhotra, S.; Narkhede, P.; Shah, K.; Makaraju, S.; Shanmugasundaram, M. A review of fault tolerant scheduling in multicore systems. Int. J. Sci. Technol. Res. 2015, 4, 132–136. [Google Scholar]
- Yu, X.B.; Zhao, J.S.; Zheng, C.W.; Hu, X.H. A Fault-Tolerant Scheduling Algorithm using Hybrid Overloading Technology for Dynamic Grouping based Multiprocessor Systems. Int. J. Comput. Commun. Control 2012, 7, 990–999. [Google Scholar] [CrossRef]
- Zhou, J.; Yin, M.; Li, Z.; Cao, K.; Yan, J.; Wei, T.; Chen, M. Fault-Tolerant Task Scheduling for Mixed-Criticality Real-Time Systems. J. Circuits Syst. Comput. 2017, 26, 1750016. [Google Scholar] [CrossRef]
- Kang, S.; Yang, H.; Kim, S.; Bacivarov, I.; Ha, S.; Thiele, L. Static mapping of mixed critical applications for fault-tolerant MPSoCs. In Proceedings of the IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 1–5 June 2014; pp. 1–6. [Google Scholar]
- Aminzadeh, S.; Ejlali, A. A comparative study of system-level energy management methods for fault-tolerant hard real-time systems. IEEE Trans. Comput. 2011, 60, 1228–1299. [Google Scholar] [CrossRef]
- Bertogna, M.; Cirinei, M.; Lipari, G. Schedulability Analysis of Global Scheduling Algorithms on Multiprocessor Platforms. IEEE Trans. Parallel Distrib. Syst. 2009, 20, 553–566. [Google Scholar] [CrossRef]
- Baker, T.P.; Cirinei, M.; Bertogna, M. EDZL Scheduling Analysis. Real-Time Syst. 2008, 40, 264–289. [Google Scholar] [CrossRef]
- Lee, J.; Easwaran, A.; Shin, I. LLF Schedulability Analysis on Multiprocessor Platforms. In Proceedings of the Real-Time Systems Symposium, San Diego, CA, USA, 30 November–3 December 2010; pp. 25–36. [Google Scholar]
- Bertogna, M.; Cirinei, M.; Lipari, G. Improved Schedulability Analysis of EDF on Multiprocessor Platforms. In Proceedings of the Euromicro Conference on Real-Time Systems (ECRTS), Balearic Islands, Spain, 6–8 July 2005; pp. 209–218. [Google Scholar]
- Bertogna, M.; Cirinei, M. Response-Time Analysis for globally scheduled Symmetric Multiprocessor Platforms. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS), Tucson, AZ, USA, 3–6 December 2007. [Google Scholar]
- Bini, E.; Buttazzo, G.C. The space of rate monotonic schedulability. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS), Austin, TX, USA, 3–5 December 2002. [Google Scholar]
- Back, H.; Chwa, H.S.; Shin, I. Schedulability Analysis and Priority Assignment for Global Job-level Fixed-Priority Multiprocessor Scheduling. In Proceedings of the Real Time and Embedded Technology and Applications Symposium, Beijing, China, 16–19 April 2012; pp. 297–306. [Google Scholar]
- Baker, T.P. Comparison of Empirical Success Rates of Global vs. Partitioned Fixed-Priority EDF Scheduling for Hard Real-Time; Technical Report TR–050601; Department of Computer Science, Florida State University: Tallahassee, FL, USA, 2005. [Google Scholar]
- Andersson, B.; Bletsas, K.; Baruah, S. Scheduling Arbitrary-Deadline Sporadic Task Systems on Multiprocessor. In Proceedings of the IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, Barcelona, Spain, 30 November–3 December 2008; pp. 197–206. [Google Scholar]
- Lee, J.; Easwaran, A.; Shin, I. Contention-Free Executions for Real-Time Multiprocessor Scheduling. ACM Trans. Embed. Comput. Syst. 2014, 13, 1–69. [Google Scholar] [CrossRef]
- Baek, H.; Lee, H.; Lee, H.; Lee, J.; Kim, S. Improved Schedulability Analysis for Fault-Tolerant Space-Borne SAR System. In Proceedings of the Conference on Korea Institute of Military Science and Technology (KIIT), Deajeon, Korea, 7–8 June 2018; pp. 1231–1232. [Google Scholar]
- RTEMS Community. RTEMS Real-Time Operating System. Available online: https://www.rtems.org (accessed on 9 May 2019).
- Excalibur Systems. MIL-STD-1553B. Available online: https://www.mil-1553.com (accessed on 9 May 2019).
- European Space Agency. SpaceWire. Available online: http://spacewire.esa.int (accessed on 9 May 2019).
- Ghosh, S.; Melhem, R.; Mosse, D. Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 1997, 8, 272–284. [Google Scholar] [CrossRef]
- Manimaran, G.; Murthy, C.S.R. A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis. IEEE Trans. Parallel Distrib. Syst. 1998, 9, 1137–1152. [Google Scholar] [CrossRef]
- Al-Omari, R.; Somani, A.K.; Manimaran, G. Efficient overloading techniques for primary-backup scheduling in real-time systems. J. Parallel Distrib. Comput. 2004, 64, 629–648. [Google Scholar] [CrossRef]
- Cirinei, M.; Bini, E.; Lipari, G.; Ferrari, A. A Flexible Scheme for Scheduling Fault-Tolerant Real-Time Tasks on Multiprocessors. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Rome, Italy, 26–30 March 2007; pp. 1–8. [Google Scholar]
- Liberato, F.; Lauzac, S.; Melhem, R.; Mosse, D. Fault tolerant real-time global scheduling on multiprocessors. In Proceedings of the Euromicro Conference on Real-Time Systems (ECRTS), York, UK, 9–11 June 1999; pp. 252–259. [Google Scholar]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).