Task-Level Re-Execution Framework for Improving Fault Tolerance on Symmetry Multiprocessors

: Hard real-time systems are employed in military, aeronautics, and astronautics ﬁelds where deployed systems are susceptible to software faults that can result in functional errors. Thus, there is a need to use fault-tolerant (FT) real-time scheduling. Among the various fault-tolerant real-time scheduling techniques, re-execution has been applied widely to existing real-time systems owing to its simplicity and applicability. However, re-execution requires multiple executions of every task, and some tasks miss their deadlines owing to the prolonged execution time; therefore, it has been found to be suitable for only soft real-time systems. In this paper, we propose an FT policy that can be incorporated into most (if not all) existing real-time scheduling algorithms on multiprocessor systems, which improves the reliability of the target system without a tradeoff against schedulability. As a case study, we apply the FT policy to existing ﬁxed-priority scheduling and earliest deadline zero-laxity scheduling, and we demonstrate that it enhances reliability without schedulability loss.


Introduction
A computer system is referred to as a real-time system if the correctness of the system depends on not only its logical output, but also the time when the output is produced. Such times referring to the correctness of real-time are called deadlines, and the requirement to meet these deadlines is referred to as the timing constraint. There are two fundamental problems with designing a real-time system: the design of a real-time scheduling algorithm for assigning task priorities to meet deadlines and schedulability analysis for satisfying timing constraints [1].
A hard real-time system requires strict satisfaction of timing constraints; otherwise, such breaches may result in catastrophic consequences such as significant economic loss and threats to human lives. Hard real-time systems have been employed in many fields such as the military, aeronautics, and astronautics in which systems are susceptible to faults that produce a functional error. For example, a satellite system is deployed in a harsh operational environment where the state of the software can be affected by cosmic radiation [2]. In addition, such systems tend to be situated in remote and inaccessible locations, which necessitates the use of fault-tolerant real-time scheduling.
In hard real-time systems, there are several popular fault-tolerant real-time scheduling techniques such as check pointing with rollback, dual/triple modular redundancy, and re-execution [3][4][5]. The check pointing with rollback technique saves the state of the system on a stable storage at each checkpoint, and the systems rolls back to the latest checkpoint if a transient fault is detected. The dual/triple modular redundancy technique executes identical copies for each task simultaneously on multicore platforms, and their results are voted on to produce a single output. The re-execution technique executes the task multiple times and selects a correct output (without a transient fault) from the multiple executions. Thus, it re-executes the task when the correct output is not obtained at the given times of execution to improve reliability. The faults can mainly be categorized as permanent and transient [3,6]. Permanent faults normally indicate malfunction of any part that requires replacement with a spare part to restore system functionality. Transient faults are short-term faults where the system functionality is restored using a software-based approach such as re-execution.
Although re-execution with respect to a transient fault is an effective fault-tolerant technique for real-time scheduling, it is known to be suitable for only soft real-time systems [3]. This is because the technique's main aim is to improve reliability, which can be measured based on the metric of the probability of successful executions (in terms of functionality) without any transient faults, and strict conformance to meeting a deadline is not the main requirement. The re-execution technique requires multiple executions of every task, so some deadlines may be missed owing to the prolonged execution time. Thus, studies [3,5] have focused on improving the reliability of mixed-criticality systems or energy-sensitive real-time systems while inevitably sacrificing the schedulability of the systems.
In this paper, we propose a fault-tolerant (FT) policy that can be incorporated into most existing (if not all) real-time scheduling algorithms, which improves the reliability of the target system without sacrificing schedulability. We target identical multiprocessor systems where each processor's architecture is exactly the same. The FT policy employs the re-execution technique in conjunction with a new deadline-based schedulability analysis proposed in this paper for the re-execution technique while ensuring that the delayed finishing time of each task's execution due to re-execution is never later than its corresponding deadline. The delayed finishing time of each task is dependent on how many times each task is executed. Here, the λ k of a task is the execution count, and the λ k -assignment problem is addressed to improve reliability while conserving schedulability. As a case study, we apply the FT policy to existing fixed-priority (FP) scheduling and earliest deadline zero-laxity (EDZL) scheduling, and we demonstrate that it enhances reliability without schedulability loss.
In summary, the contributions of this paper are as follows.
• It proposes the FT policy to improves reliability of the target system scheduled by a given real-time scheduling algorithm without sacrificing schedulability.

•
A new deadline-based schedulability analysis designed for the re-execution technique is proposed, which can be incorporated into the FT policy. • FT policy incorporated into FP and EDZL scheduling are proposed as a case study.

•
The conducted experiments demonstrate that the FT policy dramatically improves the performance compared to the existing techniques (utilizing the predetermined λ k ) when we consider the schedulability and reliability simultaneously.
The remainder of this paper is organized as follows. Section 2 presents the system model, including the task and fault models, and the safety metric. Section 3 introduces the proposed FT scheduling framework, called the FT policy. As a case study, the FT policy is applied to FP scheduling and EDZL scheduling, and its performance is evaluated in Section 4. Section 5 discusses related work. Section 6 concludes the paper.

The System Model
In this section, we describe our system and fault models including the task and system reliability, and the system safety for a performance metric.

The Task Model
We consider a task set τ following the Liu and Layland model [1], scheduled on m processors in a hard real-time system. A task τ k = (T k , C k , D k ) in a task set τ is supposed to invoke a series of jobs, of which the length between two consecutive job's release times is at least T k time units. Each job should complete its worst-case execution C k within the relative deadline D k . The q-th job J q k of a task τ k is released at r q k , and its has its absolute deadline d q k , meaning that J q k should finish its execution before or at d q k to be schedulable. The finishing time of a job J q k is denoted by f q k . A job J q k is said to be schedulable if f q k is smaller than or equal to d q k . Thus, a task τ k is schedulable if every job of τ k is schedulable, and a task set τ is schedulable when every task τ k is schedulable. We target a constrained-deadline task system in which C k ≤ D k ≤ T k holds for every task τ k ∈ τ.
We consider a global preemptive work-conserving scheduling algorithm. An algorithm is referred to as global, preemptive, and work-conserving if a job can migrate from one processor to the other one, a lower-priority job's execution can be hindered by a higher-priority job, and the scheduler always tries to keep the processors busy when there are released jobs with remaining execution. Moreover, a single job cannot execute in parallel. We assume quantum-based time where a time unit describes a quantum length of 1, meaning that all task parameters are specified by multiples of the quantum length.

The Fault Model
Among two types of faults (i.e., permanent and transient), we consider the transient fault that appears for a short time without damaging the device. Transient faults determine the reliability of a task τ k (called the task reliability of τ k ), which is defined as the probability of its successful execution (in terms of functionality) without any transient fault. An average arrival rate γ is the expected number of failures occurring per second. Using a given fault arrival rate γ and an exponential distribution, the task reliability (as a performance metric for fault tolerance) R k of task τ k is expressed as [5] R k = e −γC k .
(1) For example, the task reliability R k of τ k for given γ = 0.001 and C k = 300 is e −0.001·300 ≈ 0.7408. Thus, the system reliability R(τ) is defined as the average of the task reliability of tasks in τ calculated as We assume that a transient fault can affect the reliability but not change the worst-case execution time C k of a task τ k . When it comes to an FT technique, we adopt re-execution to improve the reliability of the target system suffering from transient faults. In the re-execution technique, the fault (if any) is supposed to be detected at the end of a job execution, and the job is re-executed when the correct output is not obtained. Specifically, each job instance of a task τ k is executed N k times, and the job is re-executed if the correct output (with no transient fault) is not obtained after N k executions, thereby resulting in N k +1 executions. λ k is the number of times that every job of a task τ k is executed under the re-execution technique. For a given N k , λ k is calculated by We suppose that at most one transient fault can occur for a single job instance by following a common assumption [7]. Moreover, each execution over the λ k executions shares the same absolute deadline d q k . By the definition of reliability, 1 − R k implies the possibility that a job of a task τ k does not successfully execute without any transient fault. Since a job is executed λ k times when the correct output is not obtained over N k executions in the re-execution technique, the reliability of a task τ k is expressed as follows: For example, the task reliability R k of τ k for given γ = 0.001, C k = 300, and λ k = 3 is The reliability of a hard real-time system should be maintained at a high level, and every single execution of a task should be finished before its corresponding absolute deadline. To support this requirement, we propose a new metric, i.e., system safety, to quantify the system's reliability and schedulability simultaneously. The system safety S(τ) is given by R(τ) (i.e., R(τ) · 1) if τ is schedulable and 0 (i.e., R(τ) · 0) otherwise. Thus, the system safety indicates the system reliability of a schedulable task set.

The Fault-Tolerant Scheduling Framework
In this section, we present the FT policy that can be incorporated into most (if not all) existing real-time scheduling algorithms, which can improve reliability by exploiting the re-execution technique without sacrificing the schedulability of task sets under the scheduling algorithm. Thus, we perform a schedulability analysis to support the use of the policy.

The Scheduling Algorithm Incorporating FT Policy
As mentioned in Section 1, we aim at improving the reliability of the target systems without degrading the schedulability. Basically, the re-execution technique increases the number of times that every job of a task τ k is executed. Thus, it inevitably prolongs the finishing time of every job of τ k , and conditionally (depending on the scheduling policy) increases interference in the other tasks. Based on this reasoning, we need to address the following questions: Q1 How can λ k of τ k be determined without compromising the schedulability of τ k ? Q2 How can λ k of τ k be determined without compromising the schedulability of the other tasks τ i ?
To address both questions (Q1-2), we should guarantee that the increased finishing time (due to should be less than or equal to the corresponding absolute deadline d q k (likewise, d q i ). One may argue that the finishing time f q k of a job J q k will be prolonged exactly by λ k · C k (e.g., in the case of the detection of a transient fault) for a given λ k . However, such a phenomenon only occurs when τ k is highest priority so that every job f q k of τ k does not suffer from any interference from the other tasks. Depending on the considered scheduling algorithm (e.g., whether a task-level or job-level priority assignment policy), the increased finishing time f q k can be greater than λ k · C k due to the interference of other tasks while executing for λ k · C k for a given λ k . Therefore, we should ensure an upper bound on the interference from other tasks while executing for λ k · C k , and carefully consider this for determining λ k of τ k to conserve schedulability.
The FT policy effectively assigns the value of λ k using the λ k -assignment algorithm so that the prolonged finishing time f q k never exceeds d q k . With λ k of every task τ k , a task set τ is scheduled according to the base scheduling algorithm. Every job J q k is executed at least λ k − 1 times, and once again when the correct output is not obtained.
Algorithm 1 illustrates how the FT-policy-incorporated scheduling algorithm operates. Before the system starts, λ k for every task τ k is assigned by a given λ k -assignment algorithm (Line 1); we will describe how λ k -assignment algorithm operates in Section 3.3. For every time instant t, a job J q k of a task τ k is inserted in a ready queue Q ready whenever J q k is released (Lines 3-5). Released jobs in Q ready are scheduled according to a given base scheduling algorithm (Line 6). Each job in Q ready is executed at least λ k − 1 times and once again if a fault is detected (Lines 7-9). Finally, J q k is removed from Q ready when the execution of J q k is completed.

Algorithm 1
The FT-policy-incorporated scheduling algorithm. 1: λ k for every τ k is assigned by a given λ k assignment algorithm (Algorithm 2) 2: for Every time instance t do 3: if J q k is released by τ k then 4: Insert J q k into Q ready 5: end if 6: Schedule jobs in Q ready according to a given base scheduling algorithm 7: if λ k − 1 times of executions are completed for J q k , and a fault is detected then 8: Execute J q k again. 9: end if 10: if J q k finishes its execution then 11: Delete J q k from Q ready 12: end if 13: end for

Schedulability Analysis
Since our goal is to ensure schedulability while improving reliability, we must be able to judge whether the task set τ is schedulable with the given values of λ k for every task τ k . To do so, we utilize a deadline-based analysis technique that has been widely used in real-time multiprocessor scheduling [8][9][10] and modify it to support the FT policy.
Deadline-based analysis for multiprocessor systems employs the concept of interference [11]. The interference in τ k in an interval [a, b), which is denoted by I(τ k , a, b), is the cumulative length of all sub-intervals in [a, b) such that a job of τ k cannot be executed due to the execution of other higher-priority jobs even though it is ready to be executed. In addition, the interference of such that a job of τ i is executed even though a job of τ k is ready to be executed. Since the execution of a job (in a ready queue) of τ k is hindered when m other jobs are executed at the same time instance, I(τ k , a, b) under any global work-conserving can be upper-bounded by [11] As derived in [11], the relationship between I(τ k , a, b) and I(τ k ← τ i , a, b) for any arbitrary positive x is as follows.
We also let I(τ k ← τ i ) be the maximum interference of τ i ∈ τ \ {τ k } with τ k in an interval of length D k between r q k and f q k of any job J q k of τ k , which is expressed as Any job of τ k is successfully executed before its deadline if the maximum interference in τ k in an interval of length D k starting from the release time of any job of τ k is strictly less than D k − C k + 1. The deadline-based schedulability analysis is expressed as follows using Equations (6) and (7). Lemma 1 (Theorem 5 in [8]). Suppose that a task set τ is scheduled by a global, preemptive, and work-conserving algorithm. Thus, τ is schedulable if the following inequality holds for all τ k ∈ τ.
Proof. We briefly summarize the proof of Theorem 5 in [8]. To miss a deadline for a job of τ k scheduled on m processors, the job executes in at most C k − 1 time instances. At each time instance, at least m other jobs are required to hinder the execution of a job of τ k . Hence, at least m · (D k − (C k − 1)) amount of interference of other tasks with τ k is required to miss the job's deadline.
We now develop I(τ k ← τ i ) for any work-conserving scheduling algorithm incorporating the FT policy. To upper-bound I(τ k ← τ i ), we exploit the notion of the workload of a task τ i in an interval of length , which is defined as the amount of computation time required for τ k in the interval of length [12]. Figure 1 describes the scenario where the workload of a task τ k is maximized under any preemptive scheduling incorporating the FT policy with a given value of λ i . As seen in Figure 1, the left-most job of τ i starts its execution at the beginning of the interval and finishes at d q i , which executes for λ i · C i without any interference or delay. Thus, the following jobs are released and scheduled as soon as possible. Thus, the workload W i ( ) of a task τ i under any preemptive scheduling incorporating the FT policy with a given value of λ i in an interval of length is upper-bounded as where F i ( ) is the number of jobs executed for λ · C i calculated by Thus, the following theorem is derived.
Theorem 1. Suppose that a task set τ (which holds that λ k · C k ≤ D k for every τ k ∈ τ) is scheduled by the FT policy with a given base algorithm. Thus, τ is schedulable if the following inequality holds for all τ k ∈ τ Proof. To miss a deadline for a job of τ k under the FT policy with a given base algorithm on m processors, the job executes in at most λ k · C k − 1 time instances. At each time instance, at least m other jobs are required to hinder the execution of a job of τ k . Hence, at least m · (D k − (λ k · C k − 1)) amount of interference of other tasks with τ k is required to miss the job's deadline.
job release/deadline Figure 1. Worst-case scenario in which the workload of τ i is maximized under any work-conserving scheduling.

The λ k -Assignment Algorithm
Under the base scheduling algorithm employing the FT policy, it is guaranteed that increased the finishing time of any job J q k due to a given λ k of τ k is never later than its absolute deadline d q k . The FT policy assigns such λ k by exploiting the λ k -assignment algorithm, which is described in this subsection.
The λ k -assignment algorithm selects a task τ j of a task set τ according to a given selection algorithm, and increases the value of λ j one by one while checking that the increased value of λ j does not make the schedulable task set unschedulable with a given schedulability analysis. (Note that we use another task index j to indicate a selected task τ j for avoiding confusion since k indicates the index of an arbitrary task as we presented in Section 2.) It repeats this for every task τ j in τ. A number of selection algorithms can be applied for this such as highest-priority first (i.e., selected in an order of scheduling priority).
Algorithm 2 presents how the λ k -assignment algorithm operates. It first sets λ k to zero for every task τ k (Line 1). For every task τ j selected by a given selection algorithm (Line 2), it increases the value of λ j of a task τ j ∈ τ one by one until τ is deemed unschedulable (Lines 3-5). Note that a task τ j that holds λ j · C j > D j naturally misses its deadline without any interference, so we assume that τ containing such τ j is unschedulable. Thus, it decreases λ j by one to make τ schedulable (Line 6). Lines 3-6 are repeated for each task τ j selected by a given selection algorithm. The time complexity of Algorithm 2 is obtained as follows. It first initiates λ k for every task τ k in Line 1, which needs O(n) where n is the number of tasks in a task set τ. Thus, it considers a task τ j ∈ τ one by one in Line 2, which requires O(n). In Line 3, it repeatably conducts the schedulability analysis proposed in Theorem 1 while the condition in Line 3 holds. Since the calculation of the left-hand side and right-hand side in Equation (11) Algorithm 2 λ k -Assignment Algorithm 1: λ k ← 0 for all tasks τ k ∈ τ 2: for τ j from the first task to the last one selected by a given selection algorithm do 3: while τ is deemed schedulable by Theorem 1, and λ j · C j ≤ D j holds do 4: λ j ← λ j + 1 5: end while 6: λ j ← λ j − 1 7: end for

Case Study
In this section, we apply the FT policy to FP scheduling and EDZL scheduling (we denote it by FT-FP-A and FT-EDZL-A, respectively) as a case study.

Schedulability Analysis for FT-FP-A
In FP scheduling, a priority is assigned to a task rather than each job. Thus, only a higher-priority task τ i can interfere with a job J q k of a lower-priority task τ k . Well-known FP scheduling algorithms include rate monotonic (RM) [13] and earliest quasi-deadline first (EQDF) [14]; a task whose T k (likewise D k − C k ) is smaller than that of other tasks has a higher priority under the RM (likewise EQDF) scheduling algorithm. We denote FP scheduling incorporating the FT policy with λ k -assignment algorithm A employing any sorting algorithm by FT-FP-A. Let HP(τ k ) be a set of tasks whose priorities are higher than τ k . Thus, Theorem 1 is re-formulated for FP scheduling as follows.
Theorem 2. Suppose that a task set τ (which holds that λ k · C k ≤ D k for every τ k ∈ τ) is scheduled by FT-FP-A. Thus, τ is schedulable if the following inequality holds for all τ k ∈ τ Proof. To miss a deadline for a job of τ k under FT-FP-A scheduling on m processors, the job executes in at most λ k · C k − 1 time instances due to the existence of higher-priority tasks. At each time instance, at least m other jobs are required to hinder the execution of a job of τ k . Hence, at least m · (D k − (λ k · C k − 1)) amount of interference of tasks in HP(τ k ) with τ k is required to miss the job's deadline.
Thus, we schedule a given task set τ by FT-FP-A (Algorithm 1) exploiting λ k -assignment algorithm A (Algorithm 2 with Theorem 2 instead of Theorem 1 in Line 3). job release/deadline Figure 2. Worst-case scenario in which interference of τ i to τ k is maximized under work-conserving EDF scheduling.

Schedulability Analysis for FT-EDZL-A
The EDZL scheduling algorithm assigns a higher priority to a job J q k whose absolute deadline d q k is earlier than that of other jobs such as earliest deadline first (EDF) scheduling. Thus, it promotes the job's priority (to the highest) at time instance t at which the job's laxity (The laxity of a job J q k is defined as the difference between d q k − t (i.e., the remaining time instances up to d q k ) and the remaining executions of J q k to finish.) is zero (i.e., d q k − t is equal to the remaining execution time of the job) because the job would miss its deadline otherwise.
For deadline-based schedulability analysis for FT-EDZL-A, we first upper-bound I(τ k ← τ i ) under work-conserving EDF scheduling. Figure 2 illustrates the worst-case release pattern of higher-priority jobs of τ i in an interval D k . As shown in Figure 2, the interference from higher-priority jobs to J q k is maximized when their absolute deadlines are aligned because J q i whose d q i is later than d q k cannot interfere with J q k . Thus, the upper bound of the amount of interference from the jobs of τ i to a job of τ k is calculated by E(D k ) as follows: Thus, we schedule a given task set τ by FT-EDZL-A (Algorithm 1) exploiting λ k -assignment algorithm A (Algorithm 2 with Theorem 3 instead of Theorem 1 in Line 3).
Under EDZL scheduling, a job J q i can interfere with J q k even if J q i 's deadline is later than J q k 's deadline. This happens only when J q i is in the zero-laxity state and its priority is promoted. Figure 3 illustrates a job (the right-most one in the figure) of τ i is in the zero-laxity state. The key characteristic of such a job is that it finishes its execution at its absolute deadline. Thus, Figure 3 also derives the same upper bound of the amount of interference from the jobs of τ i to a job of τ k with Equation (13).
In order for a job to miss its absolute deadline under EDZL scheduling on an m-processor platform, there should be at least m + 1 zero-laxity jobs at the same time instance. Based on this reasoning, we derive the following schedulability conditions for FT-EDZL-A.  Theorem 3. Suppose that a task set τ (which holds that λ k · C k ≤ D k for every τ k ∈ τ) is scheduled by FT-EDZL-A. Thus, τ is schedulable if the following inequality holds for at least τ − m tasks τ k ∈ τ . (14) Proof. To be in the zero-laxity state for a job of τ k under FT-EDZL-A scheduling on m processors, the job is interfered in D k − λ k · C k time instances. At each time instance, at least m other jobs are required to hinder the execution of a job of τ k . Thus, at least m · (D k − λ k · C k ) amount of interference of higher-priority jobs with τ k is required to be in the zero-laxity state for a job of τ k . Moreover, there should be at least m + 1 zero-laxity jobs at the same time instance in order for a job to miss its absolute deadline on an m-processor platform. Thus, the theorem holds.

Evaluation Environment
In this subsection, we evaluate the performance of the considered scheduling algorithms incorporating the proposed FT policy.
For our evaluation, we randomly generate task sets based on a well-known task set generation framework [8,15,16]. For the input parameters, we consider the number of processors m ∈ {2, 4, 8, 16} and the individual task utilization (C i /T i ) distribution (bimodal or exponential with its input parameter chosen in {0.1, 0.3, 0.5, 0.7, 0.9} [17]). For a given bimodal input parameter p, the C i /T i value is uniformly selected in [0, 0.5) and [0.5, 1) with probability p and 1 − p, respectively. For a given exponential input parameter 1/β, the value is selected according to the exponential distribution whose probability density function is β · exp(−β · x). For each task, T i is uniformly chosen in [1,1000], C i is determined by the bimodal or exponential parameter, and D i is uniformly chosen in [C i , T i ]. We generate 10,000 task sets for each value of m. We then measure the number of tasks sets deemed schedulable by the proposed schedulability analysis, as well as the average system safety of task sets (defined as the average of the considered task sets' system safety), as performance metrics.
We consider the following schedulability tests (as well as the corresponding scheduling algorithms for measuring the system safety): • EDZL: for the EDZL scheduling algorithm (Equation (14) with every λ k = 1); • RM (also denoted by RM-1): for the RM scheduling algorithm (Equation (12) with every λ k = 1), • EQDF: for the EQDF scheduling algorithm (Equation (12) with every λ k = 1); • FT-EDZL-Any: for the EDZL scheduling algorithm incorporating the FT policy in which the λ k -assignment algorithm increases λ k in an order of index k (Equation (14) with λ k determined by a given λ k -assignment algorithm); • FT-RM-Inverse: for the RM scheduling algorithm incorporating the FT policy in which the λ k -assignment algorithm increases λ k in an order of task priority (Equation (12) with λ k determined by a given λ k -assignment algorithm); • FT-EQDF-Inverse: for the EQDF scheduling algorithm incorporating the FT policy in which the λ k -assignment algorithm increases λ k in an order of task priority (Equation (12) with λ k determined by a given λ k -assignment algorithm); • RM-2: for the RM scheduling algorithm (Equation (12) with every λ k = 2); • RM-3: for the RM scheduling algorithm (Equation (12) with every λ k = 3).

Example of a Task Set: ACSW in Satellite Systems
In this subsection, we illustrate an actual real-time system whose operational characteristic can be specified by task parameters described in the previous subsection. A reconnaissance satellite system is a compelling example of a real-time system, which is equipped with a reconnaissance antenna to obtain a signal image of the target terrain by transmitting and receiving radio frequency signals. In a reconnaissance satellite system, antenna controller software (ACSW) [18] controls a reconnaissance antenna, of which tasks are scheduled by RM on RTEMS (real-time executive for multi-processor systems) [19] as a space-specific RTOS (real-time operating system). ACSW typically consists of five main tasks named tHigh, tMilbus, tOne, tTwo, and tSync, respectively, whose high-level description of main operation is described as follows.
• tHigh retrieves a single macro command (MCMD) from an MCMD queue in every period and invokes a job corresponding to the MCMD. • tMilbus is responsible for receiving MCMDs from the ground station by utilizing the MIL-STD-1553B protocol [20] and verifies the integrity of each MCMD before the MCMD is inserted into an MCMD queue. • tOne performs internal mode transitions such as turning on/off relevant equipment and transmits internal telemetries via the SpaceWire protocol [21]. • tTwo conducts various executions such as fault detection, formatting network packets that will be transferred to the ground station. • tSync executes a job for the operation preparation whenever there are surplus computing resources. Table 1 describes task parameters of the five tasks. T i and D i are determined by the system designer by considering the operating concept of the ACSW. That is, tHigh takes an MCMD from an MCMD queue every 62.5 ms, tMilbus receives an MCMD from the ground station every 125 ms, an internal mode transition occurs by tOne every 250 ms, and tTwo transmits the system status to the ground station every 500 ms. tSync does not have specified task parameters because it executes without deadlines when the other tasks are inactive. Thus, WCET, best case execution time (BCET), and average case execution time (ACET) are measured on a multiprocessor platform equipped with 256 Mbps SDRAM and FT-Leon3 CPU architecture (80 Mhz clock rate). Since the task set generation method described in the previous subsection considers a number of task sets whose parameters are randomly selected, it can cover various real-time embedded systems in which tasks conduct different roles in various operation scenarios.  Figure 4a,b plot the number of tasks deemed schedulable under the considered schedulability tests according to varying task set utilization (∑ τ k ∈τ C k /T k ) for m = 4 and m = 16, respectively. Note that the FT policy does not compromise the schedulability of the base scheduling algorithm B, so algorithm B in Figure 4a,b also represents B incorporating the FT policy. For example, the number of task sets deemed schedulable by EDZL and FT-EDZL-Any is the same. As shown in Figure 4a,b, EQDF (largely) outperforms EDZL, which performs better than RM. Figure 4c,d show the average system safety (i.e., the average system reliability of schedulable task sets) of task sets under the considered techniques for γ = 0.001. As shown, the average system safety of the considered techniques decreases with increasing task set utilization because the system safety is zero when the task set τ is unschedulable. Similar to Figure 4a,b, EQDF (largely) is shown to outperform EDZL, which also outperforms RM. This is because a better-performing schedulability analysis finds a higher number of schedulable task sets whose system safety is not zero. Moreover, the FT-series improves the average system safety for every task set utilization since the FT policy increases (or at least does not decrease) λ k of all tasks, thereby increasing the system reliability. The number of task sets deemed schedulable Task set utilzation   Figure 4c,d, respectively, shows that the average system safety of considered techniques decreases due to a higher error rate γ (i.e., 0.01 compared to 0.001). However, the performance gap between the schedulability tests of the base algorithms (i.e., EQDF, EDZL, and RM) and those incorporating the FT policy becomes larger as γ increases. This phenomenon happens because the system reliability is dramatically degraded with an increasing value of γ (as Equation (2) implies), but λ k assigned by the FT policy makes up such system-safety degradation. Figure 4 excludes the evaluation results regarding the FT policy in which the λ k -assignment algorithm increases λ k in an order of lower task priority (e.g., denoted by FT-RM-Reverse) because the trends demonstrated are similar to those shown in Figure 4c-f. Figure 5 presents the average system safety of task sets under RM with different λ k assignments for γ = 0.01. As shown in Figure 5, the higher number of re-executions (i.e., a greater value of λ k ) dramatically decreases the average system safety. It indicates that the schedulability is more important than the system reliability to obtain a high level of the average system safety. That is, a greater value of N k improves the system reliability according to Equation (4), while schedulability is not guaranteed when such an increase of N k is conducted not in conjunction with schedulability analysis. Thus, the higher number of re-executions compromises the average system safety due to the degraded schedulability even though it may improve reliability. Note that the system safety is 0 for an unschedulale task set according to the definitions of the system safety, as presented in Section 2.

Related Work
While a number of FT techniques have been previously implemented using hardware, hybrid software-based techniques such as checkpointing with rollback and re-execution have been proposed recently [3][4][5]. The former manages checkpoints at which the state of the system is saved on stable storage and recovered at the latest checkpoint in case of a transient fault. The latter executes tasks multiple times (e.g., λ k − 1 times) and chooses a correct output (if any) obtained over multiple executions. If all outputs during the λ k − 1 executions are not correct, the tasks are re-executed to improve reliability. Some tasks are executed multiple times under this technique, so they may miss their deadlines. Thus, existing studies [4,5] have focused on improving the reliability of mixed-criticality systems or energy-sensitive real-time systems while inevitably sacrificing the schedulability of the systems.
In multiprocessor domains, we can utilize the power of multiprocessors to tolerate faults. One popular approach for this is primary-backup approaches [22][23][24]. In this approach, the backup of a task does not need to be executed if its primary executes successfully. Backup overloading allows backup copies of primary tasks to be scheduled in a time-overlapping manner for task efficiency [22]. Backup overloading was improved by Manimaran and Murthy [23]. Another efficient overloading algorithm on multiprocessors was proposed with dynamic logical grouping among copies of tasks [24].
There are other ways to support fault tolerance in multiprocessors [25,26]. Cirinei et al. proposed a dynamic reconfiguration of multiprocessor hardware platforms considering the tradeoff between performance and fault tolerance (through simultaneous replication) [25]. Liberato et al. proposed FT global multiprocessor scheduling by re-executing an instance of a faulty job [26].

Conclusions
We proposed an FT policy that can be incorporated into most (if not all) existing real-time scheduling algorithms on multiprocessor systems, which improves the reliability of a target system without sacrificing schedulability. Our study was inspired by the fact that existing re-execution techniques enforce multiple executions of some tasks to improve system reliability, which can result in a loss of schedulability of schedulable tasks. Our proposed FT policy employs the re-execution technique in conjunction with deadline-based schedulability analysis while ensuring that schedulable task sets under the FT policy never become unschedulable. As a case study, we applied the FT policy to existing FP scheduling and EDZL scheduling and evaluate its performance regarding schedulability and reliability. In future, we plan to extend our work to mixed-criticality systems and try to apply better schedulability analysis techniques such as response-time analysis to improve the analytical capability of the FT policy.