A Novel System Reliability Modeling of Hardware, Software, and Interactions of Hardware and Software

: In the past few decades, a great number of hardware and software reliability models have been proposed to address hardware failures in hardware subsystems and software failures in software subsystems, respectively. The interactions between hardware and software subsystems are often neglected in order to simplify reliability modeling, and hence, most existing reliability models assumed hardware subsystems and software subsystem are independent of each other. However, this may not be true in reality. In this study, system failures are classiﬁed into three categories, which are hardware failures, software failures, and hardware-software interaction failures. The main contribution of our research is that we further classify hardware-software interaction failures into two groups: software-induced hardware failures and hardware-induced software failures. A Markov-based uniﬁed system reliability modeling incorporating all three categories of system failures is developed in this research, which provides a novel and practical perspective to deﬁne system failures and further improve reliability prediction accuracy. Comparison of system reliability estimation between the reliability models with and without considering hardware-software interactions is elucidated in the numerical example. The impacts on system reliability prediction as the changes of transition parameters are also illustrated by the numerical examples.


Introduction
To model system reliability, most studies have considered partial of the system, hardware subsystem or software subsystem [1][2][3][4][5][6][7][8][9][10][11][12]. In the past few decades, a great number of reliability models in terms of hardware [1][2][3][4][5][6][7] and software [8][9][10][11][12] have been proposed in order to address hardware failures in hardware subsystems and software failures in software subsystems from various perspectives considering many critical applications, respectively. The interactions between hardware and software subsystems are often neglected for the sake of simplifying mathematical formulation, and hence, hardware and software subsystems are assumed as independent in most studies. However, this assumption may not be true in reality.
Several studies have shown the existence of the interactions between these two subsystems, hardware and software, in modern complex system applications [13][14][15][16][17][18][19][20][21]. The health status, e.g., the degradation and failure, of hardware components is one of the critical factors affecting the performance of software subsystems [16,19,20]. Accordingly, one of the significant reasons that cause the failure of software is the malfunction/failure of hardware platform where software is located in [21]. From the software error studies on the existing Multiple Virtual Storage operating system, one transition parameter at a time while keeps others as same. Section 4 concludes this research and further discusses the future research direction.

System Failures Classification
Complex system [28] refers to a system consists many components which may interact with each other. Many critical modern applications, such as communication systems and computing systems, are composed of many hardware and software components. In general, the failures of the whole system may be caused by the failures of one or more components. In this study, system failures are classified into three categories: (1) Hardware failures [22,29,30] refer to a hardware component stops its designed function. Hardware failures are not able to be recovered by, for example, restarting the system after a certain period. Hardware failures are further classified as either total or partial hardware failures. In particular, total hardware failures, or hard failures by some studies [29,31], are catastrophic failures that cause complete cease of the designed function. Partial hardware failures, or soft failures [29,31], refer to the partial loss of the designed function, in which the hardware component may continue performing its designed function, but the system will under the degradation state. Overall, the system may continue working in a degradation state with partial hardware failures, but not with total hardware failures.
(2) Software failures [32] refer to the occurrence of an incorrect output that is triggered by a specific input because of the latent faults left in software program, e.g., design errors, that are unrelated with hardware components.
(3) The main contribution of this paper is that we particularly classified hardware-software interaction failures into two categories: software-induced hardware failures and hardware-induced software failures. Software-induced hardware failures were defined as hardware failures induced by the execution of embedded software system [15]. For example, the electronic stress induced by software execution may lead to the physical damage of hardware components. Hardware-induced software failures were defined as software failures resulting from a change in hardware configuration, which causes software operates in a different operational environment compared with the testing environment [22].
Transient hardware failures are also defined as one of the system failures in literature [15,22,29]. Generally, transient hardware failures are recovered by restarting the system since the disruption of the function usually caused by operation environment, such as high temperature, strong electromagnetic fluctuation. In this study, we did not incorporate transient hardware failures in reliability model.

Model Formulation
As discussed in the introduction, given that the interactions between hardware and software subsystems are often neglected, the novel Markov-based unified system reliability model was developed in this research by taking into account hardware failures, software failures, and hardware-software interaction failures, including software-induced hardware failures and hardware-induced software failures. The proposed Markov-based unified system reliability model has the following assumptions: (1) System will fail if any of the failure happens, including hardware failures, software failures, and hardware-software interaction failures.
(2) Three categories of systems failures are independent of each other.
(4) Software fault detection process follows non-homogeneous Poisson process [32]. We consider the time to remove detected software faults to be negligible in this study.
Hence, the Markov-based unified system reliability model is proposed follows: where R Hardware (t), R So f tware (t), and R H−S Interactions (t) represent the reliability function of hardware subsystems, software subsystems, and hardware-software interactions, respectively. In this study, the main concentration is on the reliability model development of hardware-software interactions. We employ the Markov process to represent the state transition of hardware-software interactions, as illustrated in Figure 1. As stated in model assumptions, three main states, full working state, degradation states, and failure states, and eight sub-states {0, 1a, 1b, 1c, 2a, 2b, 2c, 2d} are defined for hardware-software interactions. State (0) represents full working state, which means the system is under perfect working condition. Degradation state (1a) signifies that partial hardware failure is detected but it cannot be recovered by software. Degradation state (1b) signifies that partial hardware failure is detected and it can be recovered by software. Degradation state (1c) signifies that partial hardware failure is not detected. Hence, the Markov-based unified system reliability model is proposed follows: where ( ) , ( ) , and ( ) represent the reliability function of hardware subsystems, software subsystems, and hardware-software interactions, respectively. In this study, the main concentration is on the reliability model development of hardwaresoftware interactions. We employ the Markov process to represent the state transition of hardwaresoftware interactions, as illustrated in Figure 1. As stated in model assumptions, three main states, full working state, degradation states, and failure states, and eight sub-states {0, 1a, 1b, 1c, 2a, 2b, 2c, 2d} are defined for hardware-software interactions. State (0) represents full working state, which means the system is under perfect working condition. Degradation state (1a) signifies that partial hardware failure is detected but it cannot be recovered by software. Degradation state (1b) signifies that partial hardware failure is detected and it can be recovered by software. Degradation state (1c) signifies that partial hardware failure is not detected. Failure state (2a) signifies execution abortion.  The transition parameters described in Figure 1 are stated as follows. Hardware components can transit to degradation sate with degradation rate . The probability of detecting the partial hardware failures is . Hence, the probability of not detecting the partially failed hardware is , in which = 1 − . The probability of fixing the partial hardware failures through software is . Hence, the probability that the partially failed hardware cannot be recovered through software is , in which = 1 − .
The rate of fixing the partially failed hardware components, from state (1a) to full working status, state (0), through replacement is . The rate of fixing the partially failed hardware components, from state (1b) to full working status, state (0), through replacement is . No replacement will be performed if no failure is being detected.
The partially failed hardware can further transit from state (1a) to executing abortion, state (2a), with rate , hardware failures, state (2b), with rate , and software-induced hardware failures, state (2c), with rate . The partially failed hardware can further transit from state (1b) to hardware The transition parameters described in Figure 1 are stated as follows. Hardware components can transit to degradation sate with degradation rate λ 1 . The probability of detecting the partial hardware failures is p 1 . Hence, the probability of not detecting the partially failed hardware is q 1 , in which q 1 = 1 − p 1 . The probability of fixing the partial hardware failures through software is p 2 . Hence, the probability that the partially failed hardware cannot be recovered through software is q 2 , in which q 2 = 1 − p 2 .
The rate of fixing the partially failed hardware components, from state (1a) to full working status, state (0), through replacement is u 1 . The rate of fixing the partially failed hardware components, from state (1b) to full working status, state (0), through replacement is u 2 . No replacement will be performed if no failure is being detected.
According to the model assumptions, the differential equations based on the Markov process [30,33,34] with Q i (t) denotes the probability of system being in the state i at time t are obtained as follows: We consider at time t = 0 the probability of system being in full working state (0) is 1 and in degradation states {state (1a), (1b), and (1c)} and failure states {state (2a), (2b), (2c), and (2d)} are 0, respectively. Thus, the initial condition of the above differential Equations (2)-(9) will be given as Q 0 (0) = 1 and Q i (0) = 0, i = 1a, 1b, 1c, 2a, 2b, 2c, 2d; the solutions of the differential Equations (2)-(9) of hardware-software interaction failures are obtained as follows: where The working states include full working state (0) and degradation states (1a), (1b), and (1c). Hence, the reliability of hardware-software interactions is obtained as follows: where Q 0 (t), Q 1a (t), Q 1b (t), and Q 1c (t) are illustrated in Equation (10). According to model assumptions, hardware reliability is elucidated by the Weibull model: where λ and β are the parameters of Weibull distribution. Software fault detection and removal process are considered as a nonhomogeneous Poisson process. We also consider the time that software tester spent on removing detected software faults is negligible. In particular, software fault detection rate and the total number of software faults in the software program are considered as constants in this study; thus, G-O model [32,35] will be employed to estimate the expected number of detected software failures up to time t. G-O model [32,35] is shown below: where m(t) denotes the expected number of software failures up to time t. The constants a and b denote software fault detection rate and total number of software faults in the program, respectively. Applying the G-O model given in Equation (13), the software reliability can be estimated for a time period t given software startup time x as follows: By substituting Equations (11), (12) and (14) to Equation (1), the proposed Markov-based unified system reliability model is obtained as follows: Note that this research mainly contributes on the reliability model development for hardware-software interactions. We do not develop new reliability model of hardware and software subsystems in this study. Hence, as an example, Weibull distribution and G-O model are, respectively, employed for modeling hardware reliability and software reliability to compare the entire system reliability with and without considering hardware-software interactions.

Numerical Examples
The state transition diagram of hardware-software interaction failures is shown in Figure 1. We are expecting the transitions from degradation states to execution abortion and hardware failures states can be neglected by considering min {λ i } λ j , i = 1, 4, 6, 8, 9, j = 2, 3, 5, 7. Hence, we simplify the problem by taking λ j = 0. The transition parameters shown in Figure 1 are unknown. In practice, after collecting the testing/operation failure data, the parameter estimation methodologies, such as the maximum likelihood estimation and least squares method, can be employed to estimate these unknown parameters. However, we do not have the failure data from practice, since the proposed transaction diagram includes many degradation and failure states, which cause the transition rates to be very complicated to measure in the industry. It is vital to demonstrate the importance of considering such interactions between hardware and software subsystems (software-induced hardware failures and hardware-induced software failures) and the impacts on system reliability prediction as the changes of transition parameters. By taking the numerical examples illustrated in references [1,22,27] as references, we initially set up the transition parameters for the proposed Markov-based unified system reliability model as: λ = 0.006, β = 1.09, a = 30, b = 0.001, x = 10, λ 1 = 0.07, λ 4 = 0.02, λ 6 = 0.01, λ 8 = 0.03, λ 9 = 0.04, µ 1 = 0.05, µ 2 = 0.06, p 1 = 0.3, and p 2 = 0.2, and then obtain the numerical values of A 1 , A 2 , A 3 , c 1 , c 2 , and c 3 stated in Q i (t), as described in Equation (10).
Given the initial parameter set, we are interested in the impacts on system reliability prediction as the transition parameters change. Time t is set up as 100 in the numerical examples. The initial parameter set is Θ 0 , as seen in Table 1. The model proposed by Teng et al. [22] was the basis of the proposed Markov-based unified system reliability model in this study. Therefore, we first presented the comparison of system reliability prediction results obtained from the proposed model and Teng et al. [22], as seen in Figure 2. The proposed model is prone to have higher reliability prediction in the early operation stage and have slightly lower reliability prediction in the late operation stage, compared with the model proposed by Teng et al. [22]. Figure 3 illustrates the system reliability prediction with and without considering hardware-software interactions. The system reliability model, without considering hardware-software interactions, tends to have higher reliability prediction compared with the proposed model considering hardware-software interactions. Table 1 lists different parameter set, Θ S i −k , where s i = λ 1 , λ 4 , λ 6 , λ 8 , λ 9 , µ 1 , µ 2 , p 1 , p 2 , k = {1, 2}. Compared with the initial parameter set Θ 0 , each parameter set Θ S i −k represents the increase or decrease of transition parameter s i , while as the others are kept the same. For example, parameter sets Θ λ 1 1 and Θ λ 1 2 , respectively, represent the increase and decrease of degradation rate λ 1 , while other transition parameters stay unchanged. 8 the comparison of system reliability prediction results obtained from the proposed model and Teng et al. [22], as seen in Figure 2. The proposed model is prone to have higher reliability prediction in the early operation stage and have slightly lower reliability prediction in the late operation stage, compared with the model proposed by Teng et al. [22]. Figure 3 illustrates the system reliability prediction with and without considering hardware-software interactions. The system reliability model, without considering hardware-software interactions, tends to have higher reliability prediction compared with the proposed model considering hardware-software interactions. Table 1 lists different parameter set, Θ , where = { , , , , , , , , }, = {1, 2}. Compared with the initial parameter set Θ , each parameter set Θ represents the increase or decrease of transition parameter , while as the others are kept the same. For example, parameter sets Θ and Θ , respectively, represent the increase and decrease of degradation rate , while other transition parameters stay unchanged.    We are interested in the system reliability comparison by applying the parameter set Θ and initial parameter set Θ . For instance, if hardware degradation rate increases, the system reliability may increase or decrease, as illustrated in Figure 4. If the failure rate increases, which means the transition rate from the partially failed hardware (detected but not recovered by software) to the software-induced hardware failures increases, the system reliability may increase or decrease, as illustrated in Figure 5. If the failure rate increases, which means the transition rate from the partially failed hardware (detected and recovered by software) to software-induced hardware failures increases, the system reliability may increase or decrease, as illustrated in Figure 6. Figure 7 shows the system reliability may decrease as the failure rate , the transition parameter from the partially failed hardware (not detected) to software-induced hardware failures, increases. Figure 8 shows the system reliability may decrease as the failure rate , the transition parameter from the partially failed hardware (not detected) to hardware-induced software failures, increases. Figure 9 illustrates the system reliability may increase or decrease as the repair rate , the transition parameter from the partially failed hardware (detected but not recovered by software) to full working state, increases. Figure 10 illustrates the system reliability may increase or decrease as the repair rate , the transition parameter from the partially failed hardware (detected and recovered by software) to the full working state, increases. As the probability ( ) of detecting hardware failures increases, the system reliability may increase or decrease, as seen in Figure 11. Figure 12 shows the system reliability may increase or decrease as the probability ( ) of fixing the partially failed hardware through software increases. We are interested in the system reliability comparison by applying the parameter set Θ S i −k and initial parameter set Θ 0 . For instance, if hardware degradation rate λ 1 increases, the system reliability may increase or decrease, as illustrated in Figure 4. If the failure rate λ 4 increases, which means the transition rate from the partially failed hardware (detected but not recovered by software) to the software-induced hardware failures increases, the system reliability may increase or decrease, as illustrated in Figure 5. If the failure rate λ 6 increases, which means the transition rate from the partially failed hardware (detected and recovered by software) to software-induced hardware failures increases, the system reliability may increase or decrease, as illustrated in Figure 6. Figure 7 shows the system reliability may decrease as the failure rate λ 8 , the transition parameter from the partially failed hardware (not detected) to software-induced hardware failures, increases. Figure 8 shows the system reliability may decrease as the failure rate λ 9 , the transition parameter from the partially failed hardware (not detected) to hardware-induced software failures, increases. Figure 9 illustrates the system reliability may increase or decrease as the repair rate µ 1 , the transition parameter from the partially failed hardware (detected but not recovered by software) to full working state, increases. Figure 10 illustrates the system reliability may increase or decrease as the repair rate µ 2 , the transition parameter from the partially failed hardware (detected and recovered by software) to the full working state, increases. As the probability (p 1 ) of detecting hardware failures increases, the system reliability may increase or decrease, as seen in Figure 11. Figure 12 shows the system reliability may increase or decrease as the probability (p 2 ) of fixing the partially failed hardware through software increases. We are interested in the system reliability comparison by applying the parameter set Θ and initial parameter set Θ . For instance, if hardware degradation rate increases, the system reliability may increase or decrease, as illustrated in Figure 4. If the failure rate increases, which means the transition rate from the partially failed hardware (detected but not recovered by software) to the software-induced hardware failures increases, the system reliability may increase or decrease, as illustrated in Figure 5. If the failure rate increases, which means the transition rate from the partially failed hardware (detected and recovered by software) to software-induced hardware failures increases, the system reliability may increase or decrease, as illustrated in Figure 6. Figure 7 shows the system reliability may decrease as the failure rate , the transition parameter from the partially failed hardware (not detected) to software-induced hardware failures, increases. Figure 8 shows the system reliability may decrease as the failure rate , the transition parameter from the partially failed hardware (not detected) to hardware-induced software failures, increases. Figure 9 illustrates the system reliability may increase or decrease as the repair rate , the transition parameter from the partially failed hardware (detected but not recovered by software) to full working state, increases. Figure 10 illustrates the system reliability may increase or decrease as the repair rate , the transition parameter from the partially failed hardware (detected and recovered by software) to the full working state, increases. As the probability ( ) of detecting hardware failures increases, the system reliability may increase or decrease, as seen in Figure 11. Figure 12 shows the system reliability may increase or decrease as the probability ( ) of fixing the partially failed hardware through software increases.

Conclusions
The interactions between hardware subsystems and software subsystems are often neglected in most existing system reliability models. Even a few system reliability models have hardwaresoftware interaction failures; for instance, such interactions were interpreted as hardware-related software failures. However, the impact of the software execution on the hardware platform is not well addressed. Thus, we incorporated three types of system failures in this research: hardware failures, software failures, and hardware-software interaction failures. The main contribution of our research was that we further classified hardware-software interaction failures into two groups: software-induced hardware failures and hardware-induced software failures. A Markov-based unified system reliability model was proposed incorporating three main failure categories: hardware failures, software failures, and hardware-software interaction failures (software-induced hardware failures and hardware-induced software failures), which provided a novel and practical perspective to define system failures and further improve reliability prediction accuracy. The dependence among system failures can be further investigated.

Conclusions
The interactions between hardware subsystems and software subsystems are often neglected in most existing system reliability models. Even a few system reliability models have hardwaresoftware interaction failures; for instance, such interactions were interpreted as hardware-related software failures. However, the impact of the software execution on the hardware platform is not well addressed. Thus, we incorporated three types of system failures in this research: hardware failures, software failures, and hardware-software interaction failures. The main contribution of our research was that we further classified hardware-software interaction failures into two groups: software-induced hardware failures and hardware-induced software failures. A Markov-based unified system reliability model was proposed incorporating three main failure categories: hardware failures, software failures, and hardware-software interaction failures (software-induced hardware failures and hardware-induced software failures), which provided a novel and practical perspective to define system failures and further improve reliability prediction accuracy. The dependence among system failures can be further investigated.

Conclusions
The interactions between hardware subsystems and software subsystems are often neglected in most existing system reliability models. Even a few system reliability models have hardware-software interaction failures; for instance, such interactions were interpreted as hardware-related software failures. However, the impact of the software execution on the hardware platform is not well addressed. Thus, we incorporated three types of system failures in this research: hardware failures, software failures, and hardware-software interaction failures. The main contribution of our research was that we further classified hardware-software interaction failures into two groups: software-induced hardware failures and hardware-induced software failures. A Markov-based unified system reliability model was proposed incorporating three main failure categories: hardware failures, software failures, and hardware-software interaction failures (software-induced hardware failures and hardware-induced software failures), which provided a novel and practical perspective to define system failures and further improve reliability prediction accuracy. The dependence among system failures can be further investigated.