Availability Analysis of Software Systems with Rejuvenation and Checkpointing

: In software reliability engineering, software-rejuvenation and -checkpointing techniques are widely used for enhancing system reliability and strengthening data protection. In this paper, a stochastic framework composed of a composite stochastic Petri reward net and its resulting non-Markovian availability model is presented to capture the dynamic behavior of an operational software system in which time-based software rejuvenation and checkpointing are both aperiodically conducted. In particular, apart from the software-aging problem that may cause the system to fail, human-error factors (i.e., a system operator’s misoperations) during checkpointing are also considered. To solve the stationary solution of the non-Markovian availability model, which is derived on the basis of the reachability graph of stochastic Petri reward nets and is actually not one of the trivial stochastic models such as the semi-Markov process and the Markov regenerative process, the phase-expansion approach is considered. In numerical experiments, we illustrate steady-state system availability and ﬁnd optimal software-rejuvenation policies that maximize steady-state system availability. The effects of human-error factors on both steady-state system availability and the optimal software-rejuvenation trigger timing are also evaluated. Numerical results showed that human errors during checkpointing both decreased system availability and brought a signiﬁcant effect on the optimal rejuvenation-trigger timing, so that it should not be overlooked during system modeling.


Introduction
In software reliability engineering, various software fault-tolerance techniques such as software rejuvenation and checkpointing are widely used for enhancing system reliability and strengthening data protection. Software rejuvenation is a countermeasure against software aging, which refers to the phenomenon that the performance or dependability of software systems degrades with time, caused by aging-related bugs [1,2], eventually resulting in system failures. In 1995, Huang et al. [3] first reported the aging phenomenon in real telecommunication billing applications where the application experienced a crash or a hang failure over time. The software-aging phenomenon exists in the real world and is inevitable, but can nevertheless be controlled or even reversed [1,2,4]. Software rejuvenation plays a central role in counteracting aging issues by refreshing the system's internal states. However, as pointed out by Alonso et al. [5], the software rejuvenation can address aging issues well, but typically involves an overhead since the system becomes unavailable during rejuvenation. That is to say, it is necessary and important to determine an optimal rejuvenation schedule for achieving the best trade-off between target performance or dependability and the associated overhead. To date, there are a number of works devoted to solving such optimization problems [6][7][8][9][10]. For example, Vaidyanathan and Trivedi [6] presented a semi-Markov reward model for a UNIX operating system, and used this model to derive optimal software-rejuvenation schedules in terms of system availability or downtime cost. Dohi et al. [9] considered two basic software-rejuvenation models described by Markov regenerative processes (MRGPs), and provided transient solutions using Laplace-Stieltjes transform (LST) and their numerical inversion. In [9], an optimal software-rejuvenation policy that maximized interval system reliability was numerically determined. Wang and Liu [10] recently offered a real-time decision method for optimal software-rejuvenation timing through simulating and modeling the state-transition process of software aging and constructing the rejuvenation decision function using an analytic hierarchy process.
In the context of data protection, a typical technique is checkpointing, which is an efficient method for saving re-execution time in the presence of faults [11] through saving current data in the main memory to secondary storage. Checkpointing is easy to conduct and has been widely studied for decades [12][13][14][15][16]. For example, Fukumoto et al. [12], and Dohi et al. [13] introduced different checkpointing schemes for database systems, and Ranganathan and Upadhyaya [14] considered the temporal behavior related to database system states from a macroscopic viewpoint. Some of the literature also considered software rejuvenation and checkpointing together [17][18][19][20]. Okamura and Dohi [17] focused on two kinds of maintenance policies for a software system, and adopted a dynamic programming approach to comprehensively evaluate aperiodic checkpointing and rejuvenation schemes in the system. In [19], the authors introduced a stochastic reward Petri net (SRN) [21] to model a software system of which the state moves to the execution process immediately after a rollback recovery. In particular, according to SRN analysis, a non-Markovian statetransition diagram was derived. More recently, a similar to but somewhat different system from [19] was considered in [20], in which the system executes checkpointing immediately after a rollback recovery in order to update the starting point of the recovery operation from the past to the current time. In these previous works, the systems underwent both aperiodic checkpointing and software rejuvenation, and their transition diagrams are not one of the trivial stochastic models such as semi-Markov process (SMP) and MRGP. That means that common approaches such as the LST and embedded Markov chain techniques cannot be directly applied. To solve these complex non-Markovian transition diagrams, the phase (PH) expansion approach [22,23], which is an approximation technique by using phase-type (PH) distribution, was utilized and worked well in different contents. Moreover, in [19,20], it was assumed that system failures are caused by only aging problems, but in fact, human error is inescapable [24], and the system operator's misoperations during checkpointing cannot be ignored [25].
In this paper, we consider the different software systems from [19,20], where both aperiodic checkpointing and software rejuvenation were executed, and system failure occurred due to both software aging and human errors in checkpointing. A stochastic framework composed of a composite SRN and its resulting non-Markovian availability model is presented to capture the dynamics of the system from a macroscopic point of view. More specifically, the non-Markovian availability model was derived from the reachability graph of the composite SRN model. On the basis of the non-Markovian availability model, which is also a nontrivial model including multiple competitive events as in [19,20], we formulated the steady-state availability of the system by means of PH expansion, and then determined the optimal software-rejuvenation schedule that maximized steady-state system availability. The effects of human-error factors on both steady-state system availability and optimal software-rejuvenation schedule are investigated. The main differences between this work and previous ones [19,20] are that we (i) consider both aging-related and human-error-related system failures, of which the latter was overlooked in previous works; and (ii) investigate the effect of human-error factors on system availability and software rejuvenation. For brevity, the main contributions of this paper are summarized as twofold: • stochastic modeling of software systems that undergo both software rejuvenation and checkpointing, and may fail due to both the aging problem and human errors in checkpointing; • investigation of the effects of human-error factors on both steady-state system availability and optimal software-rejuvenation trigger timing by the comparison of cases where human-error-related system failures are considered or not.
The remainder of this paper is organized as follows. In Section 2, a stochastic framework composed of a composite SRN and its corresponding non-Markovian state-transition diagram for an operational software system with software rejuvenation and checkpointing are introduced. In particular, a reachability graph was generated from the composite SRN, and on its basis, a non-Markovian state-transition diagram was obtained. Section 3 first defines continuous PH distribution and presents an approach to formulate the steady-state system availability of the non-Markovian model by using the underlying approximate CTMC of the non-Markovian model, which was derived by replacing all general distributions with their corresponding PH distributions. In Section 4, we describe conducted numerical experiments that evaluated system availability, determined the optimal softwarerejuvenation trigger timing, and quantified the effects of human-error factors. Lastly, in Section 5, we conclude this paper with some remarks.

Macroscopic System Model
In this section, we first introduce the system assumptions and then present a stochastic framework consisting of a composite SRN and its resulting non-Markovian transition diagram to model operational software systems from a macroscopic point of view. More specifically, the non-Markovian transition diagram was derived on the basis of a reachability graph, which was generated from analysis of the composite SRN.

System Assumptions
Consider an operational software system that aperiodically executes checkpointing for saving current data in the main memory in secondary storage. Without loss of generality, it was assumed that the system suffers from software aging, so that it may fail due to aging-related bugs, such as a memory leak and the accumulation of round-off errors. On the other hand, system failure might also be caused by incorrect operation by the operator during the execution of checkpointing. Once system failure occurred, a series of recovery operations that include checkpointed data loading and rollback recovery were conducted to recover the system. In addition, software rejuvenation was adopted to counteract the aging problem. A few other assumptions: • the checkpointing operation just saves the current data and does not refresh system aging; • the clock of the rejuvenation trigger is not reset and continuously accumulates even when the system executes the checkpointing; • when a rejuvenation point is reached while the system is under checkpointing, the rejuvenation waits until the checkpointing is completed; • the system is regarded as good as new after either rollback recovery or rejuvenation.

Stochastic Reward Nets
On the basis of the above assumptions, the dynamics of the system are described by a composite SRN as in Figures 1 and 2. Concretely, the composite SRN contains three submodels: clock model for system aging (Figure 1a), clock model for software rejuvenation (Figure 1b), and SRN model for system behavior ( Figure 2). In these SRNs, transitions are divided into three types: (i) immediate (IMM) transition (represented by a thin black bar), which means the zero firing time transition; (ii) exponential (EXP) transition (represented by a white rectangle), which refers to the exponentially distributed firing time transition; and (iii) general (GEN) transition (represented by a thick black bar), which is generally distributed firing time transition. The places are defined as follows: • P f clock : software aging accumulates as time passes. • P f signal : it is time for an aging-related system failure to occur. • P rclock : time is accumulated to trigger a rejuvenation. • P rsignal : a rejuvenation point was reached. • P normal : the system waits for checkpointing and rejuvenation in the normal execution process. • P checkpointing : the system is under checkpointing. • P rejuvenation : the system is under rejuvenation. • P f ailure : the system fails due to either aging-related bugs or human-error factors, and checkpointed data are loaded for rollback recovery. • P recovery : rollback recovery is executed to recover the failed system. • P completed : the system becomes as good as new after the completion of either rejuvenation or rollback recovery.  On the other hand, transitions T cint , T trigger , and T f ail1 correspond to the trigger intervals of checkpointing and rejuvenation, and system lifetime, respectively. Transitions T checkpointing , T rejuvenation , T load , and T recovery separately represent the operations of checkpointing, rejuvenation, loading of checkpointed data, and rollback recovery. Transitions T f ail2 and T f ail3 are both EXP transitions, representing failures caused by incorrect operations by the operators. Once IMM transition t rej fires with satisfied guard function G rej , the system is immediately rejuvenated. If a token appears in place P f signal , either transition t f ail1 or transition t f ail2 fires due to the exhausted lifetime. Transitions t f reset and t rreset represent the reset of the clocks, and t normal means that the system becomes normal again at the same time as when clock reset. The details of guard functions are shown in Table 1. Table 1. Guard functions.

Guard
Guard Function

Reachability Graph
A Petri net's reachability graph is also a directed graph composed of nodes and edges, each of which representing a reachable marking and a transition between two reachable markings, respectively. According to analysis of the composite SRN described in Section 2.2, a reachability graph, starting with the initial marking {P normal : 1, P f clock : 1, P rclock : 1} (here no token places are not shown for brevity), is generated and depicted as in Figure 3. The description of nodes in the graph are summarized in Table 2. For example, node GEN (T cint → enable T f ail1 → enable T trigger → enable) is the initial marking and represents the normal execution state of the system in which all transitions T cint , T f ail1 , and T trigger are enable. Both nodes GEN (T checkpointing → enable T f ail1 → enable T trigger → enable) and GEN (T checkpointing → enable T f ail1 → enable) correspond to the checkpointing execution states, and the difference between them is whether a rejuvenation point was reached. Node GEN (T load → enable) means that the system failed, and the loading of checkpointed data is being executed. This graph shows that there exist two edges from either node GEN (T checkpointing → enable T f ail1 → enable T trigger → enable) or node GEN (T checkpointing → enable T f ail1 → enable) to node GEN (T load → enable). This is explained by the fact that, during checkpointing, the system may fail due to aging-rated bugs or human-error factors, that is, among two edges, one represents the GEN transition T f ail1 and another corresponds to the EXP transition T f ail3 . Table 2. Nodes in reachability graph.

Node
Description

Non-Markovian State-Transition Diagram
From the reachability graph in Section 2.3, a non-Markovian state-transition diagram was derived as shown in Figure 4. This model consisted of seven states: Normal, Checkpointing, Checkpointing , Rejuvenation, Failure1, Recovery, and Failure2. State Normal is the initial state and represents that the system is in the normal execution process in the main memory and waits for the checkpointing and rejuvenation. Once a checkpoint is reached prior to the rejuvenation point, the system state becomes Checkpointing, in which data on the main memory are saved in secondary storage. Since the checkpointing operation does not reset the clock of the rejuvenation trigger, a rejuvenation point may be reached during checkpointing. In such a case, the system enters state Checkpointing , which represents the checkpoint execution with enabled rejuvenation. After the completion of checkpointing, the system transitions from state Checkpointing to state Rejuvenation. If a rejuvenation point is reached prior to the checkpoint, the system immediately executes rejuvenation and enters state Rejuvenation from state Normal. As mentioned in Section 2.1, system failure may occur due to aging-related bugs and human-error factors. Thus, two failure states, Failure1 and Failure2, were defined to distinguish two kinds of system failures. When the system fails, a series of recovery operations, including checkpointed data loading and the rollback recovery, are conducted to recover the system from failure. Lastly, the system becomes Normal again from state Recovery. Of course, the system may fail before both checkpointing and rejuvenation. The details of state notation are given in Table 3. Table 4 summarizes the cumulative distribution functions (CDFs) of the corresponding transitions in the state-transition diagram. In this table, GEN represents general distribution, and EXP means exponential distribution. The reasons for making such assumptions of probability distributions can be found in [20]. The checkpoint interval was assumed to follow general distribution G intv (t), and the CDF of the time needed for checkpointing is given by G cp (t). The time for an aging-related failure to occur follows a general distribution G f ail (t) with increasing failure rate (IFR), while the time distributions for failures occurring during both rollback recovery and checkpointing due to incorrect operations by operators are given by F f ail1 (t) and F f ail2 with constant failure rates (CFRs) λ f ail1 and λ f ail2 , respectively. Similarly, the rejuvenation-trigger interval distribution is described by G trig (t), and its relevant overhead distribution is represented by G rej (t). The probability distribution of loading time of checkpointed data and the time needed for rollback recovery are given by G load (t) and G rc (t), respectively.

State Description
Normal Normal execution process in the main memory Checkpointing Checkpointing execution with a disabled rejuvenation Checkpointing' Checkpointing execution with an enabled rejuvenation Failure1 Aging-related system failure Failure2 Human-error-related system failure Recovery Rollback recovery to recover from system failure Rejuvenation Software-rejuvenation execution to refresh system's internal states  Figure 4 shows states Normal and Checkpointing, highlighted by a dashed rectangle with G f ail (t) and G trig (t), indicating that these GEN transitions regarding G f ail (t) and G trig (t) are enabled and could fire under either the Normal or the Checkpointing state. In the same way, the dashed rectangle for Checkpointing and Checkpointing means the possible firings of GEN and EXP transitions regarding G f ail (t), G cp (t), and F f ail2 (t). This implies that the non-Markovian state-transition diagram under consideration is neither the SMP nor the MRGP, resulting in difficult numerical analysis. To cope with this issue, in this paper we consider the PH expansion approach [22], which proved to be efficient for solving such kind of non-Markovian state-transition models [19,20,26].

System Availability Analysis
This section first introduces the well-known continuous PH distribution [22] and then derives the underlying approximate CTMC for the non-Markovian state-transition diagram in Figure 4 via PH expansion approach, of which the essential idea is to replace general distribution with its corresponding PH distribution at a high accuracy level. Lastly, the stationary solution for the model in Figure 4 through CTMC analysis is presented. The measure of interest is steady-state system availability, which is defined as the probability that the system is operational in the steady state.

Continuous PH Distribution
Continuous PH distribution is defined as the probability distribution of absorbing time in a finite CTMC with absorbing states, and it is widely applied in various fields, such as reliability assessment [26], queueing systems [27], and random telegraph noise analysis [28]. Without loss of generality, we define Q as an infinitesimal generator matrix of a CTMC that has m transient states and one absorbing state, and then partition Q into four parts as below: In the above, T and ξ represent transition rates among transient states and exit rates from transient states to the absorbing state, respectively. Defining α as an initial probability vector over the transient states, we have the CDF and probability density function (PDF) for the continuous PH distribution: where 1 is a column vector of ones. Exit vector ξ is given by ξ = −T1. Transient states are called phases in general. Continuous PH distribution can be categorized into several subclasses according to the structure of T [29]. When phase transition is acyclic, the corresponding PH distribution is called acyclic PH distribution (APH). The APH is the widest class among mathematically tractable PH distributions, and it can be converted into the canonical form (CF), which is the minimal representation of APH with the smallest number of free parameters [30]. The APH and its CF are important from the viewpoint of practical applications because it covers some well-known probability distributions, such as exponential distribution, Erlang distribution, and their mixtures. In particular, canonical form 1 (CF1) is usually considered and defined by where α i ≥ 0, ∑ i α i = 1 and 0 < β 1 ≤ · · · ≤ β m for m phases.
In this paper, continuous PH distribution was applied to approximate all general distributions in the non-Markovian state-transition diagram, that is, to determine PH distribution with parameters (α, T, ξ), which can fit the target distribution well by means of maximum likelihood estimation (MLE) approach [22].

PH-Expanded CTMC
According to the definition of PH distribution in Section 3.1, we define the general distributions in Table 4 by PH distributions with appropriate phases as follows: Here, PH parameters (α x , T x , ξ x ), x ∈ {intv, f ail, cp, load, rc, trig, rej} were estimated on the basis of MLE using an expectation-maximization (EM) algorithm [22,31]. Using the above-estimated PH distributions to replace general distributions, the non-Markovian transition diagram was expanded into an approximate CTMC, alternatively called PHexpanded CTMC, of which the infinitesimal generator matrix is given by The infinitesimal generator matrix is derived on the basis of the Kronecker representation [23], and the order of states is {Normal, Checkpointing, Checkpointing', Rejuvenation, Failure1, Recovery, Failure2}. In Equation (13), ⊕ and ⊗ are the Kronecker product and sum [32], I is an identity matrix, and 1/λ f ail1 and 1/λ f ail2 are the mean values of EXP distributions F f ail1 (t) and F f ail2 (t), say the mean times to failure during rollback recovery and checkpointing, respectively. Entry (ξ intv α cp ⊗ I ⊗ I) shows that the clock of the rejuvenation trigger is not reset and continuously accumulates, even when the system executes the checkpointing. Since the checkpointing operation just saves the current data and does not refresh system aging, entry (ξ cp α intv ) ⊗ I ⊗ I indicates that only the clock of checkpointing trigger is reset. When a rejuvenation point is reached while the system is under checkpointing, rejuvenation waits until checkpointing is completed; in such a case, the system transits from Checkpointing to Checkpointing with entry I ⊗ I ⊗ ξ trig . Entries (1 intv ⊗ ξ f ail 1 trig )α load , (1 cp ⊗ 1 trig ⊗ ξ f ail )α load , and (ξ f ail ⊗ 1 cp )α load indicate aging-related failures in both normal and checkpointing states, while entries (1 cp ⊗ 1 trig ⊗ 1 f ail ⊗ λ f ail2 )α load and (1 f ail ⊗ 1 cp ⊗ λ f ail2 )α load represent human-error-related failures during checkpointing. In addition, the system is regarded to be as good as new after either rollback recovery or rejuvenation, so the corresponding transitions are represented by entries ξ rej (α intv ⊗ α f ail ⊗ α trig ), and ξ rc (α intv ⊗ α f ail ⊗ α trig ), where (α intv ⊗ α f ail ⊗ α trig ) implies that the clocks of checkpointing trigger, system aging, and rejuvenation trigger are refreshed at the same time.

Steady-State System Availability
Steady-state system availability gives the probability that the system is operational in the steady state, so that it provides a significant insight into the long-term performance of a repairable system. Let A ss define the steady-state system availability. Then, we can obtain it by where π ss is the steady-state probability vector of the PH-expanded CTMC, Q, and can be computed by solving the following linear equation [33]: and r is the reward (column) vector of the PH-expanded CTMC and given by It is clear that the system is only available in the normal execution process state. In this paper, one problem of interest is to determine optimal software-rejuvenation timing that maximizes steady-state system availability.

Numerical Illustration
This section is devoted to the numerical illustration of the presented model in Figure 4 by means of phase expansion. Model parameters are summarized in Table 5, where all values are given according to the related literature [13,20,34]. All general distributions were accurately approximated by PH distributions with appropriate phases, that is, 100 phases for G intv (t), G cp (t), G load (t), G rc (t), G trig (t), and G rej (t) and 10 phases for G f ail (t) (see [20] for more details); eventually, we obtained a large approximate CTMC consisting of 201,400 PH-expanded states. Similar to [20], in order to evaluate the effects of the checkpoint interval and the rejuvenation-trigger interval on system availability, the mean checkpoint interval (MCI) was varied from 1 to 10 h, and the mean rejuvenation-trigger interval (MRTI) was changed from 5 to 35 h. In addition, human-error-related system failures both were and were not considered, aiming at quantifying the effects of human-error factors on both system availability and optimal software-rejuvenation timing.

Steady-State System Availability
Here, we show the steady-state availabilities of a system that may fail due to human error in checkpointing under different cases of MRTI and MCI. The corresponding results are given in Table 6, which shows that steady-state system availability increased as the value of MCI increased under each MRTI case. This means that too-frequent checkpointing decreases system availability because the system becomes unavailable during checkpointing. The effect of MRTI on system availability is now examined. For each MCI, steady-state system availability increases at the beginning and subsequently decreases with increasing MRTI, implying that an optimal MRTI might exist for maximizing steady-state system availability. Moreover, by comparing results in Tables 6 and 7, the latter of which gives the steadystate system availability without considering human-error-related system failures, it is reasonable to say that human-error factors significantly decreased system availability, especially in the case where the value of MCI was small. In other words, although frequent checkpointing can save data in a timely manner, it also brings a higher risk of system failure, caused by incorrect operations. Therefore, it is crucial to determine a suitable frequency of executing checkpointing to satisfy target system availability. For example, given a target steady-state system availability of 0.9 and an MRTI of 10 h, an MCI equal to or larger than 5 h is a good choice.

Optimal Rejuvenation-Trigger Timing
This subsection discusses optimal software-rejuvenation timing maximizing steadystate system availability. Figure 5 illustrates the sensitivity of steady-state system availability with respect to the mean rejuvenation-trigger interval in the cases of MCI = 2, 4, 6, 8 and 10. The figure plots unimodal curves of the steady-state system availabilities, which reveals the existence of optimal rejuvenation-trigger timing maximizing steady-state system availability in each case. Specifically, the overhead incurred by frequent rejuvenation (i.e., short MRTI) largely affects system availability. Conversely, downtime due to system failures caused by a less frequent execution of rejuvenation smoothly decreases system availability. Optimal rejuvenation-trigger timings and their corresponding maximal steady-state system availabilities in all cases are presented in Table 8. We present all optimal rejuvenation timings for the system regardless of considering human-error-related system failures. Optimal MRTIs for all cases of MCI were very similar, which means that the optimal rejuvenation-trigger timing is not very sensitive to checkpoint interval. Optimal MRTIs in the case where human-error-related system failures were not considered were slightly smaller than those in the case with human-error-related failure when the value of MCI was small, and vice versa when the MCI had a large value, for example, MCI = 9, 10.

Conclusions
In this paper, we presented a composite stochastic Petri reward net and its resulting non-Markovian availability model for operational software systems where both checkpointing and software rejuvenation are adopted to protect data and to enhance the system availability, and the system may fail due to both the aging problem and human errors during checkpointing. More specifically, the non-Markovian availability model was derived on the basis of a reachability graph that was generated from the original SRNs. In particular, the PH expansion approach was applied to solve the stationary solution of the non-Markovian availability model since the model was not one of the trivial stochastic models such as SMP and MRGP, so that common approaches such as LST and embedded Markov chain techniques do not work. Numerical results showed that human-error factors both decreased steady-state system availability and brought a significant effect on optimal rejuvenation-trigger timing, which means that human-error factors during system modeling should not be overlooked.
The model presented in this paper was based on a macroscopic view, providing a fundamental idea of how to model such a software system that undergoes both checkpointing and software rejuvenation, and in which the system behaves with multiple competitive events. The system's actual behavior is very complex, and more possible events need to be considered, for example, software environment upgrades and time-scope limitations of used versions of libraries. Although this improvement may vastly increase difficulty in numerical analysis, it is significant to take a microscopic look at system behavior, which will be one of our future directions. This paper only considered both aperiodic checkpointing and software rejuvenation, but to the best of our knowledge, there exist various kinds of checkpointing [35] and rejuvenation techniques [8]. In the future, we aim to extend this work to solve more complicated software systems considering different rejuvenation and checkpointing schemes.

Conflicts of Interest:
The authors declare no conflict of interest.