Software Reliability Model with Dependent Failures and SPRT

: Software reliability and quality are crucial in several fields. Related studies have focused on software reliability growth models (SRGMs). Herein, we propose a new SRGM that assumes interdependent software failures. We conduct experiments on real-world datasets to compare the goodness-of-fit of the proposed model with the results of previous nonhomogeneous Poisson process SRGMs using several evaluation criteria. In addition, we determine software reliability using Wald’s sequential probability ratio test (SPRT), which is more efficient than the classical hypothesis test (the latter requires substantially more data and time because the test is performed only after data collection is completed). The experimental results demonstrate the superiority of the proposed model and the effectiveness of the SPRT.


Introduction
Software performance and reliability are essential in various fields, such as the Internet of Things. If software reliability is not guaranteed, economic and human losses may occur. Hence, various studies have been conducted to improve software reliability and prevent loss.
Software reliability growth models (SRGMs) are tools that estimate the quality and reliability of software products. SRGMs not only provide information regarding software reliability for developers and consumers, but also help establish an optimal release policy.
Most SRGMs assume that the mean value function, m(t), is a nonhomogeneous Poisson process (NHPP). The m(t) of each model is a single function that reflects the failure intensity, failure detection, number of remaining failures, and the environment. In particular, assumptions regarding the test and development environments are essential; furthermore, the parameters and the form of m(t) depend on the assumptions regarding the environment.
Recent studies combined deep learning and machine learning with software reliability [25][26][27][28]. Wang et al. discussed an optimal release policy and the selection of the best software reliability model [28]. Lee et al. proposed an SRGM to consider the actual test time instead of the designed test time [29]. Minamino et al. [30] discussed software reliability and release policies that considered the change-points of data based on the theory of multiproperty utilities. Rani et al. [31] developed a hazard rate model that introduced an imperfect debugging parameter and a single change-point parameter.
Some studies suggest statistical techniques to predict software reliability. Typically, parameters of reliability models are estimated using the least-square estimator (LSE) method. Additionally, the maximum likelihood estimator (MLE) and Bayesian methods have been used [32,33]. However, software reliability models have complex structures and numerous parameters, rendering it difficult to apply the MLE and Bayesian methods; hence, we used the LSE to estimate the parameters in this study.
We applied the sequential probability ratio test (SPRT), a statistical test technique, to efficiently determine software reliability. The SPRT was designed for military and naval equipment development by Wald [34]. The main advantage of a sequential test is that it requires a shorter test time compared with that required by the classical test, which requires a fixed sample size. In addition, the SPRT can instantly determine software reliability whenever new faults occur. Hence, reliability can be assessed based on the change-points of data. This procedure is described in Section 2.
Software reliability studies using Wald's SPRT procedure have been conducted continually. Stieber applied the SPRT to an NHPP SRGM for the first time [35]. This method has been applied to various SRGMs in many studies [36][37][38][39][40]. Furthermore, the authors of [38] performed the SPRT using order statistics.
Herein, we propose a new SRGM that assumes interdependent software failures. In Section 2, we discuss the efficiency of the SPRT. The proposed model is described in Section 3. In Section 4, we introduce the criteria and data used in the experiments. Subsequently, we discuss the results in Section 5. Finally, Section 6 concludes the paper.

Wald's SPRT
Wald described the efficiency of the SPRT [34]. The probability ratio p /p is used as a test statistic of the SPRT for testing the null hypothesis against an alternative hypothesis. The test is continued (until the next decision) if the following condition is satisfied: where A and B are constants used to decide the acceptance and rejection of the null hypothesis H , respectively. If p /p ≥ A, then H is rejected. If p /p ≤ B, then H is accepted. Moreover, A and B depend on α and β, as shown in Equations (2)-(5). Here, α and β are type-1 and -2 errors, respectively. In other words, α is the producer's risk, whereas β is the consumer's risk.
The SPRT can be expressed visually. Figure 1 shows the method to assess software reliability using the SPRT. In Figure 1, N (t) and N (t) denote the lower and upper limits of the area in which the software is reliable, respectively. Here, N(t) is the expected number of failures at time t for the NHPP. If N(t) is in the reliable region, then the software is reliable.
where m(t) indicates the expected number of detected failures until time t, which is described as: Stieber defined the probability ratio, which is used as a statistic of the SPRT for NHPP SRGMs, as in Equation (1).
Using Equations (3), (5) and (11), we can rewrite Equation (1) as follows: In Equation (12), the left term is constant B, and the right term is constant A. Moreover, N(t) refers to the probability ratio, p /p .

New SRGM
A general SRGM that follows the NHPP, as shown in Equation (9), can be obtained by solving the differential equation shown in Equation (13). Function b(t) varies according to each assumption as follows: where b(t) is the fault detection rate function. The model assumes independent failures. However, the proposed model is different. For example, if a fault occurs because of a syntax error and a developer cannot fix it completely, new faults will be affected. Consequently, failures may depend on one another ( Figure 2). Hence, we assume that the failures are dependent. The SRGM based on these assumptions can be obtained from the differential equation expressed in Equation (14).
where a(t) and b(t) are the functions of the total failure content rate and failure detection rate, respectively. In the proposed model, a(t) and b(t) are expressed as: Therefore, we can calculate the m(t) of the proposed model. The initial value is m(0) = h. Table 1 summarizes the m(t) for existing NHPP SRGMs. Table 1. Mean value functions of software reliability growth models (SRGMs).

No.
Model

Criteria
In this section, we elucidate the use of eight criteria to compare the NHPP SRGMs for two realworld datasets.
(1) The mean squared error (MSE) [42] measures the distance between the estimated and actual data, considering the number of observations and the number of parameters in the models. The MSE is defined as follows: where n is the number of total cumulative failures, N the number of model parameters, and y the number of cumulative failures at t , obtained from the dataset.
(2) The predictive ratio risk (PRR) [42] indicates the distance of model estimates from actual data with respect to the model estimates. It is calculated as: (3) The predictive power (PP) [43] measures the distance of actual data from the estimate with regard to the actual data. It is defined as: (4) R-square (R ) [44] is the correlation index of the regression curve equation. It is used to explain the fitting power of the SRGMs and is described as follows: .
(5) Akaike's information criterion (AIC) [45] is used to maximize the likelihood function. It can be considered as an approximate distance from the true probability model, as follows: where N is the degree of freedom. Here, L and lnL are written as follows: (6) The sum of absolute error (SAE) [11] measures the distance between the predicted number of failures and the observed data. It is defined as: The variation [46] is the standard deviation of the prediction bias. It is expressed as follows: where bias is expressed as: The root-mean-square prediction error (RMSPE) [46] estimates the closeness of the predicted values with the actual observations.

RMSPE = Variation + Bias
The closer the value of R is to 1, the better the goodness-of-fit of the dataset. For other criteria, a smaller value indicates a better fit. Table 2 shows the two datasets used in the experiments. Dataset 1 was collected from a telecommunication system that manages the radio access of wireless systems on a weekly basis [42,47]. It is a test dataset corresponding to two releases of the software. For dataset 1, the number of cumulative failures is {1,1, … ,26} at t = 1, 2, … , 21, and the number of total cumulative failures is 26. Dataset 2 includes one of the three releases of weekly medical record system test data that correspond to 188 software tools [42,48]. For dataset 2, the number of cumulative failures is {90,107, … ,204} at t = 1, 2, … , 17.

Results of Parameter Estimation and Goodness-of-Fit
Before comparing the fit and applying the SPRT, we estimated the parameters of all models listed in Table 1 using the LSE method. Table 3 summarizes the estimated parameters for datasets. Table 3. Estimation of parameters for datasets. As shown in Table 4, the proposed model achieved the best results for all criteria except for the PP and AIC. However, the PP and AIC of the proposed model were the third smallest and fifth smallest values, respectively. In general, the proposed model outperformed other models.

Model
As shown in Table 5, the proposed model indicated the best results for most criteria (except for the AIC, variation, and RMSPE). Similarly, the variation and RMSPE of the proposed model were the second smallest values. In general, the proposed model exhibited better fitting than that shown by other models for dataset 2. However, because m(t) converged for t = {14, 15, 16, 17}, the AIC was calculated as not-a-number (NaN). If the AIC of the proposed model was calculated for the time before convergence, t = {1, 2, … , 13} , then the value would be 83.51093 (which would be the smallest).

Confidence Interval
The two-sided limit confidence interval [42] of NHPP SRGMs is defined as: where z / is the 100(1-α) percentile of the standard normal distribution. Table 8 and Figures 7 and 8 show the lower-limit confidence interval (LC) and the upper-limit confidence interval (UC) of the proposed model for datasets 1 and 2 at t .

Results of the SPRT for Datasets
We determined the software reliability by applying the SPRT based on the parameters of the proposed model. In particular, a sensitivity analysis showed that a and b were more sensitive than other parameters of the proposed model; therefore, we focused on a and b. The related assumptions are listed in Table 9. Subsequently, we set a = a − δ, a = a + δ for Case 1 and b = b − δ, b = b + δ for Case 2. Substituting a and a for the m(t) of the proposed model instead of a, we obtained m (t) and m (t), respectively. Likewise, m (t) and m (t) were obtained by substituting b and b into m(t), respectively. Substituting m (t) and m (t), obtained in Equation (12), allows us to discuss the SPRT procedure. Table 10 shows the results of the SPRT for parameter a for both datasets. Judging the reliability using the proposed model, we conclude that software testing should continue because we cannot determine the software reliability yet. Table 11 shows the SPRT results for parameter b for both datasets. When t = 7 in dataset 1, the test was accepted. Using the proposed model for dataset 2, we conclude that software testing should continue because we cannot determine the software reliability yet. However, as m(t) converged for t = {14, 15, 16, 17}, the result was "Inf".

Conclusions and Remarks
Herein, we proposed a new SRGM. General SRGMs assume independent failures. However, in the proposed model, software failures depend on one another. We presented experiments on realworld datasets using eight evaluation criteria. The proposed model achieved the best goodness-of-fit on both datasets. In addition, we evaluated the software reliability using the SPRT, which is more efficient than the classical hypothesis test. As shown in Table 10, we demonstrated that the test based on parameter a should be continued for both datasets because no reliable judgment was obtained. The test based on parameter b was accepted, as shown in Table 11, when t = 7 for dataset 1. However, the test should be continued for dataset 2 because the judgment was unreliable.
In the future, experiments based on more recent data should be performed to further validate the superiority of the proposed model. In addition, after estimating the parameters of software reliability models using the MLE and Bayesian techniques, we plan to discuss software reliability by applying the SPRT.