Application of EM Algorithm to NHPP-Based Software Reliability Assessment with Generalized Failure Count Data

: Software reliability models (SRMs) are widely used for quantitative evaluation of software reliability by estimating model parameters from failure data observed in the testing phase. In particular, non-homogeneous Poisson process (NHPP)-based SRMs are the most popular because of their mathematical tractability. In this paper, we focus on the parameter estimation algorithm for NHPP-based SRMs and discuss the EM algorithm for generalized fault count data. The presented algorithm can be applied for failure time data, failure count data, and their mixture. The paper derives the EM-step formulas for basic 12 NHPP-based SRMs and demonstrate a numerical experiment to present the convergence property of our algorithms. The developed algorithms are suitable for an automatic tool for software reliability evaluation.


Introduction
Software reliability models (SRMs) are used to assess quantitative reliability and to control the quality of software products. Since Jelinski and Moranda [1], and Goel and Okumoto [2] presented SRMs based on stochastic processes, numerous SRMs have been proposed [3][4][5][6][7][8]. In particular, non-homogeneous Poisson process (NHPP)-based SRMs have become popular in representing the dynamics of failure occurrence processes in a variety of situations [9][10][11][12][13]. By using an NHPP-based SRM, we predict the future behavior of software failure, i.e, the number of failures experienced in future, and estimate the quantitative measure of software reliability.
The advantage of NHPP-based SRMs is simplifying the stochastic analysis. NHPPs are generally dominated by mean value functions. The mean value function indicates the expected number of failures experienced at arbitrary testing time. By choosing appropriate mean value functions, NHPP-based SRMs can fit any observed failure data. The NHPPbased SRMs and the mean value functions have a one-to-one correspondence.
The Goel-Okumoto model [2]; Goel model [2]; Musa-Okumoto model [14]; Ohba [15,16]; Yamada, Ohba, and Osaki model [17]; Zhao and Xie model [18] are early NHPP-based SRMs. They were constructed with the deterministic debugging scenarios of mean value functions. Pham [19] solved a generalized differential equation by which the mean value function in the NHPP-based SRM is governed and proposed a generalized SRM with many redundant parameters.
Apart from such a deterministic modeling framework, almost all NHPP-based SRMs can be characterized as Markov processes. Shantikumar [20] discussed a modeling framework to integrate time-homogeneous Markov processes and NHPPs by using a binomialtype stochastic point process. Langberg and Singpurwalla [21] presented a unified modeling framework for almost all NHPP-based SRMs. Chen and Singpurwalla [22] also discussed the framework with a self-exciting point process. Miller [23] introduced the concept of exponential order statistics and drastically extended Langberg and Singpurwalla's idea. In fact, the realizations of NHPP-based SRM can be described by either the general order statistics or record value statistics of the underlying software fault data, where the fault-detection times are assumed to be independent and identically distributed (i.i.d.) random variables. Specifically, the general order statistics are based on the order of all the fault detection times, and the record value statistics focus on their maximum detection time.
In this paper, we focus on the parameter estimation problem of NHPP-based SRMs. In general, there are three steps to evaluating the software reliability with NHPP-based SRM. (i) Collect the failure data such as the number of detected bugs in the testing phase, (ii) estimate the model parameters of NHPP-based SRM to fit it to the collected data, and (iii) compute reliability measures from the NHPP-based SRM with the estimated parameters. Based on quantitative measures, we control the software development process. As a typical usage of NHPP-based SRM, we estimate the number of failures that will be experienced in the future and decide whether to continue testing the software or the software can be released. In other words, the parameter estimation of NHPP-based SRM is frequently executed in the software development phase. The computation cost of the estimation should be small in practice.
Therefore, many authors have concerns about the parameter estimation problem on NHPP-based SRMs. Nevertheless, in actual software reliability assessments, a few NHPPbased SRMs and familiar maximum likelihood (ML) estimation methods are still used conventionally. The main reason for this is that the practitioners wish to use intuitively simple statistical methods, which exclude empirically based tuning parameters for a few SRMs that have survived a long history of software reliability engineering. In fact, the Bayesian estimation methods are still minor in software engineering practice, although its theoretical benefit is recognized. On the other hand, ML estimation is based on the maximization of likelihood function with software failure data and possesses several rational properties such as asymptotic efficiency. Hossain and Dahiya [24] derived necessary and sufficient conditions that the maximum likelihood estimates (MLEs), which satisfy the non-linear likelihood equations, exist in Goel and Okumoto SRM [2]. Knafl and Morgan [25] presented a method to systematically solve the likelihood equations with two model parameters. Joe [26] also discussed the confidence interval of MLEs. Zhao and Xie [18] provided the MLEs for an extended Goel and Okumoto SRM. Jeske and Pham [27] discovered empirically that the MLEs in Goel and Okumoto SRM are not statistically consistent. It should be noted, however, that ML estimation is always possible for all NHPP-based SRMs. Even if the likelihood functions are strictly concave in model parameters, it is difficult to solve the likelihood equations analytically. For instance, in the cases where the likelihood functions are not concave and where there exists no solution for the likelihood equations inside the parameter space, the conventional methods to calculate the MLEs cannot be used. Usually, the Newton method and the Nelder-Mead method are used to solve the maximization problem in the existing literature. From the recent development of computational ability, it is becoming easier to handle a large-scale complex optimization problem.
On the other hand, it is known that the local convergence property of the Newton method is a weakness for practical application. The local convergence property means that the convergence radius of algorithm is limited, and thus, it may fail to obtain a result if we set unsuitable initial guesses. For example, when we develop the application to automatically obtain parameter estimates from given data, the local convergence property becomes troublesome when choosing the initial guesses. Therefore, the Newton method is not suitable for this purpose. Additionally, the Nelder-Mead method is one of the direct search methods. The convergence property of the Nelder-Mead method is improved from the Newton method. However, some design parameters should be provided appropriately for the Nelder-Mead method. Even if we use the Nelder-Mead algorithm, the convergence of algorithm is not always guaranteed for any given data.
The EM algorithm is an algorithm that finds maximum likelihood estimates for a statistical model with incomplete data. The idea behind our EM algorithms is to find the incomplete data structure of NHPP-based SRMs. Concretely, in NHPP-based SRMs, we assume that the number of failures is finite due to a finite number of software bugs, but all of them cannot be observed, i.e., the number of reaming software failures can be regarded as missing data. From this insight, the EM algorithm for an individual NHPP-based SRM is developed. Although the convergence speed of EM algorithm is generally slower than other general-purpose numerical methods such as the Newton method, it has a global convergence property. This property allows us to reduce efforts in choosing good initial guesses for the model parameters and is suitable for automating the estimation procedure. In our past work [49], we summarized EM algorithms for 12 NHPP-based SRMs when the failure data were time data. The failure time data consisted of a set of exact failure times experienced. In practice, it is difficult to obtain exact failure times. Generally, we record failure count data consisting of the number of failures experienced for time intervals. For example, it is reasonable to record the number of failures per working day. From this reason, this paper presents the EM algorithms for 12 basic NHPP-based SRMs when the failure count data are given. In particular, we consider the generalized failure count data that involve both failure time and count data formats, and thus, the developed EM algorithms can be applied to either failure time data, failure count data, or their mixture.
We highlight our contributions here: (i) we derive the EM-step formula for NHPPbased SRMs with a finite number of failures under generalized fault count data, (ii) we derive concrete EM-step formulas for 12 basic NHPP-based SRMs, and (iii) we demonstrate the performance on the convergence property of the presented algorithms with real software failure data. To our best knowledge, this is the first paper that presents the EM algorithm for the generalized fault count data in 12 basic NHPP-based SRMs.
This paper is organized as follows. In Section 2, we introduce NHPP-based SRMs that are considered in this paper. In particular, NHPP-based SRMs are classified by failure time distribution and present the relationship between basic 12 NHPP-based SRMs and their failure time distributions. In Section 3, we derive the EM-step formulas for 12 basic NHPPbased SRMs. Section 4 is devoted to a numerical example to compare the convergence properties of EM algorithm, the Nelder-Mead method, and the quasi-Newton method. Finally, we conclude the paper with remarks in Section 5.

Model Description
Let {X(t), t ≥ 0} denote the number of software failures experienced before time t.
We make the following model assumptions [21]: Here, F(t) and N are the cumulative distribution function of the failure time and the number of inherent faults. Then, the probability mass function of the cumulative number of failures experienced by time t is where F(·) = 1 − F(·). This is often called the framework of generalized order statistics [21]. For instance, when the failure distribution is an exponential distribution, the corresponding SRM, the so-called exponential order statistics model, is the same as the Jelinski-Moranda SRM [1]. Most NHPP-based SRMs are advanced models of the generalized order statistics models. We make an additional model assumption [21]: • Assumption C: The number of inherent faults is unknown, but prior information is given by a Poisson distribution.
When the expected number of inherent faults is ω, the cumulative number of software failures at time t has the following probability mass function: Equation (2) is equivalent to the probability mass function of NHPP with mean value function ωF(t). In this modeling framework, the failure time distribution F(t) specifies an NHPP-based SRM.
Since the NHPP-based SRM is characterized by the failure time distribution, there have been a number of NHPP-based SRMs that change the failure time distribution. In this paper, we propose basic NHPP-based SRMs using well-known statistical distributions as the failure time distribution. Table 1 shows 11 basic NHPP-based SRMs and their failure time distributions. In the table, most of the basic NHPP-based SRMs correspond to the existing traditional NHPP-based SRMs. 'exp' is the so-called Goel and Okumoto model [2], 'gamma' is a generalized delayed S-shaped model [17,18], 'pareto' is a modified Duane model [50], 'tlogis' is an inflection S-shaped model [15], and 'lxvmin' is the Goel (Weibull) model [51].

Parameter Estimation
As mentioned before, the model parameters of NHPP-based SRMs should be estimated from software failure data to predict the future tendency of a software failure. The most commonly used technique for parameter estimation is maximum likelihood (ML) estimation. In the context of ML estimation, we found model parameters that maximize the log-likelihood function (LLF). Since the LLF depends on the failure data experienced, the ML estimation of NHPP-based SRMs has been discussed for two types of data: failure time data and count data. The failure time data is a set of exact times in which a software failure occurs in the testing phase. The count data, equivalently called grouped data, consists of the number of failures experienced for time intervals. The estimation problems for these two data structures have been discussed separately. This paper deals with a generalized data structure to express both failure time and count data. Our data structure is D : In addition, if u i = 1, an additional failure occurs at the end of the ith time interval, i.e, at time x i . Otherwise, if u i = 0, no failure occurs at the instant. If u i = 0 for all time intervals, the data turns out the failure count data. If x i = 0 and u i = 1 for all i, D is the failure time data.
Based on the generalized data, the LLF for NHPP-based SRMs is written in the following form: Then, the problem is to find the optimal (ω, θ), so-called maximum likelihood estimates (MLEs), maximizing LLF(ω, θ). However, it is noted that we cannot derive the closed form solution of MLEs. That is, we need to utilize numerical optimization techniques such as the Newton method, quasi-Newton method, and Nelder-Mead method.
Although conventional methods such as the Newton method and the Nelder-Mead method may be occasionally useful in computing MLEs of the NHPP-based SRMs, it is worth noting that these aim to solve unconstrained optimization problems in ML estimation. However, in many cases, we have to cope with constrained optimization problems because almost all of the model parameters of NHPP-based SRMs are implicitly constrained, such as positive constraint.

EM Algorithms for NHPP-Based SRMs
This paper develops numerical procedures to compute MLEs for NHPP-based SRMs with generalized data. The proposed estimation algorithms are based on the EM principle. The EM algorithm is one of the statistical approaches to compute the MLEs for incomplete data and is numerically stable because of its global convergence property. Moreover, the proposed EM algorithms for NHPP-based SRMs are based on the closed forms of MLEs for an arbitrary fault-detection time distribution and are capable of solving constrained optimization problems. Although we have already developed EM algorithms for failure time data and failure count data for several basic NHPP-based SRMs, this paper revisits their EM algorithm when generalized data are given.

EM Algorithm
The EM algorithm is an iterative method for computing ML estimates with incomplete data [28,29]. Let D and U be observable and unobservable data vectors, respectively, and θ be a model parameter vector θ to be estimated from only the observable data. In the ML estimation, we find a parameter vector by maximizing the following log-likelihood function (LLF) L(θ; D): where p(·) is any probability density or mass function and thus p(D, U ; θ) denotes the likelihood function for complete data (D, U ). Let Q(θ|θ ) denote the conditional expected LLF with respect to the complete data vector (D, U ) using the posterior distribution for unobservable data vector with a given parameter vector θ : Then, the EM algorithm consists of an E-step and an M-step. The E-step computes the conditional expected LLF with respect to the complete data vector (D, U ) using the posterior distribution for unobservable data vector with provisional parameter vector θ , i.e., Q(θ|θ ). In the M-step, we find a new parameter vector θ that maximizes the expected LLF: and θ becomes a provisional parameter vector at the next E-and M-steps. These steps surely increase the marginal LLF. The E-and M-steps are repeatedly executed until the parameters converge.

EM Algorithm for NHPP-Based SRMs
Consider the complete data in NHPP-based SRMs, T 1 < T 2 < . . . < T N , where T i is the ith failure time and N is the number of all the failures. It is worth noting that the number of all the failures in software is unobserved. Since N is the Poisson-distributed random variable and T i obeys F(·; θ), the complete LLF is given by From the standard argument of MLEs, the MLEs of ω and θ can be derived as and respectively. These imply that the estimation problem of NHPP-based SRMs under complete data can be decomposed into separate problems for two distribution functions: Poisson distribution and the failure time distribution. Since the number of failures and the exact failure time in intervals are unobserved, the generalized data D := {(t 1 , x 1 , u 1 ), . . . , (t k , x k , u k )} are incomplete data. By applying the EM algorithm, we have the following EM-step formulas for NHPP-based SRMs with the generalized data: and Additionally, we obtain the following formula to compute the expected values. For any measurable function h(·), the expected value with the generalized data is expressed as where f (z; θ) is a probability density function (p.d.f.) of failure time provided that the parameter vector is θ. The derivation of this formula is given in Appendix A.
exp: 'exp' is the model where the failure time distribution is an exponential distribution. This model is exactly the same as the Goel-Okumoto model [2]. Define the c.d.f. of failure time as F(t; β) = 1 − exp(−βt), β > 0.
Since the MLE of an ordinary exponential distribution is given by a closed from, the EM-step formulas for exp are directly derived from Equations (10) and (11); By applying the formula for the expected value, we have where gamma: The failure time distribution becomes the following gamma distribution: where α and β are shape and scale parameters, respectively. When α = 2 is fixed, the model reduces the delayed S-shaped SRM [17]. Similar to exp, the EM-step formulas are given using Equations (10) and (11): where ψ(·) is the digamma function, i.e., ψ(α) = d log γ(α)/dα. Additionally, we use the updated α to compute β. Note that Equation (21) can easily be solved with the nonlinear equation solver such as a bisection method. In addition, E[N|D; ω , α , β ] and E[∑ N i=1 T i |D; ω , α , β ] are obtained as follows: where F(t; α, β) is the complementary c.d.f. of gamma distribution with parameters α and β.
On the other hand, we need the numerical integration to obtain E[∑ N i=1 log T i |D; ω , α , β ]. It should be noted that, if the shape parameter α is fixed, then the computation algorithm becomes simple because we ignore solving the nonlinear equation and computing the pareto: 'pareto' is the SRM where the failure time distribution is the Pareto distribution of the second kind. The Pareto distribution of the second kind is called Lomax distribution: This model was proposed as the modified Duane model [50].
Since the Pareto distribution of the second kind is a mixture of exponential distribution, the EM algorithm for 'pareto' is constructed using this property. In general, the mixture distribution is defined as a superposition of original statistical distributions with mixture ratio. Let G(ξ; θ) be the c.d.f. of mixture ratio distribution for the parameter ξ. Then, the mixture distribution is given by ; θ). (27) The Pareto distribution of the second kind is a mixture of exponential distribution when the mixture ratio distribution is a gamma distribution. That is, the failure time distribution is written in the following form: For the EM algorithm of mixture-type SRMs, we also define the fault detection rate for each fault as a hidden variable.
Let (T 1 , Ξ 1 ), . . . , (T N , Ξ N ) be a set of failure time and its associated fault detection rate for all the failures. The complete LLF is given by Similar to gamma, we have the following EM-step formula from the MLEs of gamma distributions: On the other hand, the formula for the expected value is given by where h(·) is an arbitrary measurable function and f (z; θ ) = f (z; ξ)dG(ξ; θ ). (35) tnorm, lnorm: 'tnorm' and 'lnorm' are SRMs whose failure time distributions are truncated and log normal distributions, respectively. The failure time distributions for tnorm and lnorm are tnorm: lnorm: where Φ(·) is the c.d.f. of the standard normal distribution. Since the EM algorithms for both models with failure time and count data were introduced in detail in the literature [34], this paper provides the EM-step formulas with the generalized data.
• EM-step formula for tnorm: • EM-step formula for lnorm: In the above formulas, Φ(z) is the complementary c.d.f. of the standard normal distribution, and Φ (1) (z) and Φ (2) (z) are expressed with the p.d.f. of the standard normal distribution φ(z): In addition, after the convergence, we take ω =ωΦ(z 0 ) to obtain the ML estimate for ω in the case of tnorm. tlogis, llogis tlogis and llogis are the SRMs with truncated and log logistic distributions, respectively. In particular, 'tlogis' is equivalent to the inflection S-shaped model [15]. The failure time distribution of tlogis is given by where Ψ(·) is the c.d.f. of standard logistic distribution By taking into account the exponential of logistic distribution, we have the following failure time distribution of llogis: Since logistic distribution does not belongs to the exponential family of distributions, neither expectation nor maximization can be expressed as simple formulas. To construct the algorithm, we consider only one assumption; the number of all failures is not observed. Then, the EM-step formulas become

•
The EM-step formula for tnorm • The EM-step formula for lnorm t k ; θ)) .
The second equations in both formulas indicate that θ is updated by the MLEs when the number of all the failures is given byω and ω . These algorithm are also stable if there exists a unique solution maximizing the right-hand side of the second term. Note that, after the convergence, the model parameter ω in tlogis can be obtained as ω =ωF(0; θ). txvmax, lxvmax, txvmin, lxvmin Suppose that the failure time caused by each failure follows an extreme value type I distribution. The extreme value type I distribution is called Gumbel distribution, and its definition is based on the limitation of the maximum value of random variables. Here, the c.d.f. of a standard Gumbel distribution is defined as Similar to tnorm, lnorm, tlogis, and llogis, we consider the truncation and logarithm of the extreme value distribution. In addition, since the extreme value distribution is not symmetric, we also consider the case of negative samples, i.e., the minimum value of random variables.
The failure time distributions of txvmax and lxvmax are, respectively, Similarly, the failure time distributions of txvmin and lxvmin are given by where Θ(t) = 1 − Θ(−t) corresponds to the c.d.f. of a standard extreme value type I distribution of the minimum. From Equation (63), we find that lxvmin is equivalent to the Weibull distribution.
Since the extreme value distribution is not an exponential family, we consider only one assumption; the number of all the failures is not observed. Then, the EM-step formulas are given by • The EM-step formula for txvmax and txvmiñ • The EM-step formula for lxvmax and lxvmin

Numerical Example
We investigated the numerical characteristics of the presented EM algorithms. Here, we compare the convergence property with the Nelder-Mead method and the quasi-Newton method (BFGS method). First, we check the trace of model parameters until they converge to MLE for the proposed method (EM algorithm), the Nelder-Mead method, and the BFGS method. In this experiment, we used the fault count data, which were collected from real software projects [53]. The statistics of fault count data are given as follow. For the above failure count data, we estimated the parameters of 'exp'. The MLEs of exp areω = 354.75 andβ = 0.00251, and the maximum LLF is LLF(ω,β) = −180.79. Figures 1-3 show the trace of model parameters for EM algorithm, the Nelder-Mead method, and the BFGS method when the initial guesses are ω = 100 and β = 0.1. We use the 'optim' function in R for the Nelder-Mead and BFGS methods. From these figures, we find that the EM algorithm stably updates the model parameters and converges to close parameters to the MLEs. However, the convergence speed is not fast, since the update becomes smaller as the parameters are close to the MLEs. The Nelder-Mead method provides the MLEs, but the trace of the algorithm is not stable. In particular, this algorithm sometimes takes invalid values that violate the parameter constraints, i.e., ω < 0 or β < 0, while searching for the parameters. Figure 3 depicts the trace of parameters for the BFGS method. The convergence property is the worst among the three methods. Additionally, the algorithm fails to obtain the MLE.  Next, we present the convergence properties for the proposed EM algorithm, the Nelder-Mead method, and the BFGS method quantitatively. Here, we use two additional fault count data that were collected from real software projects [53] as well as SS1A. For three data sets-SS1A, SS1B, and SS1C-we applied the proposed EM algorithm, the Nelder-Mead method, and the BFGS method for 12 basic NHPP-based SRMs with 100 different initial parameters. In the experiment, the initial parameters were selected by random numbers. Tables 2-4 present the number of converged estimations, i.e., the number of times that the model parameters are successfully estimated for each NHPP-based SRMs and methods. If this value is 100, the method succeeded in obtaining the MLE for all of the initial parameters. On the other hand, if this value is 0, the estimation method fails to obtain the MLE for all of the initial parameters due to numerical computation errors such as overflow and underflow.   exp  100  7  7  gamma  100  7  6  pareto  100  18  18  tnorm  100  98  98  lnorm  87  87  87  tlogis  100  100  100  llogis  87  87  87  txvmax  99  99  99  lxvmax  0  0  0  txvmin  89  89  89  lxvmin  100  41  41   Table 4. The number of converged estimations (SS1C). exp  100  9  8  gamma  100  13  12  pareto  100  47  47  tnorm  100  98  98  lnorm  96  97  97  tlogis  100  100  100  llogis  96  96  96  txvmax  99  99  99  lxvmax  0  0  0  txvmin  89  89  89  lxvmin  100  43  43 From these results, we find that the convergence rates of the proposed EM algorithms are 100% in the cases of exp, gamma, pareto, tnorm, and tlogis. Since the number of converged estimations of the Nelder-Mead is almost the same as that of BFGS for all cases, the convergence properties of their methods are the same if we use the 'optim' function in R. Additionally, since lxvmax did not fit SS1B and SS1C, all of the estimation methods failed to obtain the MLE. Furthermore, it was found that the numbers of converged estimates in the Nelder-Mead and BFGS methods are worse than that of the EM algorithm, specifically in the cases of exp, gamma, and pareto. Additionally, in the cases of tnorm and lnorm, the convergence property of EM is slightly superior to the other two methods. In exp, gamma, pareto, tnorm, and lnorm, the failure time distributions belong to the exponential family, and thus, their EM-step formulas do not include the numerical optimization step. That is, these EM-step formulas are 'pure' EM-step formulas. Therefore, the convergence properties outperform those of the Nelder-Mead and BFGS methods. On the other hand, in the cases of tlogis, llogis, txvmax, lxvmax, txvmin, and lvxmin, the failure time distributions are not in the exponential family, and we should use the numerical optimization step in their EM-step formulas. This is the reason why the convergence property of the presented EM algorithm is same as that in the Nelder-Mead and BFGS methods.

Conclusions
This paper derived EM-step formulas for 12 basic NHPP-based SRMs when the generalized failure count data are given. Since the generalized fault count data involve both time and count data formats, the presented EM algorithms can be applied to failure data experienced in practice. In addition, the convergence property of EM algorithm is better than or equivalent to other ordinary methods such as the Nelder-Mead and BFGS methods for practical software fault data. Thus, the presented algorithms are suitable for implementation in the automatic tool for software reliability evaluation. In fact, our research group has developed an AddIn of Microsoft Excel to estimate software reliability [54].
In the future, we will develop a reliability assessment tool by integrating a software repository such as GitHub, a bug tracking system, and a continuous integration system, and the tool will continuously monitor the reliability of software.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Derivation of Equation (12)
For convenient, ω and θ are written as ω and θ, respectively, and E[·|D; ω , θ ] is simplified as E[·|D]. Here we have where s i = ∑ i j=1 (x j + b j ). According to the order statistics of failure times, the first term of the right-hand side of the above can be rewritten as follows. (A2)