Analysis of Faults in Software Systems Using Tsallis Distribution: A Uniﬁed Approach

: The identiﬁcation of the appropriate distribution of faults is important for ensuring the reliability of a software system and its maintenance. It has been observed that different distributions explain faults in different types of software. Faults in large and complex software systems are best represented by Pareto distribution, whereas Weibull distribution ﬁts enterprise software well. An analysis of faults in open-source software endorses generalized Pareto distribution. This paper presents a model, called the Tsallis distribution, derived using the maximum-entropy principle, which explains faults in many diverse software systems. The effectiveness of Tsallis distribution is ascertained by carrying out experiments on many real data sets from enterprise and open-source software systems. It is found that Tsallis distribution describes software faults better and more precisely than Weibull and generalized Pareto distributions, in both cases. The applications of the Tsallis distribution in (i) software fault-prediction using the Bayesian inference method, and (ii) the Goal and Okumoto software-reliability model, are discussed.


Introduction
The study of faults in software systems is important as it has a direct impact not only in determining its quality and reliability but also on overall management. The terms fault, bug, and defect are used interchangeably in the field of software engineering. Software fault is a problem leading either to its crash or undesirable output [1]. Thus, what should be considered as a fault varies and is primarily dependent on the requirements and standards of that software product. No software is free from faults, and hence their analysis has been an active area of research in software engineering [2]. One important aspect relates to the probability distribution of faults over modules in a software system, known as fault distribution. The module can be a class in an object-oriented system, a function in procedural languages, or a file in Python. The identification of fault distribution helps in prioritizing and analyzing those modules that have more impact on the overall quality of the software system. Additionally, the knowledge of the fault distribution facilitates developers to identify error-prone modules as early as possible while developing software in order to optimize resources and effective testing [3]. Moreover, this also helps in reducing software management costs post delivery because fixing a fault is far more economical during the earlier phases of the software development life cycle [4,5]. Knowledge of the underlying probability distribution of faults can be used to predict faults in future releases of the software as well [6].

Motivation
Finding an appropriate mathematical model that explains the empirical data of faults in a software system has a long history [3]. Some earlier investigations of faults in software systems favored exponential and logistic models. Later, the Pareto model became widely accepted, particularly in large [7] and complex software [8,9]. This essentially means that 80% of the faults are contained in 20% of the modules. A replicated study establishes that the Weibull model describes software faults more effectively than the Pareto model [10]. However, another such analysis on complex software systems endorses the Pareto model [11]. The application of the bounded generalized Pareto model is evaluated in [3] for open-source software. The feasibility of the variant of Pareto models such as the double Pareto has also been explored [12].
In spite of the availability of numerous models for fault distributions, there is no single accepted model that can explain faults in a variety of software. It has been asserted that the distribution of faults in software systems depends on the environment and that only the replicated studies can validate a model [13]. Hence, there is a clear requirement for a generic mathematical model that can describe the underlying fault distribution across a variety of software systems viz. enterprise, open source, large, and complex. Such a model will eliminate the existing diversity in the domain of analysis of software faults and will help the software developers and management to focus their efforts on quality assurance rather than investing time in finding appropriate models.

Contributions
The paper makes the following contributions: • A generalized mathematical model, called Tsallis distribution, is derived using the maximum-entropy principle. • Tsallis distribution is fit to fault data sets of enterprise and open-source software, and it is found to be a generic model. • Applications of the Tsallis distribution in software fault-prediction and the softwarereliability model are also outlined.
The paper is organized into six sections. The related work on fault distributions is presented in Section 2. The methodology of the study along with the data sets is described in Section 3. The Tsallis distribution is derived using a maximum entropy framework in this section. Additionally, a procedure to estimate the parameters of the Tsallis distribution is developed. Section 4 contains the results of the experiments conducted on real data sets to validate the efficacy of the Tsallis distribution in describing software faults. The discussion of the results and the applications of the analysis is also presented. Threats to the validity of the analysis are discussed in Section 5. The last Section 6 contains the conclusion.

Related Work
The study of the distribution of faults has been a topic of importance in software engineering, primarily because of its role in predicting faults and hence in ensuring the quality and reliability of software [14]. One of the techniques adopted for software faultprediction is Probit regression, which is a binary classification model [15]. Harter et al. [16] have applied Probit analysis to study the severity of faults with respect to the software improvement process. Many other research studies have a consensus on the applicability of the Pareto principle in explaining faults in software. This principle, also known as the 80-20 law, implies that the majority of the faults reside in a small number of modules. After analyzing the fault data of very extensive telecom software, Ericsson, Fenton, and Ohlsson [8] suggest that the empirical distribution of faults there obeys the Pareto principle . Later on, Andersson and Runeson [17] replicate this study and validate the applicability of the Pareto principle. The suitability of the Pareto principle in another similar study on Motorola's telecom software is verified in [18]. In a large inventory software system, Ostrand and Weyuker [7] establish the appearance of Pareto distribution in the number of faults. Another replicated analysis of fault distribution in a complex software system endorses the Pareto principle [13]. A comparison of Pareto, lognormal, Weibull, double Pareto, and Yale-Simon distributions in fitting the empirical distribution of faults in a proprietary complex system is carried out in [11]. Their study finds double Pareto distribution to be more efficient than others in explaining faults.
In the enterprise Eclipse software system, Zhang [10] observes that Weibull distribution provides a better fit to faults than Pareto in both pre-release as well as post release. Concas et al. [19] investigate the distribution of faults in six releases of Eclipse by using Yale-Simon, double Pareto, lognormal, and Weibull distributions and conclude that the generative model of the Yale-Simon distribution better describes faults than the others. The Weibull distribution is also found to perform well in explaining faults in four releases of the Windows operating system [20]. In the case of open-source software, Hunt and Johnson [21], after analysing faults in projects at sourceforge, endorse the Pareto distribution. However, Kuo and Huang [3] find that the bounded generalized Pareto distribution (BGPD) provides the best fit to the empirical fault distribution of many open-source software. In a recent study on fault distribution, Sriram et al. [12]  It can be noted here that no single distribution can explain the faults in different types of software. In particular, it can be easily observed that the Pareto principle is applied to large and complex software, whereas Weibull distribution is useful for enterprise software. In contrast to both of these, BGPD is better for open-source software. A pertinent question then arises concerning whether it is possible to explain faults in these diverse software systems in a uniform way. If so, then a single model can be developed for predicting faults. This is the main motivation for this study.

Methodology
This section presents details of the methodology adopted in this study. The first step is the collection of the fault data of various types of software. The data-collection process and data sets are described in the following subsection.

Data Collection
The first data set used is from Eclipse 2.0, 2.1, and 3.0 pre-releases and post-releases. This data set was first presented in [22], and later on it was proven in [10] that the distribution of software faults in this data set is better explained by the Weibull distribution. Eclipse is an enterprise software. Additionally, the data from three consecutive releases of the same software helps in checking the persistence of the results in software. The details of Eclipse data set are presented in Table 1. Besides this, fault data of Equinox and KAA enterprise software, gcc, samba, Python, and Firefox open source are included in this study, as given in Table 2. The data have been gathered from [23-28] and are up to 18 February 2020. The status of all the included faults is resolved and fixed. These faults are reported by users and thus correspond to the user utilization phase of the software life cycle.
The next step is to identify the candidate probability distributions to be included in this study. The next subsection provides details of them.

Generalized Pareto Distribution
The 2-parameter generalized Pareto distribution has been employed to analyze the fault distribution of open-source software in [29]. If random variable X represents the number of faults then the probability distribution function (pdf) of the 2-parameter generalized Pareto distribution is given by where a is the shape parameter and b is the scale parameter.

Weibull Distribution
The importance of the Weibull distribution in modeling faults was first highlighted in [10] for enterprise Eclipse software. Thereafter, it has been part of many empirical studies [11,12,20]. The pdf of the Weibull distribution is where µ is shape parameter and λ is the scale parameter.

Maximum Entropy Tsallis Distribution
The notion of entropy is linked to the theory of statistical mechanics in physical systems. However, Shannon [30] developed a measure of randomness or uncertainty of a system in the context of communication theory, which is mathematically similar to the one in statistical mechanics, and called it Shannon entropy. For a system S with states n = 0, 1, ... with probability p n , the Shannon entropy is defined as In the context of non-extensive dynamical systems, a generalized measure of entropy known as Tsallis entropy was proposed [31] with parameter q measuring the degree of non-extensivity in the system. Tsallis entropy reduces to Shannon entropy as the parameter q → 1, Tsallis entropy has found applications in various domains [31]. It is to be noted that a software system can also be treated as a physical system [32].
It is imperative to mention that one of the ways to obtain a probability distribution when some prior information is available, usually in the form of moments, is through Jaynes maximum-entropy principle (MEP) [33]. MEP with Shannon entropy has been used to model component-size distribution in software systems [32,34]. MEP with Tsallis entropy has been applied to communication networks as well [35][36][37][38]. Recently, Sharma and Pendharkar [39] have employed Tsallis entropy to study software-component sizes.
In this section, a closed-form of Tsallis distribution in terms of Hurwitz zeta function is derived. Representing p n as the probability of n faults in a software system S and assuming that the range of faults can be from 0 to ∞, the maximum entropy problem can be formulated as subject to the mean number of faults and the normalization constant as constraints. Defining the Lagrangian function where α and β are Lagrange's multipliers corresponding to the normalization and the mean number of fault constraints, and differentiating φ q with respect to p n and equating to zero results in The finiteness of the normalization constant in (10) requires 1 1−q > 1, i.e., q > 0. The probability distribution of the number of faults given by (10) can be rewritten as , q > 0, n = 0, 1, 2, ...
where ζ 1 1−q , 1 β(1−q) denotes the Hurwitz zeta function defined by The fault distribution given in (11) is regarded as a Tsallis distribution. The parameter β in (11) can be estimated from the constraint given in (7) as Cumulative distribution of faults: The cumulative distribution of faults can be obtained from (11) as

Estimation of parameters:
For a given data set of faults, one can obtain the appropriate values of the Tsallis parameter q and Lagrange's parameter β so that the fault distribution (11) can be evaluated. For this, first q is varied over a range, and for a specific value of q, the parameter β is computed from (7) using the numerical method. The constraints (7) can be expressed as Using (10) and (15), it can be rewritten as Equation (16) is of the form f (β) = 0 and can be solved through the Newton-Raphson method from an initial approximation of β and by replacing it at each iteration by β + ∆β where which can be expressed as till the sequence of iterates converges. Once the value of β is approximated for a value of q, the fault distribution can be completely evaluated. Afterwards, the Kolmogorov-Smirov (KS) statistic [40] (where F n (x) is the empirical cumulative distribution, F(x) is the fitted cumulative distribution, and sup x is the supremum of the set of differences) is used to compute the difference between empirical and cumulative distribution of faults (14). Then, the value of q and corresponding β, which gives a minimum value of KS, is chosen. A similar method is proposed by Clauset [41] for fitting power law distributions. The procedure is outlined in Algorithm 1.

Algorithm 1 Algorithm for Fitting Tsallis Distribution to Empirical Dataset of Software Faults
Require: Empirical data Ensure: Estimated values of q and β Compute arithmetic mean A from the data Compute empirical cumulative distribution of faults Initialize Tsallis entropy parameter q Give initial value to parameter β while q < 1 do compute ∆β using (18) β ← β + ∆β repeat above two steps till β converges compute cumulative distribution of faults using (14) compute KS statistics increment q end while Choose minimum value KS and corresponding q and β The next section describes the results of the experiments conducted to validate the Tsallis distribution in modelling software faults.

Results and Discussion
The model developed viz. Tsallis distribution is validated by running experiments on several data sets using the procedure described in Section 3.4. For comparative analysis, generalized Pareto and Weibull distributions are also fitted. The goodness of the fit in all the cases is checked by KS test [40]. The results of the experiments for the enterprise software data set are presented in Tables 3 and 4. In these two tables, the h value of zero indicates that the given distribution fits well to empirical data, whereas value one implies that the specified distribution is not a good fit. It can be observed that Tsallis distribution provides the lowest value of the KS statistic in all the cases. Simultaneously, the p value also needs to be given importance while deciding about the goodness of fit [41]. It can be noted that the p value remains close to 1 for eclipse pre-release faults for Tsallis distribution and is 1 for eclipse post-release faults. Thus, it can be concluded that Tsallis distribution is a better fit than generalized Pareto and Weibull distribution for the Eclipse data set.
To justify the efficacy of the Tsallis distribution further, the experiments are run on fault data of open-source software as given in Table 2. The results of the experiments are shown in Table 5. It can be again observed that Tsallis distribution provides same or better results than both generalized Pareto and Weibull distributions in this case as well. The results of the analysis can be summarized in Table 6. Table 6. Comparative analysis of research work on distribution of faults.

Software Type Pareto and Its Variants Weibull Tsallis
It can be easily noted that Tsallis distribution describes faults in both types of software viz. enterprise and open source successfully. Therefore, Tsallis distribution can be considered as a unified model that can explain faults in diverse types of software. One of the reasons behind this is the fact that maximum entropy Tsallis distribution is the best distribution satisfying the available information about the mean number of faults in software. As pointed out by Hatton [32], software systems can be treated as physical systems; thus, the theory of evolution of physical systems can be applied to them. Following this, software systems too tend to move towards configurations that have maximum entropy. This may be another reason behind the applicability of Tsallis distribution in modeling faults in diverse types of software. Further, the Tsallis entropy parameter governs the interactions within the physical system. Identical or close values of parameter q signify that the underlying dynamics of this software is similar even if it belongs to diverse types .
Precise knowledge of fault distribution helps in predicting faults during early phases of the software life cycle in new similar projects. A framework based on the Bayesian inference method has been proposed by Rana et al. [42] where fault distribution from previous projects is used for prior fault prediction. The basis of Bayesian inference method is the formula f (κ|data) = C f (data|κ) f (κ) (20) which relates the distribution of the number of faults f (κ|data) that needs to be estimated from the available 'data' of the new project, also called posterior distribution, with the likelihood f (data|κ) and prior f (κ). Here, C is the normalization constant. The prior f (κ) now follows the Tsallis distribution. The details of the method are presented in [42]. Another application of this work is in software reliability, specifically in describing software failures. One of the pioneering works in software reliability is the Goal and Okumoto model [43], which deals with modeling the number of failures observed in a time interval. Modeling the cumulative number of failures at time t, N(t) as a non-homogeneous Poisson process (NHPP), Goal and Okumoto [43] derive an expression for the probability of the number of failures as where Here, a is a random variable representing the number of faults to be detected in a software, and b is the fault-occurrence rate. The probability distribution of a can now be given by (11), thus modifying (22) results in which, after simplifying, gives The quantity m(t), the expected number of failures observed by time t, can now be in terms of the Tsallis distribution parameters. For a given fault-occurrence rate, the expected number of failures can now be easily computed. These results will be very useful to software management.

Threats of Validity
This section addresses various threats of validity related to this study. The first one is internal validity, which determines the causal relationship between two variables [11]. There are three possible threats of internal validity, such as many other empirical studies on fault distribution [3,8,11,19]. The data-collection process is a threat, especially for opensource software. These software are free, and sometimes the faults are not reported and documented properly. Therefore, there may be incomplete and imprecise fault data for open-source software. The other threat of internal validity is that the good fit of the Tsallis distribution may be due to chance. There are two reasons to reject this threat in our study: one, because the analysis has been carried out on different types of software and the Tsallis distribution is found to be a best fit with statistically high p values; two, the Tsallis distribution is a maximum-entropy distribution, which is the most plausible one given the information about the number of faults. The last threat of internal validity relates to the fact that the Tsallis distribution may not be the generative model of faults even if it is a good fit. This is a complex question that even has not been handled in other such studies on fault distribution in the past [3,8,11,19].
The second threat of validity is construct validity, which ensures if the findings of one software instance are sufficient to ascertain the behavior across all instances. For this, a series of versions of the same Eclipse software are included in this study and are analyzed to check the consistency of the Tsallis distribution. The analysis confirms the consistency of the performance of the Tsallis distribution with a high p value both in pre-release and post-release fault data of Eclipse.
The last threat of validity is external validity, which guarantees that the results of the analysis are applicable across other software. It is believed that the Tsallis distribution is generic enough to explain faults in other large and complex, enterprise and open-source software. However, only further replications of this study can confirm this.

Conclusions
Despite the extensive research carried out on the analysis of the distribution of faults in computer programs, there is no single model accepted that can explain faults in different types of software. Using the maximum Tsallis entropy principle when information about the mean number of faults is available as a constraint, the distribution of faults in a software system is derived. A procedure to estimate the distribution parameters is also presented. The performance of the Tsallis distribution in describing faults in many tyoes of enterprise and open-source software is compared with popular generalized Pareto and Weibull distributions. The Tsallis distribution is found to perform the same or better than the other two, thus making it useful for various types of software, including open-source software.
Two applications of precise knowledge of fault distribution are discussed. The first one relates to predicting faults in new similar projects during the early phase of the software life-cycle, when limited fault data is available, using the Bayesian Inference method. The Tsallis distribution is prior there, and investigating its impact on the accuracy of fault prediction is an area of future study.
Expressing the probability distribution of a random variable representing the number of faults as Tsallis in the famous software-reliability model by Goel and Okumoto, a closed form of expression for the expected number of faults by time 't' has been derived in terms of Tsallis distribution parameters. Applying these new results to real data and analyzing the performance of the modified Goel-Okumoto model are tasks for future work.