A Novel Approach for Software Defect prediction Based on the Power Law Function

: Power law describes a common behavior in which a few factors play decisive roles in one thing. Most software defects occur in very few instances. In this study, we proposed a novel approach that adopts power law function characteristics for software defect prediction. The ﬁrst step in this approach is to establish the power law function of the majority of metrics in a software system. Following this, the power law function’s maximal curvature value is applied as the threshold value for determining higher metric values. Furthermore, the total number of higher metric values is counted in each instance. Finally, the statistical data are clustered into di ﬀ erent categories as defect-free and defect-prone instances. Case studies and a comparison were conducted based on twelve public datasets of Promise, SoftLab, and ReLink by using ﬁve di ﬀ erent algorithms. The results indicate that the precision, recall, and F-measure values obtained by the proposed approach are the most optimal among the tested ﬁve algorithms, the average values of recall and F-measure were improved by 14.3% and 6.0%, respectively. Furthermore, the complexity of the proposed approach based on the power law function is O ( 2 n ) , which is the lowest among the tested ﬁve algorithms. The proposed approach is thus demonstrated to be feasible and highly e ﬃ cient at software defect prediction with unlabeled datasets.


Introduction
Software defects play a crucial role in affecting software system quality [1]. Many studies in the software research field have focused on predicting software defects with the purpose of improving software system's quality [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. Nevertheless, most of these studies have been based on supervised prediction methods [3][4][5], which must use supervised learning with labeled datasets to build a prediction model before performing defect prediction in unlabeled instances. It is not easy or practical to obtain a large amount of labeled data for defect prediction with a newly developed software system. Under these circumstances, software defect prediction based on supervised learning is difficult to perform directly because of an absence of labeled defect information. Another method known as cross-project defect prediction utilizes labeled datasets from the source project to train the prediction model without labeled datasets [5][6][7][8]. Nevertheless, excessive dependence on source project quality and extensive attention paid to variations between the source and target projects make prediction using the cross-project technique relatively cumbersome. Unsupervised prediction technology has the advantage of predicting software defects without needing labeled datasets and can directly predict defect instances in newly developed software systems. Therefore, unsupervised prediction approaches, which do not require labeled datasets, have attracted considerable attention in the software defect prediction research field because of their high efficiency and low cost.
Unsupervised defect prediction methods in the literature have been mostly implemented based on the clustering of metrics. Metrics are one of the factors affecting software defects and can be used to depict software system features to some extent. Software metrics have been used as software fault-proneness indicators and to maintain defect predictions [9,10]. Catal et al. [11] proposed a software fault prediction approach based on metrics thresholds and clustering. Threshold values were determined using expert assistance and libraries of historical defects, and the method obtained reasonable prediction results with respect to x-means, c-means, and k-means clustering [12]. Based on the same threshold-deriving method, Abaei et al. [13], Yang et al. [14], Bishun and Bhattacherjee [15] conducted software defect predictions using unsupervised learning methods based on self-organizing map, an affinity propagation clustering algorithm, and k-medoids, respectively. Bishnu and Bhattacherjee [16] utilized a quad tree-based k-means algorithm to predict program module faults to initiate the cluster center. Zhong et al. [17] imposed clustering methods (k-means and neural-gas) to cluster instances, and selected typical instances from each cluster and provided experts with auxiliary statistical information such as mean, maximum and minimum values, and median number of the metrics. Finally, the clusters were labeled by experts [17]. In summary, these studies on software defect prediction have all been based on expert assistance. In contrast, Park et al. [18] solved the problem of determining the optimal number of clusters using clustering algorithms such as EM (Expectation Maximization) and x-means. In addition, the CLA (Clustering and labeling Approach) method proposed by Nam et al. [4] sets each metric's median value as the threshold value to determine whether the instances were defect-free or defect-prone. After that, statistical data were utilized to cluster and label instances to perform defect prediction. The PCLA (Probabilistic Clustering and labeling Approach) method was an extension of the CLA algorithm [19]. Obviously, the clustering method is widely used in unsupervised software defect prediction using unlabeled datasets.
The power law is a phenomenon that has been found to be effective in characterizing data in many scientific fields such as physics, biology, economics, earth science, and computer science [20]. The power law describes a common behavior in which a few factors play decisive roles in one thing, i.e., a small number of samples holds a large influence. Simultaneously, it is known that in software defects, a small number of instances are responsible for many faults or defects. Software metrics showed fat tails in distributions of software defect data, and the skewness and fat tails of these data were properties of the power law function. Therefore, a correlation between the power law function and software metrics may exist. However, to the best of our knowledge, few people have considered the power law function in predicting software defects.
In this study, we propose a novel software defect prediction approach based on power law functions. Section 2 presents fundamental characteristics with respect to the power law function and software defects and the details of the proposed software defect prediction approach. In Section 3, we conduct experimental studies including case studies and comparisons by imposing different datasets and algorithms. Furthermore, we present a complexity analysis in Section 4. Lastly, research conclusions and future work are generalized in Section 5.

Correlation and Characteristics
Various power law phenomena can be observed in nature and society, and the study of these phenomena has continued for more than a century. Even now, the power law phenomenon is still a research hotspot in many disciplines. Harvard University linguistics expert George Zipf studied the frequency of English words and found a simple inverse relationship between each word's frequency and the constant power of its rank in order from large to small. This distribution, known as Zipf's law, shows that only a few words in English are frequently used and that most words are seldom used [21]. Italian economist Vilfredo Pareto studied the statistical distribution of personal income and found that the income of a few people was much higher than that of the majority. Thus, he put forward the famous 80/20 rule, commonly known as Pareto's law, in which 20% of the population occupied 80% of the social wealth [22]. Zipf's law and Pareto's law are simple power law function patterns. In fact, power law distribution exists in many fields, such as physics, earth and planetary science, computer science, biology, ecology, demography and social science, and economy and finance, and has various manifestations [23][24][25][26][27][28][29][30][31].
Furthermore, Shatnawiand and Althebyan [32] validated the effects power laws have on software metrics interpretations and found that many metrics demonstrate a power law behavior. Furthermore, threshold values with respect to instances were derived from power law function properties. Additionally, Wheeldon and Counsell [33] found that a power law implied that smaller values were commonplace, whereas larger values were extremely rare. Meanwhile, Andersson and Runeson [34] quantitatively analyzed distributions of defects in three different projects and determined through graphical analysis that a small number of instances (20%) were correlated with 63%-70% of prerelease defects.
The general formula of the power law function can be written as: where c and γ are both constants greater than zero. Logarithms of both sides in Equation (1), ln y and ln x, satisfy a linear relationship. In other words, in the double logarithmic coordinates, the power law function represents a straight line with a negative slope of the power exponent. This linear relationship is the basis for judging whether the random variable in each instance satisfies the power law function [35].
The power law is also a sign of transitioning from steady to chaotic states in the chaotic edge of self-organized critical systems [36]. The power law can be used to predict phases and phase transitions for such systems. Most software defect data have unbalanced distribution characteristics [37], i.e., defect-free instances are more common than defective instances. Many software system defects are concentrated in a small part of the instances, which is characteristic of a power law function. Figure 1 shows the distributions of the NOC (number of children of a given class in an inheritance tree) and RFC (number of distinct methods invoked by code in a given class) metrics from an example using the Camel1.6 dataset in the Promise database.

104
The general formula of the power law function can be written as: where c and  are both constants greater than zero.

106
Logarithms of both sides in Equation (1), ln y and ln x, satisfy a linear relationship. In other 107 words, in the double logarithmic coordinates, the power law function represents a straight line with 108 a negative slope of the power exponent. This linear relationship is the basis for judging whether the 109 random variable in each instance satisfies the power law function [35].

110
The power law is also a sign of transitioning from steady to chaotic states in the chaotic edge of 111 self-organized critical systems [36]. The power law can be used to predict phases and phase 112 transitions for such systems. Most software defect data have unbalanced distribution characteristics 113 [37], i.e., defect-free instances are more common than defective instances.

122
As seen in Figure 1, metrics values are distributed in the form of a power law function, which 123 has also been validated by Shatnawi [21]. Therefore, there is a close correlation between the power 124 law function and software system metrics. Before using power law distribution function to predict software defects, the concept of curvature should be introduced. Curvature refers to the bending degree of a curve [38]. It can be As seen in Figure 1, metrics values are distributed in the form of a power law function, which has also been validated by Shatnawi [21]. Therefore, there is a close correlation between the power law function and software system metrics.

Power Law Function Curvature
Before using power law distribution function to predict software defects, the concept of curvature should be introduced. Curvature refers to the bending degree of a curve [38]. It can be described by the ratio of the angle changed by the curve to the radian changed by the curve. Figure 2 presents a sketch map of power law function curvature. Let M and M be the two points on the curve. If the tangent of the curve at point M and point M and the positive intersection angle of the x-axis are respectively α and α + ∆α. Therefore, when the point changes from M along the curve to M , the angle changes ∆α, and the distance to change this angle is the arc length ∆s = MM . Therefore, curvature is defined by a differential equation to indicate the degree the curve deviates from the straight line. The greater the curvature, the greater the curve's bending degree and deviation from the straight line. Expression of curvature at point M can be stated as follows.
where ∆s tends to zero. The reciprocal of curvature is curve radius, with a larger curve radius yielding a smoother arc. The angle and arc length approaching zero at the same time is the standard curvature definition of a smooth curve with an arbitrary shape.
where s  tends to zero. The reciprocal of curvature is curve radius, with a larger curve radius yielding Therefore, in: Let 0  x , the following expression can be obtained by taking the limit: Therefore, in: Appl. Sci. 2020, 10, 1892

of 16
Let ∆x → 0 , the following expression can be obtained by taking the limit: Let the curve equation be y = f (x). Because tgα = y , so α = arctgy , making differential calculation on α = arctgy , we will get: Therefore, the curvature of a curve can thus be expressed as follows.
If the metrics power law function is expressed as m(x) = αx −γ , then: By finding the first and second derivatives of m(x) and substituting them into Equation (3), the curvature function corresponding to m(x) can be obtained as follows.
As shown in Figure 2, the closer the curve is to the Y axis, the smaller its curvature is, and when the curve is further away from the Y axis, it becomes increasingly curved and thus its curvature increases. At a certain point, the curve begins to essentially parallel to the X axis and the curvature then decreases. This indicates that the power law function curve's curvature goes from small to large and then from large to small. Thus, a maximum curvature point exists, which may be the transformation point of the metrics from defective to defect-free. In other words, the maximum curvature point of a power law function can be taken as the demarcation point between defective and defect-free instances. Therefore, if this transition point is calculated, it can be used as a threshold value for estimating whether the entity in the metric has defects. Consequently, the transformation point divides the metrics into two parts, defective and non-defective.
A curve's maximum curvature point should be a point with a first-order derivative of zero for the curvature function k(x). The first-order derivative of k(x) can be obtained as follows.
when x 0, dividing both sides by αγ(γ + 1)x −(γ+3) [1 + α 2 γ 2 x −2(γ+1) ] 1 2 we will get: Appl. Sci. 2020, 10, 1892 6 of 16 Therefore, the derivative of the curvature function of a power law function is obtained and made to be zero and can be obtained as follows: That is, the X coordinate of the maximum curvature point of power function is: Consequently, the transformation point in the corresponding metrics can be obtained as: That is to say, entities within the metrics can be classified into a defective tendency group when parameter values are greater than m( x ), because for most metrics, software entities containing defects generally have larger values than those without defects [39][40][41][42]. In contrast, other entities are classified into a defect-free tendency group when parameter values are less than m( x ).
Taking as an example, for the function of y = x −2 , the image of this function can be seen in Figure 3. By Equation (13), it can be calculated that the maximum curvature point of the power function at x-axis is x = 1.31, the y value can be calculated as 0.5848. Therefore, we can obtain the transformation point as shown in Figure 3.    Therefore, this study chooses the value m( x ) in the maximal curvature point as the threshold value for evaluating whether each entity in the metric is defect-free or defective.

Approach for Software Defect Prediction Based on Power Law Function
In this section, we present the approach for detecting software defects based on the power law function. The general algorithm of the proposed model is expressed in Algorithm 1: According to Equation (14) Figure 4 shows the overall process of this approach. The kernel of our approach is to use the power law function to describe the metrics distribution and set the transformation value of the maximal curvature point as the threshold value to label the metrics.

192
The process is stated in detail as follows.   The process is stated in detail as follows.

Establish power law functions for each metric
A power law function is used to establish the linear regression model for each metric by sorting parameter values from large to small and numbering corresponding software modules from 1 to n (n is the number of software instances). This is a straight line in the double logarithmic coordinate system.

Calculate each metric's threshold value
The threshold value is the critical value, which can be used to judge whether a software system fault exists under the metrics. Therefore, Equation (13) is used to calculate the position of the maximum curvature value of the obtained power law function. The value obtained from Equation (14) at this position is the approximate boundary between the defective and defect-free metrics.

Identify the defective tendency of metrics
Identifying the defective tendency of instances under a certain metric in software defect prediction is generally based on the assumption that a defective instance's parameter value tends to be higher than that of non-defective instances [39][40][41][42]. Therefore, entities with parameters larger than the threshold value are identified as defective and set as 1. In contrast, other entities are identified as defect-free and set as 0.

Label the instances by clustering
Instances are clustered into a top half and a bottom half, with instances in the top half labeled as defective and others labeled as defect-free.
Based on the proposed approach, we can predict software defects without relying on labeled datasets. The following case studies and comparisons have been conducted to validate the feasibility of the new approach.

Case Study
To conduct a case study, we selected the public Camel1.6 dataset. The Camel1.6 dataset has 965 instances, of which 188 were defective and 777 were defect-free [43]. Defective instances accounted for 19.48% of the total instances. Each instance had 20 metric parameters and the label of each instance was removed in advance. Followingly, the power law function was established for each metric and correlation coefficients were calculated.
The power law function, correlation coefficient, and transformation value in the maximal curvature point are shown in Table 1.
Boldface indicates that the correlation coefficients were smaller than 0.3, which didn't exhibit the characteristics of power law function.
Most of the metric parameters can be described well with power law functions. The correlation coefficients were mostly larger than 0.7, indicating that the power law function effectively described the metrics parameters distribution. Although two metrics were not well described by the power law functions, the ratio was relatively small. Figure 5 shows comparisons between the actual metrics data scatter and corresponding power law functions. Good consistencies were obtained for different metrics.  Based on the calculated transformation values in each power law function's maximal curvature point, each entity with a value larger than the transformation value was identified as defective and set as 1 and others were identified as defect-free and set as 0. Following this, the total number of defective entities in each instance was calculated and listed. Finally, instances were labeled by clustering into two groups, a top half and a bottom half. Therefore, the instances in the top half of the clusters were labeled as defective and the others were labeled as defect-free. The number of defects predicted by the proposed approach was 94 and the number of actual defects was 188. Therefore, the precision of the proposed approach with respect to the Camel1.6 dataset was 0.276.

Performance Evaluation
Performance evaluations of software defect prediction are based on the confusion matrix, as shown in Table 2, which includes the measures of precision, recall, and F-measure [44]. True positive (TP) is the number of defective entities predicted as defective.
False negative (FN) is the number of defective entities predicted as defect-free. False positive (FP) is the number of defect-free entities predicted as defective.
True negative (TN) is the number of defect-free entities predicted as defect-free. In this study, predictive performance measures are computed as follows: Precision represents the proportion of defective entities to all entities correctly predicted as defective.
Recall represents the proportion of defective entities to all entities that are actually defective. F-measure is the harmonic average of recall and accuracy, with higher F-measure values corresponding to better prediction performance.

Comparative Experiment
To verify the performance of the proposed software defect prediction approach, we selected four classic algorithms used for defect prediction with unlabeled datasets: k-means [16,17], CLA and CLAMI (Clustering and labeling Approach based on Matric Instances) [4], and x-means [18]. These traditional methods are based on classical data clustering method, such as k-means, x-means and so on. We compared the proposed approach with these algorithms on twelve randomly selected public datasets [43][44][45]. These public datasets are produced from real software systems, with different sizes and types, which can be utilized as a representation for different software systems.
Details of datasets used in the comparative experiment (Section 3.3) are shown in Table 3. The selected algorithms were common unsupervised software defect prediction algorithms.    The proposed approach was generally superior to other traditional methods in terms of precision, recall, and the F-measure. Although the average precision value obtained by the proposed approach was not the best, the proposed approach performed the best in 6 of 12 datasets in terms of precision, giving it the largest number of optimal results among the tested methods. Furthermore, the average precision values of the proposed approach and the k-means algorithm were relatively similar.
For the recall value, the proposed approach's advantage was much more apparent. It performed the best in 10 of the 12 datasets and had an average value of 0.790, which was the highest among the tested methods. It was obvious that the proposed approach performed significantly better than traditional defect prediction models on almost all datasets. At the same time, some researchers have indicated that prediction models with low precision and high recall were more useful in many industrial situations [46]. The proposed approach obtained a better recall value than the precision value and thus would perform better in these areas.
For the F-measure value, the proposed approach performed the best in 7 of the 12 datasets and had the largest average value of 0.501 among the tested methods. It was clear that the proposed approach obtained the best F-measure value and thus demonstrated the proposed approach's effectiveness at software defect prediction with unlabeled datasets.
Therefore, based on these public datasets, it can be seen that the proposed approach obtains the best prediction results among the tested five algorithms. Nevertheless, considering the limitation of data timeliness and scale, these datasets cannot totally represent the real software defects and new defects. When utilized in real software systems, the proposed approach can be further validated.

Complexity Analysis
Authors should discuss the results and how they can be interpreted in perspective of previous studies and of the working hypotheses. The findings and their implications should be discussed in the broadest context possible. Future research directions may also be highlighted.
Algorithm complexities of the other algorithms tested in this study are shown in Table 7. The complexity of k-means, CLA, and CLAMI were obtained from references [4,17]. The proposed approach has a relatively lower complexity than the other algorithms, indicating a low cost in software defect predictions. Therefore, from the perspective of algorithm complexity, our proposed approach possesses some advantages over the other tested algorithms to some extent.

Conclusions
In this study, we proposed a novel approach that adopted characteristics of the power law function for software defect prediction. The kernel of our approach was using the power law function to describe metrics distributions and set the transformation values in the maximal curvature point of each power law function curve as the threshold value for labeling metrics. In our empirical studies, we found that the proposed approach performed significantly better than other commonly used four algorithms across all evaluated twelve terms. The average values of recall and F-measure were improved by over 14.3% and 6.0%, respectively. Furthermore, the proposed approach had a complexity of O(2n), which was the lowest among the tested five algorithms. Therefore, we demonstrated that our proposed approach is feasible and highly efficient at defect prediction with unlabeled datasets.
In summary, the proposed approach offers a viable choice for software defect prediction on unlabeled datasets. However, similar to any other method, there are some issues to handle in future work. The precision of the prediction should be analyzed in-depth and improved in the future. At the same time, future work should attempt to use the proposed approach to rank predictions.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The metrics and their descriptions for dataset Camel1.6 utilized in case study.

Metrics Description
wmc Weighted methods per class dit The maximum distance from a given class to the root of an inheritance tree noc Number