Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction

: Imbalanced data are a major factor for degrading the performance of software defect models. Software defect dataset is imbalanced in nature, i.e., the number of non-defect-prone modules is far more than that of defect-prone ones, which results in the bias of classiﬁers on the majority class samples. In this paper, we propose a novel credibility-based imbalance boosting (CIB) method in order to address the class-imbalance problem in software defect proneness prediction. The method measures the credibility of synthetic samples based on their distribution by introducing a credit factor to every synthetic sample, and proposes a weight updating scheme to make the base classiﬁers focus on synthetic samples with high credibility and real samples. Experiments are performed on 11 NASA datasets and nine PROMISE datasets by comparing CIB with MAHAKIL, AdaC2, AdaBoost, SMOTE, RUS, No sampling method in terms of four performance measures, i.e., area under the curve (AUC), F1, AGF, and Matthews correlation coefﬁcient (MCC). Wilcoxon sign-ranked test and Cliff’s δ are separately used to perform statistical test and calculate effect size. The experimental results show that CIB is a more promising alternative for addressing the class-imbalance problem in software defect-prone prediction as compared with previous methods.


Introduction
Software defect prediction has been an important research topic in the field of software engineering for more than three decades [1]. Software defect prediction models help to reasonably allocate limited test resources and improve test efficiency by identifying the defective modules before software testing, which has drawn increasing attention of both the academic and industrial communities [2][3][4][5][6][7][8]. Software defect prediction can be regarded as a binary classification problem, where the software modules are classified as defect-prone or non-defect-prone. By mining the historical defect dataset with the statistical or machine learning techniques, software defect proneness prediction models are built to establish the relationship between software metrics (the independent variables) and defect proneness of software modules (such as method, class, and file) and then are used to predict the labels (defect-prone or non-defect-prone) of new software modules. As the independent variables of SDP models, many software metrics [9][10][11][12]

RQ1: How effective is CIB? How much improvement can it obtain compared with the baselines?
We compare CIB with the baselines on 11 cleaned NASA datasets. On average, CIB always obtains the best performance in terms of F1, AGF, MCC, and AUC, and improves F1/AGF/MCC/AUC over the baselines by at least 9%/0.2%/3.7%/1.2%.

RQ2: How is the generalization ability of CIB?
To answer this question, we further perform CIB on nine PROMISE datasets [43]. On average CIB still outperforms all baselines in terms of F1, AGF, AUC, and MCC, which indicates that CIB has a good generalization ability.

RQ3: How much time does it take for CIB to run?
We find that the training process of CIB is relatively time-consuming compared with the baselines, but the testing time for all methods is very small. The average training and testing time of CIB on 11 NASA datasets is 223.25 s and 0.168 s, respectively. This paper mainly makes the following contributions: (1) We propose a novel class-imbalance learning method, named credibility based imbalance boosting (CIB), for software defect proneness prediction. Differing from existing oversampling methods, CIB treats the synthetic minority samples and the real minority samples differently by using the proposed concept credit factor. To the best of our knowledge, we are the first to propose the concept of credit factor for the synthetic minority class samples.
(2) Extensive experiments are conducted on 11 NASA datasets datasets to compare CIB with existing state-of-the-art baselines in terms of F1, AGF, AUC, and MCC. The experimental results show that our CIB is a more promising alternative for addressing the class-imbalance problem in software defect proneness prediction. We also perform experiments on nine PROMSIE datasets in order to investigate the generalization ability of CIB.
(3) To increase the reliability of this study, the benchmark datasets and the source code of both CIB and the baselines are publicly available on GitHub (https://github.com/THN-BUAA/CIB-master.git).
The rest of this paper is organized, as follows: Section 2 briefly reviews the related work about software defect prediction and class-imbalance learning. Section 3 presents the details of our proposed method. Section 4 introduces the content of our experimental design. Section 5 presents the experimental results. Section 6 discusses working mechanism of CIB, the meaning of our findings, and the limitation of our study. Finally, our research is summarized in Section 7.

Class-Imbalance Learning
In the machine learning field, class-imbalance learning methods can be divided into two categories [68]: data-level methods and algorithm-level methods. Data-level methods are characterized by easy implementation and being independent of the model training process. Algorithm-level methods need to modify the classifiers.

Data-Level Methods
Data-level methods need to pre-process the data by increasing or decreasing the number of samples to balance the imbalanced data, such as under-sampling methods (e.g., random under-sampling, RUS) and oversampling methods (e.g., synthetic minority oversampling technique, SMOTE [40]). RUS remove the majority class samples, while over-sampling methods add synthetic minority class samples to make the data balanced.
SMOTE makes the discriminant boundary of the classifier expand from the minority class to the majority class. However, since SMOTE randomly selects the minority class samples to generate synthetic minority class samples, this may cause overlapping problems or over-fitting problems. To deal with this problem, many researchers have made their efforts to propose new methods based on SMOTE, such as Borderline-SMOTE [41], MWMOTE (Majority Weighted Minority Oversampling TEchnique) [42], and ADASYN (Adaptive Synthetic) [69]. Borderline-SMOTE only over-samples the minority class samples on the boundary, because its basic assumption is that samples on the boundary are more likely to be misclassified by the classifier than samples away from the boundary. MWMOTE selects minority classes on the border based on their weighting factors. The larger the minority class weight, the greater the probability of being selected. Additionally, it samples the selected minority classes. ADASYN generates synthetic samples for minority classes that are easily misclassified. Moreover, the bigger the weight of minority classes on the boundary, the more synthetic samples generated. These over-sampling methods try to avoid generating noisy samples; however, numbers and regions of synthetic samples to generate are limited, since synthetic samples are only allowed to be introduced near the centre of minority samples clusters or the boundaries.

Algorithm-Level Methods
Algorithm level methods modify the classification algorithm to alleviate the effect of class-imbalance on the minority class samples, which include cost-sensitive methods [70,71], ensemble methods [46,72], and hybrid methods, such as cost-sensitive boosting methods [45,73] and oversampling boosting methods [74,75].
Cost-sensitive methods take the misclassification cost into consideration when building prediction model and try to develop a classifier that obtains the lowest cost. Before using cost-sensitive learning methods, users must define a cost matrix. Let C(i, j) denote the cost of classifying a sample from class i to j. In the binary classification task, C(0, 1) represents the cost of misclassifying a negative sample as a positive sample and C(1, 0) represents the cost of misclassifying a positive sample as a negative sample. In software defect proneness prediction, cost-sensitive learning methods assign a larger cost of false negative samples (i.e., C(1, 0)) than that of false positive samples (C(0, 1)), hence resulting in a performance improvement on the minority (aka. positive) class samples.
Ensemble methods refer to combining multiple base learners (such as decision tree, logistic regression, and support vector machine) in order to construct a strong model. There are two famous ensemble learning methods: bagging framework based random forests [76] and boosting framework based AdaBoost [46]. AdaBoost is an effective ensemble method that was proposed by Freund and Schapire. It combines a series of base classifiers to be a stronger classifier by several learning iterations. In each iteration, misclassified samples are given higher weights to make the next new classifier to focus on these samples. The learning process tries to minimize the classification cost of all the samples.
Some algorithm-level methods combine the boosting algorithm and cost-sensitive algorithm, such as AdaC2 [45] and AdaCost (Cost-Sensitive Boosting) [73]. Meanwhile, some methods take the advantages of both boosting and cost-sensitive algorithms, e.g., RAMOBoost (Ranked Minority Oversampling in Boosting) [74]. RAMOBoost assigns weights to misclassified minority class samples that are based on the number of neighbor majority class samples and selects those misclassified minority class samples with higher weights in order to generate synthetic samples.

Problem Formulation
Given a labeled and class-imbalanced training dataset S tr = {(x i , y i )} n i=1 , y i ∈ {0, 1} and an unlabeled test dataset x j , m j=1 having the same distribution as the training dataset, where x i , x j ∈ R d , respectively, represent i-th instance and j-th instance (i.e., software module) having d features (i.e., metrics) in the training and test datasets, m and n denote the number of samples in the training and test datasets, y i = 1 denotes the sample x i that belongs to the positive class (aka. the defective proneness module), and y i = 0 denotes the sample x i belonging to the negative class (aka. the defect-free module).
The objective of this study is to build a software defect proneness prediction model by using training data S tr and then predict the labels (i.e., defect-prone or non-defect-prone) of unlabeled test samples.   9: Initialize the distribution weight of S com as:

Proposed Credibility-Based Imbalance Boosting Method
c f j−n N . 10: for t = 1 → T do 11: Provide S com with distribution weights D i to the learner, then get back a hypothesis: h t : x → y.

12:
Calculate the weighted error of h t on S com as: 13: if ε t ≥ 0.5 then 14: End this iteration. 15: end if 16: Set α t = 1 2 ln ε t 1−ε t . 17: Update the distribution weight as: 18: end for 19: Output the final hypothesis:

Calculate the Credibility Factors of Synthetic Samples
Given the training data S tr = S pos tr ∪ S neg tr , where S pos tr and S neg tr represent the positive (or minority) class samples and the negative (or majority) class samples, respectively. We first perform SMOTE [40] on the S pos tr and denote the synthetic minority class samples that are generated by SMOTE as S syn having n syn samples. SMOTE have two important parameters P and k where P denotes how many synthetic minority class samples will be generated, e.g., P = 100% means that the number of synthetic minority class samples is equal to the number of real minority class samples and k represents the number of the nearest neighbors. For each synthetic sample, we determine its K (default 10) nearest neighbors, which are the real samples and then calculate the proportion of the number of minority class samples to that of majority class samples based on above K nearest neighbors, as follows: where n min j and n maj j denote the number of minority class samples and the number of majority class samples in the K nearest neighbors of j-th synthetic minority class sample, and n syn represents the number of samples in S syn .
The high value of r means synthetic samples are surrounded with more minority class samples than majority class samples. Low value of r means that synthetic samples may be introduced in high density of majority class samples or sparse regions with no minority class samples around. The synthetic sample with a high value of r is more reliable and it should be assigned a higher credibility factor. We define the credibility factor of a synthetic sample, as follows: where β controls the steepness of the nonlinear curve.

Update the Weights of Synthetic Samples in Boosting Iterations
After calculating the credibility factors of all the synthetic samples, a new training dataset (denoted as S com ) that consists of both synthetic samples and real samples is obtained. The initial weights of synthetic samples are their credibility factors and the weight of each real sample is set to be 1 (i.e., the largest credibility factor). That is to say, real samples have complete credibility when compared to the synthetic samples. We denote the initialized distribution weights of S com as D 1 . And then S com is normalized to ensure the sum of all weights is equal to 1. For t-th (t = 1, 2, · · · , T) iteration, we first train the given base classifier on S com with distribution weights D t and then denote the trained classifier as h t . We next calculate the weighted misclassification error of h t on S com , as follows: where N denotes the number of samples of S com . Based on the error ε t , the distribution weight D t is updated, as follows: where Z t is a normalization factor, In the above weight updating scheme, the weights of synthetic samples are dependent on the weights of nearest real minority class samples in their k neighbors and the classification results of weak classifiers on them. In other words, in each iteration, if the real minority class sample is misclassified, then its weight as well as weights of those synthetic samples close to it are increased together. The decision boundaries are extended to those misclassified real minority class by increasing weights of synthetic samples around them. Meanwhile, the weights of those synthetic samples that have no real minority class samples around are decreased by their low credit factors, as they should be neglected to avoid shifting decision boundary incorrectly to minority class, which may result in sacrificing the accuracy of majority class samples. When the iteration stop condition is satisfied, then we can obtain the final hypothesis, i.e., the learned classifier h f (x) , as follows: where α t is the weight of t-th trained base learner h t (x).

Benchmark Datasets
The NASA datasets have been widely used in previous studies [3,27,67]. Because the original NASA datasets have the data quality problems [5,78], the cleaned version from the tera-PROMISE repository [43] is used as the benchmark datasets. Table 1 shows the statistics of eleven cleaned NASA datasets. For each dataset, the statistical information includes the project name, the number of software metrics (# Met., for short), the number of instances (# Ins.), the number of defective instances (# Def. Ins.), and defect ratio (i.e., the proportion of the number of defective instances to the total number of instances). Each data instance (or sample) includes two kinds of information: the metrics of one software module (e.g., a function) and the corresponding class label of whether this module contains defects. The module metrics include Halstead metrics [10], McCabe metrics [9], and other metrics. The details of software metrics that are used in these datasets can be seen in [3]. We can see that all of the datasets are very imbalanced, especially for MC1 and PC2. With varying degrees of sample size, metrics, and defect ratio, these datasets provide an extensive scenario for the evaluation of difference class-imbalance learning methods.

Baseline Methods
CIB is compared with five state-of-the-art class-imbalance learning methods and the no class-imbalance learning method (None, for short). A brief introduction of these previous methods is presented, as follows. • MAHAKIL. It is a novel synthetic oversampling approach that is based on the chromosomal theory of inheritance for software defect prediction, which is proposed by Bennin et al. [44] in 2017. MAHAKIL utilizes features of two parent instances in order to generate a new synthetic instance which ensures that the artificial sample falls within the decision boundary of any classification algorithm. • AdaBoost. It is one of the most well-known and commonly used ensemble learning algorithms, which is proposed by Freund and Schapire [46] in 1995. AdaBoost iteratively generates a series of base classifiers. In each iteration, the classifier is trained on training dataset with specific distribution weights of instances and it is assigned a model weight according to the training error. The distribution weight of training instances is then updated. Specifically, the training instances that are misclassified get a higher weight, otherwise get a smaller weight, which ensures the decision boundary will be adjusted to the misclassified instances. AdaBoost was identified as one of the top ten most influential data mining algorithms [79]. • AdaC2. AdaC2, proposed by Sun et al. [45], combines the advantages of cost-sensitive learning and AdaBoost. Sun et al. [45] argued that AdaBoost treats samples of different types (classes) equally, which is inconsistent with the common situation that the minority class samples usually are more important that the majority class ones. AdaC2 introduces the cost items into the weight update formula of AdaBoost by outside the exponent. • SMOTE. It is proposed by Chawla et al. [40], which is the most famous and widely used oversampling approach for addressing class-imbalance problem. SMOTE tries to alleviate the imbalance of the original imbalanced dataset by generating synthetic minority class samples in the region of original minority class samples. • RUS: RUS decreases the number of majority class samples (i.e., non-defective modules) by randomly removing the existing majority class samples, such that both two classes of samples have the same number of samples. RUS has been used in previous studies [6,35].

Performance Measures
In this paper, four overall performance measures, including AUC, F1, adjusted F-measure (AGF), and MCC are used to evaluate the prediction performance. According to [80,81], some performance measures such as Accuracy and Precision are not suitable to evaluate the prediction performance for highly imbalanced datasets. Therefore, for overall measures including AUC, F1, MCC, and AGF are used in this study, which has been widely used in previous studies [3,82,83]. The larger these measures are, the better the prediction performance. The measures discussed above (i.e., AUC, MCC, and AGF) can be calculated according to the confusion matrix, as shown in Table 2. In the SDP field, the defect-prone module is generally regarded as the positive class (aka. minority class) sample and the non-defect-prone one as the negative class (majority class) sample.
F1 is a weighted harmonic mean of Precision and PD, which is defined, as follows: where Precision = TP/ (TP + FP).
Adjusted F-measure (AGF) [84] is an adjusted version of F-Measure that suffers from the well-known problem of without considering the TNR. According to Maratea et al. [84], AGF is defined as: where F 2 can be calculated according to F α = (1+α 2 )Precision * PD α 2 (Precision+PD) when α=2 and Precision = TP/ (TP + FP); InvF 0.5 is calculated as standard F 0.5 after switching the class label of samples in order to construct a new confusion matrix.
AUC is the area under of receiver operating characteristic (ROC) curve and ranges from 0 to 1, which is first used to evaluate the machine learning algorithm by Bradley [85]. The ROC curve is obtained by plotting PF on the X-axis and PD on the Y-axis. AUC is a widely used measure because it is rarely affected by class imbalance [82]. The AUC value of a random prediction model is 0.5. Given a testing dataset having 20 samples, the actual label and the scores, i.e., the predicted probability of being positive class, are shown in Table 3. In the table, P and N represent the positive and negative classes, respectively. We first sort scores in descending order and take each source as the threshold successively. For a sample, if its score is not smaller than threshold, then its label is predicted as positive class (i.e., P). For example, when the score of 3-th sample is used as the threshold, then both 1-th and 2-th samples are classified as positive class and the rest samples are predicted as negative class. For each threshold, we can construct a confusion matrix, as shown in Table 2, and then we can obtain a two-tuple of (PD and PF). Because the testing dataset has 20 samples, we can obtain 20 groups of (PD and PF). We then construct a two-dimensional (2-D) coordinate system where x-axis and y-axis present PF and PD, respectively. We plot 20 point of (PD and PF), then the ROC curve is obtained as Figure 1. As shown in the figure, the area under the ROC curve is called AUC whose value is equal to the sum of area of multiple rectangles under the curve. In this study, we directly use MATLAB build-in function perfcurve to calculate AUC.  MCC, proposed by Matthews [86], is an overall performance measure calculated by taking TP, TN, FP, and FN into consideration. MCC has been widely used in previous SDP studies [24,25,83,87], since it can be utilized even if the data are unbalanced [88]. The definition of MCC, as described in [86], is:

Statistical Test
In this study, we use the Wilcoxon signed-rank test (at a significance level 0.05) in order to demonstrate whether our proposed CIB statistically outperforms the baselines on each dataset. The Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test. Differing from the paired t-test, the Wilcoxon sighed-rank test does not assume that the samples must be normally distributed. Wilcoxon signed-rank test has been widely used in the previous SDP studies [21,89].
Moreover, Cliff's δ [90] is used in order to further quantify the magnitude of difference, which is a non-parametric effect size measure. The magnitude is often assessed according to thresholds that are provided in by Romano et al. [91]: negligible (|δ| < 0.147), small (0.147 ≤ |δ| < 0.330), medium (0.330 ≤ |δ| < 0.474), and large (|δ| ≥ 0.474). Cliff's δ also has been widely used in the previous studies [13,87]. Cliff's δ can be computed according to the method that was proposed by Macbeth et al. [92]. In this study, we use dark gray background cell to show that our CIB significantly outperforms the baseline with larger effect size, light gray cell to show significant improvement with medium effect size, and silvery gray to show significant improvement with small effect size.

Validation Method
The k-fold cross-validation (CV) [93] is the most commonly used method for validating the prediction performance of SDP models in previous studies [60,94,95]. In this paper, 5-fold CV with 30 runs is used. Five-fold CV is used because of two main reasons: (1) software defect datasets are usually very imbalanced and too many folds are likely to result in the case that testing data has few defective samples. (2) it has been used in previous studies [60,95]. To alleviate possible sampling bias in random splits of CV, we repeat the five-fold CV process 30 times. 30 is used for the reason that it is the minimum needed samples to meet the central limit theory as done in [63].
Specifically, for a defect dataset, we randomly split it into five folds, which are approximately the same size and the same defective ratio. Additionally then, four folds are used as the training data and the reminding one fold is used as the test data. In this way, each fold has a chance to be taken as the test data. Then we take the average performance on the 5 different test data as the performance of a five-fold CV. Finally, we repeat above process 30 times and report the average performance on each dataset.

Parameter Settings
We set the maximum number of iterations T in Algorithm 1 to 20, which is the same as that of three boosting related baselines, i.e., AdaC2, and AdaBoost. Because SMOTE is one step of our CIB, in this paper we just use WEKA data mining toolkit [96] to implement SMOTE, which has two key parameters: the percentage of synthetic minority class samples to generate P and the number of nearest neighbors k. We let k = 5 and P = ( f loor(n neg /n pos ) − 1) * 100, where n neg and n pos denote the number of majority class samples and the number of minority class samples for a given training dataset. The reason for setting k = 5 is twofold: (1) it is the default value in WEKA, and (2) it was used by Chawla et al. [40] when they proposed the SMOTE algorithm. The reason for setting P = ( f loor(n neg /n pos ) − 1) * 100 not a fixed value is that the value of P can be automatically adjusted according to the defect ratio of different defect datasets. Parameter β in Equation (2) is set in order to make the value of c f equal to 0.9 when the r = 2 K−2 , so as to make the credit factors of synthetic samples having two more real minority class samples be above 0.9.
With respect to the baselines, the settings follow the description of original studies if any, otherwise the default settings in the toolkit. The cost factor of AdaC2 is set to the imbalance ratio. The AdaBoost is implemented by WEKA in this paper and the parameter, i.e., the maximum number of iterations (i.e., T), is also set to 20. The reason for setting T = 20 is that too small value may be unable take advantage of ensemble learning and too big value may cause over-fitting problem and be time-consuming. For baseline SMOTE, we also use the WEKA data mining toolkit [96] to implement it with the same setting as that in CIB. The number of removed majority class samples by RUS is equal to the difference between the number of majority class samples and that of minority class samples.

Base Classifier
In this paper, we consider C4.5 decision tree as the base classifier and implement it using the WEKA data mining toolkit [96].
J48 is a Java implementation of the C4.5 algorithm [97]. It uses the greedy technique for classification and generates decision trees, the nodes of which evaluate the existence or significance of individual features. Leaves in the tree structure represent classifications and branches represent the judging rules. It is easy to be transformed into a series of productive rules. One of the biggest advantages of decision tree is its good comprehensibility. Decision tree has been used as the base classifier in many previous studies [3,67].

RQ1: How Effective Is CIB? How Much Improvement Can It Obtain Compared with the Baselines?
Tables 4, 5, 6 and 7 respectively show the comparison results of F1, AGF, MCC, and AUC for our CIB and the baselines on each of 11 benchmark datasets. For each table, the best value on each data is highlighted in bold. The row 'Average' presents the average performance of each model across all benchmark datasets. The row 'Win/Tie/Lose' shows the comparison results of Wilcoxon signed-rank test (5%) when comparing our CIB with each baseline. Specifically, Win/Tie/Lose represents the number of datasets our CIB beats, ties with, or losses to the corresponding baseline, respectively. We use different background colors to show the results of effect size test (i.e., Cliff's δ). Specifically, the deep gray , the light gray , and the silvery gray indicate that the proposed CIB significantly outperforms the corresponding baseline with large, moderate, and small effect size, respectively. If the baseline significantly outperforms our CIB or the effect size is negligible, then the corresponding table-cell of this baseline is marked with a white background. Checkmark represents that the corresponding method has no significant difference when compared with the best method.   With respect to F1 (see Table 4), we notice that, on average, CIB obtains the best F1 as 0.285 across all datasets, which improves the performance over the baselines by at least 9% (see AdaC2 [45]). The F1 of CIB ranges from 0.148 (on dataset PC2) to 0.621 (on dataset PC4) across all 11 datasets. According to the results of 'Win/Tie/Lose' and the background of table cells, we notice that our CIB nearly always significantly outperforms AdaBoost and None with large effect size on all 11 datasets. Moreover, on most datasets, CIB significantly outperforms MAHAKIL, AdaC2, AdaBoost, and SMOTE with at least medium effect size and significantly perform worse than the baselines on at most one dataset. With respect to AGF (see Table 5), we notice that, on average, CIB obtains the best F1 as 0.285 across all datasets, which is closely followed by SMOTE and MAHAKIL. The AGF of CIB ranges from 0.033 (on dataset MC1) to 0.356 (on MC2) across all 11 datasets. According to the results of Wilcoxon signed-rank test (see the row 'Win/Tie/Lose') and the background of table cells, we notice that on most datasets, our CIB significantly outperforms AdaC2, AdaBoost, and None with large effect size on 11 datasets. Moreover, CIB has similar performance to SMOTE, MAHAKIL, and RUS. Table 5. The comparison results of AGF for all methods in the form of mean ± standard deviation. The best value is in boldface.

Data
Our CIB MAHAKIL [44] AdaC2 [45] AdaBoost [46] SMOTE [40] RUS [ With respect to MCC (see Table 6), we can see that, on average, CIB obtains the best MCC as 0.285, which improves the performance over the baselines by at least 3.7% (see AdaC2 [45]). The MCC of CIB ranges from 0.137 to 0.559 across all 11 datasets. According to the results of 'Win/Tie/Lose' and the background of table cells, we notice that CIB nearly always significantly outperforms None with a large effect size on all 11 datasets. On most datasets, our CIB significantly outperforms the remaining baselines including MAHAKIL, AdaC2, AdaBoost, SMOTE, and RUS with at least a medium effect size and significantly performs worse than the baselines on, at most, two datasets.
With respect to AUC (see Table 7), we can see that CIB achieves the biggest AUC as 0.769 on average, which improves the performance over the baselines by at least 1.2% (see AdaBoost [46]). The AUC of CIB ranges from 0.672 to 0.923 across all 11 datasets. According to the results of 'Win/Tie/Lose' and the background of table cells, we notice that our CIB always significantly outperforms MAHAKIL, SMOTE, RUS, and None with large effect size on each of 11 datasets. On at least half of 11 datasets, CIB significantly outperforms the remaining two baselines, i.e., AdaC2 and AdaBoost, with non-negligible effect size and has no significant differences with them on four out of 11 datasets. Table 6. The comparison results of MCC for all methods in the form of mean ± standard deviation. The best value is in boldface.

Data
Our CIB MAHAKIL [44] AdaC2 [45] AdaBoost [46] SMOTE [40] RUS [ , and indicate our CIB method significantly outperforms the corresponding baseline with large, moderate, and small effect size, respectively. Table 7. The comparison results of AUC for our CIB and the baselines in the form of mean ± standard deviation. The best value is in boldface.

RQ2: How Is the Generalization Ability of CIB?
According to the experimental results on NASA datasets, we find that CIB is more promising to deal with the class-imbalance problem in SDP when compared with existing methods. Here, we want to investigate that whether the findings can be generalized to other scenarios, e.g., on other defect datasets. To this end, we compared CIB with the existing methods (see Section 4.2) on PROMISE datasets [43]. The PROMISE repository includes many defect datasets collected from open-source JAVA projects, such as camel, ant, etc. The details of nine PROMISE datasets are shown in Table 12. We take the same settings as Section 4.5. Due to the space limitation, we just report the experimental results of F1, AGF, AUC, and MCC with box plots as shown in the Figure 3.
From the figures, we can notice that CIB still outperforms all baselines in terms of F1, AGF, AUC, or MCC. Therefore, we can conclude that the findings have a good generalization.

RQ3: How Much Time Does It Take for CIB to Run?
To answer this question, we measure the training and testing time of CIB and the baselines. The training time includes the time taken for data preprocessing and model training. The testing time is the time that is taken to process the testing dataset until the value of performance measures is obtained.
Tables 13 and 14 separately show the training and testing time of our CIB and the baselines. From the tables, we can see that (1) CIB needs the most training time and, on average, the training of CIB takes about 223.25 s; (2) the testing time of all methods is very small, on average, CIB takes 0.168 s in the testing. Therefore, we suggest that train CIB offline. Table 13. Training time of credibility based imbalance boosting (CIB) and the baselines (in second)

Data
Our CIB MAHAKIL [44] AdaC2 [45] AdaBoost [46] SMOTE [40] RUS [ The experimental results provide substantial empirical support demonstrating that taking the credibility of synthetic samples into consideration is helpful for dealing with the class-imbalance problem in software defect prediction. The performance of CIB can be attributed to three aspects: oversampling, boosting framework, and the proposed credibility of synthetic samples. A good defect prediction model should have a high PD, but a low PF as Menzies [3] argued. We further empirically discuss how CIB works in terms of PD and PF. Figure 4 presents the bar plots of PD and PF for CIB and the baselines across all 11 benchmark datasets. From the figure, we can see that RUS obtained the largest PD at the cost of a extremely high PF. The reason of high PF is that too mach useful information of non-defective samples is abandoned when randomly removing many non-defective samples, especially the case that the defect dataset is highly imbalanced. We also notice that CIB achieved the second largest PD (0.388), closely followed by SMOTE (PD = 0.386) and MAHAKIL (PD = 0.375) at the cost of a smaller PF than them. The reason of this is that some synthetic minority class samples generated by SMOTE are noisy, which damages the classification of actual majority class samples. When compared with AdaBoost, AdaC2, and None, CIB largely improved PD at the cost of an acceptable PF (PF = 0.101) considering that of AdaBoost (0.068), AdaC2 (PF = 0.069), and None (PF = 0.075).
In summary, any of CIB's three components, including oversampling, boosting framework, and the proposed credibility of synthetic samples, is essential to ensure the performance of CIB.

Class-Imbalance Learning Is Necessary When Building Software Defect Prediction Models
The experimental results show that all class-imbalance learning methods are helpful to improve defect prediction performance in terms of F1, AGF, MCC, and AUC, which is consistent with the results of previous studies [40,[44][45][46]64]. We also notice that, among all class-imbalance learning methods, CIB always makes the largest improvement over no class-imbalance learning method (i.e., None) in terms of F1, MCC, or AUC). Moreover, although RUS is the simplest class-imbalance learning method, it makes the smallest improvement over None in terms of F1 and MCC as compared with other class-imbalance learning methods. Therefore, users, who try to select a class-imbalance learning method to address the class-imbalance problem, should consider both the easy implementation and its actual performance. To sum up, CIB is a more promising alternative for addressing class-imbalance problem.

Is CIB Comparable to Rule-Based Classifiers?
In this section, we provide a comparison between CIB and some rule-based classifiers, including OneR [98], Bagging (C4.5, i.e., J48, is used as the base classifier in this study) [99], and RIPPER [100]. We implement OneR, Bagging, and RIPPER by WEKA with the default parameters. We perform experiments on NASA datasets and use five-fold CV (see Section 4.5) to evaluate the prediction performance. On each dataset, we run each model 30 times with 5-fold CV and report the average performance. Owing to the space limitation, we just report the experimental results of F1, AGF, AUC, and MCC with box plots, as shown in the Figure 5.
From the figures, we can notice that (1), with respect to F1, AGF, and MCC, CIB obviously outperform the baselines, i.e., OneR, Bagging, and RIPPER; (2) with respect to AUC, CIB has similar performance to Bagging and obviously outperforms OneR and RIPPER.

Is CIB Applicable for Cross-Project Defect Prediction?
Honestly, we do not suggest directly using CIB for cross-project defect prediction (CPDP), because CIB is designed for addressing class-imbalance problem in within-project defect prediction (WPDP). As we all know that WPDP is completely different from CPDP. Therefore, the methods for WPDP are not suitable for addressing CPDP task. The biggest challenge in CPDP is the distribution difference between source and target datasets. CIB should be modified before using for cross-project defect prediction. To this end, for example, we can modify the method for calculating credit factor of synthetic minority class samples.
The more the similarity between synthetic sample and the target dataset is, the bigger the credit factor of this synthetic sample. We plan to design a specific method to implement our idea in the future work.

Threats to Validity
External validity. In this paper, 11 NASA datasets are used as the benchmark datasets, which have been used in previous studies [3,27,67]. Moreover, we further discussed the performance of CIB on PROMISE datasets and obtain similar findings to that on NASA datasets. Honestly, we still cannot state that the same experimental conclusion can be obtained on other defect datasets. To alleviate this threat, the proposed CIB method should be applied to more defect datasets.
Construct validity. According to [101], different model validation techniques may affect model performance. Although the k-fold CV is one of the commonly used model validation techniques in previous studies [60,94,95], we still cannot ensure that our experimental results that are based on the 30*10-fold CV will be the same as that while using other validation techniques, e.g., hold-out validation technique. This could be a threat to the validity of our research.
Internal validity. Threats to internal validity mainly result from the re-implementation of the baseline methods and parameter tuning. Although we have carefully implemented the baselines and double-checked our source code, there may still be some mistakes in the code. In order to ensure the reliability of our research, the source code of our CIB and all baselines have been public on GitHub (https://github.com/THN-BUAA/CIB.git).

Conclusions
The class-imbalance data is a major factor to lower the performance of software defect prediction models [31,32]. Many class-imbalance learning methods have been proposed to address this problem in previous studies, especially the synthetic based oversampling methods, e.g., SMOTE [40]. Since previous synthetic based oversampling methods treat the artificial minority class samples equally with the real minority class samples, the unreliable synthetic minority samples may interfere the learning on real samples and shift the decision boundaries to incorrect directions. To address this limitation, we propose a credibility based imbalance boosting (CIB) method for software defect proneness prediction. To demonstrate the effectiveness of our CIB, experiments are conducted on 11 cleared NASA datasets and nine PROPMISE datasets. We compare CIB with previous class-imbalance learning methods, i.e., MAHAKIL [44], AdaC2 [45], AdaBoost [46], SMOTE [40] and RUS [35], in terms of three performance measures including AUC, F1, AGF, and MCC based on the well-known classifier (i.e., J48 decision tree). The experimental results show that our CIB is more effective compared with the baselines and is very promising to address the class-imbalance problem for software defect proneness prediction.
In the future, we plan to further demonstrate the effectiveness of our CIB on more defect datasets and modify CIB to use in cross-project defect prediction.