Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance

: Software defect prediction (SDP) is the technique used to predict the occurrences of defects in the early stages of software development process. Early prediction of defects will reduce the overall cost of software and also increase its reliability. Most of the defect prediction methods proposed in the literature su ﬀ er from the class imbalance problem. In this paper, a novel class imbalance reduction (CIR) algorithm is proposed to create a symmetry between the defect and non-defect records in the imbalance datasets by considering distribution properties of the datasets and is compared with SMOTE (synthetic minority oversampling technique), a built-in package of many machine learning tools that is considered a benchmark in handling class imbalance problems, and with K-Means SMOTE. We conducted the experiment on forty open source software defect datasets from PRedict or Models in Software Engineering (PROMISE) repository using eight di ﬀ erent classiﬁers and evaluated with six performance measures. The results show that the proposed CIR method shows improved performance over SMOTE and K-Means SMOTE.


Introduction
Most important activity in the testing phase of software development process is the software defect prediction (SDP) [1]. SDP identifies defect prone modules which need rigorous testing. By identifying defect prone modules well in advance, testing engineers can use testing resources efficiently without violating the constraints. Although, the SDP is most useful in the testing phase, it is not always easy to predict the defect prone modules. There are different issues which obstruct the algorithm performance as well as use of the defect prediction methods. The quality of the software is correlated directly with the number of defects in the software module. So, the defect prediction is considerable part in measuring the software quality. To minimize the number of defects, thorough testing is required, but the implicit disadvantage is that it is most expensive in terms of man hours. Accurate identification of defective modules in initial phases of testing can decrease the overall testing time. There are several algorithms available in machine learning for model building. Users can choose appropriate algorithm for regression or classification problems and calculate accuracy of the model. However, most of the defect prediction methods proposed in the literature suffers with the problem of class imbalance. Machine learning algorithms tend to oscillate when the datasets are imbalanced and lead to misleading accuracies. In an imbalanced dataset, samples of one class contain less number compared to other class samples and the former is termed as minority class and the other is termed as majority class. When the dataset is imbalanced, classification algorithm doesn't have the sufficient information relating to the minority class to get accurate prediction. So, it is advantageous to have balanced datasets with a symmetry of defect and non-defect records to apply classification algorithms.
Datasets with disparity in dependent variable are called as imbalanced. Classification (or) prediction with imbalanced datasets is a supervised learning method where the percentage of one class differs with other class in large proportion. This is the frequent problem occurred in binary classification. With the imbalance nature of the datasets, machine learning algorithms results in poor accuracy. The reasons for poor accuracy of classification algorithms with imbalanced datasets can be uneven distribution of values in class label attributes. This leads to biased performance of the classifier towards majority class To balance the imbalanced data, various sampling methods are proposed in the literature and are used to change imbalanced data into balanced data. By balancing the datasets, the accuracy of classification is improved. The main approaches used to handle the imbalanced data are undersampling, oversampling, and synthetic data generation.
In the undersampling method, some samples of majority class are reduced to balance the data. Undersampling can be random or informative. In random undersampling, the samples to be deleted are chosen randomly. Informative undersampling uses a pre-specified condition which selects the samples from majority class. EasyEnsemble [2] and BalanceCascade [2] are popular algorithms for informative undersampling.
EasyEnsemble extracts numerous subsets from majority class which are independent samples (with replacement) and it create multiple classifiers by considering combination of every subset with the minority class, whereas BalanceCascade works on the supervised learning method where it creates an ensemble of classifiers and selects systematically which majority class to ensemble. But with this method, valuable information relating to majority class is lost.
In oversampling, samples of minority class are replicated to create symmetry between the number of defect and non-defect records to balance the data. Oversampling method is of two types: (i) informative oversampling and (ii) random oversampling. A pre-specified criterion is used in Informative Oversampling and it generates synthetically minority class samples. In Random Oversampling data is balanced by oversampling the samples of minority class randomly. Oversampling avoids the problem of information loss. But it suffers with replication of data. The synthetic data generation method generates synthetic data in minority class. The synthetic minority oversampling technique (SMOTE) [3] is powerful and most widely used technique. It creates random set of samples to balance minority class. New synthetic data samples are generated between randomly chosen minority class sample and its nearest neighbor samples. SMOTE is considered as a benchmark in learning from imbalanced datasets. Chawla et al. [4] discussed the current research progress using SMOTE and applications of SMOTE in different fields. There are different variants of SMOTE are proposed by many researchers and their implementations in python is discussed in [5]. K-Means SMOTE [6] is the variation of SMOTE in which the data samples are divided into k number of clusters by using K-Means algorithm. Next, the clusters which are to be oversampled are identified by using filtering method and finally the SMOTE is applied to identified clusters to balance the samples.
In the present work, we propose a novel class imbalance reduction (CIR) technique for reducing the imbalance between defective and non-defective samples for achieving improved accuracy in software defect prediction and our technique is compared with baseline method SMOTE and latest variant K-Means SMOTE. The organization of the paper is as follows. In Section 2, various classification methods used in this work are explained. Related work is mentioned in Section 3; Section 4 describes the proposed algorithm. In Section 5, experimentation results are outlined. The paper is concluded with Section 6 by including the possible scope for future enhancements.

Background
There are many classifiers used in the literature for prediction. In this work, we used below mentioned classifiers to test our new approach.

AdaBoost
AdaBoost [7] is the ensemble boosting classification method which merges various classifiers to increase classifier accuracy. Multiple weak classifiers are merged to create a strong classifier with high accuracy. The AdaBoost algorithm works as follows:

1.
AdaBoost randomly selects a subset from training data.

2.
It trains the chosen machine learning model iteratively by selecting the training dataset based on the accurate prediction of the last training.

3.
It assigns weights to samples such that the wrongly classified samples get higher weight than correctly classified samples. With this the wrongly classified samples will get highest classification probability in the next iteration.

4.
In every iteration, the algorithm assigns the weight to the classifier based on the accuracy of the classifier, so that more accurate classifier will have the highest weight.

5.
This process will be terminated when all the training data classified correctly or reach the specified threshold of a maximum number of estimators. 6.
Finally it performs a "vote" among all of the learning algorithms built.

Decision Tree
The decision tree [8] is a tree structure, where each non-terminal (non-leaf) node represents an attribute, each branch of the non-terminal node represents an outcome of the condition on that node, and each terminal (leaf) node represents a class label. The tree is constructed by identifying the best splitting attribute as the root node. Each possible value of splitting attribute leads to one branch of the tree. This process is recursively repeated to identify attributes at next levels of the tree and is terminated when all the attributes are added to the tree. These trees handle high dimensional data with good accuracy. It classifies new instances by traversing the decision tree from the root to leaf node. At each level of the decision tree, the new instance is tested against the attribute of that node, and traversing the branch corresponding to the value of new instance attribute. This procedure will be repeated until the search reaches to the leaf node.

Extra Tree
The extra tree [9] is an ensemble learning technique in which the classification result represents the aggregated results of different de-correlated decision trees. It is mostly analogous to the Random Forest technique but it differs in the construction process of the decision trees in the forest. Decision Trees in the Extra Trees are constructed by considering all the training samples. For each test node, every tree is provided with a random sample of n features. Each decision tree will select the best feature from n features. Then, the data are partitioned by using mathematical criteria. Finally, multiple de-correlated decision trees are created from this random sample of features.

Gradient Boosting
Gradient boosting [10] is used to generate a classification model with the collection of weak prediction models. This method builds the model in a stage-wise manner, and an arbitrary differentiable loss function is used to generalize them by allowing optimization. It trains number of models in additive, gradual, and sequential fashion. Gradient boosting uses gradients in the loss function (y = ax + b + e, e is the error term) to identify weak classifiers. The loss function used in gradient boosting indicates how best the model's coefficients are fitting the underlying data.

KNN
The KNN algorithm [11] works on the assumption that similar samples exist in the close proximity. KNN consider every training sample with its associated class label as a vector in the multidimensional space. While training the model, KNN stores the feature vectors and their class labels. While classifying, class label is assigned to the new instance by considering the majority class label value of k nearest samples of that new instance (k is a user parameter).

Logistic Regression
Logistic regression [12] is a statistical method used to analyze the given dataset which contains more than one independent variables (features) which determine dependent variable's (Class Label) value. The dependent variable's value is binary (either zero or one) or dichotomous.
Logistic regression aims to find the model with best fitting which describes the association between the class label and set of independent variables. Logistic regression generates the coefficients shown in the Equation (1), which predicts a logit transformation of the probability of presence of the characteristic of interest: where, the variable p is the probability of presence of the characteristic of interest. The logit transformation is described as the logged odds given by Equations (2) and (3). and

Naïve Bayes Classifier
Naïve Bayesian classifier [13] is built based on Bayes' theorem which assumes the predictor attributes are independent and is called class conditional independence. According to the Bayes' theorem, the class label (c) of the data instance (x) is identified by calculating the posterior probability of value P(c|x) as given in Equation (4).
where P(c|x) is posterior probability of class c conditioned on data instance x. P(x|c) is posterior probability of data instance x conditioned on class c. P(c) is prior probability of class c. P(x) is prior probability of data instance x.

Random Forest
Random forest [14] model is the collection of many decision trees. This algorithm extracts random sample from training data while constructing the trees and extract random subset from the features while splitting the nodes. While training, each tree in a RF learns from the randomly selected samples of training data points. The sampling technique used is random sampling with replacement. So that same samples can be used number of times in a single tree. While testing, each tree's prediction is taken and the average of these predicted values are considered as final prediction. The other concept in RF is that only some features (sqrt of n features) of the dataset are considered for splitting each node in each decision tree.

Related Work
Class imbalance problem in software defect prediction is addressed by various methods based on data and algorithm levels [15,16]. Data-level methods address the issue by means of re-sampling techniques, which may balance datasets by deleting majority class data samples or by replicating minority class samples. Methods like random undersampling, random oversampling, SMOTE [3], and their variants [17][18][19][20][21][22] are widely used in literature. But, these kinds of methods have the risk of discarding useful data or duplicate the existing data.
Ensemble learning and cost-sensitive learning are examples for algorithm-level methods. Bagging and boosting are classic ensemble learning techniques, which demonstrated handling class imbalanced problem effectively [23]. Variants of bagging and boosting have been proposed to address this problem in the SDP [15,[24][25][26]. Cost sensitive (CS) learning works on assigning big misclassification cost for defective instance and small misclassification cost for non-defective instances. Khoshgotaar et al. [27] introduced CS learning into SDP and proposed a cost-boosting method. Zheng J. [28] proposed cost-sensitive boosting neural networks for SDP. Similarly, CS neural network was studied by Arar and Ayan [29]. Liu et al. [30] proposed two-stage CS learning for SDP, which includes CS feature selection and CS neural network classifier. Li et al. [31] used three-way decision-based CS for SDP. Some researchers combined CS with other machine learning methods, such as dictionary learning and random forest [18,20]. Furthermore, CS also has been used in the CPDP scenario [16,32]. However, how to set suitable cost values is still an unsolved problem for the cost-sensitive learning method. Divya Tomar et al. [33], developed SDP system using weighted least squares twin support vector machine (WLSTSVM). In this method, they assigned a high cost of misclassification to the defective class samples and low cost to non-defective class samples. Lina Gong et al. [34], proposed KMFOS method which generates new samples that spread diversely in the defective space. KMFOS applies K-means clustering method to divide defective samples into K number of clusters. Then, new instances are generated by using interpolation between different instances belong to each two clusters. Finally, it uses CLNI filtering technique to clean the noise instances. Sohan et al. [35], assessed the imbalance learning effect on CPDP by using eight different classifiers. Q. Song et al. [36], conducted experiments which explores the effect of the presence of imbalanced data, its nature, use of different classifiers, using software metrics as input. They evaluated twenty seven data sets, using seven classifiers on seven types of input metrics and various imbalanced learning methods and concluded that imbalanced learning could be considered only for moderate or highly imbalanced software defect prediction datasets. Sohan et al. [37] conducted a study to know the inconsistency in the performance among imbalanced dataset and balanced dataset. In this study, eight public data sets were examined with seven classification methods to conclude that the imbalance nature of defective and non-defective classes plays a major role in SDP and among seven classifiers, the voting results in best performer among the classifiers. S. Huda et al. [38], proposed two novel hybrid SDP models to choose significant attributes by combining wrapper and filter methods.

Proposed Method
Our proposed approach class imbalance reduction (CIR) is based on calculating the centroid of all attributes of minority class samples for synthesizing new samples. Our approach is outperforming the popular SMOTE oversampling technique in six mostly used performance measures. The general framework for implementing our approach is described as follows: Let the imbalanced defect dataset DS i = {r 1 , r 2 , r 3 , . . . , r n } where r i (1 ≤ i ≤ n) is a ith record representing ith module in the project. Each r i contains m number of attributes where each attribute is a software metric and one additional class label attribute. The value of class label represents number of bugs occurred in that module. The zero value in class label attribute represents a non-defective module and the value greater than zero represents a defective module. The binary classifier requires the class label values to be zero or one, so with pre-processing, we changed class label values to either zero or one. The proposed framework is depicted in Figure 1.

Algorithm for Class Imbalance Reduction (CIR)
The input dataset DSi is divided based on the value of the class label into two groups. The group with less number of samples is designated as minority class and the group with more number of samples as majority class as shown in Figure 2. In our approach, as described in algorithm 1, synthetic data is generated to increase the samples of minority class to match number of samples of majority class. The centroid (C) of the minority class samples is computed and its nearest neighbor sample is identified. A new sample is generated by applying scalar multiplication of centroid and random number generated within the range of 0 to 1 and adding it to its nearest neighbor. The generated synthetic sample is appended to minority class samples. This procedure of generating new samples is terminated when both minority and majority classes are balanced so as to create symmetry between the number of defect and non-defect records. For example, consider three data samples (1,2), (2,5), (3,5). The centroid (C) of these data sampless is (2,4) and the nearest neighbor to centroid is (2,5). The algorithm generates synthetic data samples by (2,5) + (random number between 0 and 1) * C. As shown in Figure 1, from the balanced data, random 70% records are considered as training data and remaining 30% records are considered as test data. The classification models are generated and tested by using 10-fold cross validation on training and test datasets with various classifiers like

Algorithm for Class Imbalance Reduction (CIR)
The input dataset DS i is divided based on the value of the class label into two groups. The group with less number of samples is designated as minority class and the group with more number of samples as majority class as shown in Figure 2. In our approach, as described in Algorithm 1, syncsamples of minority class to match number of samples of majority class. The centroid (C) of the minority class samples is computed and its nearest neighbor sample is identified. A new sample is generated by applying scalar multiplication of centroid and random number generated within the range of 0 to 1 and adding it to its nearest neighbor. The generated synthetic sample is appended to minority class samples. This procedure of generating new samples is terminated when both minority and majority classes are balanced so as to create symmetry between the number of defect and non-defect records. For example, consider three data samples (1,2), (2,5), (3,5). The centroid (C) of these data sampless is (2,4) and the nearest neighbor to centroid is (2,5). The algorithm generates synthetic data samples by (2,5) + (random number between 0 and 1) * C.

Algorithm for Class Imbalance Reduction (CIR)
The input dataset DSi is divided based on the value of the class label into two groups. The group with less number of samples is designated as minority class and the group with more number of samples as majority class as shown in Figure 2. In our approach, as described in algorithm 1, synthetic data is generated to increase the samples of minority class to match number of samples of majority class. The centroid (C) of the minority class samples is computed and its nearest neighbor sample is identified. A new sample is generated by applying scalar multiplication of centroid and random number generated within the range of 0 to 1 and adding it to its nearest neighbor. The generated synthetic sample is appended to minority class samples. This procedure of generating new samples is terminated when both minority and majority classes are balanced so as to create symmetry between the number of defect and non-defect records. For example, consider three data samples (1,2), (2,5), (3,5). The centroid (C) of these data sampless is (2,4) and the nearest neighbor to centroid is (2,5). The algorithm generates synthetic data samples by (2,5) + (random number between 0 and 1) * C. As shown in Figure 1, from the balanced data, random 70% records are considered as training data and remaining 30% records are considered as test data. The classification models are generated and tested by using 10-fold cross validation on training and test datasets with various classifiers like As shown in Figure 1, from the balanced data, random 70% records are considered as training data and remaining 30% records are considered as test data. The classification models are generated and tested by using 10-fold cross validation on training and test datasets with various classifiers like AdaBoost (AB), decision tree (DT), extra tree (ET), GradientBoost (GB), K-nearest neighbor (KNN), logistic regression (LR), Naïve Bayes (NB) and random forest (RF).

Algorithm1: Class Imbalance Reduction (CIR)
Input: Imbalanced Dataset (DS i ) with X 1 , X 2 , X 3 , . . . , X m attributes which represent features (software metrics) with class label and r 1 , r 2 , r 3 , . . . , r n are records Output: Balanced Dataset (BD) with symmetry of number of defect and non-defect records Step-1: Divide the DS i into two groups based on class label value representing defect and non-defect classes Step-2: Class which contains less number of records is denoted as minority class (D o ) Step-3: Class which contains more number of records is denoted as majority class (D j ) Step-4: Calculate the centroid (C) of D o using C ={mean(X 1 ), mean(X 2 ), mean(X 3 ), . . . , mean(X m )} Step-5: For each record r i in D o Step-5.1 Calculate the distance dist(r i , C) using Euclidian distance Step-6: Sort the records in increasing order of their distances Step-7: Choose the record with minimum distance (D min ) Step-8: Generate n random numbers k 0 , k 1 , k 2 , . . . , k n , between 0 and 1, where n = |D j |-|D o | such that symmetry is created between the number of defect and non-defect records Step-8.1 For each random number k j Step-8.1.1 Generate a new record as D min + k j * C Step-8.1.2 Append new record to D o

Experimentation and Results
For experimentation, we considered forty open source datasets relating to the defect prediction from tera-PROMISE repository [39]. The list and the class imbalance percentages of datasets are shown in Table 1. All the datasets are containing data of twenty software metrics such as McCabes cyclomatic complexity, weighted methods per class, and others. The description of each metric is given in Table 2. No. of methods and attributes, and access to those methods on another class loc Lines of code dam Ratio of the no. of private and protected attributes to the total number of attributes moa Extent of the part-whole relationship, realized by using attributes mfa No. of methods inherited by a class per number of methods accessible by its methods cam Relatedness among methods of a class based upon the parameter list of the methods ic No. of parent classes to which a given class is coupled cbm No. of new and redefined methods to which all the inherited methods are coupled amc Average method size for each class max_cc Max value for Cyclomatic Complexity metric avg_cc Average value for Cyclomatic Complexity metric

Performance Measures
There are several classifier performance measures that are proposed in the literature as given by Equations (5)- (10). Sensitivity or recall is the measure to check the proportion of positives which are correctly classified. Specificity is the ability of the test to correctly identify true negatives. Geometric mean combines rate of true negative and true positive at a specific threshold. Precision measures the proportion of predicted positives over all positives. F-Measure is the harmonic mean of precision and recall. Accuracy measures the proportion of true results over total cases. The confusion matrix which is used to compute these performance measures is shown in Table 3.
Speci f icity = True Neg True Neg + False Pos (6) GeometricMean = (Sensitivity * Speci f icity)  The comparative analysis of our approach with SMOTE and K-means SMOTE for above mentioned performance measures by using AdaBoost (AB), decision tree (DT), extra tree (ET), GradientBoost (GB), K-nearest neighbor (KNN), logistic regression (LR), Naïve Bayes (NB) and random forest (RF) classifiers is shown in Table 4 in the format of "Mean ± Standard Deviation (SD)". The analysis shows that the proposed CIR method exhibits an improvement over SMOTE as well as K-Means SMOTE for all the performance measures (Depicted in bold faced values). CIR is performing very well over SMOTE when applied with frequently used eight machine algorithms like AdaBoost, decision tree, extra tree, gradient boost, K-nearest neighbors, logistic regression, Naïve Bayes classifier and random forest.  Table 5 is showing the comparison of improvement of CIR over SMOTE using eight classifiers with six performance measures. Each row in Table 5 shows the number of datasets in which the performance improvement is seen using CIR over SMOTE. For example the first row is showing that 27 out of 40 datasets is showing improvement in performance with CIR over SMOTE and 6 datasets is showing equal performance with CIR over SMOTE in accuracy for AdaBoost classifier. Table 6 is showing the comparison of improvement of CIR over K-means SMOTE using eight classifiers with six performance measures. Each row in Table 6 is showing the number of datasets in which the performance improvement is seen using CIR over K-means SMOTE. For example the first row is showing that 26 out of 40 datasets is showing improvement in performance with CIR over K-means SMOTE and 5 datasets is showing equal performance with CIR over K-means SMOTE in accuracy for AdaBoost classifier. Table 7 is showing the comparison of performance improvement with CIR over SMOTE and K-means SMOTE using different classifiers. Every row in Table 7 is showing the performance improvement of CIR over SMOTE and K-means SMOTE for each classifier. KNN performance is better than other classifiers in accuracy, precision and specificity whereas logistic regression is performing well in recall, F-measure and geometric mean. Overall, logistic regression is performing close to or better than all other classifiers.

Post hoc Analysis
The post hoc analysis is done by using SPSS 20.0 tool [40] with multiple comparison single factor ANOVA to compare three algorithms, SMOTE, K-means SMOTE and proposed CIR algorithm. By post hoc analysis, the proposed CIR algorithm showed significance in precision, recall, and specificity using AdaBoost classifier, in geometric mean using extra tree. K-nearest neighbors, logistic regression and Naïve Bayes classifiers are showed high significance in all six performance measures and Random Forest classifier showed high significance in precision, F-measure and specificity when compared with SMOTE and K-Means SMOTE since p-values are significant at 0.05 level the results of which are shown in Table 8.

Conclusions
In this paper, we proposed a novel technique class imbalance reduction (CIR) to handle class imbalance in software defect prediction by considering distribution properties of dataset. The proposed method uses the centroid and nearest neighbor based approach to generate synthetic data. Several experiments are conducted by applying the proposed approach on forty open source datasets and the results of the experiment obtained using proposed approach are compared with the results obtained by applying the SMOTE, which is a benchmark model in reducing class imbalance, and with K-means SMOTE algorithms. Our experiment results prove that the proposed approach CIR is outperforming the SMOTE and K-Means SMOTE in terms of six standard prediction measures. CIR is performing very well over SMOTE and K-means SMOTE when applied with frequently used eight machine algorithms like AdaBoost, decision tree, extra tree, gradient boost, K-nearest neighbors, logistic regression, naïve Bayes and random forest. KNN performance is better than other classifiers in accuracy, precision and specificity whereas logistic regression is performing well in recall, F-measure and geometric mean. Overall, logistic regression performs close to or better than all other classifiers.
Post hoc analysis is done by using SPSS 20.0 tool with multiple comparison single factor ANOVA to compare three algorithms, SMOTE, K-means SMOTE and proposed CIR algorithm. By post hoc analysis, proposed CIR algorithm showed significance in precision, recall and specificity using AdaBoost classifier, in geometric mean using extra tree. K-Nearest neighbors, Logistic regression and naïve Bayes classifiers are showed high significance in all six performance measures and Random Forest classifier showed high significance in precision, f-measure and specificity when compared with SMOTE and K-means SMOTE since p-values are significant at 0.05 levels. The proposed work can be extended to cross project defect prediction (CPDP) and also can be integrated with other optimization techniques such as ant colony optimization.
Author Contributions: K.K.B. has implemented CIR algorithm, collected data sets and verified results. J.G. has proposed CIR algorithm along with other authors and given suggestions to improve the paper. N.G. has helped in validating the results of CIR algorithm by comparing it with SMOTE technique. All authors have read and agreed to the published version of the manuscript.