LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction

: Software defect prediction (SDP) is an effective technique to lower software module testing costs. However, the imbalanced distribution almost exists in all SDP datasets and restricts the accuracy of defect prediction. In order to balance the data distribution reasonably, we propose a novel resampling method LIMCR on the basis of Naïve Bayes to optimize and improve the SDP performance. The main idea of LIMCR is to remove less-informative majorities for rebalancing the data distribution after evaluating the degree of being informative for every sample from the majority class. We employ 29 SDP datasets from the PROMISE and NASA dataset and divide them into two parts, the small sample size (the amount of data is smaller than 1100) and the large sample size (larger than 1100). Then we conduct experiments by comparing the matching of classiﬁers and imbalance learning methods on small datasets and large datasets, respectively. The results show the effectiveness of LIMCR, and LIMCR+GNB performs better than other methods on small datasets while not brilliant on large datasets.


Introduction
Software defect prediction (SDP) is an effective technique to lower software module testing costs. It can efficiently identify defect-prone software modules by learning information from defect datasets of the previous release. Existing SDP studies can be divided into four categories: (1) Classification, (2) Regression, (3) Mining association rules, (4) Ranking [1]. Studies in the first category use classified algorithms (also called classifiers) as the prediction algorithms to classify software modules into defect-prone classes (positive or minority class) and non-defective classes (negative or majority class) or various levels of defect severity. The imbalance learning we focus on is based on binary classification study in this paper.
Commonly in software defect datasets, the number of samples (samples usually refer to software modules in SDP) in defect-prone classes is naturally smaller than the number of samples in non-defective classes. However, most prediction algorithms assume that the number of samples in any class are equally balanced. This contradiction makes the prediction algorithms trained in imbalanced software defect datasets are generally biased towards the samples in non-defect-prone classes and ignore the samples in defect-prone classes, i.e., many defect-prone samples might be classified into non-defect-prone class based on prediction algorithms trained by imbalanced datasets. This problem widely occurs in SDP and it has proved that reducing the influence of the imbalance problem can improve prediction performance efficiently.
Numerous methods [2,3] are proposed for tackling imbalance problems in SDP. In imbalance learning research [4,5], methods are divided into two categories: data level and algorithm level. Methods within the data level mainly consist of various data resampling techniques. The resampling technique is a kind of method to rebalance datasets by adding minorities (over-sampling methods) or removing majorities (under-sampling methods). For instance, SMOTE [6] is an over-sampling method to generate synthetic samples in minority class, NCL [7] is an under-sampling method to remove samples in majority class. Methods within the algorithm level improve classification algorithms based on existing algorithms to make them no more biased towards the samples in majority classes and ignore the samples in minority classes. Cost-sensitive methods combine both algorithm and data level methods. They consider the different misclassified cost of samples in different classes. For instance, RAMOBoost [8] is an improvement of Boosting algorithm [9], NBBag [5] is an improved algorithm based on Bagging algorithm [10], AdaCost [11] modifies the weight updated by adding a cost adjustment function based on AdaBoost algorithm [12]. Boosting, bagging and other ensemble classifiers are frequently selected as the basic classification algorithms for improving because of the high performance in classification they have [4,5]. A proper basic prediction algorithm could perform better with the imbalanced dataset after improving by imbalance learning. Obviously, basic classification algorithm selection is one of the most important steps for imbalance learning methods within algorithm level.
Different from algorithm level methods, methods in data level can choose and change classifiers flexibly. With the increasing number of imbalance learning studies, researchers notice the influence of classifier selection. Numerous empirical studies compare the performance of different techniques to find the rules of classifier selection, and the influence factors within consideration include the researcher group [13], levels of class imbalance [14], diversity [15,16], and others [17,18].
Most empirical studies focus on the comparison between resampling methods and the influence factors while less notice on the applicability of resampling methods and the connection between classifiers and them. In addition, there is almost no resampling method by quantifying the sample information. Motivated by this, we aim to investigate how resampling methods work in datasets with different sample size, and how they cooperate with various classifiers. Moreover, we aim to propose a novel and effective resampling method to remove less-informative samples of majority class for rebalancing the data distribution. The main contributions of this paper are divided into the following three aspects: 1. We perform an empirical study to investigate the influence of datasets sample size on popular comment classifiers. 2. We present a novel resampling method LIMCR based on Naïve Bayes to solve the class imbalance problem in SDP datasets. The new method outperforms other resampling methods on datasets with small sample size. 3. We evaluate and compare the proposed method with existing well performance imbalance learning methods including both methods from data level and algorithm level. The experiment presents the effective performance of specified methods on different datasets, respectively.
The remainder of this paper is organized as follows. Section 2 summarizes related work in the area of imbalance learning. In Section 3, we describe the methodology and procedure of our LIMCR. In Section 4, the experimental setup and results are explained, respectively. Finally, the discussion and conclusion are presented in Sections 5 and 6.

Imbalanced Learning Methods
A large number of methods have been developed to address imbalance problems currently, and all of these methods are classified into two basic categories: data level and algorithm level. The methods in data level mainly study the effect of changing class distribution to deal with imbalanced datasets. It has been empirically proved that the application of a preprocessing process for rebalancing class distribution is usually a positive solution [19]. The main advantage of methods in data level is that they are independent of the classifier [4]. Moreover, data level methods can be easily embedded in ensemble learning algorithms as algorithm level methods. Hereafter, representative imbalanced learning methods will be introduced in this section.
Among all the data resampling methods, random over-sampling and random under-sampling are the simplest resampling methods for rebalancing datasets [20]. Although random sampling methods have some defects, they indeed improve the performance of classifiers. To avoid the drawbacks caused by random methods, researchers attempt to generate new synthetic samples based on original dataset and attain to great success. SMOTE [6] is one of the most classical methods among synthetic over-sampling methods. Numerous methods are proposed based on SMOTE, like ADASYN [21], Borderline-SMOTE [22], MWMOTE [23], Safe-level-SMOTE [24]. The generated samples add essential information to the original dataset so that the additional bias to classifier can be alleviated and the overfitting problem to which random over-sampling might lead can be avoided [25].
On the other side, resampling methods we introduced above have been proved to show remarkable performance after being embedded in an ensemble algorithm [26,27]. So researchers integrate an oversampling method with an appropriate ensemble method to achieve a stronger approach for solving class imbalance problems. The most widely used ensemble learning algorithms are AdaBoost and Bagging which are usually combined with the resmapling methods to form new typical algorithms, such as SMOTEBoost [28], SMOTEBagging [15], RAMOBoost [8], etc., which perform well in the imbalanced dataset.
Undersampling methods are also widely used in imbalance learning, especially in SDP, research [29] has proved static code features have limited information content and undersampling performs better than others. In earlier studies, researchers prefer identifying redundant samples by cluster or K-nearest neighborhoods algorithms, for instance, Condensed Nearest Neighbor Rule (CNN) [30], Tomek links [31], Edited Nearest Neighbor Rule (ENN) [32], One-Sided Selection (OSS) [33], Neighborhood Cleaning Rule (NCL) [7]. With the increasing number of distribution problems are found in datasets, more new stronger undersampling methods are proposed currently. Research [34] proposes a set of sample hardness measurements to understand why some samples are harder to classify correctly and remove samples that are suspected hard to learn. A similar study [35] is also proved effective for imbalance learning. Undersampling can be also embedded in ensemble algorithms. Two algorithms embedded by undersampling methods: EasyEnsemble and BalanceCascade [36] are proposed for preserving information to a maximum degree and reducing the data complexity for efficient computation.

Software Defect Prediction
The classification problem in SDP is a typical learning problem. Bohem and Basili pointed out that in most cases, 20% of the modules can result in 80% of the software defects [37], this means software defect data has a natural imbalanced distribution.
SDP research starts with software defective metrics selection. The original defect data is obtained by using specified static software metrics [38]. For instance, McCabe [38] and Halstead [39] metrics are widely used, Chidamber and Kemerer's (CK) metrics are proposed for fitting the demand of object-orientation (OO) software. Lots of empirical studies are conducted for the imbalance problem in SDP. A comprehensive experiment to study the effect of imbalance learning in SDP emphasizes the importance of method selection [40]. The result of the study [41] advocates resampling method for effective imbalance learning. Meanwhile, many new imbalance leaning methods are proposed for SDP. L. Chen et al. [2] consider the class imbalance problem together with class overlap and integrate neighbor cleaning learning (NCL) and ensemble random under-sampling (ERUS) methods as a novel approach for SDP. H. N. Tong et al. [1] propose a novel ensemble learning approach for imbalance and overfitting problems, ensemble it with the deep learning algorithm, and solve the imbalance problem and high dimensionality simultaneously. S. Kim et al. [42] propose an approach to detect and eliminate noises in defect data. N. Limsettho et al. [3] propose a novel approach named Class Distribution Estimation with Synthetic Minority Oversampling Technique (CDE-SMOTE) to modify the distribution of the training data for a balanced distribution.

Classification Algorithms for Class Imbalance
Classification is a form of data analysis that can be used to build a model that minimizes the number of classification errors on a training dataset [43]. Some classifiers are commonly used because of their outstanding performance, e.g., Naïve Bayes [44], multilayer perceptron [45], K-nearest neighbors [46], and logistic regression [47], decision trees [48], support vector machines [49], backpropagation neural networks [50]. However, it is confirmed that ensemble algorithms by a few weak classifiers outperform a common classifier [4,16] when the training dataset has a class imbalance problem. Random forest is a frequently used ensemble method in machine learning, which ensemble a certain amount of decision trees together for classification. However, it is still negatively influenced by imbalanced class distribution [51]. Facing the imbalance problem, F. Herrera et al. [52] evaluate the performance of the diverse approaches for imbalanced classification and use the MapReduce framework to solve the imbalance problem in big data. J. Xiao et al. [51] propose a dynamic classifier ensemble method for imbalanced data (DCEID). This method combines ensemble learning with cost-sensitive learning which improves classification accuracy effectively.
All of these methods have been proved to improve the classifier performance efficiently, but the sample size of SDP defect data and its relationship with the classifiers are unexplored. Moreover, the cooperation between resampling methods and classifiers is less to be noticed. Therefore, in this paper, we firstly empirically study the influence of sampling size on classifiers and resampling methods; then, we investigate the cooperation between resampling methods and classifiers. Finally, based on the results of the empirical study, we propose a novel resampling method for imbalanced learning in software defect prediction, which can improve prediction results for SDP datasets.

Overall Structure
In order to solve the class imbalance problem rationally and effectively, we choose to remove less-informative samples of majority class instead of randomly deleting for rebalancing the data distribution. Furthermore, we define the informative degree of a specified sample by measure the difference of samples with same feature values between defective and non-defective classes, which is the main idea of LIMCR. The proposed LIMCR involves three key phases. In the first phase, LIMCR defines the sample information calculating rule on one feature based on Naïve Bayes. In the second phase, LIMCR summarizes the variable of sample informative degree on one feature and proposes a new variable for describing sample informative degree. In the third phase, LIMCR analyzes the relationship between variable and sample distribution and proposes the definition of less informative majorities. The structure of the proposed method LIMCR is shown in Figure 1.

Assumption of Porposed Methods
In order to make the calculation of LIMCR more efficient and applicable to more datasets, the method we proposed is based on the following assumptions: 1. All features are independent for the given class label; 2. All features are continuous variables and the likelihood of the features is assumed to be Gaussian; 3. There is only one majority class in datasets.

Variable of Sample Information for One Feature
A sample E is represented by a set of feature values X(x 1 , x 2 , · · · , x m ) and a class label Y, the value of Y can only be 1 or 0. According to Bayes probability, posterior probability of Y can be calculated as Then posterior probability of a sample E i with m features being class y can be calculated as Because of the assumption that all features are independent for the given class label, the conditional probability p(X i | Y = y) can be calculated as The Naïve Bayes classifier is expressed as where n is the number of samples and IR represents the imbalance radio. Then the class label Y of a sample with feature X i is predicted according to f b (X i ) For the single feature, the bigger the gap between p(x ij | Y = 1) and p(x ij | Y = 0), the larger the value of p(x ij |Y=0) is, simultaneously the easier the sample X i can be correctly classified by Naïve Bayes classifier. Correspondingly, the easier a sample is misclassified the more informative it is. Generally, the value of conditional probabilities p(x ij | Y = 0) and p(x ij | Y = 1) are calculated based on samples in dataset, and the likelihood of the features is assumed to be Gaussian. Then when there is only one feature, the conditional probabilities are calculated as If a sample with one feature can be precisely classified into corresponding class (for instance y = 0), the conditional probability p(x i | y = 0) should be close to 1 and p(x i | y = 1) should be close to 0, then the difference between them should be close to 1. This kind of samples cannot provide much effective information for classifier except making sample variance larger. Figure 2 presents two curves in each figure which mean the probability density functions of two probabilities in one dimension, respectively. It is known that samples in overlapping area are hard to be classified correctly and may disturb model training. The data distribution of original dataset in one dimension is like curves in Figure 2a, the distribution of majorities is dispersive and creates a large overlapping area with minorities. After removing less informative samples in majorities, the variance of majorities turns small and the overlapping area turns smaller as well in Figure 2b. The two figures illustrate that the increase of sample variance of one class would lead sample overlapping area larger and make the learning phase harder to classify. Considering the amount of majority samples in datasets with imbalanced distribution is larger than minority samples, we define an informative variable D for evaluating how informative a majority sample is for one feature. In imbalanced datasets, the variable for evaluating information of a majority sample for one feature is defined as its difference between conditional probabilities D ik = p(x ik | y i = 0) − p(x ik | y i = 1), where i represents the ith sample and k represents the kth feature.

Variable of Sample Informative Degree
In the feature space, variable D can only indicate the distribution of one feature which cannot refer to sample distribution characteristics. Since the Naïve Bayes algorithm assumes that all features are independent for the given class label, the relationship between features are not involved in our method. Under this assumption, we propose a rule to sum up the informative variable D of each feature distribution to get the informative degree SU M_D of each majority sample.
The construction of the informative degree SU M_D mainly consider two aspects. One is the difference between two conditional probabilities p(X | Y = 0) and p(X | Y = 1) might be too small to split samples with different labels. The other is D from different features might be offset after summation. To avoid the above two possible problems, we sum up the rank values of D instead of the variable itself. The steps proposed to calculate informative degree SU M_D of each majority sample are as follows. SU M_D(X i ) quantifies how informative a sample is, especially when the classifier is Naïve Bayes, this variable denotes how difficult a Naïve Bayes classifier learns information for classification from a sample. The rank value recorded in Rank_vector can differ variable D from different features clearly and the product of Rank_vector and SIGN_vector can avoid offset of value D from different features efficiently.

Finding the Less Informative Majorities
Generally, the bigger the SU M_D is, the less informative the majority sample is, so we try to find out and remove the majority samples with big SU M_D values. However, there is another situation to be noticed, when SU M_D value is negative, it means this majority sample is in overlapping area or even in the minority class area. Samples like these are overlapping samples or noises, both possible results might have bad influence on performance of classification. Summarizing the rules above, we give the definition of less informative majorities. Definition 2. Majority samples in datasets which have a too large or too small SU M_D value are defined as the less informative samples.
Order majorities with SU M_D, remove specified number of the first few and last few samples of the sequence from majorities. After removing, recalculate data distribution variables and repeat procedures introduced above until the imbalance problem is solved. The main components of LIMCR are described in Algorithm 1.
Step of iteration step_a and step_b. Original training set S o Output: Resampled dataset S new 1 Split S o into majority class set S Maj and minority class set S Min . 2 Calculate prior probabilities of majority class and minority class: Prior_P0 and Prior_P1: Calculate mean value X Majk and variance S 2 Majk of feature k for samples in S Maj (k = 1, 2, · · · , m; m =number of features). 5 Define Calculate mean value X Mink and variance S 2 Mink of feature k for samples in S Min (k = 1, 2, · · · , m; m =number of features).

15
ABS_vector(i) = abs(D i ). 16 Sort element in ABS_vector from smallest to largest and record the rank value into Rank_vector. If there are elements which have the same value, calculate the average rank of these elements.

18
Number of samples need to be remove in this iteration: Sort X i ∈ S Maj with SU M_D i from largest to smallest. 20 Remove the first N b and the last N a samples in S Maj .

22
Renew training datasets S tempnew = S Maj + S Min . Update

Benchmark Datasets
Datasets we choose in this research are software defect datasets from Marian Jureczko Datasets [53], NASA MDP datasets from Tim Menzies [54] and Eclipse bug datasets from Thomas Zimmermann [55] as benchmark data. Datasets in the first two research can be obtained from the website (https://zenodo.org/search?page=1&size=20&q=software%20defect%20predictio) and the Eclipse bug datasets are downloaded from Eclipse bug repository (https://www.st.cs.uni-saarland. de/softevo/bug-data/eclipse/). We investigate the sample size and IR value of each set (totally 108), and statistical results are shown in pie graph in Figure 3. Two basic distribution characteristics are listed as follows: 1. Most IR (imbalanced ratio) of SDP datasets range from 2 to 100. 2. The sample size of SDP datasets in different projects has a huge disparity. Some small datasets are less than 100, while some large datasets are more than 10,000.
In experiments, we choose 29 datasets from SDP datasets we investigated above for the following experiments as benchmark data. The information of selected datasets are presented in Table 1. To solve binary classification problem, we regard the samples of which the label is the number of bugs and greater than 1 as the same class and redefine the label as "1", meanwhile, the samples with the "0" label are not changed.
The definition of imbalanced ratio is the ratio of number of negative samples to number of positive samples, and the sample size is the number of samples of a dataset.

Performance Metrics
In experiments, we exploit four common performance metrics: recall, G-mean, AUC and Balanced Accuracy Score (balancedscore) [56]. The larger value of the four metrics is, the better the performance of the classifier is. All these metrics are based on confusion matrix ( Table 2). Where the defective modules are regarded as buggy (or positive) samples and non-defective modules as clean (or negative) samples. According to confusion matrix, the definition of PD (the probability of detection, also called recall, TPR), PF (the probability of false alarm, also called FPR) and precision are as follows.
recall and G-mean are proved more suitable for imbalanced learning [20].
AUC measures the area under the ROC curve which describes the trade-off between PD and PF, it can be calculated as follows: where ∑ buggy i rank(buggy i ) represents the sum of ranks of all buggy(or positive) samples, M and N are the number of buggy samples and clean samples, respectively. Balanced Accuracy Score (called balancedscore) as another accuracy metric is defined as the average recall obtained from each class, the metric can avoid inflated accuracy resulted from imbalanced class. Assume that y i is the true value of the ith sample, and ω i is the corresponding sample weight, then we adjust the sample weight to ω i = ω i ∑ l(y=y i )ω i , where l(x) is the indicator function. Given predicted y i , balancedscore is defined as balancedscore(y, y, ω) = 1

RQ1
: Which baseline classifiers do we choose to match the imbalance learning methods with different sample sizes?
Motivation: Classification effect can be affected by classifiers, imbalance learning methods, sample size and number of features. To improve the efficiency of the experiments, we perform an empirical study to give priority to classifiers that perform well on SDP datasets. On the other hand, we need explore the impact of different classifers on different sample size.
Approach: We first do a preliminary experiment to show which baseline classifier performs better without any resampling method on 29 benchmark datasets. We choose nine baseline classifiers with parameters which are listed in Table 3. All baseline classifiers [57] are implemented in scikit-learn v0.20.3 [58]. The parameters of each classifier are decided by pre-experiments, meanwhile, parameters which make no influence on classification performance, are used as default value. Results: Table 4 summarize the results of nine classifiers on 29 datasets. The best result is highlighted with bold-face type. The differences between results of each classifier are analyzed by using Friedman and Wilcoxon statistical test [59].
The performance of classifiers is measured by recall, G-mean and AUC, the result of these three metrics are quite similar so we present recall value of classifiers in Table 4. The average values of each algorithm are listed after the result of each datasets, and the average ranks calculated as in the Friedman test are followed with it, the lower the average rank is, the better the classifier is. From recall value of each dataset, we can see clearly that GNB performs better than other basic classifiers in most of the datasets when the size of datasets is around 100 to 1100, and when sample size is larger than 1100, ABC and DTC perform better in most of datasets. Moreover, from the average result in the last two rows in Table 4, GNB attains to the highest average recall value from all datasets but ABC gets the best Friedman rank value among all nine classifiers comparison. We present result of G-mean and AUC together with recall in Figures 4 and 5, the figure shows the similar trend of the three metrics.
For more details, we divide the datasets into two parts according to sample size, and analyze the differences among these classifiers on two parts of datasets, respectively. The datasets with sample size smaller than 1100 are called small datasets, otherwise, we call them large dataset. The average value and Friedman rank are recalculated for the two parts of datasets in Table 5.
From Table 5 we discover law of these classifiers clearly that GNB performs best among the nine selected classifiers when the sample size of a dataset is small, and the exact boundary of small and large sample size is 1100. ABC and DTC perform best in datasets of large sample size, the differences among classifier average results are analyzed in Table 6.     From the results on small datasets, in the Friedman test, we reject the null hypothesis that the nine classifiers have no significant difference (p-values of the three metrics are all smaller than 0.00001). Carrying out the Nemenyi post hoc analysis (Critical Difference of three metrics CD = 3. 003, α = 0.05) shows that ABC, Bgg, GNB, LR and DTC are significantly better than others. According to average rank and results of datasets, ABC and GNB seem to be slightly better than others and GNB is slightly better than ABC intuitively. In order to verify whether GNB is significantly better than ABC, we perform a further paired Wilcoxon test which null hypothesis is no significant difference between ABC and GNB. The p-value of three metrics are 0.017, 0.053, 0.088, respectively. According to the Wilcoxon test, for defective sample detecting metric (recall) GNB performs better than ABC significantly, and for overall accuracy metrics (G-mean and AUC) GNB performs a little better than ABC. Considering the cost of false negative is much more than false positive, we attach more importance to recall. Therefore, we regard GNB as the best classifier among nine basic classifiers and better than other eight classifiers on datasets with small sample size. However, we also see the bad performance of all the nine classifiers, even GNB have a relatively low AUC on small datasets. This indicates that performance of classifiers are restricted by the class imbalance problem, and there is a great space of improvement in performance of basic classifiers after overcoming the class imbalance problem reasonably. From the results on large datasets, we can extract observations that GNB performs worse than DTC and ABC. In the Friedman test, all the p-values of three metrics are smaller than 0.00001, which shows there is significant difference between nine classifiers, then the Nemenyi post hoc analysis (critical difference of three metric CD = 3.332, α = 0.05) shows that LR, MLP, RF and SVC perform significantly worse than other classifiers, these classifiers are unsuitable for SDP datasets with a large sample size. Then the paired Wilcoxon test turns out that differences between ABC and DTC are not significant because the null hypothesis on no significant difference between ABC and DTC cannot be rejected (p-value of recall, G-mean and AUC are 0.433, 0.396, 0.753, respectively). Furthermore, combined with the average rank we have found that ABC performs well in both small and large datasets (rank second in small-sample-size datasets and rank first in large-sample-size datasets), the result of Wilcoxon rank sum test support the null hypothesis that no significant difference between large and small datasets for ABC (p-value of recall, G-mean and AUC are 0.558, 0.661, 0.539, respectively). The result reflects that the performance of ABC is not affected by sample size of datasets. RQ2: How does LIMCR perform compared with other imbalance learning methods in small sample size of datasets?
Motivation: We have selected the baseline classifier GNB in small sample size in RQ1. Now, we need to validate the effectiveness of our proposed LIMCR.
Approach: To solve the class imbalance problem, researchers usually use two kinds of methods: resampling methods based on data level and classification methods based on algorithm level. Resampling methods are always sorted into three categories: over-sampling methods, under-sampling methods and combination of over-and under-sampling methods. The question that which method is more suitable for the class imbalance problem has been discussed in many research [4,60,61]. To evaluate the effectiveness of LIMCR, we employ six baseline imbalance learning methods with parameters which are listed in Table 7. All these baseline methods are implemented in the Imbalanced-learn model in Python [62]. In this experiment, we compare performance metrics (balancedscore and G-mean) of our LIMCR with baseline imbalance learning methods in small sample size of benchmark datasets. All imbalance learning methods including LIMCR are combined with baseline classifier GNB.
Results: Tables 8 and 9 present the results of our LIMCR and the baseline imbalance learning methods on small datasets in terms of balancedscore and G-mean, respectively. We notice that the average balancedscore and G-mean of LIMCR are 0.701 and 0.69 which perfoms better than other baseline imbalance learning methods.  In further study, p-value in Friedman test in these two performance metrics are all smaller than 0.00001, which shows that significant difference is existing among the seven methods. Then, the Nemenyi post hoc test shows CD = 2.326, α < 0.05. According to Nemenyi post hoc test, we underline the average ranks which significantly worse than our LIMCR. The results reflect that ensemble algorithms performs significantly worse than resampling method combined with GNB.
In order to find out if there is any significant difference between resampling methods, we perform Wilcoxon signed-rank test between our LIMCR and other resampling methods, the p-values of each test are listed in Table 10. From Table 10 we can learn that SMOE performs no significant difference from LIMCR, but for the Friedman average ranks, LIMCR performs slightly better than SMOE. The overall metrics balancedscore and G-mean of IHT shows significantly worse than LIMCR. Another two methods B-SMO and NCL perform significantly worse than LIMCR in most of datasets for balancedscore and G-mean. Therefore we conclude that LIMCR performs better than most of imbalance learning methods.
RQ3: How does LIMCR work with other classifiers? Motivation: In principle analyses, our LIMCR is based on Bayesian Probability, and GNB is the most suitable classifier for it. However, we expect that LIMCR still has good performance when it is combined with other classifiers.
Approach: In this experiment, we choose another two well performed classifiers ABC and DTC, with the combination of three resampling methods LIMCR, SOME and IHT on small datasets in terms of balancedscore and G-mean for further comparison.
Results: Tables 11 and 12 show the results of matching of three classifiers with three resampling methods on small datasets in terms of balacedscore and G-mean. We notice that the average balacedscore value of LIMCR+GNB is 0.701 which is equal to IHT+ABC and higher than other combinations. Simultaneously, the average G-mean of LIMCR+GNB is 0.69 while it is 0.696 for IHT+ABC. Table 11. Balancedscore of matching of classifiers with resampling methods on small datasets. From the average score and Friedman average ranks we see the combination of LIMCR and GNB still performs better than others except for G-mean, IHT+ABC ranks first on G-mean and LIMCR+GNB is slightly worse than it.

Datasets
Friedman test results are shown in Table 13, column named Total is the Friedman test among all nine methods, p-value of metric G-mean is 0.123 larger than 0.05 means there is no significant difference among all nine methods, i.e., LIMCR with other classifiers can perform as well as it with GNB and other methods for G-mean. The result of Nemenyi post hoc test (CD = 3.12, α < 0.05) supports this result. Column LIMCR, SMOE and IHT represent the Friedman tests among classifiers with same resampling method, respectively, for instance, p-value in column LIMCR is the result of Friedman test among GNB, DTC, ABC with the same resampling method LIMCR. For LIMCR on balancedscore p-value is smaller than 0.05, which suggests the combination of LIMCR + GNB performs significantly better than other combinations of LIMCR. For other resampling methods, p-value of balancedscore and G-mean are all larger than 0.05, which suggests that for SMOE and IHT, there are no significant difference among different combinations of them with different classifiers on the perspective of Balancedscore and G-mean. We can draw a conclusion from the experiment, LIMCR, SMOE and IHT all perform no significant difference on some metrics by being combined with different classifiers but inversely on other metrics. We take more consideration on the difference, performance of data resampling methods may change by using different classifiers in imbalance learning, therefore, when the dataset has a small sample size, it is necessary to choose GNB or Naïve Bayes as the basic classifier in imbalance learning. Approach: In this experiment, we retain k (k = 4, 8, 12, 16 and 20) highest scoring features to observe the variation of performance on LIMCR. We exploit a feature selection method named SelectKBest from model feature selection in scikit-learn v0.20.3 [58], and the dependence score between each feature and class label is measured by chi-square [57]. The main reason we choose this method is the convenience for selecting certain number of features in the experiment, and it removes features by univariate analyze suits for datasets.
Results: The results are shown in Tables 14 and 15. We notice that the average balancedscore values vary from 0.636 to 0.662 and the average G-mean from 0.534 to 0.631. When the number of feature is 16, the values of average balancedscore and G-mean are highest.   From the p-value of Friedman test in Table 16 we know for metric balancedscore there is no significant difference between the number of features in five levels. According to the result of Nemenyi post hoc test (CD = 1.575), the Friedman average ranks which larger than the best one more than 1.575. For metric G-mean, only datasets with four features perform significantly worse the best one and others has no significant difference. When the number of features declines, the precision score increases and the overall performance is declining. In summary, it can be believed that the number of features has no influence on the performance of LIMCR unless the number of features below a certain value. The value in this experiment is 4 or 5.

Metric
Balancedscore G-Mean p-value 0.054 <0.001 RQ5: How does LIMCR work with datasets with large sample sizes? Motivation: From RQ1 we know sample size has a great influence on classifier selecting in imbalance learning and our proposed LIMCR is proved performing well with small datasets. However, how LIMCR performs on datasets with large sample size is also needed to know.
Approach: In this experiment we combine ILMCR with three classifiers, GNB, ABC and DTC. IHT combined with the same classifiers are used as comparison. The datasets are large datasets introduced in RQ1.
Results: The results of comparison between six combined methods are listed in Tables 17 and 18.  From Tables 17 and 18, for both metrics balancedscore and G-mean, IHT get higher results than LIMCR when using the same classifier. In other words, resampling method IHT performs better than our LIMCR according to average score and Friedman average ranks. Meanwhile, the Nemenyi post hoc test (CD = 2.015) result declares that all classifiers combined with LIMCR perform significantly worse than the best performance on balancedscore. From these we can conclude that the proposed LIMCR performs well when the sample size is small (generally smaller than 1100), but it turns worse than IHT when sample size increases (generally larger than 1100). So LIMCR can achieve better performance with typical classifier when the sample size is smaller.

Why Hold 3 Digits for Informative Variable D?
As mentioned in Section 3.3, the rank value of informative variable D on features of a sample have a great effect on estimating how informative the sample is, moreover, the precision of variable D affect the rank value directly. Therefore, it is necessary to discuss a proper value of this parameter (precision for variable D). The aim of this discussion is to present the effects of different precision of D on the performance of proposed LIMCR. Considering the space limitation, here, we randomly select the result of three datasets with different sample size (sample sizes of synapse-1.0, PC4, prop2 are 157, 1270, 23014, respectively) and the average result of 29 datasets introduced in Section 3.1. We choose GNB as classifier and evaluate the performance with balancedscore, precision, recall, and G-mean. The value of the precision of D is varied from 0 to 5 with increment of 1. The experimental result are presented as shown in Figure 6. From the figures we notice that when this parameter equals to 3, the result of LIMCR performs stable and better than most of other values on average result. The metrics excepting precision have an increasing trend with the increase of this parameter value. Inversely, the metric precision decreases with the parameter value increasing. In order to obtaining the global optimum performance, we choose 3 as the generic value of this parameter.

Threats to Validity
There are still several potential limitations in our study which are shown as follows.
1. Quality and quantity of datasets for empirical study might be insufficient. Although we have collected more than 100 datasets for illustrating the distribution of sample size and imbalanced ratio in most of SDP datasets, and 29 datasets for investigations in empirical study. It is still hard to confirm if these datasets are typical to reflect characters of SDP data 2. The generalization of our method might be limited. The method we proposed focus on binary classification, it improves the performance of predicting if a sample (software model) has any defects but cannot predict the number of defects in it. More types of defect datasets should be considered in the future to reduce the threats. 3. The performance evaluating metrics we selected might be partial. There are many metrics such as PD and MCC have been used in binary classification for SDP research. At the same time, F1 is also widely used in SDP, but we do not employ it as it is proved to be biased and unreliable [63]. Although we have considered to select evaluating metrics from two aspects, overall performance and one-class accuracy, however, the limited number of metrics still pose some threats to the construct validity. 4. Practical significance of LIMCR in software engineering might be extended. Project members can obtain information on possible defect-prone modules of the software before failure occur by using defect prediction technique, LIMCR has not been applied to predict defect classes/severities [64]. In addition, it is worth studying the performance of LIMCR with different prediction models (within a single project, cross project predictions) [65]. Meanwhile, how to cooperate with instance deletion, missing values replacement, normalization issues mentioned in [66] and defect prediction cost effectiveness [67] also needs further research.

Conclusions
The performance of a defect prediction model is influenced by the sample size of dataset, selection of classifiers and data resampling methods. From our empirical study, we compared performance of nine popular classifiers in 29 software datasets with different sample size ranging from 100 to 20,000 to study the influence of sample size and classifiers. The major conclusion in this part is that GNB performs well with small datasets, but its performance deteriorates when sample size of datasets grow to 1100. Another classifier ABC performs stable with different sample size and obtains relatively better result with large datasets among classifiers. On this basis, in order to make an expected matching on small datasets, we proposed a new resampling method LIMCR motivated by the good performance of GNB. LIMCR is used for SDP datasets with small sample size and it is designed as the best resampling method cooperated with classifier GNB. The results of comparison experiments confirm that the performance of LIMCR is better than the other resampling methods, and the matching between GNB and LIMCR is the best solution for the imbalance problem in SDP datasets with small datasets. Besides, we also design experiments to research how it performs with other classifiers, feature selection and data with large sample size. The results can be summarized as follows.
1. LIMCR together with classifier GNB is a better solution for the imbalance problem on SDP datasets with small datasets which is slightly better than SMOE+GNB . 2. On aspect of metric G-mean, LIMCR has the same well performance when cooperates with other classifiers. On aspect of metric balancedscore, when cooperating with LIMCR, GNB performs significantly better than other classifiers. 3. Number of features in a datasets has no influence on LIMCR, but the performance turn significantly worse when the number of features less than 5. 4. When the sample size bigger than 1100, performance of LIMCR is worse than IHT, so when sample size bigger than 1100, IHT is recommended as the best imbalanced learning method for SDP.
Although our proposed LIMCR cannot outperform for all datasets, but the result of our research emphasizes the importance of the influence of datasets. There is no all-purpose imbalance learning methods, the way of choosing methods appropriately is also important. In the future, we plan to extend our research to cover other data distribution problems such as overlapping problem and high dimensionality. We will update our LIMCR to solve more combined problems and being suitable for more SDP datasets.

Conflicts of Interest:
The authors declare no conflict of interest.