Next Article in Journal
Dynamic Dark Channel Prior Dehazing with Polarization
Previous Article in Journal
A Numerical Simulation Study on DC Positive Corona Discharge Characteristics at the Conductor’s Tip Defect
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving Software Defect Prediction in Noisy Imbalanced Datasets

1
School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China
2
China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou 510610, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(18), 10466; https://doi.org/10.3390/app131810466
Submission received: 3 August 2023 / Revised: 8 September 2023 / Accepted: 15 September 2023 / Published: 19 September 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Software defect prediction is a popular method for optimizing software testing and improving software quality and reliability. However, software defect datasets usually have quality problems, such as class imbalance and data noise. Oversampling by generating the minority class samples is one of the most well-known methods to improving the quality of datasets; however, it often introduces overfitting noise to datasets. To better improve the quality of these datasets, this paper proposes a method called US-PONR, which uses undersampling to remove duplicate samples from version iterations and then uses oversampling through propensity score matching to reduce class imbalance and noise samples in datasets. The effectiveness of this method was validated in a software prediction experiment that involved 24 versions of software data in 11 projects from PROMISE in noisy environments that varied from 0% to 30% noise level. The experiments showed a significant improvement in the quality of datasets pre-processed by US-PONR in noisy imbalanced datasets, especially the noisiest ones, compared with 12 other advanced dataset processing methods. The experiments also demonstrated that the US-PONR method can effectively identify the label noise samples and remove them.

1. Introduction

Defects in software systems often lead to software errors and losses [1,2]. With the development of software technology, the scale and complexity of software have grown rapidly, and software testing to discover and correct software defects has become increasingly expensive. To optimize software testing schemes, software defect prediction (SDP) is used to identify modules in software systems that may contain defects and to improve the efficiency and performance of software testing. In recent years, with the rapid development of machine learning technology, many researchers have applied machine learning to software defect prediction [3] and have achieved remarkable results [4,5]. For machine learning to predict software defects, software defect metrics first need to be defined in order to characterize the distribution of defects. At present, CK metrics [6], process metrics [7,8,9,10], and network metrics [11,12,13,14,15] are widely used for this task.
The quality of the software defect dataset is one of the most important factors that affect the performance of SDP. In real-world software projects, defects usually exist in very few modules, so the defect samples in the software defect dataset are far smaller than the clean samples, resulting in serious class imbalance problems and reducing the performance of defect prediction [16]. Many scholars have proposed methods to solve the class imbalance problem, but these rebalancing methods are still limited in their application and performance [17]. Most of the existing dataset rebalancing methods are based on undersampling and oversampling [18]. Undersampling can effectively solve the class imbalance problem, but the instances discarded during undersampling may contain information useful or important for predicting defects [19]. In SDP, oversampling is often more effective than undersampling, and minority class sample generation methods such as SMOTE and MAHAKIL [20,21] can effectively alleviate the class imbalance problem, but these oversampling methods generate overfitting noise that can degrade the predictive performance of the model [22].
Another problem with software defect datasets involves the way labels on software samples are manually marked by testers or developers [23]. Samples may be erroneously labeled due to insufficient knowledge of the defects, the time difference between defect introduction and defect discovery, or new defects caused by defect correction [24]; this mislabeling introduces noise in the dataset, which deteriorates its quality and reduces the performance of defect prediction. At present, there are few studies on the noise of software defect datasets in the SDP field [25,26]. To reduce the noise in the dataset, using propensity score matching (PSM), a noise reduction method used in machine learning [27], may be a good idea, being a statistical method that achieves noise reduction by addressing the difference between the distribution of the observed data and the overall distribution.
This paper proposes a novel method for predicting software defects using data pre-processing. Named US-PONR—for Undersampling, Propensity-Score-Matching-Based Oversampling and Noise Reduction—this method aims to improve the quality of software defect datasets by addressing deficiencies caused by class imbalance and dataset noise. The method uses undersampling in data pre-processing to remove duplicate samples that result from multiple versions of the data. And the method uses propensity score matching (PSM) to oversample and reduce noise in the dataset, which alleviate the introduction of overfitting noise. To predict defects, an aggregated multi-classifier trained under a cross-validation (CV) scoring method was built to construct the predictor. We conducted experiments that used US-PONR to predict defects under different noise environment; the results show that the proposed method is better than the benchmark and SOTA data pre-processing methods, and they further demonstrate the ability of the proposed method to identify and remove label noise samples.
This paper demonstrates how US-PONR can make the following contributions to the field of software defect prediction:
(1)
US-PONR offers a new PSM-based method for data oversampling and noise reduction, which reduces the introduction of overfitting noise caused by the minority class sample generation.
(2)
It also offers a new method of data pre-processing for software defect prediction that uses a combination of undersampling, oversampling, and noise reduction.
(3)
US-PONR additionally performs SOTA in software defect prediction experiments under different noise environment settings, and the experiments demonstrate that the method can effectively identify and label noise samples and remove them.
The rest of this paper is organized as follows: Section 2 presents background information about software defect prediction; Section 3 presents the methodology the authors used to develop US-PONR; Section 4 describes the experiment the authors conducted to test the method, including the research questions and experiment settings; Section 5 reports the experiment’s results; Section 6 discusses the method’s limitations; and Section 7 spells out major conclusions.

2. Background

2.1. Software Defect Prediction (SDP) and Metrics

SDP is an increasingly important subject in software reliability research. Using historical data to link software metrics and defects, SDP supports the designing and testing of software by determining the defect tendency of software modules [28]. At present, SDP mainly uses machine learning algorithms to make binary classification judgments on whether a module is defective, and it is applied to software at different stages, including data acquisition, data processing, model training, and model evaluation. And a major focus of data collection is to characterize software features through metrics.
The earliest software metric applied to SDP measured software code and its complexity. In 1991, Chidamber and Kmerer proposed the famous CK metric element [6], which became one of the standard metric tuples in the SDP field. From the perspective of object-oriented design, the CK metric tuple comprehensively considers the factors affecting software, such as the number of code lines, the degree of class cohesion, and the relationship between classes. Researchers also proposed applying to SDP the metrics of the code development process [7,8,9,10] in order to address the macro-integrity of a software program and how its elements interact. Researchers also provided network metrics to measure codes and thus build a software network that can characterize software features. Many studies on network metrics have been conducted [11,12,13,14,15]. Later, Jin proposed a distance metric based on cost-sensitive learning for reducing class imbalance and better differentiation of defect/clean samples [29].
Besides software metrics, there are many studies that have applied machine learning for software defect prediction. Smoya removed clean samples surrounding defect samples using a filtering technique to enhance the performance of predicting software defects using SVMs [30]. Xu represented the source code as an augmented code property graph and trained a software defect predictor with a graph neural network [31]. Hanif trained a code pre-trained language model using defect datasets as the corpus for code defect domain tasks [32].

2.2. Class Imbalance

In machine learning, the quality of the dataset is one of the most important factors affecting performance. In the SDP field, dataset quality is mainly compromised by widespread serious imbalance and the introduction of noise caused by various factors.
Class imbalance is as a key problem in machine learning and data mining [33]. It refers to an imbalance in the proportion of a dataset’s different classes of instances. Since imbalanced defect datasets reduce the model’s ability to learn defects [1], database rebalancing methods are often used to process the original dataset. There are various methods for rebalancing datasets.
One method is undersampling, which balances the dataset by reducing samples from the majority class. The simplest form of undersampling is random undersampling, which balances the dataset by randomly reducing the samples in the majority class. A. Guzmán-Ponce proposed a two-stage under-sampling algorithm combining the DBSCAN clustering algorithm and graph-based procedure to face the class imbalance [34]. The disadvantage of undersampling is that it may cause the training data to lose important information from the majority of class samples [19].
Another method for rebalancing datasets is oversampling. Oversampling balances a dataset by adding a minority class of sample data. Random oversampling (ROS) randomly replicates minority class samples, but this can lead to severe overfitting. Most current research instead balances the dataset by generating minority class samples. This method of oversampling often develops sample generation strategies based on the common assumption in the machine learning community that closer instances are more similar than more distant ones [35]. One sample generation strategy, MAHAKIL, determines Mahalanobis distance to generate samples [21], but this method does not work when the number of instances in a few classes is less than their dimensionality, as this generates overly diverse data that reduces the model’s ability to find defects [36]. Another strategy, COSTE, generates samples based on complexity [37], but further investigation is needed to validate its assumption that there is less information about defects in complex samples. Mateusz discussed the impact of class imbalance in few-shot learning and pointed out the effectiveness of oversampling to rebalance datasets [38].
The SMOTE algorithm proposed by Chawla et al. in 2002 is one of the most commonly used oversampling methods in academia [20]; it generates minority class samples using k-NN spatial distance. However, this method may increase the risk of overfitting and also increase the false positive rate of prediction results [39]. Researchers have proposed variants of SMOTE to minimize these drawbacks [40]. Paria proposed a range-controlled SMOTE to alleviate the increasing of overlaps between different classes around the class boundaries caused by SMOTE [41]. Batista et al. [42] proposed combining SMOTE with TomekLinks (SMOTE_TomekLinks) and with the edited nearest neighbor rule (SMOTE_ENN). Han et al. [43] proposed Borderline-SMOTE. He et al. [44] added different weights for different minority instances (ADASYN). Douzas et al. applied k-means clustering [45] and self-organizing maps [46] to the SMOTE method (kmeans-SMOTE and SOMO). Lee et al. [47] added Gaussian random variables to the SMOTE synthesis sample process (Gaussian-SMOTE). Barua et al. [48] gave the minority sample weight based on its distance to the nearest majority sample (MWMOTE). Recently, Agrawl et al. [36] proposed an algorithm to automatically optimize the parameter combination in SMOTE: k = number of neighbors, m = number of synthetic examples to create, and r = power parameter for the Minkowski distance (SMOTUNED).

2.3. Noise Reduction

There are two main types of noise in software defect prediction: noise existing in the dataset itself and noise introduced into the dataset by data pre-processing techniques such as oversampling. Reducing both kinds of noise has become an important area of research [49]. Some researchers perform noise reduction by adding filters after oversampling or by changing the sample generation method. Hu et al. [50] changed the strategy of selecting nearest neighbor samples, which caused a denoising effect (MSMOTE). Luengo et al. [51] combined IPF filtering with SMOTE (SMOTE-IPF). Koziarski et al. [52] cleaned up the decision boundary and guided synthetic oversampling samples (CCR). In 2012, Ramentol et al. [53] applied rough set theory to SMOTE (SMOTE-RSB). In 2016, Ramentol et al. [54] applied fuzzy rough set theory to SMOTE (SMOTE-FRST-2T). Khoshgoftaar et al. [55] applied filters to SDP for noise reduction. In 2017, Rivera [27] introduced propensity score matching (PSM) to the balancing and noise reduction methods.

3. Materials and Methods

3.1. Framework

This section introduces the framework of US-PONR (or Undersampling, Propensity Score Matching-Based Oversampling, and Noise Reduction), which is illustrated in Figure 1. The US-PONR consists of three main steps: undersampling, PSM-based oversampling and noise reduction, and predictor construction. An overview of each step appears below, while a more in-depth discussion of each step follows in Section 3.2, Section 3.3 and Section 3.4.
In our method, after obtaining the defect datasets, the CK [6], process [7], and network metrics [11,13] were selected to characterize the source code samples, which contain 80 metrics. Table 1, Table 2 and Table 3 shows some of the metrics used, while the entire meta dataset can be found at https://github.com/buaaSoftwareReliabilityGroup/US-PONR (accessed on 18 September 2023). The first step in US-PONR is undersampling (US). After the source code is represented using metrics, the software data are undersampled to pre-adjust the degree of imbalance in the dataset, which alleviates the problem of class imbalance and overfitting of the dataset by reducing the number of repeated non-defective samples in the dataset. At this point in the process, the ratio of undersampling also needs to be determined. A more detailed account of undersampling in US-PONR is provided in Section 3.2.
The second step in US-PONR is PSM-based oversampling and noise reduction (PONR). Both of these tasks are performed using the PSM technique to obtain the final dataset. PONR is performed on the undersampled dataset by calculating the propensity score of each sample in the dataset and then by obtaining the nearest neighbor sample set of each sample based on propensity score matching. According to the distribution of the PSM-based nearest neighbor sample set, oversample synthesis is performed to balance the dataset. After oversampling, PSM is performed again to judge whether a sample is noise and to exclude noise samples. Section 3.3 covers the specific calculation for the propensity score of each sample and the detailed steps of PONR.
The third step in US-PONR is predictor construction. To obtain the defect predictor, an ensemble model aggregating multi-ML classifiers was trained by the cross-validation (CV) method, in order to select the most appropriate ML model for different datasets. Section 3.4 describes how to perform cross-validation and to aggregate the model in detail.

3.2. Undersampling

In datasets, the number of non-defective samples often far exceeds the number of defective samples. These datasets typically include duplicate non-defective samples brought by non-defective code files that have not been modified in each version.
In such cases, the dataset needs to be undersampled in order to reduce the degree of class imbalance and reduce the over-fitting of the model caused by repeated samples.
The initial dataset can be expressed by the following formula:
X origin = { ( x i , y i ) | i ( 1 , n ) , y i { 0 , 1 } , x i = [ x i , 1 , x i , 2 , , x i , m ] T }
where xi is the feature vector of the i-th sample, yi is the defect label of the i-th sample, n is the number of samples, and m is the feature dimension.
In the method we are proposing, undersampling is introduced to alleviate the imbalance in a dataset caused by duplicate codes in multiple versions. The process of undersampling is represented by Algorithm 1:
Algorithm 1: Undersampling
input   original   dataset   X origin ,   imbalance   degree   r max = y negetive / y positive ,   step   s us
output undersample   dataset   X us _ all [   ] [   ]   contains   X us   with   diff   r us
1.  r us : undersample ratio, initialize to 1
2.  X max , X d , X min initialize to null
3. for x i in X origin
4.     if y i = 0
5.         then if x i in X max
6.             then X d x i
7.          X max x i
8.         end if
9.     else
10.         X min x i
11.     end if
12. end for
13. do {
14.     r = the   number   of   samples   in   X max / the   number   of   samples   in   X min
15.     while r ≤ r us
16.          X max ← xj (random select from X d )
17.     end while
18.      X us = X max + X min
19.      X us _ all X us and r us
20.      r us += s us
21. }while ( r us   r max )
22. return  X us _ all
where r max is the imbalance degree of the dataset, X max are samples with defect-free labels in X origin , X min are samples with defect labels in X origin , X d are duplicate defect-free samples, the undersampling parameter r us is the ratio of non-defective samples to defective samples after undersampling, X us is the dataset undersampled under certain r us , and s us is the search step for the undersampling ratio.
The algorithm searches through the steps in [1, r max ] and then deletes the duplicated data in the dataset under each r us . Setting different r us will alleviate the class imbalance problem to varying degrees, but setting small r us will lead to few samples and the loss of important data information. Therefore, it is necessary to optimize r us . The optimization method for r us is introduced in Section 3.4.
After undersampling, the under sampled dataset X us _ all is obtained, which contains the different r us values and their corresponding undersampled dataset X us .

3.3. PONR

After undersampling, a dataset with repeated sample removal is obtained. However, undersampling greatly reduces the sample size in the dataset. To keep the data volume sufficient, the dataset cannot be balanced only with undersampling.
At present, the most popular dataset balancing method in machine learning is SMOTE and its derivative methods. But the synthetic data strategy of this method creates the problems of overfitting and noise amplification. In order to overcome overfitting and dataset noise, propensity score matching (PSM) is introduced to generate synthetic samples and reduce dataset noise. The theoretical framework of the PSM method is to solve the difference between the observed sampling data distribution and the overall distribution. The systematic deviation uses the propensity score to measure the difference in high-dimensional data in the feature space.
PSM is used for nearest neighbor search, which generates synthetic samples and reduces dataset noise. PSM is also used to reduce the complexity and cost of the algorithm while effectively denoising the dataset and the synthesis of minority samples, oversampling the samples closest to the center of the minority class, and improving the quality of the software defect dataset.
This subsection calculates the propensity score of each sample in X us by solving the weight vector of X us and adds it as an additional feature to X us for providing a basis for the subsequent PSM-based noise removal and oversampling process.
First, use the logistic function to characterize the distribution of data:
f ( x ) = 1 1 + e x
Next, define an m-dimensional (m is the dimension of feature) weight vector β to characterize the similarity of samples. Also, define a constant β 0 and perform a minimum initialization, allowing each element of β 0 be a random decimal value close to 0 to ensure that x i T β + β 0 0 :
Now, define the maximum log-likelihood probability of the dataset as
ln L = i = 1 n ( y i ln ( f ( x i T β ) ) + ( 1 y i ) ln ( 1 f ( x i T β ) ) )
Solve the β that maximizes the maximum log-likelihood probability of the dataset:
arg max β ln L = i = 1 n ( y i ln ( f ( x i T β ) ) + ( 1 y i ) ln ( 1 f ( x i T β ) ) )
Substitute the obtained weight vector β and the feature vector of each sample for the logistic function to obtain the propensity score f β ( x i ) of each sample:
f β ( x i ) = 1 1 + e x i T β + β 0
Add the propensity score of each sample to X us as the (m + 1) th dimensional feature of the sample to obtain the propensity score dataset X β . For each sample x i , traverse the other samples in X β to calculate the Euclidean distance between each sample:
d i j = ( z = 1 m + 1 | x i z x j z | 2 ) 1 / 2 i f f   x i x j   and   x i , x j X β
where x i z is the z-th dimension feature of x i . Then, find the k samples whose Euclidean distance is closest to x i to form the k nearest neighbor sample set XK i of x i :
XK i = { ( x i j , y i j ) | j i , j ( 1 , k + 1 ) , min ( d i j )   in   X β }
Then, oversampling the data, the number of synthetic samples to be generated is
N new = r os × ( num y i   =   0 num y i   =   1 ) , ( x i , y i ) i n   X us
where parameter r os is the weight of the number of samples generated and is set to 1 by default. Traverse each defect sample x i in the noise reduction dataset XK i , randomly select one defect sample x i j in its XK i for each defect sample, and randomly synthesize a new defect sample between the two defect sample lines in the eigenspace by (9):
x new = x i + c ( x i j x i ) , y new = 1 ,   x i j X ik
where c represents a randomly generated constant between 0 and 1. Repeat the above process until the number of synthesized instances reaches n new . Add all synthesized defective instances to X β for obtaining X os . After oversampling, the dataset still needs noise reduction. For each x i , count the number of different class samples in XK i by (10):
d i f f = j = 1 k | y i y i j |
If d i f f the noise discrimination threshold t, consider x i to be a noise sample, and remove it from X os . In this method, set the nearest neighbor number parameter k = 5 and the noise discrimination threshold t = 3 for PONR; the value of k is set according to the density of the dataset used, and the value of t is set to ensure that the data in which more than half of the k-neighbors belong to different categories are treated as noise. Repeating the above traversal process until traversing the entire X os without discriminating noise, the dataset after noise reduction X final is obtained. At this point, the complete data processing flow is completed.

3.4. Predictor Construction

The third step of the US-PONR method involves building a software defect predictor that uses a simple method of machine learning aggregation based on cross-validation (CV) scoring. It also provides a scheme for simple balanced parameter optimization.
In the SDP field, data distribution is significantly diverse: different software projects, different metrics, and different types of defects all exhibit different data distribution characteristics. Therefore, it is difficult for a single ML algorithm to comprehensively predict software defects. Inspired by the ensemble learning used in SDP [56], in this paper, multiple ML models are aggregated and trained through a CV scoring method to build an aggregate defect prediction model. Instructions for researchers are as follows:
First, set the parameter N k of the StratifiedKFold random grouping and the model score error threshold T dev . StratifiedKFold random grouping refers to randomly dividing the dataset into N k groups and keeping the proportion of defective samples and non-defective samples in each group equal to the original dataset. The X final obtained in the previous step is randomly StratifiedKFold grouped to obtain a set of N k groups datasets: { X final , k | k = 1 N k } . In each loop, select a group of the dataset as x validate and the remaining groups as x train :
{ X validate = X final , i , X train = set ( X final , j ) , i ( 1 , N k ) , j i ( 1 , N k ) }
Then, the CV-based algorithm showed in Algorithm 2 was used to predict results:
Algorithm 2: CV-based aggregation
input StratifiedKFold Grouped Dataset { X final , k | k = 1 N k } ,
undersampling ratio r us , Classifers, Model Score Error Threshold T dev
output defect prediction result, best undersampling ratio r us _ best
1.  for r us in X us _ all :
2.      for X final , k  in  { X final , k | k = 1 N k } :
3.           X train X final , k
4.           X validate { X final , j | j ( 1 , N k ) , j k }
5.          for classifier i in Classifiers
6.              if classifier i need hyperparameter-optimal or threshold-optimal:
7.                  optimize hyperparameter or threshold
8.              end if
9.               AUC i ← train predictor by X train and X validate
10.              if  AUC max < AUC i :
11.                   AUC max = AUC i
12.              end if
13.                   AUC classifiers AUC i
14.          end for
15.            AUC max _ N k += AUC max
16.            AUC classifiers _ N k += each   AUC i   was   summed   separately   in   AUC classifiers
17.      end for
18.       AUC max ¯ = AUC max _ N k / N k ; AUC classifiers ¯ = AUC classifiers _ N k / N k
19.      Record  AUC max ¯  and  AUC classifiers ¯  corresponding to each different r us
20. end for
21. calculate (max( AUC max ¯ calculated by diff r us ))
22.  r us _ best = r us  with max ( AUC max ¯ )
23.  AUC classifiers _ best = AUC classifiers ¯ with max( AUC max ¯ )
24. for result in AUC classifiers _ best
25.     if | AUC max ¯ -result| < T dev
26.          add the classfier (with its predict result) into C CV
27. end for
28. defect prediction result = best prediction result by classifiers in C CV
29. return defect prediction result and r us _ best
where { X final , k | k = 1 N k } is the dataset obtained by processing PONR and StratifiedKFold grouped for X us , which is the undersampled dataset corresponding to the particular undersampling ratio r us in X us _ all . Classifiers is the ML classifiers used in this method shown in Table 4. AUC best is the highest AUC of the result predicted in Classifiers, AUC classifiers is the AUC of each classifier in Classifiers, C CV V is classifiers filtered by CV scoring method, and r us _ best is the optimal parameter obtained by the CV-based aggregation model.
Before training different ML classifiers to obtain different machine learning models using x train , some classifiers need hyperparameter or threshold optimization to obtain the best performance. In this method, for RR and LAR, the RandomizedSearchCV algorithm is used to tune up. For KNN, the GridSearchCV algorithm is used to determine the best n-neighbors. For LiR, KR and other classifiers that need to optimize the threshold, set the initial threshold as the median of the prediction results, then use the dichotomy to iterate, and following this, select the threshold with the maximum score as the optimal threshold. The classifiers are trained to obtain ML models, then they are used to predict x validate , and the results are scored using AUC. AUC indicates the probability that a positive sample is greater than a negative sample when a trained classifier is used to predict a pair of data. The formula for AUC is
AUC = i positiceClass rank i M ( 1 + M ) 2 M × N
Repeat the above operation until each group of data is regarded as x validate once. Calculate the average AUC AUC max ¯ of the scores of different machine learning models in N k experiments and take the model algorithm with the largest AUC max ¯ and the model algorithms in C CV within the difference | AUC AUC max ¯ | T dev as the final selection algorithms.
Due to the different datasets, class imbalance, data distribution, and other characteristics of different software, in the process of constructing the predictor, the balance parameter settings of the undersampling process should be optimized. Here, a simple scheme is proposed in order to optimize r us in the CV aggregation process described above. For a dataset with a class unbalanced ratio r, experiment with different r us values with a step size between [1, r]. Take the r us with the highest AUC max ¯ as r us _ best shown in Algorithm 2. Then, take the parameter r us _ best as the selected undersampling ratio to obtain X final during the actual prediction.
Train the selected aggregated ML model using X final to obtain the defect prediction model. Measure metrics of the test set of the target software, and then implement prediction. Multiple LM models in the aggregated model were selected in the previous step, and the prediction results may not be the same, so the prediction with the highest AUC should be selected as the final prediction result. However, because it costs much less to test a software defect than to repair it, the prediction results should include as many defective modules as possible, so that more defects can be found when testing. In practice, the final defect prediction result should combine the defect modules predicted by each model.

4. Experiment

We conducted three experiments to answer the three research questions raised in Section 4.1 separately:
  • defect prediction in SDP datasets without noise (for RQ1);
  • defect prediction in SDP datasets in different noise environments (for RQ2);
  • validation of the proportion of labeled noisy samples in the dataset before and after the use of the data pre-processing method (for RQ3).
The datasets used in different experiments are described in Section 4.2, the selected ML algorithms for the aggregated model and experiment environment are shown in Section 4.3, and the SOTA methods compared to the proposed method in a different experiment is shown in Section 4.4.

4.1. Research Questions

The experiment sought to answer the following three research questions:
  • RQ1: Is US-PONR effective in SDP datasets?
  • RQ2: Can US-PONR perform better compared to the benchmark methods in unbalanced SDP datasets with noise?
  • RQ3: Is US-PONR especially good at eliminating label noise samples?
RQ1 aims to validate the effectiveness of the proposed method in SDP datasets. RQ2 focuses on the effectiveness of US-PONR in noisy environments. And RQ3 further validates the effectiveness of the proposed method in noise reduction.

4.2. Datasets

4.2.1. PROMISE Public Dataset

To create the US-PONR method, we used the data of 24 versions of software from 11 different software projects in the PROMISE public dataset [57]. Table 5 provides an overview of these projects and their software, which represent the origin dataset before any pre-processing. The selected datasets cover different scales of data and different degrees of class imbalance. The CK and process metrics used in the experiment were directly obtained from the dataset. The net metrics were obtained by converting the dataset into a network diagram through an ISEE platform developed by our laboratory and by integrating measurement elements proposed by Yang et al. [11] and by Zimmermann and Nagappan [13]. PROMISE datasets, comprising software donations from NASA and other sources, have become one of the main public datasets in the SDP field, and they have been modified by researchers to mitigate various problems [58]. As Baljinder et al. pointed out, the quality of the PROMISE dataset is now better than the noise-reduced NASA dataset [59]. Therefore, this article assumes that the PROMISE dataset used is clean and noise-free.

4.2.2. Noise Dataset Generation

In order to generate datasets with label noise, 18 researchers in our lab were selected to re-generate labels for the PROMISE dataset. After they were generated, labels that differed from the PROMISE labels were considered samples with label noise.
We separately replaced 10%, 20%, and 30% samples of the training set with noise samples and generated three training sets with different numbers of noise samples to test the performance of the model in different noise environments. Meanwhile, the validation and test sets were kept the same as the PROMISE dataset. The generated noisy datasets were used to answer RQ2 and RQ3.

4.3. Experiment Settings

As shown in Table 4 above, 14 machine learning algorithms were selected as candidates for model aggregation. Python’s scikit-learn was then used to process any algorithms that required hyperparameter optimization and feature selection.
The experiments were carried out using Ubuntu20.04 and Python 3.8 on an Intel E5-2620 CPU with 32 GB of memory and a NVidia GeForce RTX 2080 with 16 GB of RAM.

4.4. Comparison Methods

For RQ1, the SOTA oversampling method SMOTUNED [46] was used to verify the effectiveness of US-PONR in non-noisy SDP datasets.
For RQ2 and RQ3, to compare US-PONR with current mainstream methods of oversampling and noise removal, 12 methods were selected to join US-PONR in the noise experiment following undersampling, including SOMO [46], MWMOTE [48], SMOTE-IPF [51], SMOTE-RSB [53], SMOTE-FRST-2T [54], SMOTE [20], SMOTE-TomekLinks [42], Borderline-SMOTE [43], ADASYN [44], MSMOTE [50], Gaussian-SMOTE [47], and CCR [52].
Scikit-learn and the open-source library “smote_variants” [60] on GitHub were used to implement these methods.

5. Results

5.1. Answer to RQ1: Is US-PONR Effective in SDP Datasets?

In order to answer RQ1, three small experiments were conducted:
  • Determining the optimization parameter.
  • Comparing the results of US-PONR with the results of just US (undersampling) or PONR alone.
  • Comparing US-PONR with the SOTA data pre-processing method.
The first experiment determined the optimal parameter of each dataset. These parameters were then used for the other two experiments. The second experiment sought to prove the necessity of using both US and PONR in the proposed method. The third experiment sought to validate the effectiveness of US-PONR. The dataset used in these experiments was the original PROMISE dataset without any introduced noise (shown in Table 5).

5.1.1. Determining the Optimization Parameter

To obtain the optimal parameter, different steps were taken to process undersampling on the selected datasets; step size was set at 1 between [1, r], where r is the degree of imbalance in the dataset. We trained the CV-based aggregation model under different conditions and then compared with the average AUC obtained by cross-validation; the undersampling ratio with the largest average AUC was selected as the optimal parameter.
When conducting this experiment, we discovered that the search interval was usually too large, so we recommend that one first use step = 1 to search for the interval where there may be optimal parameters and then use step = 0.2 to find the optimal parameters. Our test results indicate that the value interval should be set as follows: ant [3, 4], camel [3, 3.9], poi [5, 6], JDT [2.6, 3.4], synapse [2, 3], jedit [3.6, 4.4], velocity [1, 1.8], xerces [4.4, 5.2], log4j [1.6, 2.4], PDE [4.6, 5.4], and mylyn [4.6, 5.4].
The results for parameter optimization are shown in Table 6:
These results show that the optimal parameters of each dataset are as follows: ant (3.8), camel (3.9), JDT (3.2), jedit (4.2), log4j (2.2), mylyn (5.4), PDE (5), poi (5.8), synapse (2.4), velocity (1.8), and xerces (4.8).

5.1.2. Comparing US-PONR with US and PONR

After the best parameters were obtained for each dataset, the dataset pre-processed with US-PONR or only US or only PONR was trained with a CV-based aggregation model, individually, and the AUC for each of the three methods was compared as the output result.
Figure 2 shows how the AUC of US-PONR under the optimal parameters compared with the AUC of US and PONR, individually. US-PONR performed better than both US and PONR in most datasets but performed less well than PONR alone in the ant, log4j, and synapse datasets. The reason for the latter results may be that some important features of the dataset were deleted during undersampling, and the problem was then magnified during oversampling. Overall, the results show that US-PONR is a significant improvement over either US or PONR alone.

5.1.3. Comparing US-PONR with the SOTA Data Pre-Processing Method—SMOTUNED

Table 7 shows how US-PONR compared with SMOTUNED, the alternative SOTA data pre-processing method. The results show that the AUC of the two methods was good or bad in different datasets, but nonetheless SMOTUNED spent a great deal more time on each dataset than US-PONR. The reason for the latter result is that SMOTUNED carried out many iterations to obtain the best parameter value, and the larger the dataset, the more time it spent. US-PONR obtained the best parameter value by interval search, which greatly reduced its processing time.

5.2. Answer to RQ2: Can US-PONR Perform Better Compared to the Benchmark Methods in Un-Balanced SDP Datasets with Noise?

The purpose of this experiment was to explore how well US-PONR predicts defects in unbalanced datasets that contain different proportions of label noise. The proposed method was compared with other oversampling and filtering methods after undersampling.
In order to test datasets containing noise samples of different proportions, training datasets were generated with 10%, 20%, and 30% noise samples added to each set (see Section 4.2.2). Data pre-processing was then performed on these datasets to train the aggregate SDP model.
A total of 12 methods of oversampling and noise reduction were tested, alongside our proposed method, on the datasets under the three noise levels. Following the same experiment settings used to optimize dataset parameters (Section 5.1.1), the settings of each dataset were as follows (for 10%, 20%, 30% noise): ant (1.6, 1.6, 1.4), camel (1.4, 1.4, 1.4), JDT (2.8, 1.5, 1.5), jedit (2, 2, 2), log4j (1.2, 1.2, 1.2), mylyn (3.5, 2.7, 2), PDE (2, 2, 2.2), poi (1.8, 1.8, 1.2), synapse (2.2, 1.8, 1.2), velocity (1.6, 1.2, 1.2), and xerces (2, 2, 2).
Table 8 shows the AUC for each method when applied to every dataset. The results indicate that no matter which dataset or method was used, the introduction of noise always reduced a method’s prediction performance. However, under most noise ratios of most datasets, the prediction performance of US-PONR was the best, while its decrease in performance as the noise ratio rose was minimal. When the US-PONR method was used on the POI dataset, the low-level noise score was lower than the high-level score. The reason for this anomaly is that the undersampling step size was set too large. In order to save time and money, the step size was set to 0.2. For the same dataset with different noise levels, the high noise level dataset gives the impression of having reached the optimal while the low noise level dataset does not.
Experiment results show that at the 30% noise level, our proposed method had a very significant advantage over all other methods for most datasets, especially for Gaussian SMOTE and CCR (which outperforms US-PONR at the low noise level in some datasets). This suggests that US-PONR is the best overall method in a noisy environment, except for the Mylyn and PDE datasets. The reason for the method’s poor performance on these datasets may be that they have heavy distributions (such as the “levy distribution”), which make it impossible to distinguish between good and bad samples in the feature space using PSM, leading to improper oversampling or noise reduction. The reason for US-PONR’s superior performance lies in the way it reduces noise using the PSM score, which can effectively capture noise samples.

5.3. Answer to RQ3: Is US-PONR Especially Good at Eliminating Label Noise Samples?

In order to verify the effectiveness of US-PONR in eliminating label noise, we designed an experiment to discover how many introduced noise samples were found and eliminated by each method in the 10%, 20%, and 30% noise environments.
We marked each introduced label noise sample and then checked the residual noise ratio in the datasets after using each method. The residual noise ratio is the number of the remaining marked noise samples divided by the total number of samples after using the data processing method. The residual noise ratio in the dataset is used to verify the effectiveness of the noise reduction method: the smaller the proportion of remaining marked noise samples in the datasets, the more effective the noise-reduction method. The setting of dataset parameters in this experiment was the same as that used to test RQ2.
Table 9 shows the proportion of residual marked label noise samples after using each method in 10%, 20%, and 30% noise environments. For example, in 10% noise level ant, the residual noise sample ratio recorded for US-PONR was 1.77%, which means that US-PONR identified and removed 8.23% of the noise samples. The table shows that in each dataset and noise level, the residual noise ratio was the lowest after using US-PONR (maximum: −22.82% in 30% noise Xerces; minimum: −15.4% in 30% noise level Velocity), which proves that the proposed method is effective in removing noise samples and improving the quality of datasets.

6. Threats to Validity

6.1. Datasets

In our SDP experiment, we tested US-PONR on 24 versions of software data from 11 projects in the PROMISE dataset, and our proposed method performed admirably. However, before being put into practice, US-PONR needs to be tested against actual commercial projects. Moving forward, we plan to apply this method to some commercial projects to verify its validity.
Although our team members manually labeled the samples to simulate as much as possible the process of generating label noise in real data (see Section 4.2.2), our test methodology inevitably brought some artificial features to the dataset. In the future, US-PONR’s effectiveness will also need to be verified in other noisy datasets.
Finally, the CK, process, and network metrics were chosen to be used in the experiment as feature vectors for representing software defects. There are other metrics that can be used to characterize software defects, such as the Hausdorff metric [61]. Validation of the method against these metrics also needs to be carried out in future work.

6.2. Model Hyperparameters

US-PONR and other methods for detecting software defects have difficulty controlling every variable and every model parameter, such as the n_neighbors k, the noise discrimination threshold t, or the weight of the number of samples generated. Therefore, in our experiment, we used the parameters recommended by their authors, and we performed no further tuning of them. Fine-tuning these parameters may improve their performance slightly.

6.3. Evaluation

In our study, AUC was used to evaluate the prediction performance of the models. Other indicators such as MCC were also calculated in the experiment, but due to the limited length of the article, these indicators are not fully addressed in the paper. Readers who are interested in detailed results should contact the authors by email.

7. Discussion and Conclusions

In the past two decades, software defect prediction (SDP) has developed rapidly as an important way to improve software quality and reliability. However, class imbalance and noise are often present in SDP datasets. With the continuous improvement of SDP algorithms based on machine learning, improving the quality of software defect data offers a further significant effect to improve SDP performance. Oversampling such as SMOTE is one of the most well-known methods to balance the datasets, but it can introduce overfitting noise to datasets. Thus, our article attempts to solve the data imbalance in the SDP dataset and reduce the noise in it.
In this paper, US-PONR is proposed as a dataset pre-processing method to improve SDP dataset quality. To alleviate imbalance in a dataset, US-PONR first uses undersampling to remove redundant samples due to version iterations, and then it uses oversampling to remove potential noise samples, relying on propensity score matching as a dimension of the feature vector. A total of 24 versions of software data from 11 projects were used in experiments to test the proposed method. In an SDP experiment conducted in different noise environments, US-PONR demonstrated its ability to improve the quality of SDP datasets. And in experiments on removing label noise samples, the new method demonstrated that it can effectively identify label noise samples and remove them.
However, there are still a few limitations related to the datasets will need to be addressed. For our data, the experiments only used the CK, process, and network metrics extracted from PROMISE datasets. In the future, we hope to test our method against commercial datasets and more metrics that represent software defects. And artificial noise may be introduced in the generation of noise datasets using for experiments, so the ability of the method to identify and remove label noise samples needs further validation in real software projects.

Author Contributions

Conceptualization, H.S.; methodology, H.S.; software, H.S.; validation, H.S. and J.X.; formal analysis, H.S.; investigation, H.S. and J.X.; resources, J.X.; data curation, J.X.; writing—original draft preparation, H.S.; writing—review and editing, H.S. and J.L.; visualization, H.S.; supervision, J.A.; project administration, J.A.; funding acquisition, No. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The entire meta dataset can be found at https://github.com/buaaSoftwareReliabilityGroup/US-PONR (accessed on 18 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wong, W.E.; Li, X.; Laplante, P.A. Be more familiar with our enemies and pave the way forward: A review of the roles bugs played in software failures. J. Syst. Softw. 2017, 133, 68–94. [Google Scholar] [CrossRef]
  2. Wong, W.E.; Debroy, V.; Surampudi, A.; Kim, H.; Siok, M.F. Recent catastrophic accidents: Investigating how software was responsible. In Proceedings of the SSIRI 2010—4th IEEE International Conference on Secure Software Integration and Reliability Improvement, Singapore, 9–11 June 2010; pp. 14–22. [Google Scholar] [CrossRef]
  3. Aleem, S.; Capretz, L.F.; Ahmed, F. Benchmarking Machine Learning Techniques for Software Defect Detection. Int. J. Softw. Eng. Appl. 2015, 6, 11–23. [Google Scholar] [CrossRef]
  4. Alsaeedi, A.; Khan, M.Z. Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques: A Comparative Study. J. Softw. Eng. Appl. 2019, 12, 85–100. [Google Scholar] [CrossRef]
  5. Prasad, M.; Florence, L.F.; Arya3, A. A Study on Software Metrics based Software Defect Prediction using Data Mining and Machine Learning Techniques. Int. J. Database Theory Appl. 2015, 8, 179–190. [Google Scholar] [CrossRef]
  6. Chidamber, S.; Kemerer, C.F. A Metric suite for object oriented design. IEEE Trans. Softw. Eng. 1994, 20, 476–493. [Google Scholar] [CrossRef]
  7. Nagappan, N.; Ball, T. Use of relative code churn measures to predict system defect density. In Proceedings of the 27th International Conference on Software Engineering, ICSE05, St. Louis, MO, USA, 15–21 May 2005; pp. 284–292. [Google Scholar] [CrossRef]
  8. Khoshgoftaar, T.; Allen, E.; Goel, N.; Nandi, A.; McMullan, J. Detection of software modules with high debug code churn in a very large legacy system. In Proceedings of the ISSRE ‘96: 7th International Symposium on Software Reliability Engineering, White Plains, NY, USA, 30 October–2 November 1996. [Google Scholar] [CrossRef]
  9. Nikora, A.P.; Munson, J.C. Developing fault predictors for evolving software systems. In Proceedings of the 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry, Sydney, Australia, 5 September 2004. [Google Scholar] [CrossRef]
  10. Hassan, A.E. Predicting faults using the complexity of code changes. In Proceedings of the International Conference on Software Engineering, Vancouver, BC, Canada, 16–24 May 2009; pp. 78–88. [Google Scholar] [CrossRef]
  11. Yang, Y.; Ai, J.; Wang, F. Defect Prediction Based on the Characteristics of Multilayer Structure of Software Network. In Proceedings of the 2018 IEEE International Conference on Software Quality, Reliability, and Security Companion (QRS-C), Lisbon, Portugal, 16–20 July 2018; pp. 27–34. [Google Scholar] [CrossRef]
  12. Ai, J.; Su, W.; Zhang, S.; Yang, Y. A Software Network Model for Software Structure and Faults Distribution Analysis. IEEE Trans. Reliab. 2019, 68, 844–858. [Google Scholar] [CrossRef]
  13. Zimmermann, T.; Nagappan, N. Predicting defects using network analysis on dependency graphs. In Proceedings of the International Conference on Software Engineering, Leipzig, Germany, 10–18 May 2008; pp. 531–540. [Google Scholar] [CrossRef]
  14. Zhang, S.; Ai, J.; Li, X. Correlation between the Distribution of Software Bugs and Network Motifs. In Proceedings of the 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), Vienna, Austria, 1–3 August 2016. [Google Scholar] [CrossRef]
  15. Li, Y.; Wong, W.E.; Lee, S.-Y.; Wotawa, F. Using Tri-Relation Networks for Effective Software Fault-Proneness Prediction. IEEE Access 2019, 7, 63066–63080. [Google Scholar] [CrossRef]
  16. Yu, X.; Liu, J.; Keung, J.W.; Li, Q.; Bennin, K.E.; Xu, Z.; Wang, J.; Cui, X. Improving Ranking-Oriented Defect Prediction Using a Cost-Sensitive Ranking SVM. IEEE Trans. Reliab. 2019, 69, 139–153. [Google Scholar] [CrossRef]
  17. Gong, L.; Jiang, S.; Jiang, L. Tackling Class Imbalance Problem in Software Defect Prediction through Cluster-Based Over-Sampling with Filtering. IEEE Access 2019, 7, 145725–145737. [Google Scholar] [CrossRef]
  18. Zhang, X.; Song, Q.; Wang, G. A dissimilarity-based imbalance data classification algorithm. Appl. Intell. 2015, 42, 544–565. [Google Scholar] [CrossRef]
  19. Zhou, L. Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowl. Based Syst. 2013, 41, 16–25. [Google Scholar] [CrossRef]
  20. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  21. Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S. Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 2017, 44, 534–550. [Google Scholar] [CrossRef]
  22. Riquelme, J.C.; Ruiz, R.; Rodríguez, D.; Moreno, J. Finding defective modules from highly unbalanced datasets. Actas De Los Talleres Las Jorn. Ing. Del Softw. Bases Datos 2008, 2, 67–74. [Google Scholar]
  23. Pandey; Kumar, S.; Tripathi, A.K. An empirical study toward dealing with noise and class imbalance issues in software defect prediction. Soft Comput. 2021, 25, 13465–13492. [Google Scholar] [CrossRef]
  24. Li, Z.; Jing, X.-Y.; Zhu, X. Progress on approaches to software defect prediction. IET Softw. 2018, 12, 161–175. [Google Scholar] [CrossRef]
  25. Kim, H.; Just, S.; Zeller, A. It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In Proceedings of the 2013 35th International Conference on Software Engineering (ICSE), San Francisco, CA, USA, 18–26 May 2013. [Google Scholar]
  26. Kim, H.; Just, S.; Zeller, A. The impact of tangled code changes on defect prediction models. Empir. Softw. Eng. 2016, 21, 303–336. [Google Scholar]
  27. Rivera, W.A. Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets. Inf. Sci. 2017, 408, 146–161. [Google Scholar] [CrossRef]
  28. Song, Q.; Jia, Z.; Shepperd, M.; Ying, S.; Liu, J. A general software defect-proneness prediction framework. IEEE Trans. Softw. Eng. 2011, 37, 356–370. [Google Scholar] [CrossRef]
  29. Jin, C. Software defect prediction model based on distance metric learning. Soft Comput. 2021, 25, 447–461. [Google Scholar] [CrossRef]
  30. Goyal, S. Effective software defect prediction using support vector machines (SVMs). Int. J. Syst. Assur. Eng. Manag. 2022, 13, 681–696. [Google Scholar] [CrossRef]
  31. Xu, J.; Ai, J.; Liu, J.; Shi, T. ACGDP: An Augmented Code Graph-Based System for Software Defect Prediction. IEEE Trans. Reliab. 2022, 71, 850–864. [Google Scholar] [CrossRef]
  32. Hanif, H.; Maffeis, S. Vulberta: Simplified source code pre-training for vulnerability detection. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
  33. Weyuker, E.J.; Ostrand, T.J.; Bell, R.M. Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empir. Softw. Eng. 2008, 13, 539–559. [Google Scholar] [CrossRef]
  34. Guzmán-Ponce, A.; Sánchez, J.S.; Valdovinos, R.M.; Marcial-Romero, J.R. DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst. Appl. 2021, 168, 114301. [Google Scholar] [CrossRef]
  35. Tax, D.M.J. One-Class Classification: Concept Learning in the Absence of Counter-Examples; Netherlands Participating Organizations: Leidschendam, The Netherlands, 2002; p. 584. [Google Scholar]
  36. Agrawal, A.; Menzies, T. Is ‘better data’ better than ‘better data miners’?: On the benefits of tuning SMOTE for defect prediction. In Proceedings of the International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 1050–1061. [Google Scholar] [CrossRef]
  37. Feng, S.; Keung, J.; Yu, X.; Xiao, Y.; Bennin, K.E.; Kabir, A.; Zhang, M. COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction. Inf. Softw. Technol. 2021, 129, 106432. [Google Scholar] [CrossRef]
  38. Ochal, M.; Patacchiola, M.; Vazquez, J.; Storkey, A.; Wang, S. Few-shot learning with class imbalance. IEEE Trans. Artif. Intell. 2023. [Google Scholar] [CrossRef]
  39. Bennin, K.E.; Keung, J.; Phannachitta, P.; Mensah, S. The significant effects of data sampling approaches on software defect prioritization and classification. In Proceedings of the 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Toronto, ON, Canada, 9–10 November 2017. [Google Scholar]
  40. Feng, S.; Keung, J.; Yu, X.; Xiao, Y.; Zhang, M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf. Softw. Technol. 2021, 139, 106662. [Google Scholar] [CrossRef]
  41. Soltanzadeh, P.; Hashemzadeh, M. RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf. Sci. 2021, 542, 92–111. [Google Scholar] [CrossRef]
  42. Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  43. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Lect. Notes Comput. Sci. 2005, 3644, 878–887. [Google Scholar] [CrossRef]
  44. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
  45. Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
  46. Douzas, G.; Bacao, F. Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst. Appl. 2017, 82, 40–52. [Google Scholar] [CrossRef]
  47. Lee, H.; Kim, J.; Kim, S. Gaussian-based SMOTE algorithm for solving skewed class distributions. Int. J. Fuzzy Log. Intell. Syst. 2017, 17, 229–234. [Google Scholar] [CrossRef]
  48. Barua, S.; Islam, M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2014, 26, 405–425. [Google Scholar] [CrossRef]
  49. Ahluwalia, A.; Falessi, D.; Di Penta, M. Snoring: A noise in defect prediction datasets. In Proceedings of the IEEE International Working Conference on Mining Software Repositories, Montreal, QC, Canada, 25–31 May 2019; pp. 63–67. [Google Scholar] [CrossRef]
  50. Hu, S.; Liang, Y.; Ma, L.; He, Y. MSMOTE: Improving classification performance when training data is imbalanced. In Proceedings of the 2nd International Workshop on Computer Science and Engineering, WCSE 2009, Qingdao, China, 28–30 October 2009; Volume 2, pp. 13–17. [Google Scholar] [CrossRef]
  51. Sáez, J.A.; Luengo, J.; Stefanowski, J.; Herrera, F. SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 2015, 291, 184–203. [Google Scholar] [CrossRef]
  52. Koziarski, M.; Wożniak, M. CCR: A combined cleaning and resampling algorithm for imbalanced data classification. Int. J. Appl. Math. Comput. Sci. 2017, 27, 727–736. [Google Scholar] [CrossRef]
  53. Ramentol, E.; Caballero, Y.; Bello, R.; Herrera, F. SMOTE-RSB *: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst 2012, 33, 245–265. [Google Scholar] [CrossRef]
  54. Ramentol, E.; Gondres, I.; Lajes, S.; Bello, R.; Caballero, Y.; Cornelis, C.; Herrera, F. Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm. Eng. Appl. Artif. Intell. 2016, 48, 134–139. [Google Scholar] [CrossRef]
  55. Khoshgoftaar, T.M.; Rebours, P. Improving software quality prediction by noise filtering techniques. J. Comput. Sci. Technol. 2007, 22, 387–396. [Google Scholar] [CrossRef]
  56. Matloob, F.; Ghazal, T.M.; Taleb, N.; Aftab, S.; Ahmad, M.; Abbas, S.; Khan, M.A.; Soomro, T.R. Software defect prediction using ensemble learning: A systematic literature review. IEEE Access 2021, 9, 98754–98771. [Google Scholar] [CrossRef]
  57. Menzies, T.; Caglayan, B.; Kocaguneli, E.; Krall, J.; Peters, F.; Turhan, B. The Promise Repository of Empirical Software Engineering Data. Available online: http://promise.site.uottawa.ca/SERepository/ (accessed on 31 December 2007).
  58. Cheikhi, L.; Abran, A. PROMISE and ISBSG software engineering data repositories: A survey. In Proceedings of the Joint Conference of the 23rd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement, IWSM-MENSURA 2013, Ankara, Turkey, 23–26 October 2013; pp. 17–24. [Google Scholar] [CrossRef]
  59. Ghotra, B.; McIntosh, S.; Hassan, A.E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the International Conference on Software Engineering, Florence, Italy, 16–24 May 2015; Volume 1, pp. 789–800. [Google Scholar] [CrossRef]
  60. Kovács, G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing 2019, 366, 352–354. [Google Scholar] [CrossRef]
  61. Kyurkchiev, N.; Markov, S. On the Hausdorff distance between the Heaviside step function and Verhulst logistic function. J. Math. Chem. 2016, 54, 109–119. [Google Scholar] [CrossRef]
Figure 1. Framework of US-PONR.
Figure 1. Framework of US-PONR.
Applsci 13 10466 g001
Figure 2. The AUC of predicted results using a dataset pre-processed by US-PONR, or only US, or only PONR.
Figure 2. The AUC of predicted results using a dataset pre-processed by US-PONR, or only US, or only PONR.
Applsci 13 10466 g002
Table 1. CK metrics.
Table 1. CK metrics.
NameDescription
WMCWeighted methods per class
DITDepth of inheritance tree
NOCNumber of children
CBOCoupling between object classes
RFCResponse for a class
LCOMLack of cohesion in methods
LOCLines of code
Table 2. Process metrics.
Table 2. Process metrics.
NameDescription
REVISIONSNumber of revisions of a file.
AUTHORSNumber of distinct authors that checked a file into the repository.
LOC_ADDEDSum over all revisions of the lines of code added to a file.
MAX_LOC_ADDEDMaximum number of lines of code added for all revisions.
AVE_ LOC_ADDEDAverage lines of code added per revision.
LOC_DELETEDSum over all revisions of the lines of code deleted from a file.
MAX_LOC_DELETEDMaximum number of lines of code deleted for all revisions.
AVE_LOC_DELETEDAverage lines of code deleted per revision.
CODECHURNSum of (added lines of code—deleted lines of code) over all revisions.
MAX_CODECHURNMaximum CODECHURN for all revisions.
AVE_CODECHURNAverage CODECHURN per revision.
Table 3. Network metrics.
Table 3. Network metrics.
NameDescriptionNameDescription
FuncountThe number of internal functions of the class node.Katz_centralityThe relative influence of a node within a network.
IndegreeThe total number of connections it points to other nodes.Load_centralityThe fraction of all shortest paths that pass through that node.
OutdegreeThe total number of connections other nodes point to it.PageRankA ranking of the nodes in the graph G based on the structure of the incoming links.
InsidelinksThe total number of connections within the internal functions of the node.Average_neighbor_degreeThe average of the neighborhood of each node.
Out_degree_centralityThe fraction of nodes its outgoing edges are connected to.Number_of_cliquesThe number of maximal cliques for each node.
In_degree_centralityThe fraction of nodes its incoming edges are connected to.Core_numberThe largest value k of a k-core containing that node.
Degree_centralityThe fraction of nodes it is connected to.BrokerageThe number of pairs not directly connected.
Closeness_centralityThe reciprocal of the sum of the shortest path distances from v to all other nodes.EffSizeEffective size of network.
Betweenness_centralityThe sum of the fraction of all-pair shortest paths that pass through node v.ConstraintMeasures how strongly a module is constrained by its neighbors.
EccentricityThe maximum distance from v to all other nodes in G.HierarchyMeasures how the constraint measure is distributed across neighbors.
Communicability_centralityA broader measure of connectivity, which assumes that information could flow along all possible paths between two nodes.TwoStepReachThe percentage of nodes that are two steps away.
Table 4. ML algorithms.
Table 4. ML algorithms.
NameDescription
LiRLinear regression
RRRidge regression
LoRLogistic regression
LDALinear discriminant analysis
QDAQuadratic discriminant analysis
KRKernel ridge
SVCC-Support vector classification
SGDCLinear classifiers (SVM)
KNNK-Nearest neighbors vote classifier
GNBGaussian naïve Bayes
DTDecision tree classifier
RFRandom forest classifier
ETExtra trees classifier
ABAdaBoost classifier
Table 5. Datasets.
Table 5. Datasets.
ProjectsVersionsTotal InstancesDefective InstancesUnbalance Ratio r
ant1.3, 1.4, 1.5, 1.69471844.147
camel1.0, 1.2, 1.4, 1.627845623.954
poi2314377.486
synapse1.0, 1.1, 1.26351622.920
log4j1138342.971
jedit3.2, 4.0, 4.1, 4.2, 4.317493034.772
PDE114972096.163
JDT19972063.840
velocity1.6229781.936
xerces1.2, 1.38931405.379
mylyn118622456.600
Table 6. Parameter-undersampling_ratio optimization.
Table 6. Parameter-undersampling_ratio optimization.
US-PONR
rusAUC
ant3.80.8218
camel3.90.8612
JDT3.40.9257
jedit4.40.9381
log4j2.20.8716
mylyn5.40.9116
PDE50.9292
poi5.80.9416
synapse2.40.7811
velocity1.90.7842
xerces4.80.9442
Table 7. Comparison results of US-PONR and SMOTUNED in AUC and processing time.
Table 7. Comparison results of US-PONR and SMOTUNED in AUC and processing time.
AUCTime
US-PONRSMOTUNEDUS-PONRSMOTUNED
ant0.8218 0.889611.12 s378 s
camel0.8612 0.929037.63 s6364 s
JDT0.92570.8631 9.96 s585 s
jedit0.9381 0.944522.77 s1635 s
log4j0.8716 0.87661.04 s82 s
mylyn0.9116 0.9003 23.25 s1594 s
PDE0.92920.9061 17.52 s3029 s
poi0.94160.9346 2.95 s111 s
synapse0.7811 0.83155.71 s598 s
velocity0.7826 0.83991.64 s150 s
xerces0.94420.9325 9.09 s502 s
Table 8. AUC of each method for different datasets at 10%, 20%, and 30% noise levels.
Table 8. AUC of each method for different datasets at 10%, 20%, and 30% noise levels.
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
ant
10% noise0.8676 0.8624 0.8571 0.8630 0.8630 0.87160.7355 0.7481 0.6997 0.7201 0.7338 0.7628 0.8355
20% noise0.79520.7012 0.7338 0.7287 0.7015 0.7080 0.6980 0.7290 0.6792 0.7133 0.7116 0.7457 0.7780
30% noise0.77170.7359 0.7461 0.7185 0.7438 0.7314 0.7461 0.7344 0.7008 0.6929 0.7185 0.6945 0.7618
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
camel
10% noise0.8522 0.85490.8508 0.8470 0.8528 0.8433 0.7653 0.7678 0.7443 0.7730 0.7353 0.7864 0.7523
20% noise0.79850.7354 0.7423 0.7610 0.7459 0.7112 0.7251 0.7678 0.7232 0.7366 0.7398 0.7264 0.7021
30% noise0.77740.7155 0.7366 0.7508 0.7208 0.6633 0.6990 0.7389 0.6849 0.7309 0.7105 0.7105 0.6809
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
JDT
10% noise0.89600.7035 0.8271 0.7648 0.7091 0.7596 0.7989 0.8139 0.7406 0.8004 0.7775 0.8766 0.8859
20% noise0.8166 0.6599 0.7896 0.7251 0.7402 0.7452 0.7993 0.7962 0.7220 0.7685 0.7540 0.82530.7100
30% noise0.8020 0.6447 0.7643 0.7181 0.7091 0.7201 0.7738 0.7858 0.7207 0.7758 0.7410 0.80270.7028
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
jedit
10% noise0.88650.8073 0.8829 0.8762 0.8364 0.8443 0.8720 0.8771 0.8437 0.8824 0.8308 0.9016 0.9061
20% noise0.88820.8169 0.8856 0.8479 0.8273 0.8173 0.8558 0.8730 0.8525 0.8724 0.8155 0.8720 0.8816
30% noise0.88040.8201 0.8622 0.8383 0.7960 0.8470 0.8350 0.8670 0.8086 0.8375 0.7995 0.8614 0.8589
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
log4j
10% noise0.86840.7838 0.8354 0.7949 0.7792 0.7838 0.7532 0.8519 0.8052 0.7586 0.7805 0.7805 0.7887
20% noise0.8108 0.8101 0.8205 0.7397 0.7632 0.7576 0.7342 0.8000 0.7442 0.7750 0.7317 0.7436 0.7619
30% noise0.79450.7619 0.7805 0.7536 0.7654 0.7805 0.7686 0.7632 0.7541 0.7394 0.7541 0.7541 0.7448
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
mylyn
10% noise0.8538 0.7403 0.7889 0.7314 0.6851 0.6731 0.8018 0.8067 0.7526 0.7632 0.7287 0.8688 0.8848
20% noise0.7542 0.6035 0.6871 0.6870 0.5676 0.6823 0.7504 0.7795 0.7277 0.7651 0.6651 0.8293 0.8472
30% noise0.7508 0.5664 0.7027 0.6454 0.5589 0.6546 0.6588 0.6254 0.6196 0.7123 0.6639 0.7699 0.8211
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
PDE
10% noise0.7800 0.7001 0.8040 0.7598 0.7373 0.6843 0.8148 0.8099 0.7282 0.7693 0.7168 0.8900 0.8862
20% noise0.7438 0.7211 0.7776 0.7251 0.7336 0.6602 0.7696 0.8044 0.7509 0.7891 0.7171 0.8529 0.8416
30% noise0.7282 0.7202 0.7522 0.7052 0.7144 0.6699 0.7642 0.7681 0.7402 0.7773 0.6932 0.7798 0.8035
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
poi
10% noise0.7259 0.7087 0.6812 0.6777 0.6557 0.7867 0.7717 0.7626 0.7134 0.6479 0.6667 0.7671 0.8056
20% noise0.6612 0.6825 0.7107 0.6614 0.5912 0.6897 0.7813 0.7143 0.7273 0.7412 0.6013 0.74320.7285
30% noise0.69470.6914 0.5882 0.6301 0.6739 0.6582 0.7647 0.6747 0.6452 0.6905 0.6897 0.6667 0.6410
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
synapse
10% noise0.80370.5630 0.7304 0.7009 0.6281 0.7139 0.7295 0.7358 0.7176 0.7586 0.7107 0.7715 0.7988
20% noise0.78660.6379 0.7361 0.6988 0.6490 0.6910 0.7330 0.6845 0.7077 0.7431 0.6833 0.7539 0.7818
30% noise0.75100.6140 0.7190 0.7335 0.6199 0.7224 0.6846 0.6792 0.6585 0.6732 0.6815 0.6254 0.6649
US-PONRSOMOMWMOTESMOTE_##3IPFSMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
velocity
10% noise0.84960.7757 0.8333 0.7724 0.7628 0.7439 0.8171 0.7987 0.7642 0.7561 0.7561 0.8272 0.8067
20% noise0.81900.7686 0.7716 0.7371 0.7353 0.7672 0.7457 0.7959 0.7371 0.7802 0.7802 0.7489 0.7974
30% noise0.81630.7600 0.7908 0.7398 0.7245 0.7806 0.7398 0.7911 0.7500 0.7194 0.7653 0.7199 0.7449
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
xerces
10% noise0.8616 0.7928 0.8544 0.8087 0.7309 0.7814 0.8477 0.8542 0.8108 0.8298 0.8033 0.87930.8734
20% noise0.8535 0.7667 0.8242 0.7849 0.7330 0.7790 0.8316 0.8398 0.7993 0.8379 0.7824 0.87110.8700
30% noise0.84480.6484 0.7942 0.7928 0.7225 0.7647 0.8321 0.8318 0.7880 0.8100 0.7818 0.8303 0.8441
Table 9. Residual noise sample ratio of each method for different datasets at 10%, 20%, and 30% noise levels.
Table 9. Residual noise sample ratio of each method for different datasets at 10%, 20%, and 30% noise levels.
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
ant
10% noise0.01770.0995 0.0675 0.0670 0.0867 0.0662 0.0644 0.0600 0.0643 0.0644 0.0647 0.0643 0.0672
20% noise0.04350.1996 0.1479 0.1457 0.1679 0.1448 0.1417 0.1340 0.1415 0.1415 0.1408 0.1413 0.1530
30% noise0.09260.2844 0.2416 0.2407 0.2639 0.2405 0.2355 0.2180 0.2355 0.2339 0.2347 0.2354 0.2193
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
camel
10% noise0.01850.0996 0.0677 0.0650 0.0884 0.0685 0.0682 0.0597 0.0678 0.0682 0.0683 0.0675 0.0637
20% noise0.04960.1987 0.1470 0.1439 0.1894 0.1510 0.1480 0.1318 0.1476 0.1480 0.1499 0.1468 0.1468
30% noise0.10020.2998 0.2420 0.2390 0.2538 0.2512 0.2432 0.2222 0.2426 0.2406 0.2447 0.2418 0.2132
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
JDT
10% noise0.01810.0939 0.0682 0.0678 0.0985 0.0671 0.0682 0.0614 0.0677 0.0683 0.0682 0.0678 0.0650
20% noise0.04720.1890 0.1483 0.1475 0.1986 0.1475 0.1220 0.1317 0.1475 0.1478 0.1479 0.1481 0.1510
30% noise0.09290.2851 0.2435 0.2419 0.2982 0.2416 0.2431 0.2249 0.2425 0.2422 0.2422 0.2435 0.2170
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
jedit
10% noise0.01560.0998 0.0657 0.0653 0.0993 0.0665 0.0658 0.0580 0.0658 0.0658 0.0657 0.0659 0.0674
20% noise0.03990.1997 0.1434 0.1434 0.1773 0.1445 0.1438 0.1312 0.1440 0.1441 0.1452 0.1433 0.1531
30% noise0.08720.3000 0.2381 0.2369 0.2869 0.2396 0.2371 0.2164 0.2380 0.2375 0.2388 0.2378 0.2188
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
log4j
10% noise0.01980.0873 0.0750 0.0725 0.0879 0.0689 0.0737 0.0654 0.0729 0.0747 0.0703 0.0680 0.0829
20% noise0.04350.1996 0.1479 0.1457 0.1679 0.1448 0.1417 0.1340 0.1415 0.1415 0.1408 0.1413 0.1530
30% noise0.09260.2844 0.2416 0.2407 0.2639 0.2405 0.2355 0.2180 0.2355 0.2339 0.2347 0.2354 0.2193
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
mylyn
10% noise0.01190.0776 0.0634 0.0617 0.0995 0.0694 0.0630 0.0596 0.0633 0.0628 0.0631 0.0629 0.0612
20% noise0.03430.1632 0.1386 0.1369 0.1997 0.1531 0.1388 0.1289 0.1385 0.1384 0.1389 0.1383 0.1277
30% noise0.07820.2722 0.2317 0.2295 0.2988 0.2571 0.2326 0.2159 0.2321 0.2311 0.2329 0.2318 0.2205
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
PDE
10% noise0.01090.0999 0.0642 0.0626 0.0993 0.0624 0.0635 0.0592 0.0634 0.0633 0.0636 0.0634 0.0612
20% noise0.03310.1953 0.1403 0.1384 0.1989 0.1395 0.1398 0.1330 0.1395 0.1398 0.1398 0.1396 0.1387
30% noise0.07530.2956 0.2339 0.2312 0.2989 0.2339 0.2334 0.2216 0.2332 0.2326 0.2330 0.2326 0.2208
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
poi
10% noise0.00840.0963 0.0681 0.0572 0.0920 0.0639 0.0630 0.0551 0.0690 0.0632 0.0646 0.0595 0.0608
20% noise0.02840.1927 0.1414 0.1338 0.1921 0.1457 0.1385 0.1219 0.1432 0.1424 0.1379 0.1374 0.1295
30% noise0.07180.2898 0.2295 0.2167 0.2895 0.2492 0.2359 0.2143 0.2242 0.2306 0.2363 0.2329 0.2192
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
synapse
10% noise0.02630.0988 0.0715 0.0716 0.0938 0.0704 0.0722 0.0556 0.0718 0.0714 0.0727 0.0721 0.0756
20% noise0.06140.1954 0.1551 0.1526 0.1715 0.1539 0.1548 0.1375 0.1539 0.1547 0.1550 0.1540 0.1473
30% noise0.11480.2930 0.2505 0.2497 0.2704 0.2506 0.2493 0.2222 0.2503 0.2507 0.2498 0.2502 0.2838
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
velocity
10% noise0.03650.0964 0.0790 0.0791 0.0948 0.0784 0.0775 0.0662 0.0790 0.0791 0.0794 0.0775 0.0695
20% noise0.08850.1891 0.1685 0.1614 0.1927 0.1693 0.1694 0.1465 0.1669 0.1719 0.1642 0.1642 0.1925
30% noise0.14600.2933 0.2649 0.2595 0.2915 0.2743 0.2669 0.2470 0.2645 0.2665 0.2665 0.2627 0.3000
US-PONRSOMOMWMOTESMOTE_
IPF
SMOTE_
RSB
SMOTE_
FRST_2T
SMOTESMOTE_
TomekLinks
Borderline_
SMOTE2
ADASYNMSMOTEGaussian_
SMOTE
CCR
xerces
10% noise0.01340.0818 0.0649 0.0641 0.0924 0.0718 0.0643 0.0435 0.0645 0.0640 0.0647 0.0648 0.0634
20% noise0.03670.1670 0.1437 0.1409 0.1878 0.1641 0.1419 0.1062 0.1418 0.1420 0.1418 0.1417 0.1412
30% noise0.07180.2553 0.2357 0.2340 0.2724 0.2654 0.2349 0.1903 0.2356 0.2354 0.2352 0.2355 0.2222
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, H.; Ai, J.; Liu, J.; Xu, J. Improving Software Defect Prediction in Noisy Imbalanced Datasets. Appl. Sci. 2023, 13, 10466. https://doi.org/10.3390/app131810466

AMA Style

Shi H, Ai J, Liu J, Xu J. Improving Software Defect Prediction in Noisy Imbalanced Datasets. Applied Sciences. 2023; 13(18):10466. https://doi.org/10.3390/app131810466

Chicago/Turabian Style

Shi, Haoxiang, Jun Ai, Jingyu Liu, and Jiaxi Xu. 2023. "Improving Software Defect Prediction in Noisy Imbalanced Datasets" Applied Sciences 13, no. 18: 10466. https://doi.org/10.3390/app131810466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop