Improving Software Defect Prediction in Noisy Imbalanced Datasets

Shi, Haoxiang; Ai, Jun; Liu, Jingyu; Xu, Jiaxi

doi:10.3390/app131810466

Open AccessArticle

Improving Software Defect Prediction in Noisy Imbalanced Datasets

¹

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China

²

China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou 510610, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10466; https://doi.org/10.3390/app131810466

Submission received: 3 August 2023 / Revised: 8 September 2023 / Accepted: 15 September 2023 / Published: 19 September 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Software defect prediction is a popular method for optimizing software testing and improving software quality and reliability. However, software defect datasets usually have quality problems, such as class imbalance and data noise. Oversampling by generating the minority class samples is one of the most well-known methods to improving the quality of datasets; however, it often introduces overfitting noise to datasets. To better improve the quality of these datasets, this paper proposes a method called US-PONR, which uses undersampling to remove duplicate samples from version iterations and then uses oversampling through propensity score matching to reduce class imbalance and noise samples in datasets. The effectiveness of this method was validated in a software prediction experiment that involved 24 versions of software data in 11 projects from PROMISE in noisy environments that varied from 0% to 30% noise level. The experiments showed a significant improvement in the quality of datasets pre-processed by US-PONR in noisy imbalanced datasets, especially the noisiest ones, compared with 12 other advanced dataset processing methods. The experiments also demonstrated that the US-PONR method can effectively identify the label noise samples and remove them.

Keywords:

software defect prediction; class imbalance; undersampling; propensity score matching; oversampling; noise reduction

1. Introduction

Defects in software systems often lead to software errors and losses [1,2]. With the development of software technology, the scale and complexity of software have grown rapidly, and software testing to discover and correct software defects has become increasingly expensive. To optimize software testing schemes, software defect prediction (SDP) is used to identify modules in software systems that may contain defects and to improve the efficiency and performance of software testing. In recent years, with the rapid development of machine learning technology, many researchers have applied machine learning to software defect prediction [3] and have achieved remarkable results [4,5]. For machine learning to predict software defects, software defect metrics first need to be defined in order to characterize the distribution of defects. At present, CK metrics [6], process metrics [7,8,9,10], and network metrics [11,12,13,14,15] are widely used for this task.

The quality of the software defect dataset is one of the most important factors that affect the performance of SDP. In real-world software projects, defects usually exist in very few modules, so the defect samples in the software defect dataset are far smaller than the clean samples, resulting in serious class imbalance problems and reducing the performance of defect prediction [16]. Many scholars have proposed methods to solve the class imbalance problem, but these rebalancing methods are still limited in their application and performance [17]. Most of the existing dataset rebalancing methods are based on undersampling and oversampling [18]. Undersampling can effectively solve the class imbalance problem, but the instances discarded during undersampling may contain information useful or important for predicting defects [19]. In SDP, oversampling is often more effective than undersampling, and minority class sample generation methods such as SMOTE and MAHAKIL [20,21] can effectively alleviate the class imbalance problem, but these oversampling methods generate overfitting noise that can degrade the predictive performance of the model [22].

Another problem with software defect datasets involves the way labels on software samples are manually marked by testers or developers [23]. Samples may be erroneously labeled due to insufficient knowledge of the defects, the time difference between defect introduction and defect discovery, or new defects caused by defect correction [24]; this mislabeling introduces noise in the dataset, which deteriorates its quality and reduces the performance of defect prediction. At present, there are few studies on the noise of software defect datasets in the SDP field [25,26]. To reduce the noise in the dataset, using propensity score matching (PSM), a noise reduction method used in machine learning [27], may be a good idea, being a statistical method that achieves noise reduction by addressing the difference between the distribution of the observed data and the overall distribution.

This paper proposes a novel method for predicting software defects using data pre-processing. Named US-PONR—for Undersampling, Propensity-Score-Matching-Based Oversampling and Noise Reduction—this method aims to improve the quality of software defect datasets by addressing deficiencies caused by class imbalance and dataset noise. The method uses undersampling in data pre-processing to remove duplicate samples that result from multiple versions of the data. And the method uses propensity score matching (PSM) to oversample and reduce noise in the dataset, which alleviate the introduction of overfitting noise. To predict defects, an aggregated multi-classifier trained under a cross-validation (CV) scoring method was built to construct the predictor. We conducted experiments that used US-PONR to predict defects under different noise environment; the results show that the proposed method is better than the benchmark and SOTA data pre-processing methods, and they further demonstrate the ability of the proposed method to identify and remove label noise samples.

This paper demonstrates how US-PONR can make the following contributions to the field of software defect prediction:

(1): US-PONR offers a new PSM-based method for data oversampling and noise reduction, which reduces the introduction of overfitting noise caused by the minority class sample generation.
(2): It also offers a new method of data pre-processing for software defect prediction that uses a combination of undersampling, oversampling, and noise reduction.
(3): US-PONR additionally performs SOTA in software defect prediction experiments under different noise environment settings, and the experiments demonstrate that the method can effectively identify and label noise samples and remove them.

The rest of this paper is organized as follows: Section 2 presents background information about software defect prediction; Section 3 presents the methodology the authors used to develop US-PONR; Section 4 describes the experiment the authors conducted to test the method, including the research questions and experiment settings; Section 5 reports the experiment’s results; Section 6 discusses the method’s limitations; and Section 7 spells out major conclusions.

2. Background

2.1. Software Defect Prediction (SDP) and Metrics

SDP is an increasingly important subject in software reliability research. Using historical data to link software metrics and defects, SDP supports the designing and testing of software by determining the defect tendency of software modules [28]. At present, SDP mainly uses machine learning algorithms to make binary classification judgments on whether a module is defective, and it is applied to software at different stages, including data acquisition, data processing, model training, and model evaluation. And a major focus of data collection is to characterize software features through metrics.

The earliest software metric applied to SDP measured software code and its complexity. In 1991, Chidamber and Kmerer proposed the famous CK metric element [6], which became one of the standard metric tuples in the SDP field. From the perspective of object-oriented design, the CK metric tuple comprehensively considers the factors affecting software, such as the number of code lines, the degree of class cohesion, and the relationship between classes. Researchers also proposed applying to SDP the metrics of the code development process [7,8,9,10] in order to address the macro-integrity of a software program and how its elements interact. Researchers also provided network metrics to measure codes and thus build a software network that can characterize software features. Many studies on network metrics have been conducted [11,12,13,14,15]. Later, Jin proposed a distance metric based on cost-sensitive learning for reducing class imbalance and better differentiation of defect/clean samples [29].

Besides software metrics, there are many studies that have applied machine learning for software defect prediction. Smoya removed clean samples surrounding defect samples using a filtering technique to enhance the performance of predicting software defects using SVMs [30]. Xu represented the source code as an augmented code property graph and trained a software defect predictor with a graph neural network [31]. Hanif trained a code pre-trained language model using defect datasets as the corpus for code defect domain tasks [32].

2.2. Class Imbalance

In machine learning, the quality of the dataset is one of the most important factors affecting performance. In the SDP field, dataset quality is mainly compromised by widespread serious imbalance and the introduction of noise caused by various factors.

Class imbalance is as a key problem in machine learning and data mining [33]. It refers to an imbalance in the proportion of a dataset’s different classes of instances. Since imbalanced defect datasets reduce the model’s ability to learn defects [1], database rebalancing methods are often used to process the original dataset. There are various methods for rebalancing datasets.

One method is undersampling, which balances the dataset by reducing samples from the majority class. The simplest form of undersampling is random undersampling, which balances the dataset by randomly reducing the samples in the majority class. A. Guzmán-Ponce proposed a two-stage under-sampling algorithm combining the DBSCAN clustering algorithm and graph-based procedure to face the class imbalance [34]. The disadvantage of undersampling is that it may cause the training data to lose important information from the majority of class samples [19].

Another method for rebalancing datasets is oversampling. Oversampling balances a dataset by adding a minority class of sample data. Random oversampling (ROS) randomly replicates minority class samples, but this can lead to severe overfitting. Most current research instead balances the dataset by generating minority class samples. This method of oversampling often develops sample generation strategies based on the common assumption in the machine learning community that closer instances are more similar than more distant ones [35]. One sample generation strategy, MAHAKIL, determines Mahalanobis distance to generate samples [21], but this method does not work when the number of instances in a few classes is less than their dimensionality, as this generates overly diverse data that reduces the model’s ability to find defects [36]. Another strategy, COSTE, generates samples based on complexity [37], but further investigation is needed to validate its assumption that there is less information about defects in complex samples. Mateusz discussed the impact of class imbalance in few-shot learning and pointed out the effectiveness of oversampling to rebalance datasets [38].

The SMOTE algorithm proposed by Chawla et al. in 2002 is one of the most commonly used oversampling methods in academia [20]; it generates minority class samples using k-NN spatial distance. However, this method may increase the risk of overfitting and also increase the false positive rate of prediction results [39]. Researchers have proposed variants of SMOTE to minimize these drawbacks [40]. Paria proposed a range-controlled SMOTE to alleviate the increasing of overlaps between different classes around the class boundaries caused by SMOTE [41]. Batista et al. [42] proposed combining SMOTE with TomekLinks (SMOTE_TomekLinks) and with the edited nearest neighbor rule (SMOTE_ENN). Han et al. [43] proposed Borderline-SMOTE. He et al. [44] added different weights for different minority instances (ADASYN). Douzas et al. applied k-means clustering [45] and self-organizing maps [46] to the SMOTE method (kmeans-SMOTE and SOMO). Lee et al. [47] added Gaussian random variables to the SMOTE synthesis sample process (Gaussian-SMOTE). Barua et al. [48] gave the minority sample weight based on its distance to the nearest majority sample (MWMOTE). Recently, Agrawl et al. [36] proposed an algorithm to automatically optimize the parameter combination in SMOTE: k = number of neighbors, m = number of synthetic examples to create, and r = power parameter for the Minkowski distance (SMOTUNED).

2.3. Noise Reduction

There are two main types of noise in software defect prediction: noise existing in the dataset itself and noise introduced into the dataset by data pre-processing techniques such as oversampling. Reducing both kinds of noise has become an important area of research [49]. Some researchers perform noise reduction by adding filters after oversampling or by changing the sample generation method. Hu et al. [50] changed the strategy of selecting nearest neighbor samples, which caused a denoising effect (MSMOTE). Luengo et al. [51] combined IPF filtering with SMOTE (SMOTE-IPF). Koziarski et al. [52] cleaned up the decision boundary and guided synthetic oversampling samples (CCR). In 2012, Ramentol et al. [53] applied rough set theory to SMOTE (SMOTE-RSB). In 2016, Ramentol et al. [54] applied fuzzy rough set theory to SMOTE (SMOTE-FRST-2T). Khoshgoftaar et al. [55] applied filters to SDP for noise reduction. In 2017, Rivera [27] introduced propensity score matching (PSM) to the balancing and noise reduction methods.

3. Materials and Methods

3.1. Framework

This section introduces the framework of US-PONR (or Undersampling, Propensity Score Matching-Based Oversampling, and Noise Reduction), which is illustrated in Figure 1. The US-PONR consists of three main steps: undersampling, PSM-based oversampling and noise reduction, and predictor construction. An overview of each step appears below, while a more in-depth discussion of each step follows in Section 3.2, Section 3.3 and Section 3.4.

In our method, after obtaining the defect datasets, the CK [6], process [7], and network metrics [11,13] were selected to characterize the source code samples, which contain 80 metrics. Table 1, Table 2 and Table 3 shows some of the metrics used, while the entire meta dataset can be found at https://github.com/buaaSoftwareReliabilityGroup/US-PONR (accessed on 18 September 2023). The first step in US-PONR is undersampling (US). After the source code is represented using metrics, the software data are undersampled to pre-adjust the degree of imbalance in the dataset, which alleviates the problem of class imbalance and overfitting of the dataset by reducing the number of repeated non-defective samples in the dataset. At this point in the process, the ratio of undersampling also needs to be determined. A more detailed account of undersampling in US-PONR is provided in Section 3.2.

The second step in US-PONR is PSM-based oversampling and noise reduction (PONR). Both of these tasks are performed using the PSM technique to obtain the final dataset. PONR is performed on the undersampled dataset by calculating the propensity score of each sample in the dataset and then by obtaining the nearest neighbor sample set of each sample based on propensity score matching. According to the distribution of the PSM-based nearest neighbor sample set, oversample synthesis is performed to balance the dataset. After oversampling, PSM is performed again to judge whether a sample is noise and to exclude noise samples. Section 3.3 covers the specific calculation for the propensity score of each sample and the detailed steps of PONR.

The third step in US-PONR is predictor construction. To obtain the defect predictor, an ensemble model aggregating multi-ML classifiers was trained by the cross-validation (CV) method, in order to select the most appropriate ML model for different datasets. Section 3.4 describes how to perform cross-validation and to aggregate the model in detail.

3.2. Undersampling

In datasets, the number of non-defective samples often far exceeds the number of defective samples. These datasets typically include duplicate non-defective samples brought by non-defective code files that have not been modified in each version.

In such cases, the dataset needs to be undersampled in order to reduce the degree of class imbalance and reduce the over-fitting of the model caused by repeated samples.

The initial dataset can be expressed by the following formula:

X_{origin} = {(x_{i}, y_{i}) | i \in (1, n), y_{i} \in {0, 1}, x_{i} = {[x_{i, 1}, x_{i, 2}, \dots, x_{i, m}]}^{T}}

(1)

where x_i is the feature vector of the i-th sample, y_i is the defect label of the i-th sample, n is the number of samples, and m is the feature dimension.

In the method we are proposing, undersampling is introduced to alleviate the imbalance in a dataset caused by duplicate codes in multiple versions. The process of undersampling is represented by Algorithm 1:

Algorithm 1: Undersampling

input

original dataset X_{origin}

, imbalance degree r_{\max} = y_{negetive} / y_{positive}

, step s_{us}

output

undersample dataset X_{us_all}

[] [] contains X_{us}

with diff r_{us}

1.

r_{us}

: undersample ratio, initialize to 1

2.

X_{\max}

,

X_{d}

,

X_{\min}

initialize to null

3. for

x_{i}

in

X_{origin}

4. if

y_{i} = 0

5. then if

x_{i}

in

X_{\max}

6. then

X_{d}

←

x_{i}

7.

X_{\max}

←

x_{i}

8. end if

9. else

10.

X_{\min}

←

x_{i}

11. end if

12. end for

13. do {

14. r =

{the number of samples in X}_{\max}

/

{the number of samples in X}_{\min}

15. while r ≤

r_{us}

16.

X_{\max}

← x_j (random select from

X_{d}

)

17. end while

18.

X_{us}

=

X_{\max}

+

X_{\min}

19.

X_{us_all}

←

X_{us}

and

r_{us}

20.

r_{us}

+=

s_{us}

21. }while (

r_{us}

≤

r_{\max}

)

22. return

X_{us_all}

where

r_{\max}

is the imbalance degree of the dataset,

X_{\max}

are samples with defect-free labels in

X_{origin}

,

X_{\min}

are samples with defect labels in

X_{origin}

,

X_{d}

are duplicate defect-free samples, the undersampling parameter

r_{us}

is the ratio of non-defective samples to defective samples after undersampling,

X_{us}

is the dataset undersampled under certain

r_{us}

, and

s_{us}

is the search step for the undersampling ratio.

The algorithm searches through the steps in [1,

r_{\max}

] and then deletes the duplicated data in the dataset under each

r_{us}

. Setting different

r_{us}

will alleviate the class imbalance problem to varying degrees, but setting small

r_{us}

will lead to few samples and the loss of important data information. Therefore, it is necessary to optimize

r_{us}

. The optimization method for

r_{us}

is introduced in Section 3.4.

After undersampling, the under sampled dataset

X_{us_all}

is obtained, which contains the different

r_{us}

values and their corresponding undersampled dataset

X_{us}

.

3.3. PONR

After undersampling, a dataset with repeated sample removal is obtained. However, undersampling greatly reduces the sample size in the dataset. To keep the data volume sufficient, the dataset cannot be balanced only with undersampling.

At present, the most popular dataset balancing method in machine learning is SMOTE and its derivative methods. But the synthetic data strategy of this method creates the problems of overfitting and noise amplification. In order to overcome overfitting and dataset noise, propensity score matching (PSM) is introduced to generate synthetic samples and reduce dataset noise. The theoretical framework of the PSM method is to solve the difference between the observed sampling data distribution and the overall distribution. The systematic deviation uses the propensity score to measure the difference in high-dimensional data in the feature space.

PSM is used for nearest neighbor search, which generates synthetic samples and reduces dataset noise. PSM is also used to reduce the complexity and cost of the algorithm while effectively denoising the dataset and the synthesis of minority samples, oversampling the samples closest to the center of the minority class, and improving the quality of the software defect dataset.

This subsection calculates the propensity score of each sample in

X_{us}

by solving the weight vector of

X_{us}

and adds it as an additional feature to

X_{us}

for providing a basis for the subsequent PSM-based noise removal and oversampling process.

First, use the logistic function to characterize the distribution of data:

f (x) = \frac{1}{1 + e^{- x}}

(2)

Next, define an m-dimensional (m is the dimension of feature) weight vector

β

to characterize the similarity of samples. Also, define a constant

β_{0}

and perform a minimum initialization, allowing each element of

β_{0}

be a random decimal value close to 0 to ensure that

- x_{i}^{T} β + β_{0} \neq 0

:

Now, define the maximum log-likelihood probability of the dataset as

\ln L = \sum_{i = 1}^{n} (y_{i} \ln (f (x_{i}^{T} β)) + (1 - y_{i}) \ln (1 - f (x_{i}^{T} β)))

(3)

Solve the β that maximizes the maximum log-likelihood probability of the dataset:

\arg \max_{β} \ln L = \sum_{i = 1}^{n} (y_{i} \ln (f (x_{i}^{T} β)) + (1 - y_{i}) \ln (1 - f (x_{i}^{T} β)))

(4)

Substitute the obtained weight vector

β

and the feature vector of each sample for the logistic function to obtain the propensity score

f_{β} (x_{i})

of each sample:

f_{β} (x_{i}) = \frac{1}{1 + e^{- x_{i}^{T} β + β_{0}}}

(5)

Add the propensity score of each sample to

X_{us}

as the (m + 1) th dimensional feature of the sample to obtain the propensity score dataset

X_{β}

. For each sample

x_{i}

, traverse the other samples in

X_{β}

to calculate the Euclidean distance between each sample:

d_{i j} = (\sum_{z = 1}^{m + 1} | x_{i}^{z} - x_{j}^{z} |^{2})^{1 / 2} i f f x_{i} \neq x_{j} and x_{i}, x_{j} \in X_{β}

(6)

where

x_{i}^{z}

is the z-th dimension feature of

x_{i}

. Then, find the k samples whose Euclidean distance is closest to

x_{i}

to form the k nearest neighbor sample set

{XK}_{i}

of

x_{i}

:

{XK}_{i} = {(x_{i j}, y_{i j}) | j \neq i, j \in (1, k + 1), \min (d_{i j}) in X_{β}}

(7)

Then, oversampling the data, the number of synthetic samples to be generated is

N_{new} = r_{os} \times ({num}_{y_{i} = 0} - {num}_{y_{i} = 1}), (x_{i}, y_{i}) i n X_{us}

(8)

where parameter

r_{os}

is the weight of the number of samples generated and is set to 1 by default. Traverse each defect sample

x_{i}

in the noise reduction dataset

{XK}_{i}

, randomly select one defect sample

x_{i j}

in its

{XK}_{i}

for each defect sample, and randomly synthesize a new defect sample between the two defect sample lines in the eigenspace by (9):

x_{new} = x_{i} + c (x_{i j} - x_{i}), y_{new} = 1, x_{i j} \in X_{ik}

(9)

where c represents a randomly generated constant between 0 and 1. Repeat the above process until the number of synthesized instances reaches

n_{new}

. Add all synthesized defective instances to

X_{β}

for obtaining

X_{os}

. After oversampling, the dataset still needs noise reduction. For each

x_{i}

, count the number of different class samples in

{XK}_{i}

by (10):

d i f f = \sum_{j = 1}^{k} | y_{i} - y_{i j} |

(10)

If

d i f f ⩾

the noise discrimination threshold t, consider

x_{i}

to be a noise sample, and remove it from

X_{os}

. In this method, set the nearest neighbor number parameter k = 5 and the noise discrimination threshold t = 3 for PONR; the value of k is set according to the density of the dataset used, and the value of t is set to ensure that the data in which more than half of the k-neighbors belong to different categories are treated as noise. Repeating the above traversal process until traversing the entire

X_{os}

without discriminating noise, the dataset after noise reduction

X_{final}

is obtained. At this point, the complete data processing flow is completed.

3.4. Predictor Construction

The third step of the US-PONR method involves building a software defect predictor that uses a simple method of machine learning aggregation based on cross-validation (CV) scoring. It also provides a scheme for simple balanced parameter optimization.

In the SDP field, data distribution is significantly diverse: different software projects, different metrics, and different types of defects all exhibit different data distribution characteristics. Therefore, it is difficult for a single ML algorithm to comprehensively predict software defects. Inspired by the ensemble learning used in SDP [56], in this paper, multiple ML models are aggregated and trained through a CV scoring method to build an aggregate defect prediction model. Instructions for researchers are as follows:

First, set the parameter

N_{k}

of the StratifiedKFold random grouping and the model score error threshold

T_{dev}

. StratifiedKFold random grouping refers to randomly dividing the dataset into

N_{k}

groups and keeping the proportion of defective samples and non-defective samples in each group equal to the original dataset. The

X_{final}

obtained in the previous step is randomly StratifiedKFold grouped to obtain a set of

N_{k}

groups datasets:

{X_{final, k} | k = 1 \dots N_{k}}

. In each loop, select a group of the dataset as

x_{validate}

and the remaining groups as

x_{train}

:

{X_{validate} = X_{final, i}, X_{train} = set (X_{final, j}), i \in (1, N_{k}), j \neq i \in (1, N_{k})}

(11)

Then, the CV-based algorithm showed in Algorithm 2 was used to predict results:

Algorithm 2: CV-based aggregation

input StratifiedKFold Grouped Dataset

{X_{final, k} | k = 1 \dots N_{k}}

,

undersampling ratio

r_{us}

, Classifers, Model Score Error Threshold

T_{dev}

output defect prediction result, best undersampling ratio

r_{us_best}

1. for

r_{us}

in

X_{us_all}

:

2. for

X_{final, k}

in

{X_{final, k} | k = 1 \dots N_{k}}

:

3.

X_{train}

←

X_{final, k}

4.

X_{validate}

←

{X_{final, j} | j \in (1, N_{k}), j \neq k}

5. for classifier i in Classifiers

6. if classifier i need hyperparameter-optimal or threshold-optimal:

7. optimize hyperparameter or threshold

8. end if

9.

{AUC}_{i}

← train predictor by

X_{train}

and

X_{validate}

10. if

{AUC}_{\max}

<

{AUC}_{i}

:

11.

{AUC}_{\max}

=

{AUC}_{i}

12. end if

13.

{AUC}_{classifiers}

←

{AUC}_{i}

14. end for

15.

{AUC}_{\max_N_{k}}

+=

{AUC}_{\max}

16.

{AUC}_{classifiers_N_{k}}

+=

{each AUC}_{i} {was summed separately in AUC}_{classifiers}

17. end for

18.

\bar{{AUC}_{\max}}

=

{AUC}_{\max_N_{k}} / N_{k}

;

\bar{{AUC}_{classifiers}}

=

{AUC}_{classifiers_N_{k}} / N_{k}

19. Record

\bar{{AUC}_{\max}}

and

\bar{{AUC}_{classifiers}}

corresponding to each different

r_{us}

20. end for

21. calculate (max(

\bar{{AUC}_{\max}}

calculated by diff

r_{us}

))

22.

r_{us_best}

=

r_{us}

with max (

\bar{{AUC}_{\max}}

)

23.

{AUC}_{classifiers_best}

=

\bar{{AUC}_{classifiers}}

with max(

\bar{{AUC}_{\max}}

)

24. for result in

{AUC}_{classifiers_best}

25. if |

\bar{{AUC}_{\max}}

-result| <

T_{dev}

26. add the classfier (with its predict result) into

C_{CV}

27. end for

28. defect prediction result = best prediction result by classifiers in

C_{CV}

29. return defect prediction result and

r_{us_best}

where

{X_{final, k} | k = 1 \dots N_{k}}

is the dataset obtained by processing PONR and StratifiedKFold grouped for

X_{us}

, which is the undersampled dataset corresponding to the particular undersampling ratio

r_{us}

in

X_{us_all}

. Classifiers is the ML classifiers used in this method shown in Table 4.

{AUC}_{best}

is the highest AUC of the result predicted in Classifiers,

{AUC}_{classifiers}

is the AUC of each classifier in Classifiers,

C_{CV V}

is classifiers filtered by CV scoring method, and

r_{us_best}

is the optimal parameter obtained by the CV-based aggregation model.

Before training different ML classifiers to obtain different machine learning models using

x_{train}

, some classifiers need hyperparameter or threshold optimization to obtain the best performance. In this method, for RR and LAR, the RandomizedSearchCV algorithm is used to tune up. For KNN, the GridSearchCV algorithm is used to determine the best n-neighbors. For LiR, KR and other classifiers that need to optimize the threshold, set the initial threshold as the median of the prediction results, then use the dichotomy to iterate, and following this, select the threshold with the maximum score as the optimal threshold. The classifiers are trained to obtain ML models, then they are used to predict

x_{validate}

, and the results are scored using AUC. AUC indicates the probability that a positive sample is greater than a negative sample when a trained classifier is used to predict a pair of data. The formula for AUC is

AUC = \frac{\sum_{i \in positiceClass} {rank}_{i} - \frac{M (1 + M)}{2}}{M \times N}

(12)

Repeat the above operation until each group of data is regarded as

x_{validate}

once. Calculate the average AUC

\bar{{AUC}_{\max}}

of the scores of different machine learning models in

N_{k}

experiments and take the model algorithm with the largest

\bar{{AUC}_{\max}}

and the model algorithms in

C_{CV}

within the difference

| AUC - \bar{{AUC}_{\max}} | ⩽ T_{dev}

as the final selection algorithms.

Due to the different datasets, class imbalance, data distribution, and other characteristics of different software, in the process of constructing the predictor, the balance parameter settings of the undersampling process should be optimized. Here, a simple scheme is proposed in order to optimize

r_{us}

in the CV aggregation process described above. For a dataset with a class unbalanced ratio r, experiment with different

r_{us}

values with a step size between [1, r]. Take the

r_{us}

with the highest

\bar{{AUC}_{\max}}

as

r_{us_best}

shown in Algorithm 2. Then, take the parameter

r_{us_best}

as the selected undersampling ratio to obtain

X_{final}

during the actual prediction.

Train the selected aggregated ML model using

X_{final}

to obtain the defect prediction model. Measure metrics of the test set of the target software, and then implement prediction. Multiple LM models in the aggregated model were selected in the previous step, and the prediction results may not be the same, so the prediction with the highest AUC should be selected as the final prediction result. However, because it costs much less to test a software defect than to repair it, the prediction results should include as many defective modules as possible, so that more defects can be found when testing. In practice, the final defect prediction result should combine the defect modules predicted by each model.

4. Experiment

We conducted three experiments to answer the three research questions raised in Section 4.1 separately:

defect prediction in SDP datasets without noise (for RQ1);
defect prediction in SDP datasets in different noise environments (for RQ2);
validation of the proportion of labeled noisy samples in the dataset before and after the use of the data pre-processing method (for RQ3).

The datasets used in different experiments are described in Section 4.2, the selected ML algorithms for the aggregated model and experiment environment are shown in Section 4.3, and the SOTA methods compared to the proposed method in a different experiment is shown in Section 4.4.

4.1. Research Questions

The experiment sought to answer the following three research questions:

RQ1: Is US-PONR effective in SDP datasets?
RQ2: Can US-PONR perform better compared to the benchmark methods in unbalanced SDP datasets with noise?
RQ3: Is US-PONR especially good at eliminating label noise samples?

RQ1 aims to validate the effectiveness of the proposed method in SDP datasets. RQ2 focuses on the effectiveness of US-PONR in noisy environments. And RQ3 further validates the effectiveness of the proposed method in noise reduction.

4.2. Datasets

4.2.1. PROMISE Public Dataset

To create the US-PONR method, we used the data of 24 versions of software from 11 different software projects in the PROMISE public dataset [57]. Table 5 provides an overview of these projects and their software, which represent the origin dataset before any pre-processing. The selected datasets cover different scales of data and different degrees of class imbalance. The CK and process metrics used in the experiment were directly obtained from the dataset. The net metrics were obtained by converting the dataset into a network diagram through an ISEE platform developed by our laboratory and by integrating measurement elements proposed by Yang et al. [11] and by Zimmermann and Nagappan [13]. PROMISE datasets, comprising software donations from NASA and other sources, have become one of the main public datasets in the SDP field, and they have been modified by researchers to mitigate various problems [58]. As Baljinder et al. pointed out, the quality of the PROMISE dataset is now better than the noise-reduced NASA dataset [59]. Therefore, this article assumes that the PROMISE dataset used is clean and noise-free.

4.2.2. Noise Dataset Generation

In order to generate datasets with label noise, 18 researchers in our lab were selected to re-generate labels for the PROMISE dataset. After they were generated, labels that differed from the PROMISE labels were considered samples with label noise.

We separately replaced 10%, 20%, and 30% samples of the training set with noise samples and generated three training sets with different numbers of noise samples to test the performance of the model in different noise environments. Meanwhile, the validation and test sets were kept the same as the PROMISE dataset. The generated noisy datasets were used to answer RQ2 and RQ3.

4.3. Experiment Settings

As shown in Table 4 above, 14 machine learning algorithms were selected as candidates for model aggregation. Python’s scikit-learn was then used to process any algorithms that required hyperparameter optimization and feature selection.

The experiments were carried out using Ubuntu20.04 and Python 3.8 on an Intel E5-2620 CPU with 32 GB of memory and a NVidia GeForce RTX 2080 with 16 GB of RAM.

4.4. Comparison Methods

For RQ1, the SOTA oversampling method SMOTUNED [46] was used to verify the effectiveness of US-PONR in non-noisy SDP datasets.

For RQ2 and RQ3, to compare US-PONR with current mainstream methods of oversampling and noise removal, 12 methods were selected to join US-PONR in the noise experiment following undersampling, including SOMO [46], MWMOTE [48], SMOTE-IPF [51], SMOTE-RSB [53], SMOTE-FRST-2T [54], SMOTE [20], SMOTE-TomekLinks [42], Borderline-SMOTE [43], ADASYN [44], MSMOTE [50], Gaussian-SMOTE [47], and CCR [52].

Scikit-learn and the open-source library “smote_variants” [60] on GitHub were used to implement these methods.

5. Results

5.1. Answer to RQ1: Is US-PONR Effective in SDP Datasets?

In order to answer RQ1, three small experiments were conducted:

Determining the optimization parameter.
Comparing the results of US-PONR with the results of just US (undersampling) or PONR alone.
Comparing US-PONR with the SOTA data pre-processing method.

The first experiment determined the optimal parameter of each dataset. These parameters were then used for the other two experiments. The second experiment sought to prove the necessity of using both US and PONR in the proposed method. The third experiment sought to validate the effectiveness of US-PONR. The dataset used in these experiments was the original PROMISE dataset without any introduced noise (shown in Table 5).

5.1.1. Determining the Optimization Parameter

To obtain the optimal parameter, different steps were taken to process undersampling on the selected datasets; step size was set at 1 between [1, r], where r is the degree of imbalance in the dataset. We trained the CV-based aggregation model under different conditions and then compared with the average AUC obtained by cross-validation; the undersampling ratio with the largest average AUC was selected as the optimal parameter.

When conducting this experiment, we discovered that the search interval was usually too large, so we recommend that one first use step = 1 to search for the interval where there may be optimal parameters and then use step = 0.2 to find the optimal parameters. Our test results indicate that the value interval should be set as follows: ant [3, 4], camel [3, 3.9], poi [5, 6], JDT [2.6, 3.4], synapse [2, 3], jedit [3.6, 4.4], velocity [1, 1.8], xerces [4.4, 5.2], log4j [1.6, 2.4], PDE [4.6, 5.4], and mylyn [4.6, 5.4].

The results for parameter optimization are shown in Table 6:

These results show that the optimal parameters of each dataset are as follows: ant (3.8), camel (3.9), JDT (3.2), jedit (4.2), log4j (2.2), mylyn (5.4), PDE (5), poi (5.8), synapse (2.4), velocity (1.8), and xerces (4.8).

5.1.2. Comparing US-PONR with US and PONR

After the best parameters were obtained for each dataset, the dataset pre-processed with US-PONR or only US or only PONR was trained with a CV-based aggregation model, individually, and the AUC for each of the three methods was compared as the output result.

Figure 2 shows how the AUC of US-PONR under the optimal parameters compared with the AUC of US and PONR, individually. US-PONR performed better than both US and PONR in most datasets but performed less well than PONR alone in the ant, log4j, and synapse datasets. The reason for the latter results may be that some important features of the dataset were deleted during undersampling, and the problem was then magnified during oversampling. Overall, the results show that US-PONR is a significant improvement over either US or PONR alone.

5.1.3. Comparing US-PONR with the SOTA Data Pre-Processing Method—SMOTUNED

Table 7 shows how US-PONR compared with SMOTUNED, the alternative SOTA data pre-processing method. The results show that the AUC of the two methods was good or bad in different datasets, but nonetheless SMOTUNED spent a great deal more time on each dataset than US-PONR. The reason for the latter result is that SMOTUNED carried out many iterations to obtain the best parameter value, and the larger the dataset, the more time it spent. US-PONR obtained the best parameter value by interval search, which greatly reduced its processing time.

5.2. Answer to RQ2: Can US-PONR Perform Better Compared to the Benchmark Methods in Un-Balanced SDP Datasets with Noise?

The purpose of this experiment was to explore how well US-PONR predicts defects in unbalanced datasets that contain different proportions of label noise. The proposed method was compared with other oversampling and filtering methods after undersampling.

In order to test datasets containing noise samples of different proportions, training datasets were generated with 10%, 20%, and 30% noise samples added to each set (see Section 4.2.2). Data pre-processing was then performed on these datasets to train the aggregate SDP model.

A total of 12 methods of oversampling and noise reduction were tested, alongside our proposed method, on the datasets under the three noise levels. Following the same experiment settings used to optimize dataset parameters (Section 5.1.1), the settings of each dataset were as follows (for 10%, 20%, 30% noise): ant (1.6, 1.6, 1.4), camel (1.4, 1.4, 1.4), JDT (2.8, 1.5, 1.5), jedit (2, 2, 2), log4j (1.2, 1.2, 1.2), mylyn (3.5, 2.7, 2), PDE (2, 2, 2.2), poi (1.8, 1.8, 1.2), synapse (2.2, 1.8, 1.2), velocity (1.6, 1.2, 1.2), and xerces (2, 2, 2).

Table 8 shows the AUC for each method when applied to every dataset. The results indicate that no matter which dataset or method was used, the introduction of noise always reduced a method’s prediction performance. However, under most noise ratios of most datasets, the prediction performance of US-PONR was the best, while its decrease in performance as the noise ratio rose was minimal. When the US-PONR method was used on the POI dataset, the low-level noise score was lower than the high-level score. The reason for this anomaly is that the undersampling step size was set too large. In order to save time and money, the step size was set to 0.2. For the same dataset with different noise levels, the high noise level dataset gives the impression of having reached the optimal while the low noise level dataset does not.

Experiment results show that at the 30% noise level, our proposed method had a very significant advantage over all other methods for most datasets, especially for Gaussian SMOTE and CCR (which outperforms US-PONR at the low noise level in some datasets). This suggests that US-PONR is the best overall method in a noisy environment, except for the Mylyn and PDE datasets. The reason for the method’s poor performance on these datasets may be that they have heavy distributions (such as the “levy distribution”), which make it impossible to distinguish between good and bad samples in the feature space using PSM, leading to improper oversampling or noise reduction. The reason for US-PONR’s superior performance lies in the way it reduces noise using the PSM score, which can effectively capture noise samples.

5.3. Answer to RQ3: Is US-PONR Especially Good at Eliminating Label Noise Samples?

In order to verify the effectiveness of US-PONR in eliminating label noise, we designed an experiment to discover how many introduced noise samples were found and eliminated by each method in the 10%, 20%, and 30% noise environments.

We marked each introduced label noise sample and then checked the residual noise ratio in the datasets after using each method. The residual noise ratio is the number of the remaining marked noise samples divided by the total number of samples after using the data processing method. The residual noise ratio in the dataset is used to verify the effectiveness of the noise reduction method: the smaller the proportion of remaining marked noise samples in the datasets, the more effective the noise-reduction method. The setting of dataset parameters in this experiment was the same as that used to test RQ2.

Table 9 shows the proportion of residual marked label noise samples after using each method in 10%, 20%, and 30% noise environments. For example, in 10% noise level ant, the residual noise sample ratio recorded for US-PONR was 1.77%, which means that US-PONR identified and removed 8.23% of the noise samples. The table shows that in each dataset and noise level, the residual noise ratio was the lowest after using US-PONR (maximum: −22.82% in 30% noise Xerces; minimum: −15.4% in 30% noise level Velocity), which proves that the proposed method is effective in removing noise samples and improving the quality of datasets.

6. Threats to Validity

6.1. Datasets

In our SDP experiment, we tested US-PONR on 24 versions of software data from 11 projects in the PROMISE dataset, and our proposed method performed admirably. However, before being put into practice, US-PONR needs to be tested against actual commercial projects. Moving forward, we plan to apply this method to some commercial projects to verify its validity.

Although our team members manually labeled the samples to simulate as much as possible the process of generating label noise in real data (see Section 4.2.2), our test methodology inevitably brought some artificial features to the dataset. In the future, US-PONR’s effectiveness will also need to be verified in other noisy datasets.

Finally, the CK, process, and network metrics were chosen to be used in the experiment as feature vectors for representing software defects. There are other metrics that can be used to characterize software defects, such as the Hausdorff metric [61]. Validation of the method against these metrics also needs to be carried out in future work.

6.2. Model Hyperparameters

US-PONR and other methods for detecting software defects have difficulty controlling every variable and every model parameter, such as the n_neighbors k, the noise discrimination threshold t, or the weight of the number of samples generated. Therefore, in our experiment, we used the parameters recommended by their authors, and we performed no further tuning of them. Fine-tuning these parameters may improve their performance slightly.

6.3. Evaluation

In our study, AUC was used to evaluate the prediction performance of the models. Other indicators such as MCC were also calculated in the experiment, but due to the limited length of the article, these indicators are not fully addressed in the paper. Readers who are interested in detailed results should contact the authors by email.

7. Discussion and Conclusions

In the past two decades, software defect prediction (SDP) has developed rapidly as an important way to improve software quality and reliability. However, class imbalance and noise are often present in SDP datasets. With the continuous improvement of SDP algorithms based on machine learning, improving the quality of software defect data offers a further significant effect to improve SDP performance. Oversampling such as SMOTE is one of the most well-known methods to balance the datasets, but it can introduce overfitting noise to datasets. Thus, our article attempts to solve the data imbalance in the SDP dataset and reduce the noise in it.

In this paper, US-PONR is proposed as a dataset pre-processing method to improve SDP dataset quality. To alleviate imbalance in a dataset, US-PONR first uses undersampling to remove redundant samples due to version iterations, and then it uses oversampling to remove potential noise samples, relying on propensity score matching as a dimension of the feature vector. A total of 24 versions of software data from 11 projects were used in experiments to test the proposed method. In an SDP experiment conducted in different noise environments, US-PONR demonstrated its ability to improve the quality of SDP datasets. And in experiments on removing label noise samples, the new method demonstrated that it can effectively identify label noise samples and remove them.

However, there are still a few limitations related to the datasets will need to be addressed. For our data, the experiments only used the CK, process, and network metrics extracted from PROMISE datasets. In the future, we hope to test our method against commercial datasets and more metrics that represent software defects. And artificial noise may be introduced in the generation of noise datasets using for experiments, so the ability of the method to identify and remove label noise samples needs further validation in real software projects.

Author Contributions

Conceptualization, H.S.; methodology, H.S.; software, H.S.; validation, H.S. and J.X.; formal analysis, H.S.; investigation, H.S. and J.X.; resources, J.X.; data curation, J.X.; writing—original draft preparation, H.S.; writing—review and editing, H.S. and J.L.; visualization, H.S.; supervision, J.A.; project administration, J.A.; funding acquisition, No. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The entire meta dataset can be found at https://github.com/buaaSoftwareReliabilityGroup/US-PONR (accessed on 18 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wong, W.E.; Li, X.; Laplante, P.A. Be more familiar with our enemies and pave the way forward: A review of the roles bugs played in software failures. J. Syst. Softw. 2017, 133, 68–94. [Google Scholar] [CrossRef]
Wong, W.E.; Debroy, V.; Surampudi, A.; Kim, H.; Siok, M.F. Recent catastrophic accidents: Investigating how software was responsible. In Proceedings of the SSIRI 2010—4th IEEE International Conference on Secure Software Integration and Reliability Improvement, Singapore, 9–11 June 2010; pp. 14–22. [Google Scholar] [CrossRef]
Aleem, S.; Capretz, L.F.; Ahmed, F. Benchmarking Machine Learning Techniques for Software Defect Detection. Int. J. Softw. Eng. Appl. 2015, 6, 11–23. [Google Scholar] [CrossRef]
Alsaeedi, A.; Khan, M.Z. Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques: A Comparative Study. J. Softw. Eng. Appl. 2019, 12, 85–100. [Google Scholar] [CrossRef]
Prasad, M.; Florence, L.F.; Arya3, A. A Study on Software Metrics based Software Defect Prediction using Data Mining and Machine Learning Techniques. Int. J. Database Theory Appl. 2015, 8, 179–190. [Google Scholar] [CrossRef][Green Version]
Chidamber, S.; Kemerer, C.F. A Metric suite for object oriented design. IEEE Trans. Softw. Eng. 1994, 20, 476–493. [Google Scholar] [CrossRef]
Nagappan, N.; Ball, T. Use of relative code churn measures to predict system defect density. In Proceedings of the 27th International Conference on Software Engineering, ICSE05, St. Louis, MO, USA, 15–21 May 2005; pp. 284–292. [Google Scholar] [CrossRef]
Khoshgoftaar, T.; Allen, E.; Goel, N.; Nandi, A.; McMullan, J. Detection of software modules with high debug code churn in a very large legacy system. In Proceedings of the ISSRE ‘96: 7th International Symposium on Software Reliability Engineering, White Plains, NY, USA, 30 October–2 November 1996. [Google Scholar] [CrossRef]
Nikora, A.P.; Munson, J.C. Developing fault predictors for evolving software systems. In Proceedings of the 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry, Sydney, Australia, 5 September 2004. [Google Scholar] [CrossRef]
Hassan, A.E. Predicting faults using the complexity of code changes. In Proceedings of the International Conference on Software Engineering, Vancouver, BC, Canada, 16–24 May 2009; pp. 78–88. [Google Scholar] [CrossRef]
Yang, Y.; Ai, J.; Wang, F. Defect Prediction Based on the Characteristics of Multilayer Structure of Software Network. In Proceedings of the 2018 IEEE International Conference on Software Quality, Reliability, and Security Companion (QRS-C), Lisbon, Portugal, 16–20 July 2018; pp. 27–34. [Google Scholar] [CrossRef]
Ai, J.; Su, W.; Zhang, S.; Yang, Y. A Software Network Model for Software Structure and Faults Distribution Analysis. IEEE Trans. Reliab. 2019, 68, 844–858. [Google Scholar] [CrossRef]
Zimmermann, T.; Nagappan, N. Predicting defects using network analysis on dependency graphs. In Proceedings of the International Conference on Software Engineering, Leipzig, Germany, 10–18 May 2008; pp. 531–540. [Google Scholar] [CrossRef]
Zhang, S.; Ai, J.; Li, X. Correlation between the Distribution of Software Bugs and Network Motifs. In Proceedings of the 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), Vienna, Austria, 1–3 August 2016. [Google Scholar] [CrossRef]
Li, Y.; Wong, W.E.; Lee, S.-Y.; Wotawa, F. Using Tri-Relation Networks for Effective Software Fault-Proneness Prediction. IEEE Access 2019, 7, 63066–63080. [Google Scholar] [CrossRef]
Yu, X.; Liu, J.; Keung, J.W.; Li, Q.; Bennin, K.E.; Xu, Z.; Wang, J.; Cui, X. Improving Ranking-Oriented Defect Prediction Using a Cost-Sensitive Ranking SVM. IEEE Trans. Reliab. 2019, 69, 139–153. [Google Scholar] [CrossRef]
Gong, L.; Jiang, S.; Jiang, L. Tackling Class Imbalance Problem in Software Defect Prediction through Cluster-Based Over-Sampling with Filtering. IEEE Access 2019, 7, 145725–145737. [Google Scholar] [CrossRef]
Zhang, X.; Song, Q.; Wang, G. A dissimilarity-based imbalance data classification algorithm. Appl. Intell. 2015, 42, 544–565. [Google Scholar] [CrossRef]
Zhou, L. Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowl. Based Syst. 2013, 41, 16–25. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S. Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 2017, 44, 534–550. [Google Scholar] [CrossRef]
Riquelme, J.C.; Ruiz, R.; Rodríguez, D.; Moreno, J. Finding defective modules from highly unbalanced datasets. Actas De Los Talleres Las Jorn. Ing. Del Softw. Bases Datos 2008, 2, 67–74. [Google Scholar]
Pandey; Kumar, S.; Tripathi, A.K. An empirical study toward dealing with noise and class imbalance issues in software defect prediction. Soft Comput. 2021, 25, 13465–13492. [Google Scholar] [CrossRef]
Li, Z.; Jing, X.-Y.; Zhu, X. Progress on approaches to software defect prediction. IET Softw. 2018, 12, 161–175. [Google Scholar] [CrossRef]
Kim, H.; Just, S.; Zeller, A. It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In Proceedings of the 2013 35th International Conference on Software Engineering (ICSE), San Francisco, CA, USA, 18–26 May 2013. [Google Scholar]
Kim, H.; Just, S.; Zeller, A. The impact of tangled code changes on defect prediction models. Empir. Softw. Eng. 2016, 21, 303–336. [Google Scholar]
Rivera, W.A. Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets. Inf. Sci. 2017, 408, 146–161. [Google Scholar] [CrossRef]
Song, Q.; Jia, Z.; Shepperd, M.; Ying, S.; Liu, J. A general software defect-proneness prediction framework. IEEE Trans. Softw. Eng. 2011, 37, 356–370. [Google Scholar] [CrossRef]
Jin, C. Software defect prediction model based on distance metric learning. Soft Comput. 2021, 25, 447–461. [Google Scholar] [CrossRef]
Goyal, S. Effective software defect prediction using support vector machines (SVMs). Int. J. Syst. Assur. Eng. Manag. 2022, 13, 681–696. [Google Scholar] [CrossRef]
Xu, J.; Ai, J.; Liu, J.; Shi, T. ACGDP: An Augmented Code Graph-Based System for Software Defect Prediction. IEEE Trans. Reliab. 2022, 71, 850–864. [Google Scholar] [CrossRef]
Hanif, H.; Maffeis, S. Vulberta: Simplified source code pre-training for vulnerability detection. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Weyuker, E.J.; Ostrand, T.J.; Bell, R.M. Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empir. Softw. Eng. 2008, 13, 539–559. [Google Scholar] [CrossRef]
Guzmán-Ponce, A.; Sánchez, J.S.; Valdovinos, R.M.; Marcial-Romero, J.R. DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst. Appl. 2021, 168, 114301. [Google Scholar] [CrossRef]
Tax, D.M.J. One-Class Classification: Concept Learning in the Absence of Counter-Examples; Netherlands Participating Organizations: Leidschendam, The Netherlands, 2002; p. 584. [Google Scholar]
Agrawal, A.; Menzies, T. Is ‘better data’ better than ‘better data miners’?: On the benefits of tuning SMOTE for defect prediction. In Proceedings of the International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 1050–1061. [Google Scholar] [CrossRef]
Feng, S.; Keung, J.; Yu, X.; Xiao, Y.; Bennin, K.E.; Kabir, A.; Zhang, M. COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction. Inf. Softw. Technol. 2021, 129, 106432. [Google Scholar] [CrossRef]
Ochal, M.; Patacchiola, M.; Vazquez, J.; Storkey, A.; Wang, S. Few-shot learning with class imbalance. IEEE Trans. Artif. Intell. 2023. [Google Scholar] [CrossRef]
Bennin, K.E.; Keung, J.; Phannachitta, P.; Mensah, S. The significant effects of data sampling approaches on software defect prioritization and classification. In Proceedings of the 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Toronto, ON, Canada, 9–10 November 2017. [Google Scholar]
Feng, S.; Keung, J.; Yu, X.; Xiao, Y.; Zhang, M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf. Softw. Technol. 2021, 139, 106662. [Google Scholar] [CrossRef]
Soltanzadeh, P.; Hashemzadeh, M. RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf. Sci. 2021, 542, 92–111. [Google Scholar] [CrossRef]
Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Lect. Notes Comput. Sci. 2005, 3644, 878–887. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
Douzas, G.; Bacao, F. Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst. Appl. 2017, 82, 40–52. [Google Scholar] [CrossRef]
Lee, H.; Kim, J.; Kim, S. Gaussian-based SMOTE algorithm for solving skewed class distributions. Int. J. Fuzzy Log. Intell. Syst. 2017, 17, 229–234. [Google Scholar] [CrossRef]
Barua, S.; Islam, M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2014, 26, 405–425. [Google Scholar] [CrossRef]
Ahluwalia, A.; Falessi, D.; Di Penta, M. Snoring: A noise in defect prediction datasets. In Proceedings of the IEEE International Working Conference on Mining Software Repositories, Montreal, QC, Canada, 25–31 May 2019; pp. 63–67. [Google Scholar] [CrossRef]
Hu, S.; Liang, Y.; Ma, L.; He, Y. MSMOTE: Improving classification performance when training data is imbalanced. In Proceedings of the 2nd International Workshop on Computer Science and Engineering, WCSE 2009, Qingdao, China, 28–30 October 2009; Volume 2, pp. 13–17. [Google Scholar] [CrossRef]
Sáez, J.A.; Luengo, J.; Stefanowski, J.; Herrera, F. SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 2015, 291, 184–203. [Google Scholar] [CrossRef]
Koziarski, M.; Wożniak, M. CCR: A combined cleaning and resampling algorithm for imbalanced data classification. Int. J. Appl. Math. Comput. Sci. 2017, 27, 727–736. [Google Scholar] [CrossRef]
Ramentol, E.; Caballero, Y.; Bello, R.; Herrera, F. SMOTE-RSB *: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst 2012, 33, 245–265. [Google Scholar] [CrossRef]
Ramentol, E.; Gondres, I.; Lajes, S.; Bello, R.; Caballero, Y.; Cornelis, C.; Herrera, F. Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm. Eng. Appl. Artif. Intell. 2016, 48, 134–139. [Google Scholar] [CrossRef]
Khoshgoftaar, T.M.; Rebours, P. Improving software quality prediction by noise filtering techniques. J. Comput. Sci. Technol. 2007, 22, 387–396. [Google Scholar] [CrossRef]
Matloob, F.; Ghazal, T.M.; Taleb, N.; Aftab, S.; Ahmad, M.; Abbas, S.; Khan, M.A.; Soomro, T.R. Software defect prediction using ensemble learning: A systematic literature review. IEEE Access 2021, 9, 98754–98771. [Google Scholar] [CrossRef]
Menzies, T.; Caglayan, B.; Kocaguneli, E.; Krall, J.; Peters, F.; Turhan, B. The Promise Repository of Empirical Software Engineering Data. Available online: http://promise.site.uottawa.ca/SERepository/ (accessed on 31 December 2007).
Cheikhi, L.; Abran, A. PROMISE and ISBSG software engineering data repositories: A survey. In Proceedings of the Joint Conference of the 23rd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement, IWSM-MENSURA 2013, Ankara, Turkey, 23–26 October 2013; pp. 17–24. [Google Scholar] [CrossRef]
Ghotra, B.; McIntosh, S.; Hassan, A.E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the International Conference on Software Engineering, Florence, Italy, 16–24 May 2015; Volume 1, pp. 789–800. [Google Scholar] [CrossRef]
Kovács, G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing 2019, 366, 352–354. [Google Scholar] [CrossRef]
Kyurkchiev, N.; Markov, S. On the Hausdorff distance between the Heaviside step function and Verhulst logistic function. J. Math. Chem. 2016, 54, 109–119. [Google Scholar] [CrossRef]

Figure 1. Framework of US-PONR.

Figure 2. The AUC of predicted results using a dataset pre-processed by US-PONR, or only US, or only PONR.

Table 1. CK metrics.

Name	Description
WMC	Weighted methods per class
DIT	Depth of inheritance tree
NOC	Number of children
CBO	Coupling between object classes
RFC	Response for a class
LCOM	Lack of cohesion in methods
LOC	Lines of code

Table 2. Process metrics.

Name	Description
REVISIONS	Number of revisions of a file.
AUTHORS	Number of distinct authors that checked a file into the repository.
LOC_ADDED	Sum over all revisions of the lines of code added to a file.
MAX_LOC_ADDED	Maximum number of lines of code added for all revisions.
AVE_ LOC_ADDED	Average lines of code added per revision.
LOC_DELETED	Sum over all revisions of the lines of code deleted from a file.
MAX_LOC_DELETED	Maximum number of lines of code deleted for all revisions.
AVE_LOC_DELETED	Average lines of code deleted per revision.
CODECHURN	Sum of (added lines of code—deleted lines of code) over all revisions.
MAX_CODECHURN	Maximum CODECHURN for all revisions.
AVE_CODECHURN	Average CODECHURN per revision.

Table 3. Network metrics.

Name	Description	Name	Description
Funcount	The number of internal functions of the class node.	Katz_centrality	The relative influence of a node within a network.
Indegree	The total number of connections it points to other nodes.	Load_centrality	The fraction of all shortest paths that pass through that node.
Outdegree	The total number of connections other nodes point to it.	PageRank	A ranking of the nodes in the graph G based on the structure of the incoming links.
Insidelinks	The total number of connections within the internal functions of the node.	Average_neighbor_degree	The average of the neighborhood of each node.
Out_degree_centrality	The fraction of nodes its outgoing edges are connected to.	Number_of_cliques	The number of maximal cliques for each node.
In_degree_centrality	The fraction of nodes its incoming edges are connected to.	Core_number	The largest value k of a k-core containing that node.
Degree_centrality	The fraction of nodes it is connected to.	Brokerage	The number of pairs not directly connected.
Closeness_centrality	The reciprocal of the sum of the shortest path distances from v to all other nodes.	EffSize	Effective size of network.
Betweenness_centrality	The sum of the fraction of all-pair shortest paths that pass through node v.	Constraint	Measures how strongly a module is constrained by its neighbors.
Eccentricity	The maximum distance from v to all other nodes in G.	Hierarchy	Measures how the constraint measure is distributed across neighbors.
Communicability_centrality	A broader measure of connectivity, which assumes that information could flow along all possible paths between two nodes.	TwoStepReach	The percentage of nodes that are two steps away.

Table 4. ML algorithms.

Name	Description
LiR	Linear regression
RR	Ridge regression
LoR	Logistic regression
LDA	Linear discriminant analysis
QDA	Quadratic discriminant analysis
KR	Kernel ridge
SVC	C-Support vector classification
SGDC	Linear classifiers (SVM)
KNN	K-Nearest neighbors vote classifier
GNB	Gaussian naïve Bayes
DT	Decision tree classifier
RF	Random forest classifier
ET	Extra trees classifier
AB	AdaBoost classifier

Table 5. Datasets.

Projects	Versions	Total Instances	Defective Instances	Unbalance Ratio r
ant	1.3, 1.4, 1.5, 1.6	947	184	4.147
camel	1.0, 1.2, 1.4, 1.6	2784	562	3.954
poi	2	314	37	7.486
synapse	1.0, 1.1, 1.2	635	162	2.920
log4j	1	138	34	2.971
jedit	3.2, 4.0, 4.1, 4.2, 4.3	1749	303	4.772
PDE	1	1497	209	6.163
JDT	1	997	206	3.840
velocity	1.6	229	78	1.936
xerces	1.2, 1.3	893	140	5.379
mylyn	1	1862	245	6.600

Table 6. Parameter-undersampling_ratio optimization.

	US-PONR
	rus	AUC
ant	3.8	0.8218
camel	3.9	0.8612
JDT	3.4	0.9257
jedit	4.4	0.9381
log4j	2.2	0.8716
mylyn	5.4	0.9116
PDE	5	0.9292
poi	5.8	0.9416
synapse	2.4	0.7811
velocity	1.9	0.7842
xerces	4.8	0.9442

Table 7. Comparison results of US-PONR and SMOTUNED in AUC and processing time.

	AUC		Time
	US-PONR	SMOTUNED	US-PONR	SMOTUNED
ant	0.8218	0.8896	11.12 s	378 s
camel	0.8612	0.9290	37.63 s	6364 s
JDT	0.9257	0.8631	9.96 s	585 s
jedit	0.9381	0.9445	22.77 s	1635 s
log4j	0.8716	0.8766	1.04 s	82 s
mylyn	0.9116	0.9003	23.25 s	1594 s
PDE	0.9292	0.9061	17.52 s	3029 s
poi	0.9416	0.9346	2.95 s	111 s
synapse	0.7811	0.8315	5.71 s	598 s
velocity	0.7826	0.8399	1.64 s	150 s
xerces	0.9442	0.9325	9.09 s	502 s

Table 8. AUC of each method for different datasets at 10%, 20%, and 30% noise levels.

	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
ant
10% noise	0.8676	0.8624	0.8571	0.8630	0.8630	0.8716	0.7355	0.7481	0.6997	0.7201	0.7338	0.7628	0.8355
20% noise	0.7952	0.7012	0.7338	0.7287	0.7015	0.7080	0.6980	0.7290	0.6792	0.7133	0.7116	0.7457	0.7780
30% noise	0.7717	0.7359	0.7461	0.7185	0.7438	0.7314	0.7461	0.7344	0.7008	0.6929	0.7185	0.6945	0.7618
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
camel
10% noise	0.8522	0.8549	0.8508	0.8470	0.8528	0.8433	0.7653	0.7678	0.7443	0.7730	0.7353	0.7864	0.7523
20% noise	0.7985	0.7354	0.7423	0.7610	0.7459	0.7112	0.7251	0.7678	0.7232	0.7366	0.7398	0.7264	0.7021
30% noise	0.7774	0.7155	0.7366	0.7508	0.7208	0.6633	0.6990	0.7389	0.6849	0.7309	0.7105	0.7105	0.6809
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
JDT
10% noise	0.8960	0.7035	0.8271	0.7648	0.7091	0.7596	0.7989	0.8139	0.7406	0.8004	0.7775	0.8766	0.8859
20% noise	0.8166	0.6599	0.7896	0.7251	0.7402	0.7452	0.7993	0.7962	0.7220	0.7685	0.7540	0.8253	0.7100
30% noise	0.8020	0.6447	0.7643	0.7181	0.7091	0.7201	0.7738	0.7858	0.7207	0.7758	0.7410	0.8027	0.7028
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
jedit
10% noise	0.8865	0.8073	0.8829	0.8762	0.8364	0.8443	0.8720	0.8771	0.8437	0.8824	0.8308	0.9016	0.9061
20% noise	0.8882	0.8169	0.8856	0.8479	0.8273	0.8173	0.8558	0.8730	0.8525	0.8724	0.8155	0.8720	0.8816
30% noise	0.8804	0.8201	0.8622	0.8383	0.7960	0.8470	0.8350	0.8670	0.8086	0.8375	0.7995	0.8614	0.8589
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
log4j
10% noise	0.8684	0.7838	0.8354	0.7949	0.7792	0.7838	0.7532	0.8519	0.8052	0.7586	0.7805	0.7805	0.7887
20% noise	0.8108	0.8101	0.8205	0.7397	0.7632	0.7576	0.7342	0.8000	0.7442	0.7750	0.7317	0.7436	0.7619
30% noise	0.7945	0.7619	0.7805	0.7536	0.7654	0.7805	0.7686	0.7632	0.7541	0.7394	0.7541	0.7541	0.7448
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
mylyn
10% noise	0.8538	0.7403	0.7889	0.7314	0.6851	0.6731	0.8018	0.8067	0.7526	0.7632	0.7287	0.8688	0.8848
20% noise	0.7542	0.6035	0.6871	0.6870	0.5676	0.6823	0.7504	0.7795	0.7277	0.7651	0.6651	0.8293	0.8472
30% noise	0.7508	0.5664	0.7027	0.6454	0.5589	0.6546	0.6588	0.6254	0.6196	0.7123	0.6639	0.7699	0.8211
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
PDE
10% noise	0.7800	0.7001	0.8040	0.7598	0.7373	0.6843	0.8148	0.8099	0.7282	0.7693	0.7168	0.8900	0.8862
20% noise	0.7438	0.7211	0.7776	0.7251	0.7336	0.6602	0.7696	0.8044	0.7509	0.7891	0.7171	0.8529	0.8416
30% noise	0.7282	0.7202	0.7522	0.7052	0.7144	0.6699	0.7642	0.7681	0.7402	0.7773	0.6932	0.7798	0.8035
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
poi
10% noise	0.7259	0.7087	0.6812	0.6777	0.6557	0.7867	0.7717	0.7626	0.7134	0.6479	0.6667	0.7671	0.8056
20% noise	0.6612	0.6825	0.7107	0.6614	0.5912	0.6897	0.7813	0.7143	0.7273	0.7412	0.6013	0.7432	0.7285
30% noise	0.6947	0.6914	0.5882	0.6301	0.6739	0.6582	0.7647	0.6747	0.6452	0.6905	0.6897	0.6667	0.6410
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
synapse
10% noise	0.8037	0.5630	0.7304	0.7009	0.6281	0.7139	0.7295	0.7358	0.7176	0.7586	0.7107	0.7715	0.7988
20% noise	0.7866	0.6379	0.7361	0.6988	0.6490	0.6910	0.7330	0.6845	0.7077	0.7431	0.6833	0.7539	0.7818
30% noise	0.7510	0.6140	0.7190	0.7335	0.6199	0.7224	0.6846	0.6792	0.6585	0.6732	0.6815	0.6254	0.6649
	US-PONR	SOMO	MWMOTE	SMOTE_##3IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
velocity
10% noise	0.8496	0.7757	0.8333	0.7724	0.7628	0.7439	0.8171	0.7987	0.7642	0.7561	0.7561	0.8272	0.8067
20% noise	0.8190	0.7686	0.7716	0.7371	0.7353	0.7672	0.7457	0.7959	0.7371	0.7802	0.7802	0.7489	0.7974
30% noise	0.8163	0.7600	0.7908	0.7398	0.7245	0.7806	0.7398	0.7911	0.7500	0.7194	0.7653	0.7199	0.7449
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
xerces
10% noise	0.8616	0.7928	0.8544	0.8087	0.7309	0.7814	0.8477	0.8542	0.8108	0.8298	0.8033	0.8793	0.8734
20% noise	0.8535	0.7667	0.8242	0.7849	0.7330	0.7790	0.8316	0.8398	0.7993	0.8379	0.7824	0.8711	0.8700
30% noise	0.8448	0.6484	0.7942	0.7928	0.7225	0.7647	0.8321	0.8318	0.7880	0.8100	0.7818	0.8303	0.8441

Table 9. Residual noise sample ratio of each method for different datasets at 10%, 20%, and 30% noise levels.

	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
ant
10% noise	0.0177	0.0995	0.0675	0.0670	0.0867	0.0662	0.0644	0.0600	0.0643	0.0644	0.0647	0.0643	0.0672
20% noise	0.0435	0.1996	0.1479	0.1457	0.1679	0.1448	0.1417	0.1340	0.1415	0.1415	0.1408	0.1413	0.1530
30% noise	0.0926	0.2844	0.2416	0.2407	0.2639	0.2405	0.2355	0.2180	0.2355	0.2339	0.2347	0.2354	0.2193
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
camel
10% noise	0.0185	0.0996	0.0677	0.0650	0.0884	0.0685	0.0682	0.0597	0.0678	0.0682	0.0683	0.0675	0.0637
20% noise	0.0496	0.1987	0.1470	0.1439	0.1894	0.1510	0.1480	0.1318	0.1476	0.1480	0.1499	0.1468	0.1468
30% noise	0.1002	0.2998	0.2420	0.2390	0.2538	0.2512	0.2432	0.2222	0.2426	0.2406	0.2447	0.2418	0.2132
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
JDT
10% noise	0.0181	0.0939	0.0682	0.0678	0.0985	0.0671	0.0682	0.0614	0.0677	0.0683	0.0682	0.0678	0.0650
20% noise	0.0472	0.1890	0.1483	0.1475	0.1986	0.1475	0.1220	0.1317	0.1475	0.1478	0.1479	0.1481	0.1510
30% noise	0.0929	0.2851	0.2435	0.2419	0.2982	0.2416	0.2431	0.2249	0.2425	0.2422	0.2422	0.2435	0.2170
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
jedit
10% noise	0.0156	0.0998	0.0657	0.0653	0.0993	0.0665	0.0658	0.0580	0.0658	0.0658	0.0657	0.0659	0.0674
20% noise	0.0399	0.1997	0.1434	0.1434	0.1773	0.1445	0.1438	0.1312	0.1440	0.1441	0.1452	0.1433	0.1531
30% noise	0.0872	0.3000	0.2381	0.2369	0.2869	0.2396	0.2371	0.2164	0.2380	0.2375	0.2388	0.2378	0.2188
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
log4j
10% noise	0.0198	0.0873	0.0750	0.0725	0.0879	0.0689	0.0737	0.0654	0.0729	0.0747	0.0703	0.0680	0.0829
20% noise	0.0435	0.1996	0.1479	0.1457	0.1679	0.1448	0.1417	0.1340	0.1415	0.1415	0.1408	0.1413	0.1530
30% noise	0.0926	0.2844	0.2416	0.2407	0.2639	0.2405	0.2355	0.2180	0.2355	0.2339	0.2347	0.2354	0.2193
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
mylyn
10% noise	0.0119	0.0776	0.0634	0.0617	0.0995	0.0694	0.0630	0.0596	0.0633	0.0628	0.0631	0.0629	0.0612
20% noise	0.0343	0.1632	0.1386	0.1369	0.1997	0.1531	0.1388	0.1289	0.1385	0.1384	0.1389	0.1383	0.1277
30% noise	0.0782	0.2722	0.2317	0.2295	0.2988	0.2571	0.2326	0.2159	0.2321	0.2311	0.2329	0.2318	0.2205
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
PDE
10% noise	0.0109	0.0999	0.0642	0.0626	0.0993	0.0624	0.0635	0.0592	0.0634	0.0633	0.0636	0.0634	0.0612
20% noise	0.0331	0.1953	0.1403	0.1384	0.1989	0.1395	0.1398	0.1330	0.1395	0.1398	0.1398	0.1396	0.1387
30% noise	0.0753	0.2956	0.2339	0.2312	0.2989	0.2339	0.2334	0.2216	0.2332	0.2326	0.2330	0.2326	0.2208
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
poi
10% noise	0.0084	0.0963	0.0681	0.0572	0.0920	0.0639	0.0630	0.0551	0.0690	0.0632	0.0646	0.0595	0.0608
20% noise	0.0284	0.1927	0.1414	0.1338	0.1921	0.1457	0.1385	0.1219	0.1432	0.1424	0.1379	0.1374	0.1295
30% noise	0.0718	0.2898	0.2295	0.2167	0.2895	0.2492	0.2359	0.2143	0.2242	0.2306	0.2363	0.2329	0.2192
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
synapse
10% noise	0.0263	0.0988	0.0715	0.0716	0.0938	0.0704	0.0722	0.0556	0.0718	0.0714	0.0727	0.0721	0.0756
20% noise	0.0614	0.1954	0.1551	0.1526	0.1715	0.1539	0.1548	0.1375	0.1539	0.1547	0.1550	0.1540	0.1473
30% noise	0.1148	0.2930	0.2505	0.2497	0.2704	0.2506	0.2493	0.2222	0.2503	0.2507	0.2498	0.2502	0.2838
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
velocity
10% noise	0.0365	0.0964	0.0790	0.0791	0.0948	0.0784	0.0775	0.0662	0.0790	0.0791	0.0794	0.0775	0.0695
20% noise	0.0885	0.1891	0.1685	0.1614	0.1927	0.1693	0.1694	0.1465	0.1669	0.1719	0.1642	0.1642	0.1925
30% noise	0.1460	0.2933	0.2649	0.2595	0.2915	0.2743	0.2669	0.2470	0.2645	0.2665	0.2665	0.2627	0.3000
	US-PONR	SOMO	MWMOTE	SMOTE_ IPF	SMOTE_ RSB	SMOTE_ FRST_2T	SMOTE	SMOTE_ TomekLinks	Borderline_ SMOTE2	ADASYN	MSMOTE	Gaussian_ SMOTE	CCR
xerces
10% noise	0.0134	0.0818	0.0649	0.0641	0.0924	0.0718	0.0643	0.0435	0.0645	0.0640	0.0647	0.0648	0.0634
20% noise	0.0367	0.1670	0.1437	0.1409	0.1878	0.1641	0.1419	0.1062	0.1418	0.1420	0.1418	0.1417	0.1412
30% noise	0.0718	0.2553	0.2357	0.2340	0.2724	0.2654	0.2349	0.1903	0.2356	0.2354	0.2352	0.2355	0.2222

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, H.; Ai, J.; Liu, J.; Xu, J. Improving Software Defect Prediction in Noisy Imbalanced Datasets. Appl. Sci. 2023, 13, 10466. https://doi.org/10.3390/app131810466

AMA Style

Shi H, Ai J, Liu J, Xu J. Improving Software Defect Prediction in Noisy Imbalanced Datasets. Applied Sciences. 2023; 13(18):10466. https://doi.org/10.3390/app131810466

Chicago/Turabian Style

Shi, Haoxiang, Jun Ai, Jingyu Liu, and Jiaxi Xu. 2023. "Improving Software Defect Prediction in Noisy Imbalanced Datasets" Applied Sciences 13, no. 18: 10466. https://doi.org/10.3390/app131810466

APA Style

Shi, H., Ai, J., Liu, J., & Xu, J. (2023). Improving Software Defect Prediction in Noisy Imbalanced Datasets. Applied Sciences, 13(18), 10466. https://doi.org/10.3390/app131810466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Software Defect Prediction in Noisy Imbalanced Datasets

Abstract

1. Introduction

2. Background

2.1. Software Defect Prediction (SDP) and Metrics

2.2. Class Imbalance

2.3. Noise Reduction

3. Materials and Methods

3.1. Framework

3.2. Undersampling

3.3. PONR

3.4. Predictor Construction

4. Experiment

4.1. Research Questions

4.2. Datasets

4.2.1. PROMISE Public Dataset

4.2.2. Noise Dataset Generation

4.3. Experiment Settings

4.4. Comparison Methods

5. Results

5.1. Answer to RQ1: Is US-PONR Effective in SDP Datasets?

5.1.1. Determining the Optimization Parameter

5.1.2. Comparing US-PONR with US and PONR

5.1.3. Comparing US-PONR with the SOTA Data Pre-Processing Method—SMOTUNED

5.2. Answer to RQ2: Can US-PONR Perform Better Compared to the Benchmark Methods in Un-Balanced SDP Datasets with Noise?

5.3. Answer to RQ3: Is US-PONR Especially Good at Eliminating Label Noise Samples?

6. Threats to Validity

6.1. Datasets

6.2. Model Hyperparameters

6.3. Evaluation

7. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI