Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance
Abstract
:1. Introduction
2. Background
2.1. AdaBoost
- AdaBoost randomly selects a subset from training data.
- It trains the chosen machine learning model iteratively by selecting the training dataset based on the accurate prediction of the last training.
- It assigns weights to samples such that the wrongly classified samples get higher weight than correctly classified samples. With this the wrongly classified samples will get highest classification probability in the next iteration.
- In every iteration, the algorithm assigns the weight to the classifier based on the accuracy of the classifier, so that more accurate classifier will have the highest weight.
- This process will be terminated when all the training data classified correctly or reach the specified threshold of a maximum number of estimators.
- Finally it performs a “vote” among all of the learning algorithms built.
2.2. Decision Tree
2.3. Extra Tree
2.4. Gradient Boosting
2.5. KNN
2.6. Logistic Regression
2.7. Naïve Bayes Classifier
2.8. Random Forest
3. Related Work
4. Proposed Method
Algorithm for Class Imbalance Reduction (CIR)
Algorithm1: Class Imbalance Reduction (CIR) |
Input: Imbalanced Dataset (DSi) with X1, X2, X3,…, Xm attributes which represent features (software metrics) with class label and r1, r2, r3,…, rn are records |
Output: Balanced Dataset (BD) with symmetry of number of defect and non-defect records |
Step-1: Divide the DSi into two groups based on class label value representing defect and non-defect classes |
Step-2: Class which contains less number of records is denoted as minority class (Do) |
Step-3: Class which contains more number of records is denoted as majority class (Dj) |
Step-4: Calculate the centroid (C) of Do using C ={mean(X1), mean(X2), mean(X3),…, mean(Xm)} |
Step-5: For each record ri in Do Step-5.1 Calculate the distance dist(ri, C) using Euclidian distance |
Step-6: Sort the records in increasing order of their distances |
Step-7: Choose the record with minimum distance (Dmin) |
Step-8: Generate n random numbers k0, k1, k2,…, kn, between 0 and 1, where n = |Dj|-|Do| such that symmetry is created between the number of defect and non-defect records Step-8.1 For each random number kj Step-8.1.1 Generate a new record as Dmin + kj * C Step-8.1.2 Append new record to Do |
5. Experimentation and Results
5.1. Performance Measures
5.2. Statistical Significance
Post hoc Analysis
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Arora, I.; Tetarwal, V.; Saha, A. Open Issues in Software Defect Prediction. Proc. Comput. Sci. 2015, 46, 906–912. [Google Scholar] [CrossRef] [Green Version]
- Liu, X.; Wu, J.; Zhou, Z. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Syst. Man Cybern. Part B 2009, 39, 539–550. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
- Kovacs, G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing 2019, 366, 352–354. [Google Scholar] [CrossRef]
- Douzas, G.; Bacao, F.; Last, F. Improving Imbalanced Learn-ing Through a Heuristic Oversampling Method Based on K-Means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef] [Green Version]
- Freund, Y.; Schapire, R. A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting. J. Comp. Syst. Sci. 1995, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
- Friedman, J. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1–39. [Google Scholar]
- Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef] [Green Version]
- Peng, C.Y.J.; Lee, K.L.; Ingersoll, G.M. An Introduction to Logistic Regression Analysis and Reporting. J. Educ. Res. 2002, 96, 3–14. [Google Scholar] [CrossRef]
- Rish, I. IBM Research Report, An Empirical Study of the Naive Bayes Classifier. In Proceedings of the JCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4–6 August 2001; Volume 3, pp. 41–46. [Google Scholar]
- Liaw, A.; Wiener, M. Classification and Regression by RandomForest. R News 2002, 2, 18–22. [Google Scholar]
- Laradji, I.H.; Alshayeb, M.; Ghouti, L. Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 2015, 58, 388–402. [Google Scholar] [CrossRef]
- Li, Z.; Jing, X.Y.; Wu, F.; Zhu, X.; Xu, B.; Ying, S. Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng. 2018, 25, 201–245. [Google Scholar] [CrossRef]
- Aman, H.; Amasaki, S.; Sasaki, T.; Kawahara, M. Lines of comments as a noteworthy metric for analyzing fault-proneness in methods. IEICE Trans. Inf. Syst. 2015, 12, 2218–2228. [Google Scholar] [CrossRef] [Green Version]
- Gao, K.; Khoshgoftaar, T.M.; Napolitano, A. The use of ensemble-based data preprocessing techniques for software defect prediction. Int. J. Softw. Eng. Knowl. Eng. 2014, 24, 1229–1253. [Google Scholar] [CrossRef]
- Chen, L.; Fang, B.; Shang, Z.; Tang, Y. Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol. 2015, 62, 67–77. [Google Scholar] [CrossRef]
- Siers, M.J.; Islam, M.Z. Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 2015, 51, 62–71. [Google Scholar] [CrossRef]
- Khoshgoftaar, T.M.; Gao, K.; Napolitano, A.; Wald, R. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf. Syst. Front. 2014, 16, 801–822. [Google Scholar] [CrossRef]
- Zhang, Z.W.; Jing, X.Y.; Wang, T.J. Label propagation based semi-supervised learning for software defect prediction. Autom. Softw. Eng. 2016, 24, 1–23. [Google Scholar] [CrossRef]
- Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Trans. Syst. Man Cybern. Part C 2012, 42, 463–484. [Google Scholar] [CrossRef]
- Tong, H.; Liu, B.; Wang, S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf. Softw. Technol. 2018, 96, 94–111. [Google Scholar] [CrossRef]
- Wang, S.; Yao, X. Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 2013, 62, 434–443. [Google Scholar] [CrossRef] [Green Version]
- Sun, Z.; Song, Q.; Zhu, X. Using coding-based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cybern. Part C 2012, 46, 1806–1817. [Google Scholar] [CrossRef]
- Khoshgoftaar, T.M.; Geleyn, E.; Nguyen, L.; Bullard, L. Cost-sensitive boosting in software quality modeling. In Proceedings of the 7th IEEE International Symposium on High Assurance Systems Engineering, Tokyo, Japan, 23–25 October 2002; pp. 51–60. [Google Scholar] [CrossRef]
- Zheng, J. Cost-sensitive boosting neural networks for software defect prediction. Exp. Syst. Appl. 2010, 37, 4537–4543. [Google Scholar] [CrossRef]
- Arar, Ö.F.; Ayan, K. Software defect prediction using cost-sensitive neural network. Appl. Soft Comput. 2015, 33, 263–277. [Google Scholar] [CrossRef]
- Liu, M.; Miao, L.; Zhang, D. Two-stage cost-sensitive learning for software defect prediction. IEEE Trans. Reliab. 2014, 63, 676–686. [Google Scholar] [CrossRef]
- Li, W.; Huang, Z.; Li, Q. Three-way decisions based software defect prediction. Knowl.-Based Syst. 2016, 91, 263–274. [Google Scholar] [CrossRef]
- Ryu, D.; Jang, J.I.; Baik, J. A transfer cost-sensitive boosting approach for cross-project defect prediction. Softw. Qual. J. 2017, 25, 235–272. [Google Scholar] [CrossRef]
- Tomar, D.; Agarwal, S. Prediction of Defective Software Modules Using Class Imbalance Learning. Appl. Comput. Intell. Soft Comput. 2016, 2016, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Gong, L.; Jiang, S.; Jiang, L. Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling with Filtering. IEEE Access 2019, 7, 145725–145737. [Google Scholar] [CrossRef]
- Sohan, M.F.; Jabiullah, M.I.; Rahman, S.S.M.M.; Mahmud, S.H. Assessing the Effect of Imbalanced Learning on Cross-project Software Defect Prediction. In Proceedings of the 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 6–8 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Song, Q.; Guo, Y.; Shepperd, M. A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction. IEEE Trans. Softw. Eng. 2019, 45, 1253–1269. [Google Scholar] [CrossRef] [Green Version]
- Sohan, M.F.; Kabir, M.A.; Jabiullah, M.I.; Rahman, S.S.M.M. Revisiting the Class Imbalance Issue in Software Defect Prediction. In Proceedings of the 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), Cox’sBazar, Bangladesh, 7–9 February 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Huda, S.; Alyahya, S.; Ali, M.M.; Ahmad, S.; Abawajy, J.; Al-Dossari, H.; Yearwood, J. A Framework for Software Defect Prediction and Metric Selection. IEEE Access 2018, 6, 2844–2858. [Google Scholar] [CrossRef]
- Ferenc, R.; Toth, Z.; Ladányi, G.; Siket, I.; Gyimóthy, T. A Public Unified Bug Dataset for Java. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering, Oulu, Finland, 10 October 2018; pp. 12–21. [Google Scholar] [CrossRef]
- IBM_SPSS_Advanced_Statistics.pdf. Available online: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/en/client/Manuals/IBM_SPSS_Advanced_Statistics.pdf (accessed on 12 February 2020).
Dataset | Defect Class | Non-Defect Class | Minority Class % |
---|---|---|---|
Arc | 207 | 27 | 13.04 |
camel-1.0 | 326 | 13 | 3.99 |
camel-1.2 | 392 | 216 | 55.10 |
camel-1.4 | 727 | 145 | 19.94 |
camel-1.6 | 777 | 188 | 24.20 |
ivy-1.1 | 48 | 63 | 76.19 |
ivy-1.4 | 225 | 16 | 7.11 |
ivy-2.0 | 312 | 40 | 12.82 |
ant-1.7 | 579 | 166 | 28.67 |
jedit-3.2 | 182 | 90 | 49.45 |
jedit-4.0 | 231 | 75 | 32.47 |
jedit-4.1 | 233 | 79 | 33.91 |
jedit-4.2 | 319 | 48 | 15.05 |
jedit-4.3 | 481 | 11 | 2.29 |
log4j-1.0 | 101 | 34 | 33.66 |
log4j-1.1 | 72 | 37 | 51.39 |
log4j-1.2 | 16 | 189 | 8.47 |
lucene-2.0 | 104 | 91 | 87.50 |
lucene-2.2 | 103 | 144 | 71.53 |
lucene-2.4 | 137 | 203 | 67.49 |
poi-1.5 | 96 | 141 | 68.09 |
poi-2.0 | 277 | 37 | 13.36 |
poi-2.5 | 137 | 248 | 55.24 |
poi-3.0 | 161 | 281 | 57.30 |
Redactor | 149 | 27 | 18.12 |
synapse-1.0 | 141 | 16 | 11.35 |
synapse-1.1 | 162 | 60 | 37.04 |
synapse-1.2 | 170 | 86 | 50.59 |
Tomcat | 781 | 77 | 9.86 |
velocity-1.4 | 49 | 147 | 33.33 |
velocity-1.5 | 72 | 142 | 50.70 |
velocity-1.6 | 151 | 78 | 51.66 |
xerces-1.2 | 369 | 71 | 19.24 |
xerces-1.3 | 384 | 69 | 17.97 |
xerces-1.4 | 151 | 437 | 34.55 |
xerces-init | 85 | 77 | 90.59 |
xalan-2.4 | 613 | 110 | 17.94 |
xalan-2.5 | 416 | 387 | 93.03 |
xalan-2.6 | 474 | 411 | 86.71 |
xalan-2.7 | 11 | 898 | 1.22 |
Metric | Description |
---|---|
wmc | No. of methods in a class |
dit | Depth of inheritance levels from top |
noc | No. of immediate subclasses of the class |
cbo | Coupling between the objects of different classes |
rfc | No. of methods executed when an object receives a message |
lcom | No. of methods in a class that are not related through the sharing of some of the class fields |
ca | No. of other classes which depend on the class |
ce | No. of other classes on which the class is depended |
npm | No. of public methods in a class |
lcom3 | No. of methods and attributes, and access to those methods on another class |
loc | Lines of code |
dam | Ratio of the no. of private and protected attributes to the total number of attributes |
moa | Extent of the part-whole relationship, realized by using attributes |
mfa | No. of methods inherited by a class per number of methods accessible by its methods |
cam | Relatedness among methods of a class based upon the parameter list of the methods |
ic | No. of parent classes to which a given class is coupled |
cbm | No. of new and redefined methods to which all the inherited methods are coupled |
amc | Average method size for each class |
max_cc | Max value for Cyclomatic Complexity metric |
avg_cc | Average value for Cyclomatic Complexity metric |
Predicted | |||
---|---|---|---|
Positive | Negative | ||
Actual | Positive | ||
Negative |
Classifier | Accuracy | Precision | ||||
---|---|---|---|---|---|---|
SMOTE | Kmeans SMOTE | CIR | SMOTE | Kmeans SMOTE | CIR | |
AdaBoost | 0.81 ± 0.10 | 0.83 ± 0.10 | 0.85 ± 0.09 | 0.80 ± 0.11 | 0.82 ± 0.10 | 0.86 ± 0.10 |
Decision Tree | 0.80 ± 0.09 | 0.82 ± 0.09 | 0.83 ± 0.09 | 0.79 ± 0.10 | 0.82 ± 0.10 | 0.83 ± 0.09 |
Extra Tree | 0.78 ± 0.10 | 0.80 ± 0.11 | 0.83 ± 0.09 | 0.78 ± 0.09 | 0.80 ± 0.11 | 0.82 ± 0.10 |
Gradient Boost | 0.82 ± 0.10 | 0.84 ± 0.10 | 0.85 ± 0.09 | 0.82 ± 0.10 | 0.84 ± 0.10 | 0.85 ± 0.10 |
K-Nearest Neighbors | 0.74 ± 0.08 | 0.77 ± 0.09 | 0.83 ± 0.10 | 0.73 ± 0.08 | 0.78 ± 0.12 | 0.85 ± 0.09 |
Logistic Regression | 0.76 ± 0.09 | 0.79 ± 0.10 | 0.85 ± 0.10 | 0.77 ± 0.10 | 0.81 ± 0.12 | 0.87 ± 0.10 |
Naïve Bayes | 0.69 ± 0.09 | 0.74 ± −0.11 | 0.82 ± 0.13 | 0.77 ± 0.10 | 0.81 ± 0.11 | 0.88 ± 0.10 |
Random Forest | 0.85 ± 0.10 | 0.86 ± 0.10 | 0.88 ± 0.07 | 0.84 ± 0.10 | 0.87 ± 0.90 | 0.90 ± 0.08 |
Classifier | Recall | F-Measure | ||||
SMOTE | Kmeans SMOTE | CIR | SMOTE | Kmeans SMOTE | CIR | |
AdaBoost | 0.83 ± 0.10 | 0.83 ± 0.12 | 0.83 ± 0.10 | 0.81 ± 0.10 | 0.82 ± 0.10 | 0.85 ± 0.09 |
Decision Tree | 0.82 ± 0.12 | 0.82 ± 0.11 | 0.83 ± 0.10 | 0.80 ± 0.10 | 0.82 ± 0.10 | 0.83 ± 0.09 |
Extra Tree | 0.79 ± 0.13 | 0.80 ± 0.12 | 0.84 ± 0.09 | 0.78 ± 011 | 0.80 ± 0.11 | 0.83 ± 0.09 |
Gradient Boost | 0.84 ± 0.12 | 0.83 ± 0.13 | 0.85 ± 0.09 | 0.83 ± 0.10 | 0.83 ± 0.11 | 0.85 ± 0.09 |
K-Nearest Neighbors | 0.78 ± 0.13 | 0.77 ± 0.14 | 0.80 ± 0.13 | 0.75 ± 0.09 | 0.77 ± 0.10 | 0.82 ± 0.10 |
Logistic Regression | 0.74 ± 0.11 | 0.77 ± 0.12 | 0.82 ± 0.12 | 0.75 ± 0.10 | 0.79 ± 0.11 | 0.84 ± 0.10 |
Naïve Bayes | 0.55 ± 0.21 | 0.62 ± 0.21 | 0.73 ± 0.23 | 0.62 ± 0.14 | 0.69 ± 0.16 | 0.78 ± 0.19 |
Random Forest | 0.86 ± 0.12 | 0.84 ± 0.12 | 0.85 ± 0.09 | 0.85 ± 0.10 | 0.85 ± 0.10 | 0.88 ± 0.08 |
Classifier | Specificity | Geometric Mean | ||||
SMOTE | Kmeans SMOTE | CIR | SMOTE | Kmeans SMOTE | CIR | |
AdaBoost | 0.79 ± 0.11 | 0.82 ± 0.11 | 0.87 ± 0.11 | 0.81 ± 0.10 | 0.83 ± 0.10 | 0.85 ± 0.09 |
Decision Tree | 0.79 ± 0.10 | 0.82 ± 0.10 | 0.83 ± 0.10 | 0.80 ± 0.10 | 0.82 ± 0.10 | 0.83 ± 0.09 |
Extra Tree | 0.78 ± 0.10 | 0.80 ± 0.12 | 0.82 ± 0.10 | 0.78 ± 0.11 | 0.80 ± 0.11 | 0.83 ± 0.09 |
Gradient Boost | 0.82 ± 0.10 | 0.85 ± 0.09 | 0.85 ± 0.11 | 0.83 ± 0.10 | 0.83 ± 0.11 | 0.85 ± 0.09 |
K-Nearest Neighbors | 0.71 ± 0.10 | 0.78 ± 0.13 | 0.86 ± 0.08 | 0.75 ± 0.09 | 0.77 ± 0.10 | 0.83 ± 0.10 |
Logistic Regression | 0.79 ± 0.10 | 0.81 ± 0.11 | 0.88 ± 0.09 | 0.75 ± 0.10 | 0.79 ± 0.11 | 0.85 ± 0.13 |
Naïve Bayes | 0.83 ± 0.13 | 0.85 ± 0.13 | 0.90 ± 0.11 | 0.64 ± 0.13 | 0.70 ± 0.15 | 0.79 ± 0.17 |
Random Forest | 0.83 ± 0.10 | 0.87 ± 0.09 | 0.90 ± 0.08 | 0.85 ± 0.10 | 0.85 ± 0.10 | 0.88 ± 0.07 |
Classifiers | AdaBoost | Decision Tree | Extra Tree | Gradient Boost | KNN | Logistic Regression | Naïve Bayes | Random Forest | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Measures | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | |
Accuracy | 27 | 6 | 82.5 | 32 | 3 | 87.5 | 32 | 2 | 85 | 25 | 2 | 67.5 | 36 | 1 | 92.5 | 35 | 1 | 90 | 34 | 1 | 87.5 | 30 | 3 | 82.5 | |
Precision | 30 | 3 | 82.5 | 31 | 3 | 85 | 29 | 1 | 75 | 21 | 2 | 57.5 | 35 | 1 | 90 | 30 | 4 | 85 | 35 | 0 | 87.5 | 33 | 2 | 87.5 | |
Recall | 18 | 1 | 47.5 | 22 | 2 | 60 | 29 | 2 | 77.5 | 17 | 2 | 47.5 | 20 | 1 | 52.5 | 33 | 2 | 87.5 | 29 | 1 | 75 | 16 | 1 | 42.5 | |
F-Measure | 29 | 3 | 80 | 19 | 5 | 60 | 30 | 3 | 82.5 | 25 | 2 | 67.5 | 33 | 1 | 85 | 35 | 1 | 90 | 33 | 0 | 82.5 | 25 | 3 | 70 | |
Specificity | 28 | 3 | 77.5 | 28 | 3 | 77.5 | 31 | 0 | 77.5 | 21 | 3 | 60 | 36 | 1 | 92.5 | 31 | 0 | 77.5 | 30 | 3 | 82.5 | 34 | 1 | 87.5 | |
Geometric Mean | 29 | 3 | 80 | 30 | 4 | 85 | 29 | 3 | 80 | 25 | 2 | 67.5 | 33 | 1 | 85 | 35 | 1 | 90 | 33 | 0 | 82.5 | 25 | 3 | 70 |
Classifiers | AdaBoost | Decision Tree | Extra Tree | Gradient Boost | KNN | Logistic Regression | Naïve Bayes | Random Forest | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Measures | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | X | Y | % | |
Accuracy | 26 | 5 | 77.5 | 21 | 3 | 60 | 27 | 2 | 72.5 | 20 | 5 | 62.5 | 29 | 3 | 80 | 28 | 0 | 70 | 30 | 0 | 75 | 27 | 1 | 70 | |
Precision | 30 | 2 | 80 | 21 | 4 | 62.5 | 20 | 5 | 62.5 | 19 | 3 | 55 | 25 | 3 | 70 | 23 | 1 | 60 | 24 | 1 | 62.5 | 18 | 2 | 50 | |
Recall | 20 | 1 | 52.5 | 20 | 3 | 57.5 | 25 | 4 | 72.5 | 20 | 2 | 55 | 25 | 1 | 65 | 28 | 4 | 80 | 27 | 1 | 70 | 18 | 6 | 60 | |
F-Measure | 25 | 4 | 72.5 | 22 | 4 | 65 | 26 | 4 | 75 | 21 | 5 | 65 | 27 | 3 | 75 | 27 | 1 | 70 | 25 | 3 | 70 | 26 | 1 | 67.5 | |
Specificity | 28 | 3 | 77.5 | 20 | 2 | 55 | 22 | 4 | 65 | 22 | 0 | 55 | 25 | 1 | 65 | 25 | 2 | 67.5 | 25 | 1 | 65 | 26 | 4 | 75 | |
Geometric Mean | 25 | 5 | 75 | 22 | 5 | 67.5 | 26 | 4 | 75 | 21 | 5 | 65 | 27 | 3 | 75 | 27 | 1 | 70 | 25 | 2 | 67.5 | 24 | 3 | 67.5 |
Accuracy | Precision | Recall | F-Measure | Specificity | Geometric Mean | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SMOTE | K-Means SMOTE | SMOTE | K-Means SMOTE | SMOTE | K-Means SMOTE | SMOTE | K-Means SMOTE | SMOTE | K-Means SMOTE | SMOTE | K-Means SMOTE | |
AB | 67.5 | 70 | 75 | 75 | 45 | 52.5 | 72.5 | 65 | 75 | 72.5 | 72.5 | 67.5 |
DT | 80 | 55 | 77.5 | 60 | 55 | 52.5 | 47.5 | 62.5 | 77.5 | 52.5 | 75 | 62.5 |
ET | 80 | 67.5 | 72.5 | 60 | 72.5 | 67.5 | 75 | 72.5 | 72.5 | 57.5 | 72.5 | 72.5 |
GB | 62.5 | 57.5 | 52.5 | 52.5 | 42.5 | 52.5 | 62.5 | 62.5 | 52.5 | 55 | 62.5 | 62.5 |
KNN | 90 | 75 | 87.5 | 67.5 | 50 | 65 | 82.5 | 67.5 | 90 | 65 | 82.5 | 67.5 |
LR | 87.5 | 70 | 75 | 60 | 82.5 | 77.5 | 87.5 | 70 | 75 | 65 | 87.5 | 70 |
NB | 85 | 75 | 87.5 | 62.5 | 72.5 | 67.5 | 82.5 | 67.5 | 87.5 | 65 | 82.5 | 65 |
RF | 75 | 70 | 82.5 | 65 | 40 | 55 | 62.5 | 67.5 | 82.5 | 72.5 | 62.5 | 67.5 |
Accuracy | Precision | Recall | F-Measure | Specificity | Geometric Mean | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
F-Value | p-Value | F-Value | p-Value | F-Value | p-Value | F-Value | p-Value | F-Value | p-Value | F-Value | p-Value | |
AB | 1.94 | 0.14 | 4.03 | 0.02 * | 3.03 | 0.048 * | 1.31 | 0.35 | 3.66 | 0.045 * | 1.7 | 0.19 |
DT | 1.28 | 0.43 | 1.77 | 0.17 | 2.12 | 0.12 | 1.62 | 0.49 | 1.63 | 0.17 | 1.03 | 0.49 |
ET | 2.16 | 0.12 | 2.34 | 0.1 | 2.16 | 0.11 | 2.34 | 0.1 | 2.45 | 0.1 | 0.04 * | 0.05 |
GB | 2 | 0.14 | 1.017 | 0.36 | 1.27 | 0.35 | 1.64 | 0.39 | 1.44 | 0.36 | 1.372 | 0.59 |
KNN | 9.949 | 0.000 * | 16.165 | 0.000 * | 15.42 | 0.000 * | 11.52 | 0.000 * | 14.56 | 0.000 * | 7.291 | 0.001 * |
LR | 7.7111 | 0.000 * | 9.467 | 0.000 * | 8.465 | 0.000 * | 10.25 | 0.000 * | 8.46 | 0.000 * | 9.526 | 0.000 * |
NB | 13.449 | 0.000 * | 11.510 | 0.000 * | 11.510 | 0.000 * | 12.630 | 0.000 * | 10.250 | 0.000 * | 16.268 | 0.000 * |
RF | 1.229 | 0.229 | 5.127 | 0.000 * | 1.12 | 0.18 | 6.235 | 0.000 * | 6.98 | 0.000 * | 1.75 | 0.54 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bejjanki, K.K.; Gyani, J.; Gugulothu, N. Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance. Symmetry 2020, 12, 407. https://doi.org/10.3390/sym12030407
Bejjanki KK, Gyani J, Gugulothu N. Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance. Symmetry. 2020; 12(3):407. https://doi.org/10.3390/sym12030407
Chicago/Turabian StyleBejjanki, Kiran Kumar, Jayadev Gyani, and Narsimha Gugulothu. 2020. "Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance" Symmetry 12, no. 3: 407. https://doi.org/10.3390/sym12030407