A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification
Abstract
:1. Introduction
- : Minority instances overlapping with the majority class,
- : Minority instances in distinct regions,
- : Majority instances overlapping with the minority class,
- : Majority instances in distinct regions.
2. Related Works
3. Motivation
4. Designed Algorithms
- Set 0: Original (all parts combined),
- Set 1: Minority Overlap vs. Majority Non-Overlap,
- Set 2: Majority Overlap vs. Minority Non-Overlap,
- Set 3: Minority Overlap vs. Majority Overlap,
- Set 4: Minority Non-Overlap vs. Majority Non-Overlap.
4.1. Data Characterization
- : Majority class data points that do not overlap with the minority class.
- .: Minority class data points that do not overlap with the majority class.
- : Majority class data points that overlap with the minority class.
- : Minority class data points that overlap with the majority class.
Algorithm 1: Data characterization |
Input: Original dataset |
Pseudo Code: //1. Separate the dataset by class into two separate sets: //2. Calculate the radius and minimum neighbors for minority class instances: //Compute the radius for minority instances Determine minimum neighbors //3. Execute the function to type overlapping instances: //4. Return the updated feature matrices and overlapping instances: |
Return: : Non-overlapping instances for the majority class : Non-overlapping instances for the minority class : Overlapping instances for the majority class : Overlapping instances for the minority class |
Algorithm 2: Radius calculation |
) ) |
Pseudo Code: //1. Compute all pairwise distances between instances of the minority and majority classes using the distance metric (e.g., Euclidean distance): //2. Calculate the percentile distances for each instance in the distance matrix (e.g., 75th percentile): ) for the minority class instances: |
Return: ) for the minority class instances. |
Algorithm 3: Minimum Neighbor Calculation |
) ) |
Pseudo Code: ). ). |
Return: ) for the minority class instances. |
Algorithm 4: Data Typing for Overlapping Instances |
) |
Pseudo Code: //Initialize sets for non-overlapping and overlapping instances ): : ) //Identify instances in the majority class that are within half of the radius from the minority instance ) : //The minority instance is considered an overlap //Add the close majority neighbors to the majority overlap set Else: //Otherwise, add the minority instance to non-overlapping //Also add all majority instances to non-overlapping if not in overlap set : : //2. Return updated feature matrices for the minority and majority classes, and sets of overlapping instances |
Return: : Non-overlapping instances for the majority class : Non-overlapping instances for the minority class : Overlapping instances for the majority class : Overlapping instances for the minority class |
4.2. Data Matching
- Baseline Model: The first model serves as the baseline, trained using the original dataset (Set 0) without any resampling.
- Overlap Differentiation Models: The second and third models focus on distinguishing overlapping from non-overlapping subsets, specifically minority overlap versus majority non-overlap (Set 1) and majority overlap versus minority non-overlap (Set 2).
- In-depth Overlap Analysis Model: The fourth model is dedicated to an in-depth analysis of the overlapping subsets, specifically minority overlap versus majority overlap (Set 3).
- Non-overlapping Subset Model: The fifth model examines the dataset after excluding the overlapping elements, focusing on minority non-overlap versus majority non-overlap (Set 4).
- Non-weighted (Non–W): A simple majority vote ensures fairness by treating all models equally, which is beneficial when each metric is non-prioritized and every model has a similar performance level.
- Weighted by Recall (W–R): Models with higher Recall receive greater influence. This approach is effective for imbalanced datasets, where detecting rare cases is crucial.
- Weighted by G-Mean (W–G): Models with higher G-Mean contribute more to the final decision. This method balances sensitivity and specificity, ensuring both classes are well represented in the final prediction.
- Weighted by AUC (W–A): Models with higher AUC scores have stronger voting power. This enhances class distinction across various thresholds.
- Weighted by Average (W–avg): The model’s influence is determined by the average of its Recall, AUC, and G-Mean scores. This method provides a balanced approach by considering multiple performance metrics.
5. Experimental Results and Discussion
5.1. Experimental Design
5.2. Datasets
5.3. Results of Proposed Algorithm
5.4. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
- Nasrollahpour, H.; Isildak, I.; Rashidi, M.-R.; Hashemi, E.A.; Naseri, A.; Khalilzadeh, B. Ultrasensitive bioassaying of HER-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as a nanobiocompatible platform. Cancer Nanotechnol. 2021, 12, 10. [Google Scholar] [CrossRef]
- Guo, K.; Wang, Y.; Kang, J.; Zhang, J.; Cao, R. Core dataset extraction from unlabeled medical big data for lesion localization. Big Data Res. 2021, 24, 100185. [Google Scholar] [CrossRef]
- Cheng, S.; Wu, Y.; Li, Y.; Yao, F.; Min, F. TWD-SFNN: Three-way decisions with a single hidden layer feedforward neural network. Inf. Sci. 2021, 579, 15–32. [Google Scholar] [CrossRef]
- Wu, C.; Luo, C.; Xiong, N.; Zhang, W.; Kim, T.-H. A greedy deep learning method for medical disease analysis. IEEE Access 2018, 6, 20021–20030. [Google Scholar] [CrossRef]
- Wei, W.; Li, J.; Cao, L.; Ou, Y.; Chen, J. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web-Internet Web Inf. Syst. 2013, 16, 449–475. [Google Scholar] [CrossRef]
- Niu, K.; Zhang, Z.; Liu, Y.; Li, R. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Inf. Sci. 2020, 536, 120–134. [Google Scholar] [CrossRef]
- Daliri, S. Using harmony search algorithm in neural networks to improve fraud detection in the banking system. Comput. Intell. Neurosci. 2020, 2020, 6503459. [Google Scholar] [CrossRef]
- Cui, L.; Bai, L.; Wang, Y.; Jin, X.; Hancock, E.R. Internet financing credit risk evaluation using multiple structural interacting elastic net feature selection. Pattern Recognit. 2021, 114, 107835. [Google Scholar] [CrossRef]
- Yang, J.; Xiong, N.; Vasilakos, A.V.; Fang, Z.; Park, D.; Xu, X.; Yoon, S.; Xie, S.; Yang, Y. A fingerprint recognition scheme based on assembling invariant moments for cloud computing communications. IEEE Syst. J. 2011, 5, 574–583. [Google Scholar] [CrossRef]
- Xia, F.; Hao, R.; Li, J.; Xiong, N.; Yang, L.T.; Zhang, Y. Adaptive GTS allocation in IEEE 802.15.4 for real-time wireless sensor networks. J. Syst. Archit. 2013, 59 Pt D, 1231–1242. [Google Scholar] [CrossRef]
- Rezvani, S.; Wang, X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023, 143, 110415. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Devi, D.; Biswas, S.K.; Purkayastha, B. Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognit. Lett. 2017, 93, 3–12. [Google Scholar] [CrossRef]
- Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
- Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In Knowledge Discovery in Databases: PKDD 2003, Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar]
- Batuwita, R.; Palade, V. Efficient resampling methods for training support vector machines with imbalanced datasets. In Proceedings of the International Joint Conference on Neural Networks 2010, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
- Estabrooks, A.; Jo, T.; Japkowicz, N. A multiple resampling method for learning from imbalanced datasets. Comput. Intell. 2004, 20, 18–36. [Google Scholar] [CrossRef]
- Lin, W.-C.; Tsai, C.-F.; Hu, Y.-H.; Jhang, J.-S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409–410, 17–26. [Google Scholar] [CrossRef]
- Fernandez, A.; Garcia, S.; del Jesus, M.J.; Herrera, F. A study of the behaviour of linguistic fuzzy rule-based classification systems in the framework of imbalanced datasets. Fuzzy Sets Syst. 2008, 159, 2378–2398. [Google Scholar] [CrossRef]
- Fernandez, A.; del Jesus, M.J.; Herrera, F. On the 2-tuples based genetic tuning performance for fuzzy rule-based classification systems in imbalanced datasets. Inf. Sci. 2010, 180, 1268–1291. [Google Scholar] [CrossRef]
- Qian, Y.; Liang, Y.; Li, M.; Feng, G.; Shi, X. A Resampling Ensemble Algorithm for Classification of Imbalance Problems. Neurocomputing 2014, 143, 57–67. [Google Scholar] [CrossRef]
- Batista, G.; Bazzan, A.; Monard, M.C. Balancing Training Data for Automated Annotation of Keywords: A Case Study. In Proceedings of the II Brazilian Workshop on Bioinformatics, São Paulo, Brazil, 3–5 December 2003; pp. 10–18. [Google Scholar]
- Kumar, P.; Kumar, R.; Srivastava, G.; Gupta, G.P.; Tripathi, R.; Gadekallu, T.R.; Xiong, N.N. PPSF: A Privacy-Preserving and Secure Framework Using Blockchain-Based Machine Learning for IoT-Driven Smart Cities. IEEE Trans. Netw. Sci. Eng. 2021, 8, 2326–2341. [Google Scholar] [CrossRef]
- Elkan, C. The Foundations of Cost-Sensitive Learning. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; pp. 973–978. [Google Scholar]
- Barandela, R.; Sánchez, J.S.; Valdovinos, R.M. New Applications of Ensembles of Classifiers. Pattern Anal. Appl. 2003, 6, 245–256. [Google Scholar] [CrossRef]
- Chen, C.; Liaw, A.; Breiman, L. Using Random Forest to Learn Imbalanced Data; Technical Report; University of California: Berkeley, CA, USA, 2004. [Google Scholar]
- Yang, X.; Song, Q.; Cao, A. Weighted Support Vector Machine for Data Classification. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; pp. 859–864. [Google Scholar] [CrossRef]
- Seiffert, C.; Khoshgoftaar, T.M.; Hulse, J.V.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern.—Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new oversampling method in imbalanced datasets learning. In Advances in Intelligent Computing, Proceedings of the ICIC 2005, Hefei, China, 23–26 August 2005; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3644, pp. 878–887. [Google Scholar]
- Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalance problem. In Advances in Knowledge Discovery and Data Mining, Proceedings of the PAKDD 2009, Bangkok, Thailand, 27–30 April 2009; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5476, pp. 475–482. [Google Scholar] [CrossRef]
- Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
- Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.L.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
- Shi, S.; Li, J.; Zhu, D.; Yang, F.; Xu, Y. A Hybrid Imbalanced Classification Model Based on Data Density. Inf. Sci. 2023, 624, 50–67. [Google Scholar] [CrossRef]
- Huang, Z.; Gao, X.; Chen, W.; Cheng, Y.; Xue, B.; Meng, Z.; Zhang, G.; Fu, S. An Imbalanced Binary Classification Method via Space Mapping Using Normalizing Flows with Class Discrepancy Constraints. Inf. Sci. 2023, 623, 493–523. [Google Scholar] [CrossRef]
- Mayabadi, S.; Saadatfar, H. Two Density-Based Sampling Approaches for Imbalanced and Overlapping Data. Knowl.-Based Syst. 2022, 241, 108217. [Google Scholar] [CrossRef]
- Tao, X.; Guo, X.; Zheng, Y.; Zhang, X.; Chen, Z. Self-Adaptive Oversampling Method Based on the Complexity of Minority Data in Imbalanced Datasets Classification. Knowl.-Based Syst. 2023, 277, 110795. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Nakai, K. Yeast. In UCI Machine Learning Repository; University of California, Irvine, School of Information and Computer Sciences: Irvine, CA, USA, 1991. [Google Scholar] [CrossRef]
- Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef]
- Alcalá-Fdez, J.; Fernandez, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
- Fedesoriano. Stroke Prediction Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data (accessed on 15 November 2024).
- Mssmartypants. Water Quality Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/mssmartypants/water-quality (accessed on 15 November 2024).
- Sudhanshu. Microcalcification Classification Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/sudhanshu2198/microcalcification-classification/data (accessed on 15 November 2024).
- Mathew, J.; Pang, C.K.; Luo, M.; Leong, W.H. Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4065–4076. [Google Scholar] [CrossRef]
- Zhao, J.; Jin, J.; Chen, S.; Zhang, R.; Yu, B.; Liu, Q. A Weighted Hybrid Ensemble Method for Classifying Imbalanced Data. Knowl.-Based Syst. 2020, 203, 106087. [Google Scholar] [CrossRef]
- Guo, J.; Wu, H.; Chen, X.; Lin, W. Adaptive SV-Borderline SMOTE-SVM Algorithm for Imbalanced Data Classification. Appl. Soft Comput. 2024, 150, 110986. [Google Scholar] [CrossRef]
- Li, F.; Wang, B.; Shen, Y.; Wang, P.; Li, Y. An Overlapping Oriented Imbalanced Ensemble Learning Algorithm with Weighted Projection Clustering Grouping and Consistent Fuzzy Sample Transformation. Inf. Sci. 2023, 637, 118955. [Google Scholar] [CrossRef]
Techniques | Advantage | Disadvantage |
Data-level Techniques |
|
|
Algorithm-level Techniques |
|
|
Hybrid Techniques |
|
|
Notation | Description |
Majority class instances. | |
Minority class instances. | |
Radius threshold for minority class instances (calculated via percentile distances). | |
Minimum number of neighbors required for minority class instances. | |
Majority class instances overlapping with the minority class. | |
Minority class instances overlapping with the majority class. | |
Majority class instances in non-overlapping regions. | |
Minority class instances in non-overlapping regions. | |
Set 0–Set 4 | ). |
Pairwise distance matrix between minority and majority instances. | |
. | |
. |
Name | Dataset | Attributes | Instances | Class Distribution | Imbalance Ratio |
---|---|---|---|---|---|
Yeast 143 | Yeast | 8 | 459 | 429/30 | 14.3 |
Yeast 246 | 1484 | 1055/429 | 2.46 | ||
Yeast 508 | 1484 | 1240/244 | 5.08 | ||
Yeast 908 | 514 | 463/51 | 9.08 | ||
Yeast 912 | 506 | 456/50 | 9.12 | ||
Yeast 914 | 1004 | 905/99 | 9.14 | ||
Yeast 935 | 528 | 477/51 | 9.35 | ||
Yeast 1225 | 464 | 429/35 | 12.25 | ||
Yeast 3057 | 947 | 917/30 | 30.57 | ||
Yeast 3273 | 1484 | 1440/44 | 32.73 | ||
WineRedQ4 | Wine Quality | 11 | 1599 | 53/1546 | 29.17 |
WineRedQ5 | 681/918 | 1.35 | |||
WineRedQ6 | 638/961 | 1.51 | |||
WineRedQ7 | 199/1400 | 7.04 | |||
WineWhiteQ4 | 11 | 4838 | 163/4735 | 29.05 | |
WineWhiteQ5 | 1457/3441 | 2.36 | |||
WineWhiteQ6 | 2198/2700 | 1.23 | |||
WineWhiteQ7 | 880/4018 | 4.57 | |||
WineWhiteQ8 | 175/4723 | 26.99 | |||
Stroke | Stroke | 10 | 4908 | 209/4699 | 22.48 |
Microcal | Microcalcification | 6 | 11,183 | 260/10,923 | 42.01 |
Water | Water Quality | 20 | 7996 | 912/7084 | 7.77 |
Datasets | Algorithm | Best Recall | Detail | Best G-Mean | Detail | Best AUC | Detail |
---|---|---|---|---|---|---|---|
Yeast 143 | Baseline | 0.86667 | • S-SMOTE | 0.80803 | • ROS | 0.80853 | • ROS |
• Boosting SVC | • SVC | • SVC | |||||
Proposed | 0.83333 | • S-SMOTE | 0.82228 | • S-SMOTE | 0.82248 | • S-SMOTE | |
• Boosting SVC | • SVC | • SVC | |||||
• W–R | • Non–W | • Non–W | |||||
Yeast 246 | Baseline | 0.81177 | • B-SMOTE | 0.70388 | • ROS | 0.71012 | • ROS |
• SVC | • Bagging RF | • Bagging RF | |||||
Proposed | 0.96235 * (0.0042) | • ROS | 0.70828 | • ROS | 0.71410 | • ROS | |
• Boosting RF | • Bagging RF | • Bagging RF | |||||
• W–R | • Non–W | • W–G | |||||
Yeast 508 | Baseline | 0.81923 | • B-SMOTE | 0.79436 | • ROS | 0.80102 | • ROS |
• Boosting SVC | • Boosting SVC | • Boosting SVC | |||||
Proposed | 0.87692 | • ROS | 0.80338 | • ROS | 0.80361 | • ROS | |
• Bagging RF | • SVC | • SVC | |||||
• W–R | • Non–W | • Non–W | |||||
Yeast 908 | Baseline | 0.88333 | • B-SMOTE | 0.91206 | • S-SMOTE | 0.91355 | • S-SMOTE |
• Boosting SVC | • Bagging SVC | • SVC | |||||
Proposed | 0.93333 | • ROS | 0.94212 * (0.04187) | • ROS | 0.94249 * (0.04206) | • ROS | |
• SVC | • SVC | • SVC | |||||
• Non–W | • Non–W | • Non–W | |||||
Yeast 912 | Baseline | 0.85714 | • ROS | 0.78421 | • ROS | 0.78752 | • ROS |
• Bagging SVC | • Bagging SVC | • Bagging SVC | |||||
Proposed | 0.91429 | • B-SMOTE | 0.78160 | • B-SMOTE | 0.78541 | • B-SMOTE | |
• Boosting SVC | • SVC | • SVC | |||||
• W–R | • W–G | • W–A | |||||
Yeast 914 | Baseline | 0.76191 | • SMOTE | 0.81985 | • SMOTE | 0.82206 | • SMOTE |
• Bagging SVC | • SVC | • SVC | |||||
Proposed | 0.96190 * (0.00036) | • ROS | 0.81359 | • SMOTE | 0.82429 | • SMOTE | |
• Boosting RF | • Bagging RF | • Bagging RF | |||||
• W–R | • Non–W | • Non–W | |||||
Yeast 935 | Baseline | 0.71429 | • SMOTE | 0.76379 | • S-SMOTE | 0.77994 | • S-SMOTE |
• Bagging SVC | • Bagging RF | • Bagging RF | |||||
Proposed | 0.74286 | • B-SMOTE | 0.78540 | • B-SMOTE | 0.79928 | • SMOTE | |
• SVC | • RF | • Boosting RF | |||||
• Non–W | • W–G | • W–A | |||||
Yeast 1225 | Baseline | 0.83333 | • SMOTE | 0.81556 | • B-SMOTE | 0.83218 | • B-SMOTE |
• Boosting SVC | • RF | • RF | |||||
Proposed | 0.83333 | • B-SMOTE | 0.85685 | • ROS | 0.86379 | • ROS | |
• Boosting SVC | • SVC | • SVC | |||||
• Non–W | • W–G | • W–A | |||||
Yeast 3057 | Baseline | 0.84000 | • ROS | 0.66553 | • ROS | 0.67838 | • ROS |
• Boosting SVC | • Bagging SVC | • Bagging SVC | |||||
Proposed | 0.92000 | • SMOTE | 0.69037 | • ROS | 0.69730 | • ROS | |
• Boosting SVC | • SVC | • SVC | |||||
• W–R | • W–G | • W–G | |||||
Yeast 3273 | Baseline | 1.00000 | • ROS | 0.97441 | • SMOTE | 0.97474 | • SMOTE |
• Boosting SVC | • SVC | • SVC | |||||
Proposed | 1.00000 | • SMOTE | 0.98711 * (0.00025) | • SMOTE | 0.98720 * (0.00025) | • SMOTE | |
• Boosting SVC | • SVC | • SVC | |||||
• W–R | • Non–W | • Non–W |
Datasets | Algorithm | Best Recall | Detail | Best G-Mean | Detail | Best AUC | Detail |
---|---|---|---|---|---|---|---|
WineRedQ4 | Baseline | 0.90000 | • SMOTE | 0.80205 | • SMOTE | 0.80807 | • SMOTE |
• SVC | • SVC | • SVC | |||||
Proposed | 0.90000 | • ROS | 0.79468 | • ROS | 0.80097 | • ROS | |
• Bagging SVC | • Bagging SVC | • Bagging SVC | |||||
• Non–W | • Non–W | • Non–W | |||||
WineRedQ5 | Baseline | 0.89692 | • SMOTE | 0.77443 | • SMOTE | 0.77591 | • SMOTE |
• Boosting SVC | • Boosting RF | • Boosting RF | |||||
Proposed | 0.97846 * (0.00374) | • B-SMOTE | 0.77267 | • SMOTE | 0.77320 | • SMOTE | |
• Bagging RF | • Boosting RF | • Boosting RF | |||||
• W–R | • Non–W | • Non–W | |||||
WineRedQ6 | Baseline | 0.67879 | • Safe-SMOTE | 0.70162 | • B-SMOTE | 0.70293 | • B-SMOTE |
• SVC | • Bagging RF | • Bagging RF | |||||
Proposed | 0.96667 * (0.00003) | • ROS | 0.70804 | • B-SMOTE | 0.70835 | • B-SMOTE | |
• RF | • Bagging RF | • Bagging RF | |||||
• W–R | • W–G | • W–G | |||||
WineRedQ7 | Baseline | 0.89524 | • S-SMOTE | 0.81897 | • B-SMOTE | 0.82141 | • B-SMOTE |
• Boosting SVC | • Bagging SVC | • Bagging SVC | |||||
Proposed | 0.98095 * (0.00605) | • SMOTE | 0.82385 | • ROS | 0.82653 | • ROS | |
• Boosting RF | • SVC | • SVC | |||||
• W–R | • Non–W | • Non–W |
Datasets | Algorithm | Best Recall | Detail | Best G-Mean | Detail | Best AUC | Detail |
---|---|---|---|---|---|---|---|
WineWhiteQ4 | Baseline | 0.78400 | • SMOTE | 0.78848 | • S-SMOTE | 0.78886 | • S-SMOTE |
• SVC | • SVC | • SVC | |||||
Proposed | 0.80800 | • S-SMOTE | 0.78799 | • ROS | 0.78831 | • ROS | |
• Boosting SVC | • SVC | • SVC | |||||
• W–R | • Non–W | • Non–W | |||||
WineWhiteQ5 | Baseline | 0.86804 | • B-SMOTE | 0.76996 | • B-SMOTE | 0.77319 | • B-SMOTE |
• Boosting SVC | • Bagging RF | • Bagging RF | |||||
Proposed | 0.97113 * (0.00005) | • B-SMOTE | 0.77350 | • B-SMOTE | 0.77548 | • B-SMOTE | |
• Boosting RF | • Bagging RF | • Bagging RF | |||||
• W–R | • W–G | • W–G | |||||
WineWhiteQ6 | Baseline | 0.92500 | • S-SMOTE | 0.71466 | • ROS | 0.71504 | • ROS |
• Boosting SVC | • Bagging RF | • Bagging RF | |||||
Proposed | 0.99583 * (0.2256) | • ROS | 0.71982 | • B-SMOTE | 0.72000 | • B-SMOTE | |
• Boosting RF | • Bagging RF | • Bagging RF | |||||
• W–R | • W–G | • W–G | |||||
WineWhiteQ7 | Baseline | 0.79375 | • B-SMOTE | 0.74659 | • ROS | 0.76218 | • ROS |
• Bagging SVC | • Bagging RF | • Bagging RF | |||||
Proposed | 0.96979 * (0.000003) | • ROS | 0.75287 * (0.02169) | • B-SMOTE | 0.76653 * (0.03713) | • ROS | |
• Bagging RF | • Bagging RF | • Bagging RF | |||||
• W–R | • W–G | • W–G | |||||
WineWhiteQ8 | Baseline | 0.73714 | • SMOTE | 0.70885 | • ROS | 0.71270 | • S-SMOTE |
• Boosting SVC | • Boosting SVC | • Bagging RF | |||||
Proposed | 0.88571 * (0.00045) | • S-SMOTE | 0.78596 * (0.00735) | • B-SMOTE | 0.78815 * (0.00232) | • B-SMOTE | |
• Bagging RF | • Boosting RF | • Boosting RF | |||||
• W–R | • W–R | • W–R |
Datasets | Algorithm | Best Recall | Detail | Best G-Mean | Detail | Best AUC | Detail |
---|---|---|---|---|---|---|---|
Stroke | Baseline | 0.94717 | • SMOTE | 0.73554 | • ROS | 0.75056 | • ROS |
• Boosting SVC | • Boosting SVC | • Boosting SVC | |||||
Proposed | 0.93585 | • ROS | 0.75936 * (0.00173) | • S-SMOTE | 0.76684 * (0.01101) | • S-SMOTE | |
• SVC | • Boosting SVC | • Boosting SVC | |||||
• W–R | • Non–W | • Non–W | |||||
Microcal | Baseline | 0.95200 | • B-SMOTE | 0.91004 | • ROS | 0.91038 | • ROS |
• Boosting SVC | • Bagging SVC | • Bagging SVC | |||||
Proposed | 0.96000 | • B-SMOTE | 0.91360 | • SMOTE | 0.91364 | • SMOTE | |
• Boosting SVC | • SVC | • SVC | |||||
• Non–W | • Non–W | • Non–W | |||||
Water | Baseline | 0.70400 | • B-SMOTE | 0.74155 | • ROS | 0.74486 | • ROS |
• Bagging SVC | • Bagging RF | • Bagging RF | |||||
Proposed | 0.96400 * (0.00001) | • SMOTE | 0.79700 * (0.0023) | • ROS | 0.79764 * (0.0002) | • ROS | |
• RF | • SVC | • SVC | |||||
• W–R | • Non–W | • Non–W |
Datasets | CW [35] | CW [36] | CW [37] | CW [38] | CW [46] | CW [47] | CW [48] | CW [49] | Proposed |
---|---|---|---|---|---|---|---|---|---|
Yeast 143 | 0.6555 | 0.7593 | - | - | 0.76 | - | 0.697 | 0.7573 | 0.82228 |
Yeast 246 | - | - | 0.718 | 0.743 | 0.72 | - | 0.683 | - | 0.70828 |
Yeast 908 | 0.8674 | 0.8494 | 0.954 | 0.937 | 0.9 | - | 0.863 | 0.9231 | 0.94212 |
Yeast 912 | - | - | - | - | 0.74 | - | 0.627 | 0.7263 | 0.7816 |
Yeast 914 | - | - | - | - | 0.8 | 0.7642 | 0.76 | - | 0.81359 |
Yeast 935 | - | 0.789 | - | - | 0.81 | - | 0.779 | - | 0.7854 |
Yeast 3057 | - | - | - | - | 0.73 | 0.6605 | 0.657 | - | 0.69037 |
Yeast 3273 | 0.9601 | - | - | 0.962 | 0.96 | 0.939 | 0.948 | - | 0.98711 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Theephoowiang, K.; Hanskunatai, A. A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data 2025, 10, 54. https://doi.org/10.3390/data10040054
Theephoowiang K, Hanskunatai A. A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data. 2025; 10(4):54. https://doi.org/10.3390/data10040054
Chicago/Turabian StyleTheephoowiang, Kittipong, and Anantaporn Hanskunatai. 2025. "A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification" Data 10, no. 4: 54. https://doi.org/10.3390/data10040054
APA StyleTheephoowiang, K., & Hanskunatai, A. (2025). A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data, 10(4), 54. https://doi.org/10.3390/data10040054