Handling the Imbalanced Problem in Agri-Food Data Analysis
Abstract
:1. Introduction
2. Related Works
Experimental Chicken Egg Fertility Data
3. Methodologies for Handling the Imbalanced Problem
3.1. Data Preprocessing (Resampling)
3.2. Feature Selection
3.3. Recognition-Based Approach
3.4. Cost-Sensitive Learning
3.5. Ensemble Methods
3.6. Deep Learning Architecture
4. Evaluation Metrics for Imbalanced Data Analysis
4.1. ROC Analysis
4.2. Area Under ROC Curve (AUC)
4.3. Precision-Recall (PR) Curve
4.4. Comparisons of Evaluation Metrics for Handling Imbalanced Data
5. Conclusions and Future Research Direction
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chawla, N.V. Data Mining for Imbalanced Datasets: An Overview Data Mining and Knowledge Discovery Handbook; Springer: Berlin/Heidelberg, Germany, 2009; pp. 875–886. [Google Scholar]
- Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data Imbalance in Classification: Experimental Evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
- Artís, M.; Ayuso, M.; Guillén, M. Detection of automobile insurance fraud with discrete choice models and misclassified claims. J. Risk Insur. 2002, 69, 325–340. [Google Scholar] [CrossRef]
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
- Sun, Y.; Wong, A.; Kamel, M. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 687–719. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Schapire, R.E. The Boosting Approach to Machine Learning: An Overview Nonlinear Estimation and Classification; Springer: Berlin/Heidelberg, Germany, 2003; pp. 149–171. [Google Scholar]
- Adegbenjo, A.O.; Liu, L.; Ngadi, M.O. Non-Destructive Assessment of Chicken Egg Fertility. Sensors 2020, 20, 5546. [Google Scholar] [CrossRef]
- Ahmed, H.A.; Hameed, A.; Bawany, N.Z. Network Intrusion Detection Using Oversampling Technique and Machine Learning Algorithms. PeerJ Comput. Sci. 2022, 8, e820. [Google Scholar] [CrossRef]
- Almarshdi, R.; Nassef, L.; Fadel, E.; Alowidi, N. Hybrid Deep Learning Based Attack Detection for Imbalanced Data Classification. Intell. Autom. Soft Comput. 2023, 35, 297–320. [Google Scholar] [CrossRef]
- Al-Qarni, E.A.; Al-Asmari, G.A. Addressing Imbalanced Data in Network Intrusion Detection: A Review and Survey. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 136–143. [Google Scholar] [CrossRef]
- Kuhn, M.; Johnson, K. Remedies for severe class imbalance. In Applied Predictive Modeling; Springer: Berlin/Heidelberg, Germany, 2016; p. 427. [Google Scholar]
- Li, L.; Wang, Q.; Weng, F.; Yuan, C. Non-destructive Visual Inspection Method of Double-Yolked Duck Egg. Int. J. Pattern Recognit. Artif. Intell. 2019, 33, 1955006. [Google Scholar] [CrossRef]
- Devasena, C.L.; Sumathi, T.; Gomathi, V.; Hemalatha, M. Effectiveness Evaluation of Rule Based Classifiers for the Classification of Iris Data Set. Bonfring Int. J. Man Mach. Interface 2011, 1, 5. [Google Scholar]
- Jason, B. Machine Learning Mastery with Weka: Analyse Data, Develop Models and Work through Projects; Machine Learning Mastery: Vermont, VIC, Australia, 2016; pp. 110–113. [Google Scholar]
- Panigrahi, R.; Borah, S. A detailed analysis of CICIDS2017 dataset for designing Intrusion Detection Systems. Int. J. Eng. Technol. 2018, 3, 479–482. [Google Scholar]
- Choudhary, S.; Kesswani, N. Analysis of KDD-Cup’99, NSL-KDD and UNSW-NB15 Datasets Using Deep Learning in IoT. Procedia Comput. Sci. 2020, 167, 1561–1573. [Google Scholar] [CrossRef]
- Alzughaibi, S.; El Khediri, S. A Cloud Intrusion Detection Systems Based on DNN Using Backpropagation and PSO on the CSE-CIC-IDS2018 Dataset. Appl. Sci. 2023, 13, 2276. [Google Scholar] [CrossRef]
- Liu, J.; Gao, Y.; Hu, F. A Fast Network Intrusion Detection System Using Adaptive Synthetic Oversampling and LightGBM. Comput. Secur. 2021, 106, 102289. [Google Scholar] [CrossRef]
- Yulianto, A.; Sukarno, P.; Suwastika, N.A. Improving Adaboost-Based Intrusion Detection System (IDS) Performance on CIC IDS 2017 Dataset; IOP Publishing: Bristol, UK, 2019; Volume 1192, p. 012018. [Google Scholar]
- Meliboev, A.; Alikhanov, J.; Kim, W. Performance Evaluation of Deep Learning Based Network Intrusion Detection System across Multiple Balanced and Imbalanced Datasets. Electronics 2022, 11, 515. [Google Scholar] [CrossRef]
- Karatas, G.; Demir, O.; Sahingoz, O.K. Increasing the Performance of Machine Learning-Based IDSs on an Imbalanced and up-to-Date Dataset. IEEE Access 2020, 8, 32150–32162. [Google Scholar] [CrossRef]
- Dale, L.M.; Thewis, A.; Boudry, C.; Rotar, I.; Dardenne, P.; Baeten, V.; Pierna, J.A.F. Hyperspectral imaging applications in agriculture and agro-food product quality and safety control: A review. Appl. Spectrosc. Rev. 2013, 48, 142–159. [Google Scholar] [CrossRef]
- Del Fiore, A.; Reverberi, M.; Ricelli, A.; Pinzari, F.; Serranti, S.; Fabbri, A.; Fanelli, C. Early detection of toxigenic fungi on maize by hyperspectral imaging analysis. Int. J. Food Microbiol. 2010, 144, 64–71. [Google Scholar] [CrossRef]
- Zhang, M.; Qin, Z.; Liu, X.; Ustin, S.L. Detection of stress in tomatoes induced by late blight disease in California, USA, using hyperspectral remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2003, 4, 295–310. [Google Scholar] [CrossRef]
- Ariana, D.P.; Lu, R. Detection of internal defect in pickling cucumbers using hyperspectral transmittance imaging. Trans. ASABE 2008, 51, 705–713. [Google Scholar] [CrossRef]
- Ariana, D.P.; Lu, R. Hyperspectral imaging for defect detection of pickling cucumbers. Hyperspectral Imaging Food Qual. Anal. Control. 2010, 431–447. [Google Scholar] [CrossRef]
- Ariana, D.P.; Lu, R. Hyperspectral waveband selection for internal defect detection of pickling cucumbers and whole pickles. Comput. Electron. Agric. 2010, 74, 137–144. [Google Scholar] [CrossRef]
- Wang, N.; ElMasry, G. Bruise detection of apples using hyperspectral imaging. Hyperspectral Imaging Food Qual. Anal. Control. 2010, 295–320. [Google Scholar] [CrossRef]
- Senthilkumar, T.; Jayas, D.; White, N.; Fields, P.; Gräfenhan, T. Detection of fungal infection and Ochratoxin A contamination in stored wheat using near-infrared hyperspectral imaging. J. Stored Prod. Res. 2016, 65, 30–39. [Google Scholar] [CrossRef]
- Senthilkumar, T.; Singh, C.; Jayas, D.; White, N. Detection of fungal infection in canola using near-infrared hyperspectral imaging. J. Agric. Eng. 2012, 49, 21–27. [Google Scholar] [CrossRef]
- Adegbenjo, A.O.; Liu, L.; Ngadi, M.O. An Adaptive Partial Least-Squares Regression Approach for Classifying Chicken Egg Fertility by Hyperspectral Imaging. Sensors 2024, 24, 1485. [Google Scholar] [CrossRef]
- Liu, L.; Ngadi, M. Detecting fertility and early embryo development of chicken eggs using near-infrared hyperspectral imaging. Food Bioprocess Technol. 2013, 6, 2503–2513. [Google Scholar] [CrossRef]
- Smith, D.; Lawrence, K.; Heitschmidt, G. Fertility and embryo development of broiler hatching eggs evaluated with a hyperspectral imaging and predictive modeling system. Int. J. Poult. Sci. 2008, 7, 1001–1004. [Google Scholar]
- Hu, G.; Xi, T.; Mohammed, F.; Miao, H. Classification of Wine Quality with Imbalanced Data. In Proceedings of the IEEE International Conference on Industrial Technology (ICIT), Taipei, Taiwan, 14–17 March 2016; pp. 1712–1717. [Google Scholar]
- Weller, D.L.; Love, T.M.; Wiedmann, M. Comparison of Resampling Algorithms to Address Class Imbalance When Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water. Front. Environ. Sci. 2021, 9, 701288. [Google Scholar] [CrossRef]
- Yang, H.; Xu, J.; Xiao, Y.; Hu, L. SPE-ACGAN: A Resampling Approach for Class Imbalance Problem in Network Intrusion Detection Systems. Electronics 2023, 12, 3323. [Google Scholar] [CrossRef]
- Rani, M. Gagandeep Effective Network Intrusion Detection by Addressing Class Imbalance with Deep Neural Networks Multimedia Tools and Applications. Multimed. Tools Appl. 2022, 81, 8499–8518. [Google Scholar] [CrossRef]
- Phoungphol, P. A Classification Framework for Imbalanced Data. Ph.D. Thesis, Georgia State University, Atlanta, GA, USA, 2013. [Google Scholar]
- Nguyen, G.; Bouzerdoum, A.; Phung, S. Learning pattern classification tasks with imbalanced data sets. In Pattern Recognition; Yin, P., Ed.; Elsevier: Amsterdam, The Netherlands, 2009; pp. 193–208. [Google Scholar]
- Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
- López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
- Liao, T.W. Classification of weld flaws with imbalanced class data. Expert Syst. Appl. 2008, 35, 1041–1052. [Google Scholar] [CrossRef]
- Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
- Chawla, N.V.; Japkowicz, N.; Kotcz, A. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 2004, 6, 1–6. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Stefanowski, J.; Wilk, S. Selective pre-processing of imbalanced data for improving classification performance. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, Turin, Italy, 2–5 September 2008. [Google Scholar]
- Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009. [Google Scholar]
- Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, 3, 408–421. [Google Scholar] [CrossRef]
- Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, 6, 769–772. [Google Scholar]
- Kubat, M.; Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the ICML 1997, Nashville, TN, USA, 8–12 July 1997. [Google Scholar]
- Laurikkala, J. Improving Identification of Difficult Small Classes by Balancing Class Distribution. In Proceedings of the Conference on Artificial Intelligence in Medicine, Cascais, Portugal, 1–4 July 2001; pp. 63–66. [Google Scholar]
- Mani, I.; Zhang, I. KNN Approach to Unbalanced Data Distributions: A Case Study involving Information Extraction. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets, Washington, DC, USA, 21 August 2003; Volume 126, pp. 1–7. [Google Scholar]
- Kumar, A.; Singh, D.; Yadav, R.S. Entropy and Improved K-nearest Neighbor Search Based Under-sampling (ENU) Method to Handle Class Overlap in Imbalanced Datasets. Concurr. Comput. Pract. Exp. 2024, 36, e7894. [Google Scholar] [CrossRef]
- Leng, Q.; Guo, J.; Tao, J.; Meng, X.; Wang, C. OBMI: Oversampling Borderline Minority Instances by a Two-Stage Tomek Link-Finding Procedure for Class Imbalance Problem. Complex Intell. Syst. 2024, 10, 4775–4792. [Google Scholar] [CrossRef]
- Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Chawla, N.V.; Cieslak, D.A.; Hall, L.O.; Joshi, A. Automatically countering imbalance and its empirical relationship to cost. Data Min. Knowl. Discov. 2008, 17, 225–252. [Google Scholar] [CrossRef]
- Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. DBSMOTE: Density-based synthetic minority over-sampling technique. Appl. Intell. 2012, 36, 664–684. [Google Scholar] [CrossRef]
- Cohen, G.; Hilario, M.; Sax, H.; Hugonnet, S.; Geissbuhler, A. Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 2006, 37, 7–18. [Google Scholar] [CrossRef]
- Jo, T.; Japkowicz, N. Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 2004, 6, 40–49. [Google Scholar] [CrossRef]
- Yen, S.-J.; Lee, Y.-S. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation; Springer: Berlin/Heidelberg, Germany, 2006; pp. 731–740. [Google Scholar]
- Yen, S.-J.; Lee, Y.-S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
- Yoon, K.; Kwek, S. An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In Proceedings of the Fifth International Conference on Hybrid Intelligent Systems (HIS), Rio de Janeiro, Brazil, 6–9 November 2005. [Google Scholar]
- Yoon, K.; Kwek, S. A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput. Appl. 2007, 16, 295–306. [Google Scholar] [CrossRef]
- Yang, P.; Xu, L.; Zhou, B.B.; Zhang, Z.; Zomaya, A.Y. A particle swarm-based hybrid system for imbalanced medical data sampling. In Proceedings of the Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology, Singapore, 7–11 September 2009. [Google Scholar]
- Saha, D.; Annamalai, M. Machine learning techniques for analysis of hyperspectral images to determine quality of food products: A review. Curr. Res. Food Sci. 2021, 4, 28–44. [Google Scholar]
- Kamalov, F.; Thabtah, F.; Leung, H.H. Feature Selection in Imbalanced Data. Ann. Data Sci. 2023, 10, 1527–1541. [Google Scholar]
- Forman, G. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 2003, 3, 1289–1305. [Google Scholar]
- Zheng, Z.; Wu, X.; Srihari, R. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 2004, 6, 80–89. [Google Scholar] [CrossRef]
- Lê Cao, K.-A.; Bonnet, A.; Gadat, S. Multiclass classification and gene selection with a stochastic algorithm. Comput. Stat. Data Anal. 2009, 53, 3601–3615. [Google Scholar] [CrossRef]
- Wasikowski, M.; Chen, X.-w. Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng. 2010, 22, 1388–1400. [Google Scholar] [CrossRef]
- Liu, D.; Sun, D.-W.; Zeng, X.-A. Recent advances in wavelength selection techniques for hyperspectral image processing in the food industry. Food Bioprocess Technol. 2014, 7, 307–323. [Google Scholar] [CrossRef]
- Chong, J.; Wishart, D.S.; Xia, J. Using MetaboAnalyst 4.0 for Comprehensive and Integrative Metabolomics Data Analysis. Curr. Protoc. Bioinform. 2019, 68, e86. [Google Scholar] [CrossRef]
- Ladha, L.; Deepa, T. Feature selection methods and algorithms. Int. J. Comput. Sci. Eng. 2011, 3, 1787–1797. [Google Scholar]
- Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517. [Google Scholar] [CrossRef]
- Hall, M.A. Correlation-Based Feature Selection for Machine Learning. Doctoral Dissertation, The University of Waikato, Hamilton, New Zealand, 1999. [Google Scholar]
- Hukerikar, S.; Tumma, A.; Nikam, A.; Attar, V. SkewBoost: An algorithm for classifying imbalanced datasets. In Proceedings of the 2nd International Conference on Computer and Communication Technology (ICCCT), Allahabad, India, 15–17 September 2011. [Google Scholar]
- Longadge, R.; Dongre, S. Class Imbalance Problem in Data Mining Review. arXiv 2013, arXiv:1305.1707. [Google Scholar]
- Eavis, T.; Japkowicz, N. A recognition-based alternative to discrimination-based multi-layer perceptrons. In Advances in Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2000; pp. 280–292. [Google Scholar]
- Raskutti, B.; Kowalczyk, A. Extreme re-balancing for SVMs: A case study. ACM SIGKDD Explor. Newsl. 2004, 6, 60–69. [Google Scholar] [CrossRef]
- Spinosa, E.J.; de Carvalho, A.C. Combining one-class classifiers for robust novelty detection in gene expression data. In Advances in Bioinformatics and Computational Biology; Springer: Berlin/Heidelberg, Germany, 2005; pp. 54–64. [Google Scholar]
- Yu, M.; Naqvi, S.M.; Rhuma, A.; Chambers, J. Fall detection in a smart room by using a fuzzy one class support vector machine and imperfect training data. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011. [Google Scholar]
- Manevitz, L.; Yousef, M. One-class SVMs for document classification. J. Mach. Learn. Res. 2002, 2, 139–154. [Google Scholar]
- Manevitz, L.; Yousef, M. One-class document classification via neural networks. Neurocomputing 2007, 70, 1466–1481. [Google Scholar] [CrossRef]
- Hayashi, T.; Fujita, H. One-Class Ensemble Classifier for Data Imbalance Problems. Appl. Intell. 2022, 52, 17073–17089. [Google Scholar] [CrossRef]
- Elkan, C. The foundations of cost-sensitive learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001. [Google Scholar]
- El-Amir, S.; El-Henawy, I. An Improved Model Using Oversampling Technique and Cost-Sensitive Learning for Imbalanced Data Problem. Inf. Sci. Appl. 2024, 2, 33–50. [Google Scholar] [CrossRef]
- Alejo, R.; García, V.; Sotoca, J.M.; Mollineda, R.A.; Sánchez, J.S. Improving the performance of the RBF neural networks trained with imbalanced samples. In Proceedings of the Computational and Ambient Intelligence, San Sebastián, Spain, 20–22 June 2007; pp. 162–169. [Google Scholar]
- Ling, C.X.; Yang, Q.; Wang, J.; Zhang, S. Decision trees with minimal costs. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
- Nguyen, C.; Ho, T. An imbalanced data rule learner. In Knowledge Discovery in Databases: PKDD 2005, Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, 3–7 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 617–624. [Google Scholar]
- Zhou, Z.-H.; Liu, X.-Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 2006, 18, 63–77. [Google Scholar] [CrossRef]
- Weiss, G.M. Mining with rarity: A unifying framework. ACM SIGKDD Explor. Newsl. 2004, 6, 7–19. [Google Scholar] [CrossRef]
- Li, S.; Song, L.; Wu, X.; Hu, Z.; Cheung, Y.; Yao, X. Multi-Class Imbalance Classification Based on Data Distribution and Adaptive Weights. IEEE Trans. Knowl. Data Eng. 2024, 5265–5279. [Google Scholar] [CrossRef]
- Polikar, R. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 2006, 6, 21–45. [Google Scholar] [CrossRef]
- Kuncheva, L.I.; Rodríguez, J.J. A weighted voting framework for classifiers ensembles. Knowl. Inf. Syst. 2014, 38, 259–275. [Google Scholar] [CrossRef]
- Liu, X.; Wu, J.; Zhou, Z. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2009, 39, 539–550. [Google Scholar]
- Wang, S.; Yao, X. Relationships between diversity of classification ensembles and single-class performance measures. IEEE Trans. Knowl. Data Eng. 2013, 25, 206–219. [Google Scholar] [CrossRef]
- Sun, Y.; Kamel, M.S.; Wong, A.K.; Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 2007, 40, 3358–3378. [Google Scholar] [CrossRef]
- Van Hulse, J.; Khoshgoftaar, T.M.; Napolitano, A. An empirical comparison of repetitive undersampling techniques. In Proceedings of the IEEE International Conference on Information Reuse & Integration IRI’09, Las Vegas, NV, USA, 10–12 August 2009. [Google Scholar]
- Breiman, L. Stacked regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Schapire, R.E. The strength of weak learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef]
- Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Dubrovnik, Croatia, 22–26 September 2003. [Google Scholar]
- Tang, Y.; Zhang, Y.-Q.; Chawla, N.V.; Krasser, S. SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2009, 39, 281–288. [Google Scholar] [CrossRef]
- Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2012, 42, 463–484. [Google Scholar] [CrossRef]
- Barandela, R.; Valdovinos, R.M.; Sánchez, J.S. New applications of ensembles of classifiers. Pattern Anal. Appl. 2003, 6, 245–256. [Google Scholar] [CrossRef]
- Vidyarthi, S.K.; Singh, S.K.; Tiwari, R.; Xiao, H.W.; Rai, R. Classification of first quality fancy cashew kernels using four deep convolutional neural network models. J. Food Process Eng. 2020, 43, e13552. [Google Scholar] [CrossRef]
- Weng, S.; Tang, P.; Yuan, H.; Guo, B.; Yu, S.; Huang, L.; Xu, C. Hyperspectral imaging for accurate determination of rice variety using a deep learning network with multi-feature fusion. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2020, 234, 118237. [Google Scholar] [CrossRef]
- Geng, L.; Yan, T.; Xiao, Z.; Xi, J.; Li, Y. Hatching eggs classification based on deep learning. Multimed. Tools Appl. 2018, 77, 22071–22082. [Google Scholar] [CrossRef]
- Huang, L.; He, A.; Zhai, M.; Wang, Y.; Bai, R.; Nie, X. A Multi-Feature Fusion Based on Transfer Learning for Chicken Embryo Eggs Classification. Symmetry 2019, 11, 606. [Google Scholar] [CrossRef]
- Tsai, C.-F.; Lin, W.-C.; Hu, Y.-H.; Yao, G.-T. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci. 2018, 477, 47–54. [Google Scholar] [CrossRef]
- Yan, Y.; Zhu, Y.; Liu, R.; Zhang, Y.; Zhang, Y.; Zhang, L. Spatial Distribution-based Imbalanced Undersampling. IEEE Trans. Knowl. Data Eng. 2022, 6376–6391. [Google Scholar] [CrossRef]
- Sun, Y.; Cai, L.; Liao, B.; Zhu, W.; Xu, J. A Robust Oversampling Approach for Class Imbalance Problem with Small Disjuncts. IEEE Trans. Knowl. Data Eng. 2022, 5550–5562. [Google Scholar] [CrossRef]
- Han, M.; Guo, H.; Li, J.; Wang, W. Global-local information based oversampling for multi-class imbalanced data. Int. J. Mach. Learn. Cybern. 2022, 14, 2071–2086. [Google Scholar] [CrossRef]
- Fan, S.; Zhang, X.; Song, Z. Imbalanced Sample Selection with Deep Reinforcement Learning for Fault Diagnosis. IEEE Trans. Ind. Informatics 2021, 18, 2518–2527. [Google Scholar] [CrossRef]
- Sahani, M.; Dash, P.K. FPGA-Based Online Power Quality Disturbances Monitoring Using Reduced-Sample HHT and Class-Specific Weighted RVFLN. IEEE Trans. Ind. Informatics 2019, 15, 4614–4623. [Google Scholar] [CrossRef]
- Cao, B.; Liu, Y.; Hou, C.; Fan, J.; Zheng, B.; Yin, J. Expediting the Accuracy-Improving Process of SVMs for Class Imbalance Learning. IEEE Trans. Knowl. Data Eng. 2020, 33, 3550–3567. [Google Scholar] [CrossRef]
- Lu, Y.; Cheung, Y.-M.; Tang, Y.Y. Adaptive Chunk-Based Dynamic Weighted Majority for Imbalanced Data Streams with Concept Drift. IEEE Trans. Neural Networks Learn. Syst. 2019, 31, 2764–2778. [Google Scholar] [CrossRef]
- Yang, K.; Yu, Z.; Chen, C.P.; Cao, W.; You, J.; Wong, H.-S. Incremental weighted ensemble broad learning system (BLS) for imbalanced data. IEEE Trans. Knowl. Data Eng. 2021, 34, 5809–5824. [Google Scholar] [CrossRef]
- Pan, T.; Zhao, J.; Wu, W.; Yang, J. Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf. Sci. 2020, 512, 1214–1233. [Google Scholar] [CrossRef]
- Saglam, F.; Cengiz, M.A. Anovel smotebased resampling technique trough noise detection and the boosting procedure. Expert Syst. Appl. 2022, 200, 117023. [Google Scholar] [CrossRef]
- Razavi-Far, R.; Farajzadeh-Zanjani, M.; Wang, B.; Saif, M.; Chakrabarti, S. Imputation-based Ensemble Techniques for Class Imbalance Learning. IEEE Trans. Knowl. Data Eng. 2019, 33, 1988–2001. [Google Scholar] [CrossRef]
- Dixit, A.; Mani, A. Sampling technique for noisy and borderline examples problem in imbalanced classification. Appl. Soft Comput. 2023, 142, 110361. [Google Scholar] [CrossRef]
- Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C. A Survey on Imbalanced Learning: Latest Research, Applications and Future Directions. Artif. Intell. Rev. 2024, 57, 1–51. [Google Scholar] [CrossRef]
- François, D. Binary classification performances measure cheat sheet. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
- Soleymani, R.; Granger, E.; Fumera, G. F-Measure Curves: A Tool to Visualize Classifier Performance under Imbalance. Pattern Recognit. 2020, 100, 107146. [Google Scholar] [CrossRef]
- Kubat, M.; Holte, R.C.; Matwin, S. Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 1998, 30, 195–215. [Google Scholar] [CrossRef]
- Japkowicz, N. Assessment Metrics for Imbalanced Learning. In Imbalanced Learning: Foundations, Algorithms, and Applications; IEEE: New York, NY, USA, 2013; pp. 187–206. [Google Scholar]
- Egan, J. Signal detection theory and ROC analysis. In Series in Cognition and Perception; Academic Press: New York, NY, USA, 1975. [Google Scholar]
- Swets, J.A.; Dawes, R.M.; Monahan, J. Better decisions through science. Sci. Am. 2000, 283, 82–87. [Google Scholar] [CrossRef]
- Swets, J.A. Measuring the accuracy of diagnostic systems. Science 1988, 240, 1285–1293. [Google Scholar] [CrossRef]
- Ghosal, S. Impact of Methodological Assumptions and Covariates on the Cutoff Estimation in ROC Analysis. arXiv 2024, arXiv:2404.13284. [Google Scholar]
- Spackman, K.A. Signal detection theory: Valuable tools for evaluating inductive learning. In Proceedings of the Sixth International Workshop on Machine Learning; Springer: Berlin/Heidelberg, Germany, 1989. [Google Scholar]
- Provost, F.J.; Fawcett, T. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Proceedings of the KDD, Newport Beach, CA, USA, 14–17 August 1997. [Google Scholar]
- Provost, F.J.; Fawcett, T.; Kohavi, R. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the ICML, Madison, WI, USA, 24-27 July 1998. [Google Scholar]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Brown, C.D.; Davis, H.T. Receiver operating characteristics curves and related decision measures: A tutorial. Chemom. Intell. Lab. Syst. 2006, 80, 24–38. [Google Scholar] [CrossRef]
- Ozcan, E.C.; Görgülü, B.; Baydogan, M.G. Column Generation-Based Prototype Learning for Optimizing Area under the Receiver Operating Characteristic Curve. Eur. J. Oper. Res. 2024, 314, 297–307. [Google Scholar] [CrossRef]
- Aguilar-Ruiz, J.S. Beyond the ROC Curve: The IMCP Curve. Analytics 2024, 3, 221–224. [Google Scholar] [CrossRef]
- Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
- Xia, J.; Broadhurst, D.I.; Wilson, M.; Wishart, D.S. Translational biomarker discovery in clinical metabolomics: An introductory tutorial. Metabolomics 2013, 9, 280–299. [Google Scholar] [CrossRef]
- Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006. [Google Scholar]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
- Riyanto, S.; Imas, S.S.; Djatna, T.; Atikah, T.D. Comparative Analysis Using Various Performance Metrics in Imbalanced Data for Multi-Class Text Classification. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 1082–1090. [Google Scholar] [CrossRef]
- Hand, D.J. Measuring Classifier Performance: A Coherent Alternative to the Area under the ROC Curve. Mach. Learn. 2009, 77, 103–123. [Google Scholar] [CrossRef]
- Ferri, C.; Hernández-Orallo, J.; Flach, P.A. A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June 2011–2 July 2011; pp. 657–664. [Google Scholar]
- Cárdenas, A.A.; Baras, J.S. B-ROC Curves for the Assessment of Classifiers over Imbalanced Data Sets. In Proceedings of the National Conference on Artificial Intelligence, Boston, MA, USA, 16–20 July 2006; Volume 21, p. 1581. [Google Scholar]
- Ranawana, R.; Palade, V. Optimized Precision-a New Measure for Classifier Performance Evaluation. In Proceedings of the IEEE International Conference on Evolutionary Computation, Vancouver, BC, Canada, 16–21 July 2006; pp. 2254–2261. [Google Scholar]
- Batuwita, R.; Palade, V. A New Performance Measure for Class Imbalance Learning: Application to Bioinformatics Problems. In Proceedings of the IEEE International Conference on Machine Learning and Applications, Miami, FL, USA, 13–15 December 2009; pp. 545–550. [Google Scholar]
Data | Majority | Minority | Ratio | ZeroR Acc | Ref. |
---|---|---|---|---|---|
CICIDS2017 | 2,800,000 | 14 | 1:200,000 | 99.9% | [16] |
UNSW-NB15 | 2,540,044 | 9 | 1:282,000 | 99.9% | [17] |
KDD99 | 4,898,430 | 4 | 1:1,200,000 | 99.9% | [17] |
CSE-CIC- IDS2018 | 16,200,000 | 6 | 1:2,700,000 | 99.9% | [18] |
Data | Classifier | Accuracy PH, AH (%) | F1-Score PH, AH (%) | Recall PH, AH (%) | Precision PH, AH (%) | Ref. |
---|---|---|---|---|---|---|
CICIDS2017 | LightGBM | 99.86, 99.91 | -, - | -, - | -, - | [19] |
CICIDS2017 | AdaBoost | -, 81.83 | -, 90.01 | -, 100.00 | -, 81.83 | [20] |
UNSW-NB15 | KNN RF | 84.00, 95.10 84.00, 95.10 | 53.30, 95.10 53.30, 95.10 | 51.30, 95.70 51.30, 95.70 | 57.80, 94.80 57.80, 94.80 | [9] |
UNSW-NB15 | GRU | 57.00, 77.90 | 71.30, 79.00 | 97.3, 83.20 | 56.30, 75.30 | [21] |
CSE-CIC- IDS2018 | KNN Adaboost | 98.52, 98.80 99.69, 99.60 | 98.89, 98.00 99.70, 99.60 | 98.52, 98.08 99.69, 99.61 | 99.28, 97.92 99.70, 99.60 | [22] |
KDD99 | CNN LSTM | 92.30, 95.20 91.80, 95.40 | 95.20, 94.90 94.70, 95.10 | 91.00, 90.70 91.10, 91.40 | 99.80, 99.50 98.60, 99.40 | [21] |
Data | Majority | Minority | Ratio | ZeroR | SEN | SPE | AUC | F1-Score |
---|---|---|---|---|---|---|---|---|
S1 | 8807 | 400 | 1:22 | 95.70% | 99.50% | 0.30% | 50.90% | 49.00% |
S1 + S2 (S2) | 17,614 | 800 | 1:22 | 95.70% | 99.10% | 2.90% | 62.50% | 51.10% |
S1 + SMOTE | 8807 | 6400 | 1:1.4 | 57.90% | 93.10% | 97.10% | 97.90% | 94.70% |
S2 + SMOTE | 17,614 | 12,800 | 1:1.4 | 57.90% | 93.00% | 98.10% | 98.10% | 95.10% |
S1 + SMOTE + Ru | 6400 | 6400 | 1:1 | 50.00% | 90.30% | 97.50% | 97.50% | 93.90% |
S2 + SMOTE + Ru | 12,800 | 12,800 | 1:1 | 50.00% | 90.10% | 98.30% | 97.50% | 94.20% |
S1 + SMOTE + RAND | 8807 | 6400 | 1:1.4 | 57.90% | 93.20% | 97.10% | 97.90% | 94.80% |
S2 + SMOTE + RAND | 17,614 | 12,800 | 1:1.4 | 57.90% | 93.10% | 98.10% | 98.10% | 95.10% |
S1 + SMOTE + RAND + Ru | 6400 | 6400 | 1:1 | 50.00% | 90.10% | 97.50% | 97.40% | 93.80% |
S2 + SMOTE + RAND + Ru | 12,800 | 12,800 | 1:1 | 50.00% | 90.30% | 98.30% | 97.50% | 94.30% |
Advantages | Disadvantages | Ref. | |
---|---|---|---|
Data | Classifier independent, class instance information preservation | Prone to overfitting due to exact duplication in simple oversampling | [42,73] |
Performance improvement for ensemble classifiers (sampling + clustering + feature selection) | Computational complexity and data-dependency | [112] | |
Selection of representative prevalent sample (spatial undersampling) | Non-fitting the entire distribution of the original unclassified data | [113] | |
Solving small disjuncts problem in oversampling | Challenge with handling big datasets and solving multi-class imbalance | [114] | |
Consideration of selective generation strategies and intra-class data variation (global local-based oversampling) | Model complexity and data dependency | [115] | |
Performance improvement via autonomous sample selection (sampling strategies + reinforcement learning) | Data feedback reliance and algorithm complexity | [116] | |
Efficient in a highly dimensional and noisy feature data space imbalanced situation (one-class) | Recognition-based learning is algorithm-specific | [39] | |
Disallows use of fake and synthetic data (one-class learning) | The learning class must be large enough for algorithm recognition of inherent discriminating features | ||
Algorithm | Versatile in practical situations where misclassification costs are paramount (cost-sensitive) | Prone to overfitting and it is also based on assumption that misclassification costs are known | [39,93] |
Very effective in the presence of big amount of data (deep learning) | Unavailability of large enough training and validation data in practice | [66] | |
Immune to high noise and of low computational complexity | Data distribution vulnerability | [117] | |
Improved F1-score and AUC in a rare class (cost-sensitive SVM) | Cannot handle multiclass imbalance and big datasets | [118] | |
Able to handle dynamic imbalance problem | Model complexity and long run times | [119] | |
Hybrid | Low complexity, simple and efficient (Adaptive Weighted ensemble BLS) | Cannot tackle multi-class imbalanced problem | [120] |
Effective categorization of extreme data imbalance | Complex models and long run times | [121] | |
Noise detection ability in imbalanced distribution | Inability and poor adaptability to solving multiclass imbalance | [122] | |
Consideration of base classifiers’ diversity | Cannot tackle multiclass imbalance | [123] | |
Effective boundary samples and noise resolution | Large amount of run time consumption | [124] |
Prediction Class | |||
---|---|---|---|
Predicted as Positive | Predicted as Negative | ||
True Class | Actual Positive | TPR True Positive Rate | FNR False Negative Rate |
Actual Negative | FPR False Positive Rate | TNR True Negative Rate |
Advantages | Disadvantages | |
---|---|---|
Ranking | Performance in each category is broken down into two separate measures with a multi-class interpretability possibility at glance (ROC) | Validity of ROC analysis is dependent on the false and true positive rates being invariant to class skewness. The analysis might not be fully trusted in the case of changing class distributions Precision-recall curve thrives better in such situation |
Can evaluate eventualities in various situations, including when the domain imbalance ratio is not exactly known (ROC) | Readability difficulty in situation of known cost/imbalance ratio | |
Good summary statistics singular metric measure (AUC) | Loss of cogent information on algorithm performances over the whole operating range | |
Threshold | Not affected by class imbalance since the correctly classified proportions of the classes are identified separately (sensitivity/specificity) | More difficult to process the metrics as a measure for the combined classes than for each single class |
A great measure for all predicted instances assigned to a given class (sensitivity/specificity) | Misses identifying the proportion of instances assigned to a given class that actually belong to this class | |
Precision measure identifies the proportion of predicted instances assigned to a given class (usually positive class), which actually belongs to that class. | Must be considered together with recall for understanding the overall classifier performance on the positive class with no performance information on the negative class | |
Single metric considering classifier performance on both the positive and the negative classes (G-mean) | Usage limited to classes assumed to be of equal importance | |
Single metric for classifier performance from precision and recall values (F-measure) | Applicable to a single class (usually the positive class) per time |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Adegbenjo, A.O.; Ngadi, M.O. Handling the Imbalanced Problem in Agri-Food Data Analysis. Foods 2024, 13, 3300. https://doi.org/10.3390/foods13203300
Adegbenjo AO, Ngadi MO. Handling the Imbalanced Problem in Agri-Food Data Analysis. Foods. 2024; 13(20):3300. https://doi.org/10.3390/foods13203300
Chicago/Turabian StyleAdegbenjo, Adeyemi O., and Michael O. Ngadi. 2024. "Handling the Imbalanced Problem in Agri-Food Data Analysis" Foods 13, no. 20: 3300. https://doi.org/10.3390/foods13203300
APA StyleAdegbenjo, A. O., & Ngadi, M. O. (2024). Handling the Imbalanced Problem in Agri-Food Data Analysis. Foods, 13(20), 3300. https://doi.org/10.3390/foods13203300