Enhanced Input-Doubling Method Leveraging Response Surface Linearization to Improve Classification Accuracy in Small Medical Data Processing
Abstract
:1. Introduction
- We improved the input-doubling method for classification tasks in the case of limited data analysis, utilizing the principle of linearizing the response surface via expanding the number of independent attributes in the augmented dataset with probabilities of belonging to each task class.
- We developed an algorithmic implementation of the improved input-doubling method using two Naïve Bayes classifiers with different inference procedures and provided flowcharts of its training and application algorithms.
- We investigated the effectiveness of the improved input-doubling method using both different Naïve Bayes algorithms and their various combinations. By comparing it with machine learning methods from different classes, we demonstrated its highest performance in solving the heart attack risk assessment task under conditions of processing a small dataset.
2. Materials and Methods
2.1. Classical Input-Doubling Method
2.2. Enhanced Input-Doubling Classifier
2.2.1. Augmentation Procedure
- Compute the probabilities of each vector in the training dataset belonging to each class using the first Naive Bayes classifier.
- Increase the number of independent features for each vector in the original training dataset by adding the probabilities obtained in the previous step.
- Create an augmented dataset by concatenating all possible pairs of vectors using the expanded features from the previous step.
- Form the dependent attributes for each expanded vector in the augmented dataset by finding the differences in the corresponding dependent variables for each concatenated vector pair.
2.2.2. Naïve Bayes Classifiers
- Simple Mathematical Model.
- The clear and interpretable workings of the method.
- Fast processing capabilities.
- Low Memory and Computational Requirement.
- Effective results with relatively small datasets.
- Ability to provide results both as a set of probabilities for each class (required for the first classifier in the enhanced method) and as class labels (needed for the second classifier in the enhanced input-doubling method).
2.2.3. Application Procedure
- Compute the probabilities of the current vector with an unknown dependent attribute belonging to each class of the task based on the first pre-trained Naïve Bayes classifier. It should be noted that this search is performed using the initial dataset.
- Expand the number of independent attributes of the current vector with an unknown dependent attribute by adding the two probabilities found in the previous step (since it is a binary classification task). If there are more classes, there will be additional attributes.
- Concatenate the already expanded current vector with an unknown dependent attribute with all expanded probability vectors from the training dataset to form a temporary dataset.
- Apply the second Naïve Bayes classifier, pre-trained on the augmented dataset from Figure 3, to find the output signals of the temporary dataset from the previous step.
- Average the sums of the output signals from the previous step and their corresponding known values of dependent attributes from the initial training dataset to form the desired result.
3. Modeling and Results
3.1. Dataset Description
3.2. Performance Indicators
- Precision.
- Recall.
- F1-score.
- Cohen’s kappa.
- Matthews correlation coefficient (MCC).
3.3. Optimal Parameters Selection
3.3.1. Investigation of the Effectiveness of Using Identical Naïve Bayes Classifiers at Both Stages of the Enhanced Method
3.3.2. Investigation of the Effectiveness of Using Different Naïve Bayes Classifiers at Both Stages of the Enhanced Method
3.4. Results
4. Discussion
- AdaBoost;
- Gradient Boosting;
- Decision Tree;
- Bagged Decision Tree;
- Random Forest;
- Gaussian Naive Bayes;
- Bagged K-Nearest Neighbors;
- K-Nearest Neighbors;
- XGBoost;
- Support Vector Machine (with rbf kernel).
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Overview of Lifestyle Medicine—StatPearls—NCBI Bookshelf. Available online: https://www.ncbi.nlm.nih.gov/books/NBK589672/ (accessed on 18 September 2024).
- Rubiś, P.P. Cardiac Disease: Diagnosis, Treatment, and Outcomes. JPM 2022, 12, 1212. [Google Scholar] [CrossRef] [PubMed]
- Subramani, S.; Varshney, N.; Anand, M.V.; Soudagar, M.E.M.; Al-keridis, L.A.; Upadhyay, T.K.; Alshammari, N.; Saeed, M.; Subramanian, K.; Anbarasu, K.; et al. Cardiovascular Diseases Prediction by Machine Learning Incorporation with Deep Learning. Front. Med. 2023, 10, 1150933. [Google Scholar] [CrossRef] [PubMed]
- Kovalchuk, O.; Barmak, O.; Radiuk, P.; Krak, I. ECG Arrhythmia Classification and Interpretation Using Convolutional Networks for Intelligent IoT Healthcare System. CEUR Workshop Proc. 2024, 3736, 47–62. [Google Scholar]
- Kovalchuk, O.; Radiuk, P.; Barmak, O.; Krak, I. Robust R-Peak Detection Using Deep Learning Based on Integrating Domain Knowledge. CEUR Workshop Proc. 2023, 3609, 1–14. [Google Scholar]
- Slobodzian, V.; Radiuk, P.; Zingailo, A.; Barmak, O.; Krak, I. Myocardium Segmentation Using Two-Step Deep Learning with Smoothed Masks by Gaussian Blur. CEUR Workshop Proc. 2023, 3609, 77–91. [Google Scholar]
- Ferrara, E. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies. Sci 2023, 6, 3. [Google Scholar] [CrossRef]
- Tolstyak, Y.; Zhuk, R.; Yakovlev, I.; Shakhovska, N.; Gregus Ml, M.; Chopyak, V.; Melnykova, N. The Ensembles of Machine Learning Methods for Survival Predicting after Kidney Transplantation. Appl. Sci. 2021, 11, 10380. [Google Scholar] [CrossRef]
- Hekler, E.B.; Klasnja, P.; Chevance, G.; Golaszewski, N.M.; Lewis, D.; Sim, I. Why We Need a Small Data Paradigm. BMC Med 2019, 17, 133. [Google Scholar] [CrossRef]
- Dolgikh, S. Modeling of Small Data with Unsupervised Generative Ensemble Learning. CEUR-WS.org 2022, 3302, 35–43. [Google Scholar]
- Zhang, Y.; Deng, L.; Wei, B. Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation. Mathematics 2024, 12, 1709. [Google Scholar] [CrossRef]
- Gholampour, S. Impact of Nature of Medical Data on Machine and Deep Learning for Imbalanced Datasets: Clinical Validity of SMOTE Is Questionable. MAKE 2024, 6, 827–841. [Google Scholar] [CrossRef]
- Althnian, A.; AlSaeed, D.; Al-Baity, H.; Samha, A.; Dris, A.B.; Alzakari, N.; Abou Elwafa, A.; Kurdi, H. Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain. Appl. Sci. 2021, 11, 796. [Google Scholar] [CrossRef]
- Szijártó, Á.; Fábián, A.; Lakatos, B.K.; Tolvaj, M.; Merkely, B.; Kovács, A.; Tokodi, M. A Machine Learning Framework for Performing Binary Classification on Tabular Biomedical Data. Imaging 2023, 15, 1–6. [Google Scholar] [CrossRef]
- Kumar, V.; Lalotra, G.S.; Sasikala, P.; Rajput, D.S.; Kaluri, R.; Lakshmanna, K.; Shorfuzzaman, M.; Alsufyani, A.; Uddin, M. Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare 2022, 10, 1293. [Google Scholar] [CrossRef] [PubMed]
- Izonin, I.; Tkachenko, R.; Pidkostelnyi, R.; Pavliuk, O.; Khavalko, V.; Batyuk, A. Experimental Evaluation of the Effectiveness of ANN-Based Numerical Data Augmentation Methods for Diagnostics Tasks. In Proceedings of the 4th International Conference on Informatics&Data-Driven Medicine, Valencia, Spain, 19–21 November 2021; pp. 223–232. [Google Scholar]
- Izonin, I.; Tkachenko, R.; Havryliuk, M.; Gregus, M.; Yendyk, P.; Tolstyak, Y. An Adaptation of the Input Doubling Method for Solving Classification Tasks in Case of Small Data Processing. Procedia Comput. Sci. 2024, 241, 171–178. [Google Scholar] [CrossRef]
- Izonin, I.; Tkachenko, R. Universal Intraensemble Method Using Nonlinear AI Techniques for Regression Modeling of Small Medical Data Sets. In Cognitive and Soft Computing Techniques for the Analysis of Healthcare Data; Elsevier: Amsterdam, The Netherlands, 2022; pp. 123–150. ISBN 978-0-323-85751-2. [Google Scholar]
- Izonin, I.; Tkachenko, R.; Fedushko, S.; Koziy, D.; Zub, K.; Vovk, O. RBF-Based Input Doubling Method for Small Medical Data Processing. In Advances in Artificial Systems for Logistics Engineering; Hu, Z., Zhang, Q., Petoukhov, S., He, M., Eds.; Lecture Notes on Data Engineering and Communications Technologies; Springer International Publishing: Cham, Switzerland, 2021; Volume 82, pp. 23–31. ISBN 978-3-030-80474-9. [Google Scholar]
- Shahadat, N.; Pal, B. An Empirical Analysis of Attribute Skewness over Class Imbalance on Probabilistic Neural Network and Naïve Bayes Classifier. In Proceedings of the 2015 International Conference on Computer and Information Engineering (ICCIE), Rajshahi, Bangladesh, 26–27 November 2015; IEEE: Rajshahi, Bangladesh, 2015; pp. 150–153. [Google Scholar]
- Zub, K.; Zhezhnych, P.; Strauss, C. Two-Stage PNN–SVM Ensemble for Higher Education Admission Prediction. BDCC 2023, 7, 83. [Google Scholar] [CrossRef]
- Changpetch, P.; Pitpeng, A.; Hiriote, S.; Yuangyai, C. Integrating Data Mining Techniques for Naïve Bayes Classification: Applications to Medical Datasets. Computation 2021, 9, 99. [Google Scholar] [CrossRef]
- Sugahara, S.; Ueno, M. Exact Learning Augmented Naive Bayes Classifier. Entropy 2021, 23, 1703. [Google Scholar] [CrossRef]
- Kaushik, K.; Bhardwaj, A.; Dahiya, S.; Maashi, M.S.; Al Moteri, M.; Aljebreen, M.; Bharany, S. Multinomial Naive Bayesian Classifier Framework for Systematic Analysis of Smart IoT Devices. Sensors 2022, 22, 7318. [Google Scholar] [CrossRef]
- Alenazi, F.S.; El Hindi, K.; AsSadhan, B. Complement-Class Harmonized Naïve Bayes Classifier. Appl. Sci. 2023, 13, 4852. [Google Scholar] [CrossRef]
- Ou, G.; He, Y.; Fournier-Viger, P.; Huang, J.Z. A Novel Mixed-Attribute Fusion-Based Naive Bayesian Classifier. Appl. Sci. 2022, 12, 10443. [Google Scholar] [CrossRef]
- Yang, Z.; Ren, J.; Zhang, Z.; Sun, Y.; Zhang, C.; Wang, M.; Wang, L. A New Three-Way Incremental Naive Bayes Classifier. Electronics 2023, 12, 1730. [Google Scholar] [CrossRef]
- Heart Attack Analysis & Prediction Dataset (A Dataset for Heart Attack Classification). Available online: https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset (accessed on 7 September 2024).
- Sklearn.Preprocessing.MaxAbsScaler. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html (accessed on 21 August 2023).
- Manna, S. Small Sample Estimation of Classification Metrics. In Proceedings of the 2022 Interdisciplinary Research in Technology and Management (IRTM), Kolkata, India, 24 February 2022; pp. 1–3. [Google Scholar]
- Orozco-Arias, S.; Piña, J.S.; Tabares-Soto, R.; Castillo-Ossa, L.F.; Guyot, R.; Isaza, G. Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable Elements. Processes 2020, 8, 638. [Google Scholar] [CrossRef]
- Kozak, J.; Probierz, B.; Kania, K.; Juszczuk, P. Preference-Driven Classification Measure. Entropy 2022, 24, 531. [Google Scholar] [CrossRef]
- Kenyeres, É.; Kummer, A.; Abonyi, J. Machine Learning Classifier-Based Metrics Can Evaluate the Efficiency of Separation Systems. Entropy 2024, 26, 571. [Google Scholar] [CrossRef]
- Heart Attack—From EDA to Prediction (Notebook). Available online: https://kaggle.com/code/dreygaen/heart-attack-from-eda-to-prediction (accessed on 7 September 2024).
- Subbotin, S. Radial-Basis Function Neural Network Synthesis on the Basis of Decision Tree. Opt. Mem. Neural Netw. 2020, 29, 7–18. [Google Scholar] [CrossRef]
- Chumachenko, D.; Butkevych, M.; Lode, D.; Frohme, M.; Schmailzl, K.J.G.; Nechyporenko, A. Machine Learning Methods in Predicting Patients with Suspected Myocardial Infarction Based on Short-Time HRV Data. Sensors 2022, 22, 7033. [Google Scholar] [CrossRef]
- Krak, I.; Barmak, O.; Manziuk, E.; Kulias, A. Data Classification Based on the Features Reduction and Piecewise Linear Separation. In Intelligent Computing and Optimization; Vasant, P., Zelinka, I., Weber, G.-W., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2020; Volume 1072, pp. 282–289. ISBN 978-3-030-33584-7. [Google Scholar]
Gaussian NB | Multinominal NB | Complement NB | Bernoulli NB |
---|---|---|---|
It operates based on a Gaussian probability distribution for each class, calculating the mean and standard deviation for each feature within each class. This method is commonly used for classification tasks with numerical features. | It employs a multinomial distribution to compute class probabilities. This technique is particularly suited for discrete data and is frequently used in text classification. | An adaptation of the Multinomial Naive Bayes (MNB) classifier, specifically designed for handling imbalanced datasets. It incorporates statistics with class augmentation to compute model weights and is also commonly used for text classification. | It utilizes a multivariate Bernoulli distribution, which requires the data to be represented as feature vectors with binary values. If the input data is of a different type, it can convert it into a binary format. |
Advantages: 1. Performs well with numerical features. 2. Fast in both training and prediction. Disadvantages: 1. Struggles with categorical features. | Advantages: 1. Highly effective for text and categorical data. Disadvantages: 2. Requires data to be in a discrete format. | Advantages: 1. Effective with imbalanced datasets. 2. Often outperforms standard MNB in text classification tasks. | Advantages: 1. Performs well with small datasets. Disadvantages: 1. May lose information, as it only works with binary features. |
Classifier | The Gaussian Naive Bayes Classifier | The Bernoulli Naive Bayes Classifier | The Multinominal Naive Bayes Classifier | The Complement Naive Bayes Classifier | ||||
---|---|---|---|---|---|---|---|---|
Mode | Train | Test | Train | Test | Train | Test | Train | Test |
Precision | 0.792 | 0.797 | 0.647 | 0.648 | 0.634 | 0.637 | 0.691 | 0.693 |
Recall | 0.931 | 0.926 | 0.994 | 0.994 | 0.997 | 0.994 | 0.980 | 0.975 |
F1-score | 0.856 | 0.855 | 0.784 | 0.783 | 0.775 | 0.775 | 0.810 | 0.808 |
Cohen’s kappa | 0.651 | 0.644 | 0.368 | 0.354 | 0.329 | 0.320 | 0.477 | 0.467 |
MCC | 0.664 | 0.659 | 0.468 | 0.458 | 0.440 | 0.429 | 0.543 | 0.538 |
Classifier 1 | The Gaussian Naive Bayes Classifier | The Bernoulli Naive Bayes Classifier | The Multinominal Naive Bayes Classifier | The Complement Naive Bayes Classifier | ||||
---|---|---|---|---|---|---|---|---|
Mode | Train | Test | Train | Test | Train | Test | Train | Test |
Precision | 0.792 | 0.797 | 0.732 | 0.735 | 0.726 | 0.722 | 0.726 | 0.729 |
Recall | 0.931 | 0.926 | 0.973 | 0.969 | 0.971 | 0.963 | 0.971 | 0.963 |
F1-score | 0.856 | 0.855 | 0.835 | 0.834 | 0.831 | 0.823 | 0.831 | 0.828 |
Cohen’s kappa | 0.651 | 0.644 | 0.566 | 0.558 | 0.552 | 0.527 | 0.552 | 0.542 |
MCC | 0.664 | 0.659 | 0.610 | 0.601 | 0.597 | 0.576 | 0.597 | 0.588 |
Performance Indicators | Train 1 | Test 1 |
---|---|---|
Precision | 0.8 | 0.8 |
Recall | 0.93 | 0.93 |
F1-score | 0.86 | 0.86 |
Cohen’s kappa | 0.65 | 0.64 |
MCC | 0.66 | 0.66 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Izonin, I.; Tkachenko, R.; Yendyk, P.; Pliss, I.; Bodyanskiy, Y.; Gregus, M. Enhanced Input-Doubling Method Leveraging Response Surface Linearization to Improve Classification Accuracy in Small Medical Data Processing. Computation 2024, 12, 203. https://doi.org/10.3390/computation12100203
Izonin I, Tkachenko R, Yendyk P, Pliss I, Bodyanskiy Y, Gregus M. Enhanced Input-Doubling Method Leveraging Response Surface Linearization to Improve Classification Accuracy in Small Medical Data Processing. Computation. 2024; 12(10):203. https://doi.org/10.3390/computation12100203
Chicago/Turabian StyleIzonin, Ivan, Roman Tkachenko, Pavlo Yendyk, Iryna Pliss, Yevgeniy Bodyanskiy, and Michal Gregus. 2024. "Enhanced Input-Doubling Method Leveraging Response Surface Linearization to Improve Classification Accuracy in Small Medical Data Processing" Computation 12, no. 10: 203. https://doi.org/10.3390/computation12100203
APA StyleIzonin, I., Tkachenko, R., Yendyk, P., Pliss, I., Bodyanskiy, Y., & Gregus, M. (2024). Enhanced Input-Doubling Method Leveraging Response Surface Linearization to Improve Classification Accuracy in Small Medical Data Processing. Computation, 12(10), 203. https://doi.org/10.3390/computation12100203