Improvement of the ANN-Based Prediction Technology for Extremely Small Biomedical Data Analysis
Abstract
:1. Introduction
- We improved the neural network-based technology without training for solving regression tasks in the case of the analysis of extremely small datasets using response surface linearization principles, which significantly increased prediction accuracy.
- We optimized the parameters of the improved technology using the Differential Evolution method, which substantially reduced both its operation time and approximation errors in solving the stated tasks.
- We validate the improved prediction technology on two extremely small datasets in the field of biomedical engineering and, through comparison with well-known methods, established its highest accuracy.
2. State-of-the-Arts
- Excessive data augmentation can lead to the creation of a dataset that appears larger but actually contains a significant number of duplicate or similar records, which can increase the risk of model overfitting.
- Augmented data may not reflect the real data distribution. In particular, randomly adding or removing values can create inadequate or non-existent correlations, leading to bias in analysis and deterioration of model performance.
- Augmentation methods can generate inconsistent data that do not correspond to real situations (generated samples may not adhere to real physical or logical constraints), which can mislead the model.
- Data augmentation can significantly increase the volume of data, resulting in higher time and resource costs required for model training, which can be critical for systems with limited computational capabilities.
- Methods such as adding noise or altering values may create samples that are very similar to existing ones, significantly reducing the effectiveness of training since the model learns from practically identical examples.
- Increasing the data volume through augmentation can lead to the model becoming too specific to the training dataset and poorly transferring to new data or other domains.
- Tabular data often contain complex dependencies between features, and augmentation may disrupt these dependencies, leading to incorrect data interpretation by the model and, consequently, decreased prediction accuracy.
- During augmentation, errors or anomalies can be introduced into the data, which can lead to incorrect model training and reduced accuracy.
- There are many data augmentation methods, and selecting the best one for a specific dataset can be challenging. Improper selection of the method can reduce data quality and model efficiency.
- There are no universally accepted standards or criteria for evaluating the quality of augmented data, making it difficult to assess their impact on model performance.
- Form a new augmented support dataset based on the existing vectors for which the output signal is known. Perform a data augmentation procedure by pairwise concatenating all possible pairs of vectors, including each vector with itself. The output signals are generated as the difference between the outputs of the two concatenated vectors.
- 2.
- For a test vector with an unknown output signal, create a new temporary dataset by concatenating it with all vectors from the initial training set.
- 3.
- Predict the output signals for this temporary dataset using a General Regression Neural Network (GRNN), which has the highest generalization properties among existing artificial neural network (ANN) topologies [24]. Use the new augmented dataset from Step 1 as the support dataset.
- 4.
- Make the final prediction by adding the sum of the predictions obtained in the previous step to the sum of all output signals from the initial training set and then divide this sum by the number of vectors in the initial training set. This follows ensemble principles, specifically averaging the results, to ensure improved prediction accuracy.
- 5.
- Repeat Steps 2–4 for all vectors with unknown output signals.
3. Materials and Methods
3.1. Data Augmentation Procedure of the Improved ANN-Based Technology without Training
3.2. General Regression Neural Network
- It possesses the highest generalization properties among existing NNM topologies.
- It does not require a training procedure.
- Only one hyperparameter needs to be tuned.
- It operates with high speed.
- It is straightforward to implement.
- Calculation of Euclidean distances between the current vector and all vectors in the reference training set;
- Computation of Gaussian distances from the Euclidean distances;
- Prediction of the desired value;
3.3. Application Procedure of the Improved ANN-Based Technology without Training
4. Modeling and Results
4.1. Dataset Descriptions
4.2. Performance Indicators
- ME (Maximum residual error):
- MedAE (Median absolute error):
- MAE (Mean absolute error):
- MSE (Mean square error):
- MAPE (Mean absolute percentage error):
- RMSE (Root mean square error):
- R2 (Coefficient of determination):
4.3. Results
4.3.1. Optimal Parameter Selection
4.3.2. Results
5. Comparison and Discussion
- Formation of the final class label value after applying improved technology using ensemble learning principles, specifically majority voting (rather than averaging the result, as performed in this study).
- The principle of response surface linearization, which underpins the improved technology in this paper, can be implemented differently in solving classification tasks. In the first case, one can expand the initial extremely small dataset with the predicted class label value to which the corresponding data vector belongs (similar to this study). In the second, more interesting case, such a process can be implemented using a set of probabilities of belonging to each class, predicted by the specific task. This approach provides the model with more information, especially in cases where more than three classes exist, which can further enhance the accuracy of solving express diagnostic tasks in the analysis of extremely small datasets. This implementation is possible using a Probabilistic Neural Network [11].
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Tranquillo, J.V.; Goldberg, J.; Allen, R. Biomedical Engineering Design; Academic Press Series in Biomedical Engineering; Academic Press: London, UK, 2023; ISBN 978-0-12-816444-0. [Google Scholar]
- Berezsky, O.; Pitsun, O.; Liashchynskyi, P.; Derysh, B.; Batryn, N. Computational Intelligence in Medicine. In Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making; Babichev, S., Lytvynenko, V., Eds.; Lecture Notes on Data Engineering and Communications Technologies; Springer International Publishing: Cham, Switzerland, 2023; Volume 149, pp. 488–510. ISBN 978-3-031-16202-2. [Google Scholar]
- Babichev, S.; Škvor, J. Technique of Gene Expression Profiles Extraction Based on the Complex Use of Clustering and Classification Methods. Diagnostics 2020, 10, 584. [Google Scholar] [CrossRef] [PubMed]
- Bodyanskiy, Y.; Vynokurova, O.; Savvo, V.; Tverdokhlib, T.; Mulesa, P. Hybrid Clustering-Classification Neural Network in the Medical Diagnostics of the Reactive Arthritis. IJISA 2016, 8, 1–9. [Google Scholar] [CrossRef]
- Hekler, E.B.; Klasnja, P.; Chevance, G.; Golaszewski, N.M.; Lewis, D.; Sim, I. Why We Need a Small Data Paradigm. BMC Med. 2019, 17, 133. [Google Scholar] [CrossRef] [PubMed]
- Babichev, S. An Evaluation of the Information Technology of Gene Expression Profiles Processing Stability for Different Levels of Noise Components. Data 2018, 3, 48. [Google Scholar] [CrossRef]
- Voronenko, M.A.; Zhunissova, U.M.; Smailova, S.S.; Lytvynenko, L.N.; Savina, N.B.; Mulesa, P.P.; Lytvynenko, V.I. Using Bayesian Methods in the Task of Modeling the Patients’ Pharmacoresistance Development. IAPGOS 2022, 12, 77–82. [Google Scholar] [CrossRef]
- Huang, S.; Deng, H. Data Analytics: A Small Data Approach, 1st ed.; CRC Press: Boca Raton, FL, USA, 2021; ISBN 978-0-367-60950-4. [Google Scholar]
- Shaikhina, T.; Khovanova, N.A. Handling Limited Datasets with Neural Networks in Medical Applications: A Small-Data Approach. Artif. Intell. Med. 2017, 75, 51–63. [Google Scholar] [CrossRef] [PubMed]
- Izonin, I.; Tkachenko, R. Universal Intraensemble Method Using Nonlinear AI Techniques for Regression Modeling of Small Medical Data Sets. In Cognitive and Soft Computing Techniques for the Analysis of Healthcare Data; Elsevier: Amsterdam, The Netherlands, 2022; pp. 123–150. ISBN 978-0-323-85751-2. [Google Scholar]
- Havryliuk, M.; Hovdysh, N.; Tolstyak, Y.; Chopyak, V.; Kustra, N. Investigation of PNN Optimization Methods to Improve Classification Performance in Transplantation Medicine. In Proceedings of the 6th International Conference on Informatics & Data-Driven Medicine, Bratislava, Slovakia, 17–19 November 2023; pp. 338–345. [Google Scholar]
- Mulesa, O.; Geche, F.; Batyuk, A.; Buchok, V. Development of combined information technology for time series prediction. In Advances in Intelligent Systems and Computing II; Shakhovska, N., Stepashko, V., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 689, pp. 361–373. ISBN 978-3-319-70580-4. [Google Scholar]
- Tolstyak, Y.; Chopyak, V.; Havryliuk, M. An Investigation of the Primary Immunosuppressive Therapy’s Influence on Kidney Transplant Survival at One Month after Transplantation. Transpl. Immunol. 2023, 78, 101832. [Google Scholar] [CrossRef] [PubMed]
- Bodyanskiy, Y.; Chala, O.; Kasatkina, N.; Pliss, I. Modified Generalized Neo-Fuzzy System with Combined Online Fast Learning in Medical Diagnostic Task for Situations of Information Deficit. MBE 2022, 19, 8003–8018. [Google Scholar] [CrossRef] [PubMed]
- Mumuni, A.; Mumuni, F. Data Augmentation: A Comprehensive Survey of Modern Approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
- Snow, D. DeltaPy: A Framework for Tabular Data Augmentation in Python; Social Science Research Network: Rochester, NY, USA, 2020. [Google Scholar]
- Deep Learning for Tabular Data Augmentation. Available online: https://lschmiddey.github.io/fastpages_/2021/04/10/DeepLearning_TabularDataAugmentation.html (accessed on 16 May 2021).
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional GAN. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 8–14 December 2019. [Google Scholar]
- Izonin, I.; Tkachenko, R.; Pidkostelnyi, R.; Pavliuk, O.; Khavalko, V.; Batyuk, A. Experimental Evaluation of the Effectiveness of ANN-Based Numerical Data Augmentation Methods for Diagnostics Tasks. In Proceedings of the 4th International Conference on Informatics & Data-Driven Medicine, CEUR Workshop Proceedings 2021, 3038, Valencia, Spain, 19–21 November 2021; pp. 223–232. [Google Scholar]
- Pima Indians Diabetes Database. Available online: https://kaggle.com/uciml/pima-indians-diabetes-database (accessed on 16 May 2021).
- Arora, A.; Shoeibi, N.; Sati, V.; González-Briones, A.; Chamoso, P.; Corchado, E. Data Augmentation Using Gaussian Mixture Model on CSV Files. In Proceedings of the Distributed Computing and Artificial Intelligence, 17th International Conference, L’Aquila, Italy, 17–19 June 2020; Springer: Cham, Switzerland, 2020; pp. 258–265. [Google Scholar]
- Guilhaumon, C.; Hascoët, N.; Chinesta, F.; Lavarde, M.; Daim, F. Data Augmentation for Regression Machine Learning Problems in High Dimensions. Computation 2024, 12, 24. [Google Scholar] [CrossRef]
- Izonin, I.; Tkachenko, R.; Vitynskyi, P.; Zub, K.; Tkachenko, P.; Dronyuk, I. Stacking-Based GRNN-SGTM Ensemble Model for Prediction Tasks. In Proceedings of the 2020 International Conference on Decision Aid Sciences and Application (DASA), Sakheer, Bahrain, 8–9 November 2020; pp. 326–330. [Google Scholar]
- Izonin, I.; Tkachenko, R.; Gregus, M.M.; Zub, K.; Tkachenko, P. A GRNN-Based Approach towards Prediction from Small Datasets in Medical Application. Procedia Comput. Sci. 2021, 184, 242–249. [Google Scholar] [CrossRef]
- Bodyanskiy, Y.V.; Deineko, A.O.; Kutsenko, Y.V. On-Line Kernel Clustering Based on the General Regression Neural Network and T. Kohonen’s Self-Organizing Map. Autom. Control. Comput. Sci. 2017, 51, 55–62. [Google Scholar] [CrossRef]
- Qiao, L.; Liu, Y.; Zhu, J. Application of Generalized Regression Neural Network Optimized by Fruit Fly Optimization Algorithm for Fracture Toughness in a Pearlitic Steel. Eng. Fract. Mech. 2020, 235, 107105. [Google Scholar] [CrossRef]
- Bani-Hani, D.; Khasawneh, M. A Recursive General Regression Neural Network (R-GRNN) Oracle for Classification Problems. Expert Syst. Appl. 2019, 135, 273–286. [Google Scholar] [CrossRef]
- Body Fat Percentage of Women. Available online: https://www.kaggle.com/datasets/vishweshsalodkar/body-fat-percentage (accessed on 23 July 2023).
- Specht, D.F. A General Regression Neural Network. IEEE Trans. Neural Netw. 1991, 2, 568–576. [Google Scholar] [CrossRef] [PubMed]
Dataset Number | Stated Task | Instances | Features | Reference |
---|---|---|---|---|
1 | Prediction of trabecular bone strength in severe osteoarthritis | 35 | 5 | [9] |
2 | Prediction of body fat percentage of women | 24 | 7 | [28] |
Parameter/Indicator | Dataset 1 | Dataset 2 |
---|---|---|
σ1 | 0.06129 | 0.01668 |
σ2 | 3.75372 | 2.59505 |
MAPE | 0.20 | 0.07 |
RMSE | 3.15 | 1.04 |
MAE | 2.30 | 1.00 |
R2 | 0.77 | 0.86 |
MSE | 9.91 | 1.09 |
ME | 5.53 | 1.56 |
MedAE | 1.98 | 0.93 |
Application time, seconds | 136.9 | 66.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Izonin, I.; Tkachenko, R.; Berezsky, O.; Krak, I.; Kováč, M.; Fedorchuk, M. Improvement of the ANN-Based Prediction Technology for Extremely Small Biomedical Data Analysis. Technologies 2024, 12, 112. https://doi.org/10.3390/technologies12070112
Izonin I, Tkachenko R, Berezsky O, Krak I, Kováč M, Fedorchuk M. Improvement of the ANN-Based Prediction Technology for Extremely Small Biomedical Data Analysis. Technologies. 2024; 12(7):112. https://doi.org/10.3390/technologies12070112
Chicago/Turabian StyleIzonin, Ivan, Roman Tkachenko, Oleh Berezsky, Iurii Krak, Michal Kováč, and Maksym Fedorchuk. 2024. "Improvement of the ANN-Based Prediction Technology for Extremely Small Biomedical Data Analysis" Technologies 12, no. 7: 112. https://doi.org/10.3390/technologies12070112
APA StyleIzonin, I., Tkachenko, R., Berezsky, O., Krak, I., Kováč, M., & Fedorchuk, M. (2024). Improvement of the ANN-Based Prediction Technology for Extremely Small Biomedical Data Analysis. Technologies, 12(7), 112. https://doi.org/10.3390/technologies12070112