Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra

Fadlelmoula, Ahmed; Boldyrev, Kirill N.; Gonçalves, Margarida; Torres, Helena; Catarino, Susana O.; Minas, Graça; Carvalho, Vitor

doi:10.3390/info17070638

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra

by

Ahmed Fadlelmoula

^1,2,

Kirill N. Boldyrev

³

,

Margarida Gonçalves

¹,

Helena Torres

²

,

Susana O. Catarino

^1,4

,

Graça Minas

^1,4

and

Vitor Carvalho

^2,*

¹

Center for MicroElectromechanical Systems (CMEMS-UMinho), University of Minho, 4800-058 Guimaraes, Portugal

²

2Ai, School of Technology, Polytechnic University of Cávado and Ave, 4750-810 Barcelos, Portugal

³

Beijing Institute of Technology (BIT), Zhuhai BIT, Zhuhai 519088, China

⁴

LABBELS–Associate Laboratory, 4710-057 Braga, Portugal

^*

Author to whom correspondence should be addressed.

Information 2026, 17(7), 638; https://doi.org/10.3390/info17070638

Submission received: 22 May 2026 / Revised: 26 June 2026 / Accepted: 26 June 2026 / Published: 29 June 2026

(This article belongs to the Special Issue Innovative Machine Learning Technologies and Applications)

Download Versions Notes

Abstract

Early detection of diabetes mellitus (DM) is essential for preventing disease progression and improving clinical outcomes. However, developing robust machine learning (ML) models for diabetes diagnosis is often constrained by limited data availability, privacy regulations, and challenges with data sharing. This study investigates a privacy-preserving synthetic data augmentation framework for classifying diabetic and non-diabetic blood serum samples using Fourier Transform Infrared (FTIR) spectroscopy. Two deep generative approaches, Autoencoders (AEs) and Generative Adversarial Networks (GANs), were evaluated for their ability to generate realistic synthetic FTIR spectra while preserving the statistical and biochemical characteristics of the original dataset. Synthetic datasets generated by the AE and GAN models were assessed using six ML classifiers: Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Gradient Boosting (GB), Logistic Regression (LoR), and Decision Tree (DT). Model performance was evaluated using accuracy, precision, recall, F1-score, Receiver Operating Characteristic (ROC) curves, and Area Under the Curve (AUC). Results showed that AE-generated spectra retained stronger discriminative characteristics and were more easily distinguished from the original spectra, whereas GAN-generated spectra exhibited lower classifier separability, suggesting closer alignment with the original data distribution and greater realism for privacy-oriented data augmentation. Correlation analysis demonstrated high spectral fidelity for both approaches. Compared with the original spectra, AE-generated spectra achieved r = 0.9990 and R² = 0.9999, whereas GAN-generated spectra achieved r = 0.9982 and R² = 0.9965. The most prominent diabetes related spectral variations were observed in the carbohydrate (1000–1200 cm⁻¹), Amide I (~1650 cm⁻¹), and lipid-associated (3000–3500 cm⁻¹) regions. To explore the transferability of the proposed framework, a preliminary experimental feasibility study was conducted using independently acquired whole blood FTIR spectra. The generated spectra showed strong agreement with the measured whole blood spectra, demonstrating the potential applicability of the framework under alternative sampling conditions. Because the experimental cohort included only one diabetic volunteer, this analysis was intended solely as a proof-of-concept assessment of spectral feasibility and methodological transferability, rather than as a validation of diabetes classification performance. Overall, the findings demonstrate that synthetic data generation can effectively augment limited FTIR datasets while preserving privacy and key spectral characteristics. The proposed framework provides a promising foundation for privacy-aware biomedical data augmentation and future development of robust FTIR diabetes screening systems. The results should be interpreted as methodological evidence of feasibility and synthetic data utility rather than as evidence of clinical diagnostic readiness, as the serum dataset remains modest in size and the independent whole-blood experiment was intentionally exploring.

Keywords: Autoencoders; diabetes classification; FTIR spectroscopy; generative adversarial networks; synthetic data generation

Share and Cite

MDPI and ACS Style

Fadlelmoula, A.; Boldyrev, K.N.; Gonçalves, M.; Torres, H.; Catarino, S.O.; Minas, G.; Carvalho, V. Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra. Information 2026, 17, 638. https://doi.org/10.3390/info17070638

AMA Style

Fadlelmoula A, Boldyrev KN, Gonçalves M, Torres H, Catarino SO, Minas G, Carvalho V. Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra. Information. 2026; 17(7):638. https://doi.org/10.3390/info17070638

Chicago/Turabian Style

Fadlelmoula, Ahmed, Kirill N. Boldyrev, Margarida Gonçalves, Helena Torres, Susana O. Catarino, Graça Minas, and Vitor Carvalho. 2026. "Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra" Information 17, no. 7: 638. https://doi.org/10.3390/info17070638

APA Style

Fadlelmoula, A., Boldyrev, K. N., Gonçalves, M., Torres, H., Catarino, S. O., Minas, G., & Carvalho, V. (2026). Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra. Information, 17(7), 638. https://doi.org/10.3390/info17070638

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI