Non-Invasive Multiclass Diabetes Classification Using Breath Biomarkers and Machine Learning with Explainable AI

Gudiño-Ochoa, Alberto; García-Rodríguez, Julio Alberto; Ochoa-Ornelas, Raquel; Ruiz-Velazquez, Eduardo; Uribe-Toscano, Sofia; Cuevas-Chávez, Jorge Ivan; Sánchez-Arias, Daniel Alejandro

doi:10.3390/diabetology6060051

Open AccessArticle

Non-Invasive Multiclass Diabetes Classification Using Breath Biomarkers and Machine Learning with Explainable AI

by

Alberto Gudiño-Ochoa

^1,2

,

Julio Alberto García-Rodríguez

^1,3,*

,

Raquel Ochoa-Ornelas

⁴

,

Eduardo Ruiz-Velazquez

²

,

Sofia Uribe-Toscano

⁵

,

Jorge Ivan Cuevas-Chávez

²

and

Daniel Alejandro Sánchez-Arias

²

¹

Electronics Department, Tecnológico Nacional de México/Instituto Tecnológico de Ciudad Guzmán, Ciudad Guzmán 49100, Jalisco, Mexico

²

Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Electronics and Computing Division, Universidad de Guadalajara, Guadalajara 44430, Jalisco, Mexico

³

Centro Universitario del Sur (CUSUR), Departamento de Ciencias Computacionales e Innovación Tecnológica, Universidad de Guadalajara, Ciudad Guzmán 49000, Jalisco, Mexico

⁴

Systems and Computation Department, Tecnológico Nacional de México/Instituto Tecnológico de Ciudad Guzmán, Ciudad Guzmán 49100, Jalisco, Mexico

⁵

Centro Universitario del Sur (CUSUR), Departamento de Ciencias Clínicas, Divisón de Ciencias de la Salud, Universidad de Guadalajara, Av. Enrique Arreola Silva No. 883, Colón, Ciudad Guzmán 49000, Jalisco, Mexico

^*

Author to whom correspondence should be addressed.

Diabetology 2025, 6(6), 51; https://doi.org/10.3390/diabetology6060051

Submission received: 3 April 2025 / Revised: 20 May 2025 / Accepted: 26 May 2025 / Published: 4 June 2025

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: The increasing prevalence of diabetes underscores the urgent need for non-invasive, rapid, and cost-effective diagnostic alternatives. This study presents a breath-based multiclass diabetes classification system leveraging only three gas sensors (CO, alcohol, and acetone) to analyze exhaled breath composition. Methods: Breath samples were collected from 58 participants (22 healthy, 7 prediabetic, and 29 diabetic), with blood glucose levels serving as the reference metric. To enhance classification performance, we introduced a novel biomarker, the alcohol-to-acetone ratio, through a feature engineering approach. Class imbalance was addressed using the Synthetic Minority Over-Sampling Technique (SMOTE), ensuring a balanced dataset for model training. A nested cross-validation framework with 3 outer and 3 inner folds was implemented. Multiple machine learning classifiers were evaluated, with Random Forest and Gradient Boosting emerging as the top-performing models. Results: An ensemble combining both yielded the highest overall performance, achieving an average accuracy of 98.86%, precision of 99.07%, recall of 98.81% and F1 score of 98.87%. These findings highlight the potential of gas sensor-based breath analysis as a highly accurate, scalable, and non-invasive method for diabetes screening. Conclusions: The proposed system offers a promising alternative to blood-based diagnostic approaches, paving the way for real-world applications in point-of-care diagnostics and continuous health monitoring.

Keywords:

breath acetone; diabetes classification; machine learning; breath biomarkers; medical expert systems; exhaled breath analysis

1. Introduction

Diabetes mellitus (DM) is a chronic metabolic disorder characterized by abnormal blood glucose levels (BGL), posing a significant global health challenge [1,2]. According to the World Health Organization and the American Diabetes Association, diabetes is classified into three categories based on fasting BGL: healthy (<100 mg/dL), prediabetic (100–125 mg/dL), and diabetic (≥126 mg/dL) [3,4]. With over 530 million individuals currently affected worldwide, projections estimate that 783 million people will be living with diabetes by 2045, driving healthcare costs beyond $1.054 trillion [5]. The increasing prevalence and economic burden of DM highlight the urgent need for accurate, rapid, and non-invasive diagnostic methods to enable early detection and monitoring [6,7,8].

Conventional diagnostic methods rely on finger-prick blood tests and continuous glucose monitoring (CGM), which, despite their accuracy, are invasive, painful, and costly due to single-use test strips and sensor replacements. Moreover, invasive procedures increase the risk of infections and reduce patient adherence, particularly in low-resource settings [8,9].

In response, non-invasive glucose monitoring techniques have gained considerable attention, including saliva, sweat, tear, and urine analysis. However, these approaches face severe limitations such as low biomarker concentrations, high variability, and inconsistent correlations with blood glucose levels [10,11,12]. Optical and radiofrequency-based methods have also been explored, but high implementation costs and low portability restrict their widespread adoption [13].

A promising alternative is exhaled breath analysis, which enables non-invasive, real-time monitoring of volatile organic compounds (VOCs), specifically acetone, a well-established biomarker for diabetes. Elevated acetone levels in breath correlate with increased ketone production, a direct consequence of impaired glucose metabolism [14,15]. Despite its potential, traditional breath analysis techniques such as Gas Chromatography-Mass Spectrometry (GC-MS), Selected Ion Flow Tube Mass Spectrometry (SIFT-MS), and Proton Transfer Reaction Mass Spectrometry (PTR-MS), although highly sensitive, are expensive, require specialized facilities, and lack portability for real-world applications [16,17,18,19,20].

Electronic noses (e-noses) provide a portable, low-cost, and scalable solution for VOC-based breath analysis, offering rapid response times and ease of integration with machine learning (ML) algorithms [21,22,23]. E-noses utilize gas sensors, such as metal-oxide semiconductor (MOS) sensors, to detect specific VOCs in human breath, translating chemical signals into digital data for classification. Recent studies have demonstrated the effectiveness of e-noses in detecting acetone as a key diabetes biomarker [24,25]. However, previous works have predominantly focused on binary classification (healthy vs. diabetic) and have overlooked the importance of multiclass classification for distinguishing healthy, prediabetic, and diabetic individuals based on standard medical guidelines. Moreover, traditional models fail to leverage feature engineering techniques that could enhance classification accuracy [22,26].

Recent advances in ML and deep learning (DL) have significantly improved the performance of diagnostic models [27,28,29]. Algorithms such as Random Forest, LightGBM, XGBoost, Deep Neural Networks, and Convolutional Neural Networks (CNNs) have achieved high accuracy in VOC-based disease detection, enabling rapid and automated classification [21,22,23,26,28,30].

To address the challenges of multiclass diabetes classification, this study proposes a novel machine learning-based e-nose system leveraging only three MOS gas sensors (CO, alcohol, and acetone) and a feature engineering approach introducing the alcohol-to-acetone ratio. This ratio has been identified as a potential biomarker that enhances classification performance, as recent findings suggest that alcohol metabolism can influence acetone concentrations in exhaled breath [22,28,31].

This study introduces a novel breath-based multiclass diabetes classification system, contributing to the field in the following ways:

Introduces the alcohol-to-acetone ratio as a novel biomarker, using only three MOS sensors to achieve an optimal balance between cost-efficiency and classification accuracy, leveraging feature engineering to improve multiclass discrimination (healthy, prediabetic, diabetic).
Evaluates and compares multiple ML classifiers, employing a nested cross-validation strategy with 3 inner and 3 outer folds to ensure robust hyperparameter tuning and unbiased performance estimation.
Demonstrates that an ensemble model (Random Forest + Gradient Boosting) achieves high classification performance across multiple metrics, with strong generalization ability.

The remainder of this paper is organized as follows: Section 2 details the materials and methods, including the experimental setup, data collection, and machine learning pipeline. Section 3 presents the classification results, highlighting the impact of the alcohol-to-acetone ratio and ensemble learning models. Section 4 discusses limitations and potential improvements, and Section 5 concludes with future directions for real-world deployment.

2. Materials and Methods

This section outlines the dataset, sensor system, and analytical methods used in the development of the proposed breath-based diabetes classification framework. The study involved breath analysis from a cohort comprising young and adult individuals diagnosed with type 1 and type 2 diabetes mellitus (T1DM and T2DM), as well as healthy controls [21]. The breath collection system, based on an electronic nose (e-nose) architecture, is described in detail, including the specific gas sensors employed. Subsequent subsections address the data preprocessing pipeline, which includes the application of SMOTE to address class imbalance [26]. The classification categories—healthy (blood glucose level < 100 mg/dL), prediabetic (100–125 mg/dL), and diabetic (≥126 mg/dL)—were established based on standard blood glucose thresholds, which were used as reference labels [32,33].

To explore the underlying structure of the dataset, t-distributed Stochastic Neighbor Embedding (t-SNE) was employed for dimensionality reduction and visualization, providing insight into the separability of classes. Feature selection and feature engineering techniques were applied to enhance model interpretability and performance [28]. Notably, a novel biomarker—the alcohol-to-acetone ratio—was introduced to improve classification accuracy and highlight biologically relevant interactions in breath composition [22]. These methodologies collectively enabled the effective training and evaluation of machine learning models discussed in the subsequent Results section.

2.1. Dataset

The dataset used in this study originates from prior investigations conducted by our research group. Detailed information regarding the original experimental design and patient physiological profiles can be found in [21,28]. Breath samples were obtained from individuals previously diagnosed with type 1 or type 2 diabetes mellitus (T1DM or T2DM), as well as from healthy controls, with all participants providing informed consent prior to sample collection.

The dataset comprises real measurements from 58 participants: 29 healthy individuals and 29 diabetic patients (including both T1DM and T2DM), covering a range of ages, blood glucose levels, and sampling times. For each subject, the dataset includes the average sensor readings per breath sample, the participant ID, and reference BGL measured using a commercial glucometer. The raw data were acquired using a set of seven gas sensors (Waveshare, Shenzhen, China): MQ-2 (carbon monoxide), MQ-3 (alcohol), MQ-7 (carbon monoxide), MQ-135 (acetone), MQ-138 (benzene), MICS-5524 (VOCs), and DHT-22 (temperature and relative humidity) [22,28]. Table 1 provides a summary of the sensors used.

Breath samples were collected using standard 1-L Tedlar bags (CEL Scientific Corporation, Alameda, CA, USA), a widely adopted method in human breath analysis literature [15,22,24,28,30]. The e-nose system was composed of an Arduino Nano 33 BLE Sense board, which offers a 12-bit analog-to-digital converter (ADC) and a 32-bit ARM Cortex-M4 processor (Arduino S.r.l., Monza, Italy), and a suite of MQ-series gas sensors. These sensors detect key breath biomarkers, including carbon monoxide, alcohols, ketones, and VOCs, and monitor environmental variables such as temperature and relative humidity. Prior to data acquisition, all sensors were pre-heated and calibrated for 24–48 h to ensure stability and optimal sensitivity [21,28].

For the purposes of this study—and to enhance model interpretability while reducing complexity—we selected a subset of three gas sensors: MQ-2, MQ-3, and MQ-135. These were chosen based on their relevance to breath biomarkers associated with diabetes. The DHT-22 sensor was used exclusively for monitoring and controlling temperature and relative humidity (RH) within the sampling chamber [21,28]. RH is a critical factor in breath analysis, as elevated moisture levels can interfere with sensor response, leading to signal distortion or false positives [34]. Therefore, a desiccation protocol was implemented between each measurement, using an internal dehumidifier to restore the chamber to optimal conditions [21,35].

Although the dehumidifier substantially improves measurement accuracy, it introduces operational challenges, such as the need for routine maintenance and calibration to prevent over-drying, which could diminish sensor responsiveness. Moreover, dehumidification prolongs the sample preparation process, as the system must equilibrate before each session. Despite these limitations, humidity regulation remains essential for ensuring measurement reliability and reproducibility [21,34].

The full system architecture is illustrated in Figure 1. A certified medical professional instructed each participant to exhale fully into the Tedlar bag until approximately 90–100% capacity was reached. The collected breath sample was manually transferred to the e-nose sampling chamber, which had been pre-heated for five minutes to stabilize the sensor array. Breath samples were allowed to cool for 5 to 10 min before analysis. The average relative humidity during collection was approximately 70%, and the ambient temperature was around 33 °C.

Data acquisition was conducted using Python 3.13 via a serial communication protocol. All sensor readings were recorded in CSV format and later analyzed in Jupyter Notebook. Once the average sensor values for each individual were obtained, the dataset was used to train and validate the machine learning models.

2.2. Data Preprocessing

During the 90-s measurement period, approximately 10,000 data points were collected. Due to noise introduced by variations in breath temperature and humidity [13], a Discrete Wavelet Transform (DWT) with a low-pass filter was applied to denoise the acquired signals. The e-nose system relies on resistive metal-oxide semiconductor (MOS) gas sensors, which produce Rs/Ro values, where Rs represents the sensor resistance in the presence of VOCs, and Ro denotes the baseline resistance under clean air conditions. There is an inverse relationship between Rs/Ro and the concentration of key biomarkers such as acetone and alcohol: higher Rs/Ro values indicate lower concentrations, and vice versa. These sensor responses are essential for estimating blood glucose levels and for determining whether a subject is healthy or diabetic [21,28].

To further refine the signal quality, MinMax scaling was applied after DWT denoising. This normalization approach helps mitigate artifacts caused by unstable voltage levels and environmental fluctuations, including changes in temperature and humidity [34]. The normalization ensures that each feature contributes proportionally and that no single attribute dominates the model’s learning process. The DWT mathematically isolates and removes high-frequency noise using the following formula [21,30,34]:

DWT (f, a, b) = \frac{1}{\sqrt{a}} \int_{\infty}^{\infty} f (t) ψ (\frac{t - b}{a}) d t

(1)

where

f

is the original signal,

ψ

represents the mother wavelet function,

a

and

b

are scale and translation parameters, respectively. Following the denoising step, each sensor reading was scaled using the MinMax Scaler as defined below [36]:

X_{s c a l e d} = \frac{(X - X_{m i n})}{(X_{m a x} - X_{m i n})}

(2)

where

X

is the original reading, and

X_{m i n}

and

X_{m a x}

are the minimum and maximum observed values, respectively. This transformation standardizes the range of the features to [0, 1], ensuring consistency across sensors and compatibility with downstream ML algorithms.

Figure 2 illustrates the Rs/Ro response of the MQ-135 (acetone) sensor, comparing raw and denoised signals across three different patient classes: healthy, T1DM, and T2DM. The differences in sensor behavior among these groups are clearly distinguishable, demonstrating the effectiveness of the DWT in enhancing signal interpretability.

2.3. Dataset Class Balance with SMOTE

To address class imbalance in the dataset, the SMOTE technique was employed. Classification labels were assigned based on fasting BGL, following the criteria set by the American Diabetes Association and the World Health Organization [37,38]. Participants were grouped into three categories:

Healthy: BGL < 100 mg/dL
Prediabetes: 100 ≤ BGL < 126 mg/dL
Diabetes: BGL ≥ 126 mg/dL

Figure 3 illustrates the class distribution before and after applying SMOTE. Initially, there was a noticeable underrepresentation of the prediabetic and diabetic classes. After synthetic resampling, all three classes were balanced with 29 patients per class, minimizing model bias toward the majority class and ensuring robust training of classification algorithms.

To evaluate whether the samples belonging to different classes could be effectively separated, t-SNE was applied. This nonlinear dimensionality reduction technique maps high-dimensional data into a lower-dimensional space (typically 2D) while preserving local structure. It achieves this by minimizing the Kullback–Leibler (KL) divergence between two probability distributions: one in the high-dimensional space and another in the low-dimensional space. The cost function minimized by t-SNE is given by [38]:

K L (P ∥ Q) = \sum_{\{i \neq j\}} p_{i j} \cdot \log (\frac{p_{\{i j\}}}{q_{\{i j\}}})

(3)

where

p_{\{i j\}}

represents the pairwise similarity between data points

i

and

j

in the high-dimensional space (measured using a Gaussian distribution),

q_{\{i j\}}

represents the similarity in the low-dimensional space, modeled using a Student-t distribution with one degree of freedom (a Cauchy distribution), which provides heavier tails and helps separate dissimilar points more effectively. Figure 4 shows the t-SNE projection, where distinct clusters can be observed for healthy, prediabetic, and diabetic individuals. This separation confirms that the selected features contain meaningful information to support multiclass discrimination.

It should be noted that the clear class separability observed in this study stems from the controlled conditions of data collection and the distinct glucose level ranges among participants. However, in real-world scenarios with greater inter-patient variability—such as lifestyle, medication, and other metabolic factors—classification boundaries may be less distinct. Therefore, further validation on broader and more heterogeneous populations is necessary for clinical translation [8,11,21,28,34].

2.4. Feature Selection

Feature engineering played a key role in enhancing the classification performance and interpretability of the models. To reduce dimensionality and improve robustness, only a subset of sensors—MQ-2 (CO), MQ-3 (Alcohol), and MQ-135 (Acetone)—was selected based on their relevance to diabetes-related breath biomarkers. These compounds are consistently reported in the literature as reliable indicators of altered metabolism in individuals with diabetes [22,28,31]. To further enrich the dataset and highlight underlying biological interactions, a new feature was created defined mathematically as:

A l c o h o l / A c e t o n e = \frac{A l c o h o l}{A c e t o n e + ε}

(4)

where

ε = 1 \times 10^{- 8}

is a small constant added to avoid division by zero. This ratio captures the relative abundance of alcohol and acetone, two key VOCs in diabetic breath, and enhances model sensitivity to metabolic shifts.

Figure 5 presents the Pearson correlation matrix between the selected features and the engineered biomarker. Strong positive correlations were observed between CO and acetone (r = 0.95), and between alcohol and the alcohol-to-acetone ratio (r = 0.89). Notably, the alcohol-to-acetone ratio showed weak or negative correlations with CO and acetone individually, reinforcing its role as a composite indicator that captures a distinct metabolic signature.

Previous studies have demonstrated that acetone concentration in exhaled breath is inversely correlated with blood glucose levels. Specifically, lower Rs/Ro values—indicating higher acetone concentrations—were associated with higher blood glucose concentrations and a greater likelihood of diabetes. A similar pattern was observed for alcohol. In healthy individuals, Rs/Ro values tend to be higher, reflecting lower VOC concentrations. In contrast, diabetic individuals typically exhibit lower Rs/Ro values for both acetone and alcohol, indicating elevated levels of these biomarkers in breath samples [14,22,28].

The engineered alcohol-to-acetone ratio feature thus serves a dual purpose: (1) it encapsulates clinically relevant interactions between two key biomarkers, and (2) it enhances the discriminatory power of the model without requiring additional sensors. This balance between simplicity and predictive capacity makes the proposed feature a valuable addition for real-time breath-based diabetes screening applications [8,14,21,34,39].

3. Results

This section presents the results of the multiclass classification task for diabetes diagnosis using exhaled breath biomarkers. Several ML classifiers were evaluated to discriminate between healthy, prediabetic, and diabetic individuals. A detailed comparative analysis was performed, including classification reports and precision-recall metrics. The selection of classifiers was guided by prior benchmarking studies demonstrating their robust performance in medical and non-invasive diagnostic tasks [21,23,26,28,30]. Tree-based and boosting models were emphasized, including Random Forest, Gradient Boosting, Extra Trees, AdaBoost, XGBoost, LightGBM, and CatBoost. Classical classifiers such as Support Vector Machines (SVM), k-Nearest Neighbors (KNN), and Logistic Regression were also included to provide a comprehensive baseline.

To ensure rigorous evaluation and reduce the risk of overfitting, a nested cross-validation framework was implemented, consisting of a double validation structure: an internal 3-fold loop for hyperparameter optimization using GridSearchCV, and an external 3-fold loop for performance estimation. Model performance was assessed by averaging the results obtained on the outer test folds, providing an unbiased estimate of generalization capability. The top-performing models were further combined into an ensemble classifier, leveraging complementary decision boundaries to enhance performance. Additionally, to interpret model predictions and validate feature contributions, explainable artificial intelligence (XAI) techniques were applied using SHapley Additive exPlanations (SHAP). The SHAP analysis confirmed that the three selected gas sensors are sufficient for effective diabetes classification via breath analysis [22,28].

3.1. Multiclass Classification: Healthy, Prediabetic, and Diabetic

Due to the limited sample size (N = 87), splitting the dataset into separate training and testing sets was not statistically appropriate. To mitigate this constraint and prevent overfitting or biased performance estimation, a nested cross-validation strategy was employed. The framework consisted of a double cross-validation loop: an outer 3-fold stratified k-fold split to evaluate model performance, and an inner 3-fold loop within each training fold to optimize hyperparameters using Grid Search.

Stratification ensured proportional representation of the three classes—healthy, prediabetic, and diabetic—in each fold. This approach enabled each instance to participate in both training and testing across iterations, reducing variance and bias in performance estimates [21,23,26]. Model performance was assessed using six standard metrics: accuracy, precision, recall, F1-score, ROC AUC (multiclass, one-vs-rest strategy), and MCC.

These metrics provide a comprehensive view of model performance. In particular, F1-score is critical in medical diagnostics, as it balances false positives and false negatives, ensuring both diagnostic sensitivity and specificity. This is especially important when differentiating among healthy, prediabetic, and diabetic individuals, where misclassification could lead to unnecessary stress, delayed treatment, or missed early intervention.

The evaluation metrics were computed using the following formulas [21,29]:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(5)

Precision = \frac{T P}{T P + F P}

(6)

Recall = \frac{T P}{T P + F N}

(7)

F_{1} s c o r e = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(8)

ROC AUC = \int_{0}^{1} TPR (FPR) d (FPR)

(9)

MCC = \frac{(T P \cdot T N) - (F P \cdot F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(10)

where TP are True Positives, TN are True Negatives, FP are False Positives, FN are False negatives, TPR are True Positive Rate and FPR are False Positive Rate.

3.2. Comparative Model Performance

Tree-based ensemble methods consistently demonstrated superior generalization ability and discriminative capacity compared to linear and distance-based classifiers in the multiclass identification of metabolic states from exhaled breath biomarkers.

Random Forest, Gradient Boosting, and AdaBoost also demonstrated outstanding individual performance, with nearly identical average scores (all above 98.8% accuracy and 98% F1-score), highlighting the strength of boosting and bagging approaches in metabolomic classification. Importantly, these models maintained minimal variation across folds, reinforcing their stability and reliability in small-sample biomedical contexts.

Models like LightGBM, SVM, and CatBoost showed strong performance but with slightly higher variability across folds. For instance, LightGBM achieved 97.70 ± 3.25% accuracy, with a corresponding MCC of 96.55 ± 4.88%, indicating occasional inconsistencies in classifying borderline cases, particularly within the prediabetic category—a known challenge due to overlapping biomarker profiles. These fluctuations emphasize the clinical complexity of early metabolic dysregulation and its manifestation in VOCs.

ExtraTrees and KNN performed moderately, achieving accuracies of 94.25 ± 1.63%, but exhibited slightly lower MCC values and increased dispersion across metrics, especially in minority class recall. XGBoost, while maintaining high AUC, showed larger standard deviations (e.g., ±6.70 in F1-score), reflecting reduced stability compared to other tree-based models. As shown in Figure 6, such errors are reflected in increased off-diagonal elements in confusion matrices, particularly for XGBoost.

Linear classifiers such as Logistic Regression displayed the weakest performance, with an accuracy of 90.80 ± 1.63% and the lowest MCC (86.90 ± 2.43%), confirming their limited capacity to model non-linear VOC–glucose associations. Similarly, SVM and KNN did not match the consistency and precision of ensemble methods.

To enhance stability and clinical applicability, a soft voting ensemble model integrating Random Forest and Gradient Boosting was constructed. This combination leveraged the complementary strengths of both algorithms—Random Forest’s robustness to noise and overfitting, and Gradient Boosting’s ability to iteratively correct residual errors—resulting in a statistically reliable and high-performing classifier. These models were selected based on their superior F1-scores during individual evaluation, a metric particularly relevant in medical diagnostics, where minimizing both false positives and false negatives is essential. Model evaluation was conducted using a nested cross-validation framework, comprising an inner loop for hyperparameter optimization and an outer loop for unbiased performance estimation. This double validation approach mitigates the risk of overfitting and provides a more accurate reflection of the model’s generalization capability, a critical concern in studies with limited sample sizes. The ensemble model consistently achieved high accuracy, precision, recall, and F1-score across folds, with minimal variance, supporting its potential for real-world deployment in point-of-care diagnostics and clinical screening workflows, where reliability and reproducibility are essential [34,35]. The ensemble model combining Random Forest and Gradient Boosting yielded the highest performance across all metrics, with 98.86 ± 1.97% accuracy, 99.07 ± 1.60% precision, 98.81 ± 2.06% recall, 98.87 ± 1.96% F1-score, and 1.0000 ± 0.0 ROC AUC, achieving an MCC of 98.36 ± 2.84%. These results indicate the model’s robustness in capturing complex, non-linear interactions among VOCs, while maintaining excellent generalization under a nested cross-validation scheme [21,28,34].

Table 2 provides a comprehensive summary average performance metrics (mean ± standard deviation) for each evaluated model, computed across the outer folds of the nested cross-validation framework.

Figure 6 illustrates the classification behavior of each model on the final fold (Fold 3) of the outer nested cross-validation loop. Ensemble methods (f) demonstrate strong diagonal dominance, particularly the combination of Random Forest and Gradient Boosting, which achieves perfect classification across all classes. Random Forest and LightGBM (b), as well as Gradient Boosting and AdaBoost (e), exhibit minimal off-diagonal misclassifications, reflecting high robustness. Conversely, classifiers such as CatBoost (a), Extra Trees, SVM, KNN, and Logistic Regression (c) exhibited relatively low misclassification rates, with minor confusion in the prediabetes group. In contrast, XGBoost (d) showed greater variability, misclassifying four prediabetic instances as either healthy or diabetic.

Figure 7 displays a visual comparison of all global metrics, clearly positioning ensemble methods at the top of every metric. Such consistency across performance dimensions—accuracy, discrimination (AUC), and balanced error (MCC)—reinforces the ensemble model as the most promising architecture for breath-based diabetes screening.

Finally, Table 3 presents the global per-class classification performance aggregated across all outer folds of the nested cross-validation, offering detailed insight into intra-class behavior for each model. The ensemble model combining Random Forest and Gradient Boosting attained near-perfect classification for all three metabolic states, with a macro F1-score of 98.41% in the prediabetes and diabetes classes, and 100% in the healthy class. While several models—including Random Forest, Gradient Boosting, and AdaBoost—matched this high performance in the healthy class, the ensemble exhibited the most consistent behavior across folds, particularly in the challenging prediabetic category. Models such as LightGBM and SVM also performed robustly but showed slightly reduced recall for prediabetes, which affected their overall F1-scores. In contrast, classifiers like Logistic Regression and XGBoost demonstrated increased variability and more frequent misclassifications, especially in intermediate metabolic states.

Collectively, these results demonstrate that accurate, non-invasive multiclass classification of metabolic conditions is feasible using only three low-cost gas sensors [22,28]. When integrated with optimized ensemble learning methods, the system achieves performance levels suitable for real-world implementation in screening or point-of-care diagnostics.

3.3. Explainability Analysis with SHAP

To enhance transparency and interpretability in clinical applications, SHAP was employed to examine the feature contributions and pairwise interactions in the Gradient Boosting and Random Forest models—the two base learners of the final ensemble. SHAP provides a principled framework to quantify how each feature influences the model output, enabling robust interpretation of complex machine learning predictions [40,41].

Figure 8 presents the SHAP interaction values for Gradient Boosting and Random Forest. Both models independently identified acetone and alcohol as the most relevant biomarkers, with acetone consistently showing the highest contribution to the prediction of metabolic states. This aligns with established biochemical evidence, as elevated levels of exhaled acetone are associated with increased ketone body production, a common metabolic signature in diabetic and prediabetic individuals [22,28].

Moreover, SHAP interaction plots reveal strong synergistic effects between alcohol and acetone, suggesting that the combined presence of these volatiles is more predictive than their individual contributions. This interaction is especially pronounced in the Gradient Boosting model, likely due to its adaptive nature and sensitivity to hard-to-classify cases. In contrast, the Random Forest model exhibited more uniform and concentrated SHAP value distributions, reflecting its ensemble structure and lower variance. While CO exhibited moderate individual impact, its interactions—particularly with alcohol—contributed meaningfully to the classification performance. These findings support the hypothesis that CO reflects oxidative stress or inflammation, which may be indirectly linked to metabolic dysregulation [8,14,39,42].

Importantly, although SHAP explanations are not natively supported for soft-voting ensembles, the high agreement between the SHAP outputs of Gradient Boosting and Random Forest suggests that the ensemble model integrates the most informative aspects of each base classifier. This convergence in interpretability may explain the ensemble’s superior performance, achieving perfect classification across all metrics and folds. The ensemble likely benefits from Random Forest’s robustness to noise and Gradient Boosting focus on borderline instances, leading to enhanced generalizability and decision stability.

In summary, SHAP analysis confirmed that acetone and alcohol are the most influential features in the classification of healthy, prediabetic, and diabetic states using breath-based biomarkers. The reproducibility of these findings across multiple models strengthens their reliability as non-invasive indicators and validates the ensemble approach not only in performance but also in interpretability—a crucial aspect for translational deployment in clinical and screening environments [8,14,15,21,34,35].

4. Discussion

This study demonstrates the feasibility of accurately stratifying metabolic states—healthy, prediabetic, and diabetic—through a non-invasive, low-cost breath analysis system leveraging only three metal-oxide gas sensors. Among all evaluated classifiers, the ensemble model combining Random Forest and Gradient Boosting consistently exhibited superior generalization and discriminative capacity across cross-validation folds, effectively capturing nonlinear interactions in VOC profiles and resolving subtle physiological distinctions between metabolic phenotypes.

Tree-based methods outperformed linear and distance-based classifiers, with the ensemble exhibiting synergistic behavior: Random Forest contributed robustness to noisy features, while Gradient Boosting adaptively focused on misclassified samples. Furthermore, the engineered alcohol-to-acetone ratio significantly enhanced interclass separability, underscoring the value of feature engineering grounded in biochemical rationale [22,31].

The inclusion of a desiccation system proved essential given the humidity sensitivity of metal-oxide sensors. Moisture not only distorts sensor responses but accelerates material degradation, compromising long-term stability. Although physical dehumidification stabilized measurements and improved reliability, it introduced operational complexity and hindered portability [21,30,34]. As highlighted by Paleczek et al. [34], no standard algorithmic alternative currently exists to fully mitigate humidity artifacts. Thus, future research should prioritize the development of humidity-resilient sensor technologies, real-time compensation algorithms, and standardized breath sampling protocols to facilitate scalable deployment in clinical and point-of-care settings.

While the application of SMOTE effectively addressed class imbalance, it may overestimate performance in small datasets. Our prior work using CTGAN-based synthetic data generation (14,000 samples) demonstrated model consistency under expanded conditions [28]; however, this study emphasizes real-sample classification, particularly of prediabetic individuals—a clinically critical but diagnostically elusive group [26]. Despite the strong average performance, occasional misclassification of borderline cases highlights the need for more nuanced modeling strategies. Cost-sensitive learning, class-specific calibration, and probabilistic inference may offer pathways to improving early-stage detection sensitivity, thereby enhancing clinical utility.

All breath samples were collected under fasting conditions to minimize confounding metabolic fluctuations [21,22,23,34,35], reinforcing the importance of protocol standardization for temporal, dietary, and activity-related factors. Nevertheless, the observed performance is intrinsically tied to the controlled and homogeneous nature of the current cohort. Human breath is highly individualized—functioning as a metabolic fingerprint—shaped by genetics, diet, microbiome, comorbidities, and environmental exposures [21,22,23,34,35]. Consequently, validation in larger real-world breath dataset, more heterogeneous populations is imperative for clinical translation.

Therefore, larger-scale studies involving more heterogeneous populations—with broader age ranges, ethnic backgrounds, comorbidities, and glycemic profiles—are essential to validate and generalize the system’s performance in real-world clinical settings [22,28]. Future iterations should integrate contextual lifestyle and clinical variables—such as dietary intake, sleep, physical activity, and pharmacotherapy—given their established impact on metabolic state and breath chemistry [24,26]. The use of a separate hold-out validation cohort is also recommended, alongside the systematic reporting of 95% confidence intervals to ensure statistically rigorous assessment of model generalizability [29].

A notable limitation of this study is the lack of repeated longitudinal measurements, precluding assessment of intra-individual variability. Future research should adopt longitudinal designs to evaluate biomarker stability and classifier robustness across temporal scales, which is particularly relevant for real-world continuous monitoring applications [26].

Table 4 situates the present study within the broader context of e-nose-based diabetes classification. To our knowledge, this is the first work to achieve multiclass discrimination (healthy, prediabetic, diabetic) using real exhaled breath data with near-perfect performance. Prior studies predominantly address binary classification or rely on simulated data. In contrast, our ensemble model surpassed all prior methods in F1-score—a metric of paramount relevance in medical diagnostics—within a nested cross-validation framework. This dual-loop structure, involving internal hyperparameter tuning and external unbiased performance estimation, mitigates overfitting and yields more reliable generalization, an often-overlooked necessity in biomedical ML studies.

Finally, future work should prioritize the deployment of lightweight, interpretable models compatible with microcontrollers for real-time embedded inference. The integration of explainable machine learning within TinyML frameworks opens a promising avenue for portable, low-power diagnostic systems. Such systems could enable point-of-care screening, ambulatory monitoring, and integration into wearable platforms—bridging the gap between proof-of-concept studies and practical clinical application [21,28].

5. Conclusions

This work presents a novel, breath-based, non-invasive system for multiclass classification of diabetes, leveraging only three gas sensors and machine learning. The proposed ensemble model achieved near-perfect diagnostic performance, highlighting the feasibility of using VOC profiles in exhaled breath as robust metabolic biomarkers. Beyond high classification accuracy, the proposed framework offers practical clinical benefits—affordability, scalability, and compatibility with real-time, point-of-care deployment—marking a decisive advance toward accessible and patient-friendly screening tools [8,14,15,21,34,35].

Future research should aim to expand population diversity, including broader age groups, ethnicities, and comorbidity profiles, to enhance generalizability. Integrating contextual variables—such as lifestyle, dietary habits, sleep patterns, and pharmacological treatments—may further strengthen model performance and personalize risk stratification. Additionally, deploying compact, XAI models on embedded systems represents a key step toward translational applications, enabling continuous, real-time health monitoring in both clinical and ambulatory settings [24,28,34].

Ultimately, transforming a single exhalation into clinically actionable insight may revolutionize how diabetes is detected, monitored, and prevented—empowering earlier intervention and improving outcomes for millions worldwide [21].

6. Patents

A Utility Model application has been submitted to the Mexican Institute of Industrial Property (IMPI). This application has successfully passed the formal examination, complying with the requirements established by the Federal Law on Industrial Property and the Regulations of the Industrial Property Law in Mexico. The Utility Model has been published in the IMPI database, SIGA 2.0, as of 15 February 2024, under application number MX/u/2023/000465. The authors associated with this patent are Alberto Gudiño-Ochoa, Julio Alberto García-Rodríguez, Jorge Ivan Cuevas-Chávez, Raquel Ochoa-Ornelas, and Daniel Alejandro Sánchez-Arias.

Author Contributions

Conceptualization, A.G.-O., J.A.G.-R., J.I.C.-C. and R.O.-O.; methodology, A.G.-O., J.A.G.-R., R.O.-O., S.U.-T., J.I.C.-C. and D.A.S.-A.; software, A.G.-O., J.I.C.-C. and R.O.-O.; validation, A.G.-O., E.R.-V. and J.A.G.-R.; formal analysis, A.G.-O. and J.A.G.-R.; investigation, A.G.-O., J.I.C.-C. and D.A.S.-A.; re-sources, J.A.G.-R. and S.U.-T.; data curation, A.G.-O., J.I.C.-C. and R.O.-O.; writing—original draft preparation, A.G.-O., J.A.G.-R. and J.I.C.-C.; writing—review and editing, J.A.G.-R., S.U.-T. and E.R.-V.; visualization, A.G.-O., J.I.C.-C., R.O.-O. and D.A.S.-A.; supervision, A.G.-O.; project administration, J.A.G.-R., S.U.-T. and D.A.S.-A.; funding acquisition, J.A.G.-R., S.U.-T. and E.R.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The breath samples used in this study were obtained under informed consent as part of a previously approved clinical protocol. Ethical approval and data collection procedures were conducted in accordance with the Declaration of Helsinki and are detailed in earlier publications by the authors [21,28]. Ethic approved by the Comité de Ética del Centro Universitario del Sur, Universidad de Guadalajara, v.24.09.21, 6 February 2024.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original data presented in this study are available upon request from the corresponding author due to ethical and privacy considerations. The synthetic data generated during this study, which replicates the statistical characteristics of the original data, are openly available at https://github.com/AlbertoGudinoOchoa/breath-diabetes-synthetic-data/ under the MIT License (accessed on 2 April 2025).

Acknowledgments

This manuscript was prepared with resources from the PROSNII-2025 program and the Department of Computer Science and Technological Innovation at Centro Universitario del Sur, University of Guadalajara. The authors express their gratitude to Instituto Tecnológico de Ciudad Guzmán for the support provided.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DM	Diabetes mellitus
SMOTE	Synthetic Minority Over-Sampling Technique
CO	Monoxide carbon
BGL	Blood glucose levels
GCM	Continuous glucose monitoring
VOCs	Volatile organic compounds
GC-MS	Gas Chromatography-Mass Spectrometry
SIFT-MS	Selected Ion Flow Tube Mass Spectrometry
PTR-MS	Proton Transfer Reaction Mass Spectrometry
E-nose	Electronic nose
ML	Machine learning
DL	Deep Learning
MOS	metal-oxide semiconductor
CNNs	Convolutional neural network
T2DM	type 2 diabetes mellitus
T1DM	type 1 diabetes mellitus
ADC	analog-to-digital converter
RH	relative humidity
CSV	comma-separated values
DWT	Discrete wavelet transform
MOS	metal-oxide semiconductor
t-SNE	t-distributed Stochastic Neighbor Embedding
KL	Kullback–Leibler
SVM	Support Vector Machines
KNN	k-Nearest Neighbors
XAI	explainable artificial intelligence
SHAP	Shapley additive explanations
MCC	Matthews Correlation Coefficient
XAI	Explainable Artificial Intelligence

References

Gregg, E.W.; Buckley, J.; Ali, M.K.; Davies, J.; Flood, D.; Mehta, R.; Zhumadilov, Z. Improving health outcomes of people with diabetes: Target setting for the WHO Global Diabetes Compact. Lancet 2023, 401, 1302–1312. [Google Scholar] [CrossRef] [PubMed]
Hossain, M.J.; Al-Mamun, M.; Islam, M.R. Diabetes mellitus, the fastest growing global public health concern: Early detection should be focused. Health Sci. Rep. 2024, 7, e2004. [Google Scholar] [CrossRef] [PubMed]
Menke, A.; Knowler, W.C.; Cowie, C.C. Physical and metabolic characteristics of persons with diabetes and prediabetes. In Diabetes in America, 3rd ed.; National Institute of Diabetes and Digestive and Kidney Diseases: Bethesda, MD, USA, 2018. [Google Scholar]
Jasim, O.H.; Mahmood, M.M.; Ad’hiah, A.H. Significance of lipid profile parameters in predicting pre-diabetes. Arch. Razi Inst. 2022, 77, 277–285. [Google Scholar] [PubMed]
Pradeepa, R.; Mohan, V. Epidemiology of chronic complications of diabetes: A global perspective. In Chronic Complications of Diabetes Mellitus; Academic Press: Cambridge, MA, USA, 2024; pp. 11–23. [Google Scholar]
Lindner, N.; Kuwabara, A.; Holt, T. Non-invasive and minimally invasive glucose monitoring devices: A systematic review and meta-analysis on diagnostic accuracy of hypoglycaemia detection. Syst. Rev. 2021, 10, 145. [Google Scholar] [CrossRef]
Ortiz-Martínez, M.; González-González, M.; Martagón, A.J.; Hlavinka, V.; Willson, R.C.; Rito-Palomares, M. Recent developments in biomarkers for diagnosis and screening of type 2 diabetes mellitus. Curr. Diab. Rep. 2022, 22, 95–115. [Google Scholar] [CrossRef]
Jain, P.; Joshi, A.M.; Mohanty, S.P.; Cenkeramaddi, L.R. Non-invasive glucose measurement technologies: Recent advancements and future challenges. IEEE Access 2024, 12, 61907–61936. [Google Scholar] [CrossRef]
Gouveri, E.; Papanas, N. The emerging role of continuous glucose monitoring in the management of diabetic peripheral neuropathy: A narrative review. Diabetes Ther. 2022, 13, 931–952. [Google Scholar] [CrossRef]
Jang, S.; Wang, Y.; Jang, A. Review of emerging approaches utilizing alternative physiological human body fluids in non- or minimally invasive glucose monitoring. In Advanced Bioscience and Biosystems for Detection and Management of Diabetes; Springer International Publishing: Cham, Switzerland, 2022; pp. 9–26. [Google Scholar]
Li, Y.; Chen, Y. Review of noninvasive continuous glucose monitoring in diabetics. ACS Sens. 2023, 8, 3659–3679. [Google Scholar] [CrossRef]
Fiedorova, K.; Augustynek, M.; Kubicek, J.; Kudrna, P.; Bibbo, D. Review of present method of glucose from human blood and body fluids assessment. Biosens. Bioelectron. 2022, 211, 114348. [Google Scholar] [CrossRef]
Chowdhury, M.H.; Shuzan, M.N.I.; Chowdhury, M.E.; Mahbub, Z.B.; Uddin, M.M.; Khandakar, A.; Reaz, M.B.I. Estimating blood pressure from the photoplethysmogram signal and demographic features using machine learning techniques. Sensors 2020, 20, 3127. [Google Scholar] [CrossRef]
Liu, H.; Liu, W.; Sun, C.; Huang, W.; Cui, X. A review of non-invasive blood glucose monitoring through breath acetone and body surface. Sens. Actuators A Phys. 2024, 359, 115500. [Google Scholar] [CrossRef]
Jadhav, M.R.; Wankhede, P.R.; Srivastava, S.; Bhargaw, H.N.; Singh, S. Breath-based biosensors and system development for noninvasive detection of diabetes: A review. Diabetes Metab. Syndr. Clin. Res. Rev. 2024, 18, 102931. [Google Scholar] [CrossRef] [PubMed]
Xu, W.; Zou, X.; Ding, H.; Ding, Y.; Zhang, J.; Liu, W.; Chu, Y. Rapid and non-invasive diagnosis of type 2 diabetes through sniffing urinary acetone by a proton transfer reaction mass spectrometry. Talanta 2023, 256, 124265. [Google Scholar] [CrossRef] [PubMed]
Nicolier, C.; Künzler, J.; Lizoain, A.; Kerber, D.; Hossmann, S.; Rothenbühler, M.; Witthauer, L. Detection of hypoglycaemia in type 1 diabetes through breath volatile organic compound profiling using gas chromatography–ion mobility spectrometry. Diabetes Obes. Metab. 2024, 26, 5737–5744. [Google Scholar] [CrossRef]
Hu, B. Mass spectrometric analysis of exhaled breath: Recent advances and future perspectives. TrAC Trends Anal. Chem. 2023, 168, 117320. [Google Scholar] [CrossRef]
Mahnoor, M.; Shah, A.A.; Inam, A. Acetone detection using various techniques for diagnosis of diabetes mellitus from human exhaled breath: A review. In Proceedings of the AIP Conference, Kuala Lumpur, Malaysia, 28–30 August 2024; Volume 3125, p. 1. [Google Scholar]
Zhang, X.; Frankevich, V.; Ding, J.; Ma, Y.; Chingin, K.; Chen, H. Direct mass spectrometry analysis of exhaled human breath in real-time. Mass Spectrom. Rev. 2025, 44, 43–61. [Google Scholar] [CrossRef]
Gudiño-Ochoa, A.; García-Rodríguez, J.A.; Ochoa-Ornelas, R.; Cuevas-Chávez, J.I.; Sánchez-Arias, D.A. Noninvasive diabetes detection through human breath using TinyML-powered E-nose. Sensors 2024, 24, 1294. [Google Scholar] [CrossRef]
Paleczek, A.; Rydosz, A. The effect of high ethanol concentration on E-nose response for diabetes detection in exhaled breath: Laboratory studies. Sens. Actuators B Chem. 2024, 408, 135550. [Google Scholar] [CrossRef]
Paleczek, A.; Grochala, D.; Rydosz, A. Artificial breath classification using XGBoost algorithm for diabetes detection. Sensors 2021, 21, 4187. [Google Scholar] [CrossRef]
Zaim, O.; Bouchikhi, B.; Motia, S.; Abelló, S.; Llobet, E.; El Bari, N. Discrimination of diabetes mellitus patients and healthy individuals based on volatile organic compounds (VOCs): Analysis of exhaled breath and urine samples by using e-nose and VE-tongue. Chemosensors 2023, 11, 350. [Google Scholar] [CrossRef]
Zhu, H.; Liu, C.; Zheng, Y.; Zhao, J.; Li, L. A hybrid machine learning algorithm for detection of simulated expiratory markers of diabetic patients based on gas sensor array. IEEE Sens. J. 2022, 23, 2940–2947. [Google Scholar] [CrossRef]
Kapur, R.; Kumar, Y.; Sharma, R.; Singh, E.; Rohilla, D.; Kanwar, V.; Dutt, V. GlucoBreath: An IoT, ML, and breath-based non-invasive glucose meter. IEEE Access 2024, 12, 59346–59360. [Google Scholar] [CrossRef]
Lekha, S.; Suchetha, M. Real-time non-invasive detection and classification of diabetes using modified convolution neural network. IEEE J. Biomed. Health Inform. 2017, 22, 1630–1636. [Google Scholar] [CrossRef] [PubMed]
Gudiño-Ochoa, A.; García-Rodríguez, J.A.; Cuevas-Chávez, J.I.; Ochoa-Ornelas, R.; Navarrete-Guzmán, A.; Vidrios-Serrano, C.; Sánchez-Arias, D.A. Enhanced diabetes detection and blood glucose prediction using TinyML-integrated E-nose and breath analysis: A novel approach combining synthetic and real-world data. Bioengineering 2024, 11, 1065. [Google Scholar] [CrossRef]
Ochoa-Ornelas, R.; Gudiño-Ochoa, A.; García-Rodríguez, J.A. A hybrid deep learning and machine learning approach with Mobile-EfficientNet and Grey Wolf Optimizer for lung and colon cancer histopathology classification. Cancers 2024, 16, 3791. [Google Scholar] [CrossRef]
Ye, Z.; Wang, J.; Hua, H.; Zhou, X.; Li, Q. Precise detection and quantitative prediction of blood glucose level with an electronic nose system. IEEE Sens. J. 2022, 22, 12452–12459. [Google Scholar] [CrossRef]
Sha, M.S.; Maurya, M.R.; Shafath, S.; Cabibihan, J.J.; Al-Ali, A.; Malik, R.A.; Sadasivuni, K.K. Breath analysis for the in vivo detection of diabetic ketoacidosis. ACS Omega 2022, 7, 4257–4266. [Google Scholar] [CrossRef]
Grundy, S.M. Pre-diabetes, metabolic syndrome, and cardiovascular risk. J. Am. Coll. Cardiol. 2012, 59, 635–643. [Google Scholar] [CrossRef]
Rodríguez-Fonseca, L.; Llorente-Pendás, S.; García-Pola, M. Risk of prediabetes and diabetes in oral lichen planus: A case–control study according to current diagnostic criteria. Diagnostics 2023, 13, 1586. [Google Scholar] [CrossRef]
Paleczek, A.; Rydosz, A. Review of the algorithms used in exhaled breath analysis for the detection of diabetes. J. Breath Res. 2022, 16, 026003. [Google Scholar] [CrossRef]
Lekha, S.; Suchetha, M. Recent advancements and future prospects on e-nose sensors technology and machine learning approaches for non-invasive diabetes diagnosis: A review. IEEE Rev. Biomed. Eng. 2020, 14, 127–138. [Google Scholar] [CrossRef] [PubMed]
Ahsan, M.M.; Mahmud, M.P.; Saha, P.K.; Gupta, K.D.; Siddique, Z. Effect of data scaling methods on machine learning algorithms and model performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
ACE/ADA Task Force on Inpatient Diabetes. American College of Endocrinology and American Diabetes Association consensus statement on inpatient diabetes and glycemic control: A call to action. Diabetes Care 2006, 29, 1955–1962. [Google Scholar] [CrossRef] [PubMed]
Van De Ruit, M.; Billeter, M.; Eisemann, E. An efficient dual-hierarchy t-SNE minimization. IEEE Trans. Vis. Comput. Graph. 2021, 28, 614–622. [Google Scholar] [CrossRef]
Yousef, H.; Khandoker, A.H.; Feng, S.F.; Helf, C.; Jelinek, H.F. Inflammation, oxidative stress and mitochondrial dysfunction in the progression of type II diabetes mellitus with coexisting hypertension. Front. Endocrinol. 2023, 14, 1173402. [Google Scholar] [CrossRef]
Chen, T.C.T.; Wu, H.C.; Chiu, M.C. A deep neural network with modified random forest incremental interpretation approach for diagnosing diabetes in smart healthcare. Appl. Soft Comput. 2024, 152, 111183. [Google Scholar] [CrossRef]
Wang, Y.C.; Chen, T.C.T.; Chiu, M.C. A systematic approach to enhance the explainability of artificial intelligence in healthcare with application to diagnosis of diabetes. Healthc. Anal. 2023, 3, 100183. [Google Scholar] [CrossRef]
Oguntibeju, O.O. Type 2 diabetes mellitus, oxidative stress and inflammation: Examining the links. Int. J. Physiol. Pathophysiol. Pharmacol. 2019, 11, 45–52. [Google Scholar]
Lekha, S.; Suchetha, M. A novel 1-D convolution neural network with SVM architecture for real-time detection applications. IEEE Sens. J. 2017, 18, 724–731. [Google Scholar] [CrossRef]
Weng, X.; Li, G.; Liu, Z.; Liu, R.; Liu, Z.; Wang, S.; Chang, Z. A preliminary screening system for diabetes based on in-car electronic nose. Endocr. Connect. 2023, 12, e220437. [Google Scholar] [CrossRef]
Bhaskar, N.; Bairagi, V.; Boonchieng, E.; Munot, M.V. Automated detection of diabetes from exhaled human breath using deep hybrid architecture. IEEE Access 2023, 11, 51712–51722. [Google Scholar] [CrossRef]

Figure 1. Non-invasive diabetes detection system using an electronic nose.

Figure 2. Response of MQ-135 sensor signals between a healthy, TD1M and T2DM patients, original and denoised with DWT.

Figure 3. Class balance before and after applying the SMOTE technique.

Figure 4. Two-dimensional t-SNE projection of the dataset after SMOTE-based class balancing.

Figure 5. Correlation matrix between selected breath biomarkers (CO, alcohol, and acetone) and the engineered feature alcohol-to-acetone ratio.

Figure 6. Confusion matrices illustrating the classification performance on the final fold (Fold 3) of the outer nested cross-validation loop for different classifiers: (a) CatBoost; (b) Random Forest, LightGBM; (c) Extra Trees, SVM, KNN, Logistic Regression; (d) XGBoost; (e) Gradient Boosting, AdaBoost; (f) Ensemble model combining Random Forest and Gradient Boosting.

Figure 7. Overall classification performance (mean ± standard deviation) for all models across all outer folds of the nested cross-validation, including Accuracy, Precision, Recall, F1 Score, ROC AUC, and MCC.

Figure 8. SHAP interaction plots for feature pairs: (a) Gradient Boosting model; (b) Random Forest model. Each dot represents a SHAP interaction value for a specific instance. The color of each dot reflects the feature value of the main feature on the y-axis: blue indicates low values, while red indicates high values.

Table 1. Sensors used in E-nose system developed [21,28].

Sensor	Target Gases	Detection Range of Target Gas	Environment Condition Working
MQ-2	H₂, LPG, CH₄, CO, Alcohol, Propane, Air	200–10,000 ppm CO	Temperature: −10–50 °C RH: less than 95% Standard detecting condition: 20 °C ± 2 °C temperature, 65 ± 5% humidity
MQ-3	Alcohol, Benzine, CH4, Hexane, LGP, CO, Air	0.1–10 mg/L Alcohol
MQ-7	H₂, CO, LPG, CH₄, Alcohol, Air	50–4000 ppm CO
MQ-135	CO₂, Alcohol, Air, NH₄, Toluene, Acetone, CO	0–200 ppm Acetone
MQ-138	Benzene, CO, CH₄, n-Hexane, Alcohol, Propane, Air	200–10,000 ppm Benzene
DHT-22	Temperature, Relative Humidity	−40 °C–80 °C Temperature, 0–100% Relative Humidity	Temperature: 0–50 °C RH: 0–100%
MICS-5524	CO, VOCs, C₂H₆OH, H₂, NH₃, CH₄	1–1000 ppm VOCs	Temperature: 23 °C ± 5 °C RH: less than 95%

Table 2. Overall performance metrics (mean ± standard deviation) for each evaluated model across all outer folds of the nested cross-validation. Metrics include Accuracy, Precision, Recall, F1 Score, ROC AUC, and MCC.

Model	Accuracy	Precision	Recall	F1-Score	ROC AUC	MCC
Ensemble model (Random Forest + Gradient Boosting)	98.86 ± 1.97	99.07 ± 1.60	98.81 ± 2.06	98.87 ± 1.96	1.000 ± 0.0	98.36 ± 2.84
Random Forest	98.85 ± 1.63	98.99 ± 1.43	98.77 ± 1.75	98.82 ± 1.67	1.000 ± 0.0	98.33 ± 2.37
Gradient Boosting	98.85 ± 1.63	98.99 ± 1.43	98.77 ± 1.75	98.82 ± 1.67	1.000 ± 0.0	98.33 ± 2.37
AdaBoost	98.85 ± 1.63	98.99 ± 1.43	98.77 ± 1.75	98.82 ± 1.67	0.9938 ± 0.0087	98.33 ± 2.37
LightGBM	97.70 ± 3.25	97.65 ± 3.32	97.65 ± 3.32	97.65 ± 3.32	0.9988 ± 0.0017	96.55 ± 4.88
SVM	96.55 ± 2.82	97.14 ± 2.27	96.42 ± 3.03	96.49 ± 2.94	0.9988 ± 0.0017	95.08 ± 3.98
CatBoost	95.40 ± 4.30	96.13 ± 3.56	95.06 ± 4.62	95.06 ± 4.71	1.000 ± 0.0	93.48 ± 6.03
Extratrees	94.25 ± 1.63	95.29 ± 1.19	94.07 ± 1.60	94.17 ± 1.62	1.000 ± 0.0	91.84 ± 2.22
KNN	94.25 ± 1.63	95.29 ± 1.19	94.07 ± 1.60	94.17 ± 1.62	0.9731 ± 0.0227	91.84 ± 2.22
XGBoost	94.25 ± 5.86	95.29 ± 4.69	93.95 ± 6.35	93.69 ± 6.70	1.000 ± 0.0	91.97 ± 8.08
Logistic Regression	90.80 ± 1.63	92.17 ± 1.91	90.62 ± 1.43	90.64 ± 1.54	0.9896 ± 0.005	86.90 ± 2.43

Table 3. Global classification report per class (Healthy, Prediabetes, Diabetes), including precision, recall, and F1-score for each evaluated model. Results are aggregated across all outer folds of the nested cross-validation.

Model	Class	Precision	Recall	F1-Score
Random Forest + Gradient Boosting	Healthy	100 ± 0.00	100 ± 0.00	100 ± 0.00
	Prediabetes	100 ± 0.00	97.55 ± 0.05	98.41± 0.02
	Diabetes	97.55 ± 0.05	100 ± 0.00	98.41± 0.02
Random Forest	Healthy	100 ± 0.00	100 ± 0.00	100 ± 0.00
	Prediabetes	100 ± 0.00	96.30 ± 5.24	98.04 ± 2.77
	Diabetes	96.97 ± 4.29	100 ± 0.00	98.41 ± 2.24
Gradient Boosting	Healthy	100 ± 0.00	100 ± 0.00	100 ± 0.00
	Prediabetes	100 ± 0.00	96.30 ± 5.24	98.04 ± 2.77
	Diabetes	96.97 ± 4.29	100 ± 0.00	98.41 ± 2.24
AdaBoost	Healthy	100 ± 0.00	100 ± 0.00	100 ± 0.00
	Prediabetes	100 ± 0.00	96.30 ± 5.24	98.04 ± 2.77
	Diabetes	96.97 ± 4.29	100 ± 0.00	98.41 ± 2.24
LightGBM	Healthy	100 ± 0.00	100 ± 0.00	100 ± 0.00
	Prediabetes	96.67 ± 4.71	96.67 ± 4.71	96.67 ± 4.71
	Diabetes	96.30 ± 5.24	96.30 ± 5.24	96.30 ± 5.24
SVM	Healthy	94.44 ± 7.86	100 ± 0.00	96.97 ± 4.29
	Prediabetes	96.97 ± 4.29	92.59 ± 10.48	94.25 ± 5.15
	Diabetes	100 ± 0.00	96.67 ± 4.71	98.25 ± 2.48
CatBoost	Healthy	94.44 ± 7.86	100 ± 0.00	96.97 ± 4.29
	Prediabetes	96.97 ± 4.29	88.89 ± 15.71	91.75 ± 8.53
	Diabetes	96.97 ± 4.29	96.30 ± 5.24	96.45 ± 2.55
ExtraTrees	Healthy	94.44 ± 7.86	100 ± 0.00	96.97 ± 4.29
	Prediabetes	91.41 ± 6.81	92.59 ± 10.48	91.22 ± 3.17
	Diabetes	100 ± 0.00	89.63 ± 8.18	94.34 ± 4.54
KNN	Healthy	94.44 ± 7.86	100 ± 0.00	96.97 ± 4.29
	Prediabetes	91.41 ± 6.81	92.59 ± 10.48	91.22 ± 3.17
	Diabetes	100 ± 0.00	89.63 ± 8.18	94.34 ± 4.54
XGBoost	Healthy	94.44 ± 7.86	96.67 ± 4.71	95.22 ± 3.73
	Prediabetes	96.97 ± 4.29	85.19 ± 2.09	88.89 ± 1.25
	Diabetes	94.44 ± 7.86	100 ± 0.00	96.97 ± 4.29
Logistic Regression	Healthy	94.44 ± 7.86	100 ± 0.00	96.97 ± 4.29
	Prediabetes	86.25 ± 9.93	89.26 ± 9.09	86.72 ± 0.75
	Diabetes	95.83 ± 5.89	82.59 ± 12.71	88.24 ± 8.32

Table 4. Comparison of best classifier models from recent studies on DM detection using E-nose data.

Study	Best Classifier Model	Dataset Type (Real or Artificial)	Year	Accuracy (%)	Precision (%)	Recall (%)	F1-Scores (%)	Multiclass
Lekha S. et al. [43]	1D-CNN with SVM	Real: 26 individuals	2018	98	98	99	98	No
Paleczek A. et al. [23]	XGBoost	Artificial breath simulations	2021	99	97.9	100	97.4	No
Weng X. et al. [44]	Random Forest	Real: 240 individuals	2023	93.33	97.05	89.9	92.8	No
Zaim O. et al. [24]	SVM-DFA	Real: 60 individuals	2023	93.75	-	-	-	No
Bhaskar N. et al. [45]	CORNN with SVM	Real: 152 individuals	2023	98	97	98.5	97.8	No
Gudiño-Ochoa A. et al. [21]	XGBoost	Real: 44 individuals	2024	95	95	95	95	No
Kapur R. et al. [26]	GBoost-XGBoost (Ensemble)	Real: 492 individuals	2024	95.8	96.9	-	96.1	No
Gudiño-Ochoa A. et al. [28]	Random Forest	Artificial: 14,000 samples (from 58 individuals)	2024	94	93	92.5	91	No
This study	Random Forest + Gradient Boosting (Ensemble)	Real: 58 individuals (87 w/SMOTE)	2025	98.86	99.07	98.81	98.87	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gudiño-Ochoa, A.; García-Rodríguez, J.A.; Ochoa-Ornelas, R.; Ruiz-Velazquez, E.; Uribe-Toscano, S.; Cuevas-Chávez, J.I.; Sánchez-Arias, D.A. Non-Invasive Multiclass Diabetes Classification Using Breath Biomarkers and Machine Learning with Explainable AI. Diabetology 2025, 6, 51. https://doi.org/10.3390/diabetology6060051

AMA Style

Gudiño-Ochoa A, García-Rodríguez JA, Ochoa-Ornelas R, Ruiz-Velazquez E, Uribe-Toscano S, Cuevas-Chávez JI, Sánchez-Arias DA. Non-Invasive Multiclass Diabetes Classification Using Breath Biomarkers and Machine Learning with Explainable AI. Diabetology. 2025; 6(6):51. https://doi.org/10.3390/diabetology6060051

Chicago/Turabian Style

Gudiño-Ochoa, Alberto, Julio Alberto García-Rodríguez, Raquel Ochoa-Ornelas, Eduardo Ruiz-Velazquez, Sofia Uribe-Toscano, Jorge Ivan Cuevas-Chávez, and Daniel Alejandro Sánchez-Arias. 2025. "Non-Invasive Multiclass Diabetes Classification Using Breath Biomarkers and Machine Learning with Explainable AI" Diabetology 6, no. 6: 51. https://doi.org/10.3390/diabetology6060051

APA Style

Gudiño-Ochoa, A., García-Rodríguez, J. A., Ochoa-Ornelas, R., Ruiz-Velazquez, E., Uribe-Toscano, S., Cuevas-Chávez, J. I., & Sánchez-Arias, D. A. (2025). Non-Invasive Multiclass Diabetes Classification Using Breath Biomarkers and Machine Learning with Explainable AI. Diabetology, 6(6), 51. https://doi.org/10.3390/diabetology6060051

Article Menu

Non-Invasive Multiclass Diabetes Classification Using Breath Biomarkers and Machine Learning with Explainable AI

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Data Preprocessing

2.3. Dataset Class Balance with SMOTE

2.4. Feature Selection

3. Results

3.1. Multiclass Classification: Healthy, Prediabetic, and Diabetic

3.2. Comparative Model Performance

3.3. Explainability Analysis with SHAP

4. Discussion

5. Conclusions

6. Patents

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI