Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes Mellitus Incidence

Metabolomics, with its wealth of data, offers a valuable avenue for enhancing predictions and decision-making in diabetes. This observational study aimed to leverage machine learning (ML) algorithms to predict the 4-year risk of developing type 2 diabetes mellitus (T2DM) using targeted quantitative metabolomics data. A cohort of 279 cardiovascular risk patients who underwent coronary angiography and who were initially free of T2DM according to American Diabetes Association (ADA) criteria was analyzed at baseline, including anthropometric data and targeted metabolomics, using liquid chromatography (LC)–mass spectroscopy (MS) and flow injection analysis (FIA)–MS, respectively. All patients were followed for four years. During this time, 11.5% of the patients developed T2DM. After data preprocessing, 362 variables were used for ML, employing the Caret package in R. The dataset was divided into training and test sets (75:25 ratio) and we used an oversampling approach to address the classifier imbalance of T2DM incidence. After an additional recursive feature elimination step, identifying a set of 77 variables that were the most valuable for model generation, a Support Vector Machine (SVM) model with a linear kernel demonstrated the most promising predictive capabilities, exhibiting an F1 score of 50%, a specificity of 93%, and balanced and unbalanced accuracies of 72% and 88%, respectively. The top-ranked features were bile acids, ceramides, amino acids, and hexoses, whereas anthropometric features such as age, sex, waist circumference, or body mass index had no contribution. In conclusion, ML analysis of metabolomics data is a promising tool for identifying individuals at risk of developing T2DM and opens avenues for personalized and early intervention strategies.


Introduction
Type 2 diabetes mellitus (T2DM) poses a significant and escalating global public health challenge, with its incidence continually increasing.Addressing this growing epidemic requires innovative strategies for early detection and intervention to prevent long-term complications such as cardiovascular diseases, nephropathy, neuropathy, and retinopathy.
Metabolomics, the detailed analysis of small-molecule metabolites within biological systems, stands out as a potent tool for uncovering metabolic changes and disease phenotypes.By mapping the entire metabolite spectrum in biological samples, it offers a detailed Dichotomous data are given as proportion, and continuous data (all not normally distributed) are given as median and interquartile range [IQR].Differences between patients who developed T2DM during the four-year follow-up and patients who did not develop T2DM during follow-up were tested with Chi-squared tests for categorical and Jonckheere-Terpstra test for continuous variables.Anthropometric variables included in ML model generation are highlighted by asterisks.

Feature Selection
In the preliminary stage of our analysis, we employed recursive feature elimination (RFE) to identify both crucial and less significant variables.Table 2 presents a summary of the outcomes from various RFE models, including Random Forest, TreeBag, Naïve Bayes, and the Caret default.Based on the results from the Random Forest-based RFE model, we opted for a narrowed selection of variables (n = 77) and, as an alternative based on the results from both the TreeBag and Caret default models, also for the whole set of variables (n = 362) for further in-depth analysis.Regarding the plots depicted in Figure 1, there was no distinct cutoff, thus confirming our approach to proceed with both the selectively reduced and the full sets of variables.Notably, across both selections, ceramides and hexoses (six-carbon sugars) emerged as the most important features according to the top five rankings, highlighting their potential significance in predicting T2DM incidence.summarized in Table 2.The plot represents the output of an RFE process generating different models (black dots).It depicts the relation between the different feature subset sizes (=number of available variables (1-362)) for modelling and the resulting performance metric (accuracy = (true positive + true negative)/(true positive + false positive + true negative + false negative)).Using Random Forest (left) and TreeBag algorithms (right) as functions in RFE, the best models (highlighted as blue dots) were calculated to have 77 and 362 variables, respectively.The process involves repeated cross-validation (method = repeatedcv (10-fold repeated 5 times)) to evaluate the performance of feature subsets.The plots were generated by ggplot using the Caret package in R (CRAN, R [11]).

ML Model Comparison
Finally, ML was applied using a set of ten models as specified in Section 4. Table 3 outlines the performance metrics for the evaluated models.As the number of positive outcomes was limited in our study (n = 32), sensitivity and accuracy may be misleading and were not ideal performance measures.We thus prioritized "F1-score" and "balanced accuracy" as the most critical metrics due to the outcome imbalance observed in our dataset.We tested all models with both the selectively reduced set of variables (n = 77) and the full set (n = 362).The result table gives an overview of the applied models and their respective performance metrics.The F1 score (2 × (precision × sensitivity)/(precision + sensitivity) was selected as the most important metric.
Using the smaller set of variables, the "svmLinear2" model-a Support Vector Machine with linear kernels-emerged as the top performer, yielding the highest F1 score (50%).This model also demonstrated notable balanced accuracy (72%) and precision (50%), effectively identifying 50% of patients who developed T2DM (sensitivity) and 93% of those who did not (specificity), leading to an (unbalanced) accuracy of 88%.The Multi-Layer Perceptron ("mlp") model followed, marking the second-best performance with an F1 score of 40% and balanced accuracy of 66%.
Utilizing the full set of variables (n = 362), models' performance varied, with no consistent trend of improvement or decline observed when compared to the reduced set of variables (n = 77).Notably, the Rotation Forest model stood out, ranking third overall in performance with an F1 score of 40% and balanced accuracy of 63%.A comparison of ROC curves for all models is presented in Supplementary Figure S2.

Important Features Ranking
The twenty most important features of the top-ranked "svmLinear2" model according to the "VarImp()" function are depicted in Figure 2. The top features were hexoses, amino acids (glycine, isoleucine, and the amino acid derivative kynurenine), and bile acids (chenodeoxycholic acid, CDCA and deoxycholic acid, DCA).A comparison of these features regarding their amount in patients who developed T2DM (positive outcome) and those who did not (negative outcome) illustrates that indeed nearly all of them were significantly different between both patient outcomes (Supplementary Table S1) and correlate with this outcome (Supplementary Table S2).However, some features, including certain ceramides and glycerophospholipids, demonstrated high collinearity (Supplementary Table S2).Furthermore, taking into account all top 20 ranked features of each model (generated by "VarImp()"), we have calculated a feature importance score (FIS) based on in-model ranking and frequency (Table 4).Hexoses were again the top-ranked feature (FIS = 354).However, when referring only to single metabolites, the amino acid glycine was top-ranked, followed by another amino acid, isoleucine; the bile acids CDCA, glucoursodeoycholic acid (GUDCA), and DCA; and a ceramide (N-C15:1-Cer, containing a pentadecenoic acid).When referring to metabolite classes instead, bile acids had the highest count (FIS = 1012), followed by ceramides (FIS = 928), amino acids (FIS = 766), and hexoses (FIS = 354).The table lists the most important features according to the feature importance score (FIS), which sums up the ranking score of the top 20 features (=20/rank) in all models described in Table 3.
a pentadecenoic acid).When referring to metabolite classes instead, bile acids had the highest count (FIS = 1012), followed by ceramides (FIS = 928), amino acids (FIS = 766), and hexoses (FIS = 354).Moreover, the Shapley Additive Explanation (SHAP) method offered an alternative, model-agnostic view of variable significance (Figure 3).Comparing results from SHAP and the "VarImp()" function, there were some similarities regarding the inclusion of hexoses and ceramides in the top-ranked features, but also some discrepancies, particularly a high ranking of lactic acid and several acylcarnitines (tiglylcarnitine, decenoylcarnitine, hexadecenoylcarnitine, tetradecadienylcarnitine, and carnitine).
Notably, the anthropometric variables age, sex, waist circumference, waist-hip ratio, and body mass index (BMI) were almost completely ignored by the applied ML algorithms and did not show up either in the SHAP or in the "VarImp()"-generated list of top-ranked features (Figures 2 and 3), suggesting a limited impact.
Moreover, the Shapley Additive Explanation (SHAP) method offered an alternative, model-agnostic view of variable significance (Figure 3).Comparing results from SHAP and the "VarImp()" function, there were some similarities regarding the inclusion of hexoses and ceramides in the top-ranked features, but also some discrepancies, particularly a high ranking of lactic acid and several acylcarnitines (tiglylcarnitine, decenoylcarnitine, hexadecenoylcarnitine, tetradecadienylcarnitine, and carnitine).Notably, the anthropometric variables age, sex, waist circumference, waist-hip ratio, and body mass index (BMI) were almost completely ignored by the applied ML algorithms and did not show up either in the SHAP or in the "VarImp()"-generated list of topranked features (Figures 2 and 3), suggesting a limited impact.

Validation Model Output with a Linear Regression Approach
For comparison, we applied a more traditional statistical modeling approach known for its linear nature and interpretability: Lasso-Regression.This method utilized the same preselected set of variables generated by RFE (n = 77).As a result, Lasso Regression excluded 66 variables and used 11 variables for prediction (Supplementary Table S3).Overall, 10 out of these 11 variables were also part of the top 20 most important variables identified by the VarImp() function (Table 3) or the SHAP method (Figure 3).Of note, no ceramides or glycerophospholipids (both featuring high collinearity) were part of the Lasso-Regression model.Given this difference compared to the more complex ML models, the predicting performance of the Lasso-Regression approach was worse (F1 score = 19%, balanced accuracy = 45%).

Main Findings
In our investigation, ten distinct ML algorithms were employed to evaluate a comprehensive metabolomic dataset, augmented with anthropometric data, for the prediction of T2DM incidence among cardiovascular risk patients undergoing coronary angiography over a four-year period.Among the models analyzed, the "svmLinear2" model-a Support Vector Machine with a linear kernel-stood out, delivering the highest F1 score (50%) and balanced accuracy (82%).This model demonstrated a 50% sensitivity rate, accurately identifying half of the patients who developed T2DM.Furthermore, it achieved a precision of 50%, indicating that half of the T2DM onset predictions were correct.Remarkably, the model successfully classified 93% of the individuals who did not develop T2DM during the follow-up.
The most important variables out of these metabolites were bile acids, ceramides, amino acids, and hexoses.Contrarily, anthropometric measures-despite being part of the analysis-did not significantly contribute to the predictive accuracy of the model.

The Role of Metabolites in Predicting New-Onset Diabetes
Remarkably, ML was capable of predicting T2DM incidence within four years, although patients who did develop T2DM and those who did not were not strikingly different in terms of their anthropometric or HbA1c data.A recent study highlighted the challenge in classifying HbA1c levels regarding the risk of developing diabetes, noting that many individuals are already classified as high-risk under current HbA1c thresholds [12].They suggested that raising the threshold to an HbA1c value of 6.0% could improve the positive predictive value (=precision) to 12.4%.In our study, the difference in HbA1c levels between patients who developed T2DM during the follow-up period and those who did not was minimal and not statistically significant (5.85% vs. 5.70%, p = 0.052), remaining in any case below the suggested thresholds.Unlike the low predictive performance observed with HbA1c alone, our findings demonstrate that for a clinically relevant patient group, which is already at high cardiovascular risk undergoing coronary angiography, a more comprehensive risk prediction is possible when incorporating a broader dataset, particularly metabolomics data.
Machine Learning, in contrast, has identified a set of metabolites, including bile acids, ceramides, amino acids, and hexoses, which were indeed significantly different in patients who develop T2DM.Noteworthy, these metabolites were identified by the complex ML models, and in part by the more simple Lasso Regression.Bile acids, as one of these, play a pivotal role in lipid metabolism.Recent studies have uncovered that they act as signaling molecules, influencing glucose metabolism and insulin sensitivity, thus having an impact on the development of diabetes [8,13,14].Similarly, ceramides-molecules composed of a fatty acid linked to a sphingosine backbone-have emerged as potent predictors of risk [15,16].Initially investigated for their association with cardiovascular disease [17], recent studies have illuminated their significant influence on energy metabolism and diabetes [18,19].However, we noticed that, in contrast to complex ML models which handle collinearity and imply feature interactions, ceramides as well as glycerophospholipids were excluded in the Lasso-Regression model, given their high collinearity.In a previous prospective study involving 2776 patients, Liu et al. [3] demonstrated, using targeted metabolomics and Lasso Regression, that certain metabolic features-including isoleucine, tyrosine, and lactate, which were also among our top 20 features-significantly outperformed glucose in predicting T2DM in the majority of patients.A recent ML-based study identified kynurenine (also among our top 20 features) as one of the most important predictors of T2DM incidence in Chinese patients [8].
Our ML study has now demonstrated for the first time that, alongside hexoses, these metabolite classes can predict the onset of diabetes in non-diabetic cardiovascular risk patients.
That said, it is worth mentioning that a previous ML analysis by Shojaee-Mend et al. has identified age, waist-hip ratio, and BMI as the most important features predicting diabetes [20].BMI, especially those values >40, contributes to a high risk of diabetes [21].The BMI in our study was much lower, with an IQR of 25-31.Thus, neither BMI nor any further anthropometric measures emerged as top features in our model predictions.This divergence suggests that when metabolomic data are included, these anthropometric features, given that they are not extreme values, may play a less significant role, underscoring the potential superior predictive power of metabolomic markers in forecasting diabetes onset.Conversely, it has recently been demonstrated that molecular markers, including 19 metabolites, improved the prognostic performance of ML models beyond that of classical risk factors [22].This accentuates the findings of our study; metabolomic features could offer a more nuanced understanding of diabetes progression.It is important to note, however, that while certain metabolomic features are identified or corroborated here as top predictors, interpreting these features and understanding their biological significance may require further investigation.As a result, validating these features and understanding their interplay could mark a significant step forward in diabetes prevention and management strategies.

The Role of ML Models in Predicting New-Onset Diabetes
In the present study, we reached the highest prediction performance when applying a Support Vector Machine algorithm to a dataset that had been scaled down, applying a recursive elimination of the dataset's variables (in our setting from 362 to 77).Compared to Lasso Regression, which represents a more traditional linear regression approach, and is robust and easier to interpret but excludes features with high collinearity, machine learning models, particularly Support Vector Machines, effectively manage collinearity and capture feature interactions.Maybe this is the reason for the much higher predictive performance demonstrated in our study.In a recent study revisiting the Pima Indian diabetes dataset, the Support Vector Machine algorithms had also been identified to have the highest accuracy in identifying existing diabetes, though other algorithms, including Multi-Layer Perceptron, Random Forest, decision trees, and Naïve Bayes, also showed accuracies around 70% [23].A different study applying the same Pima Indian diabetes dataset has generated an improved artificial Neural Networks model attaining the highest accuracy [24].
These examples underline the complexity of comparing ML models, especially with diverse datasets (age, race, disease, etc.).Apart from that, we must mention that, other than the high values for balanced accuracy in our study, the F1 score is not very high (50.0%for the top model), but it is still in range with a comparable study reporting F1 scores of 44.9-47.1% [20].However, this [20] and the majority of other studies have reported the prediction of existing diabetes.Our focus, in contrast, was on predicting the onset of newly developed diabetes in non-diabetic individuals over four years, adding another layer of challenge distinct from identifying existing diabetes.Nonetheless, our findings suggest the SVM algorithm's robustness in diabetes prediction, particularly within datasets rich in metabolomics data.
Interestingly, our RFE process revealed that reduced feature sets, in particular the one identified through Random Forest, optimized SVM performance.Yet, not all models showed improved performance with fewer features, echoing previous research indicating Random Forest's limitations in handling omics data with many correlated variables [25].This suggests that while SVM stands out in our study, future investigations might uncover algorithms better suited to varying datasets.Moreover, we anticipate that expanding the dataset-both in patient numbers and balancing class ratios-could significantly enhance prediction accuracy and generalizability.

Strengths and Limitations
This study has strengths and limitations.A particular strength of this study is its metabolomics dataset which initially comprised over 500 metabolites.An additional strength is the very well-characterized cohort of cardiovascular disease patients, who were all enrolled under identical prerequisites in the same tertiary care center.Moreover, we analyzed prospective data regarding the incidence of diabetes.Since the diagnosis of existing diabetes is routinely conducted using glucose marker testing, predicting the risk of new-onset diabetes in originally non-diabetic patients with a high cardiovascular risk is more ambitious and of high clinical relevance.
One limitation of our study is the fact that we selected only Caucasian patients with an elevated cardiovascular risk undergoing coronary angiography.Therefore, the results are, of course, not representative in view of the general population nor necessarily applicable to other patients or other ethnicities, which might impact the model performance.In addition, though we validated the training set-based prediction on test set data, which was randomly generated from the whole dataset, validation in a different cohort ideally characterized by comparable metabolomics would be commendable.Further, we did not apply HbA1c or the formal prediabetes status for predicting T2DM incidence, as both variables are closely linked or defined by glucose values, which are already part of the metabolomic profile in terms of hexoses.Furthermore, apart from the metabolomic differences between non-diabetic patients who developed T2DM and those who did not, as investigated in the present study, we suppose that patients who develop T2DM during a shorter follow-up may exhibit an even more meaningful metabolomic profile than those who develop T2DM later.Unfortunately, the incidence rate in our cohort of cardiovascular risk patients is too limited to conduct such an analysis.Finally, we have data on the prescription of drugs but not on adherence to the respective medical treatment, which also may impact the outcome.

Study Subjects and Patient Selection for Metabolomic Analysis
Patients were selected from a coronary angiographically characterized Caucasian cohort, recruited between 2005 and 2009, as previously detailed [26].They were consecutively enrolled for angiography at our tertiary care hospital (Academic Teaching Hospital Feldkirch) to evaluate established or suspected coronary artery disease (CAD), excluding those with acute coronary syndromes or Type 1 diabetes.Basic clinical measurements and laboratory analyses were conducted at the Central Medical Laboratories Feldkirch, thoroughly described elsewhere [27].Venous blood samples, taken after a 12 h fast, underwent immediate basic laboratory testing.Serum samples were then aliquoted and frozen at −80 • C to facilitate metabolomic analysis, safeguarding against repeated freeze-thaw cycles.
For the metabolomics assay, we randomly selected 407 baseline patient samples from the above-described total study population, provided they had completed the four-year follow-up visit.We applied a targeted quantitative metabolomic approach to analyze the stored serum samples (BIOCRATES Life Sciences AG, Innsbruck, Austria) using liquid chromatography (LC)-mass spectroscopy (MS) and flow injection analysis (FIA)-MS as described previously [28].In total, 535 compounds were analyzed.Alongside these metabolomics features, we also included anthropometric features age, sex, BMI, waist circumference, and waist-to-hip ratio in the dataset.Patients with known T2DM or who were diagnosed with T2DM at baseline were excluded (n = 128) from further analysis.Therefore, metabolomics data for 279 non-diabetic patients were available.The selection of patients as well as the data preprocessing and analysis scheme is depicted in Supplementary Figure S1.

Preprocessing and Recursive Feature Elimination
Data preprocessing was performed using R version 4.2.2 and the Caret package [11].Initially, we removed 179 metabolites from the dataset due to missing values exceeding 30%.The dataset was then randomly divided (75:25) into training (209 samples) and testing (70 samples) groups using the "createDataPartition()" function.The training set was utilized for model development, while the testing set helped evaluate the model's performance in predicting T2DM incidence.Missing values in the training set were addressed using the "knnImpute" method from the "preprocess()" function.Categorical variables underwent one-hot encoding with "dummyVars()", and numerical variables were normalized via the range method of the same function.Prior to feeding the preprocessed data to the ML algorithms, we assessed variable importance through recursive feature elimination (RFE).RFE, a wrapper method, iteratively builds models to evaluate and discard the least important features.It thereby allows for the assessment of feature importance in the context of the model and the data and identifies features (variables included in the metabolomic dataset) that are more relevant than others to predict the target variable (T2DM incidence), resulting in the enhanced predictive accuracy of the final model.Utilizing the "rfeControl()" and "rfe()" functions, we explored four distinct algorithm sets within the RFE framework-Random Forest (rfFuncs), Tree Bagging (treebagFuncs), Naive Bayes (nbFuncs), and Caret Default (caretFuncs)-to ensure a comprehensive and reliable feature selection.Each set underwent a 10-fold cross-validation repeated 5 times via the "repeatedcv" method, enhancing the robustness of feature selection.The outcomes of RFE, illustrating the pivotal features, are depicted in Table 2 and Figure 1.
Following RFE, the testing dataset was processed using the same imputation and normalization techniques as the training set to maintain consistency across data preparation stages.

Statistical Analysis
Differences in baseline characteristics were tested for statistical significance with the Chi-squared tests for categorical and Jonckheere-Terpstra tests for continuous variables.Correlation analyses were performed by calculating non-parametric Spearman rank correlation coefficients.These analyses were performed with SPSS 28.0.0.0.for Windows (SPSS, Inc., Chicago, IL, USA).Multicollinearity analysis was performed by calculating the Variance Inflation Factor using the "car" package in R [29].Lasso (Least Absolute Shrinkage and Selection Operator) Regression was performed using the "glmnet" package [30].

Managing Data Imbalance
The initial dataset exhibited a significant imbalance in outcomes, with only 32 subjects developing diabetes compared to 247 who did not.Recognizing that such an imbalance could adversely affect the performance of many machine learning algorithms [31], we addressed this issue by increasing the representation of the minority class.This was achieved through the application of the "up" sampling method, integrated within the "trainControl()" function of the caret package in R.This approach balanced the dataset by augmenting the number of records for subjects who newly developed diabetes, thereby enhancing the potential for more accurate and equitable model performance.
To enhance model accuracy and minimize fitting errors, we leveraged the "train-Control()" function for meticulous training protocol definition, including cross-validation strategies, evaluation metrics, and data preprocessing methods.Specifically, we adopted a 5-fold cross-validation strategy, activated class probability estimation for binary classification, and implemented up-sampling for the minority class.While default parameters were maintained for general model training aspects, hyperparameter optimization was streamlined through the Caret package's capability to automatically adjust hyperparameters via a predefined grid for each algorithm.Post-optimization, the "expand.grid()"function allowed for the fine-tuning of model parameters on the enriched training dataset, leading to a rigorous evaluation of model performance.

ML Model Performance Evaluation
The evaluation metrics for the various models drew from true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values.These metrics encompassed sensitivity (also known as recall; calculated as TP/(TP + FN)), specificity (TN/(TN + FP)), precision (or positive predictive value; TP/(TP + FP)), accuracy (the sum of TP and TN divided by the total number of cases; (TP + TN)/(TP + TN + FP + FN)), balanced accuracy (the average of sensitivity and specificity; (sensitivity + specificity)/2), and the F1 score (the harmonic mean of precision and sensitivity; 2 × (precision × sensitivity)/(precision + sensitivity)).These metrics were generated utilizing the "confusionMatrix()" function in the Caret package.
To assess the significance of various variables, we employed the "varImp()" function within the Caret package.This analysis enabled us to rank variables by their impact, offering insights into the overall variable importance within each model.
Another method for interpreting ML model outputs is the SHAP approach [42].For this, we utilized both the kernelshap and shapviz packages in R [43].The "kernelshap()" function, a model-agnostic method for computing SHAP values, employs weighted linear regression to estimate each feature's contribution to individual predictions.SHAP values visualization was achieved through the "shapviz()" function, enhancing the understanding of the model decision-making processes.

Conclusions
In conclusion, metabolomic data are superior to anthropometric data, and applying ML for their analysis is a promising new tool for identifying individuals at risk of developing T2DM as it opens avenues for personalized and early intervention strategies.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijms25105331/s1.Funding: The VIVIT research institute was supported by the Vorarlberger Landesregierung (Bregenz, Austria) and by Peter Prast and the Emotion Foundation (Vaduz, Liechtenstein), which, however, exerted no influence on the present work in any way.Apart from that, the present study did not receive any financial support or grant from funding agencies in the public, commercial, or not-forprofit sectors.

Institutional Review Board Statement:
The present study conforms to the ethical guidelines of the 1975 Declaration of Helsinki and has been approved by the Ethics Committee of Vorarlberg, Austria (EK-2-22013/0008).
Informed Consent Statement: All participants gave written informed consent.The authors affirm that all participants provided informed consent for publication of anonymized data.

Function 15 Figure 1 .Figure 1 .
Figure 1.Identifying important variables by recursive feature elimination.Recursive feature elimination helps to identify important and less important variables and to define the optimal size of ML models, as summarized in Table 2.The plot represents the output of an RFE process generating different models (black dots).It depicts the relation between the different feature subset sizes (=number of available variables (1-362)) for modelling and the resulting performance metric (accuracy = (true positive + true negative)/(true positive + false positive + true negative + false negative)).Using Random Forest (left) and TreeBag algorithms (right) as functions in RFE, the best models (highlighted as blue dots) were calculated to have 77 and 362 variables, respectively.The process Figure 1.Identifying important variables by recursive feature elimination.Recursive feature elimination helps to identify important and less important variables and to define the optimal size of ML models, as

Figure 3 .
Figure 3. SHAP diagram of feature importance.The beeswarm plot illustrates the most important features (variables) and the contribution of these individual features to the model's output using Shapley Additive Explanation (SHAP) values.Each dot represents a SHAP value for a feature and a specific data point, indicating the magnitude and direction of the feature's impact on the model's prediction relative to the baseline.The y-axis demonstrates the variable name, in order of importance from top to bottom, and the x-axis the SHAP value scale.It indicates how large the impact of the respective variables is on the model output (T2DM incidence).The gradient color indicates the original value for that variable.C5:1 represents tiglylcarnitine, C10:1 decenoylcarnitine, C16 hexadecanoylcarntine, C14:2 tetradecadienylcarnitine, GUDCA glycoursodeoxycholic acid, CA carnitine, PS aa C:34:2 a phosphatidylserine with a diacyl bond, and N-C13:0-Cer(2H) a dihydroceramide.

Figure 3 .
Figure 3. SHAP diagram of feature importance.The beeswarm plot illustrates the most important features (variables) and the contribution of these individual features to the model's output using Shapley Additive Explanation (SHAP) values.Each dot represents a SHAP value for a feature and a specific data point, indicating the magnitude and direction of the feature's impact on the model's prediction relative to the baseline.The y-axis demonstrates the variable name, in order of importance from top to bottom, and the x-axis the SHAP value scale.It indicates how large the impact of the respective variables is on the model output (T2DM incidence).The gradient color indicates the original value for that variable.C5:1 represents tiglylcarnitine, C10:1 decenoylcarnitine, C16 hexadecanoylcarntine, C14:2 tetradecadienylcarnitine, GUDCA glycoursodeoxycholic acid, CA carnitine, PS aa C:34:2 a phosphatidylserine with a diacyl bond, and N-C13:0-Cer(2H) a dihydroceramide.

Author
Contributions: A.L. planned the study; C.H.S. and H.D. were responsible for patient recruitment; A.M. (Axel Muendlein), C.H.S., P.F. and H.D. organized basic and metabolomic analyses; A.L. and S.M. wrote the manuscript; A.L., S.M. and A.M. (Arthur Mader) discussed the results; A.M. (Axel Muendlein), S.M., A.M. (Arthur Mader), C.H.S., A.F. and P.F.helped with data interpretation; P.F. and H.D. carefully revised the manuscript.All authors have read and agreed to the published version of the manuscript.

Table 3 .
Comparison between models and metrics.

Table 4 .
Overall ranking of model features.

Table 4 .
Overall ranking of model features.