Appraisal of Clinical Explanatory Variables in Subtyping of Type 2 Diabetes Using Machine Learning Models

Khamis, Amar H.; Abdul, Fatima; Dsouza, Stafny; Sulaiman, Fatima; Khyreim, Costerwell; Siddig, Mohammed E.; Bayoumi, Riad

doi:10.3390/jcm14186548

Open AccessArticle

Appraisal of Clinical Explanatory Variables in Subtyping of Type 2 Diabetes Using Machine Learning Models

by

Amar H. Khamis

^1,*

,

Fatima Abdul

²

,

Stafny Dsouza

²

,

Fatima Sulaiman

²

,

Costerwell Khyreim

²

,

Mohammed E. Siddig

³ and

Riad Bayoumi

^2,*

¹

Hamdan Bin Mohammed College of Dental Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai P.O. Box 505055, United Arab Emirates

²

College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai P.O. Box 505055, United Arab Emirates

³

College of Science, University of Gezira, Wad Madani P.O. Box 20, Gezira, Sudan

^*

Authors to whom correspondence should be addressed.

J. Clin. Med. 2025, 14(18), 6548; https://doi.org/10.3390/jcm14186548

Submission received: 6 August 2025 / Revised: 15 September 2025 / Accepted: 16 September 2025 / Published: 17 September 2025

(This article belongs to the Section Endocrinology & Metabolism)

Download

Browse Figures

Versions Notes

Abstract

Background: Clustering type 2 diabetes (T2D) remains a challenge due to its clinical heterogeneity and multifactorial nature. We aimed to evaluate the validity and robustness of the clinical variables in defining T2D subtypes using a discovery-to-prediction design. Methods: Five explanatory clinical aetiology variables (fasting serum insulin, fasting blood glucose, body mass index, age at diagnosis and HbA1c) were assessed for clustering T2D subtypes using two independent patient datasets. Clustering was performed using the IBM-Modeler Auto-Cluster. The resulting cluster validity was tested by multinomial logistic regression. The variables’ validity for direct unsupervised clustering was compared with machine learning (ML) predictive models. Results: Five distinct subtypes were consistently identified: severe insulin-resistant diabetes (SIRD), severe insulin-deficient diabetes (SIDD), mild obesity-related diabetes (MOD), mild age-related diabetes (MARD), and mild early-onset diabetes (MEOD). Using all five variables yielded the highest concordance between clustering methods. Concordance was strongest for SIRD and SIDD, reflecting their distinct clinical signatures in contrast to that in MARD, MOD and MEOD. Conclusions: These findings support the robustness of clinically defined T2D subtypes and demonstrate the value of probabilistic clustering combined with ML for advancing precision diabetes care.

Keywords:

artificial intelligence; type 2 diabetes; subtyping; aetiology; logistic regression

1. Introduction

Type 2 diabetes (T2D), the 8th leading cause of global disease burden affecting more than 500 million individuals, is estimated to become the second leading cause by 2050 [1]. However, it remains difficult to characterize due to its heterogeneity, arising from diverse pathophysiological mechanisms—including varying degrees of insulin resistance, beta-cell dysfunction, and metabolic disturbances. This variability is further complicated by overlapping comorbidities such as obesity, hypertension, and cardiovascular disease. Although these factors make it difficult to clearly distinguish subtypes, precise classification remains essential for enabling personalized treatment strategies and improving the prediction of long-term complications.

Clustering approaches have been central to efforts to dissect this heterogeneity utilizing clinical and genetic variables. Ahlqvist et al. [2] identified five reproducible clusters of diabetes by K-means based on six clinical variables: age at diagnosis, body mass index (BMI), glycated haemoglobin (HbA1c), homeostatic model assessment for beta cell dysfunction (HOMA2-B), homeostatic model assessment for insulin resistance (HOMA2-IR), and glutamic acid decarboxylase antibody (GADA). These clusters demonstrated differing risks for complications and clinical outcomes and have since been replicated wholly or partially across multiple populations, including cohorts in Europe [2,3,4,5,6,7,8], Asia [9,10,11,12,13,14] and the Middle East [15,16]. We have previously replicated four clusters [17] termed severe insulin-resistant diabetes (SIRD), severe insulin-deficient diabetes (SIDD), mild obesity-related diabetes (MOD), and mild age-related diabetes (MARD), along with a novel South-Asian specific mild early-onset diabetes (MEOD). Genetic clustering [18,19] has added further granularity to T2D stratification. An extension of Udler’s soft clustering framework [19] to identify genetically informed clusters that highlight causal pathways of T2D heterogeneity, identified 12 genetic clusters in alignment with clinical stratification across diverse populations highlighting ancestry-specific differences [20]. Studies have also confirmed the reproducibility of these clusters with an emphasis on the emergence of new subgroups when incorporating additional biomarkers such as lipids or inflammatory markers [21,22].

Despite these advances, important gaps persist. Traditional clustering methods such as k-means often perform poorly in high-dimensional or noisy datasets, and hard assignment of individuals to a single cluster may not reflect the reality of overlapping or shared pathophysiological processes [17]. Only a few studies have identified the overlap between the T2D subtypes in European [19,23,24,25] and Arab [17] ancestry. A movement of patients between clusters over time has also been suggested showing varying outcomes across populations highlighting population-specific differences [6,25]. While probabilistic and soft clustering approaches have been proposed, they remain underutilized in clinical research, particularly in integrating both clinical and genetic information. Moreover, most studies to date have emphasized cluster discovery rather than validating the explanatory power of individual clinical variables or comparing the performance of unsupervised clustering with cluster-based supervised predictive models. Population-specific variability further complicates generalizability, with clusters defined in European cohorts not always transferring effectively to other ancestries. These limitations highlight the need for more sophisticated, flexible, and biologically informed models to better capture the complexity of T2D.

In this study, our aim was to evaluate the validity and robustness of five key variables (fasting serum insulin, fasting blood glucose, body mass index, age at diagnosis and HbA1c) in defining T2D subtypes while addressing key methodological gaps. We compared the explanatory power of these variables using direct supervised clustering and machine learning approaches [7,12,26,27], which, unlike traditional k-means, accommodate soft membership and capture overlapping disease processes [17,19,23,25]. Extending beyond prior within-cohort studies, we tested reproducibility across two distinct cohorts differing in ethnicity, disease duration, and clinical context. We further evaluated how different combinations of clinical variables influenced clustering stability, and quantified reproducibility using both Adjusted Rand Index and Fowlkes–Mallows Index, providing rigorous dual metrics of stability. By examining patterns of overlap and cluster membership under these conditions, our study offers new insights into the continuum of T2D phenotypes and lays the foundation for subtype-specific strategies in precision medicine.

2. Materials and Methods

2.1. Study Design

The study was carried out in two phases. The first phase focused on examining the use and impact of explanatory variables for clustering of type 2 diabetes (T2D). In the second phase, ML was used to test the difference between direct clustering versus cluster-based classification using predictive models employing the same sets of variables.

2.2. Patients

The study was conducted in the selected healthcare facilities of Dubai Health, an integrated academic health system recently established in Dubai, United Arab Emirates (UAE). It included two tertiary healthcare centers (THC)—Dubai Hospital and the Dubai Diabetes Centre (DDC)—26 primary healthcare centers (PHCs), and Mohammed Bin Rashid University (MBRU). The two independent datasets, stratified by disease duration, were generated. For each patient, the clinical and laboratory data were extracted from the SALAMA electronic health record system, implemented across all Dubai Health facilities. Only participants with no missing data were included.

Dataset 1 (training cohort): 348 Emirati patients with long-standing T2D (mean duration 14 years), each with ≥2 complications, 3–4 medications, recruited from tertiary centres between January 2020 and December 2022 [17].

Dataset 2 (validation cohort): 586 multi-ethnic patients with newly diagnosed T2D (mean duration 4 years), no comorbidities or complications, 0–2 medications, enrolled at PHCs between January 2022 and December 2023.

Models were developed in Dataset 1 and externally validated in Dataset 2, which differed by ethnicity, disease duration, complication burden, and care setting. This design ensured that generalizability was tested across heterogeneous patient populations and clinical environments.

2.3. Clustering Techniques

Clustering was performed using the Auto-Cluster procedure in IBM SPSS Modeler (version 18.0; IBM North America, New York, NY, USA), a data science and machine learning (ML) platform incorporating deep learning and artificial neural networks. The Auto-Cluster procedure integrates multiple algorithms, including two-step clustering (data condensation followed by hierarchical clustering), k-means clustering, and the Kohonen self-organizing neural network [17]. Models were generated using the target variable “cluster” and combinations of selected “explanatory variables” and were evaluated against multiple fit indices, with the Silhouette index serving as the primary criterion for cluster separation. Logistic regression was used to validate the clustering methods as well as to demonstrate the overlap between clusters. This approach yielded probabilistic (soft) assignments of each patient to clusters, avoiding forced hard classifications and better capturing the heterogeneity of T2D.

Model performance was further evaluated with multinomial logistic regression, treating cluster membership as the outcome and five predictors—age at diagnosis, body mass index (BMI), fasting blood glucose (FBG), fasting serum insulin (FSI), and glycated haemoglobin (HbA1c)—as independent variables. This quantified the relative contribution of each predictor to distinguishing clusters, providing an additional measure of construct validity for the clustering approach. The SPSS Modeler software (version 18.0; IBM North America, New York, NY, USA), widely applied in prior studies [17,27,28,29,30,31], sequentially tests up to 14 potential predictive models [C5.0 Decision Tree Algorithm (C5), Logistic Regression (LR), Bayesian Network (BN), Linear Discriminant Analysis (D), Linear Support Vector Machine (LSVM), Random Tree (RT), Extreme Gradient Boost Linear (XGBL), Extreme Gradient Boost Tree (XGBT), Chi-Square Automatic Interaction Detection (CHAID), Quick Unbiased Efficient Statistical Tree (QUEST), Classification and Regression Tree (C&R Tree), Neural Network (NN), Decision List (DL), Tree AS (Tree AS)] and automatically selects the top five to six based on area under the curve (AUC), silhouette index, and percent accuracy. Thus, offering significant advantages over manual model selection by efficiently handling large-scale datasets ensuring standardization, reduces subjectivity, and enhances reproducibility across analyses.

The resulting clusters were defined based on their clinical characteristics as described in Bayoumi et al. [17]. Briefly, the cluster with highest HOMA-IR, obesity and no β-cell deficiency was labelled as severe insulin-resistant diabetes (SIRD); the cluster with lowest HOMA-B and highest HbA1c as severe insulin-deficient diabetes (SIDD); the cluster with highest BMI as mild obesity-related diabetes (MOD); the cluster with highest age at diagnosis, moderate insulin resistance and no beta-cell deficiency as mild age-related diabetes (MARD); and the cluster with lower age of diagnosis, normal/overweight BMI, moderate insulin resistance and beta-cell deficiency as mild early-onset diabetes (MEOD).

Validation of Explanatory Variables

To validate the selection of explanatory variables for clustering, we examined four scenarios: Scenario 1: Included only the three fundamental etiological variables—FBG, FSI, and BMI. Unlike most previous studies, we excluded HOMA-IR and HOMA-B, as both are derived directly from FBG and FSI, thereby introducing redundancy and potential multicollinearity. This decision was supported by statistical evidence. Specifically, we evaluated collinearity using the Variance Inflation Factor (VIF) and examined the correlation structure. The VIF was calculated as

V I F = \frac{1}{1 - R_{i}^{2}}

where

R_{i}^{2}

, is a coefficient of determination from the regression.

The VIF analysis demonstrated that HOMA-IR introduces severe multicollinearity when used alongside FBG and FSI with VIF = 14.1 (Supplementary Method S1). Since over 90% of its variance is explained by these two variables, HOMA-IR does not provide independent explanatory value. Including it in clustering models would outweigh glycaemic and insulin measures, bias cluster separation, and reduce interpretability.

Scenario 2: Added age at diagnosis (a proxy for disease chronicity) to the three basic variables.

Scenario 3: Added HbA1c (a marker of disease severity) to the three basic variables.

Scenario 4: Included all five variables—FSI, FBG, BMI, age at diagnosis, and HbA1c.

Analysis was conducted as outlined in Figure 1. Briefly, the ML models were trained on previously reported direct clusters produced with Dataset 1 [17] and applied to Dataset 2 for supervised cluster-based prediction across all four scenarios. For comparison, Dataset 2 was also directly clustered by unsupervised methods under the same four scenarios.

2.4. Testing Similarity Between Clusters

Two methods were employed to assess the similarity between clusters generated by direct clustering versus cluster-based prediction using ML models. Two sets of clusters were generated for the prediction dataset: one using an unsupervised direct clustering approach and the other using a supervised cluster-based classification approach based on the previously identified unsupervised clusters. To assess the degree of similarity between the resulting cluster assignments, we calculated the Adjusted Rand Index (ARI) and the Fowlkes–Mallows Index (FMI). Both metrics quantify the agreement between two clustering results, with ARI adjusting for chance agreement and FMI evaluating the balance between precision and recall of cluster pair assignments. This analysis focuses on comparing the outputs of the two clustering approaches, rather than evaluating the intrinsic performance of the methods themselves.

2.4.1. The Adjusted Rand Index (ARI)

ARI, a metric used to measure the similarity between two data clustering’s. It is commonly applied in statistics to assess the agreement between predicted clustering and a ground-truth clustering within the same dataset, calculated as

A R I = \frac{R I - E}{M a x R I - E}

where RI is Rand index value, E is the expected value of the Rand index for random clusters and Max RI is the maximum achievable value of the Rand index. Stepwise calculations of ARI are shown in Supplementary Method S2.

2.4.2. The Fowlkes–Mallows Index (FMI)

FMI is a statistical measure used to evaluate the similarity between two clustering’s. It is widely applied in clustering validation to compare the agreement between a predicted clustering and a ground-truth clustering within the same dataset. The FMI assesses similarity between two clusters by utilizing precision and recall.

F M I = \frac{T P}{\sqrt{(T P + F P) (T P + F N)}}

where TP is True Positive (pairs of points that are in the same cluster in both clusters), FP is False Positive (pairs of points that are in the same cluster in one clustering but in different clusters in the other) and FN is False Negative (pairs of points that are in different clusters in one clustering but in the same cluster in the other).

The 95% confidence interval (Supplementary Method S3) for both ARI and FMI were estimated using a custom R-code for transformation and calculation using the mathematical formula:

z_{95 C I} = z \pm 1.96 {S E}_{z}

where z is z value and SE_z is standard error in z-space.

A contingency table was used to compare the assignment of data points between the two sets of results of clusters obtained by the two different methods from the same database. Furthermore, purity measures were employed to assess the homogeneity of the clusters, specifically to determine whether the data points within a cluster predominantly belong to a single true class.

2.5. Statistics

Data was analysed using IBM-SPSS for Windows (version 29.0; SPSS Inc., Chicago, IL, USA). Continuous variables (FSI, FBG, BMI, etc.) were described using measures of central tendency and dispersion or medians and interquartile ranges, depending on the data distribution. Categorical variables, such as clusters and overlap status, are presented as frequency and proportion. The Kolmogorov–Smirnov test was used to assess the normality of continuous variables. Mann–Whitney U test was used to compare the means between two groups. Chi-square test was used to assess the dependency between categorical variables (cluster and overlap). R [32] was used to calculate the Adjusted Rand Index (ARI) and Fowlkes–Mallows Index (FMI). A p-value of less than 0.05 was considered significant in all the statistical analyses.

3. Results

The key pathophysiological characteristics of the training and prediction datasets are shown in Table 1. The average primary and secondary clinical variables used for clustering differed significantly between the two datasets, except for BMI.

The direct clustering of the prediction dataset (n = 586 T2D patients) with all five exploratory variables resulted in five subtypes of T2D similar to those reported previously in the training dataset [17] in differing proportions (Figure 2). For SIRD, the proportion of patients remained low for both methods (6.3% direct clustering, 10.6% supervised prediction), consistent with the no-overlap analysis. In SIDD, 16.4% of patients were assigned to the cluster by unsupervised clustering compared to 7.7% by supervised prediction. The supervised method identified fewer patients in MARD (17.7%) when overlaps are allowed. Contrastingly, supervised prediction captured more MOD patients when overlap was permitted (13.5% vs. 4.3%). MEOD showed relatively similar proportions between methods, with more patients having overlapping features. Although a substantial proportion of patients were assigned to multiple clusters—43.3% by direct clustering and 42.8% by supervised prediction, the proportion of cluster overlap was not significant (p = 0.99) between the two methods (Figure 2B).

Supervised cluster-based prediction with ML generated the same five clusters in three scenarios (Table 2) except Scenario 3 (FSI, FBG, BMI, and HbA1c), which resulted in a single mixed cluster of mild forms (Table 3). It is apparent that the more variables used in clustering, the higher the concordance.

The concordance of crosstabulation of clusters was mirrored in the ARI and FMI similarity indices (Figure 3). The ARI showed high values (Supplementary Table S1) for the SIRD, SIDD clusters (>0.90) and low values for the MARD, MOD and MEOD clusters. While the MARD cluster showed weaker consistency (0.42–0.63), the MEOD cluster demonstrated unstable performance, with values ranging from poor concordance (negative ARI) to moderate (0.79). Similar results were obtained for FMI (Supplementary Table S2). Inclusion of age at diagnosis and HbA1c improved classification stability for most clusters, particularly SIRD and MARD.

4. Discussion

In this study, we examined the rationale for selecting clinical variables and their robustness in identifying T2D subtypes using both unsupervised and supervised machine learning (ML) clustering methods in two ethnically distinct cohorts of patients with T2D. We adopted a discovery-to-prediction design, evaluating the external validity of subtype assignment across cohorts differing in ethnicity, disease duration, and clinical context—a more stringent test than the within-cohort replications typically reported in earlier studies.

A central methodological choice in our study was the use of five key variables—fasting insulin (FSI), fasting blood glucose (FBG), body mass index (BMI), HbA1c, and age at diagnosis. Each was selected for its etiological relevance: FSI and FBG directly capture insulin resistance and β-cell dysfunction [33], BMI reflects adiposity-driven metabolic burden [34], HbA1c captures disease severity and chronicity [35], and age at diagnosis distinguishes early—from late-onset disease trajectories [36].

We deliberately excluded composite indices such as HOMA-IR and HOMA-B, despite their popularity in prior clustering studies [2,13,19]. Both indices are algebraic transforms of FBG and FSI, introducing redundancy and multicollinearity. Variance Inflation Factor analyses confirmed problematic inflation when HOMA indices were included [37,38]. By retaining the primary measures (FSI and FBG) and excluding HOMA, we improved model parsimony, stability, and biological interpretability. Importantly, once subtypes were established, HOMA indices remained useful descriptors of underlying metabolic differences, consistent with prior observations [39]. Other variables such as duration of diabetes, lipid parameters, and inflammatory markers have also been used in clustering but are either downstream consequences of disease or not directly aetiological and are therefore unsuitable for clustering inputs [13,40,41,42,43].

Using all five explanatory variables yielded the highest concordance between direct clustering and ML-based classification. We also quantified reproducibility using the Adjusted Rand Index (ARI) and the Fowlkes–Mallows Index (FMI). Strong concordance was observed for severe insulin-resistant diabetes (SIRD) and severe insulin-deficient diabetes (SIDD), whereas concordance for mild obesity-related diabetes (MOD), mild age-related diabetes (MARD) and mild early-onset diabetes (MEOD) was weaker and at times approached randomness (with near-zero or negative ARI values). Quantifying concordance is critical, as prior studies have largely inferred it qualitatively.

The severe T2D subtypes—SIDD and SIRD—emerged as the most stable clinical entities, exhibiting distinct clinical features, high concordance across methods, and minimal overlap with other subtypes, particularly in unsupervised clustering [17,25]. Their distinct metabolic signatures likely explain why these subtypes consistently replicate across populations and methods. SIDD is characterised by profound β-cell dysfunction, earlier age of onset, and poor glycaemic control [2,5,6,13,14], while SIRD is characterised by marked insulin resistance and strong associations with obesity, nephropathy, fatty liver, and cardiovascular disease [2,5,6,13,14,44,45].

By contrast, the mild forms of T2D—MARD, MOD, and MEOD—displayed weaker reproducibility, with lower concordance across methods and greater overlap with other subtypes, particularly in unsupervised clustering [17,25]. MARD, defined primarily by older age and modest dysglycaemia, was particularly unstable, with considerable overlap with MOD and MEOD. This instability suggests that MARD may encompass multiple overlapping phenotypes, consistent with evidence that older adults with T2D often exhibit blended features of insulin resistance and age-related β-cell decline [36]. MOD and MEOD also showed low concordance between methods, reflecting heterogeneous clinical presentations. Notably, early-onset diabetes has been associated with more aggressive disease progression and complications [45], underscoring the clinical importance of refining this cluster beyond age alone. These findings highlight that severe forms have strong pathophysiological anchors, whereas mild forms are more diffuse and may represent transitional or overlapping states influenced by environmental, lifestyle, and demographic factors.

The successful replication of severe T2D subtypes underscores the translational potential of clustering for patient stratification at diagnosis. Subtype assignment can guide pharmacotherapy: SIDD patients benefit from early insulin or insulin secretagogues; SIRD patients from GLP-1 receptor agonists or SGLT2 inhibitors; MOD patients from weight-centric therapies, including incretin-based treatments or bariatric surgery; MARD patients from simple regimens such as metformin with low hypoglycaemia risk; and MEOD patients from intensive early lifestyle or incretin-based interventions [46,47,48]. Subtyping can also inform complication surveillance, as SIDD carries heightened microvascular risk while SIRD is linked to nephropathy and cardiovascular complications [2,44,49]. Clinically, patients within overlap zones may face uncertainty in therapeutic allocation, reinforcing the need to integrate molecular, genetic, or longitudinal biomarkers to sharpen cluster boundaries [8,18].

Our study has several strengths. Unlike prior clustering work that validated subtypes within the same cohort [2] or derived mechanistic axes from genetic data [19], we directly tested the portability of clinical subtypes across independent cohorts. By applying supervised ML models to predict cluster membership in a mixed-ethnicity, newly diagnosed cohort, we assessed the real-world feasibility of subtype classification at diagnosis. Furthermore, reproducibility was rigorously quantified using ARI and FMI, providing a rigorous measure of similarity across clustering approaches. Using only five key variables (FBG, FSI, HbA1c, BMI, age at diagnosis) allowed clustering to focus on etiological influences, and despite differences in ethnicity, disease duration, complications, and medication use between cohorts, the same five T2D subtypes emerged, with variation only in individual cluster membership. Although sex-specific differences in insulin sensitivity and fat distribution are well documented [50,51] sex had no effect in the training dataset [17], though other untested confounders remain a study limitation. Our training dataset was smaller and restricted to a homogeneous Emirati population compared to the larger, multi-ethnic prediction dataset which may lead to underfitting in ML models. Reliance on cross-sectional measures such as FSI and FBG may not capture temporal dynamics in insulin resistance and β-cell function, although exploratory analyses suggest that snapshot HOMA values correlate with longitudinal averages in steady states.

The instability of mild clusters suggests that clinical variables alone may be insufficient for robust patient stratification. Multi-omic integration—incorporating genetics, proteomics, and metabolomics—has shown promise in refining T2D subtypes [8,18,23]. Polygenic partitioning, for example, improves the mechanistic resolution of T2D heterogeneity [19]. Environmental and lifestyle factors, including diet, adiposity, and physical activity, should also be integrated to account for non-genetic influences on phenotype [8,18,52,53]. Longitudinal designs will be essential to distinguish transitional from stable phenotypes [54] and to evaluate whether cluster assignment predicts treatment response and complication trajectories.

5. Conclusions

In summary, our study confirms that severe T2D subtypes (SIRD and SIDD) are reproducible across cohorts and clustering methods, whereas mild forms (MARD, MOD, and MEOD) exhibit weaker stability due to phenotypic overlap. This underscores both the promise and the limitations of clustering based on clinical variables alone. Improving reproducibility—particularly for mild clusters—will require integration of genetic, molecular, and longitudinal data. Ultimately, refining subtype classification has the potential to advance precision endocrinology by guiding therapeutic choices, monitoring complications, and tailoring lifestyle interventions.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/jcm14186548/s1, Method S1: Statistical basis for exclusion of HOMA indices as explanatory variables; Method S2: Calculation of the Adjusted Rand Index (ARI); Method S3: Calculation of Confidence intervals; Table S1: The Adjusted Rand Index (ARI) between T2D clusters generated by direct clustering versus cluster-based classification using machine learning predictive models in prediction dataset, employing various combinations of exploratory variables; Table S2: The Fowlkes-Mallows Index (FMI) between T2D clusters generated by direct clustering versus cluster-based classification using machine learning predictive models in prediction dataset, employing combinations of exploratory variables; Table S3: Comparison of characteristics between direct and predicted clusters using five exploratory variables.

Author Contributions

Conceptualization, R.B.; Methodology, R.B., A.H.K., C.K. and M.E.S.; Formal analysis, A.H.K.; Data curation, A.H.K., F.A., S.D. and F.S.; Writing—Original Draft Preparation, A.H.K.; Writing—review and editing, R.B. and S.D.; Visualization, S.D.; Funding acquisition, R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE, grant number MBRU-CM-RG2019-06, and Sandooq Al Watan LLC, Abu Dhabi, UAE, grant number SWARD-F22-013 awarded to R.B.

Institutional Review Board Statement

This study was approved by the Dubai Scientific Research Ethics Committee of Dubai Health Authority, UAE [Approval No. DSREC-12/2019-05 issued on 23 January 2020 and approval No. DSREC-022023_11 issued on 23 March 2023]. The acquisition of clinical data was by a registered healthcare practitioner through structured face to-face questionnaires and supplemented by data downloaded from the electronic SALAMA Health Information System, adhering to Dubai Health regulations and guidelines and conformed to the provisions of the Declaration of Helsinki (as revised in Fortaleza, Brazil, October 2013). Data was fully anonymized complying with data redistribution policies.

Informed Consent Statement

Written informed consent was obtained from all study participants prior to data collection. All patient data was anonymised prior to analysis to maintain confidentiality.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We are thankful to the Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU), Dubai, for their support of our research and to Dubai Health for granting access to patients and their records. The cooperation and contributions of the medical and nursing staff at the primary healthcare centers, Dubai Hospital and Dubai Diabetes Center are highly appreciated. The authors would also like to thank MBRU for the financial support towards the article processing fee.

Conflicts of Interest

The authors declare no conflicts of interest. The sponsors had no role in the design, execution, interpretation, or writing of the study.

References

International Diabetes Federation. IDF Diabetes Atlas, 11th ed.; The International Diabetes Federation (IDF): Brussels, Belgium, 2025. [Google Scholar]
Ahlqvist, E.; Storm, P.; Käräjämäki, A.; Martinell, M.; Dorkhan, M.; Carlsson, A.; Vikman, P.; Prasad, R.B.; Aly, D.M.; Almgren, P.; et al. Novel subgroups of adult-onset diabetes and their association with outcomes: A data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018, 6, 361–369. [Google Scholar] [CrossRef]
Ahlqvist, E.; Prasad, R.B.; Groop, L. Subtypes of type 2 diabetes determined from clinical parameters. Diabetes 2020, 69, 2086–2093. [Google Scholar] [CrossRef]
Dennis, J.M.; Shields, B.M.; Henley, W.E.; Jones, A.G.; Hattersley, A.T. Disease progression and treatment response in data-driven subgroups of type 2 diabetes compared with models based on simple clinical features: An analysis using clinical trial data. Lancet Diabetes Endocrinol. 2019, 7, 442–451. [Google Scholar] [CrossRef]
Herder, C.; Maalmi, H.; Strassburger, K.; Zaharia, O.P.; Ratter, J.M.; Karusheva, Y.; Elhadad, M.A.; Bódis, K.; Bongaerts, B.W.; Rathmann, W.; et al. Differences in biomarkers of inflammation between novel subgroups of recent-onset diabetes. Diabetes 2021, 70, 1198–1208. [Google Scholar] [CrossRef]
Zaharia, O.P.; Strassburger, K.; Strom, A.; Bönhof, G.J.; Karusheva, Y.; Antoniou, S.; Bódis, K.; Markgraf, D.F.; Burkart, V.; Müssig, K.; et al. Risk of diabetes-associated diseases in subgroups of patients with recent-onset diabetes: A 5-year follow-up study. Lancet Diabetes Endocrinol. 2019, 7, 684–694. [Google Scholar] [CrossRef] [PubMed]
Bello-Chavolla, O.Y.; Bahena-López, J.P.; Vargas-Vázquez, A.; Antonio-Villa, N.E.; Márquez-Salinas, A.; Fermín-Martínez, C.A.; Rojas, R.; Mehta, R.; Cruz-Bautista, I.; Hernández-Jiménez, S.; et al. Clinical characterization of data-driven diabetes subgroups in Mexicans using a reproducible machine learning approach. BMJ Open Diabetes Res. Care 2020, 8, e001550. [Google Scholar] [CrossRef]
Li, S.; Dragan, I.; Tran, V.D.T.; Fung, C.H.; Kuznetsov, D.; Hansen, M.K.; Beulens, J.W.; Hart, L.M.T.; Slieker, R.C.; Donnelly, L.A.; et al. Multi-omics subgroups associated with glycaemic deterioration in type 2 diabetes: An IMI-RHAPSODY Study. Front. Endocrinol. 2024, 15, 1350796. [Google Scholar] [CrossRef]
Song, X.; Lv, Y.; Huang, N.; Sun, J.; Yang, T.; Wang, X.; Zhang, J.; Zhou, Z.; Gao, H.; Li, J.; et al. Clinical Characteristics of Inpatients with New-Onset Diabetes Mellitus in Eastern China: Based on Novel Clustering Analysis. Front. Endocrinol. 2022, 13, 927661. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Yang, S.; Cao, C.; Yan, X.; Zheng, L.; Zheng, L.; Da, J.; Tang, X.; Ji, L.; Yang, X.; et al. Validation of the Swedish diabetes re-grouping scheme in adult-onset diabetes in China. J. Clin. Endocrinol. Metab. 2020, 105, e3519–e3528. [Google Scholar] [CrossRef] [PubMed]
Xing, L.; Peng, F.; Liang, Q.; Dai, X.; Ren, J.; Wu, H.; Yang, S.; Zhu, Y.; Jia, L.; Zhao, S. Clinical Characteristics and Risk of Diabetic Complications in Data-Driven Clusters Among Type 2 Diabetes. Front. Endocrinol. 2021, 12, 617628. [Google Scholar] [CrossRef]
Tanabe, H.; Saito, H.; Kudo, A.; Machii, N.; Hirai, H.; Maimaituxun, G.; Tanaka, K.; Masuzaki, H.; Watanabe, T.; Asahi, K.; et al. Factors associated with risk of diabetic complications in novel cluster-based diabetes subgroups: A Japanese retrospective cohort study. J. Clin. Med. 2020, 9, 2083. [Google Scholar] [CrossRef]
Anjana, R.M.; Baskar, V.; Nair, A.T.N.; Jebarani, S.; Siddiqui, M.K.; Pradeepa, R.; Unnikrishnan, R.; Palmer, C.; Pearson, E.; Mohan, V. Novel subgroups of type 2 diabetes and their association with microvascular outcomes in an Asian Indian population: A data-driven cluster analysis: The INSPIRED study. BMJ Open Diabetes Res. Care 2020, 8, e001506. [Google Scholar] [CrossRef] [PubMed]
Misra, S.; Wagner, R.; Ozkan, B.; Schön, M.; Sevilla-Gonzalez, M.; Prystupa, K.; Wang, C.C.; Kreienkamp, R.J.; Cromer, S.J.; Rooney, M.R.; et al. Precision subclassification of type 2 diabetes: A systematic review. Commun. Med. 2023, 3, 138. [Google Scholar] [CrossRef] [PubMed]
Zaghlool, S.B.; Halama, A.; Stephan, N.; Gudmundsdottir, V.; Gudnason, V.; Jennings, L.L.; Thangam, M.; Ahlqvist, E.; Malik, R.A.; Albagha, O.M.; et al. Metabolic and proteomic signatures of type 2 diabetes subtypes in an Arab population. Nat. Commun. 2022, 13, 7121. [Google Scholar] [CrossRef]
Al-Thani, N.M.; Zaghlool, S.B.; Toor, S.M.; Abou-Samra, A.B.; Suhre, K.; Albagha, O.M.E. Subtyping of type 2 diabetes from a large Middle Eastern biobank: Implications for precision medicine. Mol. Metab. 2025, 99, 102195. [Google Scholar] [CrossRef]
Bayoumi, R.; Farooqi, M.; Alawadi, F.; Hassanein, M.; Osama, A.; Mukhopadhyay, D.; Abdul, F.; Sulaiman, F.; Dsouza, S.; Mulla, F.; et al. Etiologies underlying subtypes of longstanding type 2 diabetes. PLoS ONE 2024, 19, e0304036. [Google Scholar] [CrossRef]
Suzuki, K.; Hatzikotoulas, K.; Southam, L.; Taylor, H.J.; Yin, X.; Lorenz, K.M.; Mandla, R.; Huerta-Chagoya, A.; Melloni, G.E.; Kanoni, S.; et al. Genetic drivers of heterogeneity in type 2 diabetes pathophysiology. Nature 2024, 627, 347–357. [Google Scholar] [CrossRef]
Udler, M.S.; Kim, J.; von Grotthuss, M.; Bonàs-Guarch, S.; Cole, J.B.; Chiou, J.; Anderson, C.D.; Boehnke, M.; Laakso, M.; Atzmon, G.; et al. Type 2 diabetes genetic loci informed by multi-trait associations point to disease mechanisms and subtypes: A soft clustering analysis. PLoS Med. 2018, 15, e1002654. [Google Scholar] [CrossRef] [PubMed]
Smith, K.; Deutsch, A.J.; McGrail, C.; Kim, H.; Hsu, S.; Huerta-Chagoya, A.; Mandla, R.; Schroeder, P.H.; Westerman, K.E.; Szczerbinski, L.; et al. Multi-ancestry polygenic mechanisms of type 2 diabetes. Nat. Med. 2024, 30, 1065–1074. [Google Scholar] [CrossRef]
Dong, Q.; Xi, Y.; Brandmaier, S.; Fuchs, M.; Huemer, M.T.; Waldenberger, M.; Niu, J.; Herder, C.; Rathmann, W.; Roden, M.; et al. Subphenotypes of adult-onset diabetes: Data-driven clustering in the population-based KORA cohort. Diabetes Obes. Metab. 2025, 27, 338–347. [Google Scholar] [CrossRef]
Slieker, R.C.; Donnelly, L.A.; Fitipaldi, H.; Bouland, G.A.; Giordano, G.N.; Åkerlund, M.; Gerl, M.J.; Ahlqvist, E.; Ali, A.; Dragan, I.; et al. Replication and cross-validation of type 2 diabetes subtypes based on clinical variables: An IMI-RHAPSODY study. Diabetologia 2021, 64, 1982–1989. [Google Scholar] [CrossRef] [PubMed]
Kim, H.; Westerman, K.E.; Smith, K.; Chiou, J.; Cole, J.B.; Majarian, T.; von Grotthuss, M.; Kwak, S.H.; Kim, J.; Mercader, J.M.; et al. High-throughput genetic clustering of type 2 diabetes loci reveals heterogeneous mechanistic pathways of metabolic disease. Diabetologia 2023, 66, 495–507. [Google Scholar] [CrossRef]
Christensen, D.H.; Nicolaisen, S.K.; Ahlqvist, E.; Stidsen, J.V.; Nielsen, J.S.; Hojlund, K.; Olsen, M.H.; García-Calzón, S.; Ling, C.; Rungby, J.; et al. Type 2 diabetes classification: A data-driven cluster study of the Danish Centre for Strategic Research in Type 2 Diabetes (DD2) cohort. BMJ Open Diabetes Res. Care 2022, 10, e002731. [Google Scholar] [CrossRef] [PubMed]
Wesolowska-Andersen, A.; Brorsson, C.A.; Bizzotto, R.; Mari, A.; Tura, A.; Koivula, R.; Mahajan, A.; Vinuela, A.; Tajes, J.F.; Sharma, S.; et al. Four groups of type 2 diabetes contribute to the etiological and clinical heterogeneity in newly diagnosed individuals: An IMI DIRECT study. Cell Rep. Med. 2022, 3, 100477. [Google Scholar] [CrossRef]
Ordoñez-Guillen, N.E.; Gonzalez-Compean, J.L.L.; Lopez-Arevalo, I.; Contreras-Murillo, M.; Aldana-Bobadilla, E. Machine learning based study for the classification of Type 2 diabetes mellitus subtypes. BioData Min. 2023, 16, 24. [Google Scholar] [CrossRef]
Azit, N.A.; Sahran, S.; Leow, V.M.; Subramaniam, M.; Mokhtar, S.; Nawi, A.M. Prediction of hepatocellular carcinoma risk in patients with type-2 diabetes using supervised machine learning classification model. Heliyon 2022, 8, e10772. [Google Scholar] [CrossRef] [PubMed]
Khamis, A.; Abdul, F.; Dsouza, S.; Sulaiman, F.; Farooqi, M.; Al Awadi, F.; Hassanein, M.; Ahmed, F.S.; Alsharhan, M.; AlOlama, A.; et al. Risk of Microvascular Complications in Newly Diagnosed Type 2 Diabetes Patients Using Automated Machine Learning Prediction Models. J. Clin. Med. 2024, 13, 7422. [Google Scholar] [CrossRef]
Rahimi, F.; Nasiri, M.; Safdari, R.; Arji, G.; Hashemi, Z.; Sharifian, R. Myocardial infarction prediction and estimating the importance of its risk factors using prediction models. Int. J. Prev. Med. 2022, 13, 158. [Google Scholar] [CrossRef]
Karajizadeh, M.; Nasiri, M.; Yadollahi, M.; Zolfaghari, A.H.; Pakdam, A. Mortality prediction from hospital-acquired infections in trauma patients using an unbalanced dataset. Heal. Inform. Res. 2020, 26, 284–294. [Google Scholar] [CrossRef]
Wang, Z.; Hou, J.; Shi, Y.; Tan, Q.; Peng, L.; Deng, Z.; Wang, Z.; Guo, Z. Influence of lifestyles on mild cognitive impairment: A decision tree model study. Clin. Interv. Aging 2020, 15, 2009–2017. [Google Scholar] [CrossRef]
Moustafa, M.A.M.; Mohamed, W.M.A.; Lau, A.C.C.; Chatanga, E.; Qiu, Y.; Hayashi, N.; Naguib, D.; Sato, K.; Takano, A.; Matsuno, K.; et al. R A language and environment for statistical computing, R Foundation for Statistical. Computing 2020, 20, 1979–1992. [Google Scholar]
Kahn, S.E.; Cooper, M.E.; Del Prato, S. Pathophysiology and treatment of type 2 diabetes: Perspectives on the past, present, and future. Lancet 2014, 383, 1068–1083. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Manson, J.E.; Stamfer, M.J.; Hu, F.B.; Giovannucci, E.; Colditz, G.A.; Hennekens, C.H.; Willett, W.C. A prospective study of whole-grain intake and risk of type 2 diabetes mellitus in US women. Am. J. Public Health 2000, 90, 1409–1415. [Google Scholar] [CrossRef]
Sherwani, S.I.; Khan, H.A.; Ekhzaimy, A.; Masood, A.; Sakharkar, M.K. Significance of HbA1c test in diagnosis and prognosis of diabetic patients. Biomark. Insights 2016, 11, 95–104. [Google Scholar] [CrossRef]
Berkowitz, S.A.; Karter, A.J.; Corbie-Smith, G.; Seligman, H.K.; Ackroyd, S.A.; Barnard, L.S.; Atlas, S.J.; Wexler, D.J. Food insecurity, food “deserts,” and glycemic control in patients with diabetes: A longitudinal analysis. Diabetes Care 2018, 41, 1188–1195. [Google Scholar] [CrossRef] [PubMed]
O’Brien, R.M. A caution regarding rules of thumb for variance inflation factors. Qual. Quant. 2007, 41, 673–690. [Google Scholar] [CrossRef]
Kutner, M.H.; Nachtsheim, C.J.; Neter, J.; Li, W. Introduction to the design of experimental and observational studies. In Applied Linear Statistical Models; McGraw-Hill/Irwin: Colombus, OH, USA, 2005. [Google Scholar]
Garg, M.; Dutta, M.; Mahalle, N. Study of beta-cell function (by HOMA model) in metabolic syndrome. Indian J. Endocrinol. Metab. 2011, 15, 44–49. [Google Scholar] [CrossRef] [PubMed]
Slieker, R.C.; Donnelly, L.A.; Fitipaldi, H.; Bouland, G.A.; Giordano, G.N.; Åkerlund, M.; Gerl, M.J.; Ahlqvist, E.; Ali, A.; Dragan, I.; et al. Distinct Molecular Signatures of Clinical Clusters in People With Type 2 Diabetes: An IMI-RHAPSODY Study. Diabetes 2021, 70, 2683–2693. [Google Scholar] [CrossRef]
Wang, X.; Liu, J.; Wang, Q.; Chen, Q. The transcriptomic and epigenetic alterations in type 2 diabetes mellitus patients of Chinese Tibetan and Han populations. Front. Endocrinol. 2023, 14, 1122047. [Google Scholar] [CrossRef]
Wang, N.; Zhu, F.; Chen, L.; Chen, K. Proteomics, metabolomics and metagenomics for type 2 diabetes and its complications. Life Sci. 2018, 212, 194–202. [Google Scholar] [CrossRef]
Wang, S.; Yong, H.; He, X.-D. Multi-omics: Opportunities for research on mechanism of type 2 diabetes mellitus. World J. Diabetes 2021, 12, 1070–1080. [Google Scholar] [CrossRef]
Zou, X.; Zhou, X.; Zhu, Z.; Ji, L. Novel subgroups of patients with adult-onset diabetes in Chinese and US populations. Lancet Lancet Diabetes Endocrinol. 2019, 7, 9–11. [Google Scholar] [CrossRef]
Dong, W.; Zhang, S.; Yan, S.; Zhao, Z.; Zhang, Z.; Gu, W. Clinical characteristics of patients with early-onset diabetes mellitus: A single-center retrospective study. BMC Endocr. Disord. 2023, 23, 216. [Google Scholar] [CrossRef]
Gallo, L.A.; Ward, M.S.; Fotheringham, A.K.; Zhuang, A.; Borg, D.J.; Flemming, N.B.; Harvie, B.M.; Kinneally, T.L.; Yeh, S.M.; McCarthy, D.A.; et al. Once daily administration of the SGLT2 inhibitor, empagliflozin, attenuates markers of renal fibrosis without improving albuminuria in diabetic db/db mice. Sci. Rep. 2016, 6, 26428. [Google Scholar] [CrossRef]
Veelen, A.; Erazo-Tapia, E.; Oscarsson, J.; Schrauwen, P. Type 2 diabetes subgroups and potential medication strategies in relation to effects on insulin resistance and beta-cell function: A step toward personalised diabetes treatment? Mol. Metab. 2021, 46, 101158. [Google Scholar] [CrossRef]
Chenchula, S.; Sharma, P.; Ghanta, M.K.; Amerneni, K.C.; Rajakarunakaran, P.; Saggurthi, P.; Chandra, M.B.; Gupta, R.; Chavan, M. Association and Mechanisms of Proton Pump Inhibitors Use with Type-2 Diabetes Mellitus Incidence in Adults: A Systemic Review and Meta-Analysis. Curr. Diabetes Rev. 2024, 20, 1–12. [Google Scholar] [CrossRef] [PubMed]
Schrader, S.; Perfilyev, A.; Ahlqvist, E.; Groop, L.; Vaag, A.; Martinell, M.; García-Calzón, S.; Ling, C. Novel Subgroups of Type 2 Diabetes Display Different Epigenetic Patterns That Associate with Future Diabetic Complications. Diabetes Care 2022, 45, 1621–1630. [Google Scholar] [CrossRef]
Geer, E.B.; Shen, W. Gender differences in insulin resistance, body composition, and energy balance. Gend. Med. 2009, 6, 60–75. [Google Scholar] [CrossRef] [PubMed]
Arner, P.; Viguerie, N.; Massier, L.; Rydén, M.; Astrup, A.; Blaak, E.; Langin, D.; Andersson, D.P. Sex differences in adipose insulin resistance are linked to obesity, lipolysis and insulin receptor substrate 1. Int. J. Obes. 2024, 48, 934–940. [Google Scholar] [CrossRef] [PubMed]
Mandla, R.; Lorenz, K.; Yin, X.; Bocher, O.; Huerta-Chagoya, A.; Arruda, A.L.; Piron, A.; Horn, S.; Suzuki, K.; Hatzikotoulas, K.; et al. Multi-omics characterization of type 2 diabetes associated genetic variation. MedRxiv 2024. [Google Scholar] [CrossRef]
Bancks, M.P.; Chen, H.; Balasubramanyam, A.; Bertoni, A.G.; Espeland, M.A.; Kahn, S.E.; Pilla, S.; Vaughan, E.; Wagenknecht, L.E.; Look AHEAD Research Group. Type 2 Diabetes Subgroups, Risk for Complications, and Differential Effects Due to an Intensive Lifestyle Intervention. Diabetes Care 2021, 44, 1203–1210. [Google Scholar] [CrossRef] [PubMed]
Manzini, E.; Vlacho, B.; Franch-Nadal, J.; Escudero, J.; Génova, A.; Reixach, E.; Andrés, E.; Pizarro, I.; Portero, J.L.; Mauricio, D.; et al. Longitudinal deep learning clustering of Type 2 Diabetes Mellitus trajectories using routinely collected health records. J. Biomed. Inform. 2022, 135, 104218. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Data analysis outline.

Figure 2. The comparison of the distribution of clusters using direct unsupervised clustering and supervised cluster-based prediction, both without cluster overlap (A) and with cluster overlap allowed (B). Clusters with overlap were identified by logistic regression. The bar plots show the percentage distribution of patient cohort (y axis) in five T2D subtypes (x axis). The clusters are labelled as severe insulin-deficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), mild age-related diabetes (MARD), mild obesity-related diabetes (MOD), mild early-onset diabetes (MEOD).

Figure 3. Comparison of the clustering performance by direct unsupervised clustering and supervised machine learning-based clustering with different exploratory variables: The performance was assessed for the type 2 diabetes subtypes across four scenarios of input variables (fasting blood glucose, fasting serum insulin, body mass index, glycated haemoglobin and age at diagnosis) using (A) Adjusted Rand Index (ARI) and (B) Fowlkes–Mallows Index (FMI). Bars represent mean indices, with error bars showing 95% confidence intervals. Higher values indicate stronger agreement with reference subtype assignments, reflecting greater clustering reproducibility and accuracy.

Table 1. The pathophysiological characteristics of the training and prediction datasets.

Variables	Training Dataset (N = 348)	Prediction Dataset (N = 586)	p-Value
Age at diagnosis (years)	41.92 (10.65)	46.32 (9.27)	<0.001
Body mass index (kg/m²)	31.27 (5.7)	31.54 (6.04)	0.247
Fasting blood glucose (mg/dL)	146.86 (52.47)	133 (44.16)	<0.001
HbA1c (%)	7.63 (1.72)	7 (1.33)	<0.001
Fasting serum insulin (µIU/nmol)	14.04 (10.67)	18.04 (10.74)	<0.001
Duration of type 2 diabetes (years)	14.42 (8.14)	3.67 (2.9)	<0.001

Data is shown as Mean (±standard deviation). p-value was determined using Mann–Whitney U test. p < 0.05-significant (bold).

Table 2. Crosstabulation between T2D subtypes generated by direct clustering versus cluster-based classification using ML predictive models in prediction dataset (N = 586) for three scenarios (1, 2 and 4) of exploratory variables.

		Supervised Cluster-Based Classification Using ML Predictive Models
		SIRD	SIDD	MARD	MOD	MEOD	Total
Direct unsupervised clustering	Scenario 1: FSI, FBG, and BMI
	SIRD	46	2	28	5	0	81
	SIDD	0	17	0	1	0	18
	MARD	0	15	8	0	38	61
	MOD	0	1	36	96	6	139
	MEOD	0	0	78	0	209	287
	Total	46	35	150	102	253	586
	Scenario 2: FSI, FBG, and BMI and age at diagnosis
	SIRD	56	0	0	2	0	58
	SIDD	3	28	1	1	0	33
	MARD	1	3	185	0	21	210
	MOD	20	0	66	39	0	125
	MEOD	5	1	17	43	94	160
	Total	85	32	269	85	115	586
	Scenario 4: FSI, FBG, and BMI, HbA1c and age at diagnosis
	SIRD	62	0	2	0	0	64
	SIDD	4	38	1	1	1	45
	MARD	4	1	104	30	94	233
	MOD	7	1	4	79	0	91
	MEOD	0	2	90	9	52	153
	Total	77	42	201	119	147	586

FSI: fasting serum insulin; FBG: fasting blood glucose; BMI: body mass index; SIRD: severe insulin-resistant diabetes; SIDD: severe insulin-deficient diabetes; MARD: mild age-related diabetes; MOD: mild obesity-related diabetes; MEOD: mild early-onset diabetes. The scenarios are highlighted in Grey.

Table 3. Crosstabulation between T2D subtypes generated by direct clustering versus cluster-based classification using ML predictive models in prediction dataset (N = 586) for Scenario 3 (FSI, FBG, and BMI and HbA1c).

	Supervised Cluster-Based Classification Using ML Predictive Models
Direct unsupervised clustering		SIRD	SIDD	Mixed	Total
	SIDD	6	51	1	58
	SIRD	136	1	81	218
	Mixed	0	6	304	310
	Total	142	58	386	586

SIRD: severe insulin-resistant diabetes; SIDD: severe insulin-deficient diabetes; MARD: mild age-related diabetes; MOD: mild obesity-related diabetes; MEOD: mild early-onset diabetes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khamis, A.H.; Abdul, F.; Dsouza, S.; Sulaiman, F.; Khyreim, C.; Siddig, M.E.; Bayoumi, R. Appraisal of Clinical Explanatory Variables in Subtyping of Type 2 Diabetes Using Machine Learning Models. J. Clin. Med. 2025, 14, 6548. https://doi.org/10.3390/jcm14186548

AMA Style

Khamis AH, Abdul F, Dsouza S, Sulaiman F, Khyreim C, Siddig ME, Bayoumi R. Appraisal of Clinical Explanatory Variables in Subtyping of Type 2 Diabetes Using Machine Learning Models. Journal of Clinical Medicine. 2025; 14(18):6548. https://doi.org/10.3390/jcm14186548

Chicago/Turabian Style

Khamis, Amar H., Fatima Abdul, Stafny Dsouza, Fatima Sulaiman, Costerwell Khyreim, Mohammed E. Siddig, and Riad Bayoumi. 2025. "Appraisal of Clinical Explanatory Variables in Subtyping of Type 2 Diabetes Using Machine Learning Models" Journal of Clinical Medicine 14, no. 18: 6548. https://doi.org/10.3390/jcm14186548

APA Style

Khamis, A. H., Abdul, F., Dsouza, S., Sulaiman, F., Khyreim, C., Siddig, M. E., & Bayoumi, R. (2025). Appraisal of Clinical Explanatory Variables in Subtyping of Type 2 Diabetes Using Machine Learning Models. Journal of Clinical Medicine, 14(18), 6548. https://doi.org/10.3390/jcm14186548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Appraisal of Clinical Explanatory Variables in Subtyping of Type 2 Diabetes Using Machine Learning Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design

2.2. Patients

2.3. Clustering Techniques

Validation of Explanatory Variables

2.4. Testing Similarity Between Clusters

2.4.1. The Adjusted Rand Index (ARI)

2.4.2. The Fowlkes–Mallows Index (FMI)

2.5. Statistics

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI