Data-Driven Machine-Learning Methods for Diabetes Risk Prediction

Diabetes mellitus is a chronic condition characterized by a disturbance in the metabolism of carbohydrates, fats and proteins. The most characteristic disorder in all forms of diabetes is hyperglycemia, i.e., elevated blood sugar levels. The modern way of life has significantly increased the incidence of diabetes. Therefore, early diagnosis of the disease is a necessity. Machine Learning (ML) has gained great popularity among healthcare providers and physicians due to its high potential in developing efficient tools for risk prediction, prognosis, treatment and the management of various conditions. In this study, a supervised learning methodology is described that aims to create risk prediction tools with high efficiency for type 2 diabetes occurrence. A features analysis is conducted to evaluate their importance and explore their association with diabetes. These features are the most common symptoms that often develop slowly with diabetes, and they are utilized to train and test several ML models. Various ML models are evaluated in terms of the Precision, Recall, F-Measure, Accuracy and AUC metrics and compared under 10-fold cross-validation and data splitting. Both validation methods highlighted Random Forest and K-NN as the best performing models in comparison to the other models.


Introduction
Diabetes mellitus is a common metabolic disease characterized by high blood glucose levels. In diabetes, the body inefficiently produces little or no insulin. Increased blood sugar (hyperglycemia) and impaired glucose metabolism occur either as a result of decreased insulin secretion or due to decreased sensitivity of the body cells to the action of this hormone (insulin) [1]. Depending on the insulin disorder, diabetes is classified into the following types: • Type I diabetes or juvenile diabetes: In this type, insulin-producing pancreatic cells are destroyed by an autoimmune mechanism (that is, by antibodies produced by the body itself). It mainly affects young people, insulin is completely absent, and the patient requires insulin therapy from the beginning [2]. • Type II diabetes: It is characterized by increased resistance of the body to insulin with the result that what is produced is not sufficient to meet the metabolic needs of the body. Type 2 diabetes is the most common cause of diabetes in adults. An important predisposing factor for the development of type 2 diabetes is obesity. Other predisposing factors are age and family history. If necessary, anti-diabetic drugs are used.
In case the treatment fails, it is recommended to administer insulin to control these patients as well [3]. • Gestational diabetes: It is a type of diabetes that first appears during pregnancy (excluding women with pre-pregnancy diabetes). This type is similar to type 2 diabetes. Obese women are more likely to develop gestational diabetes. Gestational diabetes is reversible and resolves after childbirth but can cause perinatal complications and maternal and neonatal health problems [4]. acquired research results. Finally, our conclusions and future directions are outlined in Section 5.

Related Work
Currently, researchers have paid great attention to the development of AI-based tools and methods suitable for chronic conditions monitoring and control. Specifically, ML models have been widely utilized to quantify the risk of a disease occurrence assuming various features or risk factors. In the context of this section, our purpose is to present relevant works concerning diabetes.
First, the authors in [26] proposed a framework for diabetes prediction consisting of different machine learning classifiers, such as K-Nearest Neighbor, Decision Trees, Random Forest, AdaBoost, Naive Bayes and XGBoost and Multilayer Perceptron neural networks. Their proposed ensembling classifier is the best performing classifier with the sensitivity, specificity, false omission rate, diagnostic odds ratio and AUC of 0.789, 0.934, 0.092, 66.234 and 0.950, respectively. Moreover, in [27], the authors utilized machine-learning techniques in the Pima Indian diabetes dataset to develop trends and detect patterns with risk factors using the R data manipulation tool. They applied supervised machine learning algorithms, such as linear kernel Support Vector Machine (SVM-linear), radial basis function, K-Nearest Neighbor, Artificial Neural Network and Multifactor Mimensionality Reduction, in order to classify the patients into diabetic and non-diabetic. The SVM-linear model provides the best accuracy of 0.89 and precision of 0.88. On the other hand, the K-NN model provided the best recall and F1 score of 0.90 and 0.88, respectively.
In addition, the authors in [28] compared machine-learning-based models, such as Glmnet, Random Forest, XGBoost and LightGBM, to commonly used regression models for the prediction of undiagnosed type 2 diabetes. With six months of data available, a simple regression model performed with the lowest average Root Mean Square Error of 0.838, followed by Random Forest (0.842), LightGBM (0.846), Glmnet (0.859) and XGBoost (0.881). When more data were added, Glmnet improved with the highest rate (+3.4%).
Logistic Regression, K-Nearest Neighbor, Support Vector Machine, Naïve Bayes, Decision Tree and Random forest were applied in [29]. The 10-fold cross-validation was also applied to test the effectiveness of different models. The experimental results showed that the accuracy of Random Forest was 94.10% and outperforms the other models.
Additionally, in [30] Logistic Regression is used to identify the risk factors for diabetes based on p-value and odds ratio (OR). The Naïve Bayes, Decision Tree, Adaboost and Random Forest were applied to predict the diabetic patients. Furthermore, three types of partition protocols (K2, K5 and K10) were considered and repeated in 20 trials. The overall ACC of the ML-based system is 90.62%. The combination of Logistic Regression-based feature selection and Random Forest-based classifier gives 94.25% ACC and 0.95 AUC for the K10 protocol.
Furthermore, in [31], dataset creation, features selection and classification using different supervised machine-learning models, such as Naïve Bayes, Decision Trees, Random Forests and Logistic Regression, were considered. The ensemble Weighted-Voting-Logistic Regression-Random Forest ML model was proposed to improve the prediction of diabetes, scoring an Area Under the ROC Curve (AUC) of 0.884.
Finally, published works [32][33][34][35] based on [36] dataset. Specifically, in [32] the authors based on Naive Bayes, Logistic Regression and Random Forest algorithms and, after applying 10-fold cross-validation and percentage split (80:20) evaluation techniques, Random forest has been found to have the best accuracy in order to predict diabetes in both cases. In [33], the authors applied Bayes Network, Naïve Bayes, J48, Random Tree, Random Forest, K-Nearest Neighbor and Support Vector Machine, and, after applying 10-fold cross-validation, the K-Nearest Neighbor performed the highest accuracy with 98.07%.
In [34], Naive Byes, Random Forest, Support Vector Machine and Multilayer Perceptron were applied. The results showed that the Random Forest provides the highest values of 0.975 for precision, recall and F-measure, respectively. Multiplayer perceptron also works well with 0.96 precision value, 0.963 recall value and 0.964 F-measure value, respectively. Last, in [35], the authors based on Artificial Neural Network and Random Forest, and after applying 10-fold cross-validation, the Random Forest outperformed with an accuracy of 97.88%. To sum up, in Table 1 we summarize the aforementioned related works.

Materials and Methods
In this section, our analysis will focus on the dataset description, the adopted methodology (i.e., data preprocessing, feature ranking and analysis in terms of the target classes), the risk prediction models and the evaluation metrics.

Dataset Description
Our experimental results were based on [36] dataset. No specific processing was performed on this dataset as there were no missing and extreme values. The number of participants is 520 and all the attributes (16 as input to machine-learning models and 1 for the target class) are analyzed as follows: • Diabetes: This feature refers to whether the participant has been diagnosed with diabetes type 2 or not. The percentage of participants who suffer from diabetes type 2 is 61.5%.
All the attributes are nominal except for age, which is numerical.

Diabetes Risk Prediction
Machine-learning models, more than ever, constitute an important tool for physicians, clinicians and health carers as they allow them to automate the risk assessment of a disease occurrence based on several risk factors. Here, the long-term risk of diabetes development is formulated as a classification task with two target classes c = "Diabetes" (diabetes occurrence) or c = "Non-Diabetes" (non-occurrence of diabetes). The trained ML models will be able to predict the class of an unlabeled instance either as Diabetes or Non-Diabetes based on the input features' values, and thus the risk of developing diabetes. The main steps of the adopted methodology include data preprocessing, feature ranking, classification models training and performance evaluation.

Data Preprocessing
For the development of efficient models suitable for the accurate identification of Diabetes and Non-Diabetes instances, the non-uniform class distribution was tackled by employing SMOTE [52]. SMOTE method, based on a 5-NN classifier, was used to create synthetic data based on 60% of the minority class, i.e., Non-Diabetes, such that the instances in the two classes are equally distributed (i.e., 50%-50%). This technique is followed to avoid overfitting as it creates new synthetic similar data from the minority class, which are not duplicate or replicate of existing minority class data. Then, the synthetic instances are added to the original dataset.

Features Importance
Four ranking methods were applied to evaluate the contribution of a feature in the target class. Their results are summarized in Table 2. As for the first method, namely Pearson correlation coefficient [53], it is used to infer the strength and direction of the association between the features and the target class and varies between −1 and 1. More specifically, we observe that a strong correlation of 0.7046 is captured between diabetes and the symptom of polyuria. Furthermore, a moderate relationship of rank 0.6969, 0.5017 and 0.4922 is noted between polydipsia, sudden weight loss and gender with diabetes. The same holds for partial paresis feature and diabetes with a rank of 0.4757. A weaker association is shown to have diabetes with the features of polyphagia, irritability, alopecia, visual blurring and weakness, while the absence of correlation occurs with the rest features where the rank is lower than 0.2.
Gain Ratio (GR) method [54] was also employed, which is calculated as , where H(x) = −p x log 2 (p x ) (with p x denoting the probability of selecting feature x), H(c) = −p c log 2 (p c ) (with p c be the probability of selecting an instance in class c) and H(c|x) are the entropy of an instance with feature x, the entropy of class c and the conditional entropy of feature x given class c, respectively. Gain ratio is used to determine the relevance of a feature and chooses the ones that achieve the maximal gain ratio considering the probability of each feature value. Gain ratio, also known as Uncertainty Coefficient, normalizes the information gain (H(c) − H(c|x)) of a feature against how much entropy that feature has.
Furthermore, the Naive Bayes and Random Forest classifiers were selected to measure the importance of the features. Random Forest creates a forest of trees, and per tree measures a candidate feature's ability to optimally split the instances into two classes using the Gini impurity [55]. Naive Bayes calculates the probability of each feature p(x|c) in order to evaluate their performance at predicting the output variable.
We observe that Naive Bayes and Pearson correlation coefficients assigned the same order of importance in all features except for the age and genital thrush, which are presented in reverse order. Although these methods compute the importance differently, they result in the same ordering outcomes. The same order may relate to the fact that (i) Naive Bayes supposes features independence, as their correlation may harm its performance and (ii) the correlation coefficient measures the strength of each feature's relationship with the target class [56].
The features of polydipsia and polyuria are unanimously categorized first while features of muscle stiffness, obesity, delayed healing and itching are last in the order by all methods. In the rest features, we observe similarities in the ranking order between different methods. In conclusion, since all features are among the most common symptoms for diabetes screening by physicians (including the blood test for verification), the models' training and validation will be based on all of them.

Features Exploration
In this section, we aim to present the diabetes prevalence in terms of the involved features. The selected features are among the signs of diabetic patients. The mean age of participants is 47.7 years, and its standard deviation is 12.2.
In Figure 1, we show the participants' distribution from both the age group and the gender perspective. We see that most of the involved women are diabetic (27%) while 22% of the participants are men with diabetes. In Figure 2, it is shown the participants' distribution in terms of the features that capture the signs of polyuria and polydipsia. A total of 38% and 35% of participants who suffer from diabetes occur these symptoms. Furthermore, a small percentage of 3.28% and 1.6%, respectively, mentioned these signs although they were not diabetics. In Figure 3, we demonstrate the participants' distribution in terms of the features that represent sudden weight loss and weakness. A total of 29% and 34% of participants were diagnosed with diabetes and noted the manifestation of these symptoms, respectively. Furthermore, a percentage of 5.47% and a higher portion of 21.41%, respectively, referred to these signs although they were not diabetics.  Figure 4 illustrates the participants' distribution in terms of the features that denote polyphagia and obesity. A total of 29.53% and 9.53% of participants are diabetics and declared an increase in appetite and that they are obese. In addition, a moderate percentage of 12.50% and a small portion of 6.56% mentioned excessive hunger and obesity, respectively, although they are not diabetics. In the following, Figure 5 depicts the irritability and alopecia signs in terms of the involved classes. We see that irritability and alopecia coexist with diabetes in 17.19% and 12.19% of the participants, correspondingly. However, an important portion of 25.63% noted the occurrence of alopecia although they were not diabetic. Moreover, Figure 6 presents the occurrence of genital thrush and itching signs in terms of the two classes. We see that these features coexist with diabetes in 12.97% and 24.06% of the participants, correspondingly. However, an important portion of 24.84% noted the occurrence of itching while 7.19% had genital thrush although they were not diabetic. Furthermore, Figure 7 focuses on two other diabetes-related symptoms and specifically partial paresis and muscle stiffness. It is observed that 30% and 21% of the involvers are diabetic and manifested theses signs, respectively. Finally, Figure 8 shows the prevalence of diabetes in terms of the features that capture the occurrence of delayed healing and visual blurring. A total of 50% of those who have been diagnosed with diabetes (or 25% of the total participants) occur visual blurring, which owes to the quick change of blood sugar levels from normal to high. Similar outcomes hold for the coexistence of diabetes and the sign that concern the delay in wound healing, which relate to problems with the immune system activation.

Machine-Learning Models
This subsection will provide a brief description of the ML classification models we relied on for the topic under consideration. Specifically, Naive Bayes, Bayesian Network, Support Vector Machine, Logistic Regression, Artificial Neural Network, K-Nearest Neighbors, J48, Logistic Model Tree, Random Forest, Random Tree, Reduced Error Pruning Tree, Rotation Forest, AdaBoostM1 and Stochastic Gradient Descent were selected in order to evaluate their prediction performance. Here, we note that we assume that each instance i in the dataset is represented by a features vector x i = x i1 , x i2 , x i3 , . . . , x in T , where n is the number of the features.

Naive Bayes
Naive Bayes (NB) [57] classifies an instance x i at that class c for which P(c|x i1 , . . . , x in ) is maximized (under the assumption that the features are highly independent). The conditional probability is defined as P(c|x i1 , . . . , x in ) = P(x i1 ,...,x in |c)P(c) P(x i1 ,...,x in ) , where P x i1 , . . . , x in |c = ∏ n j=1 P x ij |c is the features probability given class and P(x i1 , . . . , x in ), P(c) are the prior probability of features and class, respectively. The estimated class is derived by maximizing P(c) ∏ n j=1 P x ij |c , where c ∈ {Diabetes, Non − Diabetes}.

Bayesian Network
Bayesian networks (BayesNet) [58] are a widely-used class of probabilistic graphical models. They consist of two parts: a structure and parameters. The structure is a directed acyclic graph (DAG) over a set of features U that expresses conditional independencies and dependencies among random variables associated with nodes. The parameters consist of conditional probability distributions associated with each node. A Bayesian network classifier calculates arg max c P(c|x) using pa(x) (the set of parents of x ∈ U) and the distribution P(U) represented by the Bayesian network, based on P(c|x) = P(U)/P(x) ∝ P(U) = ∏ x∈U p(x|pa(x)). (1)

Support Vector Machine
Support Vector Machine (SVM) [59] is used for classification as well as Regression problems. However, primarily, it is used for classification problems in ML. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate ndimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. Support Vector Machine (SVM) finds the hyperplane that can optimally separate instances into two classes. The most characteristic Kernel functions are linear, polynomial, radial basis and quadratic. An instance x can be optimally classified based on function: where M is the size of training instances, x i , c i are the training instance feature vector and its class label, respectively, b is a bias, c i ∈ {1, −1}, K(x i , x ) is the kernel function which corresponds the input vectors into an expanded feature space.

Logistic Regression
Logistic regression (LR) [60] is one of the most popular ML algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent features. Logistic regression predicts the class output, which can be either Yes or No (0 or 1). The probability an instance to belong in the Diabetes class is p, thus, 1 − p is the probability of an instance belonging to the Non-Diabetes class. The relationship of log-odds with base b and model parameters β i is written as:

Artificial Neural Network
A fully connected multi-layer neural network is called a Multilayer Perceptron (MLP) [61]. It consists of three types of layers, such as the input layer, output layer and hidden layer.
The MLPs are designed to approximate any continuous function and can solve problems that are not linearly separable. Furthermore, it can use any arbitrary activation function.

K-Nearest Neighbors
The K-nearest neighbors algorithm (KNN) [62] is a non-parametric, supervised learning classifier that uses proximity to make classifications or predictions about the grouping of an individual data point.

J48
J48 [63] is a machine-learning decision tree classification algorithm that examines the data categorically and continuously. It deals with the problems of the numeric attributes, missing values, pruning, estimating error rates, the complexity of decision tree induction and generating rules from trees.

Logistic Model Tree
A logistic model tree (LMT) [64] consists of a standard decision tree structure with logistic regression functions f (x i ) = β 0 + ∑ n j=1 (β i x ij ) at the leaves. LMT produces a single tree containing binary splits on numeric attributes, multiway splits on nominal ones and logistic regression models at the leaves, and the algorithm ensures that only relevant attributes are included in the latter.

Random Forest
Random Forest (RF) [65] is a popular ML algorithm that belongs to the supervised learning technique. It is used in classification and regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

Reduced Error Pruning Tree
Reduced Error Pruning Tree (RepTree) [66] is a fast decision tree learner that builds a decision/regression tree using information gain as the splitting criterion and prunes it using a reduced error pruning algorithm.

Random Trees
Random Tree (RT) [67] is an ensemble of multiple decision trees. The Random Trees node is built on the Classification and Regression Tree methodology. It splits the training records (through recursive partitioning) into segments with similar output features' values. The node initially examines the available input features in order to find the best split evaluating the impurity index. All splits are binary.

Rotation Forest
Rotation Forest (RotF) [68] is a method for generating classifier ensembles based on feature extraction. In order to create the training data for a base classifier, the feature set is randomly split into subsets, and principal component analysis (PCA) is applied to each subset.

AdaBoostM1
Let G m (x i ), for m = 1, 2, . . . , M, be the sequence of weak classifiers. Our objective is to build the G(x) = sign(∑ M m=1 α m G m (x i )). The final prediction is a combination of the predictions from all classifiers through a weighted majority vote. At the first step, m = 1, the weights are initialized uniformly w l = 1/N. The coefficients α m are computed by the boosting algorithm and weight the contribution of each respective G m (x i ) giving higher influence to the more accurate classifiers in the sequence. At each boosting step, the data is modified by applying weights w 1 , w 2 , . . . , w N to each training observation. At step m, the observations that were misclassified previously have their weights increased [69].

Stochastic Gradient Descent
Stochastic gradient descent (SGD) [70] is an efficient approach to fitting linear classifiers and regressors under convex loss functions, such as linear SVM and LR. The SGD has been successfully applied to large-scale and sparse machine learning problems.

Stacking
Stacking is a common approach that is utilized to acquire more accurate predictions than single models'. Stacking uses the predicted class labels of the base models as input features to train a meta-classifier that undertakes to find the class label [71].

Evaluation Metrics
In this research work, various metrics, such as the accuracy, precision, recall, F-Measure and AUC [72], are examined in order to evaluate the performance of the machine-learning models. Each metric will help us to identify the strengths and weaknesses of the models. The desired metrics are calculated with the help of the Confusion matrix. The confusion matrix consists of the elements true positive (TP), true negative (TN), false positive (FP) and false-negative (FN). Performance metrics are defined as Precision indicates how many of those who are labeled as diabetic actually belong to this class. Recall shows how many of those who are diabetic are correctly predicted. F-Measure is the harmonic mean of the precision and recall and captures the predictive performance of a model. The Accuracy illustrates the proportion of the total number of predictions that were correct.
To evaluate the distinguishability of a model, the Area under curve (AUC) is exploited. It is a metric that varies in [0,1]. The closer to one, the better the ML model performance is in distinguishing diabetes from non-diabetes instances. If AUC equals one, the ML model can perfectly separate the instances distribution of two classes. In special case where all non-diabetes (diabetes) are classified as diabetes (non-diabetes), the AUC equals 0.

Experiments Setup
The machine-learning models' performance is evaluated in the Waikato Environment for Knowledge Analysis (Weka) [73]. It is developed at the University of Waikato, New Zealand and is free software. Furthermore, it provides a library of various models for data preprocessing, classification, clustering, forecasting, visualization, etc. The computing system in which the experiments were conducted has the following characteristics: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz 2.70 GHz, 16 GB, Windows 11 Home, 64-bit Operating System and x64-based processor. For our experiments, 10-fold cross-validation and percentage split (80:20) were applied to measure the models' efficiency in the balanced dataset of 640 instances. In Table 3, the parameters' settings of the considered models are shown.
In Table 4, we illustrate the performance of the models under consideration after applying SMOTE with 10-fold cross-validation. From the results of the experiments, we can see that the KNN and RF models present the best prediction accuracy with 98.59% compared to the corresponding proposed models. Furthermore, the RotF and RF models have an AUC of 99.9%. It should be noted that in SMOTE with 10-fold cross-validation, all our models have an accuracy greater than 88.75% (BayesNet) and an AUC greater than 94.2% (SGD).

Stacking
Base Models: RF, KNN Meta-model:LR Moreover, in Table 5, we summarize related works based on the dataset [36] after applying 10-fold cross-validation on the same features we relied on but without SMOTE. Our proposed models after SMOTE and 10-fold cross-validation showed better performance in terms of accuracy compared to the related works as shown in Table 5.
In addition, in Table 6, we depict the performance of ML models in terms of accuracy, recall, precision, F-measure and AUC after applying SMOTE and percentage split (80:20). Both in this case, the KNN and RF achieved the best performance in relation to the rest models with an accuracy of 99.22%. Furthermore, the RF model and the Stacking method performed an AUC of 100%. Our proposed models have excellent AUC rates greater than 93.7% (SGD) and accuracy greater than 88.28% (BayesNet).
Furthermore, in Table 7, we outline the accuracy of our proposed models, such as NB, LR J48 and RF, after applying SMOTE and percentage split (80:20). The same table shows the results of the work [32] after applying a percentage split (80:20) on the same features we relied on but without SMOTE. We observe that our proposed models showed better accuracy but with a small percentage gap of 0.22-1.97%.  Finally, we note a limitation of this research work. This study was based on a publicly available dataset. The dataset we relied on does not come from a hospital unit or institute, which could give us richer information data models with different characteristics, such as biochemical measurements that record a detailed health profile of the participants. Acquiring access to such data is time-consuming and difficult for privacy reasons.

Conclusions
The habits and lifestyle of the modern world are the results of the growing incidence of diabetes. Medical professionals now have the opportunity, with the contribution of machine-learning techniques, to assess the relative risk and provide appropriate guidelines and interventions for the management and treatment or prevention of diabetes.
In this research article, we applied several machine-learning models in order to identify individuals at risk of diabetes based on specific risk factors. Data exploration through risk factor analysis could help to identify associations between the features and diabetes. Performance analysis showed that data pre-processing is a major step in the design of efficient and accurate models for diabetes occurrence.
Specifically, after applying SMOTE with 10-fold cross-validation, the Random Forest and KNN outperformed the other models with an accuracy of 98.59%. Similarly, applying SMOTE with a percentage split (80:20), the Random Forest and KNN outperformed the other models with an accuracy of 99.22%. In both cases, applying SMOTE, our proposed models were superior to the related published research works based on the [36] dataset with the same features we relied on in terms of accuracy.
In future work, we aim to extend the machine-learning framework through the use of deep-learning methods by applying a Long-Short-Term-Memory (LSTM) algorithm and Convolutional Neural Networks (CNN) in the same dataset and comparing the results in terms of accuracy with relevant published works.
Author Contributions: E.D. and M.T. conceived the idea, designed and performed the experiments, analyzed the results, drafted the initial manuscript and revised the final manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.