Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods

: (1) Background: Diabetes is a common chronic disease and a leading cause of death. Early diagnosis gives patients with diabetes the opportunity to improve their dietary habits and lifestyle and manage the disease successfully. Several studies have explored the use of machine learning (ML) techniques to predict and diagnose this disease. In this study, we conducted experiments to predict diabetes in Pima Indian females with particular ML classiﬁers. (2) Method: A Pima Indian diabetes dataset (PIDD) with 768 female patients was considered for this study. Di ﬀ erent data mining operations were performed to a conduct comparative analysis of four di ﬀ erent ML classiﬁers: Naïve Bayes (NB), J48, Logistic Regression (LR), and Random Forest (RF). These models were analyzed by di ﬀ erent cross-validation (K = 5, 10, 15, and 20) values, and the performance measurements of accuracy, precision, F-score, recall, and AUC were calculated for each model. (3) Results: LR was found to have the highest accuracy (0.77) for all ‘k’ values. When k = 5, the accuracy of J48, NB, and RF was found to be 0.71, 0.76, and 0.75. For k = 10, the accuracy of J48, NB, and RF was found to be 0.73, 0.76, 0.74, while for k = 15, 20, the accuracy of NB was found to be 0.76. The accuracy of J48 and RF was found to be 0.76 when k = 15, and 0.75 when k = 20. Other parameters, such as precision, f-score, recall, and AUC, were also considered in evaluations to rank the algorithms. (4) Conclusion: The present study on PIDD sought to identify an optimized ML model, using with cross-validation methods. The AUC of LR was 0.83, RF 0.82, and NB 0.81). These three were ranked as the best models for predicting whether a patient is diabetic or not.


Introduction
Diabetes is a common chronic disease occurring when the pancreas does not produce enough insulin (Type 1 diabetes) or when the patient's body does not effectively utilize the insulin (Type 2 diabetes).Hyperglycemia or raised blood sugar is the common consequence of uncontrolled diabetes.Over time, diabetes can cause severe damage to nerves and blood vessels [1].Advanced diabetes is complicated by coronary illness, visual impairment, and kidney failure [1,2].Early detection of the disease can give patients the opportunity to make the necessary lifestyle changes and therefore can improve their life expectancy [3].
Machine learning (ML), is an application of artificial intelligence (AI) that enables computers to self-learn and perform statistical analysis without human interaction [4].ML algorithms and models are extensively used and have been found reliable for a variety of applications.Researchers have been adopting ML in medicine, especially for diagnosis, disease prediction [5], drug discovery, and clinical trials [6].
The machine learning process starts with structured or unstructured data from different sources.The next step is data preparation or data preprocessing, which involves data selection through a data mining method in which original or raw data is converted into an understandable format [7].Once the data is ready, the model tests different trained data-sets to calculate accuracy or perform statistical algorithms; this is known as model validation [8].Model optimization or model improvement is done by hyperparameter tuning for final validation to perform prediction, and classification (Figure 1).
2 have been adopting ML in medicine, especially for diagnosis, disease prediction [5], drug discovery, and clinical trials [6].
The machine learning process starts with structured or unstructured data from different sources.The next step is data preparation or data preprocessing, which involves data selection through a data mining method in which original or raw data is converted into an understandable format [7].Once the data is ready, the model tests different trained data-sets to calculate accuracy or perform statistical algorithms; this is known as model validation [8].Model optimization or model improvement is done by hyperparameter tuning for final validation to perform prediction, and classification (Figure 1).In healthcare systems, large amounts of patient data and medical knowledge are stored in databases, and new tools and technologies for data analysis and classification are needed to exploit this information.Currently, ML algorithms are used for the automatic analysis of high dimensional medical data.Dementia forecasting [9], cancer tumor identification [10], diabetes predictions [11], and radiotherapy [12] are some examples of ML in medicine.
According to World Health Organization (WHO) reports, there are 425 million people in the world with diabetes [13].Extensive studies on the diagnosis and early prediction of diabetes have shown that the risk factors associated with Type 2 diabetes include family history, hypertension, unhealthy diet, lack of physical activities, and being overweight.Females have a higher tendency to become diabetic (especially during pregnancy), due to low insulin absorption, high cholesterol levels, or a rise in blood pressure [13,14].Studies have shown that cost-effective and efficient techniques for diagnosing diabetes could be developed by employing computer skills and data mining algorithms.
Several studies conduct prediction analysis using data-mining algorithms to diagnose diabetes.For example, in [15], researchers utilized support vector machines (SVM) for the diagnosis of diabetes mellitus and achieved a prediction accuracy of about 94%.Another work has used J48 decision trees, RF, and neural networks, and has found that RF provides the highest accuracy (80.4%) in diabetic patient classification [16].Another paper proposed a model to forecast the likelihood of diabetes.This study concluded that Naïve Bayes (NB) had 76.3% accuracy, higher than J48 and SVM [17].An accuracy analysis conducted on different ways of data preprocessing, and parameter modification was done to improve model precision [18].The above results revealed that deep neural networks (DNN) with cross-validation (K = 10) generated 77.86% accuracy in diabetes identification.
In this study, we developed a classification model for Type 2 diabetes in Pima Indian females.Four classification ML algorithms to detect diabetes in female patients were used: J48 decision trees, NB, RF, and Logistic Regression (LR).Cross-validation (CV) techniques were employed to train the different ML models for varying test data-sets.The ranking of each algorithm was decided based on the performance parameters of accuracy, precision, recall, and F-scores.

Methods and Materials
A Pima Indian diabetes dataset (PIDD) with 768 female patients was considered.This dataset, owned by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), In healthcare systems, large amounts of patient data and medical knowledge are stored in databases, and new tools and technologies for data analysis and classification are needed to exploit this information.Currently, ML algorithms are used for the automatic analysis of high dimensional medical data.Dementia forecasting [9], cancer tumor identification [10], diabetes predictions [11], and radiotherapy [12] are some examples of ML in medicine.
According to World Health Organization (WHO) reports, there are 425 million people in the world with diabetes [13].Extensive studies on the diagnosis and early prediction of diabetes have shown that the risk factors associated with Type 2 diabetes include family history, hypertension, unhealthy diet, lack of physical activities, and being overweight.Females have a higher tendency to become diabetic (especially during pregnancy), due to low insulin absorption, high cholesterol levels, or a rise in blood pressure [13,14].Studies have shown that cost-effective and efficient techniques for diagnosing diabetes could be developed by employing computer skills and data mining algorithms.
Several studies conduct prediction analysis using data-mining algorithms to diagnose diabetes.For example, in [15], researchers utilized support vector machines (SVM) for the diagnosis of diabetes mellitus and achieved a prediction accuracy of about 94%.Another work has used J48 decision trees, RF, and neural networks, and has found that RF provides the highest accuracy (80.4%) in diabetic patient classification [16].Another paper proposed a model to forecast the likelihood of diabetes.This study concluded that Naïve Bayes (NB) had 76.3% accuracy, higher than J48 and SVM [17].An accuracy analysis conducted on different ways of data preprocessing, and parameter modification was done to improve model precision [18].The above results revealed that deep neural networks (DNN) with cross-validation (K = 10) generated 77.86% accuracy in diabetes identification.
In this study, we developed a classification model for Type 2 diabetes in Pima Indian females.Four classification ML algorithms to detect diabetes in female patients were used: J48 decision trees, NB, RF, and Logistic Regression (LR).Cross-validation (CV) techniques were employed to train the different ML models for varying test data-sets.The ranking of each algorithm was decided based on the performance parameters of accuracy, precision, recall, and F-scores.

Methods and Materials
A Pima Indian diabetes dataset (PIDD) with 768 female patients was considered.This dataset, owned by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), contained a tested positive (class variable: 1) and a tested negative (class variable: 0) with eight various risk factors (Table 1).Data investigation was undertaken using WEKA 3.8 [19], which is an open-source tool that can help to perform various data-mining operations.At first, PIDD was exposed to data preprocessing steps to control the unbalanced data-sets (Figure 2).

3
contained a tested positive (class variable: 1) and a tested negative (class variable: 0) with eight various risk factors (Table 1).Data investigation was undertaken using WEKA 3.8 [19], which is an open-source tool that can help to perform various data-mining operations.At first, PIDD was exposed to data preprocessing steps to control the unbalanced data-sets (Figure 2).

Data Sampling
Two data sampling techniques were used to convert imbalanced datasets into balanced datasets: oversampling (on the minority class instances), and under-sampling (on the majority class

Data Sampling
Two data sampling techniques were used to convert imbalanced datasets into balanced datasets: oversampling (on the minority class instances), and under-sampling (on the majority class instances).Different forms of the PIDD dataset with statistical values of each attribute are presented in Table 2.

Cross-Validation
Cross-validation (CV) is a model training method that can assess prediction accuracy [20].The biggest challenge in ML is validating the model with trained data.To ensure the adopted model is producing the noise-free model patterns [21], data scientists use CV techniques.Compared with other methods, the CV technique offers the most ease in estimating low bias models, and therefore is one of the most popular techniques in ML algorithms.
In this study, four ML classifiers were employed to conduct different cross-validations.The k-fold CV technique was used to perform model validation.The PIDD was split into 'k' folds to conduct training with test data, and the remaining 'k-1' folds were combined to form trained data.Original data were randomly separated into 'k' folds (k 1 ,k 2 . . .,k i ), and the model testing was performed by 'k' times.For example, in the first iteration, if subset (k 1 ) served as test data, then the remaining subsets (k 2 , . . . . . .,k i ) were combined to conduct model training, and this process was repeated for the rest of the 'k' values.Many studies reported that in order to avoid issues associated with imbalanced data-sets, the optimal value for 'k' should be 5 or 10.With the highest (k) values, the difference in trained and sampled data-sets tended to acquire low values.In the present study, model validation was conducted with k = 5, 10, 15, and 20.

Naïve Bayes (NB)
Naïve Bayes (NB) is a probability-based ML method that can be used as a classification technique.Based on feature extraction, NB produces the probability for target groups in classification [22].This algorithm quickly and easily predicts the test data and produces better performance values in multi-class predictions.Compared with numerical inputs, NB predicts correctly categorical input values.The Bayes theorem is represented in Equation (1) below P (c/X) = (P (X c) P(c) The probability of 'c' is happening, given that 'X' occurrence.

Logistic Regression (LR)
LR is a classification algorithm used to allocate observations into discrete set of classes.It is classified into the binary, multi, and normal level types.LR does not indicate a relationship between non-continuous attributes, but allows the prediction of the discrete variables [23].It is very easy to implement and quite efficient for training the model.
Logistic regression is mathematically written as a multiple linear regression function Equation (2) by The following example represents a simple logistic binary function.As discussed, two target diabetic groups (tested positive-'1' or tested negative-'0') were tested If 'W' reaches positive infinity, then the prediction is positive, and if 'W' reaches to negative infinity, then the prediction is negative (Figure 3).

Logistic Regression (LR)
LR is a classification algorithm used to allocate observations into discrete set of classes.It is classified into the binary, multi, and normal level types.LR does not indicate a relationship between non-continuous attributes, but allows the prediction of the discrete variables [23].It is very easy to implement and quite efficient for training the model.
Logistic regression is mathematically written as a multiple linear regression function Equation (2) by The following example represents a simple logistic binary function.As discussed, two target diabetic groups (tested positive-'1' or tested negative-'0') were tested Hypothesis W = AX+B ( 4) If 'W' reaches positive infinity, then the prediction is positive, and if 'W' reaches to negative infinity, then the prediction is negative (Figure 3).

Random Forest (RF)
When feature selection methods are used, the RF algorithm is quick to learn to produce the highest classification accuracy on large databases, because of the tree-based systems used.Generally, these trees are nicely positioned for improving the virtue of the tree node known as the Gini impurity [24].In RF, feature extraction is conducted from the test data.Thereafter, test features are validated by the randomly generated decision trees (Figure 4).In the example of PIDD, if the model was generated 50 random trees, every tree could predict two different outcomes for the same test group.If 30 trees were predicted (tested positive) and 20 trees were predicted (tested negative), then the RF algorithm returns 'tested positive' as the predicted target.

Random Forest (RF)
When feature selection methods are used, the RF algorithm is quick to learn to produce the highest classification accuracy on large databases, because of the tree-based systems used.Generally, these trees are nicely positioned for improving the virtue of the tree node known as the Gini impurity [24].In RF, feature extraction is conducted from the test data.Thereafter, test features are validated by the randomly generated decision trees (Figure 4).In the example of PIDD, if the model was generated 50 random trees, every tree could predict two different outcomes for the same test group.If 30 trees were predicted (tested positive) and 20 trees were predicted (tested negative), then the RF algorithm returns 'tested positive' as the predicted target.

Logistic Regression (LR)
LR is a classification algorithm used to allocate observations into discrete set of classes.It is classified into the binary, multi, and normal level types.LR does not indicate a relationship between non-continuous attributes, but allows the prediction of the discrete variables [23].It is very easy to implement and quite efficient for training the model.
Logistic regression is mathematically written as a multiple linear regression function Equation (2) by The following example represents a simple logistic binary function.As discussed, two target diabetic groups (tested positive-'1' or tested negative-'0') were tested Hypothesis W = AX+B ( 4) If 'W' reaches positive infinity, then the prediction is positive, and if 'W' reaches to negative infinity, then the prediction is negative (Figure 3).

Random Forest (RF)
When feature selection methods are used, the RF algorithm is quick to learn to produce the highest classification accuracy on large databases, because of the tree-based systems used.Generally, these trees are nicely positioned for improving the virtue of the tree node known as the Gini impurity [24].In RF, feature extraction is conducted from the test data.Thereafter, test features are validated by the randomly generated decision trees (Figure 4).In the example of PIDD, if the model was generated 50 random trees, every tree could predict two different outcomes for the same test group.If 30 trees were predicted (tested positive) and 20 trees were predicted (tested negative), then the RF algorithm returns 'tested positive' as the predicted target.

J48 (Decision Tree Algorithm)
J48 or decision tree algorithm allows to calculate the feature behavior of different test groups.With J48, it is easy to understand the explanatory distribution of instances.This can help in identifying missing attributes and therefore works as a precision tool in case of over fitting was occurred [25].The major challenge associated with the decision trees is the identification of the root node attribute.This attribute selection can be done in two methods: information gain Equation ( 5) and Gini Index Equation (6).
Information gain written as Gain (X, A) = Entropy (X) Here, X: Set of instances, A: attribute, X X : a subset of X with A = X, and value (A): set of total possible values of A. Gini index (GI) is a parameter that helps to calculate how often randomly selected instances could be incorrectly classified.

Performance Measures
Model performance was decided on the basis of accuracy, precision, recall, and F-scores.The performance measures with formulation and definitions are provided in Table 3.

F-Measure
Used to measure the accuracy of the experiment 2 * ( P * R P+R )

Results
Due to the issues raised with the model over fitting, exclusion of over-sampling and under-sampling PIDD data-sets were done during experiments.

Pruned Decision Tree
The J48 model classifier was exposed with the remained dataset (after removal of missing instances) to generate a pruned decision tree.The output pruned decision tree with plasma value as a central node is represented in Figure 5.It is obvious that plasma glucose concentration has the highest information gain, which could be considered as the highest risk factor for diabetes.Other risk factors such as multiple pregnancies, release of high levels of insulin, and lineage function also increased the chances of having diabetes.Generally, pregnant women who do not take much physical exercise have higher chances of gaining weight, which in turn increases the likelihood of having Type 2 diabetes.
These results clearly show that the four classifiers had similar prediction accuracy with small differences and margins of error.However, LR was the most accurate and J48 was the least accurate.Ultimately, LR, NB, and RF were deemed to be the three best models for predicting whether a patient is diabetic or not.Furthermore, for K = (5, 10, and 20), the NB parameters for accuracy, precision, recall, and f-scores were higher than those of RF.However, for K = 15, the RF precision and F-scores were higher than those of NB.Accuracy was not only the parameter, which can be used in assessing model optimization.The main limitation in using accuracy as the key performance metric is that it does not work well in datasets.This can generate class imbalances.The PIDD (Table 1) contains 500 women who tested negative for diabetes, and 268 women who tested positive for diabetes, and thus the imbalance ratio is 1.87.Hence, along with accuracy, it is also important to consider the AUC values (Figure 6).The AUC values of NB (Figure 6.1) and LR (Figure 6.2) were 0.81 and 0.83, respectively, and for RF (Figure 6.3), it was 0.82 and 0.81.However, J48 produced a lower AUC value (0.72) than others (Figure 6.4).When each classifier is ranked according to performance values, once seems that an optimized model is LR > RF > NB > J48.
precision and F-scores were higher than those of NB.Accuracy was not only the parameter, which can be used in assessing model optimization.The main limitation in using accuracy as the key performance metric is that it does not work well in datasets.This can generate class imbalances.The PIDD (Table 1) contains 500 women who tested negative for diabetes, and 268 women who tested positive for diabetes, and thus the imbalance ratio is 1.87.Hence, along with accuracy, it is also important to consider the AUC values (Figure 6).The AUC values of NB (Figure 6.1) and LR (Figure 6.2) were 0.81 and 0.83, respectively, and for RF (Figure 6.3), it was 0.82 and 0.81.However, J48 produced a lower AUC value (0.72) than others (Figure 6.4).When each classifier is ranked according to performance values, once seems that an optimized model is LR > RF > NB > J48.

Conclusions
Diabetes is one of the most critical chronic diseases today, and early diagnosis can help greatly in improving a patient's chances of managing it well.The latest developments in machine intelligence can be exploited to improve our understanding of the factors causing the onset of this disease.We developed four binary classifier models: NB, J48, LR, and RF, and each model was analyzed using different CV methods (subject to different 'k' values).Performance assessment was conducted with the parameters of accuracy, precision, recall, F-scores, and AUC.Preliminary outcomes suggested that all models investigated achieved good results, with the LR model showing the greatest accuracy (0.77), and the J48 the relatively low accuracy compared to the others.Ranking conducted by considering not only accuracy but also other parameters, and indicated that LR, NB, RF are the three best models for predicting whether a patient is diabetic or not.
The main limitation of this stdy is that only the conventional ML classifiers were considered.Since the results provide an improvement on existing methods for predicting diabetes, it would be worthwhile in future studies to explore these models in unsupervised machine learning and deep learning techniques as well.

Conclusions
Diabetes is one of the most critical chronic diseases today, and early diagnosis can help greatly in improving a patient's chances of managing it well.The latest developments in machine intelligence can be exploited to improve our understanding of the factors causing the onset of this disease.We developed four binary classifier models: NB, J48, LR, and RF, and each model was analyzed using different CV methods (subject to different 'k' values).Performance assessment was conducted with the parameters of accuracy, precision, recall, F-scores, and AUC.Preliminary outcomes suggested that all models investigated achieved good results, with the LR model showing the greatest accuracy (0.77), and the J48 the relatively low accuracy compared to the others.Ranking conducted by considering not only accuracy but also other parameters, and indicated that LR, NB, RF are the three best models for predicting whether a patient is diabetic or not.
The main limitation of this stdy is that only the conventional ML classifiers were considered.Since the results provide an improvement on existing methods for predicting diabetes, it would be worthwhile in future studies to explore these models in unsupervised machine learning and deep learning techniques as well.

Figure 1 .
Figure 1.The primary mechanisms of machine learning.

Figure 1 .
Figure 1.The primary mechanisms of machine learning.

Figure 6 .
Figure 6.Area under the curve (AUC) of four different classifiers.

Author Contributions:
We certify that the manuscript is not under review by any other journal.All authors have read and validated the final copy of this manuscript.GB*: Design and perform the experiments.Analyze the methods, results, and wrote the manuscript.CN& GS: Contribute to the literature review, SKT & FA: Conclusion and final manuscript revision.Funding: This research received no external funding.

Figure 6 .
Figure 6.Area under the curve (AUC) of four different classifiers.

Table 1 .
Statistical report of the Pima Indian diabetes dataset (PIDD).

Table 1 .
Statistical report of the Pima Indian diabetes dataset (PIDD).

Table 2 .
Statistics of original and different trained sets (where SD: standard deviation, BMI: Body Mass Index).

Table 3 .
Definition and formulation of accuracy measures (where TP: true positive; TN: true negative; FP: false positive; FN: false negative).