Comparing Different Machine Learning Techniques in Predicting Diabetes on Early Stage †

: One of the diseases that is constantly spreading and is estimated to cause a significant number of deaths worldwide is diabetes mellitus. It is determined by the quantity of a blood sugar molecule made from glucose. The possibility of this disease has been predicted using a variety of methods. To forecast diabetes at an early stage, adequate and clear data on diabetic individuals are needed. In this study, 520 records from a hospital in Bangladesh with 16 different characteristic numbers were used to make predictions. At UCI, this dataset is accessible to everyone. We used Random Forest, Ada Booster, KNN, and Bagging algorithms after feature selection. Through 10-fold cross-validation, it was discovered that the Random Forest method had the best test accuracy, scoring 97.03% correctly and 95.03% correctly.


Introduction
Diabetes is one of the illnesses that is now growing at the fastest rate.The World Health Organization estimates that 422 million people worldwide have diabetes.Additionally, it states that non-communicable diseases account for almost 41 million preventable deaths annually, or nearly 71% of all fatalities worldwide.By 2030, non-communicable illnesses will be responsible for 52 million annual deaths if the problem is not addressed.The most common non-communicable diseases are diabetes and hypertension, which account for around 46.2% and 4% of all mortality, respectively [1].These conditions are frequently caused by an excess of blood glucose, a sugar molecule made from carbs.With the help of the hormone insulin generated by the pancreas, food is broken down into its smallest molecules and nutrients, such as glucose, which are then taken up by all cells with the goal of creating energy.When the body does not create enough insulin, cells cannot absorb glucose [2], which can happen occasionally.It is difficult to diagnose this disease in its early stages because it is mostly a lifestyle-related condition.Usually advanced by the time it is found, it can only be treated with medication, with some patients also requiring insulin injections to regulate their blood sugar levels.Long-term uncontrolled blood sugar levels can cause serious organ damage, including diabetic retinopathy, which impairs vision, diabetic neuropathy, which harms the nerves, diabetic foot, as well as harm to the heart, pancreas, kidneys, and many other important organs [3].A balanced diet and way of living can help someone manage their blood sugar levels.Since excessive blood sugar levels can gravely affect a person's body, those who have been diagnosed with diabetes must maintain a healthy lifestyle in addition to taking medicine to control their blood sugar.Regular health checks to check for any unexpected changes in the body's blood sugar levels are the best way to manage a chronic condition like diabetes.Diabetes can be challenging to identify in its early stages, and it can be difficult to predict when it will first appear, even with all these precautions [4].Making a medical diagnosis is a labor-intensive process that can be fairly difficult.ML in healthcare systems is not just used for diagnosis; it can also be used to forecast drug effects, manage medical data, support doctors, and make decisions, among other things.Healthcare systems built on machine learning can help clinicians obtain results very quickly.Healthcare professionals and information technology experts are now working in the field of ML-based healthcare systems to accelerate processing and provide better results [5].The term "machine learning" (ML) refers to a variety of statistical methods that let computers learn from their experiences without having to be explicitly programmed.Applications of machine learning (ML) are profoundly changing the healthcare industry.
The UCI laboratory dataset will be used in this study's general framework for predicting the progression of diabetes, and four different machine learning algorithms will be implemented and compared to determine which one has the highest accuracy.
The construction of this research paper is as follows: Background information on diabetics is presented in Section 1, and previous ML models that have been used to predict diabetes in the past are then presented in Section 2. The proposed method, the dataset description, and the preprocessing that went into this study are all shown in Section 3. Section 4 presents all of the experimental results and comparisons with pertinent literature.Section 5 presents the conclusion.

Related Work
The 768-record Pima Native Dataset and a range of machine learning algorithms are used by Malini M et al. [6] to facilitate the prediction of diabetes.In the suggested technique for classification and ensemble learning, classifiers from SVM, KNN, Random Forest, decision tree, logistic regression, and gradient boosting are utilized.The highest classification accuracy of 78% was achieved using logistic regression.
The technique proposed by M Asiful Huda et al.
[7] outperforms the current findings in terms of recall and precision.The suggested method applies classification algorithms to a few features from a dataset on diabetic retinopathy, such as optical disc diameter, lesion-specific features (microaneurysms, exudates), or the presence of hemorrhages.The features are then collected and used in the final decision-making procedure to establish the presence or absence of diabetic retinopathy.The decision is then made utilizing the support vector machine, logistic regression, and decision tree algorithms.Compared to past experiments, the model's accuracy rate is substantially higher at 88%.
Ophthalmologists can forecast DR with the help of a computer-aided classification system for exudates suggested by Smitha S Prem and Umesh A.C [8].The classifications were established using the wavelet decomposition coefficient and LBP, which provide texture information and frequency information, respectively, in an image.The effectiveness of classifiers is assessed using a variety of supervised classification techniques.The KNN classifier has improved the performance of the proposed model with an accuracy of 94% for LBP features using the DIARETDRBI dataset with 89 images, while ANN has improved performance with 100% accuracy for wavelet features.
Salliah Shafi Bhat and Gufran Ahmed Ansari [9] use a machine learning technique to detect diabetes and recommend a healthy diet for diabetic patients using a diet recommendation system (DRS).Numerous machine learning approaches, such as the probabilisticbased naive Bayes (NB), the function-based multilayer perception (MLP), and the decision tree-based Random Forests (RF), are used to develop the machine learning model for the diagnosis of diabetes.Random Forests (RF), the classifier with the best accuracy, achieves 93%.
In Usama Ahmed et al.'s article [10], using a mixed strategy, a machine learning model for predicting diabetes is given.The conceptual framework is based on support vector machine (SVM) and artificial neural network (ANN) models.These models examine the dataset in order to determine whether a diabetes diagnosis is accurate or not.These models' results act as the fuzzy model's input membership function, which ultimately determines whether or not a diabetes diagnosis is made.With a prediction accuracy of 94.87%, the suggested fused ML model exceeds the previously revealed methods.
A hybrid model based on the top three findings was constructed in this study by Sarra Samet [11], who employed six supervised machine learning classification approaches in combination to diagnose diabetes early on.The research employs the Pima Indians Diabetes Database, which is accessible through UCI's machine learning repository.They are all evaluated based on a variety of metrics.With a 90.62% accuracy rate, the hybrid model stands out against other cutting-edge methods.
Minhaz Uddin Emon [12] used feature extraction to find some features in an effort to predict diabetic retinopathy.The data required for this inquiry were provided via the UCI machine learning repository.In order to assess the performance, sensitivity, selectivity, true positive (tp), false negative (fn), and receiver operating characteristic (roc) curves, this dataset was explored using several machine learning (ML) methodologies.Naive Bayes, sequential minimal optimization (SMO), logistic regression, stochastic gradient descent (SGD), Bagging classifier, J48 classifier, decision tree classifier, and random forest classifier are a few of the machine learning techniques used in this study.The overall model that performs the best is logistic regression.S. Jyotheeswar and K.V. Kanimozhi [13] presented a study that used innovative decision trees (DT) and SVM to detect diabetic retinopathy (DR), as opposed to support vector machines.To forecast diabetic retinopathy, the new decision tree (N = 10) and support vector machine (N = 10) algorithms were employed.More than 50,000 digitized retinal images from the Kaggle fundus image dataset were used to identify diabetic retinopathy.support vector machine only managed an accuracy of 85.2%, whereas innovative decision tree managed a precision of 92.8%.(p = 0.03) is the difference between DT and SVM that is statistically significant.The innovative decision tree method outperforms support vector machine for detecting diabetic retinopathy.
In this study, M. Paliwal and P. Saraswat [14] use controlled machine learning techniques on real data from 520 diabetic patients and probable diabetes patients ranging in age from sixteen to ninety.The Naive Bayes classifier, Light-GBM, and support vector machine (SVM) are some examples of these techniques.The performance of the support vector machine has the highest accuracy when comparing classification and recognition accuracy.

Methodology
Figure 1 shows how the envisioned system is laid out.An early-stage diabetes risk prediction dataset with patient records was used in the proposed method.The dataset is subjected to Random Forest, Ada boosting, KNN, and Bagging to produce an effective technique.

A. Data Collection
Islam et al. [15] created the early-stage diabetes risk prediction dataset (UCI Machine Learning Repository, 2020).Information was gathered from the patient files at the Sylhet Diabetes Hospital in Sylhet, Bangladesh.Diabetes is associated with 520 incidences and 16 characteristics.One continuous characteristic and fifteen categorical attributes are present.

Number of Attributes Number of Instances
Diabetics Patients Data 16 520 A. Data Collection Islam et al. [15] created the early-stage diabetes risk prediction dataset (UCI Machine Learning Repository, 2020).Information was gathered from the patient files at the Sylhet Diabetes Hospital in Sylhet, Bangladesh.Diabetes is associated with 520 incidences and 16 characteristics.One continuous characteristic and fifteen categorical attributes are present.Table 1 Dataset description and Table 2 Description of attributes.

Number of Attributes Number of Instances Diabetics Patients Data 16 520
Table 2. Description of attributes [16].

S_No. Features Features Name
Values Missing Non-Numeric   Applied Models Machine learning algorithms have shown great success in the issues of diabetes prediction.How a machine learning system might be used to identify diabetics is explained in this study.This system makes use of the 16 properties found in the UCI Machine Learning Repository, which is openly accessible.Three different architectures, including Random Forest, Ada boosting, KNN, and Bagging are examined as the core of our research.Detailed explanations of the predefined architecture are provided below.
• Random Forest: The Random Forest algorithm, first proposed by Bierman [18,19], consists of a number of independent classifiers for tree structures, each of which makes a classification prediction.
Based on the classification predictions with the highest number of votes from each classifier, the output is predicted.The accuracy increases linearly with the number of trees in the forest, which also removes overfitting problems [20].This simple machine learning technique typically yields excellent results without hyper-tuning.When using the Random Forest method, the classifier will not overfit the model if there are enough trees in the forest, which is a severe problem that can sabotage results.Missing data issues can be resolved by the Random Forest classifier, which can also be more beneficial for categorical values [21].• KNN: The term K Nearest Neighbors (KNN) refers to the sample's K closest neighbors.The idea behind the approach is that you can view the category of K known instances that are closest to the unknown instance when it is necessary to discover the category of an unknown instance [22].The category that makes up the greatest percentage of the K instances is counted and is assumed to be the category of an unidentified case.Different K values have a significant impact on classification when chosen.To determine the categorization, the distance between each instance and the sample point must be determined.The class with the closest neighbors is given the statistics [24].• Ada booster: The AdaBoost (adaptive boosting) technique was cre- ated by Yoav Freund and Robert Shapire in 1995 to create a strong classifier out of a collection of poor classifiers.The Boosting family of algorithms includes AdaBoost (Adaptive Boosting) [25].This kind of learner focuses more on incorrectly classified samples during training, modifies the sample distribution, and repeats this process until the weak classifier has undergone a predetermined amount of training, at which point learning is complete [26].By retaining a collection of weights across training data and adaptively adjusting them after each weak learning cycle, the AdaBoost algorithm produces a series of poor learners.The weights of the training samples that will be misclassified by the weak learner that is currently in use will be increased, while the weights of the training samples that the learner will correctly classify will be dropped [27].

•
Bagging: Bagging is another example of an ensemble technique in which a group of weak learners is combined to produce a strong learner who performs better than one.It is a meta-estimator that uses random subsets of the original dataset to fit base classifiers, and then it sums up individual predictions to provide a final prediction [20].

Result
Using four separate algorithms-Random Forest, KNN, Ada Booster, and Bagging-on the processed data of the UCI Machine Learning Repository dataset produced test accuracy values of 97.03, 92.19, 91.11, and 94.03 percent, respectively.We were able to achieve the best result, which was 97.03%, by using the Random Forest model architecture.Table 3 Obtain accuracy below shows the result.

A. 10-Fold Cross-Validation Analysis
We assessed the performance of our models using fresh data that had not been used during training using the 10-fold cross-validation method.The average cross-validation score for the employed methodologies is displayed in Table 4.We used the Random Forest classification model and obtained a 10-fold cross-validation score of 95.03%.

B. Confusion Matrix Analysis
The confusion matrix gives us a thorough understanding of both the successes and failures of our classification model.The precision of a classification model is assessed using the N × N matrix, where N is the total number of target groups.There are 16 overall attributes in the proposed system.The matrix contrasts the true goal evaluation with what the machine learning model had predicted.By comparing the model's classification of the various fault categories to their actual classification, this matrix shows how they differ [28].The resulting four pieces of information illustrate this: • True Positive (TP): Non-diabetic patients identified as non-diabetic.

•
False Positive (FP): Misclassification of healthy patients as unhealthy.• True Negative (TN): accurately identifying healthy patients as healthy.
The Comparison Parameter Formula has shown other variations, as discussed in [28].

Conclusions
It is becoming more likely for people of all ages to develop diabetes.According to studies, predicting diabetes in its early stages can be a crucial first step in treating the condition.The early prediction of this disease is improving thanks to the machine learning methods we have used.Earlier predictions might make it easier to cover the expense of the treatment, which is great.Finding the most effective method for prediction on this UCI dataset was the major goal of this study.We achieved the best results possible with Random Forest and Bagging.In the future, we will offer a free service through our online application to end users for the early-stage prediction of this disease.We have advised that a web-application system be developed for end-users.

Table 1
Dataset description andTable 2 Description of attributes.

Table 2 .
[16]ription of attributes[16].Important graphs have been drawn and analyzed to determine which features play an important role in prediction.They graphically describe the feature importance.A technique called "feature importance" values input features according to how well they predict return labels.The reliability and effectiveness of a predictive model in practice can be improved by using feature importance scores, which play a significant role in predictive modeling projects and provide insight into the data, information into the model, and a basis for feature selection.2.Filter Method for Feature Selection: A filter method has been applied in the dataset to remove redundant data.Instead, using classification algorithms, filter approaches analyze features based on data qualities.Information theory, correlation, distance, consistency, fuzzy sets, and rough sets can all be used as the foundation for filter measures.First, features are chosen and sorted, batch-wise in the multivariate case (handling redundancy naturally) and independently of feature space in the univariate case.The highest ranked features are chosen in the second stage using a performance criterion [17].
Three steps make up the specific implementation: locating the sample's K closest neighbors first; then Third, choose the category with the highest percentage of categories in the closest neighbor as the classification category [23] by calculating the proportion of nearest neighbor categories.KNN algorithms use data to categorize new data elements entirely based on similarity metrics.
Table 4 shows the various comparison parameters.TN TP + FN + TN + FP TP +