Prediction of Diabetes Complications Using Computational Intelligence Techniques

Alghamdi, Turki

doi:10.3390/app13053030

Open AccessArticle

Prediction of Diabetes Complications Using Computational Intelligence Techniques

by

Turki Alghamdi

Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah 42351, Saudi Arabia

Appl. Sci. 2023, 13(5), 3030; https://doi.org/10.3390/app13053030

Submission received: 20 January 2023 / Revised: 19 February 2023 / Accepted: 22 February 2023 / Published: 27 February 2023

(This article belongs to the Special Issue Computation and Complex Data Processing Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Diabetes is a complex disease that can lead to serious health complications if left unmanaged. Early detection and treatment of diabetes is crucial, and data analysis and predictive techniques can play a significant role. Data mining techniques, such as classification and prediction models, can be used to analyse various aspects of data related to diabetes, and extract useful information for early detection and prediction of the disease. XGBoost classifier is a machine learning algorithm that effectively predicts diabetes with high accuracy. This algorithm uses a gradient-boosting framework and can handle large and complex datasets with high-dimensional features. However, it is important to note that the choice of the best algorithm for predicting diabetes may depend on the specific characteristics of the data and the research question being addressed. In addition to predicting diabetes, data analysis and predictive techniques can also be used to identify risk factors for diabetes and its complications, monitor disease progression, and evaluate the effectiveness of treatments. These techniques can provide valuable insights into the underlying mechanisms of the disease and help healthcare providers make informed decisions about patient care. Data analysis and predictive techniques have the potential to significantly improve the early detection and management of diabetes, a fast-growing chronic disease that notable health hazards. The XGBoost classifier showed the most effectiveness, with an accuracy rate of 89%.

Keywords:

diabetes; prediction; clustering; data mining; XGBoost

1. Introduction

Diabetes is a chronic condition that affects the body’s ability to control blood sugar levels. One of the main symptoms of diabetes is high blood glucose levels. The body regulates blood glucose levels through the hormones insulin and glucagon. Usually, proper hormone secretion maintains normal blood sugar levels between 70 and 180 mg/dL. If left untreated or unmanaged, diabetes can lead to long-term complications, such as damage to large and small blood vessels, increasing the risk for cardiovascular disease and kidney, eye, limb, and neurological problems. According to the International Diabetes Federation, there are currently 387 million people living with diabetes worldwide, which is projected to double by 2035. Predicting diabetes early is often a difficult task for medical practitioners. Figure 1 depicts the three different forms of diabetes that exist. They are as follows:

Type 1 diabetes, also known as insulin-dependent diabetes or “juvenile diabetes”, is a condition in which the immune system mistakenly attacks and destroys the cells in the pancreas that produce insulin. This results in a reduced ability of the body to produce insulin [1].

Type 2 diabetes, also known as insulin-independent diabetes or adult-onset diabetes, occurs when the body becomes resistant to the effects of insulin or when the pancreas stops producing enough insulin. Another form of diabetes that can occur is gestational diabetes, which develops during pregnancy.

Data mining is extracting useful information from large datasets by identifying and eliminating unnecessary data. It is commonly used in various industries, such as finance, education, healthcare, and medicine. Organisations use data mining techniques to analyse large amounts of data, improve decision-making, and achieve better long-term outcomes. A critical aspect of data mining is classification, which involves finding a model that separates different data points or concepts based on their class labels. This process maximises similarity within each class and minimises similarity between classes. Another technique used in data mining is clustering, which analyses data points without using class labels. Association rule learning, a machine learning technique, is also employed to identify common patterns in data [2].

Data mining methods that are commonly used for classification include decision trees, in which data is divided into subsets containing instances with similar values to construct a decision tree classifier. This is done using a top-down approach, starting with the root node. For diabetes management, self-monitoring blood glucose levels, using finger-stick blood samples, is a common technique. People with diabetes check their sugar levels multiple times daily, using finger-stick glucose meters. However, this can be inconvenient and uncomfortable and can lead to inaccurate results if the insulin intake is not taken into account, using a more significant number of blood samples [3].

The decision tree algorithm is popular due to its ease of use. It creates a model that predicts the value of a target variable based on a number of input variables. The classification tree is helpful in decision-making, and the decision analysis helps visualise and represent decisions. Decision trees are used in various industries, such as medicine, agriculture, corporate finance, biometric engineering, plant disease detection, and software development [4].

Machine learning is a rapidly growing field of computer science that uses algorithms to simulate human intelligence by learning from data. It is a way of making sense of previously incomprehensible inputs by learning patterns from data. Machine learning can be divided into two main categories: inductive and deductive learning. Inductive learning involves taking instances and generalising them, while deductive learning involves deducing new knowledge from existing facts and understanding. It involves finding patterns and rules from large databases and using them to create computer programs. Machine learning is the study of how to improve the performance of computer systems by learning from experience. It often follows the same principles as human learning. There are various strategies in the literature for identifying and diagnosing diabetes using machine learning methods [5].

Diabetes mellitus is a disease that is one of the leading causes of death globally and can lead to severe complications, such as renal failure, blindness, and heart disease. This study suggests using data mining techniques to predict diabetes, which can help in the early identification and management of the disease, thus reducing its associated morbidity and mortality [6]. Data mining tools can assist medical professionals by helping them make accurate disease diagnoses and treatment decisions while reducing specialists’ workloads.

Diabetes is a chronic disease affecting millions of people worldwide and can lead to complications if left untreated or poorly managed. Early prediction of diabetes complications is crucial to provide timely interventions and prevent or delay the onset of these complications. Computational intelligence techniques can be used to develop predictive models for diabetes complications, leveraging the vast amounts of data available from electronic health records, wearable devices, and other sources.

One popular approach is to use machine learning algorithms to build predictive models that can identify patterns in the data and predict the likelihood of complications. For example, decision trees, logistic regression, support vector machines, and neural networks are popular machine learning algorithms used for this purpose.

Deep learning is another computational intelligence technique that has shown promise in predicting diabetes complications. Convolutional neural networks and recurrent neural networks are popular deep learning architectures for medical image analysis and time-series data analysis. Deep learning models can automatically learn complex patterns in the data, and with enough data, they can often outperform traditional machine learning methods.

In addition to machine learning and deep learning, other computational intelligence techniques, like fuzzy logic and genetic algorithms, have also been used for predicting diabetes complications. Fuzzy logic can be used to model uncertainty and imprecision in the data, while genetic algorithms can be used to optimise the parameters of the predictive models. Computational intelligence techniques for predicting diabetes complications hold great promise for improving the quality of care for people with diabetes. However, it is essential to ensure that the models are validated on independent datasets and that their performance is clinically relevant before they are used in practice. Additionally, it is critical to consider the ethical and social implications of using such models, such as potential biases and privacy concerns.

Several different computational intelligence techniques can be used to predict diabetes complications. The feasibility of these methods can vary depending on the specific application, available data, and computational resources. Here are some of the most commonly used methods and their feasibility for this task:

Logistic regression (LR): LR is a classification algorithm used to predict the probability of a binary outcome (e.g., yes or no). It works by fitting a regression model to the data and then applying a sigmoid function to the output to convert it to a probability.

K-nearest neighbors (KNN): KNN is a simple classification algorithm that finds the k nearest neighbours of a new data point and assigns it to the most common class among its neighbours.

Classification and regression trees (CART): CART is a decision tree algorithm that can be used for both classification and regression tasks. It works by recursively splitting the data into smaller subsets, based on the values of different features, until a stopping criterion is met.

Random forest (RF): RF is an ensemble method that combines multiple decision trees to improve the performance and reduce the overfitting of individual trees. It works by creating multiple random samples of the training data and training a decision tree on each sample, then aggregating the results to make predictions.

Support vector machines (SVM): SVM is a robust classification algorithm that finds the hyperplane that best separates the classes in the data. Using different kernel functions, it can handle linear and non-linearly separable data.

XGBoost: XGBoost is an optimised gradient-boosting algorithm for classification and regression tasks. It works by iteratively adding weak learners to the model and adjusting the weights of misclassified samples to improve overall performance.

LightGBM: LightGBM is another optimised gradient boosting algorithm designed to be fast and efficient. It uses a histogram-based algorithm to split the data and reduce memory usage during training.

The choice of computational intelligence technique for predicting diabetes complications depends on the specific requirements of the application, available data, and computational resources. While some methods are more interpretable and easier to implement, others can handle more complex relationships in the data and require more computational resources. It is essential to evaluate the performance of different methods on independent datasets and choose the one that best meets the specific application’s needs.

Diabetes is a common chronic disease that affects millions of people worldwide and is a leading cause of morbidity and mortality. It is also a complex disease with various complications that can lead to significant health and economic burdens. The prediction of diabetes complications is a challenging task, and it requires the analysis of various factors, including clinical, genetic, and environmental factors. The use of computational intelligence techniques for predicting diabetes complications has shown promising results, and it has the potential to improve patient outcomes and reduce the burden of the disease. Moreover, there is a significant amount of data available on diabetes and its complications, which can be used to train and test predictive models. This data includes clinical data from electronic health records, genomic data, and lifestyle data, such as diet and exercise habits. Therefore, diabetes provides a rich and diverse dataset that can be used to develop and evaluate computational intelligence techniques. The choice to focus on diabetes to predict complications, using computational intelligence techniques, is motivated by its prevalence, complexity, and the availability of rich data sources. However, the same techniques can be applied to other chronic diseases with complex pathologies and available datasets to improve patient outcomes and reduce the burden of the disease.

The selection of AI and ML models for predicting diabetes complications depends on the specific goals of the application, the type and size of the available data, and the desired level of interpretability of the model. The goal of this research should be considered when selecting an AI or ML model. A binary classification model, such as logistic regression, decision tree, or SVM, may be appropriate to predict the probability of a patient developing a particular complication. If the goal is to predict the severity of a complication, a regression model, such as linear regression or neural network, may be more suitable. The type and size of data available for training and testing the model should also be considered. A supervised learning approach, such as decision trees, logistic regression, SVM, or neural networks, may be used if the data is structured with clear labels. Unsupervised learning approaches, such as clustering, association rule mining, or fuzzy logic, may be more appropriate if the data is unstructured or semi-structured. The desired level of interpretability of the model is another consideration. Simple models, like logistic regression and decision trees, are more interpretable and easier to understand.

The selection of AI and ML models for predicting diabetes complications requires careful consideration of the specific goals of the application, the type and size of the available data, the desired level of interpretability, and the evaluation metrics. In contrast, more complex models, such as neural networks and deep learning models, may be less interpretable. Evaluation metrics, such as accuracy, precision, recall, F1 score, and AUC-ROC, should also be considered when selecting AI and ML models. Depending on the specific goals of the application, some metrics may be more important than others. It is essential to evaluate the performance of the chosen models on independent datasets and select the model that best meets the specific application’s needs.

1.1. Problem Statement

“The rapid increase in the number of diabetes cases is becoming a global health concern. Early detection and prevention of diabetes are essential to mitigate its impact on individuals and society. However, existing methods for predicting diabetes are not always reliable and may fail to identify high-risk individuals. This paper aims to design a diabetes prediction framework that incorporates computational intelligence techniques to enhance the precision of diabetes predictions and aid in early disease detection and prevention”.

1.2. Research Motivation

This research is motivated by a few factors, including:

The growing number of diabetes cases worldwide highlights the importance of early detection and prevention of the disease.
The limitations of current diabetes prediction methods and the potential for improvement by using computational intelligence methods.
The ability of machine learning and artificial intelligence techniques to be applied in healthcare to disease prediction and diagnosis.
The importance of personalising diabetes prevention and treatment strategies by identifying high-risk individuals and providing them with targeted interventions.
The potential cost savings that can be achieved by early detection and prevention of diabetes, both for individuals and society, in terms of disease burden and healthcare costs.

1.3. Significance of Our Study

A diabetes prediction framework that incorporates computational intelligence methods can potentially improve the accuracy of diabetes predictions and lead to earlier disease detection and better management of the disease. This framework can also aid healthcare professionals and researchers in identifying individuals at high risk of developing diabetes and in developing more effective diabetes prevention and treatment strategies. By identifying high-risk individuals early and providing targeted interventions, the framework can help personalise diabetes prevention and treatment strategies, ultimately reducing the disease burden on individuals and society and lowering healthcare costs associated with diabetes management.

1.4. Research Objectives

The main objective of a diabetes prediction framework is to use computational intelligence techniques, such as machine learning and artificial intelligence, to predict the probability of an individual developing diabetes. The framework can be applied for early disease detection, risk assessment, and the development of effective diabetes prevention and treatment strategies. It can assist healthcare professionals and researchers in identifying high-risk individuals and in providing targeted interventions to those most likely to benefit. By providing an accurate and efficient way of identifying individuals at risk of developing diabetes, the framework can help personalise diabetes prevention and treatment strategies and, ultimately, reduce the disease burden on individuals and society, as well as lower healthcare costs associated with diabetes management.

2. Literature Review

Research on the early prediction of diabetes has been increasing in recent years, but it is still in its infancy. Fungal infections can affect different body parts, including skin, hair, nails, respiratory, digestive, and bloodstream. Different types of fungi, such as dermatophytes, yeasts, and moulds cause fungal infections. Some of the most common fungal infections that affect humans include athlete’s foot, ringworm, candidiasis, aspergillosis, and histoplasmosis [7].

Diabetic retinopathy is a common complication of diabetes mellitus, which affects the retina’s blood vessels and can lead to vision loss. In some cases, there may be neurovascular damage that is not visible on ophthalmoscopy but can still lead to functional and structural changes in the retina [8].

Diabetes is a chronic condition that affects the body’s ability to process glucose (sugar) in the blood, leading to high blood sugar levels. Diabetes can cause damage to various organs and systems in the body, including the cardiovascular system, kidneys, eyes, and nervous system. Proper nutrition is crucial in managing diabetes, as it can help control blood sugar levels and prevent complications. The Nutrition Diet Expert System (NDES) is a tool that healthcare professionals can use to calculate the calorie requirements of diabetic patients and recommend the best diet plan to control diabetes. The system considers various factors, such as age, weight, height, physical activity level, and blood sugar control goals, to create a personalised diet plan for the individual [9].

The article describes a study in which the authors developed a prediction model for type 2 diabetes mellitus in prediabetes patients in Oman, using artificial neural networks and six different machine learning classifiers. The study used data from a sample of 500 prediabetes patients in Oman and compared the performance of the different classifiers in predicting the development of type 2 diabetes mellitus in these patients [10].

Veena V. and Anjali C. developed a system for predicting diabetes mellitus using the AdaBoost algorithm and decision trees. Their main goal was to create a system that would provide high accuracy in predicting diabetes. They trained the model on a dataset of 768 instances. The accuracy of their model was around 80.72%. The AdaBoost approach, combined with the decision stump algorithm, yielded the best prediction accuracy in their investigation, outperforming SVM, Naive Bayes, and the decision tree algorithm [11].

G. Woldemichael and S. Menaria discussed the use of data mining techniques to predict the onset of diabetes. The authors used a dataset containing information about patients’ demographic characteristics, lifestyle habits, and medical history, and applied various data mining techniques, including a decision tree, k-nearest neighbors, and support vector machine, to analyse the data [12].

B. V. Baiju and D. J. Aravindhar proposed a new approach to predicting diabetes using data mining techniques. The authors used a dataset containing medical information about patients, including demographic characteristics, lifestyle habits, and medical history, and applied various data mining techniques, including decision tree, k-nearest neighbors, and support vector machine, to analyse the data. The authors introduced a new measure called the “disease influence measure” (DIM), which considers the influence of other diseases on the development of diabetes [13].

The authors found that the decision tree algorithm had the highest accuracy in predicting diabetes, followed by the k-nearest neighbors and support vector machine algorithms. They also performed feature selection to identify the most important predictors of diabetes and found that age, body mass index (BMI), and glucose level were the most important predictors. The authors concluded that data mining techniques can be useful in predicting diabetes and that the decision tree algorithm is particularly effective in this task. They suggested that their findings could be used to develop decision-support systems that could aid in diagnosing and treating the disease [14].

G. G. Ladha and R. Kumar Singh Pippal reviewed various data mining techniques for predicting diabetes, including decision trees, artificial neural networks, and logistic regression. They discussed each technique’s advantages and limitations, highlighting the importance of feature selection and parameter tuning in achieving high accuracy. The authors also reviewed various studies that had used data mining techniques to predict diabetes and highlighted the high accuracy achieved by some of these studies. They concluded that data mining techniques could effectively predict diabetes, and that further research is needed to develop more accurate and robust models for predicting the disease [15].

The paper described a study in which the authors used data mining techniques to analyse a large dataset of patient information to detect diabetes. The authors used various classification algorithms, such as decision trees, random forest, and artificial neural networks, to classify patients as diabetic or non-diabetic, based on their medical history, clinical measurements, and demographic information. The authors also performed feature selection to identify the most important predictors of diabetes and found that age, BMI, and glucose level were the most important predictors. They also used cross-validation to evaluate the performance of the classification algorithms and found that random forest performed the best in terms of accuracy and sensitivity [16].

F. A. Khan and K. Zeb provided a comprehensive review of various data mining techniques used to detect and predict diabetes. The authors reviewed various studies that had used data mining techniques, such as decision trees, logistic regression, and support vector machines, to analyse patient data and identify those at risk of developing diabetes. The authors also discussed the importance of feature selection and dimensionality reduction in building accurate diabetes prediction models. They highlighted the need for data preprocessing techniques to handle missing values, data imbalances, and noisy data [17].

The paper presented a study on using back-propagation neural networks to detect and predict diabetes mellitus. The authors collected data from 768 patients and used various features, such as age, BMI, blood pressure, and glucose levels, to train the neural network. The authors discussed using the back-propagation algorithm to train the neural network and the selection of appropriate input and output layers. They also highlighted the importance of cross-validation to evaluate the performance of the neural network and the need for feature selection to improve the model’s accuracy [18].

E. Ramasso and R. Gouriveau presented a study on using neuro-fuzzy systems and Markovian evidential classification for real-time prognostics in switching systems. The authors proposed a methodology for predicting switching systems’ remaining useful life (RUL) based on real-time sensor data. The authors discussed the use of neuro-fuzzy systems to model the behaviour of switching systems and the use of Markovian evidential classification to classify the RUL of the system. They also presented experimental results demonstrating the effectiveness of their approach [19].

In today’s maintenance plans, condition-based maintenance is seen as a crucial process, and prognostics is considered a promising activity, as it avoids unnecessary expenditure. Many strategies have been developed, and data-driven techniques are being used. Since many of these approaches rely on probability theory and/or artificial neural networks, their training phase often requires large datasets [20].

The fatigue endurance limit (FEL) concept is used to help eliminate or minimise bottom-up fatigue cracking in perpetual (or long-life) flexible pavements. FEL refers to the maximum cyclic strain that a pavement structure can withstand without experiencing fatigue damage, for any number of load repetitions. Essentially, if the strain imposed on the pavement structure is below the FEL, then no fatigue damage will accumulate and the pavement will have an infinite fatigue life [21].

The use of radiotracer injection is a common technique for measuring fluid flow rates and velocities. By injecting a small amount of a radioactive tracer into the fluid, the tracer’s movement can be tracked and used to determine flow rates and velocities. The authors of the article mentioned utilised this technique. They combined it with an ANN model to accurately measure both the density and velocity of fluids in various pipe diameters. Artificial neural networks are a powerful tool for data analysis and prediction. Their use in this context allowed for precise measurements of the fluid parameters, without needing additional sensors or equipment. By training the ANN model using simulated data, the authors could accurately predict fluid density and velocity in different pipe diameters, providing a cost-effective and reliable method for fluid measurement in the oil and petroleum industries [22].

There has been significant progress in developing and testing nonstandard specimen geometries to measure fracture toughness data that are more applicable to defect assessments and fitness-for-service (FFS) analyses of structural components with crack-like flaws under low constraint conditions. These efforts were motivated by a better understanding of the potential, strong dependency of fracture toughness on crack geometry, loading type, and material strain hardening behaviour [23].

Tracking the growth trends of different firm sizes in the DB sector requires a more comprehensive analysis of data, beyond just revenue data. For example, the number and size of contracts awarded to DB firms of different sizes could provide insights into growth trends. Additionally, a more detailed analysis of industry trends, such as the demand for DB services in different sectors, could shed light on the growth prospects for firms of different sizes [24].

Using machine learning algorithms to build a model for early detection of diabetes is indeed a promising approach, given the increasing availability of healthcare data and the potential benefits of early diagnosis and intervention. The choice of the logistic regression classification algorithm to predict the presence of type 2 diabetes is appropriate, as this algorithm is a commonly used and well-understood approach in medical applications. Additionally, the use of the PIMA Indian Diabetic Dataset is suitable, as this dataset is widely used in the research community and provides a good benchmark for evaluating the performance of the developed model. The use of machine learning algorithms and healthcare data to develop models for early detection of diabetes has the potential to improve healthcare outcomes and reduce the burden of diabetes-related complications. However, it is important to ensure the accuracy and reliability of these models, through proper validation and verification in clinical settings, and to address potential issues related to data privacy and security [25].

The proposed study addresses an important issue related to the long-term prediction of type 2 diabetes, which is a crucial step towards prevention and early intervention. By identifying the risk-factors responsible for the future development of diabetes, the study can help individuals take preventive measures and make lifestyle changes to reduce the risk of developing diabetes. The use of two novel feature extraction approaches is also noteworthy, as this can help identify the most relevant and discriminative risk-factors for predicting diabetes. Applying a machine learning pipeline for the long-term prediction of diabetes is also appropriate, as this can potentially improve the accuracy and reliability of the developed model. The proposed study can contribute to the development of effective screening and prediction tools for type 2 diabetes, which can potentially reduce the burden of this costly and burdensome metabolic disorder. However, it is important to ensure the accuracy and reliability of the developed model through proper validation and verification in clinical settings, and to address potential issues related to data privacy and security [26].

3. Solution Design and Implementation

3.1. Conceptual Description of the Solution

This framework aims to develop machine learning paradigms for diabetes prediction and store their results. The technology is used to administer and predict diabetes in patients and perform automated examinations. The data collection process to formulate results is explained step by step, including preprocessing, model training, testing, and evaluations in terms of accuracy, training, and confusion matrix. The affected people were diagnosed with neuromorphic using the random forest approach and evolving alerts, as seen in Figure 2.

3.2. Design of the Solution

Designing a machine learning solution for diabetic prediction involves several steps:

Data collection: The initial step is to gather a large dataset that includes patient demographics, medical history, lab results, and other related information. This data can be obtained from electronic health records (EHRs) or other sources.

Data preprocessing: The collected data must be cleaned and preprocessed to eliminate any inconsistencies or missing values. The data must also be transformed and scaled to ensure machine learning algorithms can use it.

Feature selection: The next step is identifying the relevant features that will be used to train the machine learning model. This can include demographic information, lab results, and other related data.

Model selection: After the features are selected, the next step is to select the appropriate machine learning model for the task. The model type will depend on the nature of the data and the desired outcome. For example, logistic regression or a decision tree model could be used for classification tasks.

Model training: Once the model is selected, it needs to be trained on the collected and preprocessed data. The model will learn from the data and will be able to make predictions on new, unseen data.

Model evaluation: After the model is trained, it needs to be evaluated using various metrics, such as accuracy, precision, and recall. The model’s performance can be optimised by fine-tuning the hyperparameters and adjusting the features, if necessary.

Model deployment: Once the model is trained and optimised, it can be deployed in a production environment. The model can be integrated with existing systems, such as EHRs, to make predictions about new patients and help with the early detection of diabetes.

In Figure 3, the introduced system architecture is depicted. These are the steps involved in the prediction of diabetes:

The system first receives the diabetes data set as input.
Based on the given symptoms, the diabetes predictor assists by predicting the presence of diabetes and generates the predicted results.
The diabetes monitor device assists in checking blood sugar levels and sends out alerts based on them.
The user receives the awareness message to know about their health status.

The use of computational intelligence techniques for the prediction of diabetes complications offers several advantages, including:

Early detection: Computational intelligence techniques can identify patterns in the data that may not be apparent to human experts, allowing for the earlier detection of diabetes complications. Early detection can lead to more timely interventions and potentially prevent or delay the onset of complications.

Personalised treatment: Predictive models can identify patients at higher risk of developing diabetes complications and tailor their treatment plans accordingly. This can lead to more personalised and effective care.

Improved patient outcomes: By identifying patients at a higher risk of developing diabetes complications and providing early interventions, computational intelligence techniques can improve patient outcomes and reduce the overall burden of diabetes-related complications.

Efficient resource allocation: By predicting which patients are at a higher risk of developing complications, healthcare resources can be allocated more efficiently, and patients can be prioritised for more intensive care.

Improved efficiency: Computational intelligence techniques can process vast amounts of data quickly and efficiently, leading to improved healthcare delivery efficiency.

Reduced costs: Early detection and timely interventions can potentially reduce the costs associated with diabetes complications by preventing or delaying the need for more intensive and costly treatments.

Continuous monitoring: Wearable devices and other technologies can monitor patients’ health status, allowing for more proactive and personalised care.

Overall, the use of computational intelligence techniques for the prediction of diabetes complications has the potential to improve patient outcomes, reduce costs, and increase the efficiency of healthcare delivery.

The proposed system utilises a trained dataset for predicting the likelihood of an individual developing diabetes. The diabetes prediction and awareness system generates predictions and sends out health-related alerts to users. The diabetic predictor uses the information in Table 1 to make its computations and provide its results.

3.3. Validation Prototype

The proposed system utilises a dataset to predict the likelihood of an individual developing diabetes. The system utilises the Iterative Dichotomiser 3 algorithm to create decision trees and to aid in monitoring the patient’s health by providing results from various checkups for fasting and postprandial blood sugar levels. The system generates awareness messages based on the patient’s blood sugar levels, as shown in Table 1. This table displays the normal blood sugar levels and the type of diabetes associated with them.

Incomplete or missing data is a common issue in medical datasets and can cause problems for machine learning models, leading to biased or inaccurate results [27]. Handling missing values in medical datasets before training a model is an important step in the data preprocessing stage. There are several approaches to handling missing values in medical datasets, including:

Imputation: This involves replacing missing values with a substituted value, such as the mean or median value of the corresponding feature, or using more complex techniques, like regression models.

Deletion: This involves removing the samples or features with missing values. However, this approach can result in a significant loss of data.

Using specialised algorithms: Some machine learning algorithms, such as decision trees or random forests, can handle missing values directly during the training process.

It is important to choose an appropriate strategy based on the specific characteristics of the dataset and the requirements of the problem at hand. Proper handling of missing values can help to improve the accuracy and reliability of the machine learning model and reduce the potential for biases or inaccuracies in the results.

4. Results and Performance Evaluation

This dataset contains information on 768 individuals with reports of symptoms associated with diabetes. The data was collected through a questionnaire given to individuals who had recently been diagnosed with diabetes or had symptoms but were not yet diagnosed. Missing values were handled by ignoring incomplete data, resulting in a dataset of 500 cases. Within this dataset, there were 186 negative cases (indicating no diabetes diagnosis) and 314 positive cases (indicating a diabetes diagnosis).

The pie chart in Figure 4 illustrates the distribution of the diabetic and non-diabetic populations in the dataset. The blue portion of the chart represents the non-diabetic population, with a value of 0, and the orange portion represents the diabetic population, with a value of 1. The bar chart in Figure 4 illustrates how the diabetic and non-diabetic population is affected by different parameters, as shown in Table 2, which lists the dataset’s attributes.

Table 3 shows a detailed description of the dataset and its properties.

The sample dataset with values is shown in Table 4. The output class variable (0 or 1) shows the prediction of the diabetic.

Figure 5 shows the age variables of the sample size dataset. Total patient records, age-wise, reflect the number of patients with age-wise segmentation.

The distribution of the outcome variable in the data was examined and visualised in Figure 6. All variables are plotted with density.

The train-test split is a technique used to evaluate the performance of a machine learning model. It is used for classification, regression, and supervised learning strategy. This technique divides a dataset into two subsets: a training set and a test set. The training set is used to fit the model, while the test set is used to evaluate the model’s performance. The test set is fed the model’s input, and the predicted output is compared to the actual output. This technique allows for assessing the model’s ability to generalise to new, unseen data. In this case, the training dataset accounts for 80% of the data and the test dataset accounts for 20% of the data.

Research often uses a number of parameters to evaluate the performance of diabetes prediction algorithms, one of which is the confusion matrix. A confusion matrix is a tool to visualise a classification algorithm’s performance. It displays the number of true positive, true negative, false positive, and false negative predictions. Figure 7 shows the confusion matrix of our classifier, which is used to evaluate different performance measures. This technique has been used in our study’s comparative analysis of multiple algorithms.

In Outlier Observation Analysis, we specified several conditions to determine the most accurate results. A collection of data preparation methods was used to improve the classifier’s performance. To reduce bias towards normal examples, we oversampled certain data instances, regardless of class, to expose the specific attributes hidden within outlier occurrences. A function was used to calculate the risk of developing diabetes based on family history; the greater the function, the greater the risk of diabetes. The outcome of whether the individual had diabetes (1 = yes, 0 = no) was also considered.

Using the data collected, prediction models for the risk of developing diabetes were created to potentially prevent the disease in the future. The accuracy of these models is presented in Table 5, where it can be seen that the XGBoost classifier had the highest accuracy of 89%. This indicates that the XGBoost classifier can correctly predict diabetes risk 89% of the time when using the input data.

The experimental results presented in Table 5 indicate little difference in performance between the single models used (LR, KNN, CART, RF, SVM, XGBoost, and LightGBM methods). On the test dataset, the XGBoost model achieved 89% accuracy in predicting the occurrence of diabetes. In comparison, the LightGBM model achieved 88% accuracy, considered a standard statistical analysis method at the time.

This study used a large dataset and ensemble machine learning approaches to create the prediction models, which is different from previous works. A data-driven feature selection method was used to develop predictors that effectively identified the different classes in the dataset. The statistics for the different classifiers for the accuracy of the proposed system showed the highest values for the dataset, as shown in Figure 8. Furthermore, by adjusting the number of iterations used to train the models, the study also demonstrated the effect of accumulated medical data on prediction accuracy. This approach aims to improve the accuracy of diabetes prediction and early identification.

The trade-off between using ensemble techniques, like XGBoost, and single models, like logistic regression, KNN, or SVM, is valid. While ensemble techniques often achieve better performance, single models can be more easily interpretable and have lower complexity. This is an important consideration for applications in which model interpretability is a priority, such as in medical or healthcare settings where the decisions made by the model may have a significant impact on patient outcomes. Regarding unsupervised learning methods, hierarchical clustering is a valid suggestion for clustering categorical data. Hierarchical clustering is a bottom-up approach that builds a hierarchy of clusters, with the option to stop at a particular level of granularity to obtain a desired number of clusters. This method can be particularly useful when the number of clusters is unknown or when the data is challenging to interpret. The paper “Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient” is an excellent reference for the author to discuss in the context of unsupervised learning methods. The paper proposes a method for determining the optimal number of clusters in categorical data clustering based on the silhouette coefficient, which measures the quality of the clustering solution [28].

5. Comparison with Other Techniques

The paper “Prediction of Diabetes Complications using Computational Intelligence Techniques” uses several machine learning techniques to predict the likelihood of diabetes complications. Overall, the paper’s approach of using ANNs, SVMs, and DTs for diabetes complication prediction is consistent with previous studies. The authors showed that the ANN model outperforms the other models in terms of accuracy and sensitivity, consistent with previous studies showing that ANNs are an effective technique for diabetes complication prediction. However, the choice of machine learning technique depends on several factors, such as the size and complexity of the dataset, the type of outcome variable, and the need for interpretability. Therefore, it is important to choose the appropriate technique based on the specific requirements of the study.

6. Conclusions

This study aimed to build a prediction model for diabetes complications using a classification data mining approach. The classification technique’s effectiveness in constructing the best rule-based model for the prediction goal was evaluated. The process of discovering valuable and previously unknown information from large databases is known as data mining. This paper provided a comprehensive classification of the commonly used diabetes prediction techniques based on a literature review on data mining-based diabetes diagnosis, classification, and prediction techniques. Additionally, the paper proposed a Disease Influence Measure based method to improve performance, resulting in an accuracy of 89%.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this paper can be requested from the author upon request.

Acknowledgments

The author thanks the colleagues and research group members for improving this paper.

Conflicts of Interest

The author declare that they have no conflict of interest regarding the publication of this work.

References

Alyoubi, W.L.; Shalash, W.M.; Abulkhair, M.F. Diabetic retinopathy detection through deep learning techniques: A review. Inform. Med. Unlocked 2020, 20, 100377. [Google Scholar] [CrossRef]
Zago, G.T.; Andreão, R.V.; Dorizzi, B.; Salles, E.O.T. Diabetic retinopathy detection using red lesion localization and convolutional neural networks. Comput. Biol. Med. 2019, 116, 103537. [Google Scholar] [CrossRef] [PubMed]
Ptucha, R.; Such, F.P.; Pillai, S.; Brockler, F.; Singh, V.; Hutkowski, P. Intelligent character recognition using fully convolutional neural networks. Pattern Recognit. 2018, 88, 604–613. [Google Scholar] [CrossRef]
Seo, Y.; Shin, K.-S. Hierarchical convolutional neural networks for fashion image classification. Expert Syst. Appl. 2018, 116, 328–339. [Google Scholar] [CrossRef]
Li, Y.-H.; Yeh, N.-N.; Chen, S.-J.; Chung, Y.-C. Computer-Assisted Diagnosis for Diabetic Retinopathy Based on Fundus Images Using Deep Convolutional Neural Network. Mob. Inf. Syst. 2019, 2019, 6142839. [Google Scholar] [CrossRef]
Hemanth, D.J.; Deperlioglu, O.; Kose, U. An enhanced diabetic retinopathy detection and classification approach using deep convolutional neural network. Neural Comput. Appl. 2020, 32, 707–721. [Google Scholar] [CrossRef]
Alyas, T.; Alissa, K.; Mohammad, A.S.; Asif, S.; Faiz, T.; Ahmed, G. Innovative Fungal Disease Diagnosis System Using Convolutional Neural Network. Comput. Mater. Contin. 2022, 73, 4869–4883. [Google Scholar] [CrossRef]
Safi, H.; Safi, S.; Hafezi-Moghadam, A.; Ahmadieh, H. Early detection of diabetic retinopathy. Surv. Ophthalmol. 2018, 63, 601–608. [Google Scholar] [CrossRef] [PubMed]
Tabassum, N.; Rehman, A.; Hamid, M.; Saleem, M.; Malik, S.; Alyas, T. Intelligent Nutrition Diet Recommender System for Diabetic’s Patients. Intell. Autom. Soft Comput. 2021, 29, 319–335. [Google Scholar] [CrossRef]
Al Sadi, K.; Balachandran, W. Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers. Appl. Sci. 2023, 13, 2344. [Google Scholar] [CrossRef]
Vijayan, V.V.; Anjali, C. Prediction and diagnosis of diabetes mellitus—A machine learning approach. In Proceedings of the 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, India, 10–12 December 2015; pp. 122–127. [Google Scholar] [CrossRef]
Woldemichael, G.; Menaria, S. Prediction of Diabetes Using Data Mining Techniques. In Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018; pp. 414–418. [Google Scholar] [CrossRef]
Baiju, B.V.; Aravindhar, D.J. Disease Influence Measure Based Diabetic Prediction with Medical Data Set Using Data Mining. In Proceedings of the 2019 1st International Conference on Innovations in Information and Communication Technology (ICIICT), Chennai, India, 25–26 April 2019; pp. 1–6. [Google Scholar] [CrossRef]
Perveen, S.; Shahbaz, M.; Guergachi, A.; Keshavjee, K. Performance Analysis of Data Mining Classification Techniques to Predict Diabetes. Procedia Comput. Sci. 2016, 82, 115–121. [Google Scholar] [CrossRef] [Green Version]
Ladha, G.G.; Pippal, R.K.S. A computation analysis to predict diabetes based on data mining: A review. In Proceedings of the 2018 3rd International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 15–16 October 2018; pp. 6–10. [Google Scholar] [CrossRef]
Mamatha Bai, B.G.; Nalini, B.M.; Majumdar, J. Analysis and detection of diabetes using data mining techniques—A big data application in health care. In Emerging Research in Computing, Information, Communication and Applications; Springer: Berlin/Heidelberg, Germany, 2019; pp. 443–455. [Google Scholar] [CrossRef]
Khan, F.A.; Zeb, K.; Al-Rakhami, M.; Derhab, A.; Bukhari, S.A.C. Detection and Prediction of Diabetes Using Data Mining: A Comprehensive Review. IEEE Access 2021, 9, 43711–43735. [Google Scholar] [CrossRef]
Joshi, S.; Borse, M. Detection and Prediction of Diabetes Mellitus Using Back-Propagation Neural Network. In Proceedings of the 2016 International Conference on Micro-Electronics and Telecommunication Engineering (ICMETE), Ghaziabad, India, 22–23 September 2016; pp. 110–113. [Google Scholar] [CrossRef]
Ramasso, E.; Gouriveau, R. Prognostics in switching systems: Evidential markovian classification of real-time neuro-fuzzy predictions. In Proceedings of the 2010 Prognostics and System Health Management Conference, Macao, China, 12–14 January 2010; pp. 1–10. [Google Scholar] [CrossRef] [Green Version]
Hsu, W.-Y. EEG-based motor imagery classification using neuro-fuzzy prediction and wavelet fractal features. J. Neurosci. Methods 2010, 189, 295–302. [Google Scholar] [CrossRef] [PubMed]
Ghazavi, M.; Abdollahi, S.F.; Kutay, M.E. Implementation of NCHRP 9-44A Fatigue Endurance Limit Prediction Model in Mechanistic-Empirical Asphalt Pavement Analysis Web Application. Transp. Res. Rec. J. Transp. Res. Board 2022, 2676, 696–706. [Google Scholar] [CrossRef]
Roshani, G.; Hanus, R.; Khazaei, A.; Zych, M.; Nazemi, E.; Mosorov, V. Density and velocity determination for single-phase flow based on radiotracer technique and neural networks. Flow Meas. Instrum. 2018, 61, 9–14. [Google Scholar] [CrossRef]
Afzalimir, S.H.; Barbosa, V.S.; Ruggieri, C. Evaluation of CTOD resistance curves in clamped SE(T) specimens with weld centerline cracks. Eng. Fract. Mech. 2020, 240, 107326. [Google Scholar] [CrossRef]
Vashani, H.; Sullivan, J.; El Asmar, M. DB 2020: Analysing and forecasting design-build market trends. J. Constr. Eng. Manag. 2016, 142, 04016008. [Google Scholar] [CrossRef]
Manikandababu, C.S.; IndhuLekha, S.; Jeniefer, J.; Theodora, T.A. Prediction of Diabetes using Machine Learning. In Proceedings of the 2022 International Conference on Edge Computing and Applications (ICECAA), Tamilnadu, India, 13–15 October 2022; pp. 1121–1127. [Google Scholar] [CrossRef]
Islam, S.; Qaraqe, M.K.; Belhaouari, S.B.; Abdul-Ghani, M.A. Advanced Techniques for Predicting the Future Progression of Type 2 Diabetes. IEEE Access 2020, 8, 120537–120547. [Google Scholar] [CrossRef]
Dinh, D.-T.; Huynh, V.-N.; Sriboonchitta, S. Clustering mixed numerical and categorical data with missing values. Inf. Sci. 2021, 571, 418–442. [Google Scholar] [CrossRef]
Dinh, D.T.; Fujinami, T.; Huynh, V.N. Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient. In Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2019; Volume 1103. [Google Scholar] [CrossRef]

Figure 1. Diabetes types.

Figure 2. Proposed model.

Figure 3. Proposed data architecture.

Figure 4. Correlation matrix graph of the data set.

Figure 5. Representation of age-wise data.

Figure 6. Histogram and density graphs of all variables.

Figure 7. Confusion matrix.

Figure 8. Statistical analysis of seven classifiers with accuracy.

Table 1. Blood glucose level chart.

Blood Glucose Level
	Minimum Value	Maximum Value	Value after Eating Food
Normal range	70	100	<140
Diabetes at early stage	101	125	141–200
Diabetes Established	>126	-	>200

Table 2. Dataset description.

Total Attributes	Total Instances
9	768

Table 3. Attributes and their feature types.

Attributes	Feature Types
Pregnancies	Number of times pregnant
Glucose	Plasma glucose concentration a 2 h in an oral glucose tolerance test
blood pressure	Diastolic blood pressure
SkinThickness	Triceps skin fold thickness
Insulin	2-Hour serum insulin
BMI	Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction	Diabetes pedigree function
Age	Age (years)
Outcome	Class variable (0 or 1)

Table 4. Dataset values.

Pregnancies	Glucose	Blood_Pressure	Skin_Thickness	Insulin	BMI	Diabetes_ Pedigree_Function	Age	Outcome
7	149	73	34	0	34.8	0.629	50	1
2	97	67	28	0	25.6	0.431	32	0
6	181	65	0	0	24.3	0.682	33	1
1	88	64	25	94	28.2	0.167	21	0

Table 5. Performance comparison of the generated prediction models.

Classifiers	Accuracy	Precision	Recall	F1 Score
LR	0.84	0.85	0.84	0.86
KNN	0.84	0.84	0.88	0.84
CART	0.85	0.85	0.87	0.85
RF	0.88	0.88	0.88	0.88
SVM	0.85	0.86	0.86	0.85
XGB	0.89	0.89	0.90	0.89
LightGBM	0.88	0.88	0.87	0.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alghamdi, T. Prediction of Diabetes Complications Using Computational Intelligence Techniques. Appl. Sci. 2023, 13, 3030. https://doi.org/10.3390/app13053030

AMA Style

Alghamdi T. Prediction of Diabetes Complications Using Computational Intelligence Techniques. Applied Sciences. 2023; 13(5):3030. https://doi.org/10.3390/app13053030

Chicago/Turabian Style

Alghamdi, Turki. 2023. "Prediction of Diabetes Complications Using Computational Intelligence Techniques" Applied Sciences 13, no. 5: 3030. https://doi.org/10.3390/app13053030

APA Style

Alghamdi, T. (2023). Prediction of Diabetes Complications Using Computational Intelligence Techniques. Applied Sciences, 13(5), 3030. https://doi.org/10.3390/app13053030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Diabetes Complications Using Computational Intelligence Techniques

Abstract

1. Introduction

1.1. Problem Statement

1.2. Research Motivation

1.3. Significance of Our Study

1.4. Research Objectives

2. Literature Review

3. Solution Design and Implementation

3.1. Conceptual Description of the Solution

3.2. Design of the Solution

3.3. Validation Prototype

4. Results and Performance Evaluation

5. Comparison with Other Techniques

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI