1. Introduction
Diabetes mellitus is a chronic disease caused when the pancreas does not produce enough insulin, or when the body cannot use effectively the insulin it produces. Insulin is a hormone that moves the glucose from the bloodstream to the body cells where it will be used as energy. If not consumed by the cells, the excess sugar in the blood can lead to serious health problems [
1]. Diabetes and its complications have a significant economic impact on individuals and their families, health systems and national economies. In 2019, the global expenditures for diabetes treatment were estimated at 760 billion U.S Dollars and are expected to rise to 845 billion U.S. Dollars by 2045 [
2]. Diabetes can have life-threatening consequences for the cardiovascular, renal, and nervous system if it is not treated. In 2019, was estimated that 463 million people have diabetes worldwide. Moreover, it is predicted the prevalence to be 578 million and 700 million by 2030 and 2045, respectively [
3].
Severe acute respiratory syndrome SARS-CoV-2 is the virus responsible for the coronavirus disease 2019 (COVID-19). On 30 January 2020, the World Health Organization announced the outbreak as a Public Health Emergency of International Concern and later on 11 March 2020, a pandemic crisis. The virus was first reported in Wuhan, China, in December 2019 and quickly spread worldwide. As of 3 January 2021, 84,985,054 confirmed cases along with 1,841,077 deaths had been reported by the Center for Systems Science and Engineering at Johns Hopkins University. In general, people with diabetes are more likely to have severe symptoms and complications when infected with any virus, combined with other chronic conditions such as heart disease, which increases their risk of getting those severe complications if infected with COVID-19 [
4]. Recent studies have shown that “Elevated glucose levels increase SARS-CoV-2 replication, glycolysis sustains SARS-CoV-2 replication via the production of mitochondrial reactive oxygen species and activation of hypoxia-inducible factor 1
”. Patients with diabetes mellitus typically fall into higher categories of SARS-CoV-2 infection severity than those without, and poor glycaemic control predicts an increased need for medications, hospitalizations, and increased mortality [
5].
Data Mining methods search for patterns and trends in large-scale data by using advanced mathematical algorithms to partition the data and evaluate future events’ probability. Data Mining includes several disciplines such as statistics, probability, machine learning and artificial intelligence [
6]. Digital databases combined with the ability to apply computationally intensive statistical methodology to these data, powered by fast computers, have increased the number of applications of Data Mining in several domains. The Healthcare industry is constantly generating and storing new data [
7]. The effective use of this data can assist professionals in providing a fast and accurate diagnostic [
8]. One in two people live with diabetes without being aware of it [
3]. Their condition is not being monitored and kept under control. Consequently, they are more likely to experience additional complexities if infected by the coronavirus.
This study aims to analyse how different classification algorithms behave when applied to a training data set. The application of Data Mining can save lives in the future by providing a tool for an early diagnosis of diabetes. The experiment was conducted in Orange Data Mining, a machine learning and data mining suite for data analysis through Python scripting. On the other hand, the main contribution is to present the results of six different machine learning methods for early diagnosis of diabetes. Moreover, the results recommend the use of Neural Networks for early diagnosis of diabetes. Finally, all the model configuration details used are described, and the results are compared with similar research activities in this field. The authors use a public data set to test the proposed methods. The main reasons to use a public data set is to overpass the current challenges of data collection regarding the General Data Protection Regulation (GDPR) applied in Europe.
3. Results
In this section, we present the experimental results. The authors have done six experiments using different machine learning methods. These include Naive Bayes, Neural Network, AdaBoost, kNN, Random Forest, and SVM. The confusion matrix of each experiment is presented to allow the readers to calculate other performance metrics, if necessary.
In the first experiment, we applied the Naive Bayes classifier, which correctly predicted 452 instances out of 520, a success rate of 86.92%.
Table 3 presents 178 true negatives against 46 false negatives and 274 true positives against 22 false positives.
In the second experiment, we applied the Neural Network classifier, which correctly predicted 510 instances out of 520, a success rate of 98.08%.
Table 4 presents 195 true negatives against 5 false negatives and 315 true positives against 5 false positives.
In the third experiment, we applied the AdaBoost classifier, which correctly predicted 506 instances out of 520, a success rate of 97.31%.
Table 5 presents 194 true negatives against 8 false negatives and 312 true positives against 6 false positives.
In the fourth experiment, we applied the kNN classifier, which correctly predicted 506 instances out of 520, a success rate of 97.31%.
Table 6 presents 199 true negatives against 13 false negatives and 307 true positives against 1 false positive.
In the fifth experiment, we applied the Random Forest classifier, which correctly predicted 504 instances out of 520, a success rate of 96.92%.
Table 7 presents 194 true negatives against 10 false negatives and 310 true positives against 6 false positives.
In the last experiment, we applied the SVM classifier, which correctly predicted 505 instances out of 520, a success rate of 97.12%.
Table 8 presents 192 true negatives against 7 false negatives and 313 true positives against 8 false positives.
According to obtained results in
Table 9 and
Table 10, we can state that Neural Networks presents the best classification accuracy of 98.1%. At the same time, Naive Bayes scored the lowest accuracy with 86.9%. Moreover, the F1-Score of the proposed Neural Networks is 98.4%.
4. Discussion
When dealing with diseases such as diabetes it is essential to provide an early and accurate diagnosis. A delayed diagnosis of diabetes can lead to severe health consequences if vigilance is not applied. Therefore, when applying data mining techniques to predict someone’s condition, they must be as highly accurate as possible. The cost of a false negative is enormously higher than a false positive. If wrongly diagnosed, the subject might take a relaxed posture without knowing their real condition, and this may lead to severe health concerns.
Several similar works are available in the literature. In this section, the results of this study are compared with the state of art. In 2019, M. M. Faniqul Islam et al. analyzed a diabetes data set, implementing a Naive Bayes Algorithm, Logistic Regression Algorithm, and Random Forest Algorithm with 10-fold Cross-Validation. Their results exposed the highest accuracy of 97.40% using Random Forest classifier [
9]. In 2020, K. Alpan and G. S. İlgi presented a study comparing data mining classification techniques for diabetes using WEKA Tool. In total, seven algorithms such as Bayes Network, Naive Bayes, Decision tree (J48), Random tree, Random forest, kNN and SVM have been used. The authors registered a 98.07% accuracy, being kNN the classifier with the best performance using 10-fold Cross-Validation [
10]. H. Naz and S. Ahuja applied data mining techniques on the PIMA diabetes data set. The authors compared Deep Learning, ANN, Naive Bayes and Decision Tree. According to results, Deep Learning provided the best performance with 98.07% accuracy [
11]. A shuffled sampling with 80/20% ratio into the training and validation set has been used. In 2020, N. Pradhan, G. Rani, V. S. Dhaka, and R. C. Poonia analysed a diabetes data set proposing a model based on ANNs. Their study proved the highest accuracy of 88% efficacy [
12]. In 2018, M. Peker, O. Özkaraca, and A. Sasar conducted a study using Orange Tool to analyse a diabetes data set. Authors implemented and compared Random Forest, Feed-Forward Artificial Neural Networks, kNN, SVM and Decision Tree. The results revealed ANNs to be the algorithm with the highest accuracy. It correctly predicted 93.85% of cases [
13].
Table 11 presents the comparative results between studies.
The proposed Neural Network classifier presented better results than any of the other studies mentioned above. K. Alpan and G. S. İlgi reported higher accuracy of 98.07% using kNN classifier [
10] and S. Malik et al. report 98.62% [
14]. Our study showed a result of 98.08% employing Neural Network. The results show an AUC of 98.3%. The ROC curve for the negative and positive class is presented in Scalable Vector Graphics (SVG) format as
Supplementary Files 2 and 3, respectively. kNN classifier also presents outstandingly results. It achieved 97.31% of accuracy, however, K. Alpan and G. S. İlgi [
10] and S. Malik et al. report 98.62% [
14] presented better results. With 97.12% of accuracy, the SVM algorithm is the third most accurate classifier. K. Alpan and G. S. İlgi reported 92.11% [
10], and M. Peker, O. Özkaraca, and A. Sasar reported 78.51% [
13]. Random Forest also presented reliable results when compared to the other studies. It achieved 96.90% of accuracy, presenting a better result than M. Peker, O. Özkaraca, and A. Sasar reporting 91.34% [
13]. However, the authors of [
9,
10,
14] reported 97.40%, 97.50% and 98.8%, respectively. Naive Bayes classifier has presented the lowest accuracy of 86.92%. However, it performed better than the H. Naz and S. Ahuja study, which presented 76.33% efficiency [
11]. K. Alpan and G. S. İlgi study [
10] and M. M. F. Islam, R. Ferdousi, S. Rahman, and H. Y. Bushra [
9] outperform our result. Their research reports values of 87.11 and 87.4%, respectively. Finally, the proposed AdaBoost method provides 97.3% accuracy, which is higher than the results proposed by S. Malik et al. [
14] (77.16%).
Table 11 includes studies that have used different data sets. This is a limitation regarding the comparison of the results. We use a recent data set, and the number of studies that use the same data source is limited. Therefore, we have compared our results with works that use a different data source following the approach suggested by [
14].
Using a 10-fold Cross-Validation technique, we ensure that all the data has been used for training and testing to prevent over-fitting and under-fitting. It is critical to note that some of the cited studies do not present a clear explanation of the parameters used. Consequently, we could not reproduce their results. Nevertheless, all the studies have their own limitations. After a systematic evaluation, we can confirm that the Neural Network is the classification technique that performed most competently. However, while it may fit best for this data set, it might not be the case when applied to another. The data set is also limited. It does not consider family history of diabetes, consumption of certain prescription drugs, smoking, and sleep deprivation. Although data collected grows each day, European regulation regarding general data protection delivers a challenge when obtaining data within EU countries. Hence, we could only use a publicly available data set. We believe that providing data to conduct these experiments can revolutionize modern medicine. The results presented will support future research activities in this domain. The use of machine learning for disease diagnosis is even more essential in the current times. SARS-CoV-2 represents a threat to people living with diabetes. The employment of these procedures may protect lives in a pandemic situation such as for COVID-19. A quick determination of someone’s condition will prevent further complications and even avoid life losses. Furthermore, healthcare facilities are under high pressure due to COVID-19 patients and the access to treatment and diagnosis of other diseases is compromised. Therefore, this study suggests that machine learning methods are effective for the early diagnosis of diabetes. However, machine learning or other computer-aided systems will never replace human care since personal contact and medical experience play a crucial rule in clinical decision making.