A Deep Learning Model for Estimation of Patients with Undiagnosed Diabetes

A screening model for undiagnosed diabetes mellitus (DM) is important for early medical care. Insufficient research has been carried out developing a screening model for undiagnosed DM using machine learning techniques. Thus, the primary objective of this study was to develop a screening model for patients with undiagnosed DM using a deep neural network. We conducted a cross-sectional study using data from the Korean National Health and Nutrition Examination Survey (KNHANES) 2013–2016. A total of 11,456 participants were selected, excluding those with diagnosed DM, an age < 20 years, or missing data. KNHANES 2013–2015 was used as a training dataset and analyzed to develop a deep learning model (DLM) for undiagnosed DM. The DLM was evaluated with 4444 participants who were surveyed in the 2016 KNHANES. The DLM was constructed using seven non-invasive variables (NIV): age, waist circumference, body mass index, gender, smoking status, hypertension, and family history of diabetes. The model showed an appropriate performance (area under curve (AUC): 80.11) compared with existing previous screening models. The DLM developed in this study for patients with undiagnosed diabetes could contribute to early medical care.


Introduction
Globally, an estimated 422 million adults are suffering from diabetes mellitus (DM), according to the World Health Organization Global Report on Diabetes. This number is significantly higher than that of 1980 (108 million) [1]. However, an estimated 30-80 percent of diabetes cases are undiagnosed [2]. Diabetes without clinical care is significantly linked to serious complications, which can add a considerable burden to the public health system. The prevalence of diabetes is expected to increase rapidly in the future due to the prevalence of obesity, aging of the population, and other cardiovascular risk factors [3].
Complications of diabetes mellitus such as cardiovascular disease, kidney damage, and so on should be prevented in early stage [4]. However, diabetes is usually asymptomatic [5,6]. People with undiagnosed diabetes are more likely to be diagnosed with complications than those who are aware of their diabetes status. Although fasting plasma glucose (FPG), the oral glucose tolerance test (OGTT), and hemoglobin A1C (HBA1c) are well-established determinants in diabetes diagnosis [7], they are insufficient to provide invasive screen tests for a large population [8].
Risk screening systems for patients with undiagnosed diabetes has been developed [3,[8][9][10][11][12]. Lee et al. developed a self-assessment score for diabetes risk in Korean adults [3]. Zhou et al. proposed a diabetes screening model for middle-aged rural Chinese [8]. Aekplakorn et al. developed a prediction risk score for people at high risk of diabetes in Thailand [9]. Nanri et al. developed the model to predict the three-year incidence of type 2 diabetes in a Japanese population [10]. Gao et al. constructed a diabetes risk score for screening undiagnosed diabetes and validated it using Chinese adults [11]. Baan et al. developed a predictive model in order to identify individuals who had an increased risk of undiagnosed diabetes [12]. Theses system are used to prevent diabetes through changes in lifestyle and intervention with pharmaceutical treatments [13]. However, research using machine learning technology to develop screening tools for undiagnosed diabetes has been insufficient.
Previous studies have introduced predictive models for diseases such as diabetic retinopathy, skin cancer, lung disease, heart failure, chronic kidney disease, and so on using machine learning techniques [14][15][16][17][18][19][20]. These studies that use deep learning techniques to make major advances in solving problems have resisted the best attempts of the artificial intelligence community in many cases [21]. Although previous studies have developed predictive models based on machine learning algorithms, it is unclear whether these models can be properly used for estimating undiagnosed diabetes [22][23][24]. Consequently, the objective of the present study was to develop a deep learning model (DLM) for patients with undiagnosed diabetes. The remainder of this paper is organized as follows. Section 2 describes the proposed framework, study design, and methods. Section 3 shows the result of the experiment and Section 4 presents the conclusion and discussion.

Materials and Methods
The construction of a DLM for undiagnosed diabetes consists of four steps, as shown in Figure 1. In the first step, KNHANES(korean national health and nutrition examination survey) datasets collected from 2013 to 2016 were combined and the consistency of variables was explored. If the scale measurement of a variable was not changed during the study period, it was included in the present study. In the second step, the combined dataset was pre-processed to obtain a reliable experimental dataset. In the third step, basic characteristics were analyzed for each group, including a normal glucose group (NG), an impaired fasting glucose group (IFG), and an undiagnosed diabetes group (UDG). Significant non-invasive variables (NIV) were selected based on bivariate analysis using logistic regression (LR). These NIVs were used to optimize machine learning models. Finally, the model with the best performance was selected and compared with other screening models published in previous studies on undiagnosed diabetes.

Study Design
Data from the KNHANES 2013-2016 dataset collected by the Centers for Disease Control and Prevention Korea (KCDC) were used to perform analysis and construct a DLM for predicting undiagnosed diabetes. The KCDC assesses trends in health risk factors and nutrition status. It also conducts surveillance of infectious and chronical diseases. Records collected from the surveillance system are analyzed for the development and evaluation of health policies [25].
There were 31,098 subjects in the KNHANES 2013-2016 dataset, which is a non-duplicate sample. Those with an age ≤ 19 years (n = 7003), null or unknown response (n = 6864), or any record of diabetes diagnosis, abnormal insulin level, or antidiabetic treatment prescription (n = 1331) were excluded. The study population was then divided into a development group (n = 11,456; 2013-2015) and a validation group (n = 4444; 2016) according to the surveyed years shown in Figure 2. The diagnosis of diabetes was based on diabetes diagnosis criteria that referred to the Classification of Diabetes Mellitus 2019 WHO (world health organization) [2]. Undiagnosed diabetes was identified in the health interview survey for subjects with fasting plasma glucose (FPG) ≥ 126 mg/dL), subjects without a previous diagnosis of diabetes made by a healthcare professional, and subjects who were taking insulin or oral antidiabetic agents [3]. Impaired fasting glucose was defined as an FPG of 100-125 mg/dL with above-constraint satisfaction. Subjects were classified into two categories in the model architecture. The primary dependent group was UDG, and NG and IFG were combined into a comparison group.

Analysis Methods
Descriptive analysis was conducted for study groups (NG, IFG, and UDG) to compare their basic characteristics, as shown in Table 1 [26]. A logistic regression model was used to analyze potential correlations between variables and select candidate attributes for the generation of a deep neural network model in Table 2. All reported p-values are two-sided, and significance was set as a p-value of < 0.05.  Logistic regression (LR) can be used to discover a linear relationship between independent variables X and a binary dependent variable Y [27,28]. LR transforms log-odds to probability using the logistic function. The maximum likelihood is used to estimate regression coefficients. At each data point we have interpreter x and binary dependent variable y, and the probability of dependent variables is either p (x) (if y = 1) or 1−p (x) (if y = 0). In model generation, we used L2 regularization based on the scikit-learn package.
In this work, we focus on developing deep neural networks [31] due to its efficiency in deep representation learning. Each layer includes a given number of nodes with the activation function linked by weights in neighbor layers. We used a grid search algorithm to find optimal hyperparameters including a number of layers, hidden nodes, learning rate, batch size, and epoch number. Therefore, we applied Adam's [35] optimization algorithm, which is considered one of the best results and is faster than others [28,36]. To avoid overfitting, we used a dropout regularization technique that can avoid learning spurious features at hidden nodes. It has been shown that this method can provide a significant improvement in the generalization performance of the artificial neural network model, and that it is computationally cheap [31].
Area under curve (AUC) is suitable for performance evaluation in unbalanced clinical data. In conjunction with the Neyman-Pearson method, AUC has long been used in signal detection theory [37]. AUC was used to verify the DLM performance in the present study.

Comparison of Basic Characteristics among NG, IFG, and UDG
The basic characteristics of each group are shown in Table 1. In the development dataset, the IFG and UDG contained more old, male participants than the NG. Also, the IFG and UDG participants had higher systolic blood pressure (SBP), diastolic blood pressure (DBP), weight, body mass index (BMI), waist circumference (WC), fasting plasma glucose (FPG), total cholesterol (TC), and triglycerides (TGs). Family history of diabetes (FHD), smoke, and hypertension were more frequent in the UDG and IFG groups than in the NG group. High-density lipoprotein (HDL) levels were lower in the IFG and UDG than in the NG. Of these variables, seven NIVs (year, WC, BMI, gender, smoking status, hypertension, and family history of diabetes) were selected as candidate variables for learning.
Results of the bivariate analysis for evaluating the deep learning method are summarized in Table 2

Discussion
In the present study, we developed various screening models for undiagnosed diabetes and compared our models with each other as well as with models from other studies. The DLM had a higher AUC than any other model. Previous studies have predicted undiagnosed diabetes patients using various screening models, resulting in adequate goodness of fit and AUC [3,[8][9][10][11][12]. Zhou et al. established a simple and effective risk score for type 2 diabetes mellitus in middle-aged rural Chinese [8]. Aekplakorn et al. developed a risk score model for predicting diabetes in the Thai population that does not require laboratory tests [9]. Nanri et al. generated a simple risk model based on a non-invasive and an invasive model for type 2 diabetes [10]. Gao et al. constructed a diabetes risk score for screening undiagnosed diabetes in Chinese adults and compared fasting capillary blood glucose (FCG) and glycated hemoglobinA1c (HbA1c) [11]. Baan et al. developed a predictive model to identify individuals with an increased risk of undiagnosed based on the Rotterdam Study [12]. Lee et al. proposed a self-assessment score for diabetes risk based on the Korea National Health and Nutrition Examination Survey (KNHANES) 2001-2005 that showed good discrimination in comparison with non-Asian models [13].
Our data included fewer subjects than the data used in previous studies, and were collected only from Korean citizens. Nevertheless, our study focused on developing a deep neural network model to improve screening performance. Several previous studies have developed predictive models using machine learning algorithms. Mercaldo et al. analyzed the performance of machine learning algorithms that can classify diabetes patients using Pima Indians diabetes data from a UCI (University of California Irvine) machine repository standard dataset [22]. The architecture of a deep artificial neural network was proposed that would use 19 clinical features to automatically determine the health statuses of patients [23]. Soltani et al. developed a diagnostic model for type 2 diabetes based on an artificial neural network using the Pima Indians dataset [24]. The primary goal of such studies was to predict diabetes status of subjects that are not identified. Furthermore, invasive variables are unsuitable for universal application, and can cause a financial burden to some people or countries. Thus, the development of disease prediction or screening models should be user-centric. Pei et al. developed a diabetes prediction model using non-invasive variables based on machine learning algorithms (decision tree, AdaBoost, support vector machine, Bayesian network, naïve Bayesian) that showed an appropriate performance [38].
The DLM included seven NIVs (age, male gender, hypertension, family history of diabetes, smoking status, BMI, and waist circumference) that would be convenient for a layperson to use as a self-assessment of diabetes risk in the real world. These variables are highly correlated with undiagnosed diabetes in other studies [38][39][40][41][42][43][44][45]. Although blood analysis (including FPG and oral glucose tolerance test) are required to diagnose at-risk individuals based on the guidelines, our model can provide users with an estimation of their diabetes status without a medical diagnosis. In addition, our model allows people to carry out self-screening in terms of diabetes. It can provide services for recommending schedule arrangements with health care practitioners to people at high risk.
In the DLM building step we used variables such as age, BMI, and waist circumference with continuous measurements. However, we did not convert them into categorical values. Previous studies have applied a discretization approach that shows easily understandable information for screening diabetes risk. However, distinguished data by intervals may have biased information in the learning step. Our model insists on optimization based on hyper-parameters to prevent overfitting problems and improve model performance. Therefore, we compared and analyzed hyper-parameters for the DLM with values in the validation dataset. By comparing the results of epoch numbers with different activation functions, tanh carries out the best performance ( Figure 3). Minimal validation loss for the DLM was 0.13 ( Figure 4). The hyper-parameter for the best DLM was built with the following options: epoch number 50, batch size 32, two hidden layers with 100 neurons each, a tanh activation function, and 0.1 drop out in the network topology.   Table 3.  Aekplakorn et al. [9] Age, sex, BMI, waist circumference, hypertension, history of diabetes in parent or sibling 75.33 Our model was evaluated to guarantee reliability based on the validation dataset (KNHANES 2016) assembled at a different time than the development dataset (KNHANES 2013~2015). First, we were able to evaluate machine learning algorithms in order to prove the highest effectiveness of performance by the DLM, as shown in Figure 5. Second, the DLM was compared with previous screen models. As a result, our DLM showed good performance in screening undiagnosed diabetes, as shown in Table 3. Through this experiment, we demonstrated the effectiveness and usefulness of the DLM model for patients with undiagnosed diabetes.

Conclusions
Undiagnosed diabetes is continuously increasing due to a lack of specific symptoms and limited financial resources in the public health care system. To overcome this problem, previous studies have proposed screening models for undiagnosed patients. However, to date, there has been insufficient research regarding the identification of undiagnosed diabetes based on the deep neural network. Therefore, our study proposed a deep learning model for patients with undiagnosed diabetes that could contribute to self-assessment. Our model could help decrease the financial burden on the national health care system, and future work should be implemented using different populations for its validation.