Deep Neural Network for Predicting Diabetic Retinopathy from Risk Factors

Extracting information from individual risk factors provides an effective way to identify diabetes risk and associated complications, such as retinopathy, at an early stage. Deep learning and machine learning algorithms are being utilized to extract information from individual risk factors to improve early-stage diagnosis. This study proposes a deep neural network (DNN) combined with recursive feature elimination (RFE) to provide early prediction of diabetic retinopathy (DR) based on individual risk factors. The proposed model uses RFE to remove irrelevant features and DNN to classify the diseases. A publicly available dataset was utilized to predict DR during initial stages, for the proposed and several current best-practice models. The proposed model achieved 82.033% prediction accuracy, which was a significantly better performance than the current models. Thus, important risk factors for retinopathy can be successfully extracted using RFE. In addition, to evaluate the proposed prediction model robustness and generalization, we compared it with other machine learning models and datasets (nephropathy and hypertension–diabetes). The proposed prediction model will help improve early-stage retinopathy diagnosis based on individual risk factors.


Introduction
Diabetes is a chronic disease associated with abnormal blood glucose (BG) levels. Patients with type 1 diabetes (T1D) cannot control their BG naturally due to lacking insulin secretion, while for type 2 diabetes (T2D), the body cannot utilize its produced insulin [1,2]. Thus, T1D patients must administer insulin via injection or an insulin pump to achieve a near-normal glucose metabolism [3]. For T2D patients, a healthy diet, physical exercise, and drug administration are suggested to control BG levels and prevent many complications. Diabetes patients commonly develop acute complications, such as hypoglycemia (BG < 70 mg/dL) and hyperglycemia (BG > 180 mg/dL) if they fail to carefully self-manage BG levels [4]. Excessive BG (hyperglycemia) can result in long-term complications, e.g., retinopathy [5,6], nephropathy or kidney disease [7,8], and cardiovascular disease [9,10]. Diabetic retinopathy (DR) is the most common ocular complication from diabetes and the leading cause of visual impairment among to predict DR based on several risk factors for T2D patients. The dataset was gathered from a private hospital in Northern Taiwan, incorporating 430 normal and 106 DR patients. The SVM based model was superior to the other algorithms considered, achieving accuracy = 79.5% and AUC = 0.839.

Deep Neural Network
Deep neural network models have been used in many previous studies to the improve prediction accuracy compared with other models. For diabetes prediction-related issues, DNNs have only been applied to predict T2D. Kim et al. [22] used phenotype and genotype data from the Nurses' Health Study (NHS) with data from the Health Professionals Follow-up Study (HPFS) to evaluate the prediction model performance. The proposed DNN outperformed LR, with AUC = 0.931 and 0.928 for male and female patients. Stacked autoencoders in DNN were also applied for T2D diabetes classification [23]. Their model was applied to a Pima Indian dataset and achieved classification accuracy = 86.26%.
Diabetic retinopathy is a complication of diabetes that causes damage to the blood vessels in the retina and leading to vision impairment. Therefore, an accurate prediction model for pre-diagnosis retinopathy would be very useful to improve patient health outcomes. DNN have shown good performance diagnosing retinopathy from retinal fundus images, including datasets from Otago and Messidor [24], and three clinical departments in Sichuan Provincial People's Hospital [25]. Parmar et al. [26] employed a convolutional neural network to detect DR from retinal images and their model outperformed others considered. Furthermore, the ResNet architecture model was utilized to detect DR from fundus images achieving an excellent classification accuracy [27,28]. Finally, Gadekallu et al. [29] employed a DNN with grey wolf optimization (GWO) and principle components analysis (PCA) to optimize the parameters and reduce dimensionality, respectively, to predict DR based on extracted features from retinal imaging. However, these studies only diagnosed retinopathy from retinal fundus images, and to our best knowledge, no previous study considered DNNs for retinopathy based on risk factors.

Recursive Feature Elimination
Many previous studies employed feature selection to improve the model prediction accuracy. Guyon et al. [30] introduced RFE to select the most significant gene(s) for cancer classification and, hence, improve classification model accuracy. The algorithm calculates a rank score and eliminates the lowest-ranking features. Previous studies showed significant performance improvements by employing RFE, including predicting mental states (brain activity) [31,32], Parkinson [33], skin disease [34], autism [35], Alzheimer [36], and T2D [37]. They showed that SVM-RFE achieved superior performance than several comparison methods. In addition, previous studies demonstrated DNN accuracy improvement by integrating RFE as the feature selection algorithm [38,39]. The experimental results showed that integration of SVM-RFE to DNN algorithms achieved best prediction accuracy as compared to other methods.
To our best knowledge, only Kumar et al. [37] considered RFE for diabetes prediction. Kumar et al. used SVM-RFE to identify the most discriminatory gene target for T2D. These identified significant genes could then be focused on as potential drug targets. However, although SVM-RFE was employed to extract significant features for T2D, this was not applied to the DR dataset. Similarly, previous DR studies used DNNs to classify disease from retinal fundus images only, and risk factors were not utilized as DNN input features. Therefore, the current study proposed SVM-RFE and DNN to improve DR prediction accuracy from individual risk factors. To our best knowledge, this is the first time SVM-RFE and DNN using individual risk factors were employed to improve DR prediction accuracy.

Datasets
Previous studies established that T1D or T2D patients tend to develop complications, such as retinopathy, nephropathy, cardiovascular disease (CVD), etc. Therefore, we proposed a DNN-based model to predict whether T1D or T2D patients will later develop DR. The dataset was collected by Khodadadi et al. [40] and related with diabetes complications in Lur and Lak populations of Iran. Informed consent was obtained from patients, and the dataset was made publicly available by previous authors (https://data.mendeley.com/datasets/k62fdsnwkg/1). The dataset was gathered from 133 diabetic patients covering known risk factors for neuropathy, nephropathy, diabetic retinopathy (DR), peripheral vessel disease (PVD), CVD, food ulcer history, and dawn effect. Originally, the dataset consisted of 24 information gathered from diabetic patients (T1D and T2D). We removed irrelevant features, leaving 14 potentially DR-relevant risk factors, as shown in Table 1. The class label (retinopathy) was assigned when the subject had symptomatic cases with a history of laser or surgical therapy. The objective of our study was to classify whether a diabetic patient will develop diabetic retinopathy (DR) in the future. Figure 1 shows the proposed DNN model to predict DR diagnosis from several risk factors, based on a public DR dataset. Data pre-processing removed inappropriate and inconsistent data. During the pre-processing stage, data normalization was applied by rescaling real valued numeric attributes into [0,1]. Missing values in the numeric and nominal attributes were replaced by mean and mode, respectively. Then RFE removed irrelevant features, and a DNN-based prediction was developed using the grid search algorithm to optimize the model hyperparameter and, hence, maximize DNN performance. Performance was evaluated by comparing the proposed and other best-practice machine learning models from previous studies.  Figure 1 shows the proposed DNN model to predict DR diagnosis from several risk factors, based on a public DR dataset. Data pre-processing removed inappropriate and inconsistent data. During the pre-processing stage, data normalization was applied by rescaling real valued numeric attributes into [0,1]. Missing values in the numeric and nominal attributes were replaced by mean and mode, respectively. Then RFE removed irrelevant features, and a DNN-based prediction was developed using the grid search algorithm to optimize the model hyperparameter and, hence, maximize DNN performance. Performance was evaluated by comparing the proposed and other best-practice machine learning models from previous studies. We used stratified 10-fold cross-validation (CV), a variation of k-fold CV for the proposed and comparison machine learning models. In k-fold CV, the dataset is split into k subsets of equal size and the instances for each subset or fold are randomly selected. Each subset, in turn, is used for testing and the remainder for the training set. The model is evaluated k times such that each subset is used once as the test set. However, in stratified k-fold cross-validation, each subset is stratified so that they contain approximately the same proportion of class labels as the original dataset. By this procedure, the variance among the estimates are reduced and the average error estimate is reliable. Furthermore, our dataset is imbalanced, with 32% of subjects classified as DR patients. A previous study demonstrated that stratified k-fold CV is generally considered superior to regular CV, particularly for unbalanced datasets [41]. Figure 2 shows the overview of the model validation based on CV and stratified CV applied to the two-class dataset. We used stratified 10-fold cross-validation (CV), a variation of k-fold CV for the proposed and comparison machine learning models. In k-fold CV, the dataset is split into k subsets of equal size and the instances for each subset or fold are randomly selected. Each subset, in turn, is used for testing and the remainder for the training set. The model is evaluated k times such that each subset is used once as the test set. However, in stratified k-fold cross-validation, each subset is stratified so that they contain approximately the same proportion of class labels as the original dataset. By this procedure, the variance among the estimates are reduced and the average error estimate is reliable. Furthermore, our dataset is imbalanced, with 32% of subjects classified as DR patients. A previous study demonstrated that stratified k-fold CV is generally considered superior to regular CV, particularly for unbalanced datasets [41]. Figure 2 shows the overview of the model validation based on CV and stratified CV applied to the two-class dataset.

Recursive Feature Elimination (RFE)
Feature selection removes redundant and irrelevant features to improve machine learning quality and efficiency. This study applied RFE with SVM kernels, i.e., linear function, to evaluate feature significances for DR dataset [30]. SVM-RFE works first by training the dataset with the SVM classifier. Next, the ranking weights for all features are computed. Finally, the feature with smallest weight is deleted. This process is repeated until no features are left, with later eliminated features having higher ranks. The bottom ranked ones are the least informative and removed in the first iteration. Thus, irrelevant features are gradually eliminated and important features retained for classification. The process is summarized in Algorithm 1.
the variance among the estimates are reduced and the average error estimate is reliable. Furthermore, our dataset is imbalanced, with 32% of subjects classified as DR patients. A previous study demonstrated that stratified k-fold CV is generally considered superior to regular CV, particularly for unbalanced datasets [41]. Figure 2 shows the overview of the model validation based on CV and stratified CV applied to the two-class dataset.

Recursive Feature Elimination (RFE)
Feature selection removes redundant and irrelevant features to improve machine learning quality and efficiency. This study applied RFE with SVM kernels, i.e., linear function, to evaluate feature significances for DR dataset [30]. SVM-RFE works first by training the dataset with the SVM classifier. Next, the ranking weights for all features are computed. Finally, the feature with smallest weight is deleted. This process is repeated until no features are left, with later eliminated features having higher ranks. The bottom ranked ones are the least informative and removed in the first iteration. Thus, irrelevant features are gradually eliminated and important features retained for classification. The process is summarized in Algorithm 1.

Algorithm 1. SVM-RFE pseudocode
Input: The SVM-RFE algorithm can be divided into three steps, which are input, calculation of the weight of each feature, and removing the lowest ranked feature. In Algorithm 1, during the input stage, X 0 is defined as the training sample, y is class labels, s is subset of remaining features, and r is feature sorted list. In the next process, the weight calculation of each feature is conducted where the algorithm repeats the process until the list of s is empty. The new training sample X is defined according to the remaining features s. The set of paired inputs and outputs is used by training the classifier α. The calculation of the weight vector w and ranking criteria c i is conducted at this stage. When the lowest ranking feature f was obtained, the feature sorted list r can be updated. At the last stage, the feature with the smallest ranking criterion is removed and s is updated. The users can stop the iteration when the list of s is not empty, so that desired number of features can be obtained.
In Algorithm 1, SVM_train (linear SVM) is utilized to learn from the set of paired inputs X and outputs y. The linear SVM classifies training data by mapping the original data onto a high dimensional feature space and finding the linear optimal hyperplane to separate instances of each class from the others [42]; for the case of separating training vectors belonging to two linearly separable classes: where x i is a real valued n-dimensional input vector and y i is the class label associated with the training vector. The separating hyperplane is determined by an orthogonal vector w and bias b, which identify points that satisfy Thus, the classification mechanism for linear SVM can be expressed as with constraints where α is the parameter vector for the classifier hyperplane, and C is a penalty parameter to control the number of misclassifications. Figure 3 shows the attribute ranking for the DR dataset. We investigated the impact of the top k features for DNN accuracy. The SVM-RFE was executed to remove irrelevant features, and important features (k) are utilized as the input for the DNN. This process is repeated for all possible k features. Finally, we found that the first (top) 13 DR features is the optimal number of features that can maximize DNN accuracy. Hence only systolic blood pressure (Sys BP) was removed from the DR dataset. Section 4.2 discusses optimizing the number of features for the dataset. When the lowest ranking feature was obtained, the feature sorted list can be updated. At the last stage, the feature with the smallest ranking criterion is removed and is updated. The users can stop the iteration when the list of s is not empty, so that desired number of features can be obtained. In Algorithm 1, SVM_train (linear SVM) is utilized to learn from the set of paired inputs and outputs . The linear SVM classifies training data by mapping the original data onto a high dimensional feature space and finding the linear optimal hyperplane to separate instances of each class from the others [42]; for the case of separating training vectors belonging to two linearly separable classes: where is a real valued n-dimensional input vector and is the class label associated with the training vector. The separating hyperplane is determined by an orthogonal vector and bias b, which identify points that satisfy Thus, the classification mechanism for linear SVM can be expressed as with constraints where is the parameter vector for the classifier hyperplane, and C is a penalty parameter to control the number of misclassifications. Figure 3 shows the attribute ranking for the DR dataset. We investigated the impact of the top k features for DNN accuracy. The SVM-RFE was executed to remove irrelevant features, and important features (k) are utilized as the input for the DNN. This process is repeated for all possible k features. Finally, we found that the first (top) 13 DR features is the optimal number of features that can maximize DNN accuracy. Hence only systolic blood pressure (Sys BP) was removed from the DR dataset. Section 4.2 discusses optimizing the number of features for the dataset.
We also compared the performance of SVM-RFE with other feature selection methods on improving DNN accuracy, such as chi-squared, ANOVA, and extra trees. The result of feature selection execution for the dataset as shown in Table 2. The feature selection methods showed different results in terms of extracting the most important features. A higher value of score, f-value, and gini importance indicate the importance of the features. The impact of a different feature selection on DNN accuracy is presented in Section 4.2.   We also compared the performance of SVM-RFE with other feature selection methods on improving DNN accuracy, such as chi-squared, ANOVA, and extra trees. The result of feature selection execution for the dataset as shown in Table 2. The feature selection methods showed different results in terms of extracting the most important features. A higher value of score, f-value, and gini importance indicate the importance of the features. The impact of a different feature selection on DNN accuracy is presented in Section 4.2. The proposed SVM-RFE generated the top five features, i.e., DM duration, FBS, HDL, Age, and A1c as significant risk factors for retinopathy. Other feature selection methods, such as chi-squared, ANOVA, and extra trees, generated a different subset of important features. In chi-squared, the attributes DM duration, Sex, Sys BP, Dias BP, and Age were identified as the top five features. Furthermore, the attributes DM duration, Age, Dias BP, Sys BP, and FBS were identified as the top five features in ANOVA, while in extra trees, the result is similar to ANOVA except that A1C was included in the top five features instead of the attribute Dias BP. These top five features generated by the feature selection methods can be utilized as the input features for the deep neural network to improve classification accuracy.
Finally, the SVM-RFE identified the attribute Sys BP as an irrelevant feature, while the attribute Statin was recognized by chi-squared and ANOVA as a less important feature. The attribute DM type was discovered by extra trees as an unimportant feature; therefore, these irrelevant features generated by feature selection must be removed as input for the classifier. Excluding these irrelevant features is expected to improve DNN performance. A more detailed discussion regarding our significant risk factors for retinopathy and a comparison with the results from previous studies are presented in Section 4.3.

Proposed Deep Neural Network
We employed a DNN model to predict DR from the risk factors dataset. A DNN is an ANN class with multiple hidden layers between input and output layers and has recently become highly successful for classification. A DNN is fully connected; hence, each unit receives connections from all units in the previous layer. Thus, each unit has its own bias and a weight for every pair of units in two consecutive layers. The net input was calculated by multiplying each input with its corresponding weight and then summing. Each unit in the hidden layer took the net input and applied an activation function. Thus, network computation with three hidden layers can be expressed as and i is units in the lth hidden layer, and ϕ is the activation function. In our study, we used and evaluated several activation functions, such as sigmoid, hyperbolic tangent, and rectified linear unit (ReLU), which are presented in detail in Equations (9)-(11), respectively.
The DNNs were trained using the back propagation (BP) algorithm [43], which compares the prediction result with the target value and modifies each training tuple's weights to minimize error between prediction and target values. In measuring a good prediction model to predict the expected outcome, a loss function is required. Our study focused on binary classification, where the number of classes is two. The cross-entropy loss function can be calculated as where y i is true probability and y i is predicted probability value. This process was iterated to produce optimal weights, providing optimal predictions for the test data. Figure 4 shows the proposed DNN model to predict DR from individuals' risk factors. We utilized the grid search algorithm [44] to automatically select the best parameters for the proposed deep neural network (DNN) model. We then applied a grid search for the training set and measured cross-validation to obtain the best prediction model, as shown in Figure 5. The objective for this method was to select the best hyperparameter for the proposed DNN model to achieve the highest accuracy for DR. We found a five-hidden-layers network with different neurons each, ReLU as activation function, and SGD as weight optimization were the best parameters for the DNN. Table 3 shows optimized hyperparameter details for the proposed DNN for DR. Finally, we applied these parameters to the proposed DNN model. measured cross-validation to obtain the best prediction model, as shown in Figure 5. The objective for this method was to select the best hyperparameter for the proposed DNN model to achieve the highest accuracy for DR. We found a five-hidden-layers network with different neurons each, ReLU as activation function, and SGD as weight optimization were the best parameters for the DNN. Table  3 shows optimized hyperparameter details for the proposed DNN for DR. Finally, we applied these parameters to the proposed DNN model.

Experimental Setup
Machine-learning models were applied to distinguish DR from the public dataset. Classification and feature selection methods were implemented in Python V3.7.3, utilizing the Scikit-learn V0.22.2 library [45]. We used Scikit-learn default parameters to simplify implementing other models. We performed the experiments on an Intel Core i5-4590 computer with 8 GB RAM running Windows 7 64 bit. Average performance metrics, such as accuracy (%), precision (%), sensitivity or recall (%), specificity (%), F1 score (%), and AUC were obtained by conducting 10 runs of stratified 10-fold CV.

Results and Discussion
This section considers the proposed DNN model performance and feature selection model

Experimental Setup
Machine-learning models were applied to distinguish DR from the public dataset. Classification and feature selection methods were implemented in Python V3.7.3, utilizing the Scikit-learn V0.22.2 library [45]. We used Scikit-learn default parameters to simplify implementing other models. We performed the experiments on an Intel Core i5-4590 computer with 8 GB RAM running Windows 7 64 bit. Average performance metrics, such as accuracy (%), precision (%), sensitivity or recall (%), specificity (%), F1 score (%), and AUC were obtained by conducting 10 runs of stratified 10-fold CV.

Results and Discussion
This section considers the proposed DNN model performance and feature selection model impacts and discusses DR risk factors. We also verified a good generalization capability by applying our results to other public datasets (nephropathy and hypertension-diabetes).

Prediction Model Performances
We applied the proposed DNN with RFE model to predict DR using known risk factors and compared the outcomes with several current best-practice data-driven models that have wide acceptance and a proven track record for accuracy and efficiency: k-nearest neighbor (KNN), C4.5 decision tree (DT), support vector machine (SVM), naïve Bayes (NB), and random forest (RF). Table 4 compares model performance metrics, averaged from 10 runs of stratified 10-fold CV. The proposed DNN with RFE model was superior to the traditional models with respect to accuracy, sensitivity, specificity, F1 score, and AUC, achieving 82.033%, 76.000%, 80.389%, 71.820%, and 0.804, respectively. In terms of detecting the positive cases, the KNN model achieved the highest precision (79.714%); however, our proposed model generated the highest recall (76.000%). The precision is the ratio of correctly predicted positive cases to the total predicted positive cases. On other hand, the recall indicates the accuracy of a model in predicting the positive cases for the cases where the actual condition is positive. Therefore, to identify prediction of positive cases (retinopathy) accurately, the model should focus more on recall rather than precision. In the medical area, it is common to focus more on sensitivity (recall) and specificity to evaluate medical tests [46]. Furthermore, it is also important to detect positive cases accurately, since when the model fails to detect the retinopathy, it will lead to blindness in such patients. Finally, the proposed model achieved 5.308% accuracy improvement compared with the current best-practice DR models. Accuracy rate is the most common metric for classifier performance. However, class distribution must be considered for unbalanced datasets using specific classifier metrics. ROC is a useful tool to provide evaluation criteria for unbalanced datasets [47]. The ROC curve contrasts false positive and false negative outcomes, as shown in Figure 6, where AUC indicates overall classification performance [48], with the best model having AUC ≈ 1. Figure 6 shows ROC curves analysis for the proposed and other considered classification models for DR dataset. The proposed model achieved the highest AUC = 0.80. must be considered for unbalanced datasets using specific classifier metrics. ROC is a useful tool to provide evaluation criteria for unbalanced datasets [47]. The ROC curve contrasts false positive and false negative outcomes, as shown in Figure 6, where AUC indicates overall classification performance [48], with the best model having AUC ≈ 1. Figure 6 shows ROC curves analysis for the proposed and other considered classification models for DR dataset. The proposed model achieved the highest AUC = 0.80. Thus, the proposed model achieved significantly improved metrics compared with current best practice classification methods. Specific impacts for RFE and other feature selection methods on performance accuracy are presented in Section 4.2.

Feature Selection Impacts
The optimal number of features is required to implement RFE. Therefore, we investigated the impact of top k features on DNN accuracy. The full dataset included 14 features, and RFE expected Thus, the proposed model achieved significantly improved metrics compared with current best practice classification methods. Specific impacts for RFE and other feature selection methods on performance accuracy are presented in Section 4.2.

Feature Selection Impacts
The optimal number of features is required to implement RFE. Therefore, we investigated the impact of top k features on DNN accuracy. The full dataset included 14 features, and RFE expected to remove irrelevant features. Figure 7a shows the impact of the best k features as defined by RFE on DNN model accuracy. Optimal, k = 13 for DR dataset, and maximum DNN model accuracy are achieved when including only these defined optimal features. The result showed that removing high number of features leads to the reduction of accuracy; therefore, the highest accuracy can be achieved by removing a small number of features. to remove irrelevant features. Figure 7a shows the impact of the best k features as defined by RFE on DNN model accuracy. Optimal, k = 13 for DR dataset, and maximum DNN model accuracy are achieved when including only these defined optimal features. The result showed that removing high number of features leads to the reduction of accuracy; therefore, the highest accuracy can be achieved by removing a small number of features. Based on the strategy of searching, the feature selection can be categorized into three methods, such as wrapper, filter, and embedded methods [49,50]. Filter methods measure the relevance of features by their correlation with the target variable, while wrapper methods utilize a learning machine to measure the usefulness of a subset of features according to their predictive power. Embedded methods perform feature selection in the process of training based on specific learning machines. In our study, we used SVM-RFE as an application of wrapper methods, while for filter methods, ANOVA and chi-squared were utilized. For embedded methods, we utilized the extra trees algorithm to extract relevant features [51]. Figure 7b shows the impact of the feature selection method on DNN model accuracy. The feature selection based on extra trees automatically selected the optimal number of features = 7. We also investigated and found that the optimal number of features for ANOVA and chi-squared are same (13 features). RFE generated superior accuracy, up to 9.07% for DR dataset compared to outcomes without feature selection. However, the other considered feature selection methods performed poorly, with only slight accuracy improvements for the DR dataset using chi-squared, ANOVA, and extra trees. Thus, RFE was the best choice for the DNN, providing the maximum DR prediction accuracy.   Based on the strategy of searching, the feature selection can be categorized into three methods, such as wrapper, filter, and embedded methods [49,50]. Filter methods measure the relevance of features by their correlation with the target variable, while wrapper methods utilize a learning machine to measure the usefulness of a subset of features according to their predictive power. Embedded methods perform feature selection in the process of training based on specific learning machines. In our study, we used SVM-RFE as an application of wrapper methods, while for filter methods, ANOVA and chi-squared were utilized. For embedded methods, we utilized the extra trees algorithm to extract relevant features [51]. Figure 7b shows the impact of the feature selection method on DNN model accuracy. The feature selection based on extra trees automatically selected the optimal number of features = 7. We also investigated and found that the optimal number of features for ANOVA and chi-squared are same (13 features). RFE generated superior accuracy, up to 9.07% for DR dataset compared to outcomes without feature selection. However, the other considered feature selection methods performed poorly, with only slight accuracy improvements for the DR dataset using chi-squared, ANOVA, and extra trees. Thus, RFE was the best choice for the DNN, providing the maximum DR prediction accuracy.  [17] and Lasso [14] achieved superior AUC for DR. Most of the previous studies used holdout for model validation, whereas we used stratified 10-fold CV to avoid overfitting. Risk factors and their relative importance vary across the world and the previous studies showed that important predictors could be retrieved for DR. Hosseini et al. [13] used age, BMI, sex, diabetes duration, and HbA1c; whereas Oh et al. [14] identified fasting BG, triglyceride, low BMI, and insulin therapy as strong predictors. Other studies also identified insulin use [17] and A1C [16] as important risk factors. Our proposed study used RFE to select the best variables and identified the top five risk as diabetes duration, fasting BG, HDL, Age, and A1c; which are largely consistent with selected risk factors from previous studies.

Risk Factors and Previous Studies
Directly comparing these results is inappropriate, since they were derived from different datasets, pre-processing methods, and validation methods. Therefore, Table 5 should not be considered as strong evidence regarding model performance, but it provides a general comparison and allows discussions regarding the proposed model and previous approaches. We used a public dataset for the current study, which was limited to small populations in Iran. Benchmarking machine learning models will become somewhat fairer as other datasets become publicly available.

Another Diabetes Dataset
To evaluate the proposed prediction model robustness and generalization, we compared it with other machine learning models and datasets (diabetic nephropathy and hypertension-diabetes). Diabetic nephropathy (DN) is a serious kidney-related complication (kidney disease) relatively commonly developed by T1D and T2D patients. The DN dataset we employed was related to DR (Table 1), provided by Khodadadi et al. [40]. The dataset was gathered from 133 diabetic patients with 73 among them developing DN. Thus, we used the same 14 input features as in Table 1 but with a different output class, i.e., DN. The description of DN dataset can be seen in Appendix A (see Table A1). The objective for this dataset was to classify whether the diabetic patient would develop DN.
We also gathered a dataset from the National Health Insurance Sharing Service (NHISS) Korea comprising applicant's general health data 2013-2014 [52]. The original input variables were age group (BTH_G), systolic blood pressure (BP), diastolic BP, fasting BG, BMI, and sex. The dataset included four classes, where the subject was diagnosed to have hypertension, diabetes (T1D or T2D), hypertension and diabetes, or healthy (no diabetes or hypertension history). We converted this multiclass into a binary classification problem, transforming it into healthy or disease (hypertension, diabetes), and removed fasting BG, since this variable is closely related with diabetes diagnosis. We randomly selected 1000 individuals from approximately 1 million, and hence, the final dataset comprised 761 healthy and 239 disease (hypertension, diabetes) patients. The description of NHISS dataset can be seen in Appendix A (see Table A2). The objective for this dataset was to classify whether the subject would develop disease (i.e., hypertension or diabetes). Table 6 compares classification performance for the proposed DNN + RFE model with other machine learning models (KNN, DT, SVM, NB, and RF). Average accuracy and AUC were calculated from 10 runs of stratified 10-fold CV. The proposed model achieved superior accuracy and AUC for both datasets: 84.121% and 0.839 for the DN and 81.600% and 0.702 for the NHISS datasets, respectively. Applying RFE, we selected 13 features for the DN dataset and 3 for the NHISS dataset (BTH_G, Systolic BP, and BMI), and we used grid search to optimize the DNN hyperparameter. Thus, we found the optimal DNN design for the DN dataset to be five hidden layers (100, 64, 128, 64, 32), ReLU activation function, and a maximum of 100 iterations. For NHISS dataset, DNN with five hidden layers (100, 64, 128, 64, 32) and tanh activation function has achieved highest accuracy. The best optimization algorithms were identified as LBFGS (limited-memory Broyden-Fletcher-Goldfarb-Shanno) and Adam for the DN and NHISS datasets, respectively. These additional experiments confirmed the proposed model robustness toward different healthcare domains.

Conclusions and Future Work
The current study combined DNN with RFE to predict DR. The model is expected to help individuals foresee DR danger based on risk factors during initial disease phases. A public dataset incorporating DR risk factors was utilized, and the proposed model performance was compared with previous best-practice KNN, DT, SVM, NB, and RF models. The proposed model outperformed conventional classification models and most other previous models, achieving accuracy = 82.033%.
We applied RFE to identify significant features from all datasets, i.e., extracting the highest DR risk factors. Combining RFE and DNN improved the prediction accuracy as compared with all other considered feature selection methods (chi-squared, ANOVA, and extra trees). The proposed DNN + RFE model improved the accuracy (9.07%) compared with DNN without feature selection. Thus, machine learning combined with feature selection can effectively detect DR. This offers increased cost-effectiveness for health care systems, where decision support based on the proposed prediction model could provide decision opinions. We hope this study will help reduce the DR risk for diabetic patients, which is the major cause of blindness.
The dataset used here was from a relatively small and quite specific population; hence, the prediction model outcomes cannot not be simply generalized for broader application. Similarly, the identified important risk factors might not be appropriate for other populations. Thus, as it stands, the proposed model would be unsuitable for clinical trials due to dataset limitations. Therefore, the proposed approach should be extended to other clinical datasets and compared widely with other prediction and feature selection models. Once model validation is extended to broader datasets, other risk factors affecting DR could be identified. Funding: This paper receives no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.