Development of a Prediction Model for Colorectal Cancer among Patients with Type 2 Diabetes Mellitus Using a Deep Neural Network

Objectives: Observational studies suggested that patients with type 2 diabetes mellitus (T2DM) presented a higher risk of developing colorectal cancer (CRC). The current study aims to create a deep neural network (DNN) to predict the onset of CRC for patients with T2DM. Methods: We employed the national health insurance database of Taiwan to create predictive models for detecting an increased risk of subsequent CRC development in T2DM patients in Taiwan. We identified a total of 1,349,640 patients between 2000 and 2012 with newly diagnosed T2DM. All the available possible risk factors for CRC were also included in the analyses. The data were split into training and test sets with 97.5% of the patients in the training set and 2.5% of the patients in the test set. The deep neural network (DNN) model was optimized using Adam with Nesterov’s accelerated gradient descent. The recall, precision, F1 values, and the area under the receiver operating characteristic (ROC) curve were used to evaluate predictor performance. Results: The F1, precision, and recall values of the DNN model across all data were 0.931, 0.982, and 0.889, respectively. The area under the ROC curve of the DNN model across all data was 0.738, compared to the ideal value of 1. The metrics indicate that the DNN model appropriately predicted CRC. In contrast, a single variable predictor using adapted the Diabetes Complication Severity Index showed poorer performance compared to the DNN model. Conclusions: Our results indicated that the DNN model is an appropriate tool to predict CRC risk in patients with T2DM in Taiwan.


Introduction
Diabetes mellitus (DM) is a common chronic disease worldwide. According to the Global Report on Diabetes from the World Health Organization, the prevalence of DM has been steadily rising for the past three decades, becoming a major public health issue. In 2014, 422 million people in the world had diabetes-8.5% of the adult population [1]. In Taiwan, the standardized incidence rate of DM is in accordance with the global trend, with a near constancy (1.043% in 2000 and 1.160% in 2009 among age 20-79 residents). However, its prevalence has steadily increased (4.31% in 2000 and 6.38% in 2009 among age 20-79 residents), suggesting a possibility of better DM care that leads to a decrease in mortality rates of patients with DM [2].
As the number of patients with chronic DM increases, certain diseases related to DM become concerns among these patients. Studies have suggested that patients with DM are at a higher risk to develop cancer overall and several individual cancers compared to the general population [3][4][5][6][7][8][9]. The risk of developing colorectal cancer (TaCRC) among patients with DM was revealed by earlier reports [10][11][12][13].
Cancer has been Taiwan's leading cause of mortality since 1982 and CRC has been the most common type of malignancy recorded in the country since 2006 [14]. In 2015, the age-adjusted incidence rate of CRC in Taiwan was 43.58/100,000 people, an increase from 2005 of 20.9% and 8.3% for men and women, respectively [15]. CRC has also been ranked as the third leading cause of cancer-related death in Taiwan, from 2013 to 2017, for both men and women, as well as both combined. Consequently, cancer continues to be a challenge for the public health field of Taiwan. It has come to our government's attention, resulting in population-based investigations regarding early diagnosis and cancer-preventive epidemiology. Based on this concern and the suggestion of a possible link between CRC risk and type 2 DM (T2DM) by previous researchers, we proposed this study aimed at creating a deep neural network (DNN) to predict the onset of CRC among patients with T2DM in Taiwan.

Data Source and Sampled Participants
This study was approved by the Research Ethics Committee of the China Medical University and Hospital in Taiwan (CMUH104-REC2-115-CR3). For the present study, we used a subset of data from the National Health Insurance Research Database (NHIRD) and the Longitudinal Cohort of Diabetes Patients (LHDB), which contains health data of 1,700,000 patients with newly diagnosed T2DM (International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) code 250.x0 and 250.x2) from 2000-2012. Subjects that had at least two diagnoses of T2DM within a year were eligible for inclusion in the LHDB. The first diagnosis date was defined as the index date of T2DM. T2DM patients with a CRC history (ICD-9-CM 153, 154) before the index date, aged less than 20 years and with incomplete information on demographics were excluded.

Outcome Measurements, Comorbidities, and Medications
All study subjects were followed from the index date to the date of CRC diagnoses, date of withdrawal from the insurance program, or the end of 2013; whichever came first. The baseline comorbidities considered in this study included hypertension, hyperlipidemia, stroke, congestive heart failure, colorectal polyps, obesity, chronic obstructive pulmonary disease (COPD), coronary artery disease (CAD), asthma, smoking (stop-smoking clinic), inflammatory bowel disease, irritable bowel syndrome, alcohol-related illness, and chronic kidney disease. The adapted Diabetes Complication Severity Index (aDCSI) score consists of scores from 7 complication categories including retinopathy, nephropathy, neuropathy, cerebrovascular, cardiovascular, peripheral vascular disease and metabolic; it ranges from 0 to 13 [16,17]. Medications that may be associated with CRC were also evaluated, including statin, insulin, sulfonylureas, metformin, thiazolidinedione (TZD), and other antidiabetic drugs.

Constructing Training and Data Sets
The data comprised of 1,349,640 data points, each representing one patient. The data had 37 input features and 2 output features. The positive output class represented the diagnosis of CRC, while the negative output class represented no diagnosis. The data were split into training and test sets: 97.5% of the data were used as the training set and 2.5% were used as the test set. This ratio was chosen both to have a sufficient number of data points for validation and to use the majority of the dataset for training. The data points were randomly allocated to each set. Table 1 shows the allocation between the training and test sets.

Algorithm and Training
The average k-fold cross-validation accuracy, with a k-value of 10, was used as the metric to determine the best hyperparameters, optimizers, and loss functions of predictors.
The DNN model is a multilayer perceptron deep neural network. The model used here consisted of one input layer of 37 dimensions, three hidden layers of 30 dimensions, and a scalar output layer. The number of dimensions corresponds to the number of artificial neurons in each layer. Each layer was densely connected, meaning that the neurons of each layer were connected to neurons of the preceding and successive layers. The model was trained using a stochastic gradient descent, an iterative algorithm used to optimize the weights of neurons in the network, with a mini-batch size of 128. The model was optimized using Adam with Nesterov's accelerated gradient descent [18][19][20]. The input and hidden layers used a Rectified Linear Unit (ReLU) activation function [21], while the output layer used the Softmax activation function. These activation functions were applied to the output of each neuron. A dropout, a regularization technique used to prevent overfitting, of 20% was applied to the input layer and each hidden layer [22]. The categorical cross entropy function was used as the loss function. The neuron weights were initialized using normalized He initialization [23].
A non-diagnosis of CRC was prevalent in the data set. Each data point in the positive class was weighted approximately 40 times greater than each data point in the negative class to ensure that the output of the prediction was not unbalanced towards the dominant class.
The software was implemented using Python (version 3.7.0) [24]. The DNN model was created and trained with the Tensorflow framework (version 1.9.0) [25].

Statistical Analyses
Distributions of socio-demographic factors, including age, gender, urbanization level, occupation, underlying disease, diabetes complication, and medications of the patient with CRC and without CRC were compared. A Chi-square test and a t-test were used to test the difference between categorical and continuous variables, respectively, between the two groups.
Accuracy was not a reliable measurement of predictor performance due to the unbalanced data distribution [26]. Instead, we used the weighted average recall (sensitivity), precision (positive predictive value), and F 1 (harmonic mean of sensitivity and precision) values to evaluate predictor performance. These three metrics all have ideal values of one. The F 1 , precision, and recall values were calculated for the training set, test set, and all data using the scikit-learn library.
Additionally, the receiver operating characteristic (ROC) curve was used as a metric to measure predictor performance. The ROC was calculated between the outcome and the predicted probability of the outcome by the DNN model. The ROC curve was also computed using aDCSI as the sole predictor. The area under the ROC curve (AUROC) for the DNN model and aDCSI were compared to determine the performance of the DNN model and both values were also compared to the ideal value of 1 and to the null hypothesis area of 0.5 [27]. The ROC curve was calculated using IBM SPSS. Data management was performed using the SAS 9.4 software (SAS Institute, Cary, NC, USA). All P-values were 2 tailed and a p-value < 0.05 was considered significant.

Demographic Features of Patients
Eligible study participants consisted of 1,349,640 T2DM patients, 14,867 of whom were with CRC and 1,334,773 of whom were without CRC ( Table 2). The mean age was 63.7 years (SD = 11.2 years) for the CRC group and was 56.2 years (SD = 14.2 years) for the non-CRC group. There were more men than women. The two groups preferred to reside in urbanized areas (58.9% vs. 58.8%). Most of the occupations in both groups were white-collar jobs (45.0% vs. 48.1%). The comorbidities of hypertension, congestive heart failure, colorectal polyps, COPD, CAD, and irritable bowel syndrome were significantly higher in the CRC group than in the non-CRC group. The T2DM-related cardiovascular complication was more prevalent in the CRC group than in the non-CRC group. The mean aDCSI score at the end of the follow-up was 2.75 (SD = 2.15) in the CRC group and 3.03 (SD = 2.35) in the non-CRC group. All medications listed in Table 2 had higher proportions in the non-CRC group than in the CRC group. The mean follow-up periods were 4.73 (SD = 3.33) years in the CRC group and 6.86 (SD = 3.87) years in the non-CRC group.

Evaluation of Predictor Performance
The F 1 , precision, the recall values and the AUROC of the DNN model across all data are outlined in Table 3. The AUROC of aDCSI is outlined in Table 4. Figure 1 shows the ROC curve of the DNN model and aDCSI model for predicting CRC.
The AUROC of the aDCSI across all three datasets was close to the null hypothesis area of 0.5, which showed that aDCSI alone cannot be used as a predictor for CRC. This signified a necessity for a multivariate prediction model that takes into account all variables. The AUROC of the DNN model across all three datasets was significantly greater than the null hypothesis area and the AUROC of the aDCSI.
The DNN model had a high precision value across the test set (0.980), which indicated that the DNN model had a relatively low false positive rate. While the recall value across the test set was lower than the precision value, the recall value was also relatively high (0.886), which signified a low false negative rate.

Discussion
This national population-based study demonstrated that the DNN model appropriately predicted CRC. In contrast, a single variable predictor using aDCSI showed poorer performance compared to the DNN model.
Earlier studies suggested that compared to the general population, patients with DM are at a higher risk to develop CRC [10][11][12][13]. Although DM and cancer share several risk factors, such as obesity, aging, unhealthy food and physical inactivity [6], the association between DM and the risk of CRC is biologically plausible based on the findings of previous studies. The potential mechanisms contributing to the development of diabetes-associated CRC may include insulin resistance and associated hyperglycemia, hyperinsulinemia, oxidative stress, and chronic inflammation [6,11,28]. Insulin stimulates cell proliferation and most cancer cells express the insulin-like growth factor (IGF) receptor [28,29]. The IGF system is a potent growth regulator closely linked with carcinogenesis [30] and several observational studies and reviews have revealed a linkage between elevated IGF levels and the risks of adenomatous polyps or CRC [31][32][33][34].
In Taiwan, by using the NHIRD, several authors used traditionally statistical methods to assess the cancer risk among patients with T2DM and anti-diabetic therapies [9,[35][36][37]. Hsieh applied logistic regression models to test the risk of T2DM and antidiabetic drugs on cancers. They found that there was a significantly higher risk for CRC (adjusted odds ratio = 1.206, 95% confidence interval = 1.142-1.274) in patients with T2DM [9]. Chiu employed a Cox proportional hazards regression analysis to evaluate T2DM and antidiabetic drugs with the risk of gastrointestinal malignancy. They indicated that T2DM was significantly associated with an increased risk of CRC (adjusted hazard ratio: 1.58, 95% confidence interval = 1.28-1.94) [35]. Tseng created multivariable Cox regression models to calculate the adjusted relative risk of T2DM on CRC and concluded that a significantly higher risk was detected [36]. Our team carried out Cox regression analyses several years ago to determine if TZD can decrease cancer risk in T2DM patients and highlighted that no significant difference was observed for the risk of CRC in the TZD group relative to the non-TZD group [37]. To further clarify this issue, the current study attempted to use DNN to develop a prediction model for

Discussion
This national population-based study demonstrated that the DNN model appropriately predicted CRC. In contrast, a single variable predictor using aDCSI showed poorer performance compared to the DNN model.
Earlier studies suggested that compared to the general population, patients with DM are at a higher risk to develop CRC [10][11][12][13]. Although DM and cancer share several risk factors, such as obesity, aging, unhealthy food and physical inactivity [6], the association between DM and the risk of CRC is biologically plausible based on the findings of previous studies. The potential mechanisms contributing to the development of diabetes-associated CRC may include insulin resistance and associated hyperglycemia, hyperinsulinemia, oxidative stress, and chronic inflammation [6,11,28]. Insulin stimulates cell proliferation and most cancer cells express the insulin-like growth factor (IGF) receptor [28,29]. The IGF system is a potent growth regulator closely linked with carcinogenesis [30] and several observational studies and reviews have revealed a linkage between elevated IGF levels and the risks of adenomatous polyps or CRC [31][32][33][34].
In Taiwan, by using the NHIRD, several authors used traditionally statistical methods to assess the cancer risk among patients with T2DM and anti-diabetic therapies [9,[35][36][37]. Hsieh applied logistic regression models to test the risk of T2DM and antidiabetic drugs on cancers. They found that there was a significantly higher risk for CRC (adjusted odds ratio = 1.206, 95% confidence interval = 1.142-1.274) in patients with T2DM [9]. Chiu employed a Cox proportional hazards regression analysis to evaluate T2DM and antidiabetic drugs with the risk of gastrointestinal malignancy. They indicated that T2DM was significantly associated with an increased risk of CRC (adjusted hazard ratio: 1.58, 95% confidence interval = 1.28-1.94) [35]. Tseng created multivariable Cox regression models to calculate the adjusted relative risk of T2DM on CRC and concluded that a significantly higher risk was detected [36]. Our team carried out Cox regression analyses several years ago to determine if TZD can decrease cancer risk in T2DM patients and highlighted that no significant difference was observed for the risk of CRC in the TZD group relative to the non-TZD group [37]. To further clarify this issue, the current study attempted to use DNN to develop a prediction model for CRC among patients with T2DM. In addition, the drug effects were also considered for the analyses. A DNN is an artificial neural network with multilayer perceptron and uses sophisticated mathematical modelling to process data in complex ways [38]. It can be used for prediction, forecasting, diagnosis, and decision making in different fields including the healthcare field.
We found that the AUROC of the DNN was significantly greater than that of using only aDCSI. We used aDCSI as a single variable predictor and the AUROC of aDCSI in predicting CRC was close to the null hypothesis area of 0.5, showing that using aDCSI to predict CRC was not effective. In contrast, DNN was designed for a multivariate predictor and the AUROC of the DNN model was significantly greater than 0.5, indicating that a DNN model can be an effective prediction model for CRC. Our government has already launched a nationwide screening program of CRC since 2004 and a free fecal immunochemical test is offered biennially to individuals aged 50 to 75 [39]. According to our findings, the program may extend to cover patients with T2DM beyond this age range. Table 2 indicates that T2DM patients with hypertension, congestive heart failure, colorectal polyps, COPD, CAD, and irritable bowel syndrome had a significantly higher risk of CRC compared with T2DM patients without the corresponding underlying diseases. On the contrary, T2DM patients with hyperlipidemia, obesity, smoking, alcohol-related illness and CKD had a significantly lower risk of CRC compared with T2DM patients without the corresponding underlying diseases. As all the listed comorbidities were suggested risk factors for either T2DM or CRC or both, the findings presented here may represent the effects of intricate mechanisms among risk factors, T2DM and CRC.
The main strength of this study lies in a use of a population-based cohort with a large and nationally representative sample, which increases its generalizability in Taiwan. However, we have to acknowledge several limitations as below. Firstly, detection bias could exist because patients with T2DM are supposed to have more clinical visits than the general population and Lewis found that diabetic patients receiving medications, particularly in the first year of diagnosis, are more likely to undergo lower endoscopic examinations [40] and we can expect an overestimation of the incidence of CRC among these groups. However, Taylor stated that diabetic patients have a significantly poorer response to colonoscopy bowel cleansing preparations than nondiabetic patients [41] and it might lead to the decreased detection of CRC, resulting in an underestimation of the relationship between T2DM and CRC [11]. Secondly, inherent limitations of NHIRD hinder our ability to get some information related to the T2DM and CRC, such as smoking habits, alcohol consumption, body mass index (BMI), family history of T2DM or CRC, diet, and physical activity. To deal with this, we tried to use the comorbidities as surrogates for some risk factors of CRC, such as COPD and a stop-smoking clinic for smoking, alcohol-related illness for alcohol, and obesity for BMI. However, we should admit that the use of these comorbidities as surrogate risk factors for CRC did not allow any interpretation of data. Finally, unlike the traditional Cox proportional hazard model, our predictive models could not provide valued levels (e.g., 95% confidence intervals and p-values) to evaluate statistical significance; instead, we used recall, precision, F 1 , and AUROC to determine the predictor performance. The AUROC is 0.74 for the DNN, which is acceptable. Although these limitations suggest cautious interpretation of the results of the study, the diagnosis of T2DM and CRC is highly reliable which makes our results more convincing.

Conclusions
In conclusion, based on the DNN predictive model, our findings suggest that Taiwanese patients with T2DM were at an increased risk of developing CRC. Although we have a relatively successful screening policy for CRC, our findings might encourage the government to consider a slight policy modification regarding screening guidelines to include patients with T2DM with a wider age range. Memorial Funds, Japan. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding was received for this study.