Development of a Deep Learning Model for Malignant Small Bowel Tumors Survival: A SEER-Based Study

Background This study aims to explore a deep learning (DL) algorithm for developing a prognostic model and perform survival analyses in SBT patients. Methods The demographic and clinical features of patients with SBTs were extracted from the Surveillance, Epidemiology and End Results (SEER) database. We randomly split the samples into the training set and the validation set at 7:3. Cox proportional hazards (Cox-PH) analysis and the DeepSurv algorithm were used to develop models. The performance of the Cox-PH and DeepSurv models was evaluated using receiver operating characteristic curves, calibration curves, C-statistics and decision-curve analysis (DCA). A Kaplan–Meier (K–M) survival analysis was performed for further explanation on prognostic effect of the Cox-PH model. Results The multivariate analysis demonstrated that seven variables were associated with cancer-specific survival (CSS) (all p < 0.05). The DeepSurv model showed better performance than the Cox-PH model (C-index: 0.871 vs. 0.866). The calibration curves and DCA revealed that the two models had good discrimination and calibration. Moreover, patients with ileac malignancy and N2 stage disease were not responding to surgery according to the K–M analysis. Conclusions This study reported a DeepSurv model that performed well in CSS in SBT patients. It might offer insights into future research to explore more DL algorithms in cohort studies.


Background
Previous studies indicate that small bowel tumors (SBTs) are rare, with an estimated incidence rate of 2.6 per 100,000 for males and 2.0 per 100,000 for females in 2019 [1]. Despite the fact that the small intestine takes up the majority of the whole gastrointestinal tract, SBTs only account for 1-2% of all gastrointestinal tumors [2]. Nevertheless, SBTs have been on the rise for decades, with an average annual growth rate of 1.8%, owing to the remarkable progress of endoscopy and imaging [1]. Malignancy is more frequent in SBTs, with the most common histological diagnoses being adenocarcinoma, neuroendocrine tumors, gastrointestinal stromal tumors (GISTs) and lymphoma [3]. SBTs are characterized by a lengthy silence-period, nonspecific symptoms and a high degree of malignancy [1,4]. Therefore, studies that focus on the prognosis of SBTs are largely restricted.
Artificial intelligence (AI), consisting of machine learning (ML) and deep learning (DL), has lately been applied to the field of medicine [5][6][7][8][9]. Using an ensemble of ML models, Levi et al. [8] developed a prediction model for detecting patients with gastrointestinal bleeding who required transfusion. Skrede et al. [5] trained a DL model, based on ten-layer convolutional neural networks (CNNs), for risk stratification to help decision making in the therapy of colorectal cancer. Therefore, AI-aided systems are useful in various clinical settings, e.g., risk stratification, personalized treatments and disease classification based on endoscopic images [10][11][12]. As for AI application in survival analysis, Katzman et al. [13] first proposed a novel technology, based on deep neural networks, for survival analysis in 2018. They proved that the use of DeepSurv allowed for higher performance of prognostic prediction and effective treatment recommendations, which was superior to the state-of-the-art survival methods such as Cox-PH and random survival forests. Since then, DeepSurv has been used in a variety of fields [14][15][16], e.g., the prognosis of lung cancer [17], the therapeutic efficacy of chemoradiation in head and neck cancer [18] and the detection of cognitive decline [19], which all showed better performance than traditional Cox-PH models.
To date, several survival models have been proposed for predicting overall survival (OS) and cancer-specific survival (CSS) in SBTs [20][21][22][23][24]. However, these models all used traditional Cox proportional hazards methods (Cox-PH), which were time-consuming due to manual feature selection. Furthermore, the Cox-PH is a semiparametric model that explores the linear correlation between covariates, even though relations among covariates and survival outcomes are usually complex and nonlinear. Therefore, our study aims to evaluate the feasibility of DL models for survival analysis in malignant SBTs, based on the Surveillance, Epidemiology and End Results (SEER) dataset [13].

Cohorts Dataset
This study was based on the SEER database that provides information on cancer statistics, such as demographic and clinical data, in an effort to reduce the cancer burden [10,11,13,14]. SBTs were diagnosed using the third edition of the International Classification of Disease for Oncology (ICD-O-3), with site codes of C17.0-17.9. For cases diagnosed in 2018 and later, the SEER program applied Extent of Disease (EOD), a set of three items including primary tumor, lymph nodes and metastasis. According to the aforementioned criteria, cases diagnosed with SBTs in 2018 were included in this study. All patients were selected from the "SEER Incidence Dataset", which collected cancer incidence data from population-based cancer registries covering nearly 47.9% of the U.S. population, using SEER*Stat software (version 3.8.9.2). Moreover, the following are the exclusion criteria: (1) patients under the age of 18; (2) patients whose survival duration was unclear; (3) patients with incomplete demographic data. We collected variables from the SEER database, including age, sex, race, income, region, T stage, N stage, M stage, tumor size, primary tumor site, histological type and the number of lymph nodes removed. Our primary endpoint was CSS, the time interval between the first diagnosis and death from SBTs. Figure 1 shows the flowchart of this study.

Data Preprocessing
A total of 1987 patients were included in our study and were randomly split into the training set (n = 1398) and validation set (n = 589) at 7:3. Missing data, assumed to be missing at random, were multiple imputed using a random forest algorithm from the 'mice' package in R software [25]. The imputed variables were tumor size (21.67% in the training set and 21.05% in the validation set) and the number of lymph nodes removed (4.94% in the training set and 2.89% in the validation set).

Development of the Cox-PH Model
Univariate and multivariate Cox regression analyses were used to select important features and develop a prognostic model. The variables with a two-sided p-value of less than 0.05, determined in the univariate analysis, were integrated into the multivariate analysis. The performances of the 3-, 6-and 9-month survival models were evaluated with time-dependent receiver operating characteristic (ROC) curve, calibration curve, Harrell C-statistics and decision-curve analysis (DCA) based on the training and validation sets.

Data Preprocessing
A total of 1987 patients were included in our study and were randomly split into the training set (n = 1398) and validation set (n = 589) at 7:3. Missing data, assumed to be missing at random, were multiple imputed using a random forest algorithm from the 'mice' package in R software [25]. The imputed variables were tumor size (21.67% in the training set and 21.05% in the validation set) and the number of lymph nodes removed (4.94% in the training set and 2.89% in the validation set).

Development of the Cox-PH Model
Univariate and multivariate Cox regression analyses were used to select important features and develop a prognostic model. The variables with a two-sided p-value of less than 0.05, determined in the univariate analysis, were integrated into the multivariate analysis. The performances of the 3-, 6-and 9-month survival models were evaluated with time-dependent receiver operating characteristic (ROC) curve, calibration curve, Harrell C-statistics and decision-curve analysis (DCA) based on the training and validation sets.

Development of the DeepSurv Model
DeepSurv is a feed-forward neural-network-based extension of the Cox proportional hazards model, consisting of an input layer, an output layer and multiple hidden layers, which can predict the survival outcomes. Patients' baseline data are fed into the network as an input layer. A fully connected layer of nodes is followed by a dropout layer in the hidden layer. The output layer generates the probability of each patient's survival event estimated using the DeepSurv model. To begin, we used the training cohort, which was randomly split into the internal training set and internal validation set at 8:2, to develop the DL model of a four-layer neural network, and the validation cohort to evaluate the efficacy of this model. The process of training involved making real-time adjustments to the hyperparameters, which included the input standardization, the Rectified Linear Units (ReLU) activation function, the Adaptive Moment Estimate (Adam) optimizer and the learning rate scheduling [13]. Early stopping regularization was used to prevent model overtraining and avoid overfitting. The performance of the model was evaluated using the Harrell C-index, ROC curve, calibration curve and DCA based on the validation set.

Statistical Analysis
Continuous variables were presented as the mean ± standard deviation (SD) if normally distributed or as the median if not. Categorical variables were described as frequencies. Baseline characteristics were compared using Student's t-test or nonparametric tests for continuous variables and Chi-square tests for categorical variables. Furthermore, Kaplan-Meier (K-M) survival analysis was applied to evaluate the prognostic effect of those independent factors determined with multivariate Cox regression analysis using the log-rank test. Statistical analysis was performed using the 'MASS' package of R software (version 4.0) and the 'pycox' package of Python (version 3.9). A two-side p < 0.05 was considered statistically significant.

Performance of the Cox-PH Model
The dataset was randomly split into the training set (n = 1398) and the validation set (n = 589) at 7:3. Multivariate Cox regression analysis was performed to determine independent prognostic factors for CSS, including age, race, primary tumor site, histological type, N stage, surgery and the number of retrieval lymph nodes, as shown in Figure 2. The results showed that patients who were Asian, had tumors located in the ileum and had N2 stage disease had poor prognosis (p < 0.05). Additionally, histological types had the greatest impact on prognosis, of which adenocarcinoma and others (neuroendocrine carcinoma, squamous cell carcinoma, signet ring carcinoma, etc.) accounted for a large proportion (p < 0.001).  Table S1. Furthermore, ROC curves ( Figure 3A) were constructed to evaluate the discrimination capability of this model based on the validation set, with the areas under the ROC curve (AUCs) of 0.874, 0.922 and 0.908 for 3-, 6and 9-month survival probability, respectively. Figure 4A-C shows that our model had minor deviations from the reference line in the calibration curve for 3-, 6-and 9-month CSS in the validation set, with the Brier scores of 0.282, 0.265 and 0.265, respectively. Furthermore, the C-index, AUCs and Brier scores based on the training set are demonstrated in Table 2. Additionally, DCA curves for this Cox-PH model are presented in Figure 5A-C for 3-, 6-and 9-month CSS, demonstrating that our model could help patients obtain a net benefit of 8-10%.  Table S1. Furthermore, ROC curves ( Figure 3A) were constructed to evaluate the discrimination capability of this model based on the validation set, with the areas under the ROC curve (AUCs) of 0.874, 0.922 and 0.908 for 3-, 6-and 9-month survival probability, respectively. Figure 4A-C shows that our model had minor deviations from the reference line in the calibration curve for 3-, 6and 9-month CSS in the validation set, with the Brier scores of 0.282, 0.265 and 0.265, respectively. Furthermore, the C-index, AUCs and Brier scores based on the training set are demonstrated in Table 2. Additionally, DCA curves for this Cox-PH model are presented in Figure 5A-C for 3-, 6-and 9-month CSS, demonstrating that our model could help patients obtain a net benefit of 8-10%.

Performance of the DeepSurv Model
The aforementioned training set (n = 1398) was randomly split into the "internal training" set (n = 1129) and the "internal validation" set (n = 269) at 8:2 for the best set of hyperparameters. The validation set (n = 589) was regarded as the "testing" set for evaluating the performance of the final DeepSurv model. We tuned DeepSurv's hyperparameters by referring to previous studies [7,13], as described in Table S1. With the best performing hyperparameters, four layers (nodes: 32, 64, 128, 256), a dropout rate of 0.2 and a learning rate of 0.020092, the performance of the final model was evaluated with the validation set (n = 589). The C-index was 0.871, while the AUCs were 0.878, 0.891 and 0.891 for 3-, 6-and 9-month CSS, respectively, which were inferior to those of the Cox-PH model, as shown in Table 2 and Figure 3B. Furthermore, the calibration and prediction curves of the DeepSurv model for the 3-, 6-and 9-month CSS showed great correlations in the validation cohort, with Brier scores of 0.058, 0.070 and 0.080, respectively, indicating a high degree of reliability ( Figure 4D-F Figure 4. Calibration curves of the Cox-PH (A-C) and DeepSurv (D-F) models for 3-, 6-and 9-month cancer-specific survival.

K-M Survival Analysis Based on the COX-PH Model
Survival analysis showed that patients who were old and had adenocarcinoma had poorer prognosis (p < 0.0001), while patients who had a tumor located in the ileum had relatively good prognosis (p < 0.001), as shown in Figure S1A-C. Furthermore, we conducted a sub-group analysis in patients who received surgery. It demonstrated that surgery could significantly improve the CSS whatever the histological type was ( Figure S1D, p < 0.0001). For patients who had a tumor located in the ileum and had N2 stage disease, surgery showed no significant efficacy in improving prognosis ( Figure S1E,F, p > 0.05).

Discussion
Due to the relatively low prevalence, studies concerned with the management of SBTs are lacking. Therefore, effective methods for the risk stratification and personalized therapy of SBTs are essential for improving their survival rates.
Currently, it is well known that the AJCC stage system is applied to predict the survival of patients with SBTs. However, this system only takes TNM stage into consideration for predicting cancer risk, neglecting other significant clinical features and demographic information. To improve predictive accuracy, recent studies have constructed various prognostic models for small bowel adenocarcinoma and neuroendocrine carcinoma, using Cox regression analysis [20][21][22]24,26]. Wang et al. [20] built and verified nomograms to predict OS and CSS for small intestine adenocarcinoma (SIA), with higher sensitivity and specificity than the AJCC stage system. That study highlighted ten variables, including demographic data (age and marital status) and therapeutic regimen (surgery and chemotherapy). Zheng et al. [21] also fitted a Cox-PH model for predicting the CSS of SIA, with the C-index of 0.728 based on the internal validation. Cox regression analysis achieves high accuracy and has become a popular method for prognostic predictions. Nevertheless, it is impractical to fit survival models based on linear relation, which requires more complex training methods, e.g., ML algorithms or even deep neural networks.
The remarkable progress in AI facilitates the development of complex nonlinear models based on big clinical data in survival analysis. Adeoye et al. [7] discovered that the DeepSurv model had a higher C-index (0.95 vs. 0.83) for assessing the malignanttransformation risk in patients with oral leukoplakia and lichenoid lesions than traditional Cox regression analysis. As mentioned in the 'Introduction' section, it is indeed practicable to apply the DeepSurv algorithm to fitting prognostic models. However, to our knowledge, relative research using DL algorithms for prognostic prediction in SBTs are limited. Therefore, this study presented the development of a DeepSurv model for predicting the CSS of patients with SBTs and compared the performance of the Cox-PH and DeepSurv models.
Our study concluded that the DeepSurv model performed better in predicting survival for patients with SBTs than the Cox-PH model, based on their C-indices. Additionally, calibration curves demonstrated that the estimated risk when using DeepSurv was closer to the observed risk than when using Cox-PH analysis, with lower Brier scores. DCA showed that the two models obtained similar performance in clinical practice, with 8-10% net benefits. As a result, the predictions estimated using this DeepSurv model were proven to have good discrimination and calibration. The advantages of the DeepSurv model, we believe, attribute to the automated features extraction of inputs and the analytic ability of complex big data [17,27,28].
In terms of K-M analysis, our study demonstrated that surgery was a good choice for improving prognosis in patients with SBTs. However, in the sub-group analysis, there was some controversy over surgery. Firstly, patients with ileac tumors were proven to have better CSS compared to those with duodenal or jejunal cancer in our study. Similarly, Nicholl et al. [29] drew the same conclusion from analyzing 1444 patients with small intestine adenocarcinoma, whereas for further analysis, we found that patients with ileac malignancy were not responding to surgery according to the K-M survival analysis. Secondly, surgery had an adverse effect on patients with N2 stage disease, who were classified into stage IIIB-IV, in our study. Many research studies have revealed that palliative chemotherapy could prolong life for stage IV patients, no matter whether surgery was performed or not [30,31]. Therefore, chemotherapy might become another alternative for patients with an advanced stage of SBT. More research is needed to further explore more effective and precise treatment for different sub-group SBT patients.
There are some limitations in our study. To begin with, this was a retrospective study based on the SEER dataset, with potential heterogeneity of the enrolled patients. More external validations are needed in future research. Secondly, more risk factors could be included in future research to better guide the therapeutic regimen for SBT patients.

Conclusions
In conclusion, our study found that the DeepSurv model performed better than the traditional Cox-PH model in predicting the survival of patients with SBTs. The DL model exhibited feasibility and promise in clinical practices, which might offer insights for future research to explore more DL algorithms in prognostic analysis.