Deep Learning with Multimodal Integration for Predicting Recurrence in Patients with Non-Small Cell Lung Cancer

Due to high recurrence rates in patients with non-small cell lung cancer (NSCLC), medical professionals need extremely accurate diagnostic methods to prevent bleak prognoses. However, even the most commonly used diagnostic method, the TNM staging system, which describes the tumor-size, nodal-involvement, and presence of metastasis, is often inaccurate in predicting NSCLC recurrence. These limitations make it difficult for clinicians to tailor treatments to individual patients. Here, we propose a novel approach, which applies deep learning to an ensemble-based method that exploits patient-derived, multi-modal data. This will aid clinicians in successfully identifying patients at high risk of recurrence and improve treatment planning.


Introduction
Non-small cell lung cancer (NSCLC), which accounts for 85% of lung cancer cases [1], is one of the most common and fatal cancer types worldwide [2]. While the most common treatment strategy for NSCLC patients is surgery, a high probability of tumor recurrence following surgical resection typically results in a bleak prognosis [3]. In fact, early-stage NSCLC patients (Stage I, II, IIIA) who underwent resection surgery have recurrence rates of 40% (stage I), 66% (stage II) and 75% (stage IIIA) [4][5][6]. Therefore, the ability to accurately predict NSCLC relapse is extremely critical considering that, as cancer progresses, the five-year survival rates decrease sharply, from 40% of stage I patients to only 1% of stage IV patients [7,8]. Thus, the ability to identify patients with a high risk for recurrence following surgical resection is critically important because it allows clinicians to determine which patients may benefit from adjuvant therapies [9], and thus create more effective personalized treatments.
However, for patients with NSCLC, crafting treatment strategies that include adjuvant therapies is difficult [10,11]. Of the several common factors known to be associated with NSCLC relapse in early NSCLC, such as tumor size (T-stage), nodule involvement (N-stage) and smoking history [12][13][14], the tumor node (TN) staging system is traditionally used as a postoperative prognostic factor [15,16]. The TN staging system provides medical professionals with a framework for understanding the prognosis, treatment options, and value of new interventions allowing for the best possible care for lung cancer patients [17]. However, because tumor size is determined qualitatively, accurate TN staging can be limited due to human error [18]. The TN staging system is further limited in predicting clinical outcomes due to the lack of a clear rank order by stage [19]. Thus, the current staging system for NSCLC is insufficient for predicting treatment outcomes, illustrating the • Neural network-based recurrence prediction models exploiting (1) clinical information, (2) radiomics, and (3) deep learning-based features, respectively, were developed. • The performance of ensemble models using various combinations of the three patientderived multi-modal features was evaluated. • The ensemble model using all three features showed the best performance. • A first recurrence prediction model for early NSCLC through the integration of the aforementioned three features was proposed.

Patient Data Acquisition
Our patient data was provided by two separate institutions. One is the Veterans Health Service (VHS) Medical Center, a hospital that treats ordinary citizens to national merit. The other institution we obtained data from was the Cancer Imaging Archive (TCIA) database [30]. Clinical data and CT images were collected from VHS NSCLC patients under an Institutional Review Board (IRB)-approved protocol and merged with new NCSLC patient data from the TCIA database that includes lung CT images and clinical information of patients [31]. Since the TCIA is a publicly available database without patient identifiers, IRB approval is not required. Among the two TCIA datasets, R01 and AMC, the AMC dataset was excluded because it did not have pertinent clinical information such as the TN stage. All subjects from all datasets were pathologically confirmed to have lung squamous cell carcinoma (LUSC) or lung adenocarcinoma (LUAD). Patients with a history of multiple surgeries, with missing information related to the research, or referred to as "inadequate cases" by the experts were excluded. The demographic and clinical characteristics of all 326 NSCLC patients are provided in Table 1, including the age, TN stage, and recurrence.

Image Processing
Lung CT images of the dataset utilized in this study were acquired with various multidetector CT (MDCT) scanners. To prevent problematic spatial information lacking uniformity, when using convolution layers for deep learning, CT scans were preprocessed as summarized in Figure 1. First, CT scans were standardized to Hounsfield Units (HU), which is a quantit indicator of the x-ray attenuation degree for each pixel in a CT image. The intensity image is related to tissue density and can be measured in HU, allowing for the direct parison of images from different sources [20]. After standardizing the pixel values o  First, CT scans were standardized to Hounsfield Units (HU), which is a quantitative indicator of the X-ray attenuation degree for each pixel in a CT image. The intensity in CT image is related to tissue density and can be measured in HU, allowing for the direct comparison of images from different sources [20]. After standardizing the pixel values of the target images into HU, a defined HU value was selected to display the images in grayscale. In our study, the window level and window width were set to −600 and 1500 HU, respectively, the values regularly used for general lung CT image display [32]. Furthermore, isotropic voxel resampling was performed to compensate for CT slide thickness variation and ensure uniform isotropic resolution of all CT images. Finally, regions of interest (ROI) images to be used for CNN analysis were generated based on the tumor segmentation labels, and all non-tumor pixels in the ROI images were set to zero.

Feature Extraction and Selection
Clinical and pathologic characteristics/features including clinical characteristics, laboratory results, and pathologic conditions were collected from the patient records. Of the features available for examination, only the clinical features routinely obtained from various institutions were considered. Additionally, we excluded features with a missing value in >25% of cases at either institution. Among the remaining pathologic features, we utilized some traditionally accepted variables with well-known prognostic power; these included histology, TN stage, age, and overall stage [33]. Specifically, visceral pleural invasion and lymphovascular invasion have been found to be two critical indicators for risk of recurrence [29,34]. The final clinical features utilized in our predictive model are listed in Table 2. Table 2. Clinical variable used in the first neural network model.

Clinical Features
LUAD/LUSC, Age, Overall stage, T stage (T1, T2, T3), N stage (N0, N1, N2), Pathology-visceral pleural (+/−), Pathology-lymphovascular invasion (+/−) Next, to extract HCR features, volumetric tumor contours in lung CT images from VHS medical center and TCIA were manually segmented by two experienced radiologists (with more than 10 years of experience in lung diagnosis) of VHS medical center and Ewha Womans University Seoul Hospital, respectively, following the RTOG 1106 contouring guideline [35,36]. A total of 1668 HCR features were extracted from 3D ROI using Pyradiomics [37], which is an open-source package in Python. HCR features can be divided into four categories, shape, first-order statistics, second-order statistics, and high-order statistics, which are obtained after applying filters or mathematical transforms to the images. Attempting to use all the numerous extracted HCR features makes developing effective classification models very challenging. This is due to a high probability that many features are redundant and/or highly correlated, which can lead to over-fitting the results and affect the final performance of the network. Furthermore, Cox-proportional hazard regression was employed to identify the subset of HCR features with the most discriminating power for a given task and the greatest reduction in the data dimension. This analysis was done in a univariate way by applying a 0.05 p-value threshold. As such, we excluded variables with high correlation (>0.95) and near-zero variance, as they did not provide advantageous information to our model. As a result, 157 out of 1668 HCR features were selected as training variables. Table 3 shows the details of the selected HCR features by category.
Lastly, DLR features were not explicitly designed like HCR features. However, convolutional neural networks trained on CT images to perform a pre-defined task of predicting cancer recurrence were able to learn filters that function as edge detectors in the early layers. In the input, deeper layers respond to more complex patterns that resemble texture, shapes, or compositions of earlier features. The DLR features obtained from deep layers contain more abstract predictive patterns of CT images, compared with the clinical and

Neural Network-Based Cancer Prediction Model
To obtain a well-organized ensemble model, we initially pretrained three neural network-based models with clinical data, HCR, and DLR, respectively, as shown in Figure 2. The clinical neural network was trained with the clinical features listed in Table 2, and the HCR neural network was learned from the selected HCR features shown in Table 3. The two neural network architectures mainly consisted of dense layers, batch normalization layers, dropout layers, and rectified linear units (ReLU), excluding the last layer, which underwent sigmoid optimization with Adam. While the clinical neural network had three pairs of dense-batch normalization-activation-dropout with the shape of increasing and decreasing the number of nodes, the HCR neural network had four pairs without batch normalization layer. The DLR convolutional neural network model input was a 2D ROI image on a representative axial slice, where tumor pixels were the largest within the entire volume of the CT image of each patient. Using this network, a large diversity of features was learned, especially through non-linearity operations, as images were passing through convolutional blocks. The DLR convolutional neural network contained four convolution layers and two fully connected layers. The Max-pooling layer, parametric PReLU as an activation function, and batch normalization layer followed after convolution layers, and 0.5 of dropout probability was included between the fully connected layers to avoid overfitting. A detailed composition is shown in Figure 3. The training was optimized with stochastic gradient descent and began with an initial learning rate of 0.01 and a momentum of 0.5. This learning rate fell 0.8×, to at most 0.0005 when the validation loss stopped diminution for a time. The model provided automatically augmented images from the original input images. These images were generated in simple ways including rotating (~20°), flipping (vertically or horizontally), or shifting. The DLR convolutional neural network model input was a 2D ROI image on a representative axial slice, where tumor pixels were the largest within the entire volume of the CT image of each patient. Using this network, a large diversity of features was learned, especially through non-linearity operations, as images were passing through convolutional blocks. The DLR convolutional neural network contained four convolution layers and two fully connected layers. The Max-pooling layer, parametric PReLU as an activation function, and batch normalization layer followed after convolution layers, and 0.5 of dropout probability was included between the fully connected layers to avoid overfitting. A detailed composition is shown in Figure 3. The training was optimized with stochastic gradient descent and began with an initial learning rate of 0.01 and a momentum of 0.5. This learning rate fell 0.8×, to at most 0.0005 when the validation loss stopped diminution for a time. The model provided automatically augmented images from the original input images. These images were generated in simple ways including rotating (~20 • ), flipping (vertically or horizontally), or shifting.
CT image of each patient. Using this network, a large diversity of features was learned, especially through non-linearity operations, as images were passing through convolutional blocks. The DLR convolutional neural network contained four convolution layers and two fully connected layers. The Max-pooling layer, parametric PReLU as an activation function, and batch normalization layer followed after convolution layers, and 0.5 of dropout probability was included between the fully connected layers to avoid overfitting. A detailed composition is shown in Figure 3. The training was optimized with stochastic gradient descent and began with an initial learning rate of 0.01 and a momentum of 0.5. This learning rate fell 0.8×, to at most 0.0005 when the validation loss stopped diminution for a time. The model provided automatically augmented images from the original input images. These images were generated in simple ways including rotating (~20°), flipping (vertically or horizontally), or shifting. In the end, a machine learning-based analyzer was employed to combine the predictive outputs (cancer recurrence probability [0, 1]) of each neural network-based model to improve the accuracy of prior models (i.e., ensemble learning). The combined outputs went through a different machine learning-based ensemble analyzer for making the final In the end, a machine learning-based analyzer was employed to combine the predictive outputs (cancer recurrence probability [0, 1]) of each neural network-based model to improve the accuracy of prior models (i.e., ensemble learning). The combined outputs went through a different machine learning-based ensemble analyzer for making the final decision. The final analyzer was determined based on the performance of several representative analyzers. These tested analyzers included random forests, logistic regression, support vector machine, Gaussian naive Bayes, linear discriminant analysis, quadratic discriminant analysis, and hard and soft voting classifier. The research workflow is demonstrated in Figure 2.

Model Assessment
We applied a 5-fold cross-validation method with shuffled training data and evaluated the predictive abilities of each designed neural network with clinical features, HCR, and DLR, as well as the final proposed ensemble model. Moreover, we compared the performance of our proposed three neural network models with pre-existing models. The Cox proportional hazards (PH) model [38], traditionally used to predict the clinical outcomes or hazard functions corresponding to specific time units [39], was employed to compare predictive power using clinical variables. Selected HCR and DLR of CT images were utilized by more recent machine learning-based prediction models, random survival forest [40] and convolutional neural network [41], respectively. In addition, a neural network model with TN staging was developed to use as a baseline for performance evaluation.
The performance of the baseline model (TN staging), individual data feature models (clinical data, HCR, and DLR), and the all three data feature model (clinical data, HCR, and DLR) were evaluated using quantitative metrics including F1 score, precision, recall, and accuracy from the confusion matrix. Precision was defined as the ratio of the actual positive of the model determined to be positive. Recall was determined as the ratio of how many positive observations were found in the actual positive observations. Thus, in general, because the precision and recall values of the algorithms are inversely related, they should be considered simultaneously for evaluating the classification performance. Moreover, the F1 score is the weighted average of precision and recall, thereby making it the single numeric representation of the algorithm's precision and recall. Finally, accuracy refers to the ability of the algorithm to precisely judge positive and negative.
The results of the 5-fold cross-validation were also used to draw an ROC curve with an AUC value for each model under the same condition. An ROC curve is a plot of the true positive rate against the false-positive rate at various threshold settings. The ROC curve is crucial for future medical applications because it evaluates the diagnostic ability of tests to discriminate a subject's true state and finds optimal cut off values [42][43][44][45].
To determine if there was a significant difference between the patients classified by the model, we generated Kaplan-Meier curves with a log-rank test to analyze using one of the 5-fold results. The Kaplan-Meier estimates are one of the best methods used to calculate the proportion of patients who do not relapse or survive a certain period of time after treatment [46]. Lower p-values of log-rank test are correlated with less Kaplan-Meier curve overlaps, and clearer distinction.

Quantitative Measurements
The 5-fold cross-validation quantitative results including F1-score, precision, recall, and accuracy are displayed in Table 4. Our results show that our proposed ensemble model that incorporated clinical, HCR, and DLR performed better than all other current methods. Notably, the conventional method, the Cox PH model, underperformed the most extensively. Interestingly, clinical and HCR features had better prognostic power of recurrence than DLR features. Of the machine learning algorithms tested, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression (LR) were most successful in predicting recurrence. Table 4. Mean and standard deviation values of the 5-fold cross validation in terms of F1 score, precision, recall and accuracy. The first three rows reported the implementation results of other competitive methods using each of our dataset. Type of machine learning model was recorded if the model of the row was ensembled. The number of trainable parameters is displayed in thousands in the rightmost column. The models that shows significant difference with the TN stage (baseline) for all evaluation metrics are marked with an asterisk (*).

ROC Curve with AUC Value
Based on the ROC curve with AUC value, experimental results with the 5-fold crossvalidation indicate that the proposed ensemble algorithm had the highest predictive power when the clinical data, HCR, and DLR were included together, as opposed to the use of individual features (Figure 4). The proposed HCR neural network model showed the next best performance with a slightly lower value of AUC, compared to the ensemble model. Figure 5 shows the Kaplan-Meier curves derived from each model's lung cancer recurrence risk. The predicted to recur group indicates a high risk of recurrence, whereas the predicted not to recur group reveals a low or intermediate-risk. Of the 65-66 test data in one-fold, an average of 9.8 cases (SD 1.7, 15.1% of one test set) of the incorrect predictions from the TN stage neural network (baseline), 7.0 cases (SD 0.9, 10.8% of one test set) of the incorrect predictions from the Clinical neural network, 3.2 cases (SD 1.9, 4.9% of one test set) of the incorrect predictions from the HCR neural network, and 6.6 cases (SD 2.41, 10.2% of done test set) of the incorrect predictions from the DLR convolutional neural network were correctly classified by our proposed ensemble model.   Figure 5 shows the Kaplan-Meier curves derived from each model's lung cancer recurrence risk. The predicted to recur group indicates a high risk of recurrence, whereas the predicted not to recur group reveals a low or intermediate-risk. Of the 65-66 test data in one-fold, an average of 9.8 cases (SD 1.7, 15.1% of one test set) of the incorrect predictions from the TN stage neural network (baseline), 7.0 cases (SD 0.9, 10.8% of one test set) of the incorrect predictions from the Clinical neural network, 3.2 cases (SD 1.9, 4.9% of one test set) of the incorrect predictions from the HCR neural network, and 6.6 cases (SD 2.41, 10.2% of done test set) of the incorrect predictions from the DLR convolutional neural network were correctly classified by our proposed ensemble model.

Discussion
Despite surgical resection being the major treatment for NSCLC, adjuvant chemotherapy is regularly employed as standard post-surgical care. Adjuvant chemotherapy is critical due to the high probability of occult micro-metastases present at the time of surgical intervention [47]. Importantly, post-surgery chemotherapy attempts to eliminate any metastasized cancer cells and prevent recurrence. Although chemotherapy is correlated with an increase in overall survival in early-stage NSCLC patients [48][49][50], questions

Discussion
Despite surgical resection being the major treatment for NSCLC, adjuvant chemotherapy is regularly employed as standard post-surgical care. Adjuvant chemotherapy is critical due to the high probability of occult micro-metastases present at the time of surgical intervention [47]. Importantly, post-surgery chemotherapy attempts to eliminate any metastasized cancer cells and prevent recurrence. Although chemotherapy is correlated with an increase in overall survival in early-stage NSCLC patients [48][49][50], questions about whether adjuvant chemotherapy should be accompanied in cases of patients with high predictive risk of recurrence still remains.
Considering that the recurrence of NSCLC typically occurs at sites distal to the original tumor [51,52], patients at high risk of recurrence should be treated similarly to advanced NSCLC patients. As such, the ability to predict recurrence is very critical in early-stage (i.e., IA) patients. In these cases, patients with high probabilities of recurrence can be treated with adjuvant therapies as prevention measures. Conversely, with high powered predictive models, patients with a low predictive risk of recurrence can avoid unnecessary adjuvant therapies. Therefore, it is critical to have powerful predictive models to differentiating high risk and low-risk patients. With this information, patients can make an informed decision about their treatment options. The ensemble model we proposed here showed an 11.69% higher accuracy than the TN staging-based algorithm (baseline) as shown in Table 4. Additionally, the proposed model was able to accurately predict 15.1% of cases where the baseline algorithm failed (see Section 3.3). The high accuracy of our algorithm in predicting recurrence can aid clinicians in guiding appropriate treatment strategies for each patient.
Considering that the TN staging has limitations such as patient to patient recurrence rate variations [29], this study attempted to develop and validate an NSCLC recurrence prediction model for patients with surgically resected lungs. Using the same data sets, the three proposed neural network-based models (Clinical neural network, HCR neural network, and DLR convolutional neural network) were more accurate in predicting recurrence than existing representative models [38,40,41].
Moreover, we hypothesized that deep learning models learned using valid clinical variables, features extracted explicitly (i.e., HCR) and automatically induced (i.e., DLR) from medical images, could be applied back to the ensemble model to produce more accurate and refined results. Our results demonstrate that the proposed ensemble model that utilized all data (clinical data, HCR, and DLR) outperformed proposed models that used individual data. For the case of this study, the proposed final ensemble-based model was able to accurately predict the recurrence for the cases of 4.9% to 15.1%, where the models trained with single data-type features failed to predict. Therefore, the proposed method can help more accurately identify patients with a high risk of recurrence, and thus can significantly reduce erroneous postoperative cancer treatment decisions due to incorrect recurrence prediction. According to Table 4, the computational complexity of DLR networks is much higher than the other single-modality and TN stage (baseline) models. HCR networks utilize more features and therefore have slightly higher computational complexity than TN (baseline) and clinical models. In addition, the proposed algorithm integrates multimodal features through a machine learning-based analyzer, thus exhibiting computational complexity similar to that of DLR model. Although our proposed ensemble models have higher model complexity compared to the baseline, the inference time per image with increasing complexity is several milliseconds. Given that it can take up to minutes for a clinician to review a patient's multimodal information and draw a comprehensive conclusion, the increase of several milliseconds is considered negligible.
Proper cancer prognosis and treatment is a very difficult and complex task for oncologists. Oncologists must choose from a wide range of treatment options to fit a patient's clinical and pathological data [53]. Therefore, the algorithm for predicting relapse using only TN stage information or CT images is far removed from actual clinical decision making; however, the proposed ensemble model that utilizes all three available features (clinical data, HCR, and DLR) is more clinically relevant.
Several limitations of our study need to be addressed. First, we did not account for the possibility of overlapping among heterogeneous features. While we removed features with a high correlation within the HCR features, the correlation between each network input feature was not analyzed. This is likely to impede the final model performance and involves a process that is less efficient. Second, the DLR convolutional neural network model used a representative 2D slice rather than a full 3D volumetric image, indicating a possible loss of information. However, the effective through-plane (z)-resolution is typically much lower than the in-plane resolution of the axial slice, thus, the use of 2D slices may still produce better results [41]. Several studies have attempted to improve the z-axis resolution [54][55][56]; therefore, as research in this field develops, the algorithm can be improved upon by incorporating a 3D CT image. Thus, in future studies, we plan to develop a more robust model that utilizes a complete 3D CT image, as well as eliminating redundancy between heterogeneous data. Furthermore, we plan to develop a model that utilizes all three different neural networks simultaneously.

Conclusions
Here, we present a machine learning-based ensemble methodology with various NSCLC patients' clinical and pathological data that predicts cancer recurrence after surgical resection. Specifically, we utilized clinical variables and radiomic features, which included both human-designed (HCR) and automatically learned and extracted by the DLR convolutional neural networks features. Each of these showed meaningfully significant predictive power. The three proposed models for clinical data, HCR, and DLR more accurately predicted recurrence than existing representative models using the same individual data. Each neural network-based model produced outcomes that were combined and subsequently put through a different machine learning-based ensemble analyzer to further improve accuracy and making the final decision. The overall performance of our suggested ensemble model was greater than models that did not use all three data types (clinical data, HCR, and DLR). Moreover, the proposed model was both efficient and robust, as evidenced by the k-fold cross-validation. We believe the better prediction capabilities of our system can improve the care of NSCLC patients and allow for more efficient decision making for treatment.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: NSCLC non-small cell lung cancer LUAD lung adenocarcinoma LUSC lung squamous cell carcinoma TN tumor node HCR handcrafted radiomics DLR deep learning-based radiomics CNN convolutional neural network PH proportional hazards SD standard deviation