Next Article in Journal
Placental Mesenchymal Dysplasia and Beckwith–Wiedemann Syndrome
Previous Article in Journal
Clinical Efficacy of Hypofractionated Proton Beam Therapy for Intrahepatic Cholangiocarcinoma
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Development and Validation of Novel Deep-Learning Models Using Multiple Data Types for Lung Cancer Survival

1
Clinical Data Center, Office of Data Science, Taipei Medical University, Taipei 110, Taiwan
2
Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei Medical University, Taipei 110, Taiwan
3
Research Center of Health Care Industry Data Science, College of Management, Taipei Medical University, Taipei 110, Taiwan
4
International Ph.D. Program in Biotech and Healthcare Management, College of Management, Taipei Medical University, Taipei 110, Taiwan
5
Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, 250 Wu-Hsing Str., Xinyi Dist., Taipei 110, Taiwan
6
Office of Data Science, Taipei Medical University, Taipei 110, Taiwan
7
Graduate Institute of Data Science, College of Management, Taipei Medical University, Taipei 110, Taiwan
8
Department of Pathology, National Taiwan University Hospital, Taipei 100, Taiwan
9
Graduate Institute of Pathology, College of Medicine, National Taiwan University, Taipei 100, Taiwan
10
Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
11
Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 110, Taiwan
12
Department of Radiology, College of Medicine, Taipei Medical University, 250 Wu-Hsing Str., Xinyi Dist., Taipei 110, Taiwan
*
Authors to whom correspondence should be addressed.
Cancers 2022, 14(22), 5562; https://doi.org/10.3390/cancers14225562
Submission received: 21 September 2022 / Revised: 3 November 2022 / Accepted: 10 November 2022 / Published: 12 November 2022
(This article belongs to the Topic Artificial Intelligence in Cancer Diagnosis and Therapy)

Abstract

:

Simple Summary

Previous survival-prediction studies have had several limitations, such as a lack of comprehensive clinical data types, testing only in limited machine-learning algorithms, or a lack of a sufficient external testing set. This lung-cancer-survival-prediction model is based on multiple data types, multiple novel machine-learning algorithms, and external testing. This predicted model demonstrated a higher performance (ANN, AUC, 0.89; accuracy, 0.82; precision, 0.91) than previous similar studies.

Abstract

A well-established lung-cancer-survival-prediction model that relies on multiple data types, multiple novel machine-learning algorithms, and external testing is absent in the literature. This study aims to address this gap and determine the critical factors of lung cancer survival. We selected non-small-cell lung cancer patients from a retrospective dataset of the Taipei Medical University Clinical Research Database and Taiwan Cancer Registry between January 2008 and December 2018. All patients were monitored from the index date of cancer diagnosis until the event of death. Variables, including demographics, comorbidities, medications, laboratories, and patient gene tests, were used. Nine machine-learning algorithms with various modes were used. The performance of the algorithms was measured by the area under the receiver operating characteristic curve (AUC). In total, 3714 patients were included. The best performance of the artificial neural network (ANN) model was achieved when integrating all variables with the AUC, accuracy, precision, recall, and F1-score of 0.89, 0.82, 0.91, 0.75, and 0.65, respectively. The most important features were cancer stage, cancer size, age of diagnosis, smoking, drinking status, EGFR gene, and body mass index. Overall, the ANN model improved predictive performance when integrating different data types.

1. Introduction

Lung cancer is the leading cause of cancer deaths worldwide [1]. Globally, there were around 2.21 million new cases of lung cancer and 1.80 million fatalities in 2020 [2]. One study reported that lung cancer incidence and mortality rates were 22.2 and 18.0 per 100,000 people in 2020, respectively [3,4]. Lung cancer can be divided clinically into two types based on histological features: non-small-cell lung cancer (NSCLC) and small-cell lung cancer (SCLC). NSCLC is the most common among them, accounting for 80–90% of lung cancers [5]. Cell deterioration and metastasis are slower in NSCLC than in SCLC. Around 70% of patients are diagnosed at an advanced stage, making surgical resection and complete treatment challenging [6,7].
Artificial intelligence (AI) has been increasingly used in medical research and clinical practice [8,9]. The accurate prediction of disease prognosis and the outcome of drug treatment, which may serve as a reference for treatment decision-making and drug selection, has become an essential topic in the clinical medicine [9,10]. Developing disease-risk and prognosis-prediction models using machine-learning or deep-learning algorithms with big data is a major area of AI-based academic research in the medical field [10,11]. Studies have used machine-learning and/or deep-learning algorithms to develop lung cancer risk and prognosis-prediction models [12,13,14,15]. Among them, Lai et al. [16] used 15 biomarkers with clinical data (including gene expression) from 614 patients to develop a deep neural network to predict the five-year overall survival of NSCLC patients.
This study aimed to develop survival-prediction models for lung cancer patients using a large number of samples, different data types, various machine-learning algorithms, and external testing. In addition to the basic clinical data (including demographic information, disease condition, comorbidity, and current medication), we examined the role of laboratory and genomic test results, which are generally not easy to obtain in predicting lung cancer survival. Moreover, we also explored the important predictors for developing prediction models.

2. Methods

2.1. Study Design and Data Source

We conducted a retrospective study in which we obtained data from the Taiwan Cancer Registry (TCR) database and the Taipei Medical University Clinical Research Database (TMUCRD). The TCR database was established in 1979 and is managed by Taiwan’s Health Promotion Administration, Ministry of Health and Welfare. It covers 98% of Taiwanese cancer patients and includes diagnosis and other related information. The TMUCRD retrieved data from various electronic medical records (EHR) of three hospitals, Taipei Medical University Hospital (TMUH), Wan-Fang Hospital (WFH), and Shuang-Ho Hospital (SHH). The database contains the electronic medical record data of 3.8 million people from 1998 to 2020, including structured data (e.g., basic information of patients, medical information, test reports, diagnosis results, treatment process, surgery, and medication history) and unstructured data (e.g., progress notes, pathology reports, and medical imaging reports) [17]. This study has been approved by the Joint Institute Review Board of Taipei Medical University (TMU-JIRB), Taipei, Taiwan (approval number N202101080). All the data were anonymous before conducting analysis.

2.2. Cohort Selection

This study selected patients with lung cancer (ICD-O-3 code: C33, C34) from 2008 to 2018 in the TCR database. Exclusion criteria included individuals under 20 years old, SCLC patients, and patients who did not have any medical history in the three hospitals (TMUH, WFH, SHH). Thus, a total of 3714 patients were included in this study, including 960 patients from TMUH, 1320 from WFH, and 1434 from SHH (Figure S1 in the Supplementary Materials).

2.3. Outcome Measurement

We ascertained the study outcomes using TMUCRD EHR and vital status data from the Taiwan Death Registry (TDR) [18]. We used the diagnosis date of NSCLC as the index date, and the outcome of this study was death within two years following diagnosis. Data were censored at the date of death or loss to follow-up, insurance termination, or the study’s end on 31 December 2018.

2.4. Feature Selection

Based on a literature review and consultation with clinicians, we selected features that may lead to the mortality of NSCLC patients to build prediction models. These features consisted of:
  • Demographic information: age, gender, body mass index (BMI), smoking, drinking;
  • Cancer conditions: tumor size and cancer stage;
  • Comorbidities: cardiovascular problems (i.e., myocardial infarction (MI), congestive heart failure (CHF), peripheral vascular disease (PVD), and cardiovascular disease (CVD)), dementia, chronic obstructive pulmonary disease (COPD), rheumatic disease, peptic ulcer disease (PUD), renal disease, liver disease, diabetes, anemia, depression, hyperlipidemia, hypertension, Parkinson’s disease, and Charlson Comorbidity Index (CCI) score. These conditions were considered if they were diagnosed in at least two outpatient claims or one hospitalization over a year before the cancer diagnosis date.
  • Medications: alimentary tract and metabolism, blood and blood-forming organs, cardiovascular system, genitourinary system and hormones, musculoskeletal system, nervous system, and respiratory system. We measured patients who had used medications by receiving them for more than a month (i.e., 30 days) during a year (i.e., 360 days) before the index date.
  • Laboratory tests: basophil, blood urea nitrogen (BUN), calcium, cholesterol, chloride, creatinine, eosinophil, ferritin, glucose AC, HbA1c, HCT, HGB, potassium, lymphocyte, MCH, MCHC, MCV, monocyte, sodium, neutrophil, platelet (PLT), RBC, triglyceride, and WBC. We only selected laboratory tests with a missing rate of less than 70% values a year before or a month after the index date.
  • Genomic tests: ALK, EGFR, KRAS, PDL1, and ROS1. We collected genomic tests if patients had ever taken one a month after the cancer diagnosis date.

2.5. Development of the Algorithms

This study established prediction models based on four modes and different algorithms:
  • The primary mode (e.g., Mode 1) included demographic information, cancer conditions, comorbidities, and medications.
  • The second mode (Mode 2) included the data from Mode 1 and the laboratory tests.
  • The third mode (Mode 3) included the data from Mode 1 and genomic tests.
  • The fourth mode (Mode 4) considered all the above features.
This study aims to predict the survival of lung cancer patients; therefore, the problem can be formulated as a classification model as it could occur in the same patients. We used possible machine-learning techniques such as logistic regression (LR), linear discriminant analysis (LDA), light gradient-boosting machine (LGBM), gradient-boosting machine (GBM), extreme gradient boosting (XGBoost), random forest (RF), AdaBoost, support vector machine (SVC), and artificial neural network (ANN). These methods are briefly introduced below.
Logistic Regression (LR): This is a discrete choice model that models the relationship between a response and multiple explanatory variables and is based on the concept of probability [19]. It is widely used and more practical in fields such as biostatistics, clinical medicine, and quantitative psychology. Its Equation (1) is:
y = e ( b 0 + b 1 X ) 1 + e ( b 0 + b 1 X )  
where x is the input value, y is the predicted output, b0 is the bias or intercept term, and b1 is the coefficient for input (x). In this study, we used the LR function with the parameter C (inverse of regularization strength) of 0.0001 to reduce the model’s overfitting.
Linear Discriminant Analysis (LDA): This is generally used to classify patterns between two classes; however, it can be extended to multiple patterns. LDA assumes that all classes are linearly separable, and according to the multiple linear discrimination functions representing several hyperplanes in the feature space are created to distinguish the classes [20]. In this study, we set the parameters’ shrinkage to ‘0’ and the solver to ‘lsqr’ to improve estimation and classification accuracy.
Light Gradient-Boosting Machine (LGBM): This is a gradient-boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages: faster training speed and higher efficiency; lower memory usage; better accuracy; support of parallel, distributed, and GPU learning; and capability to handle large-scale data [21]. The model’s class_weight parameter was set as ‘balanced’, which uses the output’s value to automatically adjust weights inversely proportional to class frequencies in the input data. The learning_rate, l1 regularization—reg_alpha, and l2 regularization—reg_lambda parameters were set as 0.05, 0.1, and 0.1, respectively.
Gradient-Boosting Machine (GBM): Gradient-boosting regression trees produce competitive, highly robust, and interpretable procedures for regression and classification. The ability of TreeBoost procedures to give a quick indication of potential predictability, coupled with their extreme robustness, makes them a useful preprocessing tool that can be applied to imperfect data [22]. The default parameters were used in this model.
Extreme Gradient Boosting (XGBoost): XGBoost, an efficient and scalable implementation of the gradient-boosting framework, is a machine-learning system for tree boosting. The scalability of XGBoost is attributed to several critical systems and algorithmic optimizations. These innovations include a novel tree-learning algorithm for handling sparse data; a theoretically justified weighted quantile sketch procedure allows the handling of instance weights in approximate tree learning [23]. The default parameters were used in this model.
Random Forest (RF): RF is an ensemble-learning method that operates by constructing many small scales of classification modules (most often decision trees) at the training time. The model outputs the class that combines the result of the individual modules based on some voting algorithms [24]. In this study, we set the parameters as follows: n_estimators (the number of trees) of 500, max_depth of 10, min_samples_split of 400, and class_weight of 0.5 for each class.
AdaBoost: The AdaBoost algorithm is an iterative procedure that combines several weak classifiers to approximate the Bayes classifier C∗(x). AdaBoost builds a classifier, e.g., a classification tree that produces class labels, starting with the unweighted training sample. If a training data point is misclassified, the weight of that data point is increased (boosted). A second classifier is built using the new weights, which are no longer equal. Again, misclassified training data have their weights boosted, and the procedure is repeated [25]. The number of estimators (n_estimators) used was 100.
Support Vector Machine (SVC): This is a machine-learning algorithm that can be applied to linear and nonlinear data. SVC transforms the original data to a higher dimension, from which it can use the super vectors in the training data set to find the hyperplane for categorizing the data. An SVC mainly identifies the hyperplane with the most significant margin, e.g., the maximum marginal hyperplane, to achieve higher accuracy [26]. The SVC can be represented by the following Equation (2):
f ( x ) = i = 1 N ( α i * α i ) K ( x , x i ) + Β
where K ( x , x i ) is the kernel function, α i , α i * 0 are the Lagrange multipliers, and Β is a bias term. In this study, we used a linear kernel for computations.
Artificial Neural Network (ANN): This is a learning algorithm vaguely inspired by biological neural networks. Computations are structured in terms of an interconnected group of artificial neurons, and these neutrons process information using a connectionist approach to computation. They are usually used to model complex relationships between inputs and outputs, find patterns in data, or capture the statistical structure [27]. The number of hidden layers with the number of neurons in each layer was set at 3 and 16, respectively. Additionally, for each layer, the l2 regularization of 0.01 and the ‘relu’ activation were used in the study. We set the ‘softmax’ activation for the output layer. We also used the ‘Adam’ optimizer, a highly performant stochastic gradient descent algorithm, and ‘binary_crossentropy’ as the binary classification outcome for the loss function.

2.6. Evaluating the Algorithms

The training dataset contained the data of patients from TMUH and WFH. The stratified 5-fold cross-validation was applied in the training set to assess the different machine-learning models’ performance and general errors. In other words, patients in the training set were divided into five groups, each used repeatedly as the internal validation set. We recruited data from SHH and used it for the external testing dataset to generalize the model.
The performance of the algorithms was measured by the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity (recall), specificity, positive predictive value (PPV, precision), negative predictive value (NPV), and F1-score. We defined the best model using the highest AUC by comparing various models based on the external testing set. Furthermore, we analyzed the feature’s contribution (i.e., the feature’s importance) of the best model using SHAP values (SHapley Additive exPlanations) [28].
All the data processing was performed using MSSQL server 2017 (Redmond, WA, USA), and the model training and testing were performed using Python version 3.8 (Wilmington, DE, USA) with scikit-learn version 1.1 (Paris, France) [29].

3. Results

3.1. Baseline Characteristics of Patients

We identified 3714 eligible lung cancer patients diagnosed for the first time and registered at the TCR. Among those patients, 2280 patients were included in the training dataset, whereas 1434 were in the testing dataset. Demographic characteristics, comorbidities, tumor size, tumor stage, genomic tests, medication uses, and laboratory tests are presented in Table 1. The mean (standard deviation, SD) ages and BMI of cohort patients were 68 (13.7) and 23.4 (4.33), respectively. Most of the patients were male (57.5%) with late-stage lung cancer (i.e., stage IV, 54.8%), and patients were less likely to smoke (26.7%) or drink (11%). The cohort of patients had comorbidities related to hypertension (19.8%), hyperlipidemia (13.9%), COPD (16.1%), and CVD problems (11.6%). The follow-up durations for the cohort patients were a mean (SD) of 2.25 (2.47) years and a median (interquartile range (IQR)) of 1.41 [0.46–3.04] years. Detailed information is shown in Table S1 in the Supplementary Materials.

3.2. The Performances of Different Prediction Models

The performances of different prediction models are shown in Table 2. In Mode 1, the highest AUC of 0.88 was observed for the ANN model (i.e., accuracy, 0.82; precision, 0.90; recall, 0.75; and F1-score, 0.64), followed by the GBM and RF models with an AUC of 0.83 and 0.82, respectively. In Mode 3, the best performance was found with an AUC of 0.89 for the ANN model (i.e., accuracy, 0.83; precision, 0.89; recall, 0.81; and F1-score, 0.64). The following AUCs were observed 0.85 for LGBM, GBM, and 0.84 for RF models. Moreover, when considering all features in Mode 4, we found that the best model was the ANN model with an AUC of 0.89 (i.e., accuracy, 0.82; precision, 0.91; recall, 0.75; and F1-score, 0.65). Figure 1 and Figure 2 show the ROC curves of different prediction models in four modes. Detailed information on the various models’ measurements (i.e., sensitivity, specificity, PPV, NPV, accuracy, and F1-score) is shown in Table S2 in the Supplementary Materials.
Figure 3 shows the top 20 important features of the ANN model in Mode 4. The most important features were the cancer stage, size, age of diagnosis, smoking, and EGFR gene. In other words, patients with advanced cancer stage, large cancer size, older age, and smoking behavior had a higher risk of death within two years. The SHAP value presented the important features of the GBM model in Mode 4 and was consistent with the ANN model, such as cancer stage, age at diagnosis, cancer size, and smoking status (Figure S2 in the Supplementary Materials).

4. Discussion

In recent years, the prediction of cancer patients’ survival has attracted the medical community’s attention in various countries because it can facilitate medical decision making, strengthen the relationship between doctors and patients, and improve the quality of medical care. Rapid progress in the development of AI based on machine learning has led to more diversified applications of AI in the field of precision medicine. Based on previously published studies on machine-learning algorithms to build prediction models for the survival of lung cancer patients [12,14,15,16], this study further compared the performance of various novel machine-learning algorithms. In addition, we also analyzed the relationship between the diversity of features and the accuracy of prediction results and determined the most important features affecting lung cancer survival.
Studies using multiple data types and multiple novel machine-learning algorithms simultaneously are limited. In previous studies on lung cancer prediction, most of them used a single machine-learning (e.g., RF [30]) or deep-learning (e.g., NN [14,15,16]) algorithm or a few basic machine-learning algorithms (e.g., LR, SVM, decision tree, RF, GBM [12,31]) to develop prediction models. Our results showed that the ANN model had the highest AUC value (it was the most suitable tool for survival prediction). In contrast, the AUC value of the traditional LR algorithm exhibited the lowest performance (it had the lowest predictive ability). Lai Y.H. et al. [16] presented a deep neural network to predict the overall survival of NSCLC patients. They obtained a good predictive performance (AUC = 0.82, accuracy = 75.4%) by integrating microarray and clinical data. While only using basic clinical data (demographics, comorbidities, and medications), our predicted model demonstrated a higher performance (ANN, AUC, 0.88; accuracy, 0.82; precision, 0.90, recall, 0.75, and F1-score, 0.64). Furthermore, when combining other variables, such as laboratory and genomic tests, the AUC values of the predicted model were better (based on the external testing, the AUCs of the ANN model in Mode 1 and Mode 4 were 0.88 and 0.89, respectively; the AUCs of LGBM model in Mode 1 and Mode 4 were 0.81 and 0.86, respectively; the AUCs of the RF model in Mode 1 and Mode 4 were 0.82 and 0.85, respectively).
In this study, we explored the variables that might affect the predictive performance of the survival model. As expected, these variables were highly correlated to the mortality of lung cancer patients, such as advanced cancer stage, tumor size, age at diagnosis, and smoking and drinking status [32]. Our findings also showed that lymphocytes, platelets, and neutrophils tests were associated with the likelihood of lung cancer survival [33]. Thus, lymphocytes play an essential role in producing cytokines, inhibiting the proliferation of cancer cells, and provoking cytotoxic cell death [34]. In words, a decrease in lymphocyte count may predict worse survival in cancer patients. Neutrophils are recruited with cytokines released by the tumor microenvironment, enhancing carcinogenesis and cancer progression [35]. Platelets modulate the tumor microenvironment by releasing factors contributing to tumor growth, invasion, and angiogenesis [36]. Another study by Wang J. et al. [37] reported that lung cancer patients with a higher BMI have prolonged survival compared to those with a lower BMI. The same was true for our study’s results, which may be due to the poor nutrition and weight loss caused by respiratory diseases [38], such as COPD.
There are limitations to this study. First, although the study used data from various clinical settings (e.g., TMUH and WFH for establishing the prediction model and SHH for conducting an external test) located in the north of Taiwan, the results may not directly apply to lung cancer patients in other regions. Future studies may need to consider validating the model using data from other areas. Second, this study used retrospective data for development and validation. Further experiments with a prospective study design in clinical settings are needed. Third, to obtain a highly accurate prediction, we developed the machine-learning algorithms with binary outcomes (i.e., survival and death) rather than expected continuous outcomes (i.e., length of survival) for the NSCLC patients. Further studies should be conducted with larger sample sizes to deal with continuous outcomes for lung cancer survival.

5. Conclusions

In summary, to observe the expected survival of NSCLC patients during a two-year period, we designed an artificial neural network model with high AUC, precision, and recall. Moreover, integrating different data types (especially laboratory and genomic data) led to better predictive performance. Further research is necessary to determine the feasibility of applying the algorithm in the clinical setting and explore whether this tool could improve care and outcomes.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/cancers14225562/s1, Figure S1: Cohort Selection Process; Figure S2: Feature Importance of the GBM Prediction Model of Mode 4; Table S1: Detailed Demographic Characteristics of Cohort Patients; Table S2: Detailed Performance of various Prediction Models by Modes.

Author Contributions

T.-H.C., P.-A.N. and J.C.H. conceptualized and designed the study. P.-A.N., P.T.P. and T.-C.L. collected the data, performed the analysis, and drafted the manuscript. C.-Y.C. and T.-H.C. provided suggestions for the research design and article content. M.-H.H., M.-S.H., N.Q.K.L., C.-T.C. and J.C.H. reviewed all data and revised the manuscript critically for intellectual content. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by Taiwan Ministry of Science and Technology grants (grant numbers: MOST109-2321-B-038-004; MOST110-2321-B-038-004). The funders had no role in the study design, data collection and analysis, publication decision, or manuscript preparation.

Institutional Review Board Statement

This study has been approved by the TMU-Joint Institutional Review Board (Project number: TMU-JIRB N202101080).

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors obtained data from the Taiwan Cancer Registry (TCR) database and the Taipei Medical University Clinical Research Database (TMUCRD).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

NSCLCNon-small cell lung cancer
SCLCSmall cell lung cancer
AIArtificial intelligence
TCRTaiwan Cancer Registry
TDRTaiwan Death Registry
TMUCRDTaipei Medical University Clinical Research Database
TMUHTaipei Medical University Hospital
WFHWan-Fang Hospital
SHHShuang-Ho Hospital
BMIBody mass index
MIMyocardial infarction
CHFCongestive heart failure
PVDPeripheral vascular disease
CVDCardiovascular disease
COPDChronic obstructive pulmonary disease
PUDPeptic ulcer disease
CCICharlson Comorbidity Index
BUNBlood urea nitrogen
PLTPlatelet
LRLogistic regression
LDALinear discriminant analysis
LGBMLight gradient boosting machine
GBMGradient boosting machine
XGBoostExtreme gradient boosting
RFRandom forest
SVCSupport vector machine
ANNArtificial neural network
AUCThe area under the receiver operating characteristic curve
PPVPositive predictive value
NPVNegative predictive value
SHAPShapley additive explanations

References

  1. Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. World Health Organization. Cancer Fact Sheets. Available online: https://www.who.int/news-room/fact-sheets/detail/cancer (accessed on 1 November 2022).
  3. Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
  4. World Health Organization. Lung Cancer Statistics. Available online: https://www.wcrf.org/cancer-trends/lung-cancer-statistics/ (accessed on 1 November 2022).
  5. Siddiqui, F.; Vaqar, S.; Siddiqui, A.H. Lung Cancer. In StatPearls; StatPearls Publishing LLC.: Treasure Island, FL, USA, 2022. [Google Scholar]
  6. Testa, U.; Castelli, G.; Pelosi, E. Lung Cancers: Molecular Characterization, Clonal Heterogeneity and Evolution, and Cancer Stem Cells. Cancers 2018, 10, 248. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Ryan, C.; Burke, L. Pathology of lung tumours. Surgery 2017, 35, 234–242. [Google Scholar] [CrossRef]
  8. Liang, C.-W.; Yang, H.-C.; Islam, M.M.; Nguyen, P.A.A.; Feng, Y.-T.; Hou, Z.Y.; Huang, C.-W.; Poly, T.N.; Li, Y.-C.J. Predicting Hepatocellular Carcinoma With Minimal Features From Electronic Health Records: Development of a Deep Learning Model. JMIR Cancer 2021, 7, e19812. [Google Scholar] [CrossRef] [PubMed]
  9. Poly, T.N.; Islam, M.M.; Muhtar, M.S.; Yang, H.-C.; Nguyen, P.A.; Li, Y.-C. Machine Learning Approach to Reduce Alert Fatigue Using a Disease Medication–Related Clinical Decision Support System: Model Development and Validation. JMIR Med Inform. 2020, 8, e19489. [Google Scholar] [CrossRef] [PubMed]
  10. Le, N.Q.K.; Ho, Q.-T. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes. Methods 2022, 204, 199–206. [Google Scholar] [CrossRef] [PubMed]
  11. Dang, H.H.; Ta, H.D.K.; Nguyen, T.T.T.; Anuraga, G.; Wang, C.-Y.; Lee, K.-H.; Le, N.Q.K. Prospective role and immunotherapeutic targets of sideroflexin protein family in lung adenocarcinoma: Evidence from bioinformatics validation. Funct. Integr. Genom. 2022, 22, 1057–1072. [Google Scholar] [CrossRef] [PubMed]
  12. Lynch, C.M.; Abdollahi, B.; Fuqua, J.D.; de Carlo, A.R.; Bartholomai, J.A.; Balgemann, R.N.; van Berkel, V.H.; Frieboes, H.B. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int. J. Med Inform. 2017, 108, 1–8. [Google Scholar] [CrossRef]
  13. Siah, K.W.; Khozin, S.; Wong, C.H.; Lo, A.W. Machine-Learning and Stochastic Tumor Growth Models for Predicting Outcomes in Patients With Advanced Non-Small-Cell Lung Cancer. JCO Clin. Cancer Inform. 2019, 3, 1–11. [Google Scholar] [CrossRef]
  14. Cui, L.; Li, H.; Hui, W.; Chen, S.; Yang, L.; Kang, Y.; Bo, Q.; Feng, J. A deep learning-based framework for lung cancer survival analysis with biomarker interpretation. BMC Bioinform. 2020, 21, 112. [Google Scholar] [CrossRef] [Green Version]
  15. She, Y.; Jin, Z.; Wu, J.; Deng, J.; Zhang, L.; Su, H.; Jiang, G.; Liu, H.; Xie, D.; Cao, N.; et al. Development and Validation of a Deep Learning Model for Non–Small Cell Lung Cancer Survival. JAMA Netw. Open 2020, 3, e205842. [Google Scholar] [CrossRef] [PubMed]
  16. Lai, Y.-H.; Chen, W.-N.; Hsu, T.-C.; Lin, C.; Tsao, Y.; Wu, S. Overall survival prediction of non-small cell lung cancer by integrating microarray and clinical data with deep learning. Sci. Rep. 2020, 10, 4679. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Lu, Y.; Van Zandt, M.; Liu, Y.; Li, J.; Wang, X.; Chen, Y.; Chen, Z.; Cho, J.; Dorajoo, S.R.; Feng, M.; et al. Analysis of Dual Combination Therapies Used in Treatment of Hypertension in a Multinational Cohort. JAMA Netw. Open 2022, 5, e223877. [Google Scholar] [CrossRef] [PubMed]
  18. Nguyen, P.-A.; Chang, C.-C.; Galvin, C.J.; Wang, Y.-C.; An, S.Y.; Huang, C.-W.; Wang, Y.-H.; Hsu, M.-H.; Li, Y.-C.; Yang, H.-C. Statins use and its impact in EGFR-TKIs resistance to prolong the survival of lung cancer patients: A Cancer registry cohort study in Taiwan. Cancer Sci. 2020, 111, 2965–2973. [Google Scholar] [CrossRef]
  19. Dreiseitl, S.; Ohno-Machado, L. Logistic regression and artificial neural network classification models: A methodology review. J. Biomed. Inform. 2002, 35, 352–359. [Google Scholar] [CrossRef] [Green Version]
  20. Izenman, A.J. Linear discriminant analysis. In Modern Multivariate Statistical Techniques; Springer: Berlin/Heidelberg, Germany, 2013; pp. 237–280. [Google Scholar]
  21. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
  22. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  23. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. Xgboost: Extreme gradient boosting. R Package Version 0.4-2 2015, 1, 1–4. [Google Scholar]
  24. Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, IEEE, Montreal, QC, Canada, 14–16 August 1995; pp. 278–282. [Google Scholar]
  25. Hastie, T.; Rosset, S.; Zhu, J.; Zou, H. Multi-class adaboost. Stat. Its Interface 2009, 2, 349–360. [Google Scholar] [CrossRef] [Green Version]
  26. Gunn, S.R. Support vector machines for classification and regression. ISIS Tech. Rep. 1998, 14, 5–16. [Google Scholar]
  27. Agatonovic-Kustrin, S.; Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 2000, 22, 717–727. [Google Scholar] [CrossRef]
  28. Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
  29. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  30. He, J.; Zhang, J.X.; Chen, C.T.; Ma, Y.; De Guzman, R.; Meng, J.; Pu, Y. The Relative Importance of Clinical and Socio-demographic Variables in Prognostic Prediction in Non-Small Cell Lung Cancer: A Variable Importance Approach. Med Care 2020, 58, 461–467. [Google Scholar] [CrossRef]
  31. Bartholomai, J.A.; Frieboes, H.B. Lung Cancer Survival Prediction via Machine Learning Regression, Classification, and Statistical Techniques. In Proceedings of the 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Louisville, KY, USA, 6–8 December 2018; Volume 2018, pp. 632–637. [Google Scholar] [CrossRef]
  32. Goussault, H.; Gendarme, S.; Assié, J.B.; Bylicki, O.; Chouaïd, C. Factors associated with early lung cancer mortality: A systematic review. Expert Rev. Anticancer Ther. 2021, 21, 1125–1133. [Google Scholar] [CrossRef]
  33. Kang, J.; Chang, Y.; Ahn, J.; Oh, S.; Koo, D.H.; Lee, Y.G.; Shin, H.; Ryu, S. Neutrophil-to-lymphocyte ratio and risk of lung cancer mortality in a low-risk population: A cohort study. Int. J. Cancer 2019, 145, 3267–3275. [Google Scholar] [CrossRef]
  34. Spicer, J.D.; McDonald, B.; Cools-Lartigue, J.J.; Chow, S.C.; Giannias, B.; Kubes, P.; Ferri, L.E. Neutrophils promote liver metastasis via Mac-1-mediated interactions with circulating tumor cells. Cancer Res. 2012, 72, 3919–3927. [Google Scholar] [CrossRef] [Green Version]
  35. Powell, D.R.; Huttenlocher, A. Neutrophils in the Tumor Microenvironment. Trends Immunol. 2016, 37, 41–52. [Google Scholar] [CrossRef] [Green Version]
  36. Contursi, A.; Grande, R.; Dovizio, M.; Bruno, A.; Fullone, R.; Patrignani, P. Platelets in cancer development and diagnosis. Biochem. Soc. Trans. 2018, 46, 1517–1527. [Google Scholar] [CrossRef]
  37. Wang, J.; Xu, H.; Zhou, S.; Wang, D.; Zhu, L.; Hou, J.; Tang, J.; Zhao, J.; Zhong, S. Body mass index and mortality in lung cancer patients: A systematic review and meta-analysis. Eur. J. Clin. Nutr. 2018, 72, 4–17. [Google Scholar] [CrossRef] [PubMed]
  38. Nakagawa, T.; Toyazaki, T.; Chiba, N.; Ueda, Y.; Gotoh, M. Prognostic value of body mass index and change in body weight in postoperative outcomes of lung cancer surgery. Interact. Cardiovasc. Thorac. Surg. 2016, 23, 560–566. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The Performance of the Prediction Models in the Testing dataset by different Modes. Note: (A), Mode 1; (B), Mode 2; (C), Mode 3; (D), Mode 4.
Figure 1. The Performance of the Prediction Models in the Testing dataset by different Modes. Note: (A), Mode 1; (B), Mode 2; (C), Mode 3; (D), Mode 4.
Cancers 14 05562 g001
Figure 2. The Performance of the ANN Prediction Models in the Testing dataset by different Modes. Note: (A), Mode 1; (B), Mode 2; (C), Mode 3; (D), Mode 4.
Figure 2. The Performance of the ANN Prediction Models in the Testing dataset by different Modes. Note: (A), Mode 1; (B), Mode 2; (C), Mode 3; (D), Mode 4.
Cancers 14 05562 g002
Figure 3. Feature Importance of the ANN Prediction Model in Mode 4. Note: BMI, Body mass index; EGFR, Epidermal growth factor receptor; WBC, White blood cell; PD-L1, Programmed death-ligand 1; COPD, Chronic obstructive pulmonary disease; CCI, Charlson comorbidity index.
Figure 3. Feature Importance of the ANN Prediction Model in Mode 4. Note: BMI, Body mass index; EGFR, Epidermal growth factor receptor; WBC, White blood cell; PD-L1, Programmed death-ligand 1; COPD, Chronic obstructive pulmonary disease; CCI, Charlson comorbidity index.
Cancers 14 05562 g003
Table 1. Basic Characteristics of the Study Cohort.
Table 1. Basic Characteristics of the Study Cohort.
FeaturesOverall
n = 3714
Training Set a
n = 2280
Testing Set b
n = 1434
Male, N (%)2136 (57.5)1258 (55.2)878 (61.2)
Age, Mean (SD), yrs.68.0 (13.7)67.9 (13.8)68.0 (13.4)
BMI, Mean (SD), kg/m223.4 (4.33)23.4 (3.93)23.4 (4.81)
Smoking, N (%)
No1170 (31.5)710 (31.1)460 (32.1)
Yes993 (26.7)523 (22.9)470 (32.8)
Unknown1551 (41.8)1047 (45.9)504 (35.1)
Drinking, N (%)
No1750 (47.1)983 (43.1)767 (53.5)
Yes408 (11.0)247 (10.8)161 (11.2)
Unknown1556 (41.9)1050 (46.1)506 (35.3)
Tumor size, cm
Mean (SD)4.23 (2.45)4.11 (2.39)4.46 (2.55)
Median [IQR]3.8 [2.4–5.5]3.6 [2.3–5.5]4.0 [2.5–5.7]
Cancer stage, N (%)
011 (0.3)10 (0.4)1 (0.1)
I533 (14.4)348 (15.3)185 (12.9)
II139 (3.7)88 (3.9)51 (3.6)
III527 (14.1)330 (14.5)197 (13.7)
IV2034 (54.8)1207 (52.9)827 (57.7)
Missing470 (12.7)297 (13.0)173 (12.1)
Genomic Test
ALK, N (%)
Negative681 (18.3)457 (20.0)224 (15.6)
Positive39 (1.1)21 (0.9)18 (1.3)
Unknown2994 (80.6)1802 (79.0)1192 (83.1)
EGFR, N (%)
Negative842 (22.7)473 (20.7)369 (25.7)
Positive787 (21.2)467 (20.5)320 (22.3)
Unknown2085 (56.1)1340 (58.8)745 (52.0)
KRAS, N (%)
Negative45 (1.2)32 (1.4)13 (0.9)
Positive5 (0.1)2 (0.1)3 (0.2)
Unknown3664 (98.7)2246 (98.5)1418 (98.9)
PDL1, N (%)
Negative269 (7.2)149 (6.5)120 (8.4)
Positive66 (1.8)42 (1.8)24 (1.7)
Unknown3379 (91.0)2089 (91.6)1290 (90.0)
ROS1, N (%)
Negative288 (7.8)287 (12.6)1 (0.1)
Positive29 (0.8)27 (1.2)2 (0.1)
Unknown3397 (91.4)1966 (86.2)1431 (99.8)
Comorbidity, N (%)
CVD problems432 (11.6)296 (13.0)136 (9.5)
Dementia124 (3.3)71 (3.1)53 (3.7)
COPD599 (16.1)391 (17.1)208 (14.5)
Rheumatic disease28 (0.75)16 (0.7)12 (0.8)
PUD365 (9.8)246 (10.8)119 (8.3)
Renal disease128 (3.4)92 (4.0)31 (2.2)
Liver disease211 (5.7)147 (6.4)64 (4.5)
DM372 (10.0)248 (10.9)124 (8.6)
Anemia107 (2.9)76 (3.3)31 (2.2)
Depression245 (6.6)175 (7.7)70 (4.9)
Hyperlipidemia516 (13.9)385 (16.9)131 (9.1)
Hypertension736 (19.8)503 (22.1)233 (16.2)
Parkinson’s disease50 (1.3)29 (1.3)21 (1.5)
Charlson Comorbidity Index (CCI)
Mean (SD)3.08 (2.07)3.13 (2.19)2.97 (1.86)
Median [IQR]3.0 [2.0–4.0]3.0 [2.0–4.0]3.0 [2.0–4.0]
Follow-up, yrs.
Mean (SD)2.25 (2.47)2.44 (2.61)1.96 (2.19)
Median [IQR]1.41 [0.46–3.04]1.51 [0.53–3.36]1.24 [0.38–2.64]
Medications, N (%)
Alimentary tract and metabolism591 (15.9)394 (17.3)197 (14.7)
Blood and blood-forming organs446 (12.0)293 (12.9)153 (11.3)
Cardiovascular system675 (18.2)448 (19.6)227 (16.9)
Genitourinary system and hormones132 (3.6)74 (3.2)58 (4.3)
Musculoskeletal system252 (6.8)141 (6.2)111 (8.3)
Nervous system391 (10.5)254 (11.1)137 (10.2)
Respiratory system319 (8.6)226 (9.9)93 (6.9)
Laboratory Test, Mean (SD)
Basophil0.50 (0.40)0.53 (0.42)0.48 (0.39)
BUN19.4 (14.9)18.8 (13.1)20.5 (17.6)
Creatinine1.05 (0.98)1.02 (0.90)1.10 (1.07)
Eosinophil1.89 (2.31)2.03 (2.59)1.76 (1.97)
HCT38.3 (5.69)38.5 (5.61)37.9 (5.80)
HGB12.9 (1.97)13.0 (1.91)12.7 (2.05)
K3.99 (0.56)4.02 (0.53)3.95 (0.60)
Lymphocyte18.7 (9.98)19.6 (9.55)17.8 (10.3)
MCH29.9 (3.02)29.9 (3.03)29.8 (3.00)
MCHC33.6 (0.95)33.7 (0.96)33.6 (0.94)
MCV88.6 (7.61)88.5 (7.64)88.7 (7.57)
Monocyte7.45 (2.90)7.42 (2.93)7.48 (2.87)
Na137 (4.46)137 (4.39)137 (4.53)
Neutrophil71.3 (11.9)70.2 (11.4)72.3 (12.2)
PLT263 (109)258 (100)269 (121)
RBC4.35 (0.68)4.38 (0.67)4.29 (0.69)
WBC9.72 (5.38)9.16 (4.16)10.6 (6.80)
Note: SD, Standard deviation; yrs., Years; IQR, Interquartile Range; BMI, Body mass index; COPD, Chronic obstructive pulmonary disease; PUD, Peptic ulcer disease; CVD, Cardiovascular; DM, Diabetes; BUN, Blood urea nitrogen; HCT, Hematocrit; HGB, Hemoglobin; K, Potassium; MCH, Mean corpuscular hemoglobin; MCHC, Mean corpuscular hemoglobin concentration; MCV, Mean corpuscular volume; Na, Sodium; PLT, Platelet; RBC, Red blood count; WBC, White blood count; a The training set included the data from Taipei Medical University and Wan-Fang hospitals; b The testing set included the data from Shuang Ho hospital.
Table 2. Performance of various Prediction Models by Modes.
Table 2. Performance of various Prediction Models by Modes.
ModesModelsAUC TrainingAUC TestingAccuracyPrecisionRecallF1-score
Mode 1LR0.700.720.650.880.640.75
LDA0.780.780.710.900.700.80
LGBM 0.980.810.730.920.720.81
GBM 0.960.830.750.910.760.84
XGBoost0.990.800.750.900.770.84
RF0.900.820.720.920.700.80
AdaBoost 0.940.810.730.910.720.81
SVC0.780.780.710.890.720.79
ANN *0.890.880.820.900.750.64
Mode 2LR0.740.750.600.930.530.67
LDA0.810.790.710.900.700.80
LGBM 0.990.830.780.910.790.86
GBM 0.960.840.780.910.800.87
XGBoost1.000.810.780.900.810.86
RF0.920.830.690.940.640.76
AdaBoost 0.950.800.740.900.760.83
SVC0.810.790.700.910.680.78
ANN *0.890.890.800.910.750.64
Mode 3LR0.700.730.650.880.630.74
LDA0.800.810.750.910.760.83
LGBM 0.980.850.800.920.810.87
GBM 0.960.850.790.920.790.86
XGBoost1.000.830.790.910.800.86
RF0.910.840.720.930.690.80
AdaBoost 0.950.830.790.910.800.86
SVC0.800.810.750.900.750.83
ANN *0.890.890.830.890.810.64
Mode 4LR0.740.750.610.930.530.67
LDA0.830.820.760.900.770.84
LGBM 0.990.860.810.920.830.88
GBM 0.970.850.790.920.810.87
XGBoost1.000.840.770.920.770.85
RF0.930.850.750.930.730.82
AdaBoost 0.960.830.760.920.750.83
SVC0.830.810.750.900.760.84
ANN *0.890.890.820.910.750.65
Note: LR, Logistic Regression; LDA, Linear Discriminant Analysis; LGBM, Light Gradient Boosting Machine; GBM, Gradient Boosting Machine; XGBoost, Extreme Gradient Boosting; RF, Random Forest; SVC, Support Vector Machine; ANN, Artificial Neural Network; *, Best model based on AUC values.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hsu, J.C.; Nguyen, P.-A.; Phuc, P.T.; Lo, T.-C.; Hsu, M.-H.; Hsieh, M.-S.; Le, N.Q.K.; Cheng, C.-T.; Chang, T.-H.; Chen, C.-Y. Development and Validation of Novel Deep-Learning Models Using Multiple Data Types for Lung Cancer Survival. Cancers 2022, 14, 5562. https://doi.org/10.3390/cancers14225562

AMA Style

Hsu JC, Nguyen P-A, Phuc PT, Lo T-C, Hsu M-H, Hsieh M-S, Le NQK, Cheng C-T, Chang T-H, Chen C-Y. Development and Validation of Novel Deep-Learning Models Using Multiple Data Types for Lung Cancer Survival. Cancers. 2022; 14(22):5562. https://doi.org/10.3390/cancers14225562

Chicago/Turabian Style

Hsu, Jason C., Phung-Anh Nguyen, Phan Thanh Phuc, Tsai-Chih Lo, Min-Huei Hsu, Min-Shu Hsieh, Nguyen Quoc Khanh Le, Chi-Tsun Cheng, Tzu-Hao Chang, and Cheng-Yu Chen. 2022. "Development and Validation of Novel Deep-Learning Models Using Multiple Data Types for Lung Cancer Survival" Cancers 14, no. 22: 5562. https://doi.org/10.3390/cancers14225562

APA Style

Hsu, J. C., Nguyen, P. -A., Phuc, P. T., Lo, T. -C., Hsu, M. -H., Hsieh, M. -S., Le, N. Q. K., Cheng, C. -T., Chang, T. -H., & Chen, C. -Y. (2022). Development and Validation of Novel Deep-Learning Models Using Multiple Data Types for Lung Cancer Survival. Cancers, 14(22), 5562. https://doi.org/10.3390/cancers14225562

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop