An Interpretable Two-Phase Modeling Approach for Lung Cancer Survivability Prediction

Although lung cancer survival status and survival length predictions have primarily been studied individually, a scheme that leverages both fields in an interpretable way for physicians remains elusive. We propose a two-phase data analytic framework that is capable of classifying survival status for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points (phase I) and predicting the number of survival months within 3 years (phase II) using recent Surveillance, Epidemiology, and End Results data from 2010 to 2017. In this study, we employ three analytical models (general linear model, extreme gradient boosting, and artificial neural networks), five data balancing techniques (synthetic minority oversampling technique (SMOTE), relocating safe level SMOTE, borderline SMOTE, adaptive synthetic sampling, and majority weighted minority oversampling technique), two feature selection methods (least absolute shrinkage and selection operator (LASSO) and random forest), and the one-hot encoding approach. By implementing a comprehensive data preparation phase, we demonstrate that a computationally efficient and interpretable method such as GLM performs comparably to more complex models. Moreover, we quantify the effects of individual features in phase I and II by exploiting GLM coefficients. To the best of our knowledge, this study is the first to (a) implement a comprehensive data processing approach to develop performant, computationally efficient, and interpretable methods in comparison to black-box models, (b) visualize top factors impacting survival odds by utilizing the change in odds ratio, and (c) comprehensively explore short-term lung cancer survival using a two-phase approach.


Introduction
Lung cancer is the leading cause of cancer-related deaths worldwide [1]. According to the World Health Organization (WHO) [1], there were 2.9 million new cases and 1.76 million deaths due to lung cancer globally in 2018. It is estimated by the American Cancer Society [2] that around 236,740 people (117,910 men and 118,830 women) will be diagnosed with lung cancer while approximately 130,180 deaths (68,820 men and 61,360 women) will arise in 2022. According to Lemjabbar-Alaoui et al. [3], the prognosis of lung cancer is generally poor despite all the advancements in diagnostics and therapeutics. Through the use of data mining methods, it is possible to further analyze cancer patient data and predict the survivability outcomes. The combination of machine learning methods with physician expertise can help facilitate cancer treatment options. The Surveillance, Epidemiology, and End Results (SEER) [4] program is currently the most comprehensive repository that contains clinical data for approximately 34.6% of the US population with cancer. We believe that the literature on lung cancer survivability using SEER data can be classified within two main research groups. The first group [5][6][7][8][9][10][11][12] focuses on using statistical methods (e.g., Cox regression, Kaplan-Meier methods, and chi-squared test) for survival analysis as well as finding significant prognostic features (e.g., tumor size, performing surgery, and positive lymph node ratio) that influence survival. The second group focuses on using machine learning methods for survival prediction (see Table 1). In this study, we focus predominantly on the second group of research, while identifying significant prognostic features.  Survival status prediction, length of survival estimation, and cancer patient clustering are primary topics found in the machine learning literature that utilizes the SEER dataset, where focus is placed on model accuracy. Moreover, common classification, clustering, and regression models employed within the second group of research include artificial neural networks (ANNs), support vector machines (SVMs), Naïve Bayes (NB), decision trees (DTs), random forest (RF), ensemble methods, K-means, and bidirectional data partitioning (BDP) [7,[13][14][15][16][17][18][19][20][21][22]. Apart from the great strides made in lung cancer prediction research, several challenges still exist: • Although most studies explore survival status classification [7,14,15,17,18,21,22] and survival length prediction [19] individually, a scheme that leverages both remains elusive [13,16]. • Data used in lung cancer survivability predictions suffer from the class imbalance problem, which produces algorithm bias in favor of the majority class. This issue is scarcely addressed in cancer-related studies [14]. • Most features in the SEER data are categorical (e.g., grade, stage, and race). Many studies adopt integer encoding [14][15][16]21,22] to transform categorical features, which can introduce improper hierarchical order in feature levels. Alternatively, several studies [7,13,17,22] apply one-hot encoding to remedy non-ordinal relationships; however, most of these studies omit feature interpretation in favor of model performance.

•
An interpretable yet effective model for predicting lung cancer survivability or survival length, which can assist a practitioner in their decision-making process, remains missing.
This paper lays out a two-phase data analytic framework, where phase I predicts the 6-month (0.5-year), 1-, 1.5-, 2-, 2.5-, and 3-year survival status of patients diagnosed with lung cancer while phase II predicts the number of survival months for patients who succumb to lung cancer within 3 years. In phase I, we use three analytical models (general linear model (GLM), extreme gradient boosting (XGB), and ANN) along with five data balancing techniques (synthetic minority oversampling technique (SMOTE), relocating safe level SMOTE (RSLSMOTE), borderline SMOTE (BLSMOTE), adaptive synthetic sampling (ADASYN), and majority weighted minority oversampling technique (MWMOTE)), and two feature selection methods (least absolute shrinkage and selection operator (LASSO) and RF), while using the one-hot encoding approach to encode the categorical features. In phase II, we employ similar models used in phase I (GLM, XGB, and ANN) along with two feature selection methods (LASSO and RF) to predict the number of survival months for deceased patients within 3 years. We extract and interpret significant predictors based on regression coefficients for phase I (using odds ratio) and phase II. Through our proposed data analytic framework, we address the four challenges mentioned above. Furthermore, by implementing a comprehensive data preparation phase, we demonstrate that a statistical approach such as GLM performs comparably to the more complex models (e.g., XGB and ANN) at a considerably lower computational cost. The remaining parts of the paper are organized as follows: our proposed data analytic framework is discussed in detail within Section 2, results are presented and discussed in Section 3, and concluding remarks, including research limitations and future outlook, are given in Section 4. Figure 1 presents a diagram illustrating our proposed two-phase data analytic framework for lung cancer survivability prediction, where each phase encompasses several steps. In the following subsections, various phases are described in detail.

Material and Methods
This study incorporates de-identified diagnosed lung cancer records from SEER dataset (Novem-     The data preparation phase comprises five main steps: (1) data collection and data understanding, (2) data cleaning, (3) survival status labeling, (4) feature encoding, and (5) outlier detection, which are discussed in the following paragraphs.
This study incorporates de-identified diagnosed lung cancer records from the SEER dataset (November 2020 Submission) spanning from January 2010 to December 2017, where additional features were introduced to the SEER database in 2010 while several of these features were later omitted in 2018. As shown in Table 2, various criteria/filters are applied during data collection, resulting in 183 features and 129,756 records. Behavior recode for analysis ="Malignant" 10 Behavior code ICD-0-3 ="Malignant" 11 SequenceNumber ="One Primary Only" Due to a large number of unknown/missing values, duplicate variables, and correlated features, the filtered lung cancer dataset requires extensive data cleansing. In this step, features are identified and removed when (a) variables are discontinued or lack longitudinality, (b) variables possess more than 80% missing values, (c) variables are repetitive, and (d) variables contain constant input. Rather than arbitrarily removing records that contain NA values, the unknown and NA levels in categorical features are combined, which reduces data dimensionality while preserving statistical power. Similar to a past study [23], the categorical levels with frequencies less than 5% are regrouped in order to avoid overfitting and to avoid introducing bias to a model. Additionally, in order to mitigate gratuitous model bias from imputation, records with unknown values for total number of the benign tumors and regional node examined are removed, maintaining feature distributions. After the data cleaning step, 22 features and 125,498 records remain. All features and their type are listed in Table 3, where 4 features are numerical and 18 are categorical. In the third step of the data preparation phase, the survival status (response variable) is generated for each patient at each time point using the survival months feature. For example, if survival months < 6, then the survival status will be denoted as 0 , which indicates death within 6 months. Otherwise, the survival status is assigned as 1, which indicates that the patient survived for 6 months or greater.
In the fourth step, categorical features are transformed to ensure that the dataset is prepared in a format applicable to analytical models. As shown in Table 3, most features are categorical; however, machine learning models rely on numerical features for input. Many researchers [14][15][16]21,22] employ integer encoding to re-code categorical features, where the levels in each categorical feature are assigned integer values (e.g., denoting Grade I, Grade II, and Grade III as 1, 2, and 3). Instead, we opt for the one-hot encoding approach, which circumvents improper hierarchical order and encodes a categorical feature with m levels into m − 1 dummy variables, to avoid multicollinearity [24]. After feature encoding, we ended up with 60 features in our dataset.
In the fifth and final step, we utilize Cook's distance [25] to eliminate outlier incidences in our dataset. Cook's distance is one of the most popular approaches for detecting outliers [26], and it offered modest refinements in our preliminary analyses. For each observation, the Cook's distance is determined by comparing the fitted model performance with and without the data point. Observations with high Cook's distances are considered influential or outliers. We adopt a threshold of 4/n for outlier detection, a standard threshold in the literature [27].

Modeling
In this phase, the dataset is randomly split into training (70%) and testing (30%) sets. In addition to using 5-fold cross-validation, bootstrapping is utilized during model training to mitigate overfitting and reduce model variance. Due to a disproportionate number of survival and deceased class instances existing for each time-point, class distributions within the training set are adjusted to address the class imbalance problem. Based on the superiority of synthetic sampling demonstrated in previous studies [28,29], we explore five re-sampling approaches: SMOTE [30], RSLSMOTE [31], BLSMOTE [32], ADASYN [33], and MWMOTE [34].
It is important to note that the one-hot encoding approach increases the number of features, which increases the complexity of model development and the training process. Traditional feature selection methods (such as forward/backward selection and recursive feature elimination) are not practical for high-dimensional data. Therefore, two popular embedded feature selection methods, namely, LASSO [35] and RF, are used to reduce the dimension of the input features. Both of these methods are widely used in the literature to extract important features from high-denominational data [36,37]. Feature selection is used to decrease the complexity while increasing the generalizability of the analytical model.
Next, three popular models (GLM, XGB, and ANN) from three analytical groups, (a) statistical models, (b) ensemble models, and (c) deep learning models, are used for model development. Statistical models are simple, computationally efficient, and more interpretable compared to ensemble and deep learning models. Ensemble models typically offer high prediction performance by leveraging a majority voting approach, where the results of many lesser classifiers are combined. ANN models also offer high prediction performance through variable transformations; however, their required computational time notably increases as the dimensionality of a dataset increases [38]. These three models, drawn from three common analytical groups, are carefully selected in order to gauge how prediction performance varies from a simpler (GLM) to a more complex analytical model (ANN). In terms of complexity, the three models can be categorized as less complex (GLM), mid-complex (XGB), and complex (ANN). Furthermore, we inquire whether comprehensive data preprocessing can substitute complex models (XGB and ANN) with simpler models (GLM). For further details regarding these data mining methods, we refer the readers to [39][40][41]. To evaluate model prediction performance, we compute five metrics: (a) sensitivity-the reliability of survival status prediction, (b) sensitivity-the reliability of decease status prediction, (c) accuracy-a measure of the overall survival and decease status prediction performance, (d) G-mean-the combined reliability of survival and decease status prediction (pertinent to imbalanced datasets), and (e) area under the receiver operating characteristic (ROC) curve (AUC)-a measure of the diagnostic accuracy for survival and decease status prediction. Note that we use G-mean as our primary criterion for model selection, where a model with a higher G-mean value is more reliable in simultaneously predicting survival and decease statuses. The leading four metrics are listed as follows: where TP, TN, FP, and FN refer to the number of true positives, true negatives, false positives, and false negatives, respectively.

Phase II: Prediction of the Number of Survival Months
The goal of phase II is to predict the number of survival months for patients predicted to die within 3 years. The initial (full featured and unbalanced) 3-year survival dataset utilized in phase I is used to construct the training dataset in phase II, where we include an additional number of survival month feature. The testing dataset for phase II is the correctly predicted output from phase I. Moreover, the model development phase is similar to phase I. In addition to using LASSO and RF methods for feature selection, we employ GLM, XGB, and ANN to predict the number of survival months. To gauge the performance of each prediction model, we calculate the root mean squared error (RMSE) and mean absolute error (MAE). These metrics are listed in Equations (5) and (6), where Y i is the actual number of survival months,Ŷ i is the predicted number of survival months, and m is the number of records:

Results and Discussion
In this section, we present the prediction results for phases I and II. For phase I, due to the computational cost of XGB and ANN models, we prune the total number of model combinations, which take into account two feature selection methods and five resampling approaches. Firstly, we develop all model combinations for 1-year survival prediction and identify the best feature selection method and data balancing technique for GLM, XGB and ANN models (Table 4). These initial model benchmarks are based on 1-year survival data, which contain the largest number of observations compared to other time-points, with the exception of six-month survival. In addition to having a substantial sample size (high reliability), 1-year survival is one of the most commonly reported time-points in the literature [7,15,17]. We use these benchmark results to delimit the best feature selection and data balancing methods. Next, we combine the top feature selection and data balancing techniques found in Table 4 with GLM, XGB, and ANN for 0.5-, 1.5-, 2-, 2.5-, and 3-year survival prediction (Table 5).   Table 4 presents the classification results for 1-year survival prediction. Firstly, LASSO feature selection performs marginally better than RF feature selection across all models and all data balancing techniques using G-mean as a criterion. The G-mean values range between 0.847-0.870 and 0.846-0.858 for all models using LASSO and RF feature selection, respectively. Note that LASSO is computationally efficient compared to RF feature selection. Second, the use of ADASYN for data balancing provides equal or higher G-mean values (0.855-0.870) across all models compared to the remaining four data balancing techniques. Models utilizing balancing techniques such as SMOTE and MWMOTE are among the top performing models just below the ADASYN method. The best-performing GLM, XGB, and ANN models based on the G-mean metric (marked in bold in Table 4) are used in 0.5-, 1.5-, 2-, 2.5-, and 3-year survival predictions. Table 5 presents the classification results for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year survival predictions using GLM, XGB, and ANN, along with LASSO feature selection and the ADASYN data balancing technique. The highest-performing models for each of the six time-points are marked in bold using the G-mean value as a criterion. Based on Table  5, GLM is the top model for 0.5-year survival prediction, with a G-mean value of 0.887, while ANN is the top-performing model for 1-, 1.5-, 2-, 2.5-, and 3-year survival prediction. Although ANN models exhibit higher performance compared to GLM and XGB for 1-, 1.5-, 2-, 2.5-, and 3-year survival prediction, the G-mean values for GLM and XGB are nearly on par with those offered by ANN models. Additionally, ROC curves for all models listed in Table 5 are plotted in Figure 2, which visually demonstrates the comparable performance of each technique. By incorporating a thorough data scheme within our model framework, we demonstrate that simple models such as GLM can perform comparably to more complex models such as XGB and ANN.

Important Features for Survival Prediction
We use the GLM-LASSO-ADASYN models to extract the topmost significant survival predictors for all time-points (see supplementary materials https://github.com/zahrame/ LungCancerPrediction for a list of GLM equations). Besides their interpretability, GLM models provide relatively high classification results (see Table 5) at low computational cost. We define the odds ratio (OR = p 1−p in which p is the probability of survival) and calculate the relative change in OR (∆OR) to quantify the impact of each important feature based on its respective GLM coefficient: By defining the difference between the odds (OR new − OR old ) of an individual feature increasing by one unit (x j + 1) and exponentiating both sides of the equation, we can decouple each feature's effect on the odds of survival (confined within the logarithmic function of Equation (7)). By subtracting one from the results, we obtain the effective change in the odds ratio (Equation (8)) by an individual feature [23].    Summary stage: Regional is a highly significant and consistent feature that positively impacts (∆OR > 0) a patient's odds of survival across all time-points. If the spread of lung cancer (Summary stage) in a patient is categorized as Regional, the odds of survival are 37.99%, 27.56%, 21.64%, 17.92%, 14.80%, and 16.38% higher on average (holding other features constant) for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year survival time-points, respectively. Similarly, Summary stage: Localized is a significant feature that positively affects a patient's survival status, particularly for early time-points. If the spread of lung cancer is categorized as Localized, the odds of survival are 57.92%, 36.45%, 21.64%, 13.89%, and 11.47% higher on average (holding other features constant) for 0.5-, 1-, 1.5-, 2-, and 3-year time-points, respectively.
Another prominent feature that positively contributes to patient survival is RX Summ Surg Prim Site: Yes, a feature that documents if a surgery procedure is performed on the primary cancer site. Figure 3 shows that if surgery is performed on a primary site, a patient's odds of survival are 22.47%, 27.57%, 27.30%, 27.29%, and 20.92% higher on average (holding other features constant) for 1-, 1.5-, 2-, 2.5-, and 3-year survival time-points, respectively. Regarding primary cancer sites, Primary site: Upper lobe lung is attributed to higher odds of survival for several time-points. If the primary cancer site of a patient is Upper lobe lung, the patient's odds of survival are 14.47%, 13.75%, 12.75%, and 11.25% higher on average (holding other features constant) for 1-, 1.5-, 2-, and 2.5-year survival time-points, respectively.
In contrast, CS site specific factor 1: Unknown is one of the most significant and consistent features that negatively impacts (∆OR < 0) a patient's odds of survival across all timepoints. If the existence of separate tumor nodules (CS site specific factor 1) cannot be assessed in a patient's ipsilateral lung, the odds of survival are 12.21%, 17.14%, 15.53%, 15.99%, 17.20%, and 14.5% lower on average (holding other features constant) for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year survival time-points, respectively. Note that the presence of separate tumor nodules in the ipsilateral lung (CS site specific factor 1: 10, 20, 30, and 40) is highly significant, which negatively impacts (∆OR < 0) a patient's survival status for 1-, 1.5-, 2-, 2.5-, and 3-year survival time-points.
Mets at DX-liver: Yes is another significant and consistent feature that negatively affects a patient's odds of survival. If a patient experiences a distant metastatic involvement of the liver, the odds of survival are 17.40%, 14.67%, 11.56%, 10.86%, 11.81%, and 11.82% lower on average (holding other features constant) for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year survival time-points, respectively. Moreover, Regional nodes examined is a vital feature that negatively affects a patient's odds of survival. If the number of removed and examined regional lymph nodes for a patient increases by one node, the patient's odds of survival are 11.83%, 12.99%, 12.99%, 13.7%, and 13.23% lower (holding other features constant) for 1-, 1.5-, 2-, 2.5-, and 3-year survival time-points, respectively. Table 6 presents the number of survival months prediction results for deceased patients within 3 years, where the best models are marked in bold. Similar to phase I, LASSO outperforms RF feature selection with marginally smaller values of RMSE and MAE for each model methodology. The GLM and XGB models offer similar survival month prediction performance with an MAE ∼ 5.6 months. Even though ANN is a more complex model compared to GLM and XGB, the MAE values for ANN using LASSO and RF feature selection are ∼6.7 and ∼7.1 months, respectively. These findings illustrate that although ANN outperforms GLM and XGB in classification problems (phase I), ANN is not guaranteed to outperform the simpler models in regression problems (phase II).  Figure 4 visualizes the top 18 contributing features with coefficient values greater than |1.00| that predict the number of survival months. The 13 (5) features with positive green (negative red) bars are attributed to an increase (decrease) in the number of survival months. Histology: >8083 is the topmost significant feature that positively impacts the number of survival months. If a patient (predicted to perish) is assigned a histology code greater than 8083, the patient is expected to survive 7.07 months longer on average (holding other features constant). Note that a patient (predicted to perish) assigned a histology code, regardless of carcinoma group type, is expected to live several months longer on average compared to a patient who was not or could not be assigned a code (holding other features constant).

Phase II: Regression
The GLM and XGB models offer similar survival month prediction performance with a MAE ∼ 249 5.6 months. Even though ANN is a more complex model compared to GLM and XGB, the MAE 250 value for ANN using LASSO and RF feature selection is ∼ 6.7 and ∼ 7.1 months, respectively.

251
These findings illustrate that although ANN outperforms GLM and XGB in classification problems 252 (Phase I), ANN is not guaranteed to outperform the simpler models in regression problems (Phase 253 II).   Summary stage: Localized and Summary stage: Regional are the next important features that positively contribute to the number of survival months. If the spread of lung cancer in a patient (predicted to perish) is localized or regional, the patient is expected to survive 4.71 or 2.59 months longer on average (holding other features constant), respectively. Additionally, Regional nodes examined and RX Summ Surg Prime Site: Yes are significant features in predicting the number of survival months of a lung cancer patient. If a patient (predicted to perish) has an additional lymph node removed and examined or has surgery performed on a primary cancer site, the patient is expected to live 1.91 or 1.34 months longer on average (holding other features constant). Note that a higher number of examined regional lymph nodes implies a decrease in a patient's odds of survival (phase I); yet, with the removal and examination of additional lymph nodes, the survival length of a patient expected to perish may be prolonged (holding other features constant).

254
Contrarily, Mets at DX-liver: Yes is the top significant feature that negatively affects the number of survival months. If distant liver metastases have formed in a patient (predicted to perish), the patient is expected to live 1.83 months less on average (holding other features constant). Moreover, if distant brain (Mets at DX-brain: Yes) or bone (Mets at DX-bone: Yes) metastases have formed in a patient (predicted to perish), the patient is expected to live 1.22 or 1.02 months less on average (holding other features constant), respectively. For every additional year in age (Age at diagnosis), a patient (predicted to perish) is expected to live 1.24 months less on average (holding other features constant). Lastly, if a patient (predicted to perish) is diagnosed with Grade III lung cancer (Grade: Poorly differentiated (Grade III)), the patient is expected to live 1.18 months less on average (holding other features constant). Similar to phase I, the use of one-hot encoding enables us to not only extract significant categorical levels but to interpret the individual levels.

Recent Literature Comparison
In spite of the fact that a proper one-to-one comparison between our research and prior lung cancer data mining studies is not possible due to variations in dataset time ranges, feature availability, data collection criteria, data preprocessing techniques, modeling approaches, and prediction time-points, we highlight some similarities and differences to provide a synopsis. In a recent study, Doppalapudi et al. [13] yielded AUC values as high as 0.83, 0.86, and 0.92 for 0.5-, 0.5-2-, and >2-year survival prediction, respectively, based on 2004-2016 SEER data using CNN. Our data and approach yield AUC values as high as 0.97, 0.94, 0.94, 0.94, 0.93, and 0.92 for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points, respectively ( Figure 2). Similar to our study, Doppalapudi et al. found that Histology, Age at diagnosis, Summary stage, and Primary site are important lung cancer survival predictors. Unlike our results, Doppalapudi et al. found that Registry information, Sex, Number of radiation rounds, and two discontinued variables (Number of lymph nodes and Derived AJCC TNM) in the SEER dataset are important features. Although this study reports the relative importance of various contributing features in survival prediction, the effect of each feature is not quantified.
In another recent study, Wang et al. [7] achieved accuracies (AUC was not reported) of 0.93, 0.78, and 0.72 for 1-, 3-, and 5-year survival prediction, respectively, based on 2010-2015 SEER data using XGB and LR. Our study yields accuracies as high as 0.89, 0.86, 0.87, 0.86, 0.85, and 0.84 for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points, respectively. The important predictors Surgery, Grade, Histology, Age at diagnosis, and Race found by Wang et al. are consistent with our results; however, Laterality, Sex, Marital status, and Derived AJCC TNM (a discontinued variable in SEER data) are not. In addition, Jonson et al. [14] yielded an AUC value of 0.94 for 5-year survival prediction based on 1975-2015 SEER data using RF and AdaBoost models. Although Jonson et al. explored intermediate-term survival, they found that Age at diagnosis, Histology, Surgery on primary site, and Summary stage are important features for survival prediction, similarly found in our study for short-term survival. Jonson et al. also found that Sequence Number (used as one of our criteria for data collection) and two discontinued variables (Number of lymph nodes and Extent of disease) are important predictors, which differ from our study. Again, the impact of each feature on lung cancer survival is not quantified in the latter two studies.

Conclusions
Pertaining to the results obtained in this study, we have three main contributions, previously unexplored in lung cancer data mining research. First, we developed a twophase data analytic framework that is capable of 1) predicting the survival status of a patient with lung cancer for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points and 2) predicting the number of survival months for patients who were predicted and labeled as deceased within 3 years. Second, by incorporating a comprehensive data preprocessing step, we showed that a computationally efficient and interpretable model such as GLM can perform comparably to complex models such as XGB and ANN. Moreover, the data preparation steps outlined in phases I and II facilitate data reproducibility. Third, we used GLM-LASSO-ADASYN models to extract important numerical and encoded categorical features (using one-hot encoding), where we interpreted the effect of individual features on the odds of survival in phase I. Similarly, in phase II, we used the GLM-LASSO model to extract important numerical and individual categorical features (using one-hot encoding) that influence the number of predicted survival months. Although the performance of the proposed framework in practice is still a challenge, since other potential factors such as a patient's lifestyle (e.g., diet and smoking behavior) or prior medical/drug history may impact lung cancer survivability, our simple yet interpretable GLM models (phases I and II) may assist physicians in better decision-making by prioritizing the most important factors related to lung cancer survivability.