A Machine Learning-Based Investigation of Gender-Specific Prognosis of Lung Cancers

Background and Objective: Primary lung cancer is a lethal and rapidly-developing cancer type and is one of the most leading causes of cancer deaths. Materials and Methods: Statistical methods such as Cox regression are usually used to detect the prognosis factors of a disease. This study investigated survival prediction using machine learning algorithms. The clinical data of 28,458 patients with primary lung cancers were collected from the Surveillance, Epidemiology, and End Results (SEER) database. Results: This study indicated that the survival rate of women with primary lung cancer was often higher than that of men (p < 0.001). Seven popular machine learning algorithms were utilized to evaluate one-year, three-year, and five-year survival prediction The two classifiers extreme gradient boosting (XGB) and logistic regression (LR) achieved the best prediction accuracies. The importance variable of the trained XGB models suggested that surgical removal (feature “Surgery”) made the largest contribution to the one-year survival prediction models, while the metastatic status (feature “N” stage) of the regional lymph nodes was the most important contributor to three-year and five-year survival prediction. The female patients’ three-year prognosis model achieved a prediction accuracy of 0.8297 on the independent future samples, while the male model only achieved the accuracy 0.7329. Conclusions: This data suggested that male patients may have more complicated factors in lung cancer than females, and it is necessary to develop gender-specific diagnosis and prognosis models.


Introduction
Lung cancer is one of the leading causes of cancer deaths, and it is estimated to have caused 142,670 deaths in 2019 alone [1]. The incidence rate of lung cancer in females is higher than in males [2][3][4]. The worldwide incidence rate of female lung cancers is still increasing [5]. However, the mortality rate of male lung cancer patients is almost twice that of females [6,7].
Inherent genetic factors and living environments induce apparent gender-specific biological mechanisms and prognostic responses [8]. Non-small cell lung cancers (NSCLC) patients demonstrate large gender-specific survival differences [9]. Zang et al. suggested that the gender variation cannot be explained by the differences in the factors of baseline exposure, smoking history, or body size, but may be caused by the higher susceptibility to tobacco carcinogens in females [10]. That is to say, females are more susceptible to tobacco-induced carcinogenesis than males [11]. Gasperino's study showed that after taking into account the number of cigarettes smoked, females have a three-times higher risk of lung cancer than males [12]. Another study suggested that females have a better survival rate than males after considering confounders like smoking [13]. Many studies observed the gender differences in urothelial carcinoma of the bladder (UCB), where males had a higher incidence of UCB, but females tended to have worse outcomes [14]. Gender was also found to be a risk factor for the prognosis of head and neck cancer (HNC), and females had a better prognosis than males [15]. So, the gender specificity of lung cancers is attracting more researchers to work on this interesting topic.
Cancer prognosis is mainly predicted based on the clinician's professional experience or the nomogram calculation. The corresponding scores on the upper dots of each variable graph in the nomogram are added up to obtain the total score, and then a straight line is drawn at the bottom of the chart to estimate the probability of death [16,17]. The nomogram becomes very complicated if multiple variables are included [18].A nomogram is a statistical model for the probability calculation of a single event like death or recurrence and has been widely used to predict the probabilities of cancer metastasis and prognosis [19][20][21][22][23]. The non-negligible proportions of the available nomograms shared similar end-points and intrinsic complexity, which may limit their applications [24].Both nomogram and machine learning techniques can be used to estimate the overall survival rate of cancer patients. However, studies have shown that machine learning models outperformed the nomograms in estimating the individual patient's prognosis [25]. The machine learning model may provide complementary information for this challenge by fully utilizing the inter-variable interactions [26].
This study hypothesizes that the gender disparity should be taken into account when a prognosis model is optimized. Multiple classification algorithms were utilized to build binary classification models of whether a patient may survive one year, three years, or five years after the clinical data were collected. In this case, we used the Cox regression model to analyze the prognosis of lung cancer patients. The classification models were built for females and males separately.

Data Sources
The clinical data used in this study were retrieved from the Surveillance, Epidemiology, and End Results (SEER) database (AYA site recode/WHO 2008 8.3 Carcinoma of trachea, bronchus, and lung) [27,28]. AYA is a site/histology recode used to analyze data on adolescent and young adults. The recode was applied to all cases no matter the age in order that age comparisons can be made with these groupings. For more information, see http://www.seer.cancer.gov/ayarecode/index.html. The SEER database is an authoritative cancer statistical database in the United States, and global cancer researchers may obtain data through application. A signed SEER data usage agreement is required to obtain the fields and variables in the SEER database. Researchers may make scientific investigations into the SEER data and publish research articles based on the analysis of this data. The database is available at https://seer.cancer.gov/. This study included patients diagnosed from 2010 to 2015 and followed up until December 2016 with primary lung cancer. Only the cases with primary tumors and which were positively followed (excluding "necropsy only" and "death certificate only") were used. The cases with a follow-up time of 0 may indicate in-hospital deaths and were also excluded. All the objective characteristics of the included patients were collected, including marital status, race, gender, age, histological type, tumor-specific death status, survival time, primary site surgery, tumor grade, lymph node examined, lymph node positive, tumor grade, laterality, year of diagnosis, T stage, M stage, and N stage.
This study collected a cohort of 28,458 patients with primary lung cancers from the SEER database, and most of the patients were women (15,000, 52.71%). Table 1 summarized the baseline characteristics of the primary cancer patients. Statistical significances were observed for gender-specific features, including histological type, marital status, race, primary site surgery, grade, T stage, N stage, and M stage. This dataset consisted mostly of the white population, and the observations of this study may serve as a good reflection of the gender-disparity in white lung cancer patients. Table 1 showed that 83.20% of male and 82.50% of female patients were white.
It is also interesting to observe that female lung cancer patients tended to be diagnosed at an earlier stage than male patients, since more female patients were diagnosed at the smallest T (Tumor)/N (Node)/M(Metastasis) stages than male patients. 45.30% of the female patients were diagnosed at the T1 stage, compared with 36.60% of the male patients. The female lung cancer patients contained percentages of 74.20% and 95.20% with diagnosis at the N0 and M0 stages, while the male patients contained 68.90% and 94.00%, respectively. Moreover, all the T/N/M stages demonstrated statistically significant differences between females and males.
Diagnosed numbers of lung cancer patients were on the rise between 2010 and 2015. However, the relative percentages of females and males were similar and the differential analysis did not show a statistical significance (p-value = 0.682).

Preprocessing of the SEER Database
A preprocessing step was carried out on the lung cancer dataset from the SEER database. This study excluded the unknown data entries from the patients' clinical measurements. The T/N/M stages were annotated and extracted from the SEER database [29,30]. The log odds of the positive lymph nodes (LODDS) were defined as the logarithm of the ratio between the probability of being a positive lymph node and the probability of being a negative lymph node. The formula is: LODDS is equal to log (P + 0.5)/(T − P + 0.5), where T is the number of total nodes and P is the number of positive nodes [31]. Marital status was grouped as married and unmarried, where unmarried persons consisted of single, separated, divorced, widowed, and other cases. Before the data were loaded to the machine learning algorithms, the unordered features were encoded by the one-hot encoding strategy, and the ordered features and numerical features were normalized. One-hot encoding converts categorical variables into numeric variables, e.g., encoding the categorical variable {A, B, C} as the binary variable {100, 010, 001} [32]. Patients were excluded if they had died from causes other than lung cancer, in order to ensure that the predicted survival focused on lung cancers. A supervised classification study needs to know the category labels of the samples, so this study chose to investigate whether the lung cancer patients survived one year, three years, and five years, respectively. For the one-year survival prediction problem, patients were excluded from the dataset if their follow-up lengths were shorter than one year and they were still alive, since we could not know whether such patients died or were still alive one year after the diagnosis. Similar exclusion rules were carried out for the three-year and five-year survival prediction problems.

Binary Classification Algorithms
Seven popular classification algorithms were utilized to build the binary classification models for the one-year, three-year, and five-year survival prediction problems.
Naïve Bayes (NBayes) assumed inter-feature independence and calculated the classspecific prediction model using the Bayes' theorem. Although NBayes makes a strong assumption of inter-feature independence, it has been widely used to build accurate prediction models [33,34].
Three tree-based classifiers were evaluated for the survival prediction problems in this study. The decision tree (DTree) is a fast and popular classifier, and its trained tree structure is easy to be interpreted [35]. Random forest (RF) is an ensemble algorithm based on multiple random trees as the base classifiers, and has demonstrated its efficiency for biomedical prediction problems [36]. XGBoost is a popular fast classifier to handle various biomedical data types, including the spectroscopy spectrum and bioelectrical data [37,38].
The simple but effective classifier k-nearest neighbor (KNN) is a supervised learning algorithm based on the voting results of the training samples most similar to the query sample [39].
Logistic regression (LR) calculates the probability of a specific event such as the survival of a lung cancer patient, and is popular for building biomedical prediction models [40,41]. Support vector machine (SVM) tries to find a hyperplane with the largest distance or margin to the two classes of samples in the multi-dimensional space [42].

Evaluation Metrics of Binary Classification Algorithms
The machine learning models were evaluated by four binary classification performance metrics, i.e., sensitivity (Sn), specificity (Sp), accuracy (Acc), and F1 score (F1). Sn and Sp are defined as the percentages of positive and negative samples that were predicted correctly, respectively. Acc refers to the ratio of the number of correctly predicted samples to the total number of all the samples. F1 is an index used to measure the accuracy of a binary classification model in statistics. It takes into account both the accuracy and recall of the classification model. All the four performance metrics are from 0 to 1, with higher values indicating better classification performance.

Data Analysis Procedure
The data analysis process is shown in Figure 1. The baseline data analysis formulated the categorical data as percentages and the continuous data as averaged values and standard deviations. Two statistical tests, Chi-squared test (Chi2test) and t-test (Ttest), were used to evaluate the associations between the category or continuous features and gender information. The log-rank test was used on the two genders with the clinical data from the KM chart, which is a letter chart based on the Monoyer standard [43]. The Cox regression was used to perform univariate survival analysis of the individual features. All the prognostic factors derived from the univariate analysis were collected for the multivariate analysis. We also used the nomogram analysis method based on the COX risk ratio model to make predictions. This study defined a survival probability of 1 year, 3 years, and 5 years score less than 0.5 as death, and more than 0.5 as alive.
The classification model was evaluated by the stratified three-fold cross-validation strategy (S3FCV). S3FCV means that the positive and negative samples were randomly split into three equally-sized sub-groups, respectively. One positive subset and one negative subset were used as the test dataset and the other samples were used to train the classification model. This process was iteratively conducted until no sample subset was used as the test dataset. The overall prediction performance metrics were calculated for this iteration [44]. The parameters of all the utilized machine learning models are shown in Supplementary Table S3. All the calculations and experiments were performed using SPSS software version 24.0 and Python version 3.6. The machine learning algorithms were implemented in the Python module scikit-learn version 0.19.1. The multivariate survival analysis was conducted using the Cox regression model.
1 Figure 1. Flow chart of this study. The male and female samples were grouped as the datasets "Male" and "Female", respectively. All the samples constitute the overall dataset "All". The sample numbers are provided in parenthesis.

Gender Disparities in the Prognosis of Primary Lung Cancers
The experimental data suggested that lung cancer patients' survival probabilities are significantly different between the two genders. We used the log-rank test to measure the difference in survival rates between female and male patients with primary lung cancer (p < 0.001). The female patients with primary lung cancer tended to have a better survival rate than male patients ( Figure 2). The one-year survival rate of male patients was 85.9%, while the female patients had a survival rate of 92.4%. The female patients had better survival rates 79.1% and 70.5% at 3 years and 5 years, compared with 68.0% and 59.0% for the male lung cancer patients. The Cox regression was used to evaluate the prognosis of the investigated features, as shown in Table 2. The univariate analysis suggested that gender, age, and race had significant effects on prognosis. For clinicopathological factors, LOODS, histological type, grade, T, N, M, and primary site surgery are all prognostic factors affecting lung cancer ( Table 2). Laterality has no effect on the prognosis of lung cancer patients (HR (Hazard Ratio) = 1.023, p-value = 0.322). The above-mentioned prognostic-related factors are included in the multivariate analysis. Gender was identified as an independent prognostic factor with an adjusted hazard ratio of 0.698 (95%CI 0.666,0.731). The risk of death without surgery in the primary site was 1.631 times higher than for surgery (p-value < 0.001). On the contrary, marriage is a protective factor for primary lung cancer (HR = 0.865, p-value < 0.001). T stage, N stage and M stage were independent risk factors for prognosis, among which M = 1 was 2.137 times higher than that of M = 0 (p-value < 0.001).

Machine Learning-Based Prediction of Survival Status
Seven popular binary classifiers were utilized to predict whether a lung cancer patient survived for one, three, and five years, as shown in Figure 3. Figure 3A illustrates the machine learning models of the survival prediction problems for all the samples. The classifier XGB achieved the highest Acc for the one-year survival prediction problem (Acc = 0.9075). Another classifier LR achieved the highest Acc values for the three-year (Acc = 0.7565) and five-year (Acc = 0.7179) survival prediction problems.  The machine learning models performed slightly differently on the gender-specific datasets, as shown in Figure 3B,C. The classifier XGB achieved the highest Acc = 0.8786 for the one-year survival prediction problem for the male samples. Similar to the above, the classifier LR achieved the highest Acc for the three-(Acc = 0.7243) and five-year (Acc = 0.7352) survival prediction problems, as shown in Figure 3C. The classifier XGB achieved the highest Acc for the one-(Acc = 0.9300) and three-year (Acc = 0.0.7849) survival prediction problems. Similarly, the classifier LR achieved the highest Acc for the five-year (Acc = 0.7212) survival prediction problems, as shown in Figure 3B.
The nomogram in the survival analysis was based on the COX risk ratio model. Therefore, we used the COX risk ratio model to predict survival at one, three, and five years. In experimental data for all the samples (denoted as "Total"), compared with the COX-based nomogram model, XGB achieved better results in one-year and three-year survival predictions, and LR has a higher accuracy rate in five-years, as shown in Figure 3 and Supplementary Table S1.

Feature Contributions of the XGB Models
The feature contribution was measured by the importance of each feature returned by the trained XGB models for the one-year, three-year, and five-year survival prediction models (Table 3). Except for the feature "Gender" for the dataset of all the samples (denoted as "All"), all the features were ranked in the descendent order of their XGB model importance measurements. The results for the datasets of male and female samples were denoted as "Male" and "Female".
We further evaluated how importantly each feature contributed to the classification models, as shown in Figure 4. The above sections demonstrated that the two classifiers XGB and LR usually performed best on the one-year, three-year and five-year survival prediction problems. However, the classifier LR did not generate a measurement of feature importance, so the measurement of feature importance from the trained XGB models was used to describe each feature's contribution to the prediction models.   Figure 4A suggests that the feature "Surgery" is the most important factor for one-year survival, while the spreading of the lung cancers to the regional lymph nodes (the N stage) played an essential role in determining whether a lung cancer patient may survive for three and five years, as shown in Figure 4B,C. The T/N/M stages (features "T", "N" and "M") and the grade (feature "Grade") were consistently ranked after the feature "Surgery" for all the three datasets "All", "Male" and "Female" for the one-year survival prediction problem. The top-five ranked features were the same for the three-year and five-year survival prediction problems, as shown in Figure 4B,C. The XGB models were constructed to predict the survival status of patients with primary lung cancers and visualized for easy inspection by the reader (Supplementary Figure S1).
The LR classifier was used to calculate the accuracies of the one-year, three-year, and five-year survival prediction problems, as shown in Figure 5. The features were incrementally added to the feature subsets by their ranks calculated in the previous section. Since the datasets did not have consistent ranks for the features, Figure 5 only gives the feature ranks in the horizontal axis.
We may observe the overall trend that more features may achieve better prediction accuracies, as shown in Figure 5. However, the inclusions of some features may decrease the model performance in some cases. For example, the best LR-Female model achieved Acc = 0.9309 using only six features, and the model accuracy was decreased to Acc = 0.9296 by using five more features. The survival prediction model usually achieved a very good prediction accuracy using about six features, and the inclusions of more features only achieve minor accuracy improvements.

Feature Contributions to the RF Models
The machine learning algorithm RF also calculated the importance measurement of each feature in the one-year, three-year, and five-year survival prediction models to measure feature contribution, as shown in Table 4. At the same time, we use the RF model to describe each feature's contribution to the prediction model for feature importance. The results showed that LOODS and Age were always the most important factors for survival status in one, three, and five years. This observation was slightly different to those features with large contributions in the XGB models. We added the features incrementally to the feature subsets. We also found that the best LR-Female model can reach Acc = 0.9315 using only nine features and adding two features reduces the accuracy of the model to Acc = 0.9296 ( Figure 5). Studies have shown that there are multiple optimal solutions for the same problem [45]. In the future, when we are looking for cancer markers, we can consider different combinations of features.

Independent Validation of the Models Using Future Samples
This study validated only the three-year survival prediction models due to the limited numbers of diagnosis years in the SEER database. The samples from the diagnosis years 2010-2011 were used as the training samples. Moreover, the samples from 2012-2013 were the independent validation samples. The survival statuses of the samples diagnosed in 2013 were determined by the follow-up data in the years 2014-2016, where the 2016 data were also retrieved from the SEER database. It is interesting to observe that the LR model was trained on the 2010-2011 samples and achieved Acc = 0.7841 on the "All" dataset from the diagnosis years 2012-2013. The female samples demonstrated an even better consistency in the three-year survival prediction by the independent validation with Acc = 0.8297. The male samples received a slightly worse model with Acc = 0.7329. We found that most pairs of investigated baseline characteristics did not have statistically significant and strong correlations (correlation coefficient > 0.300 and p < 0.05), except for a few cases. The variables N and LOODS were observed to have a correlation coefficient of 0.630 and p < 0.001. The variable Surgery was correlated with N (correlation coefficient = 0.323, p < 0.001) and M (correlation coefficient = 0.348, p < 0.001). The variable Stage was correlated with T (correlation coefficient = 0.653, p < 0.001), LOODS (correlation coefficient = 0.371, p < 0.001), N (correlation coefficient = 0.579, p < 0.001), and M (correlation coefficient = 0.415, p < 0.001). The other pairs of these baseline characteristics were either not or were weakly correlated (Supplementary Table S2). The machine learning models developed in the above sections suggested that some features' removal may improve the prognosis prediction models.

Discussion
Gender disparity was observed in lung cancer incidence, mortality, prognosis and treatment responses, etc. The prognosis analysis in this study suggested that female lung cancer patients had a better prognosis than male patients. Radkiewicz et al. found that the prognosis of male patients with non-small cell lung cancers (NSCLC) was poor, even after careful adjustments for various prognostic factors [46]. This study confirmed this observation and observed the gender-specific developmental T/N/M stages on diagnosis. Biological differences may justify female's better treatment responses and improved survival rates [47]. Kinoshita et al. also pointed out that gender differences in the histological types and developmental stages on diagnosis may partially explain the better prognosis of female lung cancer patients [48]. Studies of both surgery and systemic therapeutic treatments suggested that, besides the never smokers, female lung cancer patients overall experienced a better prognosis than males [49].
Both inherent genetic factors and gene-environment interactions play important roles in regulating the prognosis of lung cancers [7,50,51]. Various factors may induce the poor prognosis of male lung cancer patients. Firstly, female lung cancer patients benefited more from the anti-programmed cell death-1programmed death ligand-1 (anti-PD-1/PD-L1) chemotherapy than male patients [52]. Immune checkpoint inhibitors were more effective in male patients than female patients, while immune checkpoint inhibitors combined with chemotherapy were often more effective in female patients [53]. Secondly, male lung cancer patients tended to have more aggressive tumor developmental behaviors, e.g., faster growth and higher metastatic potentials [46].
Machine learning algorithms are becoming a popular technique to predict survival status. Compared with nomograms, machine learning models are not easy to be interpreted, but clinical doctors do not have to manually calculate the risk scores as with nomograms. A machine learning model may deliver prediction results rapidly and conveniently to the users once the input variables are loaded. The survival prediction results in this study showed better prediction performances for female patients with primary lung cancers than male patients. The two classifiers XGB and LR performed similarly well on the binary classifications of the one-year, three-year, and five-year survival prediction problems.
The feature contributions to the prediction models were evaluated by the importance variables of the trained XGB models. One-year survival status heavily relied on the variable "Surgery" for both genders and their mixture, while the most important contributions came from the variable "N" stage for the three-year and five-year survival prediction problems. So, a surgical removal is important for the one-year survival of primary lung cancer patients. Then, a long-term goal for primary lung cancer patient is to pay special attention to monitoring the regional lymph nodes for possible metastatic tumor development.
This study has the following limitations, which may be overcome by more comprehensive data sources. The SEER database provides a limited set of clinical variables for the patient and it is anticipated that more clinical data, e.g., on smoking and drinking, will improve the model performances. The imaging and molecular data of the patient's mental health, genetic information, and living environment will definitely increase the models' prediction performance. This study selected seven commonly-used machine learning algorithms to investigate the gender-specific prognosis of lung cancers. This gender disparity may be further evaluated by more machine learning algorithms. Novel machine learning algorithms may also be developed in future studies to deliver gender-independent prognosis models with similar prediction performances to the gender-specific models. For example, a combination of nomogram and machine learning prediction models can be considered for analysis. Recently, multiple instance learning has demonstrated its superiority in various applications including tumor imaging analysis [54][55][56][57]. The deployment of the multiple instances learning method may significantly improve prognosis prediction for cancer patients. Spherical separation surface-based approaches were observed to perform better than classifiers based on linear separation surfaces on the binary classification problems of two similar class labels, and may be utilized in future investigations [58,59].
Gender-specific prognosis may be investigated using biomedical imaging data and other data types in future studies.

Conclusions
In conclusion, this study provided an exploratory investigation of gender disparity in the prognosis of primary lung cancers using regular statistical methods and machine learning prediction models. Both techniques consistently supported the prognosis observations derived from each other. Due to the limited availability of the primary lung cancer dataset with long-term follow-ups, the observations were not validated by an independent dataset.
Supplementary Materials: The following are available online at https://www.mdpi.com/1010-6 60X/57/2/99/s1, Figure S1: Visualization of using XGB model to predict the survival status of patients with primary lung cancer, Table S1: The nomogram predicts the prediction results of 1 year, 3 years, and 5 years of survival, Table S2: Correlation coefficient analysis on the pairs of baseline characteristics, Table S3: Parameters of the 7 machine learning methods.  Institutional Review Board Statement: Patient consent was waived due to the data is from the publicly available SEER database. A SEER Research Data Agreement was obtained to enable access to the SEER data. This study required no additional ethical approval as it involved no interaction with human participants or personal identification of participants.
Informed Consent Statement: Informed consent was not applicable for this study.

Data Availability Statement:
The data is from SEER database. A SEER Research Data Agreement was obtained to enable access to the SEER data. The database is available at https://seer.cancer.gov/.