A Random Forest Model for Post-Treatment Survival Prediction in Patients with Non-Squamous Cell Carcinoma of the Head and Neck

Background: Compared to squamous cell carcinoma, head and neck non-squamous cell carcinoma (HNnSCC) is rarer. Integrated survival prediction tools are lacking. Methods: 4458 patients of HNnSCC were collected from the SEER database. The endpoints were overall survivals (OSs) and disease-specific survivals (DSSs) of 3 and 5 years. Cases were stratified–randomly divided into the train & validation (70%) and test cohorts (30%). Tenfold cross validation was used in establishment of the model. The performance was evaluated with the test cohort by the receiver operating characteristic, calibration, and decision curves. Results: The prognostic factors found with multivariate analyses were used to establish the prediction model. The area under the curve (AUC) is 0.866 (95%CI: 0.844–0.888) for 3-year OS, 0.862 (95%CI: 0.842–0.882) for 5-year OS, 0.902 (95%CI: 0.888–0.916) for 3-year DSS, and 0.903 (95%CI: 0.881–0.925) for 5-year DSS. The net benefit of this model is greater than that of the traditional prediction methods. Among predictors, pathology, involved cervical nodes level, and tumor size are found contributing the most variance to the prediction. The model was then deployed online for easy use. Conclusions: The present study incorporated the clinical, pathological, and therapeutic features comprehensively and established a clinically effective survival prediction model for post-treatment HNnSCC patients.


Introduction
Head and neck cancer (HNC) is common, and there are about 932,000 new cases and 467,000 deaths worldwide each year, according to GLOBOCAN 2020 [1]. The incidence and mortality are the seventh highest compared to other cancer sites [1,2]. According to an epidemiological study based on the Surveillance, Epidemiology, and End Results (SEER) database, the incidence of HNC is still increasing, with an average annual percent change of 0.6% [3]. The histological type is mainly head and neck squamous cell carcinoma (HNSCC) derived from mucosal epithelium, accounting for over 90% [4]. HNSCC studies are relatively sufficient, thanks to which the 5-year survival for HNSCC has increased from 55% during 1992-1996 to 66% during 2002-2006, according to SEER [5]. Even many unique risk factors of HNSCC have been identified, such as Epstein-Barr virus (EBV), human papillomavirus (HPV), and areca nut, which are not common risk factors in other cancers, unlike tobacco or alcohol. Additionally, different with head and neck non-squamous cell carcinoma (HNnSCC), most HNSCCs (excluding those in the oral cavity) are treated with radiation therapy primarily [6]. Although there are enormous differences between HNSCC and HNnSCC, studies of HNnSCC are still very limited since it is uncommon and heterogeneous. Therefore, the present study aimed to provide more insight into HNnSCC.
Survival and recurrence evaluation is essential for patients and physicians, and can influence various clinical decisions or choices. However, there is still no appropriate survival prediction model for HNnSCC, especially an integrated tool for heterogenous primary sites and pathological subtypes. There are only a limited number of survivalprediction nomograms for squamous cell carcinoma [7][8][9], small-cell carcinoma [10], soft tissue sarcoma of the head and neck [11], and carcinoma of salivary glands [12,13]. The limitation of the nomogram is that it is the visualization of linear regression, which is easy to use, but not appropriate for some complex problems. Thus, to establish a survival prediction tool for HNnSCC with highly heterogeneous sites and pathological types, nonlinear ensemble learning algorithms were utilized in this study, which is expected to have higher accuracy [14]. Meanwhile, the SEER database is also appropriate for filtering out sufficient data for uncommon HNnSCC, which is essential for machine learning algorithms.
Thus, this is the first study using large-scale data to develop a survival prediction tool of HNnSCC with the help of machine learning and the SEER database. The developed prediction model was also deployed on a website. Physicians worldwide can evaluate the survivals of HNnSCC using this website easily.

Data Source
This study is reported in accordance with the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) statement.
Data used in the present study are all from the SEER database. Data during 2000-2019 (17 registries) are downloaded. The International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site codes C01-C14 and C30-C32 are used to filter out all head and neck cancer cases. Then, data with the histological codes 8000-8084 (not otherwise specified (NOS) and squamous neoplasm) and 9590-9992 (blood system-related cancers) are excluded. Exclusion criteria also include (1) inclusion of incomplete clinical or pathological data; (2) multiple primary tumors; (3) follow-up periods less than 6 months; (4)

Endpoint and Variables
There are two primary endpoints, overall survival (OS) and disease-specific surviva (DSS, also called cancer-specific survival here). The variables of the SEER database "Survival months", "SEER cause-specific death classification", and "Vital status recode"

Endpoint and Variables
There are two primary endpoints, overall survival (OS) and disease-specific survival (DSS, also called cancer-specific survival here). The variables of the SEER database, "Survival months", "SEER cause-specific death classification", and "Vital status recode", are used to decide the endpoints.
Sex, age, race, marital status at diagnosis, living area, tumor sites, histopathological type and grade, TNM staging, tumor size, involved lymph nodes, and treatment information are included in the present study for analyses. Three subgroups of marital status are identified: married, never married, and others (widowed and divorced, mainly). The data about living area includes rural/urban, and median household income, which has been adjusted to 2019 according to inflation by the SEER database. Site-specific factors are converted to involved cervical lymph node regions. Treatment variables are whether chemotherapy, radiotherapy, or surgery was received. It is worth noting that all these treatment variables are collected after treatments, which means the analyses and prediction models in the present study can only be used after physicians' treatment, rather than interfering with their judgement.

Prediction Model Establishment
Nominal categorical variables are changed to dummy variables. Hazard ratios (HRs) of OS and DSS are separately calculated with Cox regression analysis. The statistically significant variables are input in the prediction model establishment. TNM staging is highly overlapped by variables of tumor size, involved lymph nodes, and distant metastases, and thus, not included in the regression analysis and establishment of the prediction model. The random forest model is established with the "scikit-learn" package on Python. The patients are stratified-randomly divided into the train & validation cohort (70%) and test cohorts (30%). Tenfold cross validation are utilized in the process of choosing the most appropriate hyper-parameter values. Established models are exported as "*.pkl" files with the "joblib" package.

Prediction Model Evaluation
In the present study, the model performance is evaluated and exhibited through three types of curves. The receiver operating characteristic (ROC) curves are drawn to display the discrimination ability of the models. The area under curve (AUC) quantifies the results. Secondly, the calibration curves represent the concurrence between the actual survival and the predicted probability. The decision curve analysis (DCA) is used to compare the net benefit between the models and the traditional TNM-based linear prediction model, which is made to simulate traditional prediction methods based on TNM classification with "survival", "rms", and "nomogramEx" packages on R 4.2.2. Curves are plotted with Matplotlib. Performance on predicting 3-year OS, 3-year DSS, 5-year OS, and 5-year DSS are all evaluated with three types of curves.
Although random forest, a decision tree-based machine learning method, can predict with higher accuracy than nomograms, it is not as intuitive as nomograms. To better understand the developed model, the Gini coefficients of the random forest model are shown in a heatmap. A higher Gini coefficient means that the factor is more important and contributes more information/variance to the developed model.

Deployment
Considering the good evaluation results, an interactive website is established to improve the practical value of the present study. Entering clinical information required, 3-year OS, 3-year DSS, 5-year OS, and 5-year DSS can be calculated automatically. Django on Python is utilized for establishing the prediction website.

Statistical Analysis
Categorical variables are described by both frequency and percentage. The 95% confidence interval (CI) of scaled variables is displayed. Kaplan-Meier analysis is performed for survival analyses. Cox regression analysis is performed for investigating prognostic factors and calculating HRs. The AUCs between models and the TNM-based Cox regression model are compared with Delong's test. The p value is considered statistically significant when <0.05. The statistical analysis of the present study is performed mainly with SPSS Statistics 26.

Characteristics and Regression Analysis
Characteristics of the train & validation cohort, test cohort, and comprehensive data are summarized in Table 1. A total of 4458 cases were included in the present study, divided into the train & validation cohort with 3104 cases and the test cohort with 1354 cases. The median age is 45-59, and 51.1% are female patients. In total, 56.6% of patients were married, and 22.8% were never-married patients. The most frequently involved site of HNnSCC is the salivary glands (67.4%), followed by the oral cavity (24.4%). The common pathological types include "adenomas and adenocarcinomas" (28.4%), "mucoepidermoid neoplasms" (44.4%), and "acinar cell neoplasms" (8.3%). A total of 42.8% cases were evaluated to be Grade II (moderately differentiation), which is more than other grades. When diagnosed, 39.8% of patients are Stage I. Overall, 21.4% are Stage II, and 14.5% are Stage III. In total, 48.3% of tumors are smaller than 20 mm, while 13.7% are larger than 40 mm. Most involved lymph nodes are Level II (12.3%), followed by Level I (8.9%) and Level III (6.5%). Referring to the treatments, 95.5% of patients received surgery. A total of 49.6% received radiotherapy, and 10.7% received chemotherapy. HRs of OS were calculated and are listed in Table 2, and those of DSS are in Table 3. The results show that younger patients have better OS and DSS. Male (HR = 1.221-1.281) patients have shorter survival than female patients. Marital status at diagnosis has no significant effect on DSS, while married patients have a better OS than widowed/separated patients (HR = 1.291). Areas with lower household incomes tend to have poor OS and DSS (HR = 1.039-1.720; compared to the richest subgroup). The race and whether living in urban or rural areas shows no effect on either OS or DSS in multivariate analyses. Large tumor size (>20 mm, ≤40 mm HR = 1.625-1.818; >40 mm HR = 2.648-3.101) and Involved level I, II, and IV cervical lymph nodes (HR = 1.247-1.548) were associated with both poor OS and DSS, while Involved level III (HR = 1.429) was associated with only poor OS. Distant metastases also indicate poor OS and DSS (HR = 3.406-3.758). Compared to salivary gland tumors, tumors of the larynx/hypopharynx (HR = 2.209-2.608) had both poorer OS and DSS. The prognosis of tumors of the oral cavity, nasal cavity/paranasal sinus, nasopharynx, or oropharynx is similar to that of salivary gland tumors. In analyses of pathological subtypes, adenomas and adenocarcinomas (HR = 1.480-1.782) and acinar cell neoplasms (HR = 1.867, only significant for DSS) have a poorer prognosis than other subtypes. Pathological grades show significant association with both OS (HR = 1.589-3.836) and DSS (HR = 3.040-9.006). As for treatments, all surgery (HR = 0.668-0.675) and chemotherapy (HR = 1.212-1.271) both influence OS and DSS, while radiotherapy (HR = 1.343) only influences DSS. However, it is important to note that these treatment variables also included potential information about patients' performance status and clinical stages of cancer, since early-stage patients with better performance status tend to receive surgery, and advanced patients need more chemotherapy and radiotherapy. Therefore, all these treatments variables should be collected and used for survival prediction or analyses after treatments. Survivals of different treatment combinations cannot be compared here and should not be used for treatment choice.

Model Evaluation
Prediction models using random forest were established with tumor size, involved cervical lymph nodes, and distant metastases, instead of TNM classification, because of information overlapping. The performance of the prediction model was evaluated with the 1354 cases in the test cohort. The model performance and comparison with traditional TNMbased Cox regression methods are displayed in Table 4. The TNM-based Cox regression model is used to simulate traditional prediction methods. The AUCs of random forest range from 0.862 to 0.903, which are all significantly better (p < 0.001) than the regression prediction with TNM stages. The AUCs of random forest can also be seen graphically in Figure 2a. The calibration curves are all close to the diagonal (Figure 2b), which indicates that the predicted probabilities are similar to the actual proportions. The results of DCA (Figure 3) show that no matter whether predicting 3-year OS, 3-year DSS, 5-year OS, or 5-year DSS, the random forest models are all better than the traditional TNM-based regression model, which is to simulate traditional prediction methods.      Figure 4 shows the importance of each variable (Gini coefficients) included in the prediction models. The pathology (including both type and grade), involved nodes levels, tumor size, and age have the most critical influence on survival prediction. Age is apparently more important in OS prediction than in DSS prediction.

Prediction Website
An interactive website able to be easily used was deployed online. It can be accessed with "http://42.192.80.13:5001/ (accessed on 25 July 2023)". With the needed variables filled, the predicted results can be computed automatically. Figure 5 shows examples of the online prediction tool.

Prediction Website
An interactive website able to be easily used was deployed online. It can be accessed with "http://42.192.80.13:5001/ (accessed on 25 July 2023)". With the needed variables filled, the predicted results can be computed automatically. Figure 5 shows examples of the online prediction tool.

Discussion
Although there are a larger number of HNC studies, the studies of HNnSCC are still limited and need to be improved because of its rarity and heterogeneity. Existing studies and many guidelines for head and neck mainly focus on HNSCC, and the prognosis prediction tool for HNnSCC is insufficient. Also, the high heterogeneity of HNnSCC indeed brings difficulties for establishing an integrated prediction tool. For example, the 5-year OS of small-cell carcinoma of the head and neck is 27% [15], while the 5-year OS of giant-cell sarcoma of the head and neck is 54.6% [16] and that of Ewing sarcoma of the head and neck could be 87% [17]. Traditional approaches, like nomograms based on linear fitting, might be inappropriate and not as accurate as machine learning. To date, no prediction tool can be used on heterogeneous HNnSCC and perform well. Thus, the present study is aimed at exploring predicting OS and DSS of HNnSCC with an ensemble learning algorithm and large-size clinical data.
In the present study, many variables were found to be prognosis predictors of OS or

Discussion
Although there are a larger number of HNC studies, the studies of HNnSCC are still limited and need to be improved because of its rarity and heterogeneity. Existing studies and many guidelines for head and neck mainly focus on HNSCC, and the prognosis prediction tool for HNnSCC is insufficient. Also, the high heterogeneity of HNnSCC indeed brings difficulties for establishing an integrated prediction tool. For example, the 5-year OS of small-cell carcinoma of the head and neck is 27% [15], while the 5-year OS of giant-cell sarcoma of the head and neck is 54.6% [16] and that of Ewing sarcoma of the head and neck could be 87% [17]. Traditional approaches, like nomograms based on linear fitting, might be inappropriate and not as accurate as machine learning. To date, no prediction tool can be used on heterogeneous HNnSCC and perform well. Thus, the present study is aimed at exploring predicting OS and DSS of HNnSCC with an ensemble learning algorithm and large-size clinical data.
In the present study, many variables were found to be prognosis predictors of OS or DSS in multivariate regression analysis (Tables 2 and 3). Tumor size and involved levels of cervical lymph nodes replace T and N stages in the multivariate analysis of this study. Involved node levels I, II, and IV were found to be independent predictors of both OS and DSS, but the node level III was not found statistically significant in analyses of DSS. In many HNC studies, level IV is regarded as a prognostic factor [18,19], while there are studies mentioning levels I and III, but not as commonly as level IV [20,21]. However, the effect of the involved node levels on HNnSCC survival is still limited. Surgery and chemotherapy are significant for both of DSS and OS, while radiotherapy is only significant for DSS. Furthermore, radiotherapy and chemotherapy have higher HRs, which might be because they tend to be used in patients with more advanced stages. Meanwhile, the surgery choice made by surgeons must indicate earlier stages and better potential performance status. These also indicate that the treatment variables in the prediction of the present study can only be used after the treatment choice is decided. In comparison, many studies reported prognostic predictors of HNSCC. The role of age as a predictor of HNSCC prognosis has been controversial, with some studies considering it a predictor [22,23], while others not [24,25]. Other factors, such as marital status, primary site, tumor size, involved lymph nodes, distant metastasis, and treatments, can also be independent prognostic factors of OS in HNSCC [26], similar to the predictors found in this study on HNnSCC.
Many non-medical factors need to be acknowledged as potentially influencing the prognosis of tumors, including socioeconomic factors, marital status, and personal habits. According to available data of the SEER database, this study includes marital status at diagnosis and the economic level of the residence area. Among these variables, whether the residence is in rural or urban areas does not affect the prognosis, but the economic level does. Groups with a lower median household income tend to have worse prognoses. Similar results have also been mentioned in various studies [27]. However, there is also a study analyzing both the income and education level, reporting that survival is related to education rather than household income [28]. As for the marital status at diagnosis, the widowed/divorced/separated subgroup always tends to have a potentially higher HR than the married and never-married subgroups. This phenomenon has also been reported by other studies dividing marital status into three or more subgroups, which is possibly because of a large alteration of living status; more psychological stress; and the start of worse bad habits, like alcohol abuse [29][30][31][32]. Furthermore, according to the systematic review by Fugmann et al., the stability of the marital status of cancer patients is different from the general population [33], especially for younger patients [34]. Therefore, investigating the effect of divorce, as well as being widowed or separated, might be more important for tumor patients. Our exploration of socioeconomic and marital factors is preliminary and needs further studies to provide us with more insights.
Prognosis predictors were then input for the development of machine learning prediction models. Machine learning is the process in which the computation imitates humans to recognize the patterns of data. Machine learning algorithms have been utilized in many fields of clinical medicine [35], like disease diagnosis [36], survival prediction [36], and molecule screening [37]. The performance of a machine learning model is constantly improving with more sample cases input. Since HNnSCC is rare compared to HNSCC, only the databases with a considerable amount of data can have sufficient sample sizes. Thus, the present study chose the data from the SEER database because most databases contain no rare tumor data, such as The Cancer Genome Atlas (TCGA). Machine learning and the SEER database are the appropriate combination to explore things about HNnSCC. The random forest model is also selected among machine learning algorithms in the present study because it was reported that random forest is the best among 179 classifiers, which is evaluated with 121 real-world datasets [38]. And, this is also consistent with our experience.
The ROC, calibration, and decision curves (Figures 2 and 3) used for evaluating the developed prediction models show that the performance is good. The AUCs range from 0.862 to 0.903. There are several nomograms of HNnSCC. For example, the nomogram for small-cell carcinoma has a 5-year-DSS AUC of 0.75 [10], and the results of nomograms for minor salivary gland carcinoma predicting 3-year or 5-year OS and DSS are 0.837-0.871 [13]. In comparison, the performance of the prediction model of this study is still good, and there are even tumors of various pathological types from various sites of the head and neck. It might be because of the merits of ensemble machine learning, rather than a linear fitting.
However, one problem of ensemble machine learning algorithms is that the models are not as intuitive as nomograms, thus, the importance of input variables in the models displayed in a heatmap (Figure 4). (The feature importance here is the Gini coefficients.) The higher Gini coefficients mean the enormous contribution of the variable to the prediction. The pathology (including both of type and grade), involved nodes levels, tumor size, and age are the most important variables in the prediction models, which means that they might not be the most correlated to survival, but provide the most information for the prediction. Distant metastasis, surgery, and chemotherapy are also important. In contrast, radiotherapy is found to be much less useful for prediction, also indicating the difference between HNSCC and HNnSCC. Although sites, sex, marital status at diagnosis, and household income are found to be prognosis predictors, the importance is less than the variables mentioned above.
Another problem of the machine learning prediction models is that if there is no interactive interface, they will be difficult to use in daily work compared to nomograms. Thus, the models were deployed on the server and can be used through the website developed (http://42.192.80.13:5001/ (accessed on 25 July 2023)), which expanded the application value of the present study. An example of the interface and result can be seen in Figure 5.
However, there are still limitations in the present study. Firstly, the data are from the SEER database, which means it is real-world population-based data, but retrospective, with unavoidable selection bias. Meanwhile, retrospective treatment variables also limit the usage of this prediction model to only the post-treatment situation because potential information is implied, such as performance status information implied in the surgery variable. Secondly, most variables of the SEER database were included in analyses of the present study, but there is no information about molecular or genetic test results, as well as habits, and more detailed information about socioeconomic and marital factors, which limited further exploration. Most cancer databases that contain genetic or expression data and smoking data, like TCGA, do not include rare tumors, like HNnSCC. If there are variables in the SEER database possible to screen out potential molecular or smokingrelated predictors, the prediction model performance must be much better. Thirdly, all data used in the present study are from a single database. Data from diverse sources are needed to validate the generalization ability further.
Considering the above limitations, which cannot be addressed in the short term, we reiterate the urgent need for larger and more comprehensive databases, coupled with increasingly affordable technologies, such as genetic and transcriptomic sequencing, to improve the performance of existing prediction models. Additionally, we should focus more on rare tumors, as common tumors are often found in multiple databases with significant overlap, while rare tumors are equally important for patients. Such efforts will be critical for improving the accuracy of prediction and, ultimately, enhancing patient care for individuals with HNnSCC. We will continue to monitor progress in this field and update the existing prediction model if more data become available.

Conclusions
The present study incorporated clinical, pathological, and therapeutic features comprehensively and established a clinically effective survival prediction model for post-treatment HNnSCC patients.