A Prospectively Validated Prognostic Model for Patients with Locally Advanced Squamous Cell Carcinoma of the Head and Neck Based on Radiomics of Computed Tomography Images

Simple Summary Patients that suffer from advanced head and neck cancer have a low average survival chance. Improving prognosis could improve this survival rate as it may help in clinical decision making. Radiomics features calculated from images of the tumour describe tumour size, shape, and pattern. These characteristics may be linked to patient survival, which is investigated in this paper. We combined radiomics features with other biomarkers of survival of 809 patients to make a prognosis before treatment. We then compared the predicted prognosis with the actual outcome to see how well our model performs. Our model was able to make three distinct risk groups of low-, medium-, and high-survival patients. With these findings, doctors may make a better judgement of treatment and follow-up per patient, which might improve clinical outcomes. Abstract Background: Locoregionally advanced head and neck squamous cell carcinoma (HNSCC) patients have high relapse and mortality rates. Imaging-based decision support may improve outcomes by optimising personalised treatment, and support patient risk stratification. We propose a multifactorial prognostic model including radiomics features to improve risk stratification for advanced HNSCC, compared to TNM eighth edition, the gold standard. Patient and methods: Data of 666 retrospective- and 143 prospective-stage III-IVA/B HNSCC patients were collected. A multivariable Cox proportional-hazards model was trained to predict overall survival (OS) using diagnostic CT-based radiomics features extracted from the primary tumour. Separate analyses were performed using TNM8, tumour volume, clinical and biological variables, and combinations thereof with radiomics features. Patient risk stratification in three groups was assessed through Kaplan–Meier (KM) curves. A log-rank test was performed for significance (p-value < 0.05). The prognostic accuracy was reported through the concordance index (CI). Results: A model combining an 11-feature radiomics signature, clinical and biological variables, TNM8, and volume could significantly stratify the validation cohort into three risk groups (p < 0∙01, CI of 0.79 as validation). Conclusion: A combination of radiomics features with other predictors can predict OS very accurately for advanced HNSCC patients and improves on the current gold standard of TNM8.


Introduction
Head and neck squamous cell carcinomas (HNSCC) are cancerous tumours that typically grow in the oral cavity (OC), larynx, and pharynx. In Europe, 140,000 new cases are diagnosed yearly leading to approximately 70,000 deaths [1]. Despite advances in treatment, 3-years overall survival (OS) for locoregionally advanced HNSCC remained 40-50% [2][3][4]. Management of HNSCC patients starts with a diagnostic workup of the tumour, lymph node metastases, and distant metastases (TNM) to stage the tumour. Furthermore, immunostaining determined p16 protein expression, acting as a surrogate marker of HPV infection, has been included as an important factor in the American Joint Committee on Cancer (AJCC) 8th edition for staging of oropharyngeal cancer, which introduced separate staging systems for p16-positive and p16-negative oropharyngeal carcinomas [5]. Besides the TNM stage, prognosis depends on clinical (e.g., patients' comorbidities, performance status) and biological (e.g., invasive growth or gene expression) factors, and for patients treated with surgery, on microscopic examination of the resection specimen [4]. RNA and DNA profiling have identified molecular subtypes of HNSCC with different prognoses [6]. Some of these subtypes may include primary tumours with high heterogeneity which may react differently to treatment [7]. Defining a robust and clinically viable method to determine these subtypes is therefore essential for the effective treatment of HNSCC patients.
Routine pre-treatment radiological imaging provides a source of non-invasively acquired information of the primary tumor that could be investigated for the ability to determine clinically relevant subtypes. Advanced image analysis methods such as radiomics allow for the analysis of radiographic medical images by extracting large amounts of socalled features using mathematical algorithms and finding correlations with biological and/or clinical outcomes using machine learning techniques. Previous studies have shown that radiomics in computed tomography (CT) imaging can improve the prediction of prognosis of HNSCC . Radiomics on CT HNSCC imaging has been used to predict HPV status [8,9], overall survival [10][11][12][13][14], progression-free survival [10,[12][13][14], local tumour control [8,12,[15][16][17][18][19][20][21], tumour grade [9,22], lymph node response [23,24], tumour invasiveness [9,25], xerostomia [26][27][28], tumour resectability [29], and classifying molecular subtypes [30,31]. While the survival studies show that radiomics on CT data can signifi-cantly stratify patients in multiple survival groups, performance expressed through Harrell's C index ranged widely, from 0.58 to 0.9. An explanation of these discrepancies is that radiomics studies are commonly limited in data, with patient numbers for HNSCC regularly using around 100-200 patients combined for training and validation. Furthermore, the data are usually collected from two centres-one for training and one for validation. To create radiomics models which have sufficient predictive power and that are generalisable across different patient populations, large datasets from multiple institutes are needed.
We hypothesise that the multicentric 'Big Data and Models for Personalised Head and Neck Cancer Decision Support' project (BD2Decide) [32,33] dataset provides the necessary breadth to create statistically significant, high-quality models that can add complementary information to other well-known but under-utilised clinical and biological factors [34][35][36]. In addition, we hypothesise that the international multicentric nature of the data will, compared to many contemporary HNSCC radiomics-based studies, give us the necessary variation in the dataset to generalise the model across different patient populations. Similar to the inclusion of HPV status to TNM8, we believe that combining these factors may improve the prediction of patient prognosis instead of using them independently. We also hypothesise that a multifactorial machine learning model, including radiomics features derived from the primary tumour, can outperform the current gold standard (TNM8) in stratifying locally advanced HNSCC patients into OS risk groups. This new signature of radiomics features was compared against an existing signature. Furthermore, mixed models containing TNM, tumour volume, radiomics features, clinical variables, and biological variables were developed and validated.

Patient Characteristics
Protocol details were registered on Open Science Framework (DOI number: 10.17605/OSF.IO/H4DFB). The study population was composed of locoregionally advanced HNSCC patients (TNM7 stage III-IVA/B (M0)) receiving curative treatment between 2008 and 2017, collected within the framework of the BD2Decide project (http://www.bd2decide.eu/, accessed on 13 May 2021, H2020-PHC30-689715, IRB P-number P0125, ClinicalTrials.gov Identifier: NCT02832102) [32,33]. The collected patient population was originally staged at diagnosis of the TNM7 staging system. During the BD2Decide project, these patients were re-staged to I-IVA/B (M0) using the newly developed TNM8 staging system. The ethical approval statement and an overview of the inclusion criteria can be found in Supplementary Materials. Patients' data were collected both retrospectively (diagnosis between 2008 and 2014) and prospectively (diagnosis between 2015 and 2017). The retrospective and the prospective datasets were assigned as the training and validation datasets, respectively. OS was established as the period between the primary diagnosis and the date of death or last follow-up, with at least three years of follow-up performed. Patients alive with less than 2-year follow-up were excluded and defined as 'lost to follow-up'. Median follow-up times were determined separately for training and validation datasets through the reverse Kaplan-Meier (KM) estimate [37]. The similarity in patient characteristics between cohorts was assessed through two-proportion z-tests to test whether there is a difference in a categorical variable, or unpaired two-sample t-tests to test whether there is a difference in a numerical variable. For the latter, the assumptions of the data having a normal distribution and possessing the same variance in both cohorts were tested through Shapiro-Wilk's test and f-test, respectively. The significance level was set to 5%.

CT Acquisition Parameters
CT images were acquired at each centre with scanners, acquisition protocols, and reconstruction protocols according to standard operating procedures (SOPs) at the respective centres for diagnostic imaging. All CT images were either diagnostic or radiotherapy treatment planning images of comparable diagnostic quality, all with an intravenous contrast injection protocol. All CT scans within the framework of the BD2Decide project had a 3 mm slice thickness or less. Any CT scan that had imaging artifacts in more than 50% of the slices with primary tumour mass present was excluded. For patients who received radiotherapy, the primary gross tumour volume (GTV) was delineated at each centre according to local delineation guidelines by experienced radiation oncologists. The GTV was defined as the visual extent of gross tumour volume, as described in the radiology report and, if needed, adapted based on the report of the physical examination. Figure 1 gives an example of a CT with the primary tumour delineated. For patients who did not receive radiation treatment, the primary tumour volume was delineated locally by or supervised by expert radiologists according to local delineation protocols. This delineation was conducted by a single person per centre, directly on the contrast-enhanced (CE)-CT. CE-CT has shown to have lower interobserver variability for HNSCC delineation, compared to just CT, or PET-CT [38]. Additionally, for all patients treated with radiotherapy from Maastro, VUmc, and the University of Brescia, all contours were delineated on CT in conjunction with PET/MRI, which has also been proven to greatly decrease interobserver variation [39,40]. All contours were additionally peer-reviewed by radiation oncologists based on diagnostic information. Lastly, all delineations were visually judged by a single observer in the BD2Decide consortium for deficiencies. Supplementary Materials Table S1 provides an overview of the treatment received per participating centre.

Feature Extraction
Radiomics features were obtained from the delineated primary tumour volume of the preprocessed images. A full list of software packages used in the present study is shown in Table S2 of the Supplementary Materials. Feature extraction was performed in python 3.6.10, with the package PyRadiomics version 2.2.0 [41]. To lessen the impact of heterogeneity in the imaging data caused by differences in scanners and imaging protocols, preprocessing of the images and postprocessing of the extracted features were performed. An overview of pre-and postprocessing techniques applied to the data has been described in Supplementary Materials. Both International Biomarker Standardisation Initiative (IBSI)-compliant [42,43] and a non-IBSI compliant feature were extracted. Features extracted through PyRadiomics contain a single first-order feature, first-order kurtosis, which differs from the IBSI definition. A description of the features is provided in Supplementary Materials. The PyRadiomics documentation [44] provides a complete overview of all radiomics features.

Feature Selection
Unsupervised and supervised feature selection was performed on the training dataset to reduce data dimensionality and the chance of overfitting on the training data. Highly correlated features were assumed to contain overlapping information about the outcome and are therefore considered redundant; thus, for each correlating feature pair, one was selected, and the other was removed. Through absolute pairwise Spearman rank correlation, highly correlating features (>0.85) were determined. The feature with the highest mean absolute correlation with the rest of the dataset was then excluded. Univariate feature selection was performed by fitting a univariate cox model for each individual feature. Afterwards, we selected features based on the individual feature's association with survival. This was performed by choosing features with a testing association p-value (Wald test) lower than the threshold of 0.05 [45]. A false discovery rate (FDR) adjustment was performed on the p-values to correct for multiple testing [46]. A 100-repeat 10-fold cross-validation was performed to determine the most prognostic features on average.

Radiomics Model
A multivariable Cox model was trained on the training dataset using the selected features. Afterwards, the model's prognostic performance was assessed through external validation on the validation dataset. This was performed according to the principles and methods described by Royston and Altman (2013) [47], described in Supplementary Materials. Model discrimination performance was determined through CI. A CI of 0.5 means the predictions are achieved completely randomly, while values near 1 indicate almost perfect discriminative performance. Risk-stratified KM curves were generated for each model, which allowed for visual comparison between models and provided the opportunity to determine how well the cohort could be stratified into risk groups. Three risk groups were determined using threshold values at the 33rd and 66th percentile of the calculated prognostic index (PI). Two log-rank tests were performed to determine the significance of the split of the low-vs. the medium-risk groups, and the medium-vs. the highrisk groups. Predicted survival curves for each risk group were determined. The individual survival curves were estimated using the PI of each patient, which were then averaged over the entire risk group. The observed survival curves and predicted survival curves aligning indicates that the model fits correctly to the data.

Staging, Volume, and Clinical Models
Risk stratification based on TNM8, primary tumour volume, and a model developed from clinical and biological features were compared to the radiomics model's results. The radiomics feature 'original_shape_VoxelVolume' was used as a surrogate for tumour volume. This feature was added to the list of selected features and used to create a separate model [48]. The clinical and biological model was built from a list of known predictors of survival in HNSCC, which can be found in Supplementary Materials. All features had less than 10% of values missing. For any missing values imputation was performed using the 'missForest' package in R [49]. This imputation method trains a random forest (RF) model on the existing data to predict the missing values. Imputation was performed separately for the training and validation datasets. Feature selection on the clinical and biological covariates was performed through univariate Cox modelling, selecting univariate significant covariates through chi-square test p-values after correcting for multiple testing (FDR) [46]. The significant features were added to the list of radiomics features and used to create separate models. In addition, a combined model using radiomics, tumour volume, and clinical/biological variables was created and validated.

Validation of Existing Radiomics Signatures
Aerts et al. reported on a radiomics signature to predict survival in lung cancer patients which they validated on HNSCC cohorts [50]. This signature was evaluated both on our validation and the full cohort (training and validation), and its performance was compared to the newly proposed signature created. After the necessary preprocessing steps, the four features used by Aerts to establish the signature were extracted from the primary tumour volume. The feature values were multiplied with the β coefficients reported in the article to calculate the linear predictor. To stratify the patients into low-and high-risk groups, the authors used a single cut-off value based on the linear predictor's median. We applied these cut-offs in order to determine two risk groups and compared these to risk stratification using the median of the linear predictor estimated by our novel models.

Radiomics Quality Score and TRIPOD
For quality assurance, the radiomics quality score (RQS) [51,52] was calculated and transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [53] recommendations were followed. A description of these statements and the results can be found in Supplementary Materials Tables S5 and S6.

Clinical, Biological, and Imaging Characteristics
In total, 666 retrospective and 143 prospective patients were collected and analysed in this study. Table 1 provides an overview of the patient characteristics for both cohorts. The median follow-up of patients in the training and validation cohort was 63 (49-79 95% CI) and 32 (26-37 95% CI) months, respectively. Two-year survival in the training and validation cohort was 78% and 75%, respectively. A log-rank test between survival curves shows that the difference between cohorts is not significant (p = 0.29). KM plots of the cohorts are shown in Supplementary Materials Figure S1. As oropharyngeal carcinoma constituted a significant portion of the dataset (43%/n = 294 for training, 36% n = 51 for validation), we decided to build separate models for this group of patients (including both p16+ and p16−). A description of this model, along with the results, can be found in Supplementary Materials. Supplementary Materials Figure S2 shows an overview of the different parameters used for image acquisition and reconstruction in the training and validation datasets.

Model Results
We extracted 1198 radiomics features from the primary tumour volume on all CT images. After unsupervised feature selection, 204 features remained. In total, 11 features were selected by supervised selection as being the most predictive of OS in the training cohort. The first two features were kurtosis, a first-order statistics feature, and sphericity, a shape feature. The next four features are all LoG-filtered texture features consisting of GLSZM gray level non-uniformity, GLDM entropy, GLRLM run entropy, and GLDM low gray-level emphasis. Finally, five wavelet-filtered texture features were included: four differently wavelet-filtered GLSZM zone entropy features and GLRLM low gray-level-run emphasis. All selected features were IBSI compliant, except for the first-order statistics feature. Supplementary Materials Table S3 shows an overview of the feature names. The slope of the PI in validation was 1.35, and a log-rank test indicated this slope was not significantly different from a slope of 1 (p-value of 0.38). This indicates the model calibrates well, meaning the predicted and the expected outcome proportions for a certain testing population match. The joint test of all predictors with the offset of the PI gives a pvalue of 0.86, indicating there is no evidence of a lack of fit on the validation cohort.
Supplementary Materials Figure S3 depicts KM survival curves for the combined training and validation cohort after stratification in two risk groups (p < 0.01) using the Aerts et al. (2014) signature [50], with a CI of 0.66. For some patients, one or more of the required features failed to extract due to the small size of the volume. Therefore, the calculation of the signature was not possible in all available patients, resulting in 633 patients in the training cohort and 139 patients in the validation cohort. The performance of the signature in this study is similar to the reported validation performance on the lung dataset (CI of 0.65) but slightly lower than the performance on the two H&N datasets (both CI of 0.69). Figure 2 shows KM survival graphs of the validation cohort split using the previously created signature [50] and the radiomics-only signature from this study. While the CI values of the model performances are similar (0.66 and 0.67, respectively), the split and hazard ratios are significantly better using the newly created signature (p = 0.22 vs. p < 0.01).   Figure S4. However, the shape of the KM curve shows tumour volume is very poor in discerning three distinct risk groups. It is significantly lower than stratification based on TNM8 (CI of 0.74), shown in Supplementary Materials Figure S5. An overview of the clinical and biological features selected is shown in Table 2. The clinical features selected through univariate feature selection were TNM8 (higher stage has worse prognosis), age at diagnosis (higher age has worse prognosis), ACE-27 comorbidity score (higher score has worse prognosis), smoking pack-years (higher pack-years has worse prognosis), and alcohol consumption at the time of diagnosis (current has worst prognosis), and the biological features were p16-status (p16-negative has worse prognosis) and clinical Hb level at baseline (lower Hb level has worse prognosis). Supplementary Materials Figure S6 shows the KM curve stratified based on these clinical and biological features, with a CI of 0.73. Figure 4 shows KM survival curves of the validation cohort after stratification based on tumour volume, the selected clinical and biological parameters, and the selected radiomics features, with a CI of 0.71 and 0.79 in training and validation, respectively. The p-value of the log-rank test of the low and medium, and medium and high split were both <0.01 in both training and validation.   For the oropharynx patient cohort, eight features were selected as being the most predictive of OS, consisting of one first-order statistics feature, two shape features, three wavelet-filtered texture features, and two LoG-filtered texture features. All selected features were IBSI compliant. Supplementary Materials Table S4 shows an overview of the features. The slope of the PI in the validation was 3.01. A log-rank test indicates with certainty the slope in the validation is larger than unity (p-value of 0.04). The p-value for the joint test of all predictors with the PI offset is 0.12. This indicates that there is no proof of a lack of fit on the validation cohort. Kaplan-Meier survival curves of the prospective oropharynx cohort split based on radiomics features are shown in Supplementary Materials Figure S7.
A full overview of the different combinations of models, with discrimination performance and hazard ratios for each model, is provided in Table 3. In addition, Figure 5 provides an overview of the CI indices of the validation results. Table 3. Performance overview of all trained and/or validated models, showing Harrell's CI and HR values for each model. The left side shows the models for the full patient cohort, both training (n = 666) and validation (n = 143), the right the oropharynx patient cohort, both training (n = 294) and validation (n = 51). * indicates no HR could be calculated, as the lowrisk group did not have any events recorded.  Figure 5. Bar plot of the various models validated on the validation cohort (n = 143). The y-axis indicates CI value, while the coloured bars above the bar show significant differences between models, with an indent meaning the model is significantly different, and no indent meaning no significant difference was found.
From Table 3 and Figure 5, it can be observed that in the prospective cohort radiomics alone did not perform better than TNM8 (CI of 0.67 and 0.74, respectively, p < 0.01). Combining TNM8 and radiomics resulted in higher performance than both separately, with a CI of 0.77. In combination with both clinical parameters and tumour volume, the highest discrimination performance was found (CI of 0.79). Similarly, oropharynx radiomics did not perform better than TNM8 (CI of 0.82 vs. 0.86, p < 0.01), but when combining both radiomics and TNM8, the highest performance in the validation cohort was achieved (CI of 0.90).

Discussion
For advanced tumours such as those investigated in this study, being able to discern groups of poor versus good performing patients is key for personalised decision making. In this international, multicentre study, we created a multifactorial prediction model, including radiomics features derived from the primary tumour volume that can significantly stratify advanced HNSCC patients in good, average, and poor prognostic groups, with a CI of 0.79 as validation on a prospective cohort. These groups could be used in clinical decision making and for selecting patients for (de-)escalation trials and/or adjuvant treatment. While radiomics alone was not able to improve on TNM8, adding radiomics features to a model including TNM8, clinical, and biological variables improved the prognostic performance, significantly increasing CI from 0.73 to 0.79. We can therefore recommend adding these variables to the current clinical implementation of TNM8.
These results coincide with other works reporting on the complementary value of radiomics in predictive modelling for HNSCC [10,54]. The performance of the model based solely on radiomics, with a CI of 0.67, matches those of similar studies which investigate OS [10][11][12][13][14]. However, compared to these studies, this study investigates over 800 patients from multiple centres, whose data were partially collected prospectively. The largest discrepancy is with the study by Cozzi et al. (2019), which found a high CI of 0.90 in validation [12]. While their methodology is sound, as the writers explain themselves the number of patients (n = 110) from a single centre does make these results less significant. Haider et al. (2020) had the largest cohort of 306 patients with a CI of 0.58 in validation [14]. This result was found on an external cohort, and while lower, it is more in line with the result we found in external validation.
In total, 11 radiomics features were selected as being univariately most predictive of OS. The first two selected features were kurtosis, a first-order statistics feature that measures the 'peakness' of the distribution of pixel intensity values, and sphericity, a shape feature that measures the likeness of the ROI to a sphere. Sphericity being selected implies fewer spherical tumours may have a worse prognosis. The next four features are all LoG-filtered texture features consisting of GLSZM gray level non-uniformity, a feature which measures the variability of gray-level intensity values, GLDM entropy, and GLRLM run entropy, which both measure the heterogeneity in texture patterns, and GLDM low gray-level emphasis, which measures the concentration of low-intensity values. Finally, five wavelet-filtered texture features were included: four differently waveletfiltered GLSZM zone entropy features, which measure the heterogeneity in texture patterns, and GLRLM low gray-level-run emphasis, which measures the concentration of low-intensity values. Most of these features are linked to heterogeneity, reinforcing the theory that tumour heterogeneity correlates with a worse prognosis [55,56].
For most tested models, we found a higher validation accuracy than training accuracy. This may be caused by the relatively smaller size of the validation dataset, which means the result is more prone to variance, which is reflected in the larger confidence intervals, especially for the smaller oropharyngeal analysis. Another contributing factor could be that the training dataset contains relatively more 'hard' cases than the validation dataset. In this paper, we chose to validate on a prospectively collected dataset, which is for data splitting purposes an arbitrary reason. In a more balanced dataset with more similar patient datasets, the discrepancy between training and validation may be lower.
Instead of using radiotherapy planning images only, which is conventional for radiomics studies, this study used diagnostic CT images as well, which are made routinely for any patient showing a locally advanced HNSCC. From these images radiomics features can be extracted in a semi-automatic fashion, making clinical application easy. In addition, the combined model was made using simple variables that are routinely determined in a clinical setting for every patient (TNM8 stage, ACE-27 comorbidity status, smoking, and alcohol habits). As a result, it would be relatively easy to implement the presented models in a clinical environment. For the next step, the created model could be tested in a clinical trial. However, as differences in scanners, scan settings, and acquisition settings have proven to significantly affect feature reproducibility, a prospective study where these variables are controlled may be required to further validate model performance.
Radiomics performs an estimation of the tumour volume using a 3D segmentation, as opposed to conventional methods of measuring tumour volume to predict survival. This single feature was found to be significantly predictive of OS, albeit with lower performance, compared to TNM8 or the model based on radiomics features but was not chosen in the multivariable model. The main reason for this is the interaction with other features in the correlation dimensionality reduction step. Volume has a high correlation with other features, mostly shape features, and is therefore removed from the feature dataset before univariate selection is performed, revealing a shortcoming of this feature reduction step. However, the information provided by this feature should be retained in the remaining uncorrelated features.
The radiomics model in this study shows better performance in stratifying patients in risk groups in the validation dataset when compared to the previously created and validated signature [50]. One large discrepancy between these models is the risk stratification: the previously developed signature was created with two risk groups instead of three. Most importantly, it was built on lung cancer. The difference in performance on different tumour sites demonstrates that prognostic models should be developed on specific tumour sites and stages, and with relevant clinical risk groups in mind.
While the amount of data used in this study was higher than most published radiomics studies, this was partially achieved by pooling data from different HNSCC sites. Separating these regions resulted in very small datasets in either or both training and validation sets. While we had sufficient data to train an oropharynx model and found a relatively high performance of the model using radiomics features of 0.82 CI in validation, the number of patients, and particularly the number of events, of the validation dataset was relatively limited. Collecting more data from an individual tumour site would most likely result in more representative models. In addition, the patients in this study received different treatments. This significantly affects survival chance and is therefore a major limitation. Similar to tumour region, separate models according to treatment would be preferred. However, treatment is heavily linked to the region of the tumour, as, for example, the majority of surgeries were performed for oral cavity patients.
Compared to extracting radiomics features from just the primary tumour volume, TNM8 staging takes information from the primary tumour (T-stage), involvement in lymph nodes (N-stage), and the extent of metastisation (M-stage) into consideration. In addition, depending on the tumour region, additional information such as p16-status as a surrogate for HPV involvement, depth of invasion in surrounding tissues, and presence of extranodal extension are important. The addition of radiomics features derived from lymph node metastases can potentially improve the results. This would require a multifactorial model with a binary condition for the lymph node stage and would only incorporate features of those patients who have lymph node metastases.
Imaging artefacts caused by dental implants may have affected the performance of the radiomics model. The artefacts make segmentation difficult but also affect the radiomics features extracted from these images. While there was a limit on the number of artefacts allowed on images during patient selection, methods to reduce the artefacts may be considered for future studies. In addition, variability caused by the manual segmentation of tumours by different experts at each institute may have also affected model performance. Previous research has shown that inter-and intraobserver variability can possibly cause large differences in delineated volumes [57]. For shape and size radiomics features, this can cause a large decrease in their use and may affect other features to a lesser degree. The repeatability of deep-learning-based automatic segmentation methods will be able to negate interobserver variabilities in the future [58].
To compensate for interobserver variability in the current project, each centre performed delineations either directly by, or under the supervision of, expert radiologists or radiation oncologists. Additionally, although delineations were performed according to local protocols, European guidelines are largely aligned, limiting the interobserver effects on the delineated structures. Conversely, in a clinical application of the proposed model at different institutes, interobserver variabilities will be an inevitability. The discriminative performance the model has shown despite these issues strengthens the potential of application in a clinical setting.

Conclusions
A multifactorial prognostic model for stage III-IVB HNSCC (TNM7th edition) based on simple variables available for every patient and including CT radiomics features is able to very accurately predict OS and to significantly discern different risk groups. The multifactorial model was found to have higher predictive performance than the current gold standard of TNM8. This could be useful in treatment (de-)escalation trials and clinical decision support.

Institutional Review Board Statement:
The study procedures of the BD2Decide project were approved in accordance with the Declaration of Helsinki, the European and local ethical conventions and legal aspects, as well as the European General Data Protection Regulation. The management and exchange of data, specimens, and imaging information were regulated between the partners through data and material transfer agreements and standard operating procedures. Central data, imaging, and material were anonymized by the centers prior to aggregation, and data were stored in a secured and locked information technology surrounding according to General Data Protection Regulation (GDPR).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to an agreement within the BD2Decide consortium that the data will be closed access for a defined period to allow the participants of the consortium to utilise the data for publication purposes first. , and KWF project number 12079/2018-2. Philippe Lambin reports, within and outside the submitted work, grants/sponsored research agreements from Radiomics SA, ptTheragnostic/DNAmito, Health Innovation Ventures. He received an advisor/presenter fee and/or reimbursement of travel costs/consultancy fee and/or in kind manpower contribution from Radiomics SA, BHV, Merck, Varian, Elekta, ptTheragnostic, BMS and Convert pharmaceuticals. Lambin has minority shares in the company Radiomics SA, Convert pharmaceuticals, Comunicare Solutions and LivingMed Biotech, he is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248, PCT/NL2014/050728) licensed to Radiomics SA and one issued patent on mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, one non issued patent on LSRT (PCT/ P126537PC00) licensed to Varian Medical, three non-patented invention (softwares) licensed to ptTheragnostic/DNAmito, Radiomics SA and Health Innovation Ventures and three non-issues, non licensed patents on Deep & handcrafted Radiomics (US P125078US00, PCT/NL/2020/050794, n° N2028271). He confirms that none of the above entities or funding was involved in the preparation of this paper. Lisa Licitra further acknowledges grant/research support from AstraZeneca, BMS, Boehringer Ingelheim, Celgene International, Debiopharm International SA, Eisai, Exelixis Inc, Hoffmann-La Roche Ltd, IRX Therapeutics inc, Medpace Inc, Merck-Serono, MSD, Novartis, Pfizer, Roche, honoraria/consultation fees from AstraZeneca, Bayer, BMS, Eisai, MSD, Merck-Serono, Boehringer Ingelheim, Novartis, Roche, Debiopharm International SA, Sobi, Ipsen, Incyte Biosciences Italy srl, Doxa Pharma, Amgen, Nanobiotics Sa, and GSK, and fees for public speaking/teaching from AccMed, Medical Science Foundation G. Lorenzini, Associazione Sinapsi, Think 2 IT, Aiom Servizi, Prime Oncology, WMA Congress Education, Fasi, DueCi promotion Srl, MI&T, Net Congress & Education, PRMA Consulting, Kura Oncology, Health & Life srl, Immuno-Oncology Hub. Henry C. Woodruff has (minority) shares in the company Oncoradiomics. Ralph T.H. Leijenaar is a salaried employee of the company Oncoradiomics, has shares in the company Oncoradiomics and is co-inventor of an issued patent with royalties on radiomics (PCT/NL2014/050728) licensed to Oncoradiomics. C. René Leemans is an advisory board member at Merk & Co., Inc. and Rakuten Medical, and has received a research grant from Bristol Myers-Squibb. RH Brakenhoff PhD, received research grants from GenMab, Bristol Myers-Squibb, and InteRNA BV, and has collaborated with MSD.

Conflicts of Interest:
The authors declare no conflict of interest.