From Head and Neck Tumour and Lymph Node Segmentation to Survival Prediction on PET/CT: An End-to-End Framework Featuring Uncertainty, Fairness, and Multi-Region Multi-Modal Radiomics

Simple Summary Automatic delineation and detection of the primary tumour and lymph nodes using PET and CT in head and neck cancer can be helpful for diagnosis, prognosis, and monitoring the disease. However, these algorithms can suffer from silent failures, limiting their trust. In this research work, we estimate the confidence of the predicted segmentation and use it to reduce the number of false predictions. We also investigate the prognostic potential of quantitative image features extracted from the primary tumour and lymph nodes. We combine these features with clinical characteristics to predict recurrence-free survival and stratify patients into three groups of low, medium, and high-risk patients. We gain insight into the decision-making process of the model using explainability methods and correlate it to clinical knowledge. We also evaluate if the models are impacted by different biases. Our proposed framework can aid clinicians in the detection of head and neck cancer and patient risk stratification. Abstract Automatic delineation and detection of the primary tumour (GTVp) and lymph nodes (GTVn) using PET and CT in head and neck cancer and recurrence-free survival prediction can be useful for diagnosis and patient risk stratification. We used data from nine different centres, with 524 and 359 cases used for training and testing, respectively. We utilised posterior sampling of the weight space in the proposed segmentation model to estimate the uncertainty for false positive reduction. We explored the prognostic potential of radiomics features extracted from the predicted GTVp and GTVn in PET and CT for recurrence-free survival prediction and used SHAP analysis for explainability. We evaluated the bias of models with respect to age, gender, chemotherapy, HPV status, and lesion size. We achieved an aggregate Dice score of 0.774 and 0.760 on the test set for GTVp and GTVn, respectively. We observed a per image false positive reduction of 19.5% and 7.14% using the uncertainty threshold for GTVp and GTVn, respectively. Radiomics features extracted from GTVn in PET and from both GTVp and GTVn in CT are the most prognostic, and our model achieves a C-index of 0.672 on the test set. Our framework incorporates uncertainty estimation, fairness, and explainability, demonstrating the potential for accurate detection and risk stratification.


Introduction
Head and neck (H&N) cancer is the seventh most common cancer worldwide with more than 0.66 million new cases and 0.3 million deaths every year [1]. The incidence rate of H&N cancer is increasing worldwide, and the five-year survival rate for H&N cancer varies from 85.1% in localized cancer cases to 40.1% in distant cancer cases [2,3]. Research has shown a significant association between surgical delay and increased mortality rates in various populations. Patients with stage I or II cancer can achieve at least a 70% five-year survival rate if they receive appropriate treatment [4]. Early diagnosis and staging are important for improving the prognosis of H&N cancer. Computed tomography (CT) and 18F-fluorodeoxyglucose (FDG) positron-emission tomography (PET) are the most commonly used imaging modalities for the initial diagnosis, staging, and followup of H&N cancers, as they provide synergistic information related to metabolism and morphology [5][6][7]. With an increasing population, especially in developing countries, an increasing number of images acquired, and an increasing incidence rate for H&N cancer, there is a need to develop automatic segmentation and detection tools for H&N cancer to aid clinicians and reduce their workload. Automatic segmentation models cannot only help clinicians in tumour detection and delineation but also reduce inter-and intra-observer variability [8].
Convolutional neural networks (CNNs), a type of deep learning algorithm adapted to accept images as inputs, are becoming increasingly popular for medical image segmentation tasks. The U-Net architecture is one of the widely used CNNs for medical image segmentation [9,10]. In recent years, U-Net has been used as the backbone for head and neck segmentation [11,12]. No-new-U-Net (nnU-Net) has shown state-of-the-art performance in many medical image segmentation tasks [13]. The nnU-Net has streamlined many design choices related to pre-processing, network architecture, and hyperparameter selection. These design choices in nnU-Net are configured automatically based on the model hyperparameters. However, the success of nnU-Net and other segmentation methods in achieving high scores on segmentation tasks does not ensure the reliability of the segmentation results, as they can fail silently without any notice [14]. This problem is particularly acute when a lot of heterogeneity is present in the medical data. Incorporating an uncertainty estimation can help quantify the reliability and robustness of the segmentation outcomes, as there is a positive correlation between uncertainty and segmentation error [15]. Therefore, an uncertainty estimation is critical for the clinical deployment of the segmentation models [16].
Radiomics is a quantitative image analysis technique that can be divided into handcrafted radiomics (HCR) and deep learning (DL). Handcrafted radiomics extracts imaging features from radiographic medical images and correlates them with clinical and biological information using machine learning methods [17][18][19]. Deep learning involves training artificial neural networks to learn representative features for outcome prediction from amounts of data, and deep learning-based models have been developed to predict progression-free survival for head and neck squamous cell carcinoma patients using clinical and PET/CT imaging data [20][21][22]. Several studies have shown that radiomics in CT has the potential to improve the prediction of the prognosis of H&N cancer [23][24][25]. Some studies have also investigated the use of radiomics in both CT and PET for survival analysis for H&N cancer [26,27]. While these studies investigate the prognostic potential of CT and PET radiomics based on primary tumour delineation, there is a need to quantify the prognostic potential of radiomics features extracted from different regions of interest (ROI) such as the primary tumour and lymph nodes. Furthermore, it is also necessary to evaluate the predictive potential of radiomics features from different imaging modalities of CT and PET. The radiomics features can be extracted from the primary tumour and lymph nodes independently from both PET and CT images.
Handcrafted radiomics and deep learning has the potential to revolutionize the field of radiology. However, AI algorithms can lead to biased tools that replicate and amplify health inequalities between different cohorts, such as gender, age, and income [28,29]. The fairness principle of FUTURE-AI emphasises that AI algorithms should be impartial to individual and group differences, and they should demonstrate similar performance irrespective of the disparities [30]. An investigation into the fairness of the algorithm for abnormalities detection in chest X-rays revealed that the algorithm shows biased performance with respect to sex, age, and ethnicity [31]. A study on cardiac segmentation performance with respect to fairness revealed that a dataset which is balanced with respect to gender but imbalanced with respect to ethnicity does not perform consistently with respect to ethnicity [32]. Therefore, it is important to evaluate the performance of the algorithm with respect to possible biases to identify if there are any significant differences in the performance of the algorithm.
In this paper, we propose an end-to-end framework for survival analysis based on the automatic detection and segmentation of tumours and lymph nodes as well as uncertainty estimation. The key contributions of this work are as follows:

•
We evaluated a 3D segmentation framework for primary tumour and lymph node segmentation.

•
We implemented a method for uncertainty estimation to calculate the model confidence for the primary tumour and lymph nodes segmentation to minimise the risk of the model failing silently. We applied the uncertainty score for the false positive reduction in lymph nodes and tumours.

•
We extracted handcrafted radiomics features both from the primary tumour and lymph nodes, separately from CT and PET images, and investigated their prognostic potential. We explored different combinations of these regions of interest in these two modalities to guide future research.

•
We evaluated the performances of the segmentation model and the radiomics model for fairness with respect to relevant clinical characteristics such as age, gender, HPV status, and chemotherapy status, as well as lesion size.

Dataset
Our work is based on the HECKTOR 2022 challenge dataset (https://hecktor.grandchallenge.org/Timeline/, accessed on 20 August 2022), which includes 883 H&N cancer patients from 9 different centres [33,34]. The training set consisted of 524 patients acquired from 7 centres, and the test set consisted of 359 patients from 3 centres. All patients have histologically proven oropharyngeal H&N cancer and have been treated with radiotherapy and/or chemotherapy. For all patients, PET/CT scans were acquired, and some patient information such as institution, age, sex, weight, tobacco, alcohol use, performance status, human papillomavirus (HPV) status, and treatment (radiation only or adjuvant chemo and/or surgery) were collected.
For the segmentation task, the annotation of primary gross tumour volume (GTVt) and nodal gross tumour volume (GTVn) of all 524 patients are provided as ground-truth masks in the training dataset. For the survival prediction task, recurrence-free survival (RFS), including time-to-event in days and censoring, is provided as ground truth labels of 489 patients in the training set. Due to the availability of data, 359 patients and 339 patients in the test set were used for the segmentation task and the survival prediction task, respectively.

Segmentation
The multicentric data contained CT and PET images with varying resolutions. The median resolution of CT images in the training dataset was 0.98 × 0.98 ×3.27 mm 3 . We resampled the PET and CT to a common resolution of 1 × 1 × 3 mm 3 . The CT intensity values were clipped at 0.5 and 99.5 percentiles. We applied z-score normalization and min-max normalization for CT and PET images respectively in accordance with nnUNet guidelines [13]. We applied Otsu thresholding to the PET scan and found the first four axial slices containing the brain region starting from the top. We found the centre I c−axial of the thresholded region in the axial plane. We cropped a region of 260 × 260 × 104 voxels (or 260 mm × 260 mm × 312 mm) centred around I c−axial , starting from the top of the detection brain region. This cropping step reduced the image size, which decreased the computation cost and allowed the model to focus on the relevant region for head and neck cancer detection. The size of 260 × 260 × 104 voxels was selected after ensuring that all the ground truth tumours and lymph nodes were encapsulated within this region and no loss of information had occurred. Figure 1 summarizes all the components present in our proposed framework. The proposed segmentation model has been shown in the first step. We modified the 3D nnU-Net [35] to include residual skip connections [36], squeeze-and-excitation channel-wise attention mechanisms (SE) at each layer [37], and grid attention gates (GA) at each skip connection [38]. The SE layer aids in learning useful features by fusing spatial and channelwise features. Grid attention gates can aid in reducing the number of false positives by allowing the model to focus on the relevant parts of the image. The patch size for training was set at 192× 192 × 64 voxels. Random rotation, random scaling, random elastic deformation, mirroring, the addition of gaussian noise, and gamma correction were applied to each patch for augmentation. Multi-class Dice loss [39] and cross-entropy loss were used to train the model with deep supervision to allow the loss to be calculated at each stage of the decoder. We trained the model for 1200 epochs to allow for convergence and used a stochastic gradient descent (SGD) optimizer for training with a learning rate of 0.01 and a momentum of 0.9. We performed fivefold cross-validation on the training dataset and use the ensemble of the five models from the fivefold for prediction on the test set. median resolution of CT images in the training dataset was 0.98 × 0.98 ×3.27 mm . We resampled the PET and CT to a common resolution of 1 × 1 × 3 mm . The CT intensity values were clipped at 0.5 and 99.5 percentiles. We applied z-score normalization and min-max normalization for CT and PET images respectively in accordance with nnUNet guidelines [13]. We applied Otsu thresholding to the PET scan and found the first four axial slices containing the brain region starting from the top. We found the centre of the thresholded region in the axial plane. We cropped a region of 260 × 260 × 104 voxels (or 260 mm × 260 mm × 312 mm) centred around , starting from the top of the detection brain region. This cropping step reduced the image size, which decreased the computation cost and allowed the model to focus on the relevant region for head and neck cancer detection. The size of 260 × 260 × 104 voxels was selected after ensuring that all the ground truth tumours and lymph nodes were encapsulated within this region and no loss of information had occurred. Figure 1 summarizes all the components present in our proposed framework. The proposed segmentation model has been shown in the first step. We modified the 3D nnU-Net [35] to include residual skip connections [36], squeeze-and-excitation channel-wise attention mechanisms (SE) at each layer [37], and grid attention gates (GA) at each skip connection [38]. The SE layer aids in learning useful features by fusing spatial and channel-wise features. Grid attention gates can aid in reducing the number of false positives by allowing the model to focus on the relevant parts of the image. The patch size for training was set at 192× 192 × 64 voxels. Random rotation, random scaling, random elastic deformation, mirroring, the addition of gaussian noise, and gamma correction were applied to each patch for augmentation. Multi-class Dice loss [39] and cross-entropy loss were used to train the model with deep supervision to allow the loss to be calculated at each stage of the decoder. We trained the model for 1200 epochs to allow for convergence and used a stochastic gradient descent (SGD) optimizer for training with a learning rate of 0.01 and a momentum of 0.9. We performed fivefold cross-validation on the training dataset and use the ensemble of the five models from the fivefold for prediction on the test set. Figure 1. End-to-end primary tumour and lymph nodes segmentation and survival prediction framework with segmentation uncertainty estimation, exploration of the prognostic potential of Figure 1. End-to-end primary tumour and lymph nodes segmentation and survival prediction framework with segmentation uncertainty estimation, exploration of the prognostic potential of lymph nodes, tumour regions of interest for radiomics feature extraction for survival prediction, and fairness evaluation.

Uncertainty Estimation
We incorporated uncertainty estimation in the segmentation model to avoid silent failures and estimated the model's confidence for the predicted segmentation mask. We employed posterior sampling of the weight space to estimate uncertainty by saving posterior models at appropriate moments during stochastic gradient descent (SGD) training as proposed in [15]. When the number of epochs increases, the polynomial learning rate results in model convergence as the learning rate diminishes to zero. When trained for long , where x i represents the training images and y i represents the corresponding segmentation masks.
For a test image x and predicted segmentation mask y, the prediction can be given by: p(y|x, w i ) w i represents the saved checkpoint weights when the SGD training stabilises and γ training epochs have elapsed. The predicted posterior p(y|x, D) is the average of the prediction of N models with weights w i where i = 1 to N.
We utilised the multi-model posterior weight sampling around multiple local optima as proposed in [15] to make use of the diversity of weight samples. We used a cyclic learning rate for exploring multiple local optima. The total training budget T max is set at 1200 to allow for convergence within each training cycle. T max is divided into N cycles = 3. Each cycle consist consists of T cycle = T max N cycles = 400 epochs. We kept the gradient-step constant after γ = 80% of the training budget in the cycle T cycle has elapsed. At any point in the cycle, the learning rate follows the following equation: represented the number of epochs elapsed within the cycle. At t mod = 0, we set a high learning rate of 0.1 for one epoch at the start of the cycle to get out of the local optimum. For t mod > 0, we employed polynomial learning rate decay within each cycle to reach model convergence. After 320 epochs within each cycle, the gradient step is constant, and we randomly save 10 checkpoints per cycle. 30 checkpoints are saved from 3 different cycles, constituting the multi-model posterior samples in the weight space. We quantified the uncertainty by calculating the variance of the predictions I uncertain made by 30 different models. High values of variance correlate with high uncertainty and demonstrate that the model is unsure about the prediction. Furthermore, the correlation of false positives with high uncertainty can be exploited for reducing false positives per image. The uncertainty density t uncertain of the lesion l i is calculated as follows: I uncertain (x, y, z) represent the uncertainty value of a pixel at the coordinate (x, y, z) in the image and l i (x, y, z) represented a binary lesion mask. We set t uncertain at various thresholds to reduce false positives and investigated the impact on both the false positives per image and the sensitivity of detecting tumour and lymph nodes.

Handcrafted Radiomics
The segmentation masks predicted by the model were used for extracting radiomics features. A total of 107 radiomics features were extracted using PyRadiomics including first-order features, shape features, and texture features (Appendix A, Table A1). These radiomics features were extracted from five different ROIs from CT and PET images separately. The bin width for intensity discretisation was set at 25 and 0.5 for CT images and PET images, respectively, to ensure that the bin count is greater than 30 for reproducibility and better performance [40]. All the radiomics features were extracted from 3D region of interests (ROI). CT_all and PET_all represented radiomics features extracted from combined primary tumour and lymph node regions from CT and PET images respectively. CT_only_tumour and PET_only_tumour represented radiomics features extracted from the tumour region only from CT and PET images respectively. The segmentation model can predict multiple tumours and lymph nodes. CT_largest_tumour and PET_largest_tumour represented radiomics features extracted from the largest tumour region from CT and PET images respectively. CT_only_lymph and PET_only_lymph represented radiomics features extracted from the lymph nodes region only from CT and PET images, respectively. CT_largest_lymph and PET_largest_lymph represented radiomics features extracted from the largest lymph node region from CT and PET images, respectively. The radiomics features also included 6 volume-related features i.e., total tumour volume, largest tumour volume, total lymph nodes volume, largest lymph node volume, the total number of lymph nodes and the total number of tumours. All 9 available clinical features, namely, gender, age, weight, chemotherapy, tobacco, alcohol, surgery, HPV status, and performance status, were included as clinical features.
Feature selection is performed in two steps. Univariate analysis is performed to select the top 10 features with the highest C-index in a Cox proportional hazards model in the first step. These features are then aggregated one by one in the order of C-index to find the best combination. A model based on the XGBoost classifier with a regression tree base learner is trained for survival prediction. The ten sets of radiomics features from CT and PET images along with volume and clinical features are explored individually for survival prediction. Feature selection is performed to evaluate the performance of radiomics features extracted from a specific modality and region of interest. All the selected features are aggregated along with clinical features, and feature selection is performed again to form the radiomics_clinical model. Fivefold cross-validation is performed on the training set for hyperparameter tuning and evaluation. The final radiomics_clinical model is tested on the unseen external test dataset. Shapley additive explanations (SHAP) is a post-hoc interpretability method that helps in measuring the importance of each individual feature on the model's prediction in terms of SHAP values. SHAP global summary plots were used to study the impact of features in the radiomics_clinical model.

Fairness
The performance of artificial intelligence algorithms, in particular deep neural networks, is highly correlated with the quality and distribution of the training data [41]. For equitable AI, we need to ensure that the performance of AI algorithms remained the same when they are applied to various groups of individuals with varying characteristics [30]. The problem of equitable and fair AI is aggravated in the case of unbalanced datasets. We evaluated the performance of the proposed segmentation model and the ra-diomics_clinical model with respect to age, gender, chemotherapy, and HPV status during fivefold cross-validation.
2.6. Evaluation Metrics 2.6.1. Dice Coefficient Dice coefficient (DSC) measures the spatial overlap between the ground truth mask and the predicted mask. The formula for DSC is as follows: where I refers to the ground truth mask, andÎ refers to the predicted mask.

Aggregated Dice
Aggregated Dice (DSC agg ) is derived from the aggregated Jaccard index. All the intersections and unions between the ground tumour volumes (GTVs) and their respective predicted volumes are accumulated across all images [33]. The aggregated intersection is divided by the aggregated union. The formula is as follows: where I i,k is the ground truth for voxel k and image i, andÎ i,k is the prediction. DSC agg of the primary tumour is denoted by DSC agg GTVp, and DSC agg of lymph nodes is denoted by DSC agg GTVn.

Sensitivity
The threshold of Dice coefficient DSC for detection is set at 0.1. If DSC for a lesion was less than 0.1, it was not detected. This was in line with previous studies that considered the non-overlapping nature of 3D lesions [42][43][44]. The ground truth lesions that have DSC > 0.1 with predicted lesions are counted as true positives (TP), and the ground truth lesions with DSC < 0.1 are counted as false negatives (FN). The sensitivity is given by: The C-index generalises the area under the receiver operator characteristic (ROC) curve by quantifying the model's ability to provide a good split between the survival curves. C-index is based on the computation of individual patient risk scores while accounting for censored data. This statistical tool provides a global evaluation of the model's discrimination power.

Statistical Tests
The Mann-Whitney test and Chi-squared test were used to compare the variables between the training and test set. The Mann-Whitney test was used to check for statistically significant differences in Dice score and sensitivity between each subgroup and the total reported statistics. A two-sided permutation test was used to check for statistical significance for C-index for survival prediction and aggregated Dice score. For survival prediction, three risk groups were determined using threshold values at the 33rd and 66th percentile of the calculated risk score. The survival curves were generated using the Kaplan-Meier method. Two log-rank tests were performed to determine the significance of the split of the low-vs. the medium-risk groups, and the medium-vs. the high-risk groups.

Patient Characteristics
The clinical characteristics of both the training and test datasets are shown in Table 1. In the training dataset, the gender ratio was 95/429 (female/male), and the median age was 61.0 (IQR 54.0-67.0) years. In the test dataset, the gender ratio was 63/296 (female/male), and the median age was 59.0 (IQR 53.2-66.0) years. There were no significant differences in gender, weight, and chemotherapy between the training and test datasets (all p > 0.05). There is a significant difference in age between the training and test datasets (p = 0.024). We evaluated the influence of age on performance by dividing age into four equal quartiles: 0 to 50, 50 to 60, 60 to 70, and more than 70 years. Similarly, we also evaluated the performance of the segmentation with respect to lesion size by dividing the size into four equal quartiles: 0 to 4.03 cm 3 (small), 4.03 to 7.70 cm 3 (medium), 7.70 to 17.3 cm 3 (large), and over 17.3 cm 3 (extra large).

Segmentation
The segmentation results are reported in Table 2. For fivefold cross-validation, the model achieved a mean sensitivity of 0.964 for tumours and 0.878 for lymph nodes, a mean Dice score of 0.725 for tumours and 0.658 for lymph nodes, and a mean DSC agg GTVp of 0.808 and a mean DSC agg GTVn of 0.780. On the unseen external dataset, the segmentation model achieved DSC agg GTVp of 0.774 and DSC agg GTVn of 0.760. Dice score DSC, aggregated Dice score DSC agg , and detection sensitivity for tumours and lymph nodes showed no statistically significant difference in performance in different age groups and different genders when compared with the overall performance of the segmentation model (all p > 0.05). There was a statistically significant difference in performance for DSC agg and sensitivity for patients who have not had chemotherapy (p < 0.05). The detection sensitivity was significantly higher for HPV-negative patients (p < 0.05). DSC and DSC agg were significantly lower for small tumours and lymph nodes and higher when tumours and lymph nodes were larger when compared to the overall performance (p < 0.05). Figure 2 shows the impact of varying t uncertain from 0 to 0.5 in steps of 0.025 on false positive per image and sensitivity. As the threshold for detecting false positives using t uncertain decreases, false positives decrease with a decrease in sensitivity. Figure 2A,B show the impact of per image false positive reduction and sensitivity with t uncertain for the primary tumour. Similarly, Figure 2D,E show the impact of per image false positive reduction and sensitivity with t uncertain for the lymph nodes. Figure 2C,F show the plot of per image false positives and the corresponding sensitivity for primary tumour and lymph nodes, respectively. The green point in Figure 2 showed the ideal operating point that corresponds to t uncertain = 0.1. At this operating point, there is a per image false positive detection reduction of 7.14% and 19.5% at 0.62% and 0.21% decrease in detection sensitivity for lymph nodes and tumours, respectively.   Figure 3 shows the qualitative results for uncertainty estimation using posterior weight sampling for tumour and lymph node segmentation. The first row in Figure 3 showed that the model wrongly predicted two lymph nodes. The uncertainty of one of these lymph nodes is high and, therefore, t uncertain is high. After applying t uncertain to the threshold, one of the false positives is removed. The uncertainty of the tumour region was high around the boundary. High values of t uncertain were also utilised for the false positive reduction in the second and fourth row of examples in Figure 3. In the third row, the model wrongly predicted a lymph node as a tumour. The uncertainty for both tumour and lymph nodes was high for this region.

Recurrence-Free Survival Prediction
Our results of RFS prediction are reported in Table 3 and Figure 4A. For fivefold crossvalidation, the average C-index and standard deviation are reported. We first evaluated the radiomics prediction performance of using different ROIs (all regions including tumour and lymph node, only tumour, only lymph node, largest tumour, and largest lymph node) on PET or CT separately. The results showed that in CT, the radiomics model of all regions performed best with a C-index of 0.659 ± 0.063. In PET, the radiomics model of the largest lymph node performed best with a C-index of 0.644 ± 0.084. Radiomics models based on CT performed better than PET models if only the tumour region is considered. However, radiomics models based on PET performed better than models based on CT if the only lymph node is considered. Radiomics models based on the primary tumour performed similarly to models based on lymph nodes if only CT was considered. However, radiomics models based on lymph nodes performed better than models based on the primary tumour if only PET is considered. The radiomics_clinical model, which combined the radiomics features from different regions in both imaging modality as well as the volume and clinical features, led to the highest C-index of 0.682 ± 0.083 in fivefold cross-validation. Figure 4B showed a Kaplan-Meier survival plot in which all patients were stratified by the predicted risk scores in fivefold cross-validation. The high-risk group showed significantly worse survival than the low-risk group (p < 0.001) while the low-risk group and median-risk group did not show a significant difference (p = 0.114). The model showed a C-index of 0.672 in the test set. niformity from both tumour and lymph nodes in CT as the top five most important features for RFS prediction. Among these features, glszm_GrayLevelNonUniformity from lymph nodes in PET, firstorder_Range from tumours in CT, and glrlm_GrayLevelNonUniformity from both tumour and lymph nodes in CT had a similar trend that a higher feature value resulted in a higher positive SHAP value and higher risk score. While glszm_largeAreaLowGrayLevelEmphasis from lymph nodes in PET and HPV status had a negative trend in which a higher feature value resulted in a lower negative SHAP value and lower risk score. The definitions of all the radiomics features used in radiomics_clinical model are presented in Appendix A, Table A2.   Table 4. There was no statistically significant difference in model performance for different subgroups of age, gender and lymph node size (all p > 0.05) compared with the overall population. However, the model showed significantly lower performance in subgroups without chemotherapy (C-index: 0.555, p = 0.022), without HPV infection (C-index: 0.551, p = 0.006), and the large tumour size subgroup (C-index: 0.561, p = 0.037). A global SHAP summary plot ( Figure 5) identified glszm_GrayLevelNonUniformity from lymph nodes in PET, firstorder_Range from tumours in CT, glszm_largeAreaLowGray LevelEmphasis from lymph nodes in PET, HPV status, and glrlm_GrayLevelNonUniformity from both tumour and lymph nodes in CT as the top five most important features for RFS prediction. Among these features, glszm_GrayLevelNonUniformity from lymph nodes in PET, firstorder_Range from tumours in CT, and glrlm_GrayLevelNonUniformity from both tumour and lymph nodes in CT had a similar trend that a higher feature value resulted in a higher positive SHAP value and higher risk score. While glszm_largeAreaLowGrayLevelEmphasis from lymph nodes in PET and HPV status had a negative trend in which a higher feature value resulted in a lower negative SHAP value and lower risk score. The definitions of all the radiomics features used in radiomics_clinical model are presented in Appendix A, Table A2.

Discussion
In this study, our proposed segmentation model could detect tumours and lymph nodes during fivefold cross-validation with a detection rate of 96.4% and 87.8%, respectively. The aggregated Dice scores for tumours and lymph nodes were found to be 0.774 and 0.760 in the unseen external test set, respectively. We utilised posterior weight space sampling for uncertainty estimation of tumour and lymph node segmentation to avoid silent failures and measure model confidence. By applying an uncertainty density threshold, there was a reduction of 7.14% in false positives per image for lymph nodes and 19.5% for tumours, with only a minimal decrease in sensitivity. The combination of radiomics features from multiple regions in multi-modality imaging (PET and CT), volume features, and clinical features allowed us to achieve a C-index of 0.672 on the unseen test dataset. The evaluation of fairness showed that the factors such as chemotherapy, HPV status, lesion size, and lymph node size can have an impact on the segmentation results and on the survival prediction.
The reliability of segmentation models could be increased by incorporating an uncertainty estimation to highlight the uncertain areas of segmentation. This can help in identifying false positives as well as warn clinicians. The qualitative analysis of the predicted segmentation along with uncertainty maps for tumours and lymph nodes showed that false positives of lymph nodes have high uncertainty associated with them. Furthermore, the correct segmentations have high uncertainty around the boundary, which correlates well with the fact that the inter-observer variability is also high around the margins of the lesion. We also observed that when the model mistakes the lymph node as a tumour, the uncertainty for both the lymph node and the tumour was high. This showed that uncertainty estimation could also be used for identifying when the model is unsure about the type of lesion. We could obtain a significant decrease in the false positive rate with minimal loss of sensitivity showing that uncertainty is highly correlated with false positives.
The prediction of outcomes was crucial for patients with head and neck (H&N) cancer as it helps clinicians in making informed therapeutic decisions and improving treatment outcomes. Many studies have explored the use of handcrafted radiomics and deep learning in predicting RFS in H&N cancer. Meng et al. [45] developed an outcome prediction model that combines the Deep Multi-task Survival model (DeepMTS), radiomics features, and clinical indicators, achieving a C-index of 0.681 in the test set. Rebaud et al. [46] proposed a binary-weighted radiomics model that resulted in a C-index of 0.682. Wang et al. [47] introduced a prediction framework that integrates the predicted risk scores from separate clinical feature models, PET radiomics models, and CT radiomics models, achieving a C-index of 0.673. In comparison, our focus was mainly on the radiomics prediction performance in different ROIs and modalities. The results showed that the radiomics models based on CT perform better when only the primary tumour region is considered, while radiomics models based on PET perform better when only the lymph node is considered. These results are valuable for further radiomics outcome prediction research, suggesting that the tumour in CT and the lymph node in PET might provide more prognostic information. The results are consistent with Carvalho's research [48] which showed higher prognostic value for the radiomics model in metastatic lymph nodes than the primary tumour alone in PET. The reason could be that lymph nodes contain tumour cells that are more aggressive and have therefore more impact on the prognosis in PET. However, lymph node staging is difficult in CT because of low soft-tissue contrast [49], which may be the explanation for the lower performance of lymph nodes in CT. The combination of multi-region multi-modal radiomics features, volume features, as well as clinical features led to the best prediction performance. The radiomics_clinical model is able to stratify patients into three groups, which helps clinical decision-making and selecting patients for (de-)escalation trials and/or adjuvant treatment.
Researchers have studied the fairness of algorithms to ensure that the AI algorithms were equitable and fair. Newton et al.'s study [50] revealed a correlation between a model's accuracy and an individual's skin tone classification due to the underrepresentation of individuals with darker skin in benchmark datasets. Statistical analysis of the segmentation result with respect to various subgroups revealed that the performance of the segmentation model was not affected by age and gender. For patients who did not receive chemotherapy, the DSC agg is lower, and sensitivity is higher than overall performance. This means that segmentation quality is lower for such patients, but the algorithm is able to detect them with a DSC threshold of 0.1. The segmentation performance for HPV-positive patients was lower and for HPV-negative patients was higher than the overall performance. The segmentation performance is impacted by variations in texture features and morphological characteristics. AK Tahari et al. conducted a study to evaluate the differences in morphological features in primary tumours and regional lymph nodes between HPV-positive and HPV-negative patients. The findings revealed that HPV-negative patients had larger primary tumours, higher heterogeneity, and a higher standardized uptake value [51]. Furthermore, Fujita et al. compared texture features between HPV-positive and HPV-negative non-OPC patients on CT scans and found that numerous texture analysis features demonstrated statistically significant differences between the two groups [52]. These findings highlight the need to consider HPV status in the texture analysis and segmentation of head and neck cancer. Furthermore, small tumours and lymph nodes had lower performance, and large tumours and lymph nodes had higher segmentation performance. This could be explained due to the fact that the non-overlapping of a few pixels for small tumours had a significant impact on the DSC and large tumours are generally easier to segment. While the size of the lesion had an impact on the segmentation performance, there was no statistically significant difference in sensitivity. It is crucial to inform the end users whether the segmentation model was influenced by the presence or absence of clinical features. The statistical analysis of the radiomics_clinical survival prediction model for several subcohorts revealed that performance was not impacted except for those who have not received chemotherapy or are HPV-negative, as well as for large tumour size. For these three subgroups, the performance of the model was significantly lower than the overall performance, indicating that the survival prediction model should be used with caution in certain cases.
Additionally, a SHAP analysis was conducted to better understand the contributions of each feature to the prediction. The global SHAP analysis highlighted glszm_GrayLevelNon Uniformity from lymph nodes in PET, firstorder_Range from tumours in CT, and glszm_large AreaLowGrayLevelEmphasis from lymph nodes in PET as the most important radiomics features. This result emphasises the significance of texture features and aligns with previous findings that tumour features in CT and lymph node features in PET are more informative. The SHAP analysis also revealed that HPV infection was associated with a favourable prognosis and consistent with other studies that show HPV-related H&N cancers are often found in younger, healthier patients with high economic status and high-risk sexual behaviour, leading to improved prognosis [53].
There were a few limitations in this study. We observed that the performance of the segmentation model and survival prediction model was impacted by chemotherapy and HPV status. There is a need to investigate bias mitigation strategies in the future to see if we can ensure that the algorithm is unbiased with respect to these identified biases. We have limited data on radiation type and chemotherapy dosage to investigate its impact on the model performance. The non-availability of the clinical data such as HPV status can hamper the fairness analysis of the segmentation and survival prediction models. While we have extensively validated our segmentation and survival prediction model on unseen datasets, there is still a need for prospective validation to study the usability and clinical utility of the models. There is a need to quantitatively study uncertainty estimation with inter-observer variability to further instil confidence in the confidence scores. There is also a need to evaluate the explainability of the proposed clinical_radiomics model in clinical practice.
In this paper, we proposed an end-to-end framework for tumours and lymph detection, segmentation, and prediction. We increase the reliability of the segmentations by identifying silent failures and reducing false positives by using uncertainty estimates. We demonstrated by qualitative and quantitative analysis that the uncertainty maps can identify salient failures and can also aid in false positive reduction. We also explored the prognostic potential of radiomics features extracted from tumours and lymph nodes in PET and CT. We also performed a quantitative analysis of the segmentation and survival prediction models to ensure fairness and equitable performance with respect to age, gender, chemotherapy status, HPV status, and lesion size. We aimed to draw the attention of healthcare professionals to the performance outliers of the models. Furthermore, we employed SHAP analysis for the interpretability of the model and correlated the insights with clinical knowledge. In future, prospective in silico studies that evaluate the performance of the clinicians with the aid of these segmentation and survival prediction models as well as quantify the usefulness of uncertainty estimation and explainability need to be carried out. Furthermore, fair models that mitigate the identified biases may also be developed in the future.

Conclusions
An end-to-end segmentation and recurrence-free survival prediction framework has been proposed that incorporates uncertainty estimation, fairness, and explainability. The uncertainty estimates have been used to identify failure cases and for per image false positive reduction. The recurrence-free survival prediction model can be used to stratify patients into distinct risk groups for treatment (de-)escalation trials and clinical decision support. The fairness of models has been evaluated to identify biases and inform the end-users of performance outliers.
Author Contributions: Conceptualization, Z.S., Y.C., H.C.W., N.M.R. and P.L.; methodology, Z.S., Y.C. and X.Z.; software, Z.S., Y.C. and X.Z.; validation, Z.S., Y.C. and X.Z.; formal analysis, Z.S., Y.C. and X.Z.; investigation, Z.S., Y.C. and X.Z.; writing-original draft preparation, Z.S., Y.C., X.Z. and S.A.M.; writing-review and editing, Z.S., Y.C., X.Z., S.A.M., H.C.W. and P.L.; supervision, H.C.W. and P.L.; funding acquisition, H.C.W. and P.L. All authors have read and agreed to the published version of the manuscript. Conflicts of Interest: Philippe Lambin reports, within and outside the submitted work, grants/sponsored research agreements from Radiomics SA, ptTheragnostic/DNAmito, and Health Innovation Ventures. He received an advisor/presenter fee and/or reimbursement of travel costs/consultancy fee and/or in-kind manpower contribution from Radiomics SA, BHV, Merck, Varian, Elekta, ptTheragnostic, BMS, and Convert pharmaceuticals. Lambin has minority shares in the companies Radiomics SA, Convert pharmaceuticals, Comunicare, and LivingMed Biotech. He is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248 and PCT/NL2014/050728) licensed to Radiomics SA; one issue patent on mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito; three non-patented inventions (software) licensed to ptTheragnostic/DNAmito, Radiomics SA, and Health Innovation Ventures; and three non-issues, non-licensed patents on deep learning radiomics and LSRT (N2024482, N2024889, N2024889). He confirms that none of the above entities or funding was involved in the preparation of this paper. Henry C. Woodruff has minority shares in the company Radiomics SA. The rest of co-authors declare no competing interest. Table A1. Description on the type of radiomics features that are extracted from each region of interest from PET and CT images. More information on the type of features is present in PyRadiomics documentation (https://pyradiomics.readthedocs.io/en/latest/features.html (accessed on 20 March 2023)).

Radiomics Feature Description
Gray level co-occurrence matrix (GLCM) Gray level co-occurrence matrix features describe the second order joint probability function of the voxel intensities. These features include measures such as contrast, correlation, energy, and homogeneity. In this study, 24 GLCM features were extracted using PyRadiomics.
Gray level difference matrix (GLDM) Gray level difference matrix features describe the distribution of gray level differences within the ROI. These features include measures such as coarseness, contrast, and busyness. In this study, 14 GLDM features were extracted using PyRadiomics.

Gray level run length matrix (GLRLM)
Gray level run length matrix features describe the length of runs of consecutive voxels with the same gray level. These features include measures such as short-run emphasis, long-run emphasis, and run percentage. In this study, 16 GLRLM features were extracted using PyRadiomics.
Gray level size zone matrix (GLSZM) Gray level size zone matrix features describe the size of zones of consecutive voxels with the same gray level. These features include measures such as zone size, zone percentage, and zone entropy. In this study, 16 GLSZM features were extracted using PyRadiomics.
Neighbouring gray tone difference matrix (NGTDM) Neighbouring gray tone difference matrix features describe the distribution of voxel-level texture primitive patterns. These features include measures such as coarseness and contrast. In this study, 5 NGTDM features were extracted using PyRadiomics.

First order statistics
First order statistics describe the distribution of voxel intensities within the ROI. These features include measures such as mean, median, skewness, and kurtosis. In this study, 18 first order features were extracted using PyRadiomics.

Shape-based (3D)
Shape features describe the shape and size of the ROI. These features include measures such as volume, surface area, sphericity, and compactness. In this study, 14 shape features were extracted using PyRadiomics. Table A2. Description of the radiomics features used in the proposed radiomics_clinical model for recurrence-free survival prediction.

Radiomics Feature Definition
First order: First order statistics describe the distribution of voxel intensities within the image region defined by the mask through commonly used and basic metrics.

Original_firstorder_Range
The range of first order value in the ROI

Original_firstorder_10Percentile
The 10th percentile of first order value.

Original_firstorder_ TotalEnergy
Total Energy is the value of the Energy feature scaled by the volume of the voxel in cubic mm

Original_firstorder_ InterquartileRange
The different value 25th and 75th percentile of the first order value.