Context-Aware Saliency Guided Radiomics: Application to Prediction of Outcome and HPV-Status from Multi-Center PET/CT Images of Head and Neck Cancer

Simple Summary This study investigated the ability of context-aware saliency-guided PET/CT radiomics in the prediction of outcome and HPV status for head and neck cancer. In total, 806 HNC patients (training vs. validation vs. external testing: 500 vs. 97 vs. 209) from 9 centers were collected from The Cancer Imaging Archive (TCIA). Saliency-guided radiomics showed enhanced performance for both outcome and HPV-status predictions relative to conventional radiomics. The radiomics-predicted HPV status also showed complementary prognostic value. This multi-center study highlights the feasibility of saliency-guided PET/CT radiomics in outcome predictions of head and neck cancer, confirming that certain regions are more relevant to tumor aggressiveness and prognosis. Abstract Purpose: This multi-center study aims to investigate the prognostic value of context-aware saliency-guided radiomics in 18F-FDG PET/CT images of head and neck cancer (HNC). Methods: 806 HNC patients (training vs. validation vs. external testing: 500 vs. 97 vs. 209) from 9 centers were collected from The Cancer Imaging Archive (TCIA). There were 100/384 and 60/123 oropharyngeal carcinoma (OPC) patients with human papillomavirus (HPV) status in training and testing cohorts, respectively. Six types of images were used for radiomics feature extraction and further model construction, namely (i) the original image (Origin), (ii) a context-aware saliency map (SalMap), (iii, iv) high- or low-saliency regions in the original image (highSal or lowSal), (v) a saliency-weighted image (SalxImg), and finally, (vi) a fused PET-CT image (FusedImg). Four outcomes were evaluated, i.e., recurrence-free survival (RFS), metastasis-free survival (MFS), overall survival (OS), and disease-free survival (DFS), respectively. Multivariate Cox analysis and logistic regression were adopted to construct radiomics scores for the prediction of outcome (Rad_Ocm) and HPV-status (Rad_HPV), respectively. Besides, the prognostic value of their integration (Rad_Ocm_HPV) was also investigated. Results: In the external testing cohort, compared with the Origin model, SalMap and SalxImg achieved the highest C-indices for RFS (0.621 vs. 0.559) and MFS (0.785 vs. 0.739) predictions, respectively, while FusedImg performed the best for both OS (0.685 vs. 0.659) and DFS (0.641 vs. 0.582) predictions. In the OPC HPV testing cohort, FusedImg showed higher AUC for HPV-status prediction compared with the Origin model (0.653 vs. 0.484). In the OPC testing cohort, compared with Rad_Ocm or Rad_HPV alone, Rad_Ocm_HPV performed the best for OS and DFS predictions with C-indices of 0.702 (p = 0.002) and 0.684 (p = 0.006), respectively. Conclusion: Saliency-guided radiomics showed enhanced performance for both outcome and HPV-status predictions relative to conventional radiomics. The radiomics-predicted HPV status also showed complementary prognostic value.

regions with distinctive patterns but also capture non-redundant background regions by using multi-scale information and achieves good performance in image retargeting and summarization.
As such, we set out to investigate the prognostic value of context-aware saliencyguided PET/CT radiomics in a large cohort multi-center setting (806 HNC patients from 9 centers). A total of six prognostic models were constructed based on (1) original PET or CT images, (2) the saliency-map itself, sub-regions with (3) higher vs. (4) lower saliency, (5) saliency-weighted images, and finally, (6) the fused PET/CT images for outcome prediction. Meanwhile, considering the fact that HPV+ OPC patients show more favorable prognoses [24] compared with HPV-OPC patients, we built six additional models and derived radiomics score for HPV-status prediction, and subsequently, the radiomics score was introduced into the aforementioned six prognostic models and the complementary prognostic value of the radiomics score of HPV-status for OPC patients was further evaluated. Besides, since model generalizability can be affected by centers, manufacturers, and voxel size [25], ComBat-based feature harmonization [26] among batches divided by these three factors was implemented, and the impact of harmonization on model performance was further evaluated.

Multi-Center Data Set
We utilized multi-center HNC data from The Cancer Imaging Archive (TCIA). Inclusion criteria included (1) histologically proven HNC, (2) the absence of distant metastases at initial diagnosis, (3) the existence of complete follow-up information, and (4) the availability of pre-treatment 18 F-FDG PET/CT images. Exclusion criteria included (1) receiving anti-cancer therapy before PET/CT imaging or (2) having prior HNC or other kinds of cancer. This study abides by the TCIA Data Usage Policy; all datasets were de-identified, thus did not require institutional review board approval, and patient informed consent was waived.
ing cohorts, respectively, whose HPV status was available, were used for HPV-status prediction model construction and testing, and the prognostic value of the radiomics score for HPV-status prediction (Rad_HPV) was evaluated on the aforementioned 123 OPC patients, comparing its usage (Rad_Ocm_HPV) vs. not using it (i.e., Rad_Ocm only).   Biograph 40). The inplane resolution of PET images varied between 1.95 and 5.5 mm, while it was 0.48-1.95 mm for CT images; slice thicknesses were in the ranges of 2-5 mm and 1.5-5 mm for PET and CT images, respectively.

Clinical Parameters and Outcome Information
Patients' characteristics and follow-up information in txt or DICOM structured reporting (SR) files were collected and harmonized. Clinical information included Age, Sex, T-stage, N-stage, TNM stage, Site, Therapy, Smoking, HPV status, and Grade. The primary endpoint was disease-free survival (DFS), which was defined as the time from diagnosis to the first confirmed event of recurrence, metastasis, death, or their combination, whichever occurred first. The other three endpoints, including recurrence-free survival (RFS), metastasis-free survival (MFS), and overall survival (OS), were also evaluated.   Biograph 40). The in-plane resolution of PET images varied between 1.95 and 5.5 mm, while it was 0.48-1.95 mm for CT images; slice thicknesses were in the ranges of 2-5 mm and 1.5-5 mm for PET and CT images, respectively.

Clinical Parameters and Outcome Information
Patients' characteristics and follow-up information in txt or DICOM structured reporting (SR) files were collected and harmonized. Clinical information included Age, Sex, T-stage, N-stage, TNM stage, Site, Therapy, Smoking, HPV status, and Grade. The primary endpoint was disease-free survival (DFS), which was defined as the time from diagnosis to the first confirmed event of recurrence, metastasis, death, or their combination, whichever occurred first. The other three endpoints, including recurrence-free survival (RFS), metastasis-free survival (MFS), and overall survival (OS), were also evaluated.

Tumor Segmentation
For patients with available DICOM segmentations (SEG) or radiotherapy structure set (RTSTRUCT) files, the gross tumor volume of the primary tumor and lymph node (GTV and GTVnd) were extracted as CT labels and were propagated to generate PET labels. Otherwise, PET labels were first semi-automatically delineated by a radiologist (H. Z.) with 4 years of experience in head and neck nuclear medicine using the "level tracing" module in 3D Slicer 4.10.2 (https://slicer.org/, accessed on 3 November 2019), and CT labels were generated following registration.

Context-Aware Saliency Map Generation
A context-aware saliency map was generated from the smallest boxes of PET and CT images that covered the entire tumor. PET images were converted to SUV. CT intensity was truncated to (-200, 300) to mitigate the interference of air and bone. As illustrated in Figure 2, the first step is to characterize the dissimilarity among patches both locally and globally. Given a patch p i centered at pixel i, the dissimilarity between patch p i and any patch p j can be characterized by their distance d(p i , p j ) as follows: where d intensity (p i , p j ) and d position (p i , p j ) denote the intensity and position of Euclidean distances between patch p i and patch p j , with c =3 controlling the ratio of these two distances as suggested by [23]. This dissimilarity measure is proportional to the difference in intensity and inversely proportional to the positional distance. For patients with available DICOM segmentations (SEG) or radiotherapy structure set (RTSTRUCT) files, the gross tumor volume of the primary tumor and lymph node (GTV and GTVnd) were extracted as CT labels and were propagated to generate PET labels. Otherwise, PET labels were first semi-automatically delineated by a radiologist (H. Z.) with 4 years of experience in head and neck nuclear medicine using the "level tracing" module in 3D Slicer 4.10.2 (https://slicer.org/, accessed on 3 November 2019), and CT labels were generated following registration.

Context-Aware Saliency Map Generation
A context-aware saliency map was generated from the smallest boxes of PET and CT images that covered the entire tumor. PET images were converted to SUV. CT intensity was truncated to (-200, 300) to mitigate the interference of air and bone. As illustrated in Figure 2, the first step is to characterize the dissimilarity among patches both locally and globally. Given a patch i p centered at pixel i, the dissimilarity between patch i p and any patch j p can be characterized by their distance i j d p p    as follows: where int ensity i j d p p    and position i j d p p    denote the intensity and position of Euclidean distances between patch i p and patch j p , with c= controlling the ratio of these two distances as suggested by [23]. This dissimilarity measure is proportional to the difference in intensity and inversely proportional to the positional distance. Since the background is likely to have similar patches at multiple scales while the dominant object is prone to having similar patches at only a few scales, in order to suppress the background, the distance calculation was set in a multi-scale manner (as also shown in Figure 2). Patch i p was selected from the scaled image with a scaling ratio r relative to the original image size, while patch j p was selected from the further scaled image with a scaling ratio s relative to the scaled image size used for patch i p selection.
Consistent with [23], we considered 4 image scales for patch i p : , with corresponding patch sizes set to 7 × 7, 5 × 5, 3 × 3, and 3 × 3, and 3 image scales for patch j p : s r r r         . Thus, the multi-scale distance can be denoted as Since the background is likely to have similar patches at multiple scales while the dominant object is prone to having similar patches at only a few scales, in order to suppress the background, the distance calculation was set in a multi-scale manner (as also shown in Figure 2). Patch p i was selected from the scaled image with a scaling ratio r relative to the original image size, while patch p j was selected from the further scaled image with a scaling ratio s relative to the scaled image size used for patch p i selection. Consistent with [23], we considered 4 image scales for patch p i : r ∈ {1, 0 .8, 0.5, 0 .3}, with corresponding patch sizes set to 7 × 7, 5 × 5, 3 × 3, and 3 × 3, and 3 image scales for patch p j : s ∈ r, 1 2 r, 1 4 r . Thus, the multi-scale distance can be denoted as d(p r i , p s j ). To highlight the salient regions and suppress the non-salient ones, a salient pixel is required to be different from all the other patches, but for computational simplicity, it suffices to only consider the K most similar patches. The saliency of pixel i at scale r can be defined as: where K = 64, and the mean salience among different scales was calculated for each pixel i: where N = 4, i.e., aforementioned 4 scales r ∈ {1, 0 .8, 0.5, 0 .3}. In order to include the immediate context, regions with mean saliency higher than 0.8 were regarded as the foci of attention, and the final saliency of each pixel was weighted by its Euclidean distance to the closest foci pixel: As saliency maps assign a range of 0-1 weights to the pixels, we generated two subregions by setting a threshold of 0.5, i.e., high-and low-saliency regions, denoted as highSal and lowSal. Besides, saliency-weighted PET and CT images were also generated, and their summation was used as a fused PET-CT image to integrate important anatomical and metabolic information.

Feature Extraction
All images were interpolated to an isotropic voxel size of 1 × 1 × 1 mm 3 , and intensity was discretized to 64 bins before feature extraction. We utilized the Standardized Environment for Radiomics Analysis (SERA; Ver 2.1, https://github.com/ashrafinia/SERA, accessed on 30 November 2020) for feature extraction, which is in compliance with the Image Biomarkers Standardization Initiative (IBSI) [9], and further validated elsewhere [29,30]. Furthermore, 497 radiomics features were extracted from each type of image, i.e., the entire tumor region in the original PET or CT image (Origin), the saliency map itself (SalMap), the high-saliency region (highSal) or low-saliency region (lowSal) in PET or CT images according to saliency higher or lower than 0.5, and the entire tumor region in the saliency-weighted image (SalxImg) and fused image (FusedImg). The overview of the saliency-guided radiomics analysis workflow is shown in Figure 3.

Reproducibility Analysis
To investigate the impact of inter-and intra-observer segmentation variations on feature reproducibility, 52 HNC patients in dataset QIN1 were segmented twice by both manual and semi-automatic methods by three experienced readers. A two-way randomeffects, single-rater, absolute agreement intra-class correlation coefficient (ICC) [31] was calculated for each feature to assess inter-observer variation, and a two-way mixed-effect, single-measurement, absolute agreement ICC was calculated to evaluate intra-observer variation with the same calculation equation. Features were required to be reproducible (ICC > 0.8) for both intra-and inter-observer variations. Reproducibility was evaluated for PET and CT images, separately. For fused images, only features that were reproducible on both PET and CT images remained for subsequent analyses. Cancers 2022, 14, x 7 of 18 Figure 3. Overview of the investigated saliency-guided radiomics analysis workflow, including saliency-guided image generation (Origin, SalMap, highSal, lowSal, SalxImg, and FusedImg), radiomics feature extraction, reproducibility analysis, feature harmonization, model construction (outcome prediction and HPV status prediction), and saliency distribution.

Reproducibility Analysis
To investigate the impact of inter-and intra-observer segmentation variations on feature reproducibility, 52 HNC patients in dataset QIN1 were segmented twice by both manual and semi-automatic methods by three experienced readers. A two-way random- Overview of the investigated saliency-guided radiomics analysis workflow, including saliency-guided image generation (Origin, SalMap, highSal, lowSal, SalxImg, and FusedImg), radiomics feature extraction, reproducibility analysis, feature harmonization, model construction (outcome prediction and HPV status prediction), and saliency distribution.
In each model configuration, univariate Cox analysis was first used to evaluate the prognostic performance of each feature in the training cohort, i.e., calculating 4 C-indices for each feature when predicting 4 endpoints (RFS, MFS, OS, and DFS). Thus, the mean C-index was used to characterize the overall performance of each feature, and the top 20 non-redundant features with the highest mean C-index values were selected as candidate features for subsequent multivariate Cox model construction. Starting with the first feature, the multivariate Cox model was built via stepwise addition of the remaining features, i.e., 19 models were constructed, and it was only conducted for primary endpoint DFS prediction (i.e., only one C-index per model) in the training cohort instead of for each endpoint to avoid overfitting. Then, for each constructed model, 3 C-indices for 3 secondary endpoints (RFS, MFS, and OS) predictions in the training cohort and 4 C-indices for all endpoints predictions in the validation cohort were calculated. As such, the model showing the highest mean C-index (averaging among these 8 C-indices) in both training and validation cohorts for all four endpoint prediction tasks was selected. Features retained in this model weighted by their corresponding coefficients were regarded as the radiomics score for outcome prediction (Rad_Ocm). Therefore, only six selected models (one model per image type) were tested in the independent external testing set, consistent with AI algorithm development best-practice guidelines that models should be tested on an unseen holdout dataset [32].
Furthermore, considering that HPV status was only available for 100 OPC patients in the training cohort, by using the same configurations of these selected 6 prognostic models, 6 multivariate logistic regression models were constructed for HPV status prediction. Similar to the outcome prediction model construction, features were first sorted by their AUC values in descending order, the top 2-20 non-redundant features were then fed into multivariate logistic models, and finally, the model with higher AUC and lower Bayesian information criterion (BIC) values was selected (by comparing the summation of ranks of descending AUC and ascending BIC). Thus, the radiomics score for HPV status prediction (Rad_HPV) can be obtained in the same way. Next, 60 OPC patients from the external testing cohort whose HPV status was available were used to test the performance of the HPV-status prediction model. Since HPV status is well-known to be relevant in the prognosis of OPC, the complementary prognostic value of Rad_HPV was further evaluated in 123 OPC patients from the testing cohort, comparing its usage (Rad_Ocm_HPV) vs. not using it (i.e., Rad_Ocm only) (Figure 1).
The model performance was reported by the C-index and AUC with a 95% confidence interval using 100 bootstraps, and was further evaluated regarding Kaplan-Meier curves, the log-rank test (the median value of radiomics score from the training cohort was used for separating high-or low-risk group), and the receiver operating characteristic curve (ROC) when appropriate.

Feature Screening
The mean (range) inter-and intra-observer ICCs for the 497 PET features were 0.83

Performance of Saliency Guided Prognostic Model
The C-indices of the six selected optimal models in external testing cohorts are listed in Table 2, and the corresponding model configurations (image type, modality, and harmonization method) and the number of features are also provided. Compared with the Origin model, SalMap and SalxImg models achieved the highest C-indices for RFS (0.621 vs. 0.559) and MFS (0.785 vs. 0.739) predictions, respectively; the FusedImg model performed the best for both OS (0.685 vs. 0.659) and DFS (0.641 vs. 0.582) predictions. The performance of these six selected models in training and validation cohorts is provided in Supplementary  Table S1.
Among the six selected optimal models listed in Table 2, none of them were constructed by the center-based harmonization method, while four models achieved their best performance under manufacturer-or voxel size-based harmonization methods, indicating the batch effect may not be prominently expressed over different centers but different manufacturers and voxel sizes, thus suppressing the effect of center-based harmonization on performance improvement. Figure 4 shows the K-M curves of the Origin model and selected representative saliency-guided models in the external testing cohort. All saliency-related models showed higher hazard ratio (HR) values than the Origin model for the four outcome predictions (1.60-6.77 vs. 1.25-4.36). High-risk patients identified by the SalxImg model showed a 5-year MFS value of 61% (Figure 4f), while this value was 67% in the Origin model (Figure 4b). The FusedImg model achieved significant separation between high-and low-risk groups for DFS prediction with a p-value of 0.003 (Figure 4h), while it was 0.048 for the Origin model (Figure 4d).

Complementary Prognostic Value of Radiomics Score of HPV Status
Under the same model configuration, the radiomics score for HPV-status prediction (Rad_HPV) was built based on 100 OPC patients from the training cohort. As shown in Figure 5  Among the six selected optimal models listed in Table 2, none of them were constructed by the center-based harmonization method, while four models achieved their best performance under manufacturer-or voxel size-based harmonization methods, indicating the batch effect may not be prominently expressed over different centers but different manufacturers and voxel sizes, thus suppressing the effect of center-based harmonization on performance improvement. Figure 4 shows the K-M curves of the Origin model and selected representative saliency-guided models in the external testing cohort. All saliency-related models showed higher hazard ratio (HR) values than the Origin model for the four outcome predictions (1.60-6.77 vs. 1. 25-4.36). High-risk patients identified by the SalxImg model showed a 5year MFS value of 61% (Figure 4f), while this value was 67% in the Origin model ( Figure  4b). The FusedImg model achieved significant separation between high-and low-risk groups for DFS prediction with a p-value of 0.003 (Figure 4h), while it was 0.048 for the Origin model (Figure 4d). Under the same model configuration, the radiomics score for HPV-status prediction (Rad_HPV) was built based on 100 OPC patients from the training cohort. As shown in Figure 5, Rad_HPV of the Origin model and the FusedImg model showed similar AUC values of 0.706 and 0.702 for HPV-status prediction in the training set, while the FusedImg model showed a higher AUC value of 0.653 compared with the Origin model with an AUC value of 0.484 in the testing set. The prognostic value of Rad_HPV was further evaluated on 123 OPC patients from the external testing cohort (Table 3 and Supplementary Table S2). Compared with the radiomics score for outcome prediction alone (Rad_Ocm) or their combination (Rad_Ocm_HPV), Rad_HPV values of FusedImg and SalxImg models showed the highest C-indices of 0.641 and 0.687 for RFS and MFS predictions, respectively, while Rad_Ocm_HPV of the FusedImg model performed the best for OS and DFS predictions with C-indices of 0.702 and 0.684 and log-rank p-values of 0.002 and 0.006, respectively ( Figure 6). The features retained in the SalMap, SalxImg, and FusedImg models and the corresponding coefficients are reported in Supplementary Table S3.  The prognostic value of Rad_HPV was further evaluated on 123 OPC patients from the external testing cohort (Table 3 and Supplementary Table S2). Compared with the radiomics score for outcome prediction alone (Rad_Ocm) or their combination (Rad_Ocm_HPV), Rad_HPV values of FusedImg and SalxImg models showed the highest C-indices of 0.641 and 0.687 for RFS and MFS predictions, respectively, while Rad_Ocm_HPV of the FusedImg model performed the best for OS and DFS predictions with C-indices of 0.702 and 0.684 and log-rank p-values of 0.002 and 0.006, respectively ( Figure 6). The features retained in the SalMap, SalxImg, and FusedImg models and the corresponding coefficients are reported in Supplementary Table S3. Table 3. The highest C-index (95% confidence interval) of Rad_Ocm, Rad_HPV, and Rad_Ocm_HPV for outcome prediction in 123 OPC patients from testing cohort; models were selected from Supplementary Table S2.

Saliency Distribution Localization
For a better understanding of the contribution of regions with different saliency values, we calculated the normalized distance between the highest saliency voxel and the centroid in PET and CT images. A distance of 0 indicates the highest saliency located at the centroid, while 1 represents its location at the farthest edge. As shown in Figure 7, the highest saliency of more than half of the patients was located far from the centroid with distances ranging from 0.6-0.9, while a small portion of patients showed higher saliency close to the centroid.  Table 3.  Table 3.
For a better understanding of the contribution of regions with different saliency val-ues, we calculated the normalized distance between the highest saliency voxel and the centroid in PET and CT images. A distance of 0 indicates the highest saliency located at the centroid, while 1 represents its location at the farthest edge. As shown in Figure 7, the highest saliency of more than half of the patients was located far from the centroid with distances ranging from 0.6-0.9, while a small portion of patients showed higher saliency close to the centroid.

Discussion
This study investigated the prognostic value of saliency-guided PET/CT radiomics (SalMap, highSal, lowSal, SalxImg, and FusedImg) in HNC, and models were developed and validated on 806 patients from 9 centers. Our results demonstrate the great prognostic potential of saliency radiomics (SalMap, SalxImg, and FusedImg) compared with the conventional radiomics model (Origin), especially for MFS, OS, and DFS prediction. This is analogous with previous study findings that showed fused PET/CT radiomics outperformed PET or CT alone [27,33], e.g., when adopting wavelet-based fusion. Besides, saliency radiomics also outperformed conventional radiomics for HPV status prediction, and integrating Rad_HPV with Rad_Ocm showed a complementary prognostic value compared with only involving Rad_Ocm in a sub-cohort of OPC patients.
Vuong et al. [34] created a radiomics feature activation map in CT images of nonsmall-cell lung cancer by comparing whether the feature value from a small patch was higher than the population median feature value from the entire tumor or not, aiming to track the spatial location of regions responsible for signature activation. They found that the texture feature GLSZM_zone size non-uniformity normalized was more activated on the adjacent region of the tumor. By contrast, the saliency map in our present study does not need population information and can be generated for any new patient due to the internal comparison mechanism. The saliency map provides the visual perception of the prognostic contribution of each region, and high saliency was found to be more frequently located at the remoter region relative to the centroid for both PET and CT images in our study, indicating the aggressiveness and special biological behavior of frontier area. Nevertheless, more intuitive comparisons of different regions may be obtained when both preand post-therapy or follow-up images are available, and thus shrinkage and resistance or recurrence regions can be identified [35,36].
HPV status has been documented as an independent prognostic factor in OPC [24], and several studies have demonstrated the predictive ability of PET/CT radiomics features in HPV-status prediction [37][38][39]. This has motivated us to speculate that radiomics-predicted HPV status might also be prognostic for outcomes. In this study, given the limited

Discussion
This study investigated the prognostic value of saliency-guided PET/CT radiomics (SalMap, highSal, lowSal, SalxImg, and FusedImg) in HNC, and models were developed and validated on 806 patients from 9 centers. Our results demonstrate the great prognostic potential of saliency radiomics (SalMap, SalxImg, and FusedImg) compared with the conventional radiomics model (Origin), especially for MFS, OS, and DFS prediction. This is analogous with previous study findings that showed fused PET/CT radiomics outperformed PET or CT alone [27,33], e.g., when adopting wavelet-based fusion. Besides, saliency radiomics also outperformed conventional radiomics for HPV status prediction, and integrating Rad_HPV with Rad_Ocm showed a complementary prognostic value compared with only involving Rad_Ocm in a sub-cohort of OPC patients.
Vuong et al. [34] created a radiomics feature activation map in CT images of nonsmall-cell lung cancer by comparing whether the feature value from a small patch was higher than the population median feature value from the entire tumor or not, aiming to track the spatial location of regions responsible for signature activation. They found that the texture feature GLSZM_zone size non-uniformity normalized was more activated on the adjacent region of the tumor. By contrast, the saliency map in our present study does not need population information and can be generated for any new patient due to the internal comparison mechanism. The saliency map provides the visual perception of the prognostic contribution of each region, and high saliency was found to be more frequently located at the remoter region relative to the centroid for both PET and CT images in our study, indicating the aggressiveness and special biological behavior of frontier area. Nevertheless, more intuitive comparisons of different regions may be obtained when both pre-and post-therapy or follow-up images are available, and thus shrinkage and resistance or recurrence regions can be identified [35,36].
HPV status has been documented as an independent prognostic factor in OPC [24], and several studies have demonstrated the predictive ability of PET/CT radiomics features in HPV-status prediction [37][38][39]. This has motivated us to speculate that radiomicspredicted HPV status might also be prognostic for outcomes. In this study, given the limited number of patients who are available regarding HPV status, it was not feasible to construct prognostic models for the entirety of the model by incorporating HPV status as a predictor directly, while abandoning the HPV status arbitrarily means sacrificing important information. As such, prognostic information of HPV was carried by a radiomics score that was predictive of HPV status, and Rad_HPV was further combined with Rad_Ocm for outcome prediction. Better prognostic performance was achieved in the OPC testing cohort, confirming the effectiveness of this strategy.
In a multi-center study, an unavoidable problem is the so-called batch effect where radiomics features are sensitive to the manufacturer, acquisition protocol, and reconstruction algorithm. A compensation method was originally proposed in genomics [40] and extensively applied in radiomics based on different batch divisions [41,42]. Orlhac et al. [41] found that ComBat can remove the batch effect caused by different reconstruction algorithms, kernels, and slice thicknesses in CT images with lung cancer, and also found ComBat to enhance the identification performance of triple-negative breast cancer in PET images obtained from two different centers [42]. In the present study, different manufacturers and voxel sizes were observed in the same center (Supplementary Figure S1). As such, different batch divisions can be generated based on centers, manufacturers, and voxel size to fully explore the effect of feature harmonization (ComBat) on model generalization. Our results ( Table 2) showed that different batch divisions result in varying effects on different outcome predictions and model configurations; in fact, for Origin and SalMap models, an even lower performance was obtained after harmonization compared with no harmonization. One possible reason is that ComBat assumes additive and multiplicative batch effects satisfying Gaussian and inverse Gamma distributions [40], while, due to the existence of complex confounders, this strict assumption does not hold in some circumstances, especially for a variety of radiomics features. More effective batch-division strategies are warranted to obtain the prominent factors of variability and ensure the prerequisite of ComBat. Other advanced harmonization strategies [43] may allow for better generalization, e.g., Modified ComBat (M-ComBat) [44] by harmonizing features to the mean and standard deviation of a selected reference batch instead of to the global mean and standard deviation of all batches, Bootstrapped ComBat (B-ComBat) by estimating parameters in a bootstrapped manner to improve the robustness, or their combination (BM-ComBat). Deep-learning-based harmonization [45] in the image domain may also provide another solution.
The present study has some limitations. First, the dataset was retrospectively collected from the TCIA public platform, wherein each dataset may be originally collected for different purposes, and patient inclusion bias may have existed; thus, prospective cohorts can be studied for further validation in the future. Second, we used the default parameter setting instead of extensive tuning when constructing the saliency map to avoid overfitting, though parameter tuning may result in better performance. Third, the saliency map was generated slice by slice, while 3D methods may produce more precise saliency information and still need to be investigated. Fourth, investigating other saliency detection methods or designing a dedicated saliency detection method that is more appropriate to characterize the importance of the tumor sub-region in PET/CT HNC is of interest in the future. Finally, HPV-status prediction was only conducted on OPC patients, and the prognostic value of predicted HPV status on tumors located at other sites (e.g., larynx and hypopharynx) remains to be explored.

Conclusions
Our multi-center study highlights the feasibility of saliency-guided PET/CT radiomics in outcome prediction of head and neck cancer, confirming that certain regions are more relevant to tumor aggressiveness and prognosis. The radiomics score as a surrogate of HPV status also conveyed prognostic information. Feature harmonization among batches divided by the manufacturer and voxel size showed better performance improvement compared with that divided by centers.