Over past decades, the incidence of oropharyngeal squamous cell carcinomas (OPSCC) has continuously increased, which is attributed to a marked rise in prevalence of sustained high-risk human papillomavirus (HPV)-infection in the oropharynx [1
]. Despite arising from the same pharyngeal site, HPV-associated and HPV-negative OPSCC are considered separate cancer entities with diverging demographic, biologic, and, most notably, prognostic characteristics. HPV-positive cancers are associated with longer overall survival (OS), progression-free survival (PFS), and more favorable treatment response as compared to the HPV-negative form [5
]. Consequently, the 8th edition of the American Joint Committee on Cancer (AJCC)/Union for International Cancer Control (UICC) staging manuals adopted separate staging schemes for survival risk-stratification and prognostication of HPV-associated and HPV-negative OPSCC [9
Advancements in high-throughput computing and machine-learning led to emergence of the “-omics” concept, referring to collective characterization and quantification of pools of biologic information, such as genomics, proteomics, or metabolomics. Radiomics refers to automated extraction of high-dimensional, quantitative descriptor (“feature”) sets from medical images for various applications, including survival modelling, treatment guidance, and biomarker design [13
]. Such features correlate with clinical outcome and convey medically meaningful information describing tumor heterogeneity, microenvironment, pathophysiology, and mutational burden [13
]. While prior studies demonstrated prognostic value of radiomics biomarkers in head and neck cancers [15
], none have incorporated or compared the AJCC 8th edition staging scheme in OPSCC survival modelling and stratification. In this study, we explored the potential added value of radiomics biomarkers in prognostication of PFS and OS—beyond the AJCC staging scheme—in a multi-institutional cohort.
[18F]Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) are a mainstay of OPSCC staging, treatment planning, and surveillance. We applied machine-learning algorithms to devise prognostic radiomics biomarkers for OPSCC using baseline PET and/or CT scans from a multi-institutional cohort. Then, we compared the radiomic biomarkers’ performance with AJCC staging for prognostication and risk-stratification of PFS and OS in HPV-associated and HPV-negative subgroups.
Currently, pre-treatment imaging of head and neck cancers serves the purpose of evaluating primary tumor dimensions, anatomical extent, involvement of regional lymph nodes, and detecting distant metastases, which constitute main components of AJCC/UICC staging. However, our results suggest quantitative imaging biodata reflecting tissue density, texture patterns, lesion shape, and metabolic activity of primary tumors and metastatic cervical nodes may encode valuable information pertaining to tumor behavior with potential prognostic relevance. In both HPV-associated and HPV-negative OPSCC, we observed trends suggesting radiomic analysis may provide complementary value for prognostication and risk-stratification beyond AJCC staging. Statistical significance was consistently attained for PFS survival prognostication and risk-stratification in HPV-associated OPSCC. Additionally, radiomics-based OS risk-stratification outperformed AJCC staging variables in HPV-associated OPSCC, with similar trends in HPV-negative patients. Notably, models utilizing PET radiomics or combined PET and CT feature sets predominantly outperformed CT-based survival prognostication in the HPV-associated subgroup; additionally, consensus VOI models utilizing radiomics information from both primary tumor and metastatic nodes were usually superior.
Our methodology may be applied in future larger cohorts to generate uniformly applicable and objective imaging biomarkers for prognostic risk-stratification of OPSCC. Additionally, our approach enables inclusion of additional prognostic variables into PFS and OS models for risk-stratification of head/neck cancer subgroups [29
To enhance generalizability and model robustness against heterogeneity in imaging and reconstruction protocols and scanning equipment, we acquired a multi-institutional dataset provided by cancer centers in the United States and Canada. Overall, AJCC models had modest prognostic accuracy in HPV-positive and HPV-negative subsets, achieving an averaged C-index ± SD of up to 0.55 ± 0.08 (p = 0.34, OS analysis of HPV-positive patients), which is likely attributable to low event rates and relatively small cohort sizes. On the other hand, in the HPV-associated subgroup, a PET/CT radiomics model using the full set of consensus VOI features for PFS prognostication produced an averaged C-index ± SD of 0.62 ± 0.05 (p = 0.02), and a PET model using consensus features for OS prediction achieved 0.63 ± 0.08 (p = 0.06). We observed similar trends in HPV-negative OPSCC, despite using a smaller cohort.
To illustrate models’ prognostic abilities throughout the follow-up period, we plotted performance curves (Figure 2
). Findings from heatmaps are again reflected, with radiomics or combined models predominantly outperforming AJCC models. The differences between models were more notable in early years of follow-up, which could be related to data sparsity in later years of follow-up. It is likely feasible to train machine-learning models with improved long-term prognostication using larger cohorts with longer follow-up.
Most prior OPSCC radiomics studies relied on generalizations of linear models to examine radiomics features and predict survival [22
]. We applied a random forest machine-learning algorithm specifically designed to handle right-censored survival data (“random survival forest”) [30
], with proven superiority in utilizing the full prognostic capability of radiomics data [28
]. Decision tree growing—which is repeatedly performed in random forest training—resembles decision-making that physicians may apply in clinical practice—with multiple variables present, the algorithms may first select the most prognostic one (e.g., HPV-status) to stratify cases. Thereafter, further variables (e.g., AJCC-staging, radiomics feature) are incorporated in growing decision trees to sub-stratify patients and refine survival prognostication [30
Moreover, radiomics-based stratification generated high-risk and low-risk groups with significantly different PFS and OS in HPV-associated OPSCC for the 3-, 4-, and 5-year follow-up endpoints (Figure 3
). In comparison, AJCC 8th edition overall-, T-, and N-staging exhibited modest abilities in risk-stratification (Figure 3
and Figure S2
), suggesting complementary value of radiomic features for OPSCC risk-stratification in addition to HPV-status and AJCC 8th edition staging.
It should be noted that C-indices reported for AJCC models in our study were averaged across validation folds in repeated cross-validation analysis utilizing overall stage and T-/N-stage as prognostic variables, which is methodologically different from some prior studies [33
]. Dissimilarities in analysis methodology, sample size, length of follow-up, and numbers of events may have contributed to the differences between AJCC model C-indices in our study and prior reports.
Despite using a multi-institutional cohort, the sample size and length of follow-up might not suffice for training radiomics models for long-term prognostication. Our study was also limited by its lack of fully independent validation in external cohorts and adjustment for other OPSCC outcome predictors. Regional metastatic spread for segmentation was determined by expert read of PET/CT scans, but without tissue sampling from all nodes. HPV-status was ascertained through following institutional standards in The Cancer Imaging Archive (TCIA) cohorts with a heterogenous array of testing methods.
4. Materials and Methods
4.1. Data Acquisition
We retrospectively acquired clinical and imaging data from (1) Yale’s Smilow Hospital cancer registry from 2009–2019, and (2) two publicly available TCIA collections [36
]: the “Head-Neck-PET-CT” collection from four Canadian institutions (“Canadian” cohort) [37
] and the “Data from Head and Neck Cancer CT Atlas” collection from MD Anderson Cancer Center (“MD Anderson” cohort) [39
]. Institutional review board approval was obtained from the Yale University ethics committee (IRB protocol #2000024295) and informed consent was waived, given the retrospective study design. TCIA datasets are de-identified and providing entities ensure ethical compliance.
Patients with (1) histopathologically confirmed OPSCC, (2) known HPV-status, (3) pre-treatment PET and non-contrast CT scans, and (4) complete follow-up information were included. We excluded patients (1) presenting with distant metastases upon initial staging, (2) receiving palliative therapy and/or denying treatment, (3) recurrent OPSCC at presentation, (4) with CT artifacts affecting >50% of the primary gross tumor volume on visual evaluation [41
], and (5) with uneventful follow-up <18 months. Biopsies or cytologic sampling prior imaging were permissible.
Patients from Yale were regularly followed up for cancer surveillance with physical examinations, endoscopy, and imaging; additional tissue sampling was performed at oncologists’ discretion. Disease progression or recurrence was ascertained by biopsies or unequivocal imaging evidence; the latter was confirmed by additional tissue sampling or documented response to anticancer therapy. For TCIA cohorts, annotations provided within the datasets were utilized to determine study endpoints. HPV association was determined by high-risk HPV-specific [42
] testing and/or p16-immunohistochemistry, and “Yale” test results were interpreted following the Guideline from the College of American Pathologists [42
]. An overall HPV status is provided in TCIA for the “Canadian” cohort, reflecting institutional testing and interpretation, and a high-risk HPV in situ hybridization status was available for the “MD Anderson” dataset. PET/CT imaging and reconstruction were performed at the source institutions utilizing standard clinical protocols.
4.2. Lesion Segmentation
To facilitate radiomics feature extraction, we defined separate PET and CT VOI for primary tumors and individual metastatic cervical lymph nodes. Each lesion was manually contoured (“segmented”) on PET axial slices, and segmentations were transferred to the co-registered CT and adapted to exclude uninvolved bone, air, and preserved fat planes. CT axial images with streak artifacts affecting the VOI were excluded from analysis on the basis of visual assessment, and lymph nodes with artifacts in >50% of the VOI were entirely disregarded [41
]. Segmentations were verified and adjusted by experienced neuroradiologists, who additionally performed cancer staging according to the 8th edition AJCC Manual [9
]. We utilized 3D-Slicer version 4.10.1 for image review and VOI segmentation [43
]. Figure 4
summarizes the segmentation and feature extraction pipeline.
4.3. Radiomics Feature Extraction
An automated image pre-processing pipeline facilitated homogenized radiomics analysis [19
]. As detailed in the supplementary methods
, we performed PET grey scale normalization, PET/CT voxel size homogenization, CT re-segmentation, generation of 10 derivative images per original scan, and grey scale discretization prior to radiomics feature extraction.
Subsequently, we extracted 1037 PET and 1037 CT radiomics features per primary tumor or lymph node: 18 first-order and 75 texture-matrix features from each lesion’s representation in the original and derived PET and CT images, and 14 volumetric shape features from the original series (Table S3
includes a comprehensive list of features). We customized a Pyradiomics version 2.1.2 pipeline for image pre-procession and feature extraction [45
Given the variable robustness of individual radiomics features to inter- and intra-observer segmentation inconsistencies [47
], we determined feature stability, retaining only stable features for analysis; the methodology and results are reported in the supplementary methods and Table S4
4.4. Survival Study Arms and Cohorts
Survival was defined as the time interval from OPSCC diagnosis to the first event in a study arm, with censoring applied at loss of follow-up. Events in the PFS study arm were defined as locoregional recurrence or progression, distant metastasis, or death from any cause, and events in the OS study arm were deaths from any cause. Patients with uneventful follow-up <18 months were excluded from each respective study arm. This approach allows training the prognostic algorithm on survival data with adequate event-density in early follow-up, avoiding event sparsity-related performance deterioration. Survival analysis in each study arm was separately performed for the HPV-associated and HPV-negative study cohorts.
4.5. Survival Modelling
For each study arm, we generated survival models using (1) clinical “AJCC” staging, i.e., overall stage, T-stage, and N-stage; (2) “radiomics” signatures; and (3) “combined” models using AJCC staging and radiomics signatures. Survival models were fitted on the combined dataset including all subjects and were evaluated in HPV-associated and HPV-negative study cohorts. AJCC features were concatenated with HPV status and were included as categorical variables, including overall stage with seven levels (I–IV and I–III in HPV-negative and HPV-associated cancers, respectively), and T- and N- stage with eight levels each (T1-T4 and N0-N3 in HPV-negative and HPV-associated cancers, respectively). Since patients with distant metastasis were excluded from analysis, no HPV-associated stage IV patients were included.
We compared several approaches to generate optimized radiomics signatures for radiomics and combined models. Three feature dimensionality reduction techniques were compared to the prognostic performance of the full feature set (details in the supplementary methods
; abbreviations in Figure 1
). Feature sets were derived from two VOI sources of radiomics input: (a) the primary tumor lesion, and (b) consensus of the primary tumor and all metastatic cervical nodes (i.e., “virtual” consensus VOI as described by Yu et al. [53
]). Feature sets from three imaging modalities (PET, CT, PET and CT) were utilized for model development. All 24 methodological combinations (4 dimensionality reduction techniques × 2 VOI sources × 3 imaging modalities) were applied in each study arm.
R version 3.6.0 was utilized for statistical analysis [54
]. Using the predictor sets (AJCC, radiomics and combined) described above, we trained and evaluated random survival forest (RSF) [30
] models using the “ranger” package (version 0.12.1) [32
] configured to grow 1000 trees per forest using a C-index split rule [31
]. Other parameters were set according to default package recommendations.
4.6. Cross-Validation and Performance Evaluation of Survival Models
We devised a framework applying 33 repeats of threefold stratified cross-validation to assess prognostic model performance, with the event/nonevent groups, HPV-status groups, and time to event/censoring as strata. In each cross-validation iteration, consensus VOI generation (if applicable), radiomics feature standardization, dimensionality reduction, and RSF training were consecutively performed on the training folds, and RSF performance was evaluated in the validation fold. This strategy yields accurate estimates of models’ prognostic capability in new cohorts, as “information leakage” between folds is rigorously avoided.
Harrell’s C-index [31
] quantified model performance in validation folds, and each model’s score was averaged across all 99 cross-validation permutations. We selected the radiomics and combined models yielding the highest average C-index per each combination of study cohort (HPV-associated and HPV-negative) and study arm (PFS and OS) for further evaluation in those respective datasets.
A corrected paired t
-test (“corrected repeated k-fold cv test” [57
]) was applied to compare select models’ C-index distribution across validation folds against random predictions (i.e., the same fitted models applied in validation folds with randomly resampled survival outcome).
Uno’s estimator of cumulative/dynamic area under the curve (AUC) for right-censored survival data [59
] was computed in each validation fold to track model performance throughout follow-up, and was averaged across 33 cross-validation repeats. The resulting time-dependent performance curves were plotted for 5 years of follow-up. The radiomics data of selected models were utilized in risk-stratification analysis.
We used the R “Hmisc” package [61
] for C-index calculation, the “survAUC” package [62
] to compute Uno’s AUC estimator, and the “geom_smooth” function implemented in “ggplot2” (version 3.2.1) [63
] to apply “LOESS” smoothing [64
] on performance curves.
4.7. Risk-Stratification and Kaplan–Meier Analysis
To investigate the potentials of quantitative imaging for risk-stratification in HPV-associated and HPV-negative OPSCC, we utilized radiomics features for binary classification (high-risk vs. low-risk) and subsequently subjected cohorts to Kaplan–Meier analysis. For risk-stratification, we used random classification forest (RCF) models (“ranger” package version 0.12.1) [32
] configured to grow 1000 trees per forest with the remaining parameters in default setting. The framework applying 33 repeats of threefold stratified cross-validation was adapted, with the event/nonevent groups as strata. Each patient’s RCF output (probability of experiencing an event) was averaged across validation folds to generate risk scores. A cutoff was selected by maximizing Youden’s index in receiver operating characteristic analysis, and patients with risk scores greater than the cutoff were assigned to the “radiomics” high-risk group.
All classified patients were subjected to Kaplan–Meier analysis with their radiomics risk group, and a log-rank test ascertained statistical significance defined as p < 0.05. For comparison, AJCC overall stage groups, T-stage, and N-stage were applied for risk-stratification.
The radiomics-only datasets of survival models selected for further evaluation were used as RCF input without feature reduction applied, and risk-stratification models were trained and evaluated separately in each study cohort and study arm. To label subjects for Kaplan–Meier analysis, cutoffs corresponding to 2, 3, 4, and 5 years of follow-up were used; patients experiencing events before a given cutoff were labeled as positive instances, subjects lost-to-follow-up before a cutoff were excluded, and all remaining patients were labelled negative and censored at the cutoff. RCF models were trained for each cutoff, and separate Kaplan–Meier plots were generated using radiomics risk groups and AJCC variables for risk-stratification. This approach allows supplying “dense” survival data to RCF algorithms (i.e., no censoring) while enabling accurate comparison with AJCC stratification.