The Effect of Magnetic Resonance Imaging Based Radiomics Models in Discriminating stage I–II and III–IVa Nasopharyngeal Carcinoma

Background: Nasopharyngeal carcinoma (NPC) is a common tumor in China. Accurate stages of NPC are crucial for treatment. We therefore aim to develop radiomics models for discriminating early-stage (I–II) and advanced-stage (III–IVa) NPC based on MR images. Methods: 329 NPC patients were enrolled and randomly divided into a training cohort (n = 229) and a validation cohort (n = 100). Features were extracted based on axial contrast-enhanced T1-weighted images (CE-T1WI), T1WI, and T2-weighted images (T2WI). Least absolute shrinkage and selection operator (LASSO) was used to build radiomics signatures. Seven radiomics models were constructed with logistic regression. The AUC value was used to assess classification performance. The DeLong test was used to compare the AUCs of different radiomics models and visual assessment. Results: Models A, B, C, D, E, F, and G were constructed with 13, 9, 7, 9, 10, 7, and 6 features, respectively. All radiomics models showed better classification performance than that of visual assessment. Model A (CE-T1WI + T1WI + T2WI) showed the best classification performance (AUC: 0.847) in the training cohort. CE-T1WI showed the greatest significance for staging NPC. Conclusion: Radiomics models can effectively distinguish early-stage from advanced-stage NPC patients, and Model A (CE-T1WI + T1WI + T2WI) showed the best classification performance.


Introduction
Nasopharyngeal carcinoma (NPC) is a common cancer of the head and neck with an endemic distribution, especially in southeastern China [1]. When its people emigrated to other countries, although the incidence of NPC decreased, it was also higher than that of the natives. Thus, although the pathogenesis remains unknown, it may be related to a combination of genetic, ethnic, and environmental factors [2].
Fortunately, NPC is highly sensitive to radiotherapy [3]. With the application of treatment paradigms involving intensity-modulated radiotherapy, the incidence of five years' local recurrence and distant metastasis has significantly decreased [2,4,5]. Treatment options are closely related to the clinical stages of NPC [2,3]. Therefore, accurate clinical stages of NPC are crucial for treatment. At present, the staging of NPC mainly depends on the Union for International Cancer Control/American Joint Committee on Cancer (UICC/AJCC) tumor-node metastasis (TNM) staging system [2]. Magnetic resonance imaging (MRI) is widely used in diagnosing diseases in various body organs [6]. With better visibility than other existing imaging methods, it is considered the optimal approach for staging [7][8][9][10][11]. Clinical stages of NPC were assessed by radiologists based on MR images and clinical data according to the TNM staging system. However, for the patients at the same TNM stages, local recurrence and distant metastasis still occur in some of them under current staging methods, even when they are treated with the same

Patient Restaging and Human Visual Assessment
All MR images and clinical data were separately reviewed by 2 experienced r

Patient Restaging and Human Visual Assessment
All MR images and clinical data were separately reviewed by 2 experienced radiologists (with 20 years and 30 years of head and neck radiology experience, respectively). They re-evaluated the clinical stages into early-stage (I~II) and advanced-stage (III~IVa) according to the eighth edition of the UICC/AJCC TNM staging system [2]. Any differences were resolved through consensus.
In addition, another 2 radiologists (reader 1 and reader 2 with 5 years and 6 years of experience in head and neck MRI, respectively) were recruited to separately stage NPC based on MR images, and were all blinded to the patients' clinical data. They then worked together to resolve differences by consensus.

Image Segmentation and Feature Extraction
MR images were all anonymously retrieved from the picture archiving and communication system (PACS). Image segmentation was performed by reader 1. The threedimensional volume of interest that contained the whole primary tumor was obtained by stacking up the region of interest (ROI). This was manually delineated slice by slice around the outermost boundary of the tumor on axial sequences (T1WI, T2WI, and CE-T1WI separately) using 3D Slicer (version 4.10.2; http://www.slicer.org, accessed on 17 May 2019). To ensure the segmentation only contained tumor tissue, 3 mm inside the ROI was decreased with automated dilation and shrinkage.
Feature extraction was performed with the open-source Pyradiomics package (version 3.0.1; https://pyradiomics.readthedocs.io/en/latest/changes.html#pyradiomics-3-0-1, accessed on 3 June 2021). To standardize the voxel spacing, images were resampled to a voxel size of 1 × 1 × 1 mm 3 . After that, seven classes of radiomics features were extracted from the original images, including shape, first-order, gray-level co-occurrence matrix (GLCM), gray-level dependence matrix (GLDM), gray-level run length matrix (GLRLM), gray-level size zone matrix (GLSZM), and neighboring gray tone difference matrix (NGTDM) features. The same first-order and textural features were then extracted after applying wavelet (with 3 directions of wavelet decomposition: x, y, z) and Laplacian of Gaussian (LoG) (with sigma values of 0.5 mm, 1.0 mm, 1.5 mm, 2.0 mm) to the original images, respectively. Ultimately, 3669 features were extracted (1223 from each sequence). The detailed radiomics features are listed in Supplementary Materials File A.

Interobserver and Intraobserver Agreement
Forty patients (20 early-stage and 20 advanced-stage) were randomly chosen for repetitive tumor ROI segmentations, which were performed by reader 1 and reader 2 to explore interobserver stability. The same procedure was repeated by reader 1 in a 2-week period to evaluate intraobserver reproducibility. The intraclass correlation coefficient (ICC) was used to evaluate intraobserver and interobserver agreement, and ICC > 0.75 indicated satisfactory agreement. Therefore, only features with both intraand interobserver ICC > 0.75 were chosen for further analysis and were standardized with z score normalization.

Dimensionality Reduction and Radiomics Feature Selection
To reduce potential overfitting of the radiomics features and avoid the curse of dimensionality when modeling, two steps were applied to select radiomics features in the training cohort. First, the independent samples t test or the Mann-Whitney U test was used to select potentially important features. Second, features with p < 0.05 from the first step were kept and input to the least absolute shrinkage and selection operator (LASSO) classifier, with penalty parameter tuning conducted by 10-fold cross-validation, and features with non-zero coefficients were selected to build radiomics signatures [31,32]. The radiomics score (Rad-score) for each patient was calculated using a linear combination of selected features that were weighted by their respective LASSO coefficients.

Construction of the Radiomics Model
Logistic regression, a classical machine learning method, was used to construct seven radiomics models (named A, B, C, D, E, F, and G) for staging NPC patients. Models A, B, C, and D were built with radiomics signatures selected from combined sequences (CE-T1WI  + T1WI + T2WI, CE-T1WI + T1WI, T1WI + T2WI, and CE-T1WI + T2WI, respectively). Models E, F, and G were built with radiomics signatures selected from single sequences (CE-T1WI, T1WI, and T2WI, respectively).
Accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC) values were used to evaluate models' performance. ROC curves were drawn to display and compare the performance of different models. The DeLong test was used to analyze significant differences between models. The workflow of model construction and evaluation is shown in Figure 2. Models E, F, and G were built with radiomics signatures selected from single sequences (CE-T1WI, T1WI, and T2WI, respectively). Accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC) values were used to evaluate models' performance. ROC curves were drawn to display and compare the performance of different models. The DeLong test was used to analyze significant differences between models. The workflow of model construction and evaluation is shown in Figure 2.
The statistical analysis of demographic features was performed by IBM SPSS software (version 21.0). Continuous variables were compared using Mann-Whitney U tests and categorical variables were compared using chi-square tests. For all tests, a two-sided p value less than 0.05 was considered statistically significant.

Patient Characteristics
Patients' demographic information is recorded in Table 1. No significant differences were observed between the training and validation cohorts in terms of age, gender, smoking, drinking, T stage, N stage, or clinical stage.

Interobserver and Intraobserver Agreement
Only 407 features from CE-T1WI, 390 features from T1WI, and 338 features from T2WI, whose ICC scores were all greater than 0.75, were selected. In total, 1135 radiomics features were selected for the following analysis, listed in Supplementary Materials File B.

Dimensionality Reduction and Radiomics Feature Selection
All 1135 radiomics features showed significant differences (p < 0.05) when tested by independent samples t tests or Mann-Whitney U tests and were used for LASSO regression. This was followed by the selection of 13, 9, 7, and 9 features derived from  (CE-T1WI + T1WI + T2WI, CE-T1WI + T1WI, T1WI + T2WI, and  CE-T1WI + T2WI) to construct Models A, B, C, and D respectively. Ten, seven, and six features derived from single sequences (CE-T1WI, T1WI, and T2WI) were selected to construct Models E, F, and G respectively. The selected features for each model and calculation formulas for Rad-scores are shown in Table 2.

Performance of Different Models and Radiologists
Median values and interquartile ranges of the Rad-scores in the training and validation cohorts are listed in Table 3. All showed potential abilities in differentiating stage I-II from stage III-IVa both in training (all p < 0.001) and validation (all p < 0.001) cohorts, with the Rad-scores of the latter being much higher. The Rad-scores of seven models for each patient in the training and validation cohorts regarding the classification of stage I-II and stage III-IVa NPC are depicted in Figure 3.
The classification performance of readers and seven radiomics models are listed in Table 4. The ROC curves of different radiomics models and readers are shown in Figure 4. When comparing the performance of human visual assessment with all radiomics models using the DeLong test, readers (with an AUC, accuracy, specificity, and sensitivity of 0.721, 0.716, 0.738, and 0.703, respectively) showed the worst performance. When comparing the performance of Model A with that of the other radiomics models using the DeLong test, Model A (with an AUC, accuracy, specificity, and sensitivity of 0.847, 0.729, 0.571, and 0.820, respectively) showed the best performance in the training cohort. However, there were no significant differences for any of the validation cohorts. Furthermore, there were no significant differences when comparing the performance of other models in pairs. In the models constructed with features derived from combined sequences, although Model A showed the best classification performance, there were no significant differences between Models A and D (CE-T1WI + T2WI, the AUC, accuracy, specificity, and sensitivity in the training cohort were 0.826, 0.751, 0.679, and 0.793, respectively) in the training cohorts. Model C (T1WI + T2WI) showed the worst classification performance, with the lowest AUC value of 0.812 in the training cohort (the accuracy, specificity, and sensitivity were 0.703, 0.595, and 0.766, respectively).  Note: Data are expressed as the median (interquartile range); p < 0.05 indicates significant differences.     In the models constructed with features derived from combined sequences, although Model A showed the best classification performance, there were no significant differences between Models A and D (CE-T1WI + T2WI, the AUC, accuracy, specificity, and sensitivity in the training cohort were 0.826, 0.751, 0.679, and 0.793, respectively) in the training cohorts. Model C (T1WI + T2WI) showed the worst classification performance, with the lowest AUC value of 0.812 in the training cohort (the accuracy, specificity, and sensitivity were 0.703, 0.595, and 0.766, respectively).
In the models constructed with features derived from a single sequence, Model E (CE-T1WI) showed the best classification performance in the training cohort (the AUC, accuracy, specificity, and sensitivity were 0.839, 0.764, 0.667, and 0.821, respectively), which ranked second to that of Model A. However, there were no significant differences In the models constructed with features derived from a single sequence, Model E (CE-T1WI) showed the best classification performance in the training cohort (the AUC, accuracy, specificity, and sensitivity were 0.839, 0.764, 0.667, and 0.821, respectively), which ranked second to that of Model A. However, there were no significant differences between Models A and E in either the training cohort or the validation cohort. Model G (T2WI) showed the worst classification performance in the training cohort (with an AUC, accuracy, specificity, and sensitivity of 0.791, 0.716, 0.631, and 0.766, respectively).

Discussion
In this retrospective study, the median Rad-scores in advanced-stage NPC were all higher than those in early-stage NPC, which indicated great potential of radiomics features in differentiating early-stage NPC from advanced-stage NPC. However, the reason was unknown, maybe due to the fact that every model's features consisted of two or more features related to the shape of lesions, and most of the advanced-stage tumor lesions were bigger than that of early-stage. Thus, median Rad-scores in advanced-stage NPC were all higher than those in early-stage NPC.
As expected, all radiomics models showed better classification performance than that of visual assessment by a set of radiologists with less experience, with Model A (CE-T1WI + T1WI + T2WI) the best (AUC: 0.847 in the training cohort and 0.824 in the validation cohort). This may be due to the complicated anatomy of the head and neck, undefined involvement of surrounding tissues, and ambiguous metastasis of lymph nodes, which make it difficult for young radiologists to accurately identify clinical stages. However, there were no significant differences in validation cohorts when comparing classification performance of visual assessment with that of radiomics models. As we all know, adjacent invasion tissues and metastatic neck LNs were highly relevant to TNM stage of NPC patients, thus MR images of invasion tissue and metastatic neck LNs contained much information related to clinical stages. However, in our study, ROIs did not cover the adjacent invasion tissues and metastatic neck LNs. The ignorance of these images may have resulted in no significant differences in validation cohorts when comparing classification performance of radiomics models with that of visual assessment. However, given that most AUC values for radiomics models were higher than that of visual assessment, it cannot be denied that radiomics models showed better classification performance.
By extracting quantitative parameters from MR images, machine learning classifiers can minimize the influence of radiologists' differing experience and accurately characterize intratumoral heterogeneity [13]. However, for the validation cohort, no significant differences were observed between any of the models, which may be due to its small sample size and the imbalanced sample size for each stage of NPC patients (the number of advancedstage patients was almost twice that of early-stage). As shown in Table 2, there were many identical features selected for each model, especially for models A, B, D and E. This was considered the reason for no significant differences between the validation cohorts of each model and training cohorts of model A, D, and E. Features related to the shape of tumors were selected in all models, the reason being that most tumor lesions in advanced-stage were bigger than that in early-stage, and were considered the most significant features for clinical staging.
Furthermore, CE-T1WI was considered the most significant sequence for staging NPC. As we can see, Model E (CE-T1WI) showed the best classification performance in the models constructed from a single sequence, which ranked second only to Model A (with no significant differences between them). This may be due to the concentration of contrast material in CE-T1WI, which revealed the blood supply of the tumor, so CE-T1WI can provide more information for clinical staging of NPC. However, features derived from T1WI and T2WI should not be ignored, since they helped Model A to achieve the best classification performance. We thought this was due to different sequences of MR images reflecting different characteristics of the tumor tissue, meaning that radiomics can extract totally different features from different sequences, which played an important role in accurately identifying stages [6].
One study showed similar conclusions to ours; its CE-T1WI showed better classification performance than that of T2WI, and signatures from CE-T1WI + T2WI showed the best performance (AUC: 0.850 in training cohorts and 0.849 in validation cohorts) [33]. However, in our study, the classification performance of CE-T1WI+T2WI showed no significant difference from that of CE-T1WI, which may be owing to the different subjects and smaller sample size (127 head and neck squamous cell carcinoma patients) in that study.
To our knowledge, our study is the first to differentiate early-stage NPC patients from advanced-stage NPC patients. A previous study established a weakly supervised deep learning network, with 1138 cases of images (T1WI, T2WI, and CE-T1WI) inputted to train this model which achieved good performance in automated T staging of NPC (the average AUC value of different T stages was 0.943) and showed no significant differences in PFS and overall survival with those of the TNM stage system [30]. Although deep learning showed great potential in automatic T staging, an accurate clinical stage of NPC was still not achieved due to the unknown N stage in that study. Moreover, large numbers of MR images are usually needed to train deep learning models, which makes it difficult to be replicated and verified.
Another previous study established five Convolutional Neural Network models combined with transfer learning to differentiate advanced stages (III and IV) of NPC. Patches of CE-T1WI and T2WI images containing tumors, metastatic lymph nodes, and their adjacent tissues were inputted for training. The predicted stages were finally obtained by software voting, with the combined model showing better classification performance (accuracy: 0.81) than that of the TNM stage system and the traditional radiomics model under the same experimental conditions [29]. However, approximately only 200 patients were enrolled in that study, the accuracy of the model may have been compromised, and early-stage patients were not included. In addition, deep learning is still considered a black-box technique and should be interpretable for crucial application. In our study, features were directly extracted from primary tumors to predict the overall clinical stage, which decreased the influences of different radiologists and unknown N stage in accurate staging. The model was a classical machine learning method with specific algorithms and achieved good performance (AUC: 0.847) in differentiating earlystage NPC from advanced-stage NPC, which highlighted the great potential of radiomics in predicting the clinical stage of NPC.
Although good performance was achieved in our study, there were still inaccurate staging cases. The reasons for these may be as follows: (1) Sample sizes were imbalanced for the two groups in this study. (2) The involvement of the parapharyngeal space was proven to be related to T stage and prognosis of NPC [34], thus only tumor tissues were obtained; removing the related tissues and metastatic lymph nodes may decrease accuracy.
(3) This study directly predicted clinical stages instead of T and N stages, which may ignore the specific sites of invasion and lead to inaccuracy. (4) Images in the coronal plane and sagittal plane were not included in the model construction, which may lead to the loss of some imaging information.
There are some limitations in this study: (1) There was no external cohort to verify these radiomics models. (2) Only a two-stage classification framework was performed; the effect of radiomics in differentiating more detailed clinical stages was lacking. (3) A slice thickness of 5 mm may miss the minor invasion and lead to inaccurate stages. (4) Only patients admitted from 2013-2016 were enrolled in this study, because it was the first part of our research; a 5-year follow-up was needed for the rest. We plan to enlarge and balance the sample sizes of each stage to further investigate the ability of radiomics to predict the prognosis and differentiate more detailed clinical stages of NPC.

Conclusions
In conclusion, radiomics models showed great potential in distinguishing early-stage (I-II) from advanced-stage (III-IVa) NPC patients, and Model A (CE-T1WI + T1WI + T2WI) performed the best. Furthermore, CE-T1WI showed the highest significance in staging NPC. However, imbalanced sample sizes, ignorance of adjacent tissue and LNs, and a single machine learning classifier may all lead to inaccurate staging cases, and also resulted in no significant differences for validation cohorts between models and visual assessment. Thus, we plan to enlarge and balance our sample size, extend the ROI to contain adjacent tissue and LNs, and compare the classification performance of different machine learning classifiers and deep learning, to find an accurate staging method for NPC.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/diagnostics13020300/s1, Supplement A: All detailed original radiomics features. Supplement B: All radiomics features with ICC > 0.75 for both intra-and interobserver agreement.
Author Contributions: Q.L., T.L. and J.P. designed the study. Q.L., B.G., Y.N. and X.C. collected and assembled all data. Q.L., J.G. and Q.Y. performed data analysis. Q.L. wrote the manuscript. T.L., J.P. and F.L. revised the manuscript. All authors contributed to the article and approved the final submitted version. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: The study was approved by the Human Ethics Committee of the first affiliated hospital of Chongqing Medical University (No.: 2020080, 14 January 2020).
Informed Consent Statement: Informed consent was needless due to retrospective nature.

Data Availability Statement:
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.