Application of Machine Learning Methods to Improve the Performance of Ultrasound in Head and Neck Oncology: A Literature Review

Simple Summary Ultrasound (US) is a non-invasive imaging method that is routinely utilized in head and neck cancer patients to assess the anatomic extent of tumors, nodal and non-nodal neck masses and for imaging the salivary glands. In this review, we summarize the present evidence on whether the application of machine learning (ML) methods can potentially improve the performance of US in head and neck cancer patients. We found that published clinical literature on ML methods applied to US datasets was limited but showed evidence of improved diagnostic and prognostic performance. However, a majority of these studies were based on retrospective evaluation and conducted at a single center with a limited number of datasets. The conduct of multi-center studies could help better validate the performance of ML-based US radiomics and facilitate the integration of these approaches into routine clinical practice. Abstract Radiomics is a rapidly growing area of research within radiology that involves the extraction and modeling of high-dimensional quantitative imaging features using machine learning/artificial intelligence (ML/AI) methods. In this review, we describe the published clinical evidence on the application of ML methods to improve the performance of ultrasound (US) in head and neck oncology. A systematic search of electronic databases (MEDLINE, PubMed, clinicaltrials.gov) was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Of 15,080 initial articles identified, 34 studies were selected for in-depth analysis. Twenty-five out of 34 studies (74%) focused on the diagnostic application of US radiomics while 6 (18%) studies focused on response assessment and 3 (8%) studies utilized US radiomics for modeling normal tissue toxicity. Support vector machine (SVM) was the most commonly employed ML method (47%) followed by multivariate logistic regression (24%) and k-nearest neighbor analysis (21%). Only 11/34 (~32%) of the studies included an independent validation set. A majority of studies were retrospective in nature (76%) and based on single-center evaluation (85%) with variable numbers of patients (12–1609) and imaging datasets (32–1624). Despite these limitations, the application of ML methods resulted in improved diagnostic and prognostic performance of US highlighting the potential clinical utility of this approach.


Introduction
The term 'head and neck cancer' is used to describe a heterogeneous group of neoplasms that can arise in multiple sites within this anatomic region including the mucosal epithelium of the oral and nasal cavities, larynx, pharynx, thyroid, and salivary glands [1]. Ultrasound (US) is a noninvasive imaging modality that is routinely employed in head and neck cancer patients to determine the anatomic extent and vascularity of tumors and nodal masses [2][3][4]. US methods are also utilized for diagnostic evaluation of suspicious thyroid nodules [5], and the differential diagnosis of salivary gland neoplasms [6]. These applications rely on the ability of US to detect changes in size, shape, margins, structure (e.g., echogenicity, presence of calcifications), and vascularity (blood flow and velocity). Although useful, these imaging characteristics measured from US images are limited in number and to a large extent, ignore the wealth of information that is captured within the individual pixels (2D) or voxels (3D) that make up the image.
An alternative approach that overcomes these limitations is "radiomics" and refers to the extraction and modeling of multi-dimensional quantitative descriptors (textural features) from imaging datasets [7]. In radiomics, textural features are calculated based on the shape, distribution of voxel intensity, and spatial relationships between neighboring voxels and subsequently modeled through statistical algorithms ("machine learning"). Given this ability of radiomics to extract quantitative, high-throughput, multi-dimensional information, it is assumed that such radiomic "phenotypes" may better capture the heterogeneity within the tumor microenvironment and as a consequence, exhibit improved performance for diagnostic and prognostic applications compared to traditional radiologic or clinical assessment criteria used in head and neck oncology [8].
In this review, we summarize the published evidence on the clinical application of US radiomics in head and neck oncology. We describe the radiomic approaches employed, the quality of the evidence presented, and the performance of these methods for the intended application. The translational barriers and potential solutions for the successful integration of radiomics into routine clinical practice in head and neck oncology are discussed.

Literature Search Strategy
A review of the literature was performed following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [9]. A database search was conducted in MEDLINE/PubMed (National Center for Biotechnology Information, NCBI) and clinicaltrials.gov, ending on 31 March 2021. Databases were searched for the following terms: "head and neck cancer", OR "oral cancer," OR "tongue cancer", OR "thyroid cancer", OR "lymph node", OR "salivary gland", OR "parotid gland," AND "ultrasound," AND "radiomics" OR "texture analysis".

Study Screening and Selection Criteria
The principal inclusion criteria were studies utilizing radiomics or textural analysis to analyze US datasets in head and neck cancer patients. Conference abstracts, reviews, and articles not focused on head and neck cancer and studies that did not employ US were excluded. Duplicate articles were removed. Articles not published in the last 10 years, not written in English, or focused on the application of radiomic methods to non-imaging data (e.g., genomic or histopathologic data) were also removed. Finally, studies that utilized deep learning techniques, such as neural networks, were excluded from our analysis.

Data Extraction and Reporting
All relevant data from the selected articles were extracted and tabulated into an Excel database (Microsoft Corporation, Washington, DC, USA). The following information was gathered from the studies and stratified in the following manner: Graphical display of the data was performed using GraphPad Prism version 9.0.0 for Windows (GraphPad Software, San Diego, CA, USA).

Radiomics Quality Score (RQS)
The RQS is based on 16 criteria that encompass multiple domains for the evaluation of the quality of radiomic studies [10]. Briefly, these criteria include image protocol quality, segmentation, feature selection, clinical or biological correlates, statistical methods and performance metrics, validation, comparison to the gold standard, cost-effectiveness, and open science. The RQS uses a point-based system to award or penalize points depending on the fulfillment of the criteria with a maximum score of 36 (100%). Training, validation, and test sets were defined as follows: (1) a training dataset is a sample of data used to fit the model, (2) a validation set is a cohort that provides an unbiased evaluation of a model fit on the training set and (3) a test set is an external or independent sample of data that should only be reported for the algorithm once it has been trained and validated [11,12].

Search Results
The initial search of the electronic databases retrieved 15,080 candidate articles that were screened for eligibility and subsequent evaluation. Application of our pre-defined criteria resulted in the exclusion of 10,039 articles. The remaining 5041 articles underwent a comprehensive assessment based on a full-text examination to confirm eligibility. Following a second screen, an additional 5007 articles were excluded based on the imaging modality studied, lack of radiomic analysis of image datasets, cancer sites, molecular or genomic studies, and studies that utilized deep learning methods. This resulted in a final total of 34 articles that were included for in-depth assessment ( Figure 1). Figure 2A summarizes the three published clinical applications of US radiomics in head and neck cancer patients. A majority of the studies (n = 25/34; 74%) were focused on evaluating the potential of US radiomics for disease diagnosis ( Figure 2B). Of these studies, 56% examined the utility of US radiomics to distinguish benign and malignant thyroid nodules (n = 14/25) and 44% were focused on the identification of metastatic lymph nodes (n = 11/25). The second clinical application of US radiomics was to prognosticate response to standard of care chemotherapy or radiation (n = 6/34, 18%). A relatively small number of studies (n = 3/34, 8%) examined the ability of US radiomics to assess toxicity, specifically, to predict xerostomia (severe dry mouth) following radiation injury to salivary glands in head and neck cancer patients.

Characteristics of US Radiomics Studies in Head and Neck Cancer
The data presented in Figure 3 summarizes the design characteristics and head and neck cancer sites of US radiomics studies reported in the literature. A majority of the studies were based on retrospective analysis (n = 26, 76%; Figure 3A) and conducted at a single center (n = 29, 85%; Figure 3B). In clinical practice, US is predominantly used for the evaluation and staging of superficial structures in the head and neck region. Therefore, not so surprisingly, a large proportion of published studies were focused on the examination of the thyroid (n = 15, 44%) or lymph nodes (n = 13, 38%) ( Figure 3C). A relatively small number of studies (<10%) utilized US radiomics to study the salivary glands and primary tumors in the pharynx and larynx. The number of patients evaluated in these studies varied widely from 12 to 1609 (median value 137) with a majority of studies reporting findings based on analysis of data from over 100 patients ( Figure 3D).  Figure 2A summarizes the three published clinical applications of US radiomics in head and neck cancer patients. A majority of the studies (n = 25/34; 74%) were focused on evaluating the potential of US radiomics for disease diagnosis ( Figure 2B). Of these studies, 56% examined the utility of US radiomics to distinguish benign and malignant thyroid nodules (n = 14/25) and 44% were focused on the identification of metastatic lymph nodes (n = 11/25). The second clinical application of US radiomics was to prognosticate response to standard of care chemotherapy or radiation (n = 6/34, 18%). A relatively small number of studies (n = 3/34, 8%) examined the ability of US radiomics to assess toxicity, specifically, to predict xerostomia (severe dry mouth) following radiation injury to salivary glands in head and neck cancer patients.

Imaging Datasets in US Radiomics Studies
It is well recognized that the size of the imaging datasets used for training (i.e., number of datasets used to develop the model) critically influences the performance of ML models. Subsequent examination of the performance of developed ML models using an independent validation dataset (a cohort that provides an unbiased evaluation of a model) is also essential to assess their true utility. And finally, a test set is an external or independent sample of data that is used to report the performance of the algorithm once it has been trained and validated [11,12]. We, therefore, examined the number of imaging datasets employed for training and validation in published US radiomics studies  ( Figure 4A). This analysis revealed a considerable degree of variability in the number of images used in the training (range 32-1299) datasets for developing the ML models. Notably, less than half of the studies included an independent validation dataset (n = 11, 32%). The majority of studies ( Figure 4B) did not include a validation method to evaluate the performance of their model (n =12, 35%), or included cross-validation in their analyses (n = 11, 32%), Only a single study employed a validation and an independent test set. In studies that utilized a validation set, the number of images used in these validation datasets was considerably lower than the number of images in the training dataset (range 0 to 368). An asterisk (*) in front of a study name indicates that the authors used cross-validation as a means of validating their training model ( Figure 4A).

Characteristics of US Radiomics Studies in Head and Neck Cancer
The data presented in Figure 3 summarizes the design characteristics and head and neck cancer sites of US radiomics studies reported in the literature. A majority of the studies were based on retrospective analysis (n = 26, 76%; Figure 3A) and conducted at a single center (n = 29, 85%; Figure 3B). In clinical practice, US is predominantly used for the evaluation and staging of superficial structures in the head and neck region. Therefore, not so surprisingly, a large proportion of published studies were focused on the examination of the thyroid (n = 15, 44%) or lymph nodes (n = 13, 38%) ( Figure 3C). A relatively small number of studies (<10%) utilized US radiomics to study the salivary glands and primary tumors in the pharynx and larynx. The number of patients evaluated in these studies varied widely from 12 to 1609 (median value 137) with a majority of studies reporting findings based on analysis of data from over 100 patients ( Figure 3D).

Diagnosis/Classification Prognosis/Rx Response Normal Tissue Toxicity
Disease-free survival

Imaging Datasets in US Radiomics Studies
It is well recognized that the size of the imaging datasets used for training (i.e., number of datasets used to develop the model) critically influences the performance of ML models. Subsequent examination of the performance of developed ML models using an independent validation dataset (a cohort that provides an unbiased evaluation of a model) is also essential to assess their true utility. And finally, a test set is an external or independent sample of data that is used to report the performance of the algorithm once it has B A D C

Machine Learning Methods Employed in US Radiomics Studies
Next, we examined the ML methods employed in published US radiomic stud dependent of the clinical application or the outcome measures studied ( Figure 5 analysis revealed that several studies employed more than one ML method in the yses. The most commonly employed ML technique for classification was support machine (SVM) followed by multivariate logistic regression and k-nearest ne

Machine Learning Methods Employed in US Radiomics Studies
Next, we examined the ML methods employed in published US radiomic studies independent of the clinical application or the outcome measures studied ( Figure 5). This analysis revealed that several studies employed more than one ML method in their analyses.
The most commonly employed ML technique for classification was support vector machine (SVM) followed by multivariate logistic regression and k-nearest neighbor (kNN).

Reporting Completeness and Methodologic Quality Appraisal
We assessed the reporting completeness and the quality of the methodological aspects of model development and validation using the RQS [10] (Supplementary Table S1, ). Nearly all of the publications included an adequate description regarding the imaging protocol (at minimum, the type of transducer, and the central frequency). Additionally, most of the studies were awarded points for including discrimination, model performance, statistics, as well as feature reduction methods. However, over half of the studies (n = 21, 62%) lacked a robust description of segmentation methods. Additionally, many of the studies did not include multivariate analysis of radiomic features with nonradiomic features, such as clinical or histologic data (n = 19, 56%). Studies were not penalized for using cross-validation to corroborate their training set, but 5 points were deducted if there was no validation method at all (n = 12, 35%). Surprisingly, 94% of studies (n = 32) did not provide/use open-source codes/software.

Performance of US Radiomic Models for Diagnostic Applications in Head and Neck Oncology
As stated earlier, the principal diagnostic applications of US radiomics in head and neck oncology include the classification of benign and malignant thyroid nodules and the diagnosis of lymph node metastases. Table 1 summarizes the radiomics approach, methodology and the principal findings of 25 studies focused on the evaluation of US radiomics for diagnostic applications in head and neck oncology. A high proportion of studies (23/25) examining the ability of US radiomics to predict thyroid malignancy or lymph node metastases in head and neck cancer patients achieved an AUC or c-index above 0.7. Most of the ML models that were developed achieved the highest diagnostic performance

Reporting Completeness and Methodologic Quality Appraisal
We assessed the reporting completeness and the quality of the methodological aspects of model development and validation using the RQS [10] (Supplementary Table S1, ). Nearly all of the publications included an adequate description regarding the imaging protocol (at minimum, the type of transducer, and the central frequency). Additionally, most of the studies were awarded points for including discrimination, model performance, statistics, as well as feature reduction methods. However, over half of the studies (n = 21, 62%) lacked a robust description of segmentation methods. Additionally, many of the studies did not include multivariate analysis of radiomic features with non-radiomic features, such as clinical or histologic data (n = 19, 56%). Studies were not penalized for using cross-validation to corroborate their training set, but 5 points were deducted if there was no validation method at all (n = 12, 35%). Surprisingly, 94% of studies (n = 32) did not provide/use open-source codes/software.

Performance of US Radiomic Models for Diagnostic Applications in Head and Neck Oncology
As stated earlier, the principal diagnostic applications of US radiomics in head and neck oncology include the classification of benign and malignant thyroid nodules and the diagnosis of lymph node metastases. Table 1 summarizes the radiomics approach, methodology and the principal findings of 25 studies focused on the evaluation of US radiomics for diagnostic applications in head and neck oncology. A high proportion of studies (23/25) examining the ability of US radiomics to predict thyroid malignancy or lymph node metastases in head and neck cancer patients achieved an AUC or c-index above 0.7. Most of the ML models that were developed achieved the highest diagnostic performance when combining clinical and radiomic features. Additionally, most of the authors compared their models to the current clinical paradigms or "gold standard" method to demonstrate the added value of radiomics. Park et al. [31] developed an US radiomics model to differentiate benign and malignant thyroid nodules and investigated its potential as a diagnostic adjunct to improve the performance of existing risk stratification guidelines. This retrospective study included a training set (n = 1299) and a validation set (n = 325). Thyroid nodules were manually segmented on US images and radiomic features were extracted using an in-house developed algorithm in MATLAB 2018b (The MathWorks, Inc., Natick, Massachusetts, USA). Based on the training set, the least absolute shrinkage and selection operator (LASSO) feature selection method reduced the number of features from 730 to 66. A radiomics score for each nodule was calculated to reflect the risk of malignancy. The performances of the American College of Radiology (ACR) and American Thyroid Association (ATA) guidelines were compared with the combined diagnostic performance of the guidelines and the radiomics score. When ACR or ATA guidelines were combined with the radiomics score (using a 5% predicted malignant risk cutoff, Rad_5%), radiologists showed an increased specificity (72.8%, p < 0.001), accuracy (73.2%, p < 0.001) and positive predictive value (42.1%, p < 0.001) with decreased unnecessary fine needle aspiration (FNA) biopsy rates (57.9%, p < 0.001) and no change in sensitivity (75.0%, p > 0.99). This study highlights the potential for US radiomics to act as a complementary tool that could assist in the discrimination of benign and malignant thyroid nodules.
Histopathologic evidence of lymph node involvement is critical in surgical planning and determining the extent of neck dissection in head and neck cancer patients. Depending on the number of lymph nodes and the levels of involved lymph nodes on imaging, advanced neck dissection techniques can enable the staging of the patient and potentially allow for less invasive approaches to managing tumor and nodal disease [47]. In this context, Jiang et al. [21] developed a nomogram based on multivariate logistic regression that incorporated US radiomic features and clinicopathologic characteristics of papillary thyroid carcinoma patients for the preoperative diagnosis of LNM. This retrospective study included a training cohort (n = 147) to build the model and a validation cohort (n = 90) to assess the performance of their model, collecting imaging data from two separate institutions. Nodules were manually delineated and 310 B-mode US radiomic features and 209 shear wave elastography (SWE) radiomic features were automatically extracted using the open-source software, Pyradiomics [48]. Then, a 2-step feature selection method was applied: a minimum redundancy maximum relevance (mRMR) algorithm was used to reduce the feature space and LASSO logistic regression method was applied to select the most meaningful features from the training dataset (final number of features = 6). Finally, an US radiomics nomogram was developed incorporating the radiomics score of the features and the independent clinical variables (e.g., multifocality). For comparison, an independent model was developed using the clinical variables alone. Their nomogram was built for the preoperative classification of LNM, which showed superior diagnostic performance compared to the clinical model in both the training (AUC of 0.851 vs. 0.800, p = 0.034) and validation datasets (AUC of 0.832 vs. 0.783, p = 0.048). Their nomogram showed that patients with a score greater than or equal to 0.574 were high-risk patients likely to have LNM (sensitivity of 86.84% and a positive predictive value of 70.21% in the validation cohort). This study presents the usefulness of US radiomics for the preoperative diagnosis of LNM, which has the potential to guide clinicians' decisions in regard to performing FNA biopsies.
In another study, Li et al. [25] compared the clinical value of US characteristics of thyroid nodules to US radiomic features in diagnosing LNM prior to surgery. A total of 126 thyroid cancer patients with preoperative US images were retrospectively analyzed to predict LNM. Thyroid nodules were delineated by two radiologists and a total of 1079 radiomic features were extracted using closed-source software. Then, LASSO and principal component analysis (PCA) was used to reduce the multi-dimensional feature space to 91. Hypothesis testing and bagging were used to build the model based on these radiomic features. The diagnostic performance of their radiomics model was compared to the relationship between US characteristics and LNM, including shape, calcification, composition, and echogenicity. The analysis of US characteristics suggested that irregular shape and microcalcification were predictors of LNM in thyroid cancer patients (p = 0.025 and p = 0.004, respectively). However, ROC curve analysis for the diagnosis of LNM revealed that the radiomics model had markedly improved performance compared to US characteristics (AUC of 0.76 and 0.80 in the training and validation cohorts of the radiomics model respectively, vs. AUC of 0.591 and 0.629 for US characteristics of irregular shape and microcalcification, respectively). This study demonstrates the added value of radiomics analysis in screening meaningful US features in thyroid cancer patients for the preoperative prediction of LNM.
While several studies have demonstrated improved diagnostic performance with the use of ML methods, a small number of studies reported a lack of value addition with radiomics compared to traditional radiologic measures. Kim et al. [23] investigated optimal subsets of first-and second-order textural features (histogram and co-occurrence matrix features) for the discrimination of malignant and benign thyroid nodules in comparison to conventional radiologic characteristics (echogenicity, calcifications, shape). To evaluate the differences in both radiologic characteristics and textural features with malignant (n = 444) and benign (n = 189) thyroid nodules, confirmed by US-guided FNA, their diagnostic performance was assessed by calculating sensitivity, specificity, accuracy, positive predictive value (PPV), and negative predictive value (NPV). The authors reported that gray-scale ultrasound characteristics displayed the highest diagnostic performance, with 91% sensitivity, 91% specificity, 77% accuracy, 57% PPV, and 95% NPV. Among the firstand second-order textural features, the histogram parameter of mean intensity had the highest overall diagnostic performance, with 70% sensitivity, 65% specificity, 67% accuracy, 46% PPV, and 64% NPV. However, the overall performance of the mean intensity was significantly lower than that of gray-scale US characteristics (p < 0.001).

Performance of US Radiomics for Response Prediction in Head and Neck Cancer
The performance of US radiomics models for prognostic applications in predicting treatment response, locoregional recurrence, and treatment outcomes are summarized in Table 2. Dasgupta et al. [38] conducted a prospective study to investigate the role of US radiomics for the pretreatment prediction of recurrence in head and neck cancer patients (clinicaltrials.gov identifier NCT03908684). A total of 51 head and neck cancer patients with a primary site of oropharynx, hypopharynx, and larynx were enrolled in this study. Patients were treated with intensity-modulated radiation therapy or imaging-guided radiation therapy (70 Gy/33 fractions for 6-7 weeks), with or without concurrent chemotherapy. All patients were followed for a median of 38 months. For patients with confirmed recurrence, cross-sectional imaging and tissue diagnosis via biopsy were undertaken. Pretreatment US images were acquired up to 1 week prior to the start of radiation therapy and the largest lymph node was selected for imaging. Three different data modeling classifiers were built using a maximum of 3 features per model. The algorithms were tested using leave one out cross-validation (LOOCV). Recurrence (local, regional, or distant) served as the endpoint and was observed in 17 of the patients. The k-nearest neighbor (kNN) classifier demonstrated the highest sensitivity, specificity, and accuracy for predicting recurrence or no recurrence (76%, 71%, and 75%, respectively). This pilot clinical study is the first to demonstrate the potential clinical utility of US radiomics in head and neck cancer patients for the prediction of locoregional recurrence. Park et al. [41] developed an US radiomics signature to predict disease-free survival (DFS) in papillary thyroid carcinoma (PTC) patients and assessed the value of radiomics in comparison to clinicopathologic risk factors (including pathological tumor size and gross extrathyroidal extension). In their retrospective analysis, a total of 768 PTC patients that underwent preoperative US examination were enrolled. Tumors were manually segmented by a single radiologist and 730 candidate features were extracted using in-house radiomics analysis. The top 40 most useful radiomic features were selected using LASSO. A radiomics model integrating the clinicopathologic features improved the c-index compared to the clinical model alone (0.777 vs. 0.721, respectively) and demonstrated better performance in the estimation of DFS in PTC patients. While this study highlights the potential of US radiomics for risk stratification in patients with PTC, further validation is required on an independent dataset.

Performance of US Radiomics for Xerostomia Evaluation
Xerostomia (severe dry mouth) is one of the most common side effects of radiation therapy in head and neck cancer patients due to radiation damage to salivary glands [49]. Three studies shown in Table 3, create a model using features extracted from US images to distinguish between irradiated and healthy parotid glands. In a prospective study, Yang et al. [44] imaged the parotid glands of 12 patients post-RT and 7 healthy volunteers. Using an in-house developed MATLAB algorithm, they extracted 6 sonographic textural features. They observed that healthy volunteers showed tissue homogeneity and soft tissue echogenicity while irradiated glands showed greater heterogeneity. Significant differences were found in 4/6 sonographic features, all of which were found to have a p-value < 0.05. The most significant differences were found in areas of high intensity and high-intensity width. In a complementary analysis [45], Yang's group also extracted graylevel co-occurrence matrix (GLCM) features from the parotid gland of 12 RT-treated patients and 7 healthy volunteers. This texture analysis again showed trends of heterogeneity in irradiated glands in addition to hyperechoic and hypoechoic lines and spots, decreased correlation, increased entropy, and lack of symmetry. Histological analysis performed on the parotid gland tissue provided a biological explanation for their radiomics analysis. While normal parotid glands show homogeneous regions of densely packed serous acinar cells post-radiation, many acini are lost leaving remaining cells that appear larger and more disorganized. In addition to patches of inflammatory infiltrates, these biological findings provide an explanation for heterogeneity in the radiation-damaged tissue. Additionally, post-radiation associated fibrosis may help explain hypoechoic spots and lines. In this prospective study, significant differences were observed for all 8 features of post-RT glands compared to normal (p < 0.05).  A larger follow-up retrospective study was also conducted by Yang et al. [46] where both parotid glands of 34 post-radiation patients and 13 healthy volunteers were imaged with US. RT-treated patients were further stratified into two groups, early toxicity occurring within the first 3 months of treatment completion and late toxicity occurring beyond 3 months of treatment completion. Extraction of echo-intensity histograms and sonographic features showed that 4/6 features achieved high diagnostic accuracy classifying acute toxicity vs. normal glands and late-toxicity vs. normal glands, while all parameters were useful in classifying patients, especially peak intensity, area under high-intensity portions of the curve and width of high-intensity portions of the histogram (AUC ≥ 0.90). The study demonstrated that radiomics could successfully differentiate between acute and late toxicity, while the radiologist participating in the study was unable to make the same distinction based on visual interpretation. These limited findings provide encouraging evidence on the potential of US radiomics in the assessment of radiation toxicity in head and neck cancer patients. It is important to recognize that although US is not routinely used in radiation treatment planning, radiomics using CT and MRI are being extensively investigated for their potential to improve the accuracy of tumor segmentation, doseescalation, and prediction of tumor response to radiation therapy [50].

Discussion
This review summarizes the current literature on US radiomics in head and neck cancer. Our objectives were to evaluate and compare the textural and ML methodologies used for current US radiomic applications and to assess the added value of US radiomics for diagnostic and prognostic applications in head and neck cancer patients.
US plays a key role in the clinical workup of head and neck cancer patients for diagnosis and staging. Traditionally, qualitative or semantic features observed on the images are used to describe tumor shape, size, and location. However, this conventional radiologic assessment is constrained by subjective scoring methods and limited sensitivity of the human eye to detect subtle variations in the signal. With radiomics, features that are mined from the images describe recurrent patterns or textures and the statistical interrelationships between neighboring voxels quantitatively and objectively.
The generation of robust and high-quality evidence on the performance of radiomics is critical for successful integration into clinical practice. Our review of the radiomics literature revealed that the majority of published studies of US radiomics are based on retrospective analysis of imaging datasets. Moreover, a majority of these studies were conducted at a single center with a limited number of datasets. To understand the strength of the published evidence on the potential utility of US radiomics in head and neck oncology, we used the RQS. The RQS has 16 main components that assess the rigor of the studies, from image acquisition to feature extraction/reduction, model validation, biological associations, and clinical significance. However, the majority of the studies included in this review received a score of less than 50%. Several systematic reviews investigating the utility of radiomics in breast and lung cancers have also reported low RQS (less than 50%), with a lack of external validation or prospective study [51][52][53]. Many of the studies in this review did not include a validation set, or only included cross-validation due to insufficient sample sizes. Validation of the model is an indispensable component of radiomics analysis and must not be left out. Validation techniques lend credibility to the trained model and assess its overall performance. A test set must only be implemented once a model is properly trained and validated. Finally, reporting open-access data and codes is vital to radiomics being accepted in both scientific and clinical communities.
The conduct of multi-center studies in the future could further validate the performance of ML-based US radiomics and facilitate the integration of these approaches into routine clinical practice. In this regard, it is important to consider the standardization of imaging methods using phantoms. The studies included in this review employed different US probe frequencies and therefore, resolution limits. Technical fluctuations within and between scanners, different imaging parameters, or radiologists can also substantially influence the variation of radiomics features. It is therefore imperative that such studies adhere to guidelines that have been developed including the Image Biomarker Standardization Initiative (IBSI) [54]. Clinical trials aimed at evaluating interventions using radiomic approaches should also adhere to recently developed AI guidelines to Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT-AI) and Consolidated Standards of Reporting Trials (CONSORT-AI) [55]. Adherence to these standards will enable the generation of high-quality evidence on the performance of radiomics for specific clinical applications through structured data pipelines and facilitate the timely implementation of such pipelines for routine clinical use. Such efforts will likely require sustained engagement and partnership between academia and industry.
Finally, demonstrating the benefit or superior performance of radiomics over standard radiologic or clinical measures is critical to the widespread adoption of ML/AI. It is therefore important to recognize that not all clinical tasks require or may benefit from the use of radiomics. In this regard, recent studies have reported on the knowledge and attitude of radiologists towards the clinical adoption of AI [56][57][58]. A survey of over 1000 radiologists revealed that a majority (77%) of participants had a positive attitude and were favorable towards the adoption of AI. The perceived benefits of applying AI to routine radiologic practice reported by the participants included lower diagnostic error rates (73% of respondents) and optimization of radiologist workload (68% of respondents) while the risk of diminished professional reputation and increased costs, the higher workload for AI system maintenance and data analysis (39% of respondents) were the main concerns [56]. Similarly, a qualitative study by Chen et al. observed that radiologists believe that AI has the potential to assist in performing repetitive tasks which in turn could allow them to focus on challenging clinical scenarios or reads [57]. The study also showed that the awareness and knowledge about AI greatly varied among professionals in radiology. Consistent with this finding, a multi-institutional survey conducted with over 1000 radiologists and radiology residents found that limited AI-specific knowledge was associated with intimidation (OR 1.56, 95% CI 1.10-2.21, p = 0.01), while intermediate or advanced AI-specific knowledge was associated with optimism (OR 0.40, 95% CI 0.20-0.80, p = 0.01) [58]. Developing a standardized framework for the education and training of medical professionals and trainees on ML/AI methods is, therefore, critical [59]. While the overall attitude and perception of AI among radiologists and radiation oncologists are generally positive [55][56][57][58]60], there is skepticism among physicians in other specialties (e.g., surgeons) on the claims of AI and concerns regarding the high risk of bias and lack of transparency in published studies [61,62]. It is therefore important to engage all stakeholders including radiologists, medical and radiation oncologists, surgeons, and patients during the development of AI studies to ensure meaningful and successful adoption of AI-based approaches in the clinical setting.

Conclusions
The published body of evidence on US radiomics in head and neck oncology is limited. However, findings from clinical studies in the literature provide encouraging lines of evidence on the potential of ML methods to improve the diagnostic and prognostic performance of US in head and neck oncology.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/cancers14030665/s1, Table S1: Performance of US radiomics methods for studies included in this review determined using radiomics quality score (RQS). Funding: This work was supported by grants from the National Cancer Institute R01CA204636, R01CA243456, National Institute of Dental and Craniofacial Research 1F31DE031184 and utilized shared resources supported by Roswell Park Cancer Institute Cancer Center Support Grant from the National Cancer Institute P30CA016056. The funding sponsors had no role in the design of the study, collection, analyses, or interpretation of data, writing of the manuscript, and in the decision to publish the results.

Conflicts of Interest:
The authors declare no conflict of interest.