Radiomic Detection of Malignancy within Thyroid Nodules Using Ultrasonography—A Systematic Review and Meta-Analysis

Background: Despite investigation, 95% of thyroid nodules are ultimately benign. Radiomics is a field that uses radiological features to inform individualized patient care. We aimed to evaluate the diagnostic utility of radiomics in classifying undetermined thyroid nodules into benign and malignant using ultrasonography (US). Methods: A diagnostic test accuracy systematic review and meta-analysis was performed in accordance with PRISMA guidelines. Sensitivity, specificity, and area under curve (AUC) delineating benign and malignant lesions were recorded. Results: Seventy-five studies including 26,373 patients and 46,175 thyroid nodules met inclusion criteria. Males accounted for 24.6% of patients, while 75.4% of patients were female. Radiomics provided a pooled sensitivity of 0.87 (95% CI: 0.86–0.87) and a pooled specificity of 0.84 (95% CI: 0.84–0.85) for characterizing benign and malignant lesions. Using convolutional neural network (CNN) methods, pooled sensitivity was 0.85 (95% CI: 0.84–0.86) and pooled specificity was 0.82 (95% CI: 0.82–0.83); significantly lower than studies using non-CNN: sensitivity 0.90 (95% CI: 0.89–0.90) and specificity 0.88 (95% CI: 0.87–0.89) (p < 0.05). The diagnostic ability of radiologists and radiomics were comparable for both sensitivity (OR 0.98) and specificity (OR 0.95). Conclusions: Radiomic analysis using US provides a reproducible, reliable evaluation of undetermined thyroid nodules when compared to current best practice.


Introduction
Thyroid nodules occur commonly within the general population, with studies suggesting a prevalence of 20-67%, with an increased propensity in females and the elderly [1,2]. Increased access to healthcare and availability of modern imaging techniques such as ultrasonography (US) have led to the markedly increased detection of thyroid nodules [3]. The American Thyroid Association (ATA), British Thyroid Association (BTA), and European Society for Medical Oncology (ESMO) guidelines recommend US as the primary imaging modality for the assessment of thyroid nodules [4][5][6]. Several classification systems (e.g., ATA, BTA, and Thyroid Imaging Reporting and Data System (TIRADS)) are utilized by radiologists to stratify the risk of malignancy for each thyroid nodule based on US features [4,5,7]. These systems classify lesions on a scale ranging from benign to malignant based on sonographic parameters such as size, echogenicity, degree of margin

Materials and Methods
A systematic review was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [16] and in accordance with the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy [17]. Local institutional ethical approval was not required.

Population, Intervention, Comparison, Outcomes (PICO)
Population: Patients who have undergone preoperative US and definitive thyroid nodule diagnosis as benign or malignant.
Intervention: Radiomic analyses applied to preoperative US used to inform whether thyroid nodules are benign or malignant.
Comparison: The discriminative ability of radiomics compared to confirmation of benign and malignant nodules. Nodules were determined as benign by either cytological or histological means, while malignancy was confirmed by histological analysis only.
Outcomes: Primary outcomes included the evaluation of the clinical utility of preoperative US imaging to stratify thyroid nodules as either benign or malignant. Generated pooled sensitivity, specificity, and receiver operating characteristic (ROC) curve analyses will be representative of our primary outcomes. Secondary outcomes include comparing the ability of different radiomic methods to differentiate such nodules and to compare radiologists and radiomics in correctly discriminating benign versus malignant thyroid nodules.

Search Strategy
An electronic search of the PubMed Medline, EMBASE, and Scopus databases was performed on 16 January 2021 for relevant studies. This search was performed for the following headings: (Thyroid Cancer) and (Radiomics) linked using the boolean operator "AND". Included studies were limited to those published in the English language and were not restricted based on the year of publication. All duplicate studies were manually removed before titles were screened, and studies deemed appropriate had their abstracts reviewed. Studies remaining had their full texts reviewed for eligibility.

Inclusion and Exclusion Criteria
Studies meeting the following inclusion criteria were included: (1) Studies with thyroid nodules confirmed as benign or malignant following US imaging; (2) imaging of tumors had to have been performed pre-diagnosis; (3) either stated study numbers of true positive, true negative, false positive, false negative, sensitivity, specificity, or accuracy data in relation radiomic tests or the ability to calculate these figures based on study data. In some cases, sensitivity and specificity were calculated from ROC curve analyses. Studies comparing the diagnostic ability of radiologists with radiomics were also included. Studies meeting any of the following exclusion criteria were excluded from this study: (1) studies not providing radiomic validation or "test" data, (2) studies outlining the diagnostic ability of radiomics differentiating benign and malignant lesions in other cancers (e.g., breast carcinoma, skin cancers, etc.), (3) studies with no full English text, (4) review articles, (5) studies including less than five patients in their series or case reports, and (6) editorial articles.

Data Extraction and Quality Assessment
This literature search was performed by two independent reviewers (E.F.C. and S.O.) using the aforementioned search strategy. Where discrepancies in opinion occurred between the reviewers, a third reviewer was asked to arbitrate (M.G.D.). As described, duplicate studies were removed. Both reviewers reviewed all retrieved manuscripts to ensure all inclusion criteria were met before extracting the following data: (1) first author name, (2) year of publication, (3) study design, (4) country, (5) level of evidence, (6) study title, (7) number of patients, (8) number of benign and malignant nodules confirmed though cytologic or histopathologic analysis, (9) sensitivity, specificity, and area under curve (AUC) scores from the ROC curve analyses obtained from radiomic "test" data and (10) sensitivity, specificity, and AUC scores from the ROC analyses from radiologists within studies where available. Sensitivity and specificity were directly extracted from tables and study text. When not provided as discrete data in tables or the text, specific estimates of sensitivity and specificity were calculated from ROC curves with the most accurate and appropriate sensitivity prioritized. Where studies tested the diagnostic ability of multiple radiomic methods (i.e., CNN, ML, etc.), only data for the best performing radiomic method within that study was extracted. Similarly, where studies detailed data on multiple radiologists' ability to discriminate benign versus malignant nodules, data from the best performing radiologist from that particular study was included. Appraisal of the quality of each study was performed using the radiomics quality score (RQS), as outlined previously by Lambin et al. [18].

Statistical Analysis
Statistical analysis was performed according to the Cochrane guidelines. Pooled sensitivity and specificity and summary ROC analysis were calculated for included studies to demonstrate to convey the diagnostic test performance of radiomics in differentiating malignant thyroid nodules from benign thyroid nodules. We then performed a comparison between studies using CNNs (incorporating both CNNs and other deep learning methods) versus those using either ML or Radiomic AI analyses (together termed non-CNNs). For comparing radiologist and radiomic diagnostic test accuracy, sensitivity and specificity data were expressed as dichotomous data and reported as odds ratios (ORs) with 95% Diagnostics 2022, 12, 794 4 of 18 confidence intervals (CIs) following estimation using the Mantel-Haenszel method using random effects. The symmetry of funnel plots was used to assess publication bias. Statistical heterogeneity was determined using I2 statistics. Statistical significance was determined to be p < 0.05. Statistical analysis was performed using Review Manager (RevMan), Version 5.4 (Nordic Cochrane Centre, Copenhagen, Denmark).

Literature Search
The initial search of PUBMED, SCOPUS, and EMBASE resulted in a total of 537 studies identified. Following the removal of duplicates, 488 studies remained. These studies were then screened by title and abstract for relevance, after which 119 studies remained-all had their full text analyzed for eligibility. Finally, 75 studies remained for inclusion in the analysis as depicted by Figure 1  .
sitivity and specificity and summary ROC analysis were calculated for included stud to demonstrate to convey the diagnostic test performance of radiomics in differentia malignant thyroid nodules from benign thyroid nodules. We then performed a comp son between studies using CNNs (incorporating both CNNs and other deep learn methods) versus those using either ML or Radiomic AI analyses (together termed n CNNs). For comparing radiologist and radiomic diagnostic test accuracy, sensitivity specificity data were expressed as dichotomous data and reported as odds ratios (O with 95% confidence intervals (CIs) following estimation using the Mantel-Haen method using random effects. The symmetry of funnel plots was used to assess publ tion bias. Statistical heterogeneity was determined using I2 statistics. Statistical sign cance was determined to be p < 0.05. Statistical analysis was performed using Rev Manager (RevMan), Version 5.4 (Nordic Cochrane Centre, Copenhagen, Denmark).

Literature Search
The initial search of PUBMED, SCOPUS, and EMBASE resulted in a total of 537 st ies identified. Following the removal of duplicates, 488 studies remained. These stud were then screened by title and abstract for relevance, after which 119 studies remai -all had their full text analyzed for eligibility. Finally, 75 studies remained for inclus in the analysis as depicted by Figure 1  .

Diagnostic Ability of Radiomics
The mean AUC calculated from independent ROC curve analyses within included studies was 0.88 (range: 0.61-1.00). Individual study sensitivity and specificity for determining malignant versus benign thyroid nodules is demonstrated in Figure 2A. Pooled sensitivity for radiomics in distinguishing thyroid nodules was 0.87 (95% CI: 0.86-0.87). Pooled specificity for radiomics in distinguishing thyroid nodules was 0.84 (95% CI: 0.84-0.85). A combined ROC curve for radiomics of thyroid nodules by ultrasound sonography is demonstrated in Figure 2B. The mean AUC calculated from independent ROC curve analyses within included studies was 0.88 (range: 0.61-1.00). Individual study sensitivity and specificity for determining malignant versus benign thyroid nodules is demonstrated in Figure 2A. Pooled sensitivity for radiomics in distinguishing thyroid nodules was 0.87 (95% CI: 0.86-0.87). Pooled specificity for radiomics in distinguishing thyroid nodules was 0.84 (95% CI: 0.84-0.85). A combined ROC curve for radiomics of thyroid nodules by ultrasound sonography is demonstrated in Figure 2B.

Comparison of Radiomic Analysis of Thyroid Nodule US versus Radiologists Analysis of Thyroid Nodule US
Within the studies included in the meta-analysis, 35 studies provided a comparison between radiologists and radiomics in differentiating malignant versus benign thyroid nodules using thyroid US. Radiomics demonstrated similar sensitivity for detection of malignancy within a given thyroid nodule (OR 0.98, 95% CI 0.76-1.26) when compared with radiologists ( Figure 4A). Radiomics also demonstrated similar specificity (OR 0.93, 95% CI 0.72-1.20) when compared with radiologists for this purpose ( Figure 4B).

Comparison of Radiomic Analysis of Thyroid Nodule US versus Radiologists Analysis of Thyroid Nodule US
Within the studies included in the meta-analysis, 35 studies provided a comparison between radiologists and radiomics in differentiating malignant versus benign thyroid nodules using thyroid US. Radiomics demonstrated similar sensitivity for detection of malignancy within a given thyroid nodule (OR 0.98, 95% CI 0.76-1.26) when compared with radiologists ( Figure 4A). Radiomics also demonstrated similar specificity (OR 0.93, 95% CI 0.72-1.20) when compared with radiologists for this purpose ( Figure 4B).

Discussion
To the best of our knowledge, the current systematic review and meta-analysis is the first to evaluate the diagnostic test accuracy of radiomic imaging analysis in differentiating malignant from benign thyroid nodules using US. Due to the increasing prevalence of thyroid nodules now detected within the general population and the rising incidence of thyroid malignancy (which has tripled since 1975), accurate risk stratification is paramount to the enhancement of clinical outcomes [3]. The most important finding in this analysis of over 28,000 patients possessing over 46,000 thyroid nodules is the data supporting the utility of radiomic analysis in correctly stratifying undetermined thyroid nodules correctly into benign and malignant lesions (sensitivity: 0.87, specificity: 0.84). This is promising as we look to enhance diagnostics in this field of oncology, all the while promoting minimally invasive techniques in order to reduce morbidity and mortality for prospective patients. These results come at the timely promotion of precision oncology as a

Discussion
To the best of our knowledge, the current systematic review and meta-analysis is the first to evaluate the diagnostic test accuracy of radiomic imaging analysis in differentiating malignant from benign thyroid nodules using US. Due to the increasing prevalence of thyroid nodules now detected within the general population and the rising incidence of thyroid malignancy (which has tripled since 1975), accurate risk stratification is paramount to the enhancement of clinical outcomes [3]. The most important finding in this analysis of over 28,000 patients possessing over 46,000 thyroid nodules is the data supporting the utility of radiomic analysis in correctly stratifying undetermined thyroid nodules correctly into benign and malignant lesions (sensitivity: 0.87, specificity: 0.84). This is promising as we look to enhance diagnostics in this field of oncology, all the while promoting minimally invasive techniques in order to reduce morbidity and mortality for prospective patients. These results come at the timely promotion of precision oncology as a rapidly evolving field, which manipulates individual patient, cancer, or disease process characteristics in order to develop a personalized diagnosis, prognosis, and treatment strategies [12]. Data from this analysis support radiomic imaging analysis using US as a means of quantification of malignancy in thyroid nodules, without exposing patients to the risks associated with invasive FNAC sampling or surgical specimen assessment. For some patients, the use of radiomics could possibly circumvent the need for FNAC and surgical resection, providing a potentially more cost and time-efficient assessment of thyroid nodules than what is currently practiced [20,94].
Results of this analysis indicate that radiomics is a novel avenue worth exploring in the differentiation of benign and malignant thyroid lesions. CNN provided a pooled sensitivity of 85% and specificity of 82% compared to a pooled sensitivity of 90% and pooled specificity of 88% in non-CNN. CNN is designed as an automated means to adaptively learn spatial hierarchies of features through backpropagation by using multiple building blocks: convolution layers, pooling layers, and fully connected layers of data processing [95]. CNN has powerful pattern recognition capabilities due to the fact that they can approximate any continuous function, given an appropriate network structure [96]. In neural networking, high variance gives networks the ability to learn complex patterns, although it also runs the risk of overfitting since models will learn peculiarities, or noise, from a data set [96]. The noise phenomenon incorporates features into the model which are not generalizable outside of the training set [95]. This makes the model appear to perform well in training but fail to perform in a true clinical environment. Such overfitting in the setting of CNN has been noted in studies evaluating papillary thyroid nodules on US [44]. The margin between benign thyroidal tissue and malignant tissue may be unclear or blurry on US imaging, with significant overlap between cancerous and normal or benign regions. Thus, it is then challenging for the CNN model to perform accurate textural feature extraction of the malignant tissue, possibly contributing to poor model performance [44]. Ideally, a CNN should have a large training set to mitigate the risk of overfitting, but this is not always feasible due to cost, time, and other factors limiting available data [95]. Non-CNN incorporates a number of methods such as support vector machines (SVM), random forest (RF), k-nearest neighbor (k-nn), and Bayesian classifiers [97]. Each method has its own strengths and weakness. For example, SVM classifiers are based on decision planes that define decision boundaries. SVM is often used for the principle of structural risk minimization, which allows robust analysis of test data without the need for a large training set through margin maximization [98]. Another popular ML method is RF, which consists of a large network of individual decision trees that allows for ensemble learning, providing the benefit of human-readable data and the ability to adjust the classifiers' decision trees where appropriate [97]. Ultimately, the randomness of this model makes it robust, generalizable, and less prone to overfitting, although large numbers of decision trees make this approach more time-consuming. Within our analysis, small-data and overfitting within individual studies may have contributed to the overall worse performance of CNN versus non-CNN. Based on the results of this meta-analysis non-CNN radiomics should be the preferred methods for evaluating the risk of malignancy in an undetermined thyroid nodule using US.
For detecting malignancy within a given thyroid nodule radiomic methods had similar sensitivity (0.98, 95% CI 0.76-1.26) and specificity (0.93, 95% CI 0.72-1.20) when compared with radiologists. However, acknowledgment for the strengths maintained by radiologists compared to radiomics: At present, radiomic models are dependent on high-quality image acquisition and segmentation by radiologists. Without good imaging data to analyze, the radiomic model is unable to correctly stratify nodules [99]. Radiologists also maintain the innate ability to incorporate the global context of patients and the ability to maintain subjective associations based on experience, which current radiomic models are unable to perform. Radiomics can face issues with model fitting, poor input data, and subsequent suboptimal performance [100]. However, human assessment of medical imaging and, in particular, US suffers from significant inter-observer variability [101,102]. Radiomics, on the other hand, provides the benefit of an objective, quick, and reproducible analysis with the ability to analyze features of the nodule that are both visible to the radiologist and textural features occult to human perception [13,14]. Studies have attempted to blend the strengths of both radiologists and radiomic models to form computer-assisted diagnosis (CAD) tools. While CAD was not evaluated within the confines of the current meta-analysis, CAD has shown to be of benefit in the evaluation of thyroid nodules within the literature [39].
The present analysis is subject to a number of limitations. Primarily, radiomics involves a broad spectrum of analysis methods, ranging from the radiomic AI methods to deeplearning techniques; we have included all of these under the umbrella term "radiomics" despite variance in their reproducibility of data [103]. Secondly, the authors wish to highlight the inter-user variability of US due to this imaging acquisition being operator dependent. Radiomic analyses are dependent on high-quality images of thyroid nodules being obtained and nodules being correctly selected by ultra-sonographers. Thirdly, when extracting data, we selected the highest performing radiomic method within any given study. This may have led to over-estimation of overall sensitivity and specificity for radiomic evaluation of thyroid nodules on US as a whole. To combat this potential bias when comparing radiomics to radiologists, we selected data for the highest performing radiologist. Finally, prospective validation evaluating the utility of AI in the field of radiological diagnostics typically necessitates buying from large, international corporations in order to finance developing the evidence base in this field.

Conclusions
In conclusion, this meta-analysis of current evidence demonstrates an almost 90% reliability of radiomic imaging analyses to US in detecting malignancy within undetermined thyroid nodules. At present, radiomic analyses demonstrate equal diagnostic sensitivity and specificity of identifying malignant lesions when compared to radiologists. Within the field of radiomics, at present, non-CNN methods may be considered the preferred radiomic means of classifying thyroid nodules. Based on this meta-analysis, AI offers promising results as an avenue to be explored as we look to enhance the diagnostic accuracy and risk stratification of thyroid nodules in the era of personalized medical and oncological patient care. We advocate for rigorous experimentation in this field, given the potential for this technology to bolster diagnostic workflows, enhance clinical outcomes, and minimize patient morbidity; all while mitigating associated healthcare costs.