Artificial Intelligence for Thyroid Nodule Characterization: Where Are We Standing?

Simple Summary In the present review, an up-to-date summary of the state of the art of artificial intelligence (AI) implementation for thyroid nodule characterization and cancer is provided. The opinion on the real effectiveness of AI systems remains controversial. Taking into consideration the largest and most scientifically valid studies, it is possible to state that AI provides results that are comparable or inferior to expert ultrasound specialists and radiologists. Promising data approve AI as a support tool and simultaneously highlight the need for a radiologist supervisory framework for AI provided results. Therefore, current solutions might be more suitable for educational purposes. Abstract Machine learning (ML) is an interdisciplinary sector in the subset of artificial intelligence (AI) that creates systems to set up logical connections using algorithms, and thus offers predictions for complex data analysis. In the present review, an up-to-date summary of the current state of the art regarding ML and AI implementation for thyroid nodule ultrasound characterization and cancer is provided, highlighting controversies over AI application as well as possible benefits of ML, such as, for example, training purposes. There is evidence that AI increases diagnostic accuracy and significantly limits inter-observer variability by using standardized mathematical algorithms. It could also be of aid in practice settings with limited sub-specialty expertise, offering a second opinion by means of radiomics and computer-assisted diagnosis. The introduction of AI represents a revolutionary event in thyroid nodule evaluation, but key issues for further implementation include integration with radiologist expertise, impact on workflow and efficiency, and performance monitoring.


Introduction
For thyroid nodule management, the current diagnostic goal is early identification of the malignant thyroid nodules: although the incidence of the disease is high (incidence rate

Introduction
For thyroid nodule management, the current diagnostic goal is early identification of the malignant thyroid nodules: although the incidence of the disease is high (incidence rate of 3.4/100,000 in men and 11.5/100,000 in women [1]), more than half of newly diagnosed thyroid cancers have a low risk of persistence or recurrence [2,3]. It is therefore necessary to develop a diagnostic tool that improves interobserver agreement in the risk stratification of thyroid nodules to provide an objective assessment of utility for the clinical and surgical management phases that follow [4], given that even molecular biology is not specific and does not accurately predict prognosis after surgery [5,6].
In the last two decades, medical imaging has grown exponentially, shifting from the traditional use of images for visual interpretation to their conversion to quantitative features that can be analyzed to extrapolate data and thus improve clinical decisionmaking. This approach is usually called "Radiomics" [7,8]. Radiomics takes advantage from extraction algorithms to derive several quantitative features from radiological images. Several recent works underline how these data may be used by machine learning (ML) systems.
ML is an interdisciplinary sector in the subset of artificial intelligence (AI) dealing with the creation of systems that set up logical connections via algorithms to make predictions on data systems [9], Figure 1. The most interesting application of ML in the medical field is the discernment of patterns based on the examination and analysis of extensive datasets coming from various sources (clinical databases, laboratory results, and imaging data) [10,11]. In particular, ML techniques are divided into supervised and unsupervised learning methods. Supervised ML uses dataset inputs linked to dataset (labeled) outputs to identify a function between the two, while unsupervised ML uses non-labeled input datasets to identify and separate subsets with similar characteristics [12]. Deep learning (DL) is subset of ML approaches that uses neural networks arranged in layers to extract higher level features from input data and automatically learn their discriminative features, which allows approximation of non-linear relationships with excellent performance.
These technologies may be finally transferred to software used directly by clinicians: Computer Aided Diagnosis (CAD). Such software can be stand-alone or integrated in sonographic equipment and help in the detection and evaluation of thyroid nodules, one of the most common endocrine diseases, with incidental finding on ultrasound (US) examination, especially in patients over 65 years of age [13]. Deep learning (DL) is subset of ML approaches that uses neural networks arranged in layers to extract higher level features from input data and automatically learn their discriminative features, which allows approximation of non-linear relationships with excellent performance.

Materials and Methods
These technologies may be finally transferred to software used directly by clinicians: Computer Aided Diagnosis (CAD). Such software can be stand-alone or integrated in sonographic equipment and help in the detection and evaluation of thyroid nodules, one of the most common endocrine diseases, with incidental finding on ultrasound (US) examination, especially in patients over 65 years of age [13].

Materials and Methods
The study only considered articles published in the last decade (2012-2022), since most of the literature concerning AI application in radiology has undergone extensive development only recently. Among these, only large retrospective and prospective studies, systematic reviews, and meta-analyses were selected, as overall, they have greater statistical significance. The research was carried out by interrogating the PubMed and Google Scholar online databases using the Mesh terms "thyroid nodule and artificial intelligence", with the MESH terms present in the titles or abstracts. Only human studies were selected. The search identified 166 studies from January 2012 to April 2022; of these, 63 were further considered. After a full text read, the final studies included in the review were 30 in number; they are all listed below in Table 1  .  The AI-TIRADS is an optimization of ACR TIRADS generated by "genetic algorithms", a subgroup of AI methods that focus on algorithms inspired by "natural selection".

Radiomics
Medical radiomics employs high-throughput automated extraction algorithms to obtain a large number of quantitative characteristics from image datasets and is able to identify measurable information that clinical evaluation alone cannot detect [12,43].
Two of the first radiomics approaches in thyroid nodule characterization were texture analysis and US echo-intensity evaluation [44]. The latter is affected by several factors, such as gain, dynamics, operator dependency, and probe variability, as well as by the US equipment performance. The diagnostic value of echo-intensity obtained by direct measurement is limited; however, the echo intensity of the nodule and surrounding tissues increases or decreases simultaneously when these factors alternate [45]. Therefore, the echo intensity of the thyroid nodule can be indirectly quantified by measuring the grayscale ratio of the nodule to the surrounding thyroid tissues, which is more objective than the subjective assessment [44][45][46]. In a pivotal single-center study, it was demonstrated that the ratio was significantly lower in malignant nodules compared to benign ones [46], while the ratio of the nodule to the strap muscle was influenced by gender and less clinically discriminant. The inter-rater agreement was fair (k = 0.40) for hypo-echogenicity, whereas it was substantial for the ratio (k = 0.74), confirming the reduction in variability. This approach was subsequently replicated by other groups, showing that, as suggested, the ratio may distinguish anechoic and markedly hypoechoic nodules [47], and if it is applied to different nodule sizes [48], software can differentiate between benign and malignant nodules [49], even in different settings [45]. One of the most significant examples is the multicenter study conducted by Liang et al., in which a radiomic score was compared with a score based on the ACR TI-RADS criteria (which take into account, in addition to the difference in echogenicity, characteristics such as composition, shape, margin, and echogenic foci), showing a close correlation between the latter and the assessment carried out by the AI [50]. Radiomics approaches using grayscale histogram and other more complex image analyses were furthermore proved to predict BRAF mutational status [51], lateral lymph node metastasis [52], and a disease-free survival term.

Deep Learning and Machine Learning and TIRADS Systems
Deep learning (DL) is one ML method that relies on networks of computational units (i.e., neural units arranged in layers that gradually extract higher-level features from input data and automatically learn discriminative features from data) that allow approximation of complex non-linear relationships with outstanding performance. DL can achieve diagnosis automation, avoiding human intervention. In medical applications, DL algorithms are implemented for detection and characterization of tissue lesions as well as for the analysis of disease progression [12].
AI has already been widely used in thyroid imaging [11,53]. Several AI and ML approaches were implemented for the classification of thyroid nodules and the early detection of cancers, including modifications to the American College of Radiology Thyroid Imaging Reporting and Data System (TIRADS) systems that may be manually applied. Furthermore, a convolutional neural-network-based CAD program may help in predicting the BRAFV600E genetic mutation [54][55][56].
Use of the ML approach may also identify nodules with high-risk mutations on molecular testing [57]. Another important advantage of AI systems is the possibility to obtain more systematized results, which could reduce inter-observer variability and tend to standardize the results obtained through the application of different TIRADS classification systems, whose major limit to date is represented by highly variable predictive capacity, high heterogeneity in grading, and the absence of reliable data in small nodules (<10 mm) [3,58,59] (Figures 2 and 3). A recent TIRADS model showed higher accuracy than a model based on training according to the nodule status, i.e., benign and malignant; additionally, the specificity of the abovementioned model was higher than that of both experienced and junior radiologists [60].
Comparisons between different imaging modalities are represented in Figures 2 and 3, where a DL-based software confirms the suspect based on B-mode US imaging.
higher accuracy than a model based on training according to the nodule status, i.e. and malignant; additionally, the specificity of the above-mentioned model wa than that of both experienced and junior radiologists [60]. Comparisons between d imaging modalities are represented in Figures 2 and 3, where a DL-based s confirms the suspect based on B-mode US imaging.

Computer-Assisted Diagnosis (CAD)
These approaches may produce new knowledge by identifying new patterns and features to be applied in a more traditional way and generating computer-assisted diagnosis (CAD) systems; i.e., software able to analyze data through the application of machinelearning principles to aid clinicians for a "second opinion" provision. AI-based thyroid CADs may further improve diagnostic performance and reliability, reaching an accuracy similar to that obtained by an expert radiologist [10,11], with potential implication in training of less-experienced operators and reduction of intra-and inter-observer variability [11].

Computer-Assisted Diagnosis (CAD)
These approaches may produce new knowledge by identifying new patte features to be applied in a more traditional way and generating computerdiagnosis (CAD) systems; i.e., software able to analyze data through the applic machine-learning principles to aid clinicians for a "second opinion" provision. A thyroid CADs may further improve diagnostic performance and reliability, reac accuracy similar to that obtained by an expert radiologist [10,11], with p implication in training of less-experienced operators and reduction of intra-an observer variability [11].
CAD-systems are already available as commercial applications or where em in US equipment. A recent meta-analysis [61] confirmed that their perform evaluating malignant thyroid nodules is comparable to radiologists. Specifica sensitivity was reported to be like that of experienced radiologists, while specific diagnostic odds ratio were reduced [39]. While these systems did not outp experienced specialists, they are able to guide the training of less-skilled examine reducing variability when clinician's judgements show significant disagr However, it is difficult to eliminate all possible sources of inter-observer variabil in fact possible that radiologists with different degrees of experience select imag more or less relevant characteristics of suspicion. The homogeneity of the CAD-systems are already available as commercial applications or where embedded in US equipment. A recent meta-analysis [61] confirmed that their performance in evaluating malignant thyroid nodules is comparable to radiologists. Specifically, the sensitivity was reported to be like that of experienced radiologists, while specificity and diagnostic odds ratio were reduced [39]. While these systems did not outperform experienced specialists, they are able to guide the training of less-skilled examiners, thus reducing variability when clinician's judgements show significant disagreement. However, it is difficult to eliminate all possible sources of inter-observer variability: it is in fact possible that radiologists with different degrees of experience select images with more or less relevant characteristics of suspicion. The homogeneity of the image segmentation process also plays a fundamental role in reducing the impact of selection bias. The segmentation process in fact involves a manual selection of the area of interest (which should correspond to the nodule), but in this phase it is possible that portions of the slide that contain non-informative areas are selected, compromising the training process of the AI system. To try to solve the problem, some studies have adopted a two-step fully automated classification system, specifically trained both to autonomously select the area of interest and to predict the final pathology of the specific selected area [62] Furthermore, the models generated by images obtained from different machines may not be universally generalizable, which can determine limits in the sampling phases and in the standardization of software. This therefore requires an accurate evaluation and selection phase prior to the adoption of an AI system in any case [11]. Table 2 summarizes main advantages and disadvantages of artificial intelligence over conventional imaging. Table 2. Advantages and disadvantages of artificial intelligence over conventional imaging.

Main Advantages of AI Main Disadvantages of AI
It is based on models, for the interpretation of thyroid nodules, that are able to match the performance characteristics of radiologists and pathologists Too little experience at the moment; prospective multicenter trials on a wide population will be needed to improve the utility of artificial intelligence for the interpretation of thyroid nodules Usable software for thyroid nodule risk stratification are already commercially available

Discussion
The TIRADS system was developed to improve the diagnostic accuracy of conventional US in thyroid nodule characterization [63]. However, its clinical use is still very limited and diverse; in particular, there are various types of TIRADS, and their application is very subjective; therefore, it is significantly affected by inter-observer variability [64].
AI could increase US accuracy and significantly limit inter-observer variability by using standardized mathematical algorithms. In the world of DL, many authors are focusing on convolutional neural networks (CNNs), introduced by LeCun [65,66]. Before their diagnostic accuracy can be assessed, CNNs are trained by subjecting them to specific algorithm-segmented US images of thyroid nodules with known histological diagnosis; at the end of the learning phase the CNNs are able to analyze the captures of thyroid nodules and to suggest a risk stratification of these nodules in correlation to a specific TI-RADS level [16]. Most of the existing literature evaluates the diagnostic accuracy of various types of properly trained convolutional neural networks by comparing them to those of radiologists with variable degrees of experience. All the evaluated studies showed significant high overall diagnostic accuracy of CNNs, above 90%, which does not differ much from that of expert radiologists. In particular, most of the studies demonstrate a comparable diagnostic accuracy, such as Watkins et al., Bai et al.,Ye et al.,Koh et al.,and Fresilli et al. [4,16,20,30,40]. Approximately the same number of studies demonstrate a higher diagnostic accuracy of AI systems compared to that of expert radiologists (e.g., Sun et al., Peng et al., and Zhou et al.) [15,22,23], or vice versa, a superiority of diagnostic accuracy by expert radiologists compared to that of AI systems (e.g., Zhang et al. and Han et al.) [32,33]. Despite controversial results, the meta-analysis conducted by Zhao et al. suggests that the sensitivity of the CAD system is like that of experienced radiologists, but the CAD system has lower specificity and diagnostic odds ratio than experienced radiologists [39].
On the other hand, almost all the studies included in this review show that CNNs obtain a better result than junior radiologists with less than 5 years of experience in US evaluation of thyroid nodules [4,23,34,40], especially with regards to specificity [60]. These studies therefore agree in suggesting that CAD systems may be an effective support tool to increase the diagnostic efficacy of thyroid nodule evaluations by less-experienced radiologists [25]. Furthermore, some studies, such as the one by Zhao et al., show that the diagnostic accuracy of senior radiologists assisted by CAD systems is higher than that of radiologists alone and CAD systems alone [39].
It is therefore not yet clear from the literature analysis which of the specific AI systems has the best diagnostic accuracy. Wang et al. compare the effectiveness of only few CNNs [25], while most studies analyze specific systems individually, showing high specificity-especially if they are based on TIRADS system algorithm-rather than differentiation among benign and malignant nodules with surgical histopathological reference [60]. In absolute terms, the CAD system used by Zhou et al., a CNN-based transfer learning method named DLRT (deep-learning radiomics of thyroid), appears to be one of those with greater diagnostic accuracy (AUC 0.97) [23], although this type of comparison between AI systems has no real statistical significance as they were analyzed on retrospective datasets.
In addition, a variety of AI technologies have been evaluated on thyroid cytology specimens. Unfortunately, no application has been demonstrated to be robust enough for clinical use in FNAB result analysis, an issue which is related to the multi-layered, multidimensional, complex interpretation process and the lack of standardized algorithms [66,67]. However, Ippolito et al. [68] show collaborative data between cytology and US; they integrated microscopic pathology characteristics, clinical data, and imaging features into a combined algorithm to triage indeterminate and follicular lesions into high-or low-risk categories using a CNN framework that demonstrated a sensitivity of 85.7% and low specificity of 58.8%. As an element of evidence that emerged from the present review, key issues in AI implementation include integration with radiologist interpretation, impact on workflow and efficiency, and performance monitoring. This can be translated into an automated structured report for integration into a radiology report. Sensitivity settings for different features can be adjusted and customized; validation by an experienced radiologist co-reader is warranted [69].
AI tools may be useful in practice settings with limited subspeciality expertise: using AI solutions in the settings with minimal radiology support and high negative predictive value may provide comfort for clinicians with no need for follow-up of benign findings, although this should be addressed with caution. Depending on the institutional cohorts, AI results cannot be generalized, as it is assumed that AI would misperform in specialized centers with higher malignancy rates in comparison to the average population [69]. In terms of legal frame, AI-generated conclusions being reviewed by board-certified radiologists or US practitioners, regardless of their specialty, is mandatory. Several authors suggest use of AI results as second-opinion, although this has a negative impact on workflow speed [10,11,69]. US practices, in conjunction with vendors, should implement AI performance and quality control protocols in order to assess the reliability of the tool.
Finally, a limitation of AI should be noted: thyroid US scanning includes comprehensive neck soft tissue assessment, including lymph nodes and parathyroid glands, but currently, AI solutions address only one aspect of this complex examination.

Conclusions
The introduction of AI was a revolutionary event in thyroid nodule assessment. Not only ultrasound, but also other imaging methods such as CT and MRI, use it effectively [70][71][72]. In some cases, there is even the possibility to effectively predict the immunohistochemistry of the thyroid nodule simply through the evaluation of segmented image datasets by AI systems [73]. Moreover, the use of CAD in daily clinical practice does not have a significant impact on workflow, as it increases the examination time by approximately 2-3 min [4] However, the real effectiveness of AI systems remains controversial; taking into consideration the largest and most scientifically valid studies, it is possible to state that AI provides results that are comparable or in any case inferior to that of expert radiologists. Furthermore, it is necessary to consider the relevant heterogeneity of sensitivity and specificity between studies, due to the diversity in methodology and to the differences among patients included [39].
AI systems still have a long way to go to replace experienced radiologists in the process of improving accuracy and reducing time consumption, and larger studies meeting uniformity criteria are necessary to evaluate the diagnostic performance of these systems further. Nevertheless, the current CAD systems offer support for radiologists in thyroid nodule assessment and increase the overall accuracy in routine thyroid US [10,11,39].
AI solutions with CAD should be implemented in the teaching process of junior specialists. Deep-learning algorithms would benefit from follow-up US imaging data of the same thyroid nodules in combination with TIRADS classification, rather than dichotomous prediction, to increase their repeatability, reliability, and accuracy.
Regarding the legal frame, AI-generated conclusions should be reviewed by boardcertified radiologists or US practitioners as mandatory practice, such that AI results may be provided only as a second opinion.