False-Positive Malignant Diagnosis of Nodule Mimicking Lesions by Computer-Aided Thyroid Nodule Analysis in Clinical Ultrasonography Practice

This study aims to test computer-aided diagnosis (CAD) for thyroid nodules in clinical ultrasonography (US) practice with a focus towards identifying thyroid entities associated with CAD system misdiagnoses. Two-hundred patients referred to thyroid US were prospectively enrolled. An experienced radiologist evaluated the thyroid nodules and saved axial images for further offline blinded analysis using a commercially available CAD system. To represent clinical practice, not only true nodules, but mimicking lesions were also included. Fine needle aspiration biopsy (FNAB) was performed according to present guidelines. US features and thyroid entities significantly associated with CAD system misdiagnosis were identified along with the diagnostic accuracy of the radiologist and the CAD system. Diagnostic specificity regarding the radiologist was significantly (p < 0.05) higher than when compared with the CAD system (88.1% vs. 40.5%) while no significant difference was found in the sensitivity (88.6% vs. 80%). Focal inhomogeneities and true nodules in thyroiditis, nodules with coarse calcification and inspissated colloid cystic nodules were significantly (p < 0.05) associated with CAD system misdiagnosis as false-positives. The commercially available CAD system is promising when used to exclude thyroid malignancies, however, it currently may not be able to reduce unnecessary FNABs, mainly due to the false-positive diagnoses of nodule mimicking lesions.


Introduction
Thyroid nodules are present in 4-68% of the global population [1,2]. Thyroid cancer is the most common malignancy in the endocrine system and is associated with a continuously increasing rate of incidence [3,4].
Fine needle aspiration biopsy (FNAB) is the primary diagnostic tool used to detect malignancies [5,6]. Ultrasonography (US) plays a major role in indicating FNAB [7,8]. US morphological features indicative of All patients who underwent FNAB signed the general institutional informed consent regarding the advantages and risks of FNAB and the possible use of anonymized data for research purposes. The institutional review board approved the use of anonymized patient data in support of this study and waived the need for additional informed consent, since the study did not burden patients with other additional procedures than necessary based on the present clinical recommendations [8,11,45] (Code: No. 7751-PTE 2019. date: 10 January 2019).

Ultrasonography (US) Examination and Fine Needle Aspiration Biopsy (FNAB), Diagnosis Definitions
All US examinations were performed using a high-end, real-time US system (RS85 A; Samsung Medison Co. Ltd., Seoul, Korea) and a 3-12 MHz linear probe at a fixed frequency of 10 MHz. Patients were examined by K.M., a radiologist specializing in head and neck radiology with over ten years of experience regarding diagnostic and interventional thyroid US.
Standard thyroid US examination was performed with all patients in a supine position and their neck in hyperextension. Neck lymph node regions, major vessels of the neck, and major salivary glands were also scanned; however, their findings were not included in the study.
If a thyroid nodule was present, the radiologist evaluated its US morphological features presented in Table 1. Based on these features, a K-TIRADS score (Korean Thyroid Imaging Reporting and Data System) was assigned (2 = benign, 3 = low suspicion, 4 = intermediate suspicion, and 5 = high suspicion) [8]. The reason for applying this score system was that since it is integrated in the applied CAD (see CAD analysis), direct comparisons could be performed. A nodule was regarded possibly benign with a K-TIRADS score of 2 or 3, while possibly malignant when associated with a score of 4 or 5.

Ultrasonography (US) Examination and Fine Needle Aspiration Biopsy (FNAB), Diagnosis Definitions
All US examinations were performed using a high-end, real-time US system (RS85 A; Samsung Medison Co. Ltd., Seoul, Korea) and a 3-12 MHz linear probe at a fixed frequency of 10 MHz. Patients were examined by K.M., a radiologist specializing in head and neck radiology with over ten years of experience regarding diagnostic and interventional thyroid US.
Standard thyroid US examination was performed with all patients in a supine position and their neck in hyperextension. Neck lymph node regions, major vessels of the neck, and major salivary glands were also scanned; however, their findings were not included in the study.
If a thyroid nodule was present, the radiologist evaluated its US morphological features presented in Table 1. Based on these features, a K-TIRADS score (Korean Thyroid Imaging Reporting and Data System) was assigned (2 = benign, 3 = low suspicion, 4 = intermediate suspicion, and 5 = high suspicion) [8]. The reason for applying this score system was that since it is integrated in the applied CAD (see CAD analysis), direct comparisons could be performed. A nodule was regarded possibly benign with a K-TIRADS score of 2 or 3, while possibly malignant when associated with a score of 4 or 5. Additional nodule features were asserted, such as "coarse calcification," if macrocalcification was present with the largest diameter exceeding 50% of the nodule's largest diameter, and "inspissated colloid cystic nodule" for well-circumscribed, completely avascular, not entirely anechoic nodules with colloid particles producing a comet tail artifact, which were completely evacuated during aspiration.
If a patient was afflicted with more than one nodule, one showing the highest risk of malignancy, or in the case of more nodules with the same malignancy risk, the largest nodule was included in the study. An axial plane B-mode image of all included nodules at their largest diameters was saved for further CAD analysis.
FNAB indication was based on the present international guidelines [8,11,45] and performed by K.M. In case of discrepancy, the guideline indicating FNAB was applied. US-guided FNAB was performed using the parallel needle to probe technique with a 22 G needle using 10 mL syringe and Cameco biopsy gun; the nodules were panned across to sample their possibly largest portion. The aspirated material was rapidly expressed onto two glass slides, and two smears were created using the one-step smear method. One slide was fixed in 95% ethanol for H&E staining, and one was air-dried for May-Grünwald Giemsa staining. The rest of the obtained material was rinsed in formaldehyde solution for processing as a cell block. Aspiration was repeated if the material macroscopically appeared to be scanty or bloody. The cytological specimen was submitted to the cytopathology laboratory along with all relevant clinical and US information. The cytological analysis was performed by a cytopathologist (E.K.) with over 20 years of experience in cytopathology. Results were classified according to the Bethesda system regarding reporting thyroid cytopathology [44].
In patients suffering from thyroiditis, the following categories were specified: (a) focal inhomogeneity, proven not to be a nodule by biopsy or if it was completely unchanged compared with previous examinations within at least a 2-year timespan and was consistently regarded to be thyroiditis related focal inhomogeneity by the examiner; (b) pseudonodule substantiated through biopsy or was completely unchanged compared with previous examinations of at least a 2-year timespan and was consistently regarded as a pseudonodule by the examiner; (c) true nodule in addition to thyroiditis, which was managed in the same way as nodules without thyroiditis. In reference to focal in homogeneities and pseudonodules, the examiner assigned a K-TIRADS score of 1 (no nodule) for further statistical analyses.
An axial plane B-mode image representing these entities at their largest diameters were also saved for further CAD analysis to assess the accuracy of nodule detection in thyroiditis.
Nodules were regarded malignant or requiring surgery (referred to as "malignant/surgery" in further texts) if the cytological result was suspicious regarding malignancy (Thy 5), or malignant (Thy 6), and/or malignancy was evident in the surgical specimen. A benign nodule was diagnosed when any of the following criteria were met: (i) confirmation of benign status in a surgical specimen; (ii) benign or cystic cytology of an FNAB (Thy 1c or Thy 2); (iii) benign traits including spongiform or partially cystic nodules with comet tail artifacts, or pure cysts evident on US; (iv) low suspicion (K-TIRADS 3) nodules under 15 mm diameter, which were completely unchanged compared with previous examinations of at least a 2-year timespan, and no clinical poor prognostic factors were present, and therefore, FNAB was not indicated.

Computer-Aided Diagnosis (CAD) Analysis
S-Detect 2 for thyroid (Samsung Medison Co., Ltd.), which is a commercially available CAD tool integrated into the real-time US system (Samsung RS85 A) designed to detect and classify thyroid lesions was used in the study. S-Detect 2 for thyroid is based on convolutional neural network-based deep learning techniques. S-Detect evaluations were performed offline, so the primary ultrasonography examiner (K.M.) was blinded to the CAD outcomes. The CAD evaluation was performed by consensus by O.G. and A.T., a radiologist with over 5 years of experience in thyroid imaging and a resident with 32 months of supervised experience in thyroid imaging, respectively, blinded both to the findings of the primary radiological evaluation and the cytopathological results. The analysis was run on the axial plane images of the nodules, focal inhomogeneities related to thyroiditis, pseudonodules in thyroiditis, and true nodules besides thyroiditis stored and marked by the primary examiner (K.M.). The CAD data were obtained by manually setting a rectangular region of interest around the lesion. The CAD system suggested four different possible margins for the detected nodule; however, the default one was always used. The software automatically evaluated the US features of the nodule presented in Table 1. The system is able to incorporate nodule elasticity and vascularity upon user selection, but these features were omitted in this study. This system can be set up to provide a simple output as "possibly benign" or "possibly malignant" or to provide a K-TIRADS score of the lesions. The latter option was used to achieve a more detailed evaluation. A lesion was regarded possibly CAD benign if the provided K-TIRADS score was 2 or 3, while a nodule with provided K-TIRADS score of 4 to 5 was regarded as possibly CAD malignant.
First, we aimed to identify US features and entities associated with CAD system misdiagnosis; therefore, we selected cases in which the radiologist's diagnosis was correct and created two subgroups: Diagnostics 2020, 10, 378 6 of 17 one in which the CAD was correct and another in which the CAD was incorrect. Between these groups (CAD correct vs. CAD incorrect), the rates of entities such as focal inhomogeneity related to thyroiditis and pseudonodule related to thyroiditis (as defined earlier) were statistically compared using the comparison of the two rates tool [50]. Next, only cases of true nodules were kept in the CAD correct and CAD incorrect groups (focal inhomogeneity related to thyroiditis and pseudonodule related to thyroiditis cases were excluded), and the rates of nodule US features assured by the radiologist (coarse calcification, macrocalcification without coarse calcification, inspissated colloid cystic nodule, true nodule related to thyroiditis, composition, echogenicity, orientation, margin, spongiform state, shape, and microcalcification) were statistically compared including the comparison of the two rate tools [50].
Secondly, to assess the effect of these entities and US features related to CAD system misdiagnosis regarding the overall diagnostic performance, the receiver operating characteristic (ROC) curves with K-TIRADS scores provided by the examiner or CAD as variables and benign or malignant/surgery diagnosis as a classification variable in the following groups were compared using the comparison of independent ROC curves with the methodology by DeLong et al. [51][52][53][54][55]: (a) total cohort, human rating, (b) total cohort, CAD rating, (c) a subgroup derived from the total cohort by excluding all cases in which the entities or US features identified to be significantly associated with CAD system misdiagnosis were present (="screened subgroup"), CAD rating, and (d) the same screened subgroup, human rating.
Third, among these four groups, the sensitivity, specificity, and accuracy were compared using the McNemar test in reference to dependent sample comparisons and Pearson's Chi-squared test for independent samples.
All these steps regarding statistical analysis were additionally run in the group including only those patients who had an FNAB (FNAB-only group).
As an ancillary step, the number of cases in which the radiologist's diagnosis was incorrect but the CAD system diagnosis was correct was calculated.
Tests resulting with a p-value of <0.05 were considered statistically significant. Table 2 presents the occurrence of cases and diagnoses.

US Features or Entities Associated with CAD System Misdiagnosis Including Mimicking Lesions
In 176 out of the 200 cases, the radiologist made a correct diagnosis. Out of these 176 cases, the CAD was correct in 83 and incorrect in 93 cases. Focal inhomogeneities related to thyroiditis were in a significantly higher rate present in the CAD incorrect group; the CAD system identified these lesions as nodules and assigned them a median K-TIRADS score of 5 (see Table 3). Figure 2 shows representative cases of focal inhomogeneity related to CAD system misdiagnosis.

US Features or Entities Associated with CAD System Misdiagnosis Excluding Mimicking Lesions
CAD was correct in 78 case and incorrect in 64 cases, regarding true nodules within the group in which a correct diagnosis was made by the radiologist (n = 142). True nodules related to thyroiditis, coarse macrocalcifications, and inspissated colloid cystic nodules were in a significantly higher rate present in the CAD incorrect group vs. the CAD correct group, with median CAD system K-TIRADS scores of 4, 5, and 4, respectively, while only one truly malignant case was present within these groups (Table 3). Figure 3 shows representative cases of these US features related to CAD system misdiagnosis. Non-parallel orientation, ill-defined margin, and irregular shape were in a significantly higher rate present in the CAD correct group and they were all malignant/surgery cases (Table 3). A representative example is shown in Figure 4. Results of the same tests run in the FNAB only group (n = 121) are presented in Supplementary Materials Table S1.
Out of the 24 cases where the radiologist's diagnosis was incorrect, the CAD was correct in four cases. In all of them the radiologist gave a TIRADS score of 4, while the CAD gave a TIRADS score of 3, and these cases were proven to be benign by cytology.

Comparison of Human and CAD System Diagnostic Performance in the Total and in the Screened Subgroup
Regarding all cases, human specificity (88.1%) and accuracy (88%) in detecting malignancies were significantly higher than when compared with those of the CAD (40.5% and 43.5%, respectively). There was no significant difference in sensitivity (human sensitivity = 88.6%, CAD sensitivity = 80%). ROC curve comparison showed a significant difference in areas under the curves (AUROCs), which were 0.937 for the human detections and 0.656 for the CAD detections.
The exclusion of cases of US features and entities identified to be related to CAD system misdiagnosis (cases of focal inhomogeneity in thyroiditis, true nodule in thyroiditis, coarse macrocalcification, and inspissated colloid cystic nodule) from the study population resulted in a "screened" subgroup including 148 cases. In this group, a significant improvement in the specificity of CAD compared to its specificity achieved in all cases could be detected; specificity in the screened subgroup increased to 55.9%. However, no significant change was observed regarding sensitivity, accuracy, and AUROC. The comparison of human and CAD diagnostic performance in the screened subgroup showed similar results as in the group of all patients since the difference in specificity, accuracy, and AUROC remained significant, while sensitivity was not significantly different.
Neither diagnostic parameters showed a significant difference among the total and the screened subgroup human detections. Table 4 shows details of diagnostic parameters of human and CAD detections in the total population and in the screened subgroup, including their comparisons. Figure 5 shows ROC curves yielded by human and CAD in the total population and in the screened subgroup.
Results of the same tests conducted in the FNAB-only group (n = 121) are presented in Supplementary Materials Table S2.
In 176 out of the 200 cases, the radiologist made a correct diagnosis. Out of these 176 cases, the CAD was correct in 83 and incorrect in 93 cases. Focal inhomogeneities related to thyroiditis were in a significantly higher rate present in the CAD incorrect group; the CAD system identified these lesions as nodules and assigned them a median K-TIRADS score of 5 (see Table 3). Figure 2 shows representative cases of focal inhomogeneity related to CAD system misdiagnosis.

Figure 2.
Representative images of computer-aided diagnosis (CAD) system false-positive misdiagnoses of focal inhomogeneities related to thyroiditis. (a) (B-mode thyroid US, axial) A 26-yearold female patient with clinically obvious Hashimoto thyroiditis. Surrounded by diffuse hypoechogenic inhomogeneity of the thyroid gland, a more circumscribed inhomogeneity is present in the ventral part of the right lobe. This appearance went unchanged for over three years of follow up in our department, rated as no nodule (K-TIRADS 1) by the radiologist. The region of interest for CAD analysis was placed over the circumscribed inhomogeneity. (b) (CAD output image) The CAD system interpreted the lesion as a nodule and rated possibly malignancy and a K-TIRADS 4 score. (c) (B-mode thyroid US, axial) A 31-year-old female patient also with clinically obvious Hashimoto thyroiditis. The thyroid appears diffusely hypoechogenic and a thyroid septum is visible in the right lobe causing the posterior part of the lobe mimicking a nodule. This appearance was unchanged for over 4 years of follow up in our department, rated as no nodule (K-TIRADS 1) by the radiologist. The region of interest for CAD analysis was placed over the posterior part of the right lobe encased by the septum. (d) (CAD output image) The CAD system interpreted the lesion as a nodule and rated possibly malignancy and a K-TIRADS 5 score. This appearance went unchanged for over three years of follow up in our department, rated as no nodule (K-TIRADS 1) by the radiologist. The region of interest for CAD analysis was placed over the circumscribed inhomogeneity. (b) (CAD output image) The CAD system interpreted the lesion as a nodule and rated possibly malignancy and a K-TIRADS 4 score. (c) (B-mode thyroid US, axial) A 31-year-old female patient also with clinically obvious Hashimoto thyroiditis. The thyroid appears diffusely hypoechogenic and a thyroid septum is visible in the right lobe causing the posterior part of the lobe mimicking a nodule. This rappearance was unchanged for over 4 years of follow up in our department, rated as no nodule (K-TIRADS 1) by the radiologist. The region of interest for CAD analysis was placed over the posterior part of the right lobe encased by the septum. (d) (CAD output image) The CAD system interpreted the lesion as a nodule and rated possibly malignancy and a K-TIRADS 5 score.

US Features or Entities Associated with CAD System Misdiagnosis Excluding Mimicking Lesions
in which a correct diagnosis was made by the radiologist (n = 142). True nodules related to thyroiditis, coarse macrocalcifications, and inspissated colloid cystic nodules were in a significantly higher rate present in the CAD incorrect group vs. the CAD correct group, with median CAD system K-TIRADS scores of 4, 5, and 4, respectively, while only one truly malignant case was present within these groups (Table 3). Figure 3 shows representative cases of these US features related to CAD system misdiagnosis. Non-parallel orientation, ill-defined margin, and irregular shape were in a significantly higher rate present in the CAD correct group and they were all malignant/surgery cases (Table 3). A representative example is shown in Figure 4.  In addition to focal inhomogeneities due to thyroiditis, a true nodule can be depicted in the right lobe, regarded benign and K-TIRADS 3 by the radiologist. FNAB was performed and yielded a benign result (Thy 2). (b) (CAD output image) The CAD system provided a high suspicion for malignancy (K-TIRADS 5) diagnosis. (c) (B-mode thyroid US, axial) A 63-year-old male patient with confluent well circumscribed isoechoic, partially cystic nodules in the left thyroid lobe with a coarse macrocalcification, scored K-TIRADS 3 nodule by the radiologist. FNAB provided benign diagnosis (Thy 2). (d) (CAD output image) The CAD system yielded a result of high suspicion for malignancy (K-TIRADS 5). (e) (B-mode thyroid US, axial) A 71-year-old male patient presenting with several clinically and radiologically pathological lymph nodes in right cervical lymph node regions and a nodule in the right thyroid lobe, which was well circumscribed, completely avascular, contained echogenic foci with comet tail artefacts, and was hypoechogenic. The radiologist diagnosed an inspissated colloid cystic nodule (K-TIRADS 2), yet performed FNAB due to the presence of pathological lymph nodes. (f) (B-mode thyroid US, axial, insert) During FNAB, the fluid content of the nodule was completely removed. The pathological lymph nodes were proved to be squamous cell carcinoma metastases, while the thyroid nodule was diagnosed benign (Thy 1c) by cytology. (g) (CAD output image) The CAD system rated the nodule to be possibly malignant, with an intermediate suspicion for malignancy (K-TIRADS 4). echogenic foci with comet tail artefacts, and was hypoechogenic. The radiologist diagnosed an inspissated colloid cystic nodule (K-TIRADS 2), yet performed FNAB due to the presence of pathological lymph nodes. (f) (B-mode thyroid US, axial, insert) During FNAB, the fluid content of the nodule was completely removed. The pathological lymph nodes were proved to be squamous cell carcinoma metastases, while the thyroid nodule was diagnosed benign (Thy 1c) by cytology. (g) (CAD output image) The CAD system rated the nodule to be possibly malignant, with an intermediate suspicion for malignancy (K-TIRADS 4).    Figure 5. Comparison of ROC curves of the radiologist and CAD system in the total cohort and the screened subgroup (in which cases of thyroid entities and nodule features identified to be associated with CAD system misdiagnosis were excluded) for detecting malignancy and nodules requiring surgery.
Results of the same tests conducted in the FNAB-only group (n = 121) are presented in Supplementary Material Table S2. Figure 5. Comparison of ROC curves of the radiologist and CAD system in the total cohort and the screened subgroup (in which cases of thyroid entities and nodule features identified to be associated with CAD system misdiagnosis were excluded) for detecting malignancy and nodules requiring surgery.

Discussion
To the best of our knowledge, no previous study has considered the importance of nodule mimicking lesions when applying CAD systems for thyroid. CAD systems were trained on true nodules and often shown to even outperform humans. In our opinion, such results may be very misleading regarding the actual feasibility of CAD systems, since they do not represent clinical practice, in which thyroid lesion differentiation (nodule vs. mimicking lesion) is of utmost importance. Such differentiation might be challenging, especially for less experienced users-the ones most probably willing to apply CAD. To test the significance of this problem, the present study focused not only on true nodules but mimicking lesions as well.
The overall diagnostic performance (AUROC and accuracy) regarding the experienced radiologist was comparable to previous studies [30,31,33], and was significantly higher when compared with the CAD system. The most substantial difference was found in specificity and positive predictive value, which were, respectively, roughly two and four times higher for the radiologist's detections. However, there was no significant difference in sensitivities, and negative predictive values were also very close. In the study by Kim et al. [32], who applied the same commercial CAD system S-Detect 2, CAD sensitivity (81.4%) and negative predictive value (84.9%) vs. radiologist (sensitivity = 84.9%, negative predictive value = 90.7%) were also similar, and CAD specificity (68.2%) was also significantly lower than that of the radiologist (96.2%). However, in our study, CAD specificity was even lower (40.5%). This is most likely due to the fact that in our study, not only cases of true nodules pre-selected by an expert were included, but also cases posing differential diagnostic problem for nodules such as focal inhomogeneities as well. Another important difference regarding the populations of the two studies is that Kim et al. included patients who were prior to scheduled surgery, and almost half of the analyzed nodules were malignant, while in the present study most patients underwent US for the first time or returned for a check-up of a benign thyroid entity, which resulted in a lower proportion of malignancies.
Thyroiditis related focal inhomogeneity appeared to be a differential diagnostic entity related to systematic CAD system misdiagnosis, i.e., false-positive detection, since the CAD system appreciated them as a nodule and almost always assigned them a K-TIRADS score of 5 due to "ill-defined borders" and "hypoechogenicity". In clinical practice, especially with less experienced users, such false-positive misdiagnoses may lead to high rates of unnecessary FNAB indications, keeping the high incidence of chronic thyroiditis in mind [42,43].
When considering true nodules, CAD system misdiagnosis was most strikingly related to nodules associated with coarse macrocalcifications. All of them were diagnosed false-positively as possibly malignant and were mostly assigned a K-TIRADS score of 5. We assume this is due to the acoustic shadow being assessed as solid hypoechogenicity with ill-defined borders. Inspissated colloid cystic nodules, proved by the aspiration of their fluid content and cytology, were again diagnosed as possibly malignant and were assigned a K-TIRADS score of 4 or 5 nodules by the CAD, mostly diagnosed with solid hypoechoic composition and microcalcification instead of colloid particles. The likelihood of CAD being inaccurate while evaluating microcalcification was also presented by Kim et al. [32].
US features such as ill-defined contour, non-parallel orientation, and irregular shape were, in turn, significantly associated with correct diagnosis by the CAD system. This is aligned with the finding regarding high CAD sensitivity, since all of these features are known to be associated with the risk of malignancy [8] and seem to be accurately picked up by the CAD.
In contrast with our results, several studies of non-commercial, yet offline applicable algorithms presented artificial diagnostic performance being as good or even better than radiologists' performance [19,20,29,34,56]. These studies included a very high (approximately 30-50%) rate of malignancies compared to "real life" thyroid malignancy incidence and malignant nodule rate [3,4,6]. Furthermore, in all these studies, the validation sets included only true nodules pre-selected by experts. Such nodule pre-selection and the acquisition of the most representative slice performed by humans is obviously a diagnostic procedure possibly significant in helping CAD systems to achieve their promising diagnostic performance results. This is underscored by our result in which the exclusion of cases posing thyroid nodule differential diagnosis (see results related to the exclusion of the "screened subgroup") significantly improves CAD outcomes. Moreover, Jeong et al. [36] showed how CAD outcomes are significantly operator dependent, even if operators run the analysis on exactly the same images of pre-selected nodules; however, they may differently position the nodule region of interest and select nodule contours.
In this study, the radiologist had the possibility to consider clinical data and scan the entire lesions for lesion differentiation and scoring. This might have constituted an advantage regarding diagnostic performance versus the CAD system, which could not rely on clinical data, and analyzed the lesions based on single slice images. However, the aim of this study was to make a comparison under true clinical circumstances and to find CAD limitations. To overcome such possible CAD shortcomings, the authors speculate that future CAD systems should have the option of including clinical data and the option of analyzing 3D inputs. Some attempts towards 3D nodule analysis have been already done [57,58].
It is important to note that there are certain differences among the different TIRADS systems (such as ACR and EU TIRADS) compared to the presently applied K-TIRADS; for a comparison of these systems, see the review by Chiara et al. [59].
This study has certain limitations. First, not all patients underwent FNAB. In these cases, however, the chance of malignancy was firmly ruled out according to strict criteria. The authors alleged including these cases to be relevant in effectively evaluating CAD performance in routine clinical practice and not only on human-selected nodules requiring FNAB. Still, the inclusion of non-FNAB cases carries the possibility of bias since the examination was performed by a single expert. Therefore, we ran all statistics in the FNAB-only group as well (see Supplementary Materials), which did not affect the main messages of the study. Second, in this study not only histologically confirmed malignancies or cytological Thy 6 cases were included as positives (malignant/surgery), but cytological Thy 5 cases as well, since we believe that the aim of thyroid US is primarily to detect nodules that require surgery and not substituting FNAB by attempting to provide final diagnosis. The number of Thy 5 cases without final histological diagnosis was, however, low (4 cases out of 200, 2%), and in all except of one of these cases, the CAD outcome (possibly malignant) was correct and was in agreement with the human outcome; therefore, including these cases did not affect the main findings of the present study related to false positive CAD results. Third, no correction for multiple comparisons was performed when identifying US entities and features possibly related to CAD system misdiagnosis. Nevertheless, the fact that all cases in these groups (except one truly malignant nodule related to thyroiditis) were false positive, and the exclusion of these cases significantly affected diagnostic specificity, is reassuring regarding the validity of this finding. Fourth, only a single CAD system was tested in this study; however, the results regarding misdiagnosis of nodule mimicking lesions most probably apply to all other thyroid US CAD systems, since to date none of them were reported to consider mimicking lesions. Fifth, it was not possible to analyze the causes of false-negative CAD detections because of the very low number of these cases (n = 3).

Conclusions
In a routine clinical thyroid US population, the commercially available CAD seems to be applicable for screening patients with the aim of excluding thyroid malignancies. However, certain nodule types, and especially mimicking lesions, resulted in systematic false-positive malignant diagnoses for this CAD system. Therefore, this system (and probably any system trained on true nodules only) may not be entirely effective in reducing unnecessary FNABs, especially when used by inexperienced users for whom the diagnosis of the above-mentioned entities may also prove daunting.
Future CAD systems regarding thyroid may be most useful in clinical practice if mimicking lesions were added to their training sets.