Comparative Analysis of Computer-Aided Diagnosis and Computer-Assisted Subjective Assessment in Thyroid Ultrasound

The value of computer-aided diagnosis (CAD) and computer-assisted techniques equipped with different TIRADS remains ambiguous. Parallel diagnosis performances of computer-assisted subjective assessments and CAD were compared based on AACE, ATA, EU, and KSThR TIRADS. CAD software computed the diagnosis of 162 thyroid nodule sonograms. Two raters (R1 and R2) independently rated the sonographic features of the nodules using an online risk calculator while blinded to pathology results. Diagnostic efficiency measures were calculated based on the final pathology results. R1 had higher diagnostic performance outcomes than CAD with similarities between KSThR (SEN: 90.3% vs. 83.9%, p = 0.57; SPEC: 46% vs. 51%, p = 0.21; AUROC: 0.76 vs. 0.67, p = 0.02), and EU (SEN: 85.5% vs. 79%, p = 0.82; SPEC: 62% vs. 55%, p = 0.27; AUROC: 0.74 vs. 0.67, p = 0.06). Similarly, R2 had higher AUROC and specificity but lower sensitivity than CAD (KSThR-AUROC: 0.74 vs. 0.67, p = 0.13; SPEC: 61% vs. 46%, p = 0.02 and SEN: 75.8% vs. 83.9%, p = 0.31, and EU-AUROC: 0.69 vs. 0.67, p = 0.57, SPEC: 64% vs. 55%, p = 0.19, and SEN: 71% vs. 79%, p = 0.51, respectively). CAD had higher sensitivity but lower specificity than both R1 and R2 with AACE for 114 specified nodules (SEN: 92.5% vs. 88.7%, p = 0.50; 92.5% vs. 79.3%, p = 0.02, and SPEC: 26.2% vs. 54.1%, p = 0.001; 26.2% vs. 62.3%, p < 0.001, respectively). All diagnostic performance outcomes were comparable for ATA with 96 specified nodules. Computer-assisted subjective interpretation using KSThR is more ideal for ruling out papillary thyroid carcinomas than CAD. Future larger multi-center and multi-rater prospective studies with a diverse representation of thyroid cancers are necessary to validate these findings.


Introduction
Ultrasound is the primary imaging modality in the assessment of thyroid nodules. Technological advancements have contributed to the increased use of ultrasound in thyroid cancer diagnosis which has subsequently resulted in overdiagnosis [1,2]. Diverse malignancy risk stratification systems or Thyroid Imaging Reporting and Data Systems (TIRADS) have been developed to improve consistency in subjective interpretation and limit inter-observer variabilities [3][4][5][6]. Nevertheless, variability amongst different TIRADS still exists due to different malignancy risk estimation criteria for suspicious sonographic features [7]. Hence, there is currently no universal standard regarding the best TIRADS to use.
In recent years, an online-based multiple TIRADS malignancy risk scoring calculator based on subjective interpretation of sonographic features was developed [8]. The TIRADS outputs available with this online risk calculator are the American Association of Clinical Endocrinologists/American College of Endocrinology/Associazione Medici Endocrinologi Life 2021, 11, 1148 2 of 13 (AACE/ACE/AME-referred to as AACE from hereon), American Thyroid Association (ATA), Korean Society of Thyroid Radiology (KSThR), and French TIRADS. This predictive model-based risk calculator has been evaluated in comparison with other similar subjective interpretation-based models and found to be highly accurate and reliable in thyroid nodule differentiation [9]. On the other hand, with artificial intelligence (AI) evolving, computeraided diagnosis (CAD) systems have emerged and are suggested as an objective method of thyroid nodule diagnosis. One globally-approved commercial thyroid CAD software with multiple TIRADS computations is AmCAD-UT (AmCad Biomed, Taipei, Taiwan). This CAD software has been evaluated in comparison with human interpreters that were mostly using a single specific TIRADS. Some studies suggested that the CAD software has comparable diagnostic performance to that of experienced clinicians and could potentially improve that of less experienced ones [10,11]. Although both the online risk calculator and the CAD software offer automatic computation of suggested diagnosis based on multiple-TIRADS, presently there is a lack of comparative evaluation of their diagnostic performance for matched multiple TIRADS for clinical adoption considerations. Studies evaluating CAD performance versus that of subjective interpreters have largely focused on the unaided qualitative or quantitative human rating of sonographic features. Moreover, studies on different non-commercialized thyroid CAD technologies have yielded variable results for different TIRADS, thereby leaving the additional value of CAD still ambiguous [12].
The present study aimed to compare the diagnostic performance metrics of computerassisted subjective analysis by two raters using the online risk calculator, and the commercially available CAD system, based on analogous outputs of different risk-stratification systems. Since CAD software has the potential for screening purposes in low human resource health settings that have no in-house radiologists, the diagnostic performance comparisons were between non-radiologists and CAD in this study. The study sought to construe the clinical application implications of these two computational thyroid nodule diagnosis aides for routine diagnosis adoption considerations in low-resources settings. The hypothesis was that the CAD system has higher diagnostic efficiency than computerassisted subjective interpretation for all the TIRADS. The rationale is that CAD is suggested to be more objective and less prone to observer bias than subjective interpretation methods. The study findings showed that the sensitivity of CAD and the computer-assisted subjective raters were all consistently high whereas the specificity of the CAD was lower than that of both subjective raters regardless of the TIRADS used.

Study Type and Data Sources
This retrospective study was approved by the Human Subjects Ethics Subcommittee of The Hong Kong Polytechnic University (Registration Number: HSEARS20190123004) and adhered to the Declaration of Helsinki guidelines. A consecutive case analysis approach was used for the data collection of thyroid nodule ultrasound images. Informed consent was waived for this retrospective study.

Image Selection Criteria
A total of 162 thyroid nodule ultrasound images were eligible for the analysis from the thyroid nodule images of patients that were prospectively scanned by our research group in the period between May 2019 and May 2021 ( Figure 1). Standard thyroid ultrasound imaging protocols were observed to acquire the images using a Supersonic Aixplorer ultrasound machine (SuperSonic Imagine, Aix-en-Provence, France) and a 2-10 MHz linear transducer. A diagnostic radiographer with sonography experience and about 3 years of experience in thyroid ultrasound imaging solely performed the thyroid ultrasound scans. Ultrasound-guided fine-needle aspiration cytology (FNAC) was then independently conducted by two thyroid surgeons with extensive experience who later provided the cytological and/or histopathological diagnosis of the thyroid nodules. MHz linear transducer. A diagnostic radiographer with sonography experience and about 3 years of experience in thyroid ultrasound imaging solely performed the thyroid ultrasound scans. Ultrasound-guided fine-needle aspiration cytology (FNAC) was then independently conducted by two thyroid surgeons with extensive experience who later provided the cytological and/or histopathological diagnosis of the thyroid nodules. The inclusion criteria were diagnostically acceptable thyroid nodule grey-scale ultrasound images from adult patients who had undergone ultrasound evaluation for thyroid cancer suspicion; and nodules ≥ 5 mm with complete size measurements in both transverse and longitudinal planes and taller-than-wide ratio assessment and confirmatory cytological and/or histopathological results. Ultrasound images of nodules that were too large for the ultrasound probe to completely show the lesion and patients without cytology or histopathology results were excluded from the study to not compromise the accuracy of the findings. Images of nodules < 5 mm were excluded because they were below the size criteria of the web-based malignancy risk assessment system. The reference standard was the conclusive FNAC and/or histopathology results of the nodules.
Transverse plane images demonstrating the most features suggestive of benignity or malignancy were selected and areas clearly demonstrating the nodule and its relationship to the adjacent thyroid parenchyma and surrounding structures were separated from the entire image. The new nodule-specific images were anonymized, coded, and saved in JPEG format. The inclusion criteria were diagnostically acceptable thyroid nodule grey-scale ultrasound images from adult patients who had undergone ultrasound evaluation for thyroid cancer suspicion; and nodules ≥ 5 mm with complete size measurements in both transverse and longitudinal planes and taller-than-wide ratio assessment and confirmatory cytological and/or histopathological results. Ultrasound images of nodules that were too large for the ultrasound probe to completely show the lesion and patients without cytology or histopathology results were excluded from the study to not compromise the accuracy of the findings. Images of nodules < 5 mm were excluded because they were below the size criteria of the web-based malignancy risk assessment system. The reference standard was the conclusive FNAC and/or histopathology results of the nodules.
Transverse plane images demonstrating the most features suggestive of benignity or malignancy were selected and areas clearly demonstrating the nodule and its relationship to the adjacent thyroid parenchyma and surrounding structures were separated from the entire image. The new nodule-specific images were anonymized, coded, and saved in JPEG format.

Analyses of the Thyroid Nodule Images
The analyses were conducted separately with an online malignancy risk assessment system (http://www.gap.kr/xe/Estimation (accessed on 5 August 2021)) and AmCAD-UT version 2.2 (AmCad Biomed, Taipei, Taiwan) for the same nodules. The computer-assisted subjective analyses with the risk calculator were conducted first and the CAD analysis was performed after two weeks.

Computer-Assisted Subjective Risk Assessment
Two raters independently reviewed the same set of ultrasound images and evaluated the ultrasound features of the thyroid nodules using the stipulated rating criteria of the online risk calculator (Figure 2A). Rater 1 (R 1 ) was the radiographer who had performed all the thyroid scans. Since real-time image rating using the risk calculator was not practically feasible at the time of scanning, R 1 rated the thyroid nodules simultaneously but independently, with another rater (Rater 2-R 2 ) for comparison purposes. R 2 was a senior sonographer, with over 15 years of experience, who was not involved in the imaging process and first encountered the ultrasound images during the rating process. Both raters were blinded to the cytology and histopathology results during their individual rating process.   The online calculator computed the malignancy risk based on a rater's subjective assessment of the composition, margins, echogenicity, shape, and calcification from the images of each of the thyroid nodules [8]. In this study, a calculated taller-than-wide ratio of >1 was used to determine if a nodule was taller-than-wide in addition to subjective visual assessment [13,14]. The malignancy risk assessment was automatically computed as risk stratification category outputs for AACE, ATA, KTA/KSThR, French TIRADS, and an estimated malignancy risk (EMR) score ( Figure 2B). In this study, the risk stratification outputs for the French TIRADS were converted to EU-TIRADS, an updated version of French TIRADS, based on the corresponding malignancy risk estimation percentages, for comparison with the AmCAD-UT EU-TIRADS outputs.

CAD Assessment
The coded JPEG thyroid nodule images were uploaded onto the AmCAD-UT software user interface for analysis. The AmCAD-UT algorithm allows for the selection of automated, semi-automated, or manual outline of the region of interest (ROI) for detecting the nodule for risk stratification. For this study, the default selection for outlining the ROI was the automated nodule segmentation with semi-automated manual correction when necessary. When the automatically segmented ROI demonstrated satisfactory nodule boundary outlines, the computed diagnosis was accepted as valid. When the automated nodule segmentation missed the nodule or under or over-estimated the nodule boundaries, manual correction was applied to the automated outline as determined by the more experienced rater of the two raters. Only 15 out of 162 nodules (9.3%) required manual correction of the automated ROI in the present study. The malignancy risk category output for each TIRADS, taller-than-wide ratio output, and sonographic characteristics outputs for each nodule on CAD ( Figure 3) were compared to the corresponding entries for each computer-assisted rater. There was no output for the EMR score in CAD, and hence this was not compared between the different approaches.

Data Analysis and Statistical Analysis
All statistical analyses were performed using the SPSS software package (version 25.0, SPSS Inc., Chicago, IL, USA). Categorical variables were expressed as frequencies and continuous variables were expressed as mean values ± standard deviation. The Chisquare test was used to compare differences in classification data while the Mann Whitney

Data Analysis and Statistical Analysis
All statistical analyses were performed using the SPSS software package (version 25.0, SPSS Inc., Chicago, IL, USA). Categorical variables were expressed as frequencies and continuous variables were expressed as mean values ± standard deviation. The Chi-square test was used to compare differences in classification data while the Mann Whitney U test was used to compare continuous variables. The Goodman and Kruskal's Gamma correlation coefficient (G or γ) was used to measure the ordinal association of the sonographic features coded by the different raters. The gamma coefficient was interpreted as 0.01-0.30 negligible association, 0.31-0.50 low association, 0.51-0.7 moderate association, 0.71-0.9 high association, and 0.91-1.0 very high association [15]. The inter-rater reliability testing for the different raters was estimated using Cohen's kappa statistic (κ). Proportions of agreement between paired ratings based on the different TIRADS category cut-off points for malignancy risk stratification was also used to determine absolute rater agreement. The Kappa result was interpreted as follows: 0.01-0.20 none to a slight agreement, 0.21-0.40 fair agreement, 0.41-0.60 moderate agreement, 0.61-0.80 substantial agreement, and 0.81-1.00 almost perfect agreement [16]. The sensitivity (SEN), specificity (SPEC), negative likelihood ratios (NLR), positive likelihood ratios (PLR), diagnostic odds ratio (DOR), and their corresponding 95% confidence intervals (CI) were calculated with reference to final cytology or pathology results. The McNemar and Cochran's Q test for multiple comparisons were used for the comparative analysis of SEN, SPEC, and DA and post-hoc McNemar analyses were employed in the case of statistically significant results from the Cochran's Q test. The differences in PPV and NPV were evaluated using a two-sample proportion test whereas the differences in LRs and DORs were evaluated based on 95% CIs, where non-overlapping values denoted statistical significance. The receiver operating characteristic (ROC) curves were generated to obtain the area under the ROC (AUROC) and the SPSS software computed the differences in AUROCs for paired comparisons. The optimal cut-off points for differentiating benign and malignant thyroid nodules were defined by the highest Youden's J index based on the different categorizations of malignancy risk stratification for each of the 4 TIRADS used.

Nodule Sonographic Feature Classifications by Human Subjective Assessment and CAD
The different raters and the CAD system coded the sonographic features of the thyroid nodules based on echogenicity, calcifications, margins, composition, and shape. The results of the different categorizations are shown in Table 1. There were significant differences between the classifications of all sonographic features of benign and malignant nodules for both human raters (p < 0.05), whereas for the CAD system the significant differences were only observed for calcifications (p = 0.001). Amongst both human raters and CAD, majority of the malignant nodules were classified as hypoechoic compared to benign Life 2021, 11, 1148 7 of 13 nodules (R 1 = 66.1% vs. 48%, R 2 = 72.6% vs. 49% and CAD = 48.4% vs. 34%). All raters classified the majority of the malignant nodules as either predominantly solid or solid (R 1 = 16.1% and 79%, R 2 = 54.8% and 41.9% and CAD = 38.7% and 61.3%, respectively). The classification of microcalcifications was predominant in malignant nodules when rated by R 1 and CAD (46.8% and 51.6%, respectively), but with R 2 malignant nodules were interpreted as mostly without calcifications or with microcalcifications (both 45.2%).

Classification Correlation Comparisons between Subjective Ratings and CAD
There was a high association in the rating of echogenicity of malignant nodules between both human raters and CAD (R 1 vs. CAD, G = 0.74, and R 2 vs. CAD , G = 0.73), and a very high association between R 1 and R 2 (G = 0.91) as shown in Table 2. The human raters had a high association in stratifying calcifications and composition in all nodules and separate groups of malignant nodules and nodules (G > 0.7). There was negligible to a low association in classifying nodule margins between each of the human raters and the CAD for malignant, benign, and all total nodules (G < 0.5). The rank correlation association for categorizing the shape of benign nodules was generally high between each human rater and CAD and between the human raters (R 1 vs. CAD, G = 0.81; R 2 vs. CAD, G = 0.86; and R 1 and R 2 , G = 0.85, respectively).

Rater Agreement Based on TIRADS
The rater agreement based on the malignancy cut-off points of the different TIRADS was generally moderate to substantial for malignant, benign, and all total nodules between Life 2021, 11, 1148 8 of 13 the human raters (0.41 ≤ κ < 0.81), with the highest agreement achieved with ATA TIRADS (κ = 0.77). The results are shown in Table 3. The rater agreement between each of the human raters and the CAD was highest based on ATA TIRADS between R 1 and CAD for all nodules (κ = 0.75), and lowest based on AACE for malignant nodules between R 1 and CAD (κ = 0.12) and between R 2 and CAD (κ = 0.14). There was a fair rate of agreement for classifying benign nodules with AACE (κ = 0.32) between R 1 and CAD, and with ATA, EU, and KSThR (κ = 0.40, 0.24, and 0.23, respectively) between R 2 and CAD for KSThR. Proportions of agreement between the different paired raters amongst all TIRADS were generally high in contrast to the moderate kappa values, although the agreement between R 2 vs. CAD was low to moderate for benign nodules with all TIRADS-AACE = 50.8%, ATA = 73.7%, EU = 63%, and KSThR = 61% (Supplementary Table S3).

Diagnostic Performance Assessment of CAD and Computer-Assisted Raters for Matched TIRADS
The diagnostic performance outcomes for the two computer-assisted subjective raters and CAD were assessed for different TIRADS as outlined in Table 4. The best diagnostic performance for the different TIRADS was achieved at high risk (category 3) for AACE, high suspicion (category 5) for ATA and EU, and intermediate suspicion (category 4) for KSThR. EU and KSThR TIRADS were able to specify all nodules regardless of the rater, whereas AACE rating with CAD failed to specify some nodules (39 benign, 9 malignant) while ATA failed to specify some nodules regardless of the rater (CAD-30 benign, 19 malignant; R 1 -16 benign, 10 malignant; and R 2 -15 benign, malignant). Overall, the common nodules across all raters that could be specified by AACE, and ATA were 114 (61 benign, 53 malignant) and 96 (57 benign, 39 malignant), respectively.  74.4 (57.9; 87.0) N = total nodules specified, SEN = sensitivity, SPEC = specificity, PLR = positive likelihood ratio, NLR = negative likelihood ratio, DOR = diagnostic odds ratio, AUROC = area under receiver operator. Characteristic curve, CI = 95% confidence interval.
Based on the different TIRADS, CAD yielded the highest sensitivity but lowest specificity and AUROC amongst all raters with AACE (92.5%, 26.2%, and 0.59, respectively) which were all statistically significant different from R 2 (79.3%, 62.3%, and 0.72, p = 0.02 and <0.001). For stratifying all 162 nodules, R 1 had overall higher diagnostic performance than CAD for all metrics for EU and KSThR. Although the differences were not statistically significant for EU, there was a statistically significant difference in AUROC for KSThR  Table S2). R 2 had comparable sensitivity but higher specificity than CAD for KSThR (75.8% vs. 83.9%, and 61% vs. 46%, p = 0.02, respectively). Between the two computer-assisted subjective raters, there were statistically significant differences in sensitivity, but comparable specificity and AUROCs for both EU (85.5% vs. 71%, p = 0.04; 62% vs. 64% and 0.74 vs. 0.69, respectively) and KSThR (90.3% vs. 75.8%, p = 0.01; 51% vs. 61% and 0.76 vs. 0.74, respectively). Overall, CAD generally had lower PLRs, although these were comparable to those of the computer-assisted raters, while the lowest NLR was achieved with computer-assisted rating (KSThR-R 1 -0.19). The highest specificity and PLR across all raters was achieved with ATA with comparable sensitivity, specificity, and AUROC amongst all raters. At the best performance, the computer-assisted approach had higher DOR > 9 and higher AUROC > 0.7 than the CAD-based approach on all the TIRADS. Across all TIRADS, all raters yielded high sensitivity and high NPVs, but low-to-moderate SPEC, PPVs, and DAs (Supplementary Table S1).

Discussion
The results of the current study demonstrated that for matched pairs of risk-stratification systems, although the two approaches had comparable diagnostic performance, computerassisted subjective interpretation using KSThR yielded a higher overall diagnostic accuracy than computer-aided diagnosis.

Interpretation of the Study Findings for Sonographic Feature Ratings between the CAD and Computer-Assisted Approaches
The rank correlation associations of ratings of sonographic features were generally high between the computer-assisted subjective raters for echogenicity, calcifications, and composition and negligible for margin ratings. This implies that the two computer-assisted raters mainly varied in rating nodule margins. Margin characteristics are among the sonographic features highly predictive of malignancy. Therefore, the differences likely influenced the final malignancy-risk computation using the online calculator for AACE, EU, and KSThR TIRADS. Comparatively, for CAD vs. either subjective rater, moderate as-Life 2021, 11, 1148 10 of 13 sociation existed mostly for echogenicity and shape. While the rater agreement was mostly moderate between R 1 and R 2 , the comparable sensitivities and specificities reflect how the computer-assisted scoring approach accounts for diverse rating criteria in determining a risk category.
The moderate correlation association between CAD and each of R 1 and R 2 and fairto-moderate inter-rater assessment but with comparable sensitivities may be attributed to CAD's reliance on textural and statistical feature analysis based on supervised machine learning [17,18]. While individual sonographic ratings may have been different, CAD outputs are influenced by the detected sonographic features within the automated or selected ROI. The sonographic features that are detected within an ROI depend on how a particular CAD algorithm was trained with different images for malignancy risk stratification. Therefore, while CAD interprets the same sonographic features that a subjective interpreter inputs for computation by a risk-calculator model, image quality can contribute to increased sensitivity in misinterpreted suspicious features with CAD. Contrarily, an experienced human assessor may still be able to accurately interpret an image with artefacts that CAD is sensitive to.

Interpretation of the Study's Diagnostic Performance Outcomes
In the present study, two raters independently using an online-based risk calculator had similar sensitivity and good diagnostic accuracy based on AUROC, with higher specificity across all TIRADS than the CAD. However, statistically significant differences in specificity were only observed using KSThR and AACE. For all four TIRADS, the PLR were generally higher for the computer-assisted subjective raters than the CAD and the DOR was highest with R 1 using any of the TIRADS (>9). For EU and KSThR, both R 1 and R 2 had comparable sensitivity with CAD; however, there were statistically significant differences between them. The implication of this is that CAD systems can be an objective second opinion resource in the event of ambiguity with subjective outputs. However, automated web-based risk systems with simultaneous output for multiple TIRADS may potentially overcome challenges with subjective ambiguity and the bias towards high sensitivity but low specificity of commercially-available CAD. Deep learning-based CAD approaches have been suggested to be more accurate and improve specificity; however, current studies on the commercially-available deep learning-based S-Detect 2 (Samsung Medison Co. Ltd., Seoul, Korea) still show low specificity [19][20][21][22]. AmCAD-UT also uses a deep-learning analysis approach for automated ROI selection and it similarly resulted in lower specificities than the computer-assisted approach in the present study.
The comparable sensitivity but low specificity of commercial and non-commercial CAD systems to that of experienced clinicians has been established in previous studies [20,[23][24][25]. However, a few studies have shown higher specificity with CAD in comparison with human examiners of variable experience [10,26]. A recent multi-center study on the CAD-based on KSThR yielded a good AUROC (0.75) with the highest sensitivity (90.5%) but lowest specificity (49.6%) than that of the radiologists regardless of their experience [11]. However, in the present study, the KSThR TIRADS had the highest AUROC with R 1 and R 2 (0.76 and 0.74, respectively) with the highest sensitivity achieved by computer-assisted rater R 1 (SEN: 90.3%; SPEC: 51%) whereas CAD had a lower AUROC and specificity, but comparable sensitivity (0.67, 46%, and 83.9%, respectively). The multi-center study suggested that CAD KSThR be reserved for large cancer screening with subjective assistance supplemented by another TIRADS to increase specificity. However, this present study's findings, more so, the lowest NLR of 0.19 (0.09; 0.42) by R 1 using KSThR, suggest that the computer-assisted approach would be better than CAD for ruling out disease. Nonetheless, this present study had a smaller sample size and fewer raters, and hence there is a need for further validation studies.
Although ATA demonstrated good nodule discriminating ability (AUROC ≥ 0.7) overall for both approaches, this was achieved with 96 nodules due to a high rate of nonspecified nodules using CAD (30.2% vs. R 1 = 16% vs. R 2 = 12.3%). Computer-assisted rating with AACE specified all nodules whereas CAD did not specify 29.6%, thereby suggesting the superior efficiency of the risk-calculator model for this TIRADS than the CAD.

Meaning of the Study and Implications
The computer-assisted subjective assessment approach had comparable diagnostic performance to that of the CAD approach for all the four TIRADS. However, the high sensitivity of the CAD is outweighed by a lower specificity, thereby resulting in a lower diagnostic accuracy than that of computer-assisted subjective interpretation using KSThR. Complementary to sensitivity and specificity outcomes, the DOR, PLR outcomes may aid the choice of TIRADS and approach to consider for clinical adoption as they are not prevalence-dependent [27,28]. With the odds of almost 10, for the best DOR compared to about 5 for CAD for rating all nodules, this suggests that computer-assisted rating is superior to CAD when using KSThR or EU for mainly detecting papillary thyroid carcinomas. However, both approaches have the potential for clinical diagnostic workflow adoption by non-radiologists experienced in thyroid ultrasound imaging for screening purposes in low-resource settings due to comparable high sensitivities and NPVs. Nevertheless, where parallel use of both approaches can be adopted based on the TIRADS that can stratify all nodules, the best choice for rule out purposes is likely computer-assisted subjective interpretation using KSThR. Either approach using EU may suffice for rule-in purposes. The use of ATA and AACE is probably best with computer-assisted subjective interpretation due to higher rates of non-specified nodules with CAD.

Limitations and Directions for Future Research
Limitations of this study include the retrospective nature of the selection of patients' images with FNAC and/or histopathology results which cannot exclude selection bias. Secondly, the optimal cut-off points for the different TIRADS were derived from the data but not pre-determined and therefore require validation. Thirdly, the sample size was small with the malignant nodules being mostly papillary thyroid cancer, thereby limiting the generalization to the general population and other thyroid cancers. The prevalence of malignancy within this study (38%) may not be reflective of the actual prevalence in the general population. The value of real-time, subjectively-assisted CAD compared to retrospective automated CAD analysis and multiple computer-assisted raters of diverse experiences needs to be explored. Therefore, larger standardized prospective studies with a diverse representation of thyroid cancers and multiple raters are warranted to assess the validity and generalizability of the findings.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/life11111148/s1, Table S1: Prevalence-based diagnostic performance outcomes with sensitivity and specificity, Table S2: pvalues for SEN, SPEC and AUROC comparisons between subjective raters and CAD per TIRADS.   Data Availability Statement: The clinical and ultrasound data are unavailable publicly due to patient confidentiality and privacy protection reasons.