Assessing Detection Accuracy of Computerized Sonographic Features and Computer-Assisted Reading Performance in Differentiating Thyroid Cancers

For ultrasound imaging of thyroid nodules, medical guidelines are all based on findings of sonographic features to provide clinicians management recommendations. Due to the recent development of artificial intelligence and machine learning (AI/ML) technologies, there have been computer-assisted detection (CAD) software devices available for clinical use to detect and quantify the sonographic features of thyroid nodules. This study is to validate the accuracy of the computerized sonographic features (CSF) by a CAD software device, namely, AmCAD-UT, and then to assess how the reading performance of clinicians (readers) can be improved providing the computerized features. The feature detection accuracy is tested against the ground truth established by a panel of thyroid specialists and a multiple-reader multiple-case (MRMC) study is performed to assess the sequential reading performance with the assistance of the CSF. Five computerized features, including anechoic area, hyperechoic foci, hypoechoic pattern, heterogeneous texture, and indistinct margin, were tested, with AUCs ranging from 0.888~0.946, 0.825~0.913, 0.812~0.847, 0.627~0.77, and 0.676~0.766, respectively. With the five CSFs, the sequential reading performance of 18 clinicians is found significantly improved, with the AUC increasing from 0.720 without CSF to 0.776 with CSF. Our studies show that the computerized features are consistent with the clinicians’ findings and provide additional value in assisting sonographic diagnosis.


Introduction
Thyroid cancer is the most common endocrine cancer, and its incidence has increased dramatically by an average of 4.5% annually [1]. Accurate identification of thyroid cancer is crucial for effective treatment. Ultrasonography is the most common tool for early detection of thyroid cancer because it is readily available and noninvasive. In the past decade, use of high-resolution ultrasound has resulted in improved detection of thyroid nodules [2,3]. Nevertheless, most of the nodules are benign, and thyroid cancers only account for 7-15% of detected nodules [4]. Identification of malignant nodules is critical to avoid unnecessary fine-needle aspiration (FNA) biopsy and surgical procedures. Medical guidelines, most notably the management guidelines by the American Thyroid Association (ATA) [4] and Thyroid Imaging Reporting and Data System (TI-RADS) by the American College of Radiology (ACR) [5], have been developed and recommended to clinicians on how to identify and use the sonographic features for differentiation of thyroid cancers. Important features include micro-calcifications, hypo-echogenicity, irregular margins, tallerthan-wide shape, etc. [4,[6][7][8][9][10]. However, presence of the sonographic features is determined based on a physician's subjective interpretation, which may be influenced by education and experience. Interpretation discrepancies among clinicians or even by the same clinicians at different times have become a major issue that hinders the diagnosis and treatment of thyroid nodules [11,12].
Recent development of AI/ML technologies has given rise to the use of CAD software devices in clinical practice [13,14], to assist clinicians with improving their diagnosis accuracy and workflow effectiveness. The CAD solutions have been applied to ultrasound imaging of various diseases, such as breast cancer [15,16] and liver lesions [17]. One CAD software device has emerged to give a second opinion on image interpretation [18], to reduce inter-observer variation in breast images [16,19]. Studies have also reported applications of CAD devices to ultrasound imaging of thyroid nodules [20][21][22][23][24][25][26][27][28]. In particular, the effectiveness of computerized sonographic features (CSF) to differentiate malignant nodules has been demonstrated [20][21][22][23][24]. The CSFs provided by a CAD device are used to assist clinicians in interpreting the images and then making recommendations based on the clinicians' professional judgment and/or medical guidelines. Although both clinicians' finding and computerization of sonographic features are shown helpful in thyroid cancer diagnosis, the computerized sonographic features have not yet been validated by comparing to the clinicians' sonographic feature findings and by studying their effect in assisting clinicians' reading of the thyroid sonograms.
In this study, an FDA-cleared CAD software device, AmCAD-UT (AmCad BioMed Co., Taipei, Taiwan) (Figure 1a), is employed to validate the software's detection and quantification of the sonographic features ( Figure 1b) against the ground truth established by a panel of thyroid specialists and then to perform an MRMC study to test the reader's performance sequentially assisted with the computerized features calculated by the CAD device. which may be influenced by education and experience. Interpretation discrepancies among clinicians or even by the same clinicians at different times have become a major issue that hinders the diagnosis and treatment of thyroid nodules [11,12]. Recent development of AI/ML technologies has given rise to the use of CAD software devices in clinical practice [13,14], to assist clinicians with improving their diagnosis accuracy and workflow effectiveness. The CAD solutions have been applied to ultrasound imaging of various diseases, such as breast cancer [15,16] and liver lesions [17]. One CAD software device has emerged to give a second opinion on image interpretation [18], to reduce inter-observer variation in breast images [16,19]. Studies have also reported applications of CAD devices to ultrasound imaging of thyroid nodules [20][21][22][23][24][25][26][27][28]. In particular, the effectiveness of computerized sonographic features (CSF) to differentiate malignant nodules has been demonstrated [20][21][22][23][24]. The CSFs provided by a CAD device are used to assist clinicians in interpreting the images and then making recommendations based on the clinicians' professional judgment and/or medical guidelines. Although both clinicians' finding and computerization of sonographic features are shown helpful in thyroid cancer diagnosis, the computerized sonographic features have not yet been validated by comparing to the clinicians' sonographic feature findings and by studying their effect in assisting clinicians' reading of the thyroid sonograms.
In this study, an FDA-cleared CAD software device, AmCAD-UT (AmCad BioMed Co., Taipei, Taiwan) (Figure 1a), is employed to validate the software's detection and quantification of the sonographic features ( Figure 1b) against the ground truth established by a panel of thyroid specialists and then to perform an MRMC study to test the reader's performance sequentially assisted with the computerized features calculated by the CAD device.

Database for Computerized Features Testing
The Institutional Review Board of the National Taiwan University Hospital (NTUH) approved this prospective study (200805039R). Informed consent was obtained from all participants and all patient identifiers were removed from the images used in the study. The database consisted of a collection of thyroid sonograms of patients who underwent a thyroidectomy due to suspicious thyroid carcinoma, follicular neoplasm, or symptomatic nodular goiter diagnosed by ultrasound imaging and FNA cytology, at NTUH. Since the quality of the sonograms greatly depended on the ultrasound scanners and might affect both the clinicians' finding and computerization of the sonographic features, we collected for this study sonograms obtained from different ultrasound scanners. The sonograms were acquired in DICOM format using Philips HDI 5000 (denoted as PH), GE Voluson

Database for Computerized Features Testing
The Institutional Review Board of the National Taiwan University Hospital (NTUH) approved this prospective study (200805039R). Informed consent was obtained from all participants and all patient identifiers were removed from the images used in the study. The database consisted of a collection of thyroid sonograms of patients who underwent a thyroidectomy due to suspicious thyroid carcinoma, follicular neoplasm, or symptomatic nodular goiter diagnosed by ultrasound imaging and FNA cytology, at NTUH. Since the quality of the sonograms greatly depended on the ultrasound scanners and might affect both the clinicians' finding and computerization of the sonographic features, we collected for this study sonograms obtained from different ultrasound scanners. The sonograms were acquired in DICOM format using Philips HDI 5000 (denoted as PH), GE Voluson 730 PRO (denoted as GE), and ALOKA Prosound2 (denoted as AL) ultrasound scanners, with 5-12 MHz linear multifrequency probes under identical imaging setting by certified technicians. The images were ineligible if the nodule size was larger than the width of the probe array or if the single nodule was not separable from another in cases of multinodular goiters. Finally, sonograms of 823 nodules (663 patients) were included in the database (Figure 2a). The major ethnic group was Chinese. Our previous studies investigated the effectiveness of specific computerized sonographic features, namely, calcification, heterogeneity, or echogenicity, in distinguishing malignant from benign nodules [20][21][22][23][24], whereas the current study validate the detection accuracy against the ground truth established by a panel of three thyroid specialists. A set of images was used for algorithm validation, consisting of 170 sonograms (102 benign and 68 malignant). In total, 653 sonograms were used for testing. Of these, 352 sonograms were randomly chosen for test of feature detection accuracy, and another 150 images from the Philips scanner were chosen for reading performance assessment (Table 1a). The pathology results of the nodules and cancer type distribution are shown in Table 1b. Among the 150 sonograms, 20 (10 benign and 10 malignant) were used as the training set for the 18 readers to get familiar with the CAD software user interface and to train themselves using the software to differentiate malignant nodules. The remaining 130 sonograms were used for the MRMC study ( Figure 2b). 730 PRO (denoted as GE), and ALOKA Prosound2 (denoted as AL) ultrasound scanners, with 5-12 MHz linear multifrequency probes under identical imaging setting by certified technicians. The images were ineligible if the nodule size was larger than the width of the probe array or if the single nodule was not separable from another in cases of multinodular goiters. Finally, sonograms of 823 nodules (663 patients) were included in the database (Figure 2a). The major ethnic group was Chinese. Our previous studies investigated the effectiveness of specific computerized sonographic features, namely, calcification, heterogeneity, or echogenicity, in distinguishing malignant from benign nodules [20][21][22][23][24], whereas the current study validate the detection accuracy against the ground truth established by a panel of three thyroid specialists. A set of images was used for algorithm validation, consisting of 170 sonograms (102 benign and 68 malignant). In total, 653 sonograms were used for testing. Of these, 352 sonograms were randomly chosen for test of feature detection accuracy, and another 150 images from the Philips scanner were chosen for reading performance assessment (Table 1a). The pathology results of the nodules and cancer type distribution are shown in Table 1b. Among the 150 sonograms, 20 (10 benign and 10 malignant) were used as the training set for the 18 readers to get familiar with the CAD software user interface and to train themselves using the software to differentiate malignant nodules. The remaining 130 sonograms were used for the MRMC study ( Figure 2b).

Testing CSF Detection Accuracy
The AmCAD-UT software device was developed to assist clinicians in analyzing the regions of interest (ROI) on the thyroid sonograms. Figure 1a shows the interface of the AmCAD-UT software. The software produced five computerized sonographic features (CSFs), namely, anechoic areas, hyperechoic foci, hypoechoic patterns, heterogeneous textures, and indistinct margins, providing their quantified values and visualizing them, using colors, to assist clinical differentiation of thyroid nodules. The quantification and visualization algorithms of AmCAD-UT have been disclosed in previous studies [20][21][22][23][24].
To validate the feature detection, we tested the quantified value of each sonographic feature against the ground truth, indicating the presence or absence of each feature on the sonogram ( Table 2). The ground truth was determined by a panel of three thyroid specialists. All three specialists, certified to interpret the sonograms, with an average of 8.6 years (range 8-10 years) of experience and an average of 1366 readings (ranging from 700 to 2400 readings), independently read the sonograms to define the ROIs of the nodules and to determine the presence or absence of each feature. In cases of a discrepancy in interpretation, consensus was achieved by discussions among the panelists. For all sonograms, the CAD device automatically performed detection and quantification of each feature and electronically stored the values in a database for subsequent analysis. The quantified values between nodules with the feature present and those without the feature were compared using a t-test and a p-value less than 0.05 was considered statistically significant. Since each feature was quantified as a continuous value to indicate a higher likelihood of presence by a larger value of the computerized feature, the receiver operating characteristic (ROC) curve for each feature was also generated againstthe corresponding "ground truth", with the area under the ROC curves (AUC) calculated to represent the detection accuracy. If the detection accuracy was 100%, i.e., AUC = 1.0, it meant a cut-off point can be found for the CSF to determine the presence of the sonographic feature, such that the findings by the computer and by the specialist panel were in 100% agreement. The higher the AUC value, the higher the agreement between the CSF and the ground truth. The statistical analysis was performed using MedCalc version 10.6.0.0 (MedCalc Software Ltd, Ostend, Belgium).

Assessing Diagnosis Performance of Readers Assisted with CSF
To assess whether the assistance of CSF provided by the CAD device can improve the readers' diagnosis accuracy, a multiple-reader multiple-case (MRMC) study [29][30][31] was performed. We recruited 18 clinicians (readers) to read 130 thyroid nodule sonograms. All the clinicians were licensed for an average of 8.78 years (ranging from 1 to 25 years experience) to perform ultrasound scans and interpret the sonograms, but had no experience in using AmCAD-UT. No reader had foreknowledge of the corresponding pathology results (benign or malignant) of the thyroid nodules. We first trained the readers with 20 training sonograms and the corresponding pathology results (10 benign or 10 malignant) to get the readers familiarized with the interface of the CAD software. Then, each of the 18 readers read each of 130 sonograms first without the CSF and then sequentially read the sonogram with the CSF provided [30]. The sequential reading with the CSF was to mimic how the CAD device would be used in clinical practice where the CSF information was provided as an integral part of the clinical reading and interpretation of sonograms. In other words, a reader read and scored every sonogram twice-one without CSF and one with CSF-for all 130 images. The order of the 130 sonograms was randomized and different for every reader. The scoring was scaled from 0 to 100 (0 = absolute benign, 100 = absolute malignant). We used the DBM MRMC 2.32 software (based on the Dorfman-Berbaum-Metz method) for generation of the ROC curves based on bi-normal models and for estimates of random effects of readers and cases [29,[32][33][34]. Paired ROC curves of readers' scoring without and with assistance of CSF were generated against pathology, and the paired AUCs calculated. We also performed subgroup analysis for readers with different levels of experience. Readers with more than 6 years' experience of interpreting sonograms were referred to as senior readers and the others were called junior readers. We calculated the paired AUC for the two groups separately and compared the paired AUCs between the two groups.

Detection Accuracy of Computerized Sonographic Features
The detection accuracy of each computerized feature was tested against the ground truth ( Table 2). Regardless of the ultrasound scanner types, the difference between the two groups of quantified values (with or without presence of anechoic areas, hyperechoic foci, hypoechoic pattern, heterogeneous texture and indistinct margin) were significant, ranging from a p-value < 0.0001 to a p-value = 0.0347 ( Table 3). The lowest agreement (p-value = 0.0045~0.0347) was observed in detecting the heterogeneous texture and the highest agreement between the computerized values and panel findings was observed for the anechoic area and the hyperechoic foci detection. Table 3. Test results of the computerized sonographic features' detection accuracy. Similarly, in terms of the AUC against the ground truth, the detection accuracies of the quantified anechoic areas, hyperechoic foci, echogenicity (hypoechoic pattern), heterogeneous texture and indistinct margin were 0.946-0.888, 0.825-0.913, 0.812-0.847, 0.627-0.77 and 0.676-0.766 for various ultrasound scanners (Table 3 and Figure 3), respectively. The detection accuracies of the quantified heterogeneous texture and indistinct margin appeared to be lower than that of the quantified anechoic areas, hyperechoic foci, and hypoechoic pattern. These AUCs also indicated good agreement between the computerized feature values and the panel's readings.

Reader Performance Assisted with CSF
The reader performance using the computerized sonographic features generated by the CAD device was tested against the corresponding pathology. The accuracy of the readers in diagnosing malignant thyroid nodule sonograms was evaluated by the AUC based on the readers' scoring. AUCs without and with the computerized sonographic features (CSF) for each reader are shown in Table 4a. The mean AUC with CSF was significantly greater (p-value = 0.0420) than that without CSF using the DBM MRMC test, as shown in Table 4b. Moreover, the statistics showed that the difference was mainly observed for the junior readers (p-value = 0.1462 and 0.0265, respectively, for the senior and junior readers). Figure 4 shows the graph of the paired ROC curves. Differences could be observed between the ROCs with and without CSF for all readers, indicating improvement in the reader's performance with the assistance of computerized features. In particular, Figure 4c showed the distinct ROC curves for the junior readers. However, only a slight difference was found in Figure 4b between the pair of ROC curves for senior readers. This demonstrated that the computerized features were especially beneficial to junior clinicians to supplement their relatively less experience in making diagnosis. Though an improvement was also observed for the senior readers assisted by the computerized features, a bigger sample size of senior readers might be required to show the significance.

Reader Performance Assisted with CSF
The reader performance using the computerized sonographic features generated by the CAD device was tested against the corresponding pathology. The accuracy of the readers in diagnosing malignant thyroid nodule sonograms was evaluated by the AUC based on the readers' scoring. AUCs without and with the computerized sonographic features (CSF) for each reader are shown in Table 4a. The mean AUC with CSF was

Population and Features
The distribution and percentage of thyroid cancers were similar to that reported internationally [35][36][37], representing actual clinical practice and the demographic distribution of the studied population. Therefore, the possible effect of the nodule types and their distributions should be considered negligible. In addition, because the CSFs were found highly agreeable with the clinicians' findings for all three types of ultrasound scanners included in this study, the feature computerization of the CAD device should be generalizable to thyroid sonograms acquired for a general patient population in practical clinical situations.

Population and Features
The distribution and percentage of thyroid cancers were similar to that reported internationally [35][36][37], representing actual clinical practice and the demographic distribution of the studied population. Therefore, the possible effect of the nodule types and their distributions should be considered negligible. In addition, because the CSFs were found highly agreeable with the clinicians' findings for all three types of ultrasound scanners included in this study, the feature computerization of the CAD device should be generalizable to thyroid sonograms acquired for a general patient population in practical clinical situations.

CSF Accuracy
The sonographic features evaluated by the CAD device were tested to be significantly in agreement with the specialist's judgement (ground truth). The heterogeneous texture feature appeared to be less agreeable and was possibly due to the extremely unbalanced reading results by the specialist panel, where only 5-8.5% of nodules were not deemed to be heterogeneous. In clinical practice, even a slight unsmooth trace within a nodule would lead to a judgement of a heterogeneous nodule. In contrast, the CAD software provided the quantified values evaluating the degree of heterogeneity, which was more functional, with a continuous severity rating. Another less agreeable feature was the indistinct margin. An indistinct margin was supposedly indicated for a possible infiltration by a malignant nodule. However, determination of indistinct margins was highly subjective and greatly influenced by the clarity of the sonogram and the aim angle of the ultrasound probe. The readings of indistinct margins were thus with a relatively high variability and appeared to be less reliable. Since the CSFs were produced by the CAD software independently without interaction with clinicians, they could serve as second opinions to assist clinical diagnosis.

Reader Performance with CSF
An MRMC analysis was necessary to test the performance of a CAD device, since the radiological device was to assist the physicians and had to been shown of value to the physicians in making diagnosis. The MRMC study design offered a comprehensive analysis of the role of the device in actual clinical situations where clinicians of different experiences might use the device for reading a variety of cases [31]. Both the reader variability and interactions between readers and cases were considered by the DBM model in the MRMC analysis [34]. The scoring by rating the malignancy potentials of the nodules reflected the readers' interpretation confidence without and with the device's assistance. In this study, a total of 18 readers were recruited for the test and all readers received training by reading 20 sonograms prior to the test. The test sonograms were randomly selected from a sufficiently large database to avoid possible sampling bias. The analysis results demonstrated that the reading performances of clinicians, in terms of AUC, were significantly improved when assisted with the CSFs produced by the CAD. We had also observed that the greatest improvements were mainly made by the junior readers, although the higher diagnosis accuracy, with an average AUC near 0.8, was achieved by the senior readers. The CSFs appeared to be of value to both junior and senior readers, with the accuracy approaching and not exceeding 0.8. Since the improvement made by the senior readers was only about 5% (from AUC = 0.759 to 0.799), a larger number of senior readers might be needed to prove the statistical significance.

CAD Device in Clinical Practice
Major advantages of the CAD device included providing a critical second opinion, reducing time-consuming procedures, and avoiding oversight and interobserver variation [13]. A CAD device could serve as a reliable second opinion because it provides accurate information of the sonographic features in concordance with the ground truth. Independent double reading by another clinician was an alternative way to improve the detection accuracy [38]. However, independent double reading was a labor-demanding work for clinicians and might not be as efficient and effective as the assisted reading of a CAD device [19]. Furthermore, the sonographic reading of a human clinician was greatly dependent on one's past experience and often subject to momentary feel. In contrast, the CAD device could provide consistent and objective evaluation regardless of the time and environment settings, and thus could further reduce the interobserver variation [39].
Several CAD algorithms had been proposed for thyroid ultrasound images by other authors [25][26][27][28]. Acharya et al. characterized the thyroid nodules into benign and malignant classes using a combination of texture and discrete wavelet transform [25]. Chang et al. used the support vector machine classifier to differentiate malignant and benign nodules and showed accuracy similar to that obtained via visual inspection by radiologists [26]. Choi et al. suggested that the sensitivity of the junior reader was as good as that of the senior readers by using a CAD device with a lower the specificity [27]. Most recently, Wu et al. reported an MRMC study that showed a significant improvement in reading performance with assistance of the CAD device after a washout period [28]. Although the above studies assessed the differentiation of malignant and benign nodules by using the CAD device, this study would be the first study to evaluate the CAD device's performance in reading the sonographic features through comparison to readings by a panel of specialists and to assess the CSFs' effects on the reader performance via an MRMC study of sequential readings with and without CSF.

Study Limitation
This study had limitations regarding the sonographic features evaluated. First, hyperechoic foci were not further differentiated into microcalcifications, coarse calcifications, rim-shape calcifications, or colloid [20]. Second, the echogenicity level determined by the panel was not further classified into the four levels, namely, hyperechoic, isoechoic, mildly hypoechoic, and markedly hypoechoic [23], commonly used in clinical practice. Third, though the sequential readings without and with the CSF mimicked how the CAD device would be used in clinical practice but the sequential readings without a washout period might result in a recollection bias and impair the effect of the assistance by the CAD device.
In conclusion, the CAD device visualized and quantified the thyroid sonographic features in high concordance with a specialist panel and was shown to significantly improve clinicians' reading of the nodule's malignancy risk.