Using Deep Convolutional Neural Networks for Enhanced Ultrasonographic Image Diagnosis of Differentiated Thyroid Cancer

Differentiated thyroid cancer (DTC) from follicular epithelial cells is the most common form of thyroid cancer. Beyond the common papillary thyroid carcinoma (PTC), there are a number of rare but difficult-to-diagnose pathological classifications, such as follicular thyroid carcinoma (FTC). We employed deep convolutional neural networks (CNNs) to facilitate the clinical diagnosis of differentiated thyroid cancers. An image dataset with thyroid ultrasound images of 421 DTCs and 391 benign patients was collected. Three CNNs (InceptionV3, ResNet101, and VGG19) were retrained and tested after undergoing transfer learning to classify malignant and benign thyroid tumors. The enrolled cases were classified as PTC, FTC, follicular variant of PTC (FVPTC), Hürthle cell carcinoma (HCC), or benign. The accuracy of the CNNs was as follows: InceptionV3 (76.5%), ResNet101 (77.6%), and VGG19 (76.1%). The sensitivity was as follows: InceptionV3 (83.7%), ResNet101 (72.5%), and VGG19 (66.2%). The specificity was as follows: InceptionV3 (83.7%), ResNet101 (81.4%), and VGG19 (76.9%). The area under the curve was as follows: Incep-tionV3 (0.82), ResNet101 (0.83), and VGG19 (0.83). A comparison between performance of physicians and CNNs was assessed and showed significantly better outcomes in the latter. Our results demonstrate that retrained deep CNNs can enhance diagnostic accuracy in most DTCs, including follicular cancers.


Introduction
Most thyroid tumors are incidentally discovered via palpation by clinical physicians. It has been estimated that the prevalence of thyroid cancer can reach 65% [1] and is more common among females. Fortunately, most tumors are benign thyroid nodules; that is, only a small number of them are malignant [2]. Roughly 5-10% of these tumors are identified as thyroid cancer. In Taiwan, thyroid cancer is becoming increasingly common with most cases identified in individuals between 40 and 65 years old, and it is currently the fourth most prevalent form of cancer among women, as well as the most common cancer of the endocrine system. The Health Promotion Administration of Taiwan has reported a 9.67% annual increase in the number of newly diagnosed cases of thyroid cancer. This may be due in part to advances in ultrasound and imaging technology over the past decade, which have greatly facilitated diagnostic procedures [3], particularly when dealing with tumors measuring less than 1 cm. The prognosis in cases of thyroid cancer is generally good. At present, the long-term prognosis after standard treatment for differentiated thyroid cancers (DTC) is excellent, with a 10 year survival rate of 96% [4].

Data Sources
This retrospective study was based on medical records and clinical data from the cancer registry of Chang Gung Memorial Hospital (Linko branch) covering the period from January 2008 to July 2020. Patients aged >20 years with surgically confirmed diagnosis of thyroid cancer were enrolled. Furthermore, patients with surgically confirmed benign thyroid nodules between January 2016 and July 2020 were also enrolled. Patients with non-DTC were excluded ( Figure 2). Patients who had not undergone an ultrasound examination within 12 months prior to surgical intervention were also excluded, as were patients without recognizable cancer lesions in ultrasound images. This study was approved by the Institutional Review Board of the Chang Gung Medical Foundation (IRB No. 202001440B0, 31 August 2020). The requirement for informed consent was waived due to the retrospective nature of this analysis. Artificial intelligence (AI) is a key technology in the on-going personalization and development of precision medicine. Musko [13] claimed that artificial intelligence (AI) allows doctors and researchers to make predictions of greater accuracy, thereby making it easier to identify the treatment and prevention strategies best suited to a particular disease and/or groups of patients. Deep learning algorithms are increasingly being used to facilitate the diagnosis of tumors. Chi [14] reported that the retrained GoogleNet outperformed conventional machine learning approaches, such as support vector machine (SVM). Using InceptionV3, Song [15] achieved diagnostic performance comparable to that of experienced professional radiologists. Various deep learning models have also been trained to differentiate between malignant and benign thyroid tumors [16][17][18][19][20][21][22][23]. A number of studies have addressed the issue of training AI systems in the analysis of thyroid ultrasound images; however, most of this work has focused on PTC. Diagnosing other pathological types (FVPTC, FTC, and HCC) is hindered by their rarity in clinical practice and their similarity to benign lesions in ultrasound images. The prognosis for DTCs should be similar to that of PTC as long as they are identified early. Ideally, clinicians should be able to confirm diagnosis prior to surgical intervention.
In this study, transfer learning was used to train a deep convolutional neural network (CNN) [24] for the analysis of ultrasound images with the aim of differentiating between malignant and benign thyroid lesions, and facilitating the identification of other DTCs (e.g., FTC). We anticipate that such CNN models could help to eliminate unnecessary invasive examinations or surgical interventions.

Data Sources
This retrospective study was based on medical records and clinical data from the cancer registry of Chang Gung Memorial Hospital (Linko branch) covering the period from January 2008 to July 2020. Patients aged >20 years with surgically confirmed diagnosis of thyroid cancer were enrolled. Furthermore, patients with surgically confirmed benign thyroid nodules between January 2016 and July 2020 were also enrolled. Patients with non-DTC were excluded ( Figure 2). Patients who had not undergone an ultrasound examination within 12 months prior to surgical intervention were also excluded, as were patients without recognizable cancer lesions in ultrasound images. This study was approved by the Institutional Review Board of the Chang Gung Medical Foundation (IRB No. 202001440B0, 31 August 2020). The requirement for informed consent was waived due to the retrospective nature of this analysis.

Data Sources
This retrospective study was based on medical records and clinical data from the cancer registry of Chang Gung Memorial Hospital (Linko branch) covering the period from January 2008 to July 2020. Patients aged >20 years with surgically confirmed diagnosis of thyroid cancer were enrolled. Furthermore, patients with surgically confirmed benign thyroid nodules between January 2016 and July 2020 were also enrolled. Patients with non-DTC were excluded ( Figure 2). Patients who had not undergone an ultrasound examination within 12 months prior to surgical intervention were also excluded, as were patients without recognizable cancer lesions in ultrasound images. This study was approved by the Institutional Review Board of the Chang Gung Medical Foundation (IRB No. 202001440B0, 31 August 2020). The requirement for informed consent was waived due to the retrospective nature of this analysis.

Data Collection
Demographic and clinical data included the age of the patient at the time of diagnosis, gender, lesion location (left, right, both, or isthmus), ultrasound manufacturer (e.g., Aloka, Hitachi, and Siemens) ( Table 1), the distribution of pathological groups as a function of ultrasound brand (Table 2), and histopathological data ( Table 3). All images were downloaded and stored in TIFF format. Every patient included in the study presented at least one thyroid tumor in ultrasound images (longitudinal or horizontal view) when assessed using the models of multiple ultrasound manufacturers. An ultrasound image may be formed by a nodule in two views as long as they were saved in double-view mode. After a manual review of the examination data, researchers collected 1791 ultrasound images for analysis. Note that the regions of interest (ROIs) in the ultrasound images were marked by the author as rectangle bounding boxes ( Figure 3). The ROI was meant to include the entire tumor except in cases where the tumor exceeded the image boundary, such that the bounding box included only the visible part of the tumor. Ultrasound images were subsequently cropped according to the bounding boxes, which resulted in 2308 images of nodules for training ( Figure 4). The images were divided into a training set (80%) and a test set (20%). Images from each patient were placed in either the training set or the test set, but not in both. Image data underwent preprocessing to compensate for the relatively small number of images and reduce the likelihood of overfitting. As shown in Figure 5

Study Design
The diagnoses of all tumors in this study were subject to surgical and pathological confirmation; therefore, training was implemented as supervised learning. Transfer learning and fine-tuning of hyperparameters were implemented on three pretrained CNNs, namely, InceptionV3, ResNet101, and VGG19. Note that the classification accuracy of these CNNs has been demonstrated in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). The MATLAB 2021a platform was used for the retraining of the three CNNs to classify benign and malignant thyroid tumors in ultrasound images. The size of the input images was adjusted according to the CNN settings. Stochastic gradient descent with momentum (SGDM) was applied as the solver. The maximum epochs were as follows: InceptionV3 (26), ResNet101 (21), and VGG19 (32). The learning rate was as follows: Incep-tionV3 (0.001), ResNet101 (0.001), and VGG19 (0.0001). Fivefold cross-validation was used to ensure the stability of the results.

Statistical Analysis
This study compared the diagnostic capability of CNNs with that of two endocrinologists with over 20 years of experience in performing fine-needle aspiration and the interpretation of ultrasound images on the test set. In estimating the diagnosis performance of physicians, the images were classified as malignant and benign according to sonographic patterns and estimated risk of malignancy, as suggested in the American Thyroid Association (ATA) classification system [25]. The CNNs were assessed in terms of accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), as well as the receiver operating characteristic (ROC) curve, area under the curve (AUC), and confusion matrix. We also assessed accuracy in identifying tumors with various histopathologies. Continuous variables were presented as the mean and standard deviation (SD), as indicated. Categorical data were expressed in terms of actual frequencies and percentages. Statistical analysis was performed using the chi-square test and analysis of variance (ANOVA). p-Values < 0.05 were considered significant. All statistical analysis was conducted using the SAS Suite, version 9.4 (SAS Institute, Cary, NC, USA).

Study Population
A total of 791 patients were identified in our initial analysis (Figure 1). From this group, 17 patients were excluded due to non-DTC, including anaplastic cancer (n = 6), medullary cancer (n = 7), and metastatic cancer (n = 4). Patients who had not undergone ultrasound examinations within 12 months prior to surgery were also excluded, as were those without recognizable lesions (n = 353). This left 421 DTC patients and 391 patients with benign thyroid nodules who met the enrollment criteria for this study.

Demographics
As shown in Table 1, the patients were divided into a malignant group (comprising a PTC group, FTC group, FVPTC group, and HCC group) and a benign group. The mean age of patients was 44.9-54.2 years old. The average age at the time of diagnosis was lower in the malignant groups (p < 0.0001). We observed a higher proportion of females in all groups; however, the female-to-male ratio between groups did not differ significantly. We observed statistically significant between-group differences in terms of lesion location (p = 0.0017) with very few instances of bilateral lesion. In malignant groups, the PTC group presented the highest proportion of simultaneous bilateral lesions (5.14%), whereas the FTC group presented the highest proportion of isthmus lesions (4.29%). We observed statistically significant between-group differences in terms of ultrasound manufacturer (p < 0.0001). The most common brands in the PTC group were GE Healthcare (37.12%) and Siemens (28.82%), and the most common brand in the other groups was GE Healthcare. Table 2 lists the distribution (percentages) of pathological groups as a function of ultrasound brand. Statistically significant differences were observed between all pathological groups as a function of ultrasound brand (p < 0.001). Table 3 lists the histopathological distribution of tumors among groups. On the basis of ultrasonic features, the malignant group was divided into PTC and FTC subgroups. The PTC subgroup included classic PTC, the diffuse sclerosing variant, the tall cell variant, the cribriform morular variant, and the encapsulated variant. The FTC sub-group included FTC, FVPTC, HCC, and the encapsulated follicular variant of PTC. The benign group included nodular hyperplasia (NH), follicular adenoma (FA), cysts, and HA. Classic PTC and FVPTC were the most common pathology types in the PTC and FTC subgroups, respectively. Nodular hyperplasia was the most common feature in the benign group.

Discussion
In this study, deep convolutional neural networks (CNNs) were used to classify thyroid tumors as malignant or benign. Note that the accuracy achieved in the current study was slightly lower than in previous studies [15][16][17]21,22]. To the best of our knowledge, this was the first study focusing on the use of CNNs for the classification of DTCs other than PTC (FVPTC, FTC, and HCC). Diagnosing malignant thyroid tumors (e.g., FTC) prior to surgical intervention remains an unresolved problem [26]. A definitive diagnosis of FTC requires surgical intervention. This issue is largely due to similarities between the ultrasonic features of malignant and benign nodules [27]. From a microscopic point of view, the major difference between malignant FTC and benign follicular tumors (FAs) is the occurrence of vascular or capsular invasion. Furthermore, cytopathological analysis is often inconclusive due to a lack of distinguishing characteristics, such as the "clear cell border" and "pseudo-inclusion body" observed in PTC cells. Note that most of these cases would eventually be classified as follicular neoplasms (FNs) [28]. Technical advances in molecular biology and genetic engineering have revealed a link between BRAF mutations and PTC, as well as a link between RAS mutations and FTC [11]. Unfortunately, the cost of molecular and genetic testing is prohibitive in most cases and unavailable except in the best-equipped medical centers. Due to the low incidence of malignant thyroid tumor, the inclusion of advanced diagnostics in routine thyroid examinations is unreasonable. More importantly, diagnostic accuracy is strongly influenced by the number of successful punctures in fine-needle aspiration.
In one recent study, machine learning methods proved highly effective in diagnosing FNs [29]. In fact, a thyroid CAD named AmCAD-UT has already been approved by the United States Food and Drug Administration (FDA) and Taiwan Medical Device Marketing Approval for the assessment of thyroid tumors using feature extraction/selection. AI is proving highly effective in overcoming the difficulties associated with the diagnosis of malignant tumors.

Discussion
In this study, deep convolutional neural networks (CNNs) were used to classify thyroid tumors as malignant or benign. Note that the accuracy achieved in the current study was slightly lower than in previous studies [15][16][17]21,22]. To the best of our knowledge, this was the first study focusing on the use of CNNs for the classification of DTCs other than PTC (FVPTC, FTC, and HCC). Diagnosing malignant thyroid tumors (e.g., FTC) prior to surgical intervention remains an unresolved problem [26]. A definitive diagnosis of FTC requires surgical intervention. This issue is largely due to similarities between the ultrasonic features of malignant and benign nodules [27]. From a microscopic point of view, the major difference between malignant FTC and benign follicular tumors (FAs) is the occurrence of vascular or capsular invasion. Furthermore, cytopathological analysis is often inconclusive due to a lack of distinguishing characteristics, such as the "clear cell border" and "pseudo-inclusion body" observed in PTC cells. Note that most of these cases would eventually be classified as follicular neoplasms (FNs) [28]. Technical advances in molecular biology and genetic engineering have revealed a link between BRAF mutations and PTC, as well as a link between RAS mutations and FTC [11]. Unfortunately, the cost of molecular and genetic testing is prohibitive in most cases and unavailable except in the best-equipped medical centers. Due to the low incidence of malignant thyroid tumor, the inclusion of advanced diagnostics in routine thyroid examinations is unreasonable. More importantly, diagnostic accuracy is strongly influenced by the number of successful punctures in fine-needle aspiration.
In one recent study, machine learning methods proved highly effective in diagnosing FNs [29]. In fact, a thyroid CAD named AmCAD-UT has already been approved by the United States Food and Drug Administration (FDA) and Taiwan Medical Device Marketing Approval for the assessment of thyroid tumors using feature extraction/selection. AI is proving highly effective in overcoming the difficulties associated with the diagnosis of malignant tumors.
The ImageNet project has been instrumental in advancing computer vision and deep learning research. ImageNet provides an image database based on the WordNet hierarchy, and data are freely available to researchers for noncommercial applications [30]. The database has been manually annotated with more than 14 million images. Between 2010 and 2017, ImageNet held an annual competition (referred to as ILSVRC) to evaluate algorithms used in object detection and image classification. The CNNs used in this study for transfer learning achieved the highest classification accuracy in 2014 (Inception), the highest detection results in 2014 (VGG), and the highest classification accuracy in 2015 (ResNet). Since 2015, the accuracy of deep learning image classification has exceeded 95%, far exceeding the capabilities of humans. In the intervening years, new CNNs (e.g., SEnet) have achieved even higher accuracy; however, the difference is negligible. Most of the previous thyroid cancer imaging studies using Inception, ResNet, and VGG achieved acceptable accuracy and were, therefore, deemed suitable for transfer learning in the current study. We discovered that the less complex CNNs (Inception and ResNet) were slightly faster than VGG in terms of training and classification; however, overall classification accuracy was nearly identical.
In identifying cases of PTC, we were unable to achieve the 90% accuracy observed in previous studies [17,22], due primarily to an insufficient number of cases. Note that most of the previous CNN studies on classic PTC included at least 1000 patients. According to the results in previous studies, it appears likely that accuracy in identifying PTC could be increased simply by including a larger number of cases. It is also possible that the performance of the CNN algorithms was hindered by unacceptably low image resolution after cropping. There is also a possibility that the apparatus used for ultrasonic imaging played a role in algorithm performance, due to subtle differences in image fineness, brightness, contrast, and texture output from different ultrasound manufacturers.
In our analysis, accuracy in identifying FVPTC reached 74.6% (ResNet101), which is similar to the results obtained for classic PTC. FVPTC is the most common PTC variant and the second largest group in the current study. The ultrasonic characteristics of FVPTC differ considerably from those of classic PTC and, in many respects, are similar to those of benign tumors [31]. Our results demonstrated that accuracy in identifying FTC was only 63.6-72.7% and accuracy in identifying FA was only 65-80%, regardless of the CNN. Accuracy in identifying HCC was only 60-66.7%, due largely to the small number of cases in the database. The CNNs seemed not to provide much benefit in the identification or diagnosis of FTC or HCC, due largely to a lack of cases resulting from low incidence and prevalence. Note that there is only a slight difference between FTC, HCC, and FA in terms of gross structure and ultrasound features [32,33].
The performance of classification by retrained CNNs was a lot better than that of the participating physicians, especially in the malignant groups. The poor diagnostic performance of physicians in dealing with malignant tumors resulted in poor sensitivity. In clinical practice, the endocrinologist or radiologist usually considers malignant features suggested by the image as a whole, not just the cropped area surrounding the tumor with low resolution. However, with the help of fine-needle aspiration and cytopathological analysis, the sensitivity of physicians may be comparable to CNNs retrained by ultrasound images alone. Remarkably low accuracy in the classification of malignant tumors by physicians also indicates the difficulty in clinical diagnosis, particularly in cases of FTC, FVPTC, and HCC. Overall, it was demonstrated that the diagnostic performance of the CNNs exceeded that of the physicians.
Overall, InceptionV3 achieved the highest sensitivity, whereas ResNet101 and VGG19 achieved higher specificity. The concurrent application of all three CNNs appears to be a viable possibility. InceptionV3 could be used to confirm a diagnosis of malignancy, whereas ResNet101 and VGG19 could be used to confirm that lesions are indeed benign.
There are numerous situations in which CAD could advantageously be implemented in conjunction with AI. For example, many developing countries lack the medical resources, professional radiologists, and endocrinologists required to obtain a reliable diagnosis of thyroid lesions. CAD could be used to screen for potential thyroid cancers for referral to a medical center. Even in medical centers, CAD could be used to facilitate the training of medical students and inexperienced physicians. More importantly, CAD could provide helpful advice in dilemmatic cases with inconclusive cytology results. From the perspective of healthcare and therapeutics, AI has also been shown to play an important role in treatment quality. Fionda [34] reported that the use of AI-based predictive models and decision support systems for radiation oncology and interventional radiotherapy can alleviate many time-consuming repetitive tasks, thereby enabling a corresponding decrease in healthcare costs.
In the current study, image ROIs were manually cropped by the author using a bounding box. This method is precise but time-consuming, and different physicians would no doubt differ in their approach to cropping. Deep learning models with auto-detection or auto-segmentation (e.g., YOLOV3 and R-CNN) could be developed to increase the speed of ROI framing before undergoing a manual review and adjustment. In the clinical application of CAD, it is also necessary to establish a graphical user-interface (GUI) capable of automating the process of ROI framing and classification.
This study was subject to a number of limitations. Firstly, the retrospective design of this study made selection bias inevitable. Secondly, an extended acquisition period was required to obtain a usable number of samples in the malignant groups. Thus, it was inevitable that the collection period for malignant cases would far exceed that of the benign control group. Thirdly, the small number of cases in our test set may have also had a negative effect on accuracy. Fourthly, the process of image selection was complicated and slow. Unlike computed tomography and magnetic resonance imaging, ultrasound images do not present a unified arrangement and require extensive manual preprocessing. In this study, the author had to review all of the ultrasound images in a search for the target lesions identified surgically. Ultrasound images also tend to vary considerably in terms of size and zooming ratio. This made it impossible to measure the tumor sizes retrospectively, leading to instances of missing data and/or mismatch with pathology reports. Note that this was the reason for the omission of tumor size in this study. Fifthly, we opted not to include the rarer forms of malignant tumor, such as undifferentiated thyroid cancers and metastatic cancers. Note that applying the current CAD in clinical practice would no doubt raise concerns about missed diagnoses. Sixthly, most of the images selected for CNN training presented identifiable single nodules. Thus, the diagnostic power in dealing with multinodular goiters with ill-defined margins remains unclear. Lastly, differences in the output algorithms of ultrasound machines can have a profound effect on training and classification. However, without raw data, there is no simple way to standardize images. Thus, the only viable approach to balancing the datasets is to apply histogram equalization or collect more images from different ultrasound manufacturers.

Conclusions
Advanced deep CNN models that are fine-tuned using transfer learning show considerable potential as a noninvasive approach to the diagnosis of DTCs, including FTC. Clinicians should be able to diagnose thyroid cancer more easily by combining ultrasound with CAD. Anticipated advances in ultrasound technology and larger databases will greatly enhance the efficacy of these methods.