Diagnostic Performance of Kwak, EU, ACR, and Korean TIRADS as Well as ATA Guidelines for the Ultrasound Risk Stratification of Non-Autonomously Functioning Thyroid Nodules in a Region with Long History of Iodine Deficiency: A German Multicenter Trial

Simple Summary In Germany, thyroid nodules can be detected by ultrasound examinations in over 30% of the adult population, mainly as a result of prolonged nutritive iodine deficiency. Although only a small proportion of the nodules are malignant, it is important to have a reliable examination method that not only can detect these few thyroid carcinomas with a high degree of certainty, but also not be unnecessarily invasive for the much larger number of benign nodules. Ultrasound is the method of choice, and ultrasound-based risk stratification systems are important tools in clinical care. However, many different systems have been introduced within the last decade. The aim of this study was to evaluate five common ultrasound risk stratification systems for their diagnostic accuracy of thyroid nodules from an area with long history of iodine deficiency. Abstract Germany has a long history of insufficient iodine supply and thyroid nodules occur in over 30% of the adult population, the vast majority of which are benign. Non-invasive diagnostics remain challenging, and ultrasound-based risk stratification systems are essential for selecting lesions requiring further clarification. However, no recommendation can yet be made about which system performs the best for iodine deficiency areas. In a German multicenter approach, 1211 thyroid nodules from 849 consecutive patients with cytological or histopathological results were enrolled. Scintigraphically hyperfunctioning lesions were excluded. Ultrasound features were prospectively recorded, and the resulting classifications according to five risk stratification systems were retrospectively determined. Observations determined 1022 benign and 189 malignant lesions. The diagnostic accuracies were 0.79, 0.78, 0.70, 0.82, and 0.79 for Kwak Thyroid Imaging Reporting and Data System (Kwak-TIRADS), American College of Radiology (ACR) TI-RADS, European Thyroid Association (EU)-TIRADS, Korean-TIRADS, and American Thyroid Association (ATA) Guidelines, respectively. Receiver Operating Curves revealed Areas under the Curve of 0.803, 0.795, 0.800, 0.805, and 0.801, respectively. According to the ATA Guidelines, 135 thyroid nodules (11.1%) could not be classified. Kwak-TIRADS, ACR TI-RADS, and Korean-TIRADS outperformed EU-TIRADS and ATA Guidelines and therefore can be primarily recommended for non-autonomously functioning lesions in areas with a history of iodine deficiency.


Introduction
Iodine deficiency is a well-known risk factor in the development of nodular thyroid disease [1]. Although nutritive iodine supply in the German population has improved in the recent years, Germany has a long history of iodine deficiency and the requirements of the World Health Organization (WHO) have not yet been fully met [2][3][4][5]. The prevalence of thyroid nodules (TNs) ranges from 12.5% in young men to over 80% in older women [6][7][8][9]. Since the vast majority of the detected TNs are benign, the diagnostic challenge is to reliably detect malignant nodules while avoiding unnecessary interventions for benign lesions [10].
Thyroid ultrasound (US) is a non-invasive, cost-effective, and accurate method for detecting and describing TNs [11]. It is also the method of choice for assessing and selecting TNs for further diagnostic procedures such as fine-needle cytology (FNC) to rule-out malignancy [12][13][14]. During the last decade, several international societies have published different US-based risk stratification systems (RSSs, Thyroid Imaging Reporting and Data System, TIRADS) based on US features and lesion size. The aim was to improve diagnostic performance of thyroid US, to reduce unnecessary interventions, and to provide a standardized terminology for physicians [12,13,[15][16][17][18]. In 2011, Kwak et al. published a TIRADS (Kwak-TIRADS) to detect suspicious malignant features: microcalcifications, solid composition, hypoechogenicity, a taller-than-wide shape, and an irregular/microlobulated margin [19]. In 2016, The Korean Thyroid Association/Korean Society of Thyroid Radiology (KTA/KSThR) proposed a pattern-based RSS (Korean-TIRADS) based on solidity and echogenicity with additional suspicious features (microcalcifications, non-parallel orientation, and spiculated/microlobulated margins) [20]. In 2015, The American Thyroid Association (ATA) announced a pattern-based, five-tier RSS with different risks of malignancy [21]. Similar to the Korean-TIRADS, the European Thyroid Association (ETA) in 2017 proposed a pattern-based five-tier RSS (EU-TIRADS) with US features showing a high probability of malignancy (irregular shape and margins, marked hypoechogenicity, solidity, and microcalcifications) [22]. Simultaneously, the American College of Radiology (ACR) published the scoring-based ACR TI-RADS [18].
Recently, several studies were carried out to compare the diagnostic performance of different US-based RSSs [13,14,17,[23][24][25][26][27][28][29][30]. Although it is known that hyperfunctioning TNs have a very high probability of being benign and need no further diagnosis [31], none of these studies took the functional status of the TNs into account. Furthermore, in a previous study, our group demonstrated that a relevant proportion of hyperfunctioning TNs were classified as intermediate risk or high risk according to .
The aim of this study was to compare the diagnostic performance of five established US RSSs for non-autonomously functioning TNs in iodine deficiency.

Patients and Ethics
Since 2012, an increasing number of physicians specializing in thyroid diagnostics have been in constant communication regarding the diagnostic assessment of TNs, organized in the "German TIRADS Study Group" (GTSG). In recent years, seven institutions set up a continuously growing multicenter database containing the imaging and clinical data of over 2000 consecutive TNs. US features were recorded prospectively in real time immediately after the US examinations (see Section 2.2). Out of this pool, patients recorded between January 2012 and August 2020 were considered for the study. Their cases were consecutively recorded without influencing the treatment course, which was conducted according to guideline-based clinical decisions by the respective sites. Since August 2020, the rating of the RSSs was retrospectively conducted based on prospectively documented US features. Observers were blinded to the clinical results such as cytological and histopathological findings. Communication between the observers regarding difficult cases was, and is, consistently performed to reduce interobserver bias [33].
The inclusion criteria consisted of hypofunctioning or indifferent TNs on thyroid scintigraphy and the availability of cytological (FNC) or histopathological (surgery) diagnoses. Bethesda II lesions were considered benign. Scintigraphically hyperfunctioning TNs and those without scintigraphy as well as FNC findings outside Bethesda category II without histopathological evaluation were excluded. Scintigraphy scans were conducted according to the European guideline using 99 m-technetium-pertechnetate [31].
Recorded data comprised institution site, age, gender, number of TNs per patient, lesion size in three dimensions (crania-caudal, ventral-dorsal, medial-lateral), lesion functionality on scintigram, US features and RSS classifications (see Section 2.2), cytological findings according to the Bethesda System [34], and histopathological results.
The multicentric data collection was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of the Medical Faculty of the University Hospital of Duisburg-Essen, Germany (ID: 16-7022-BO).

Ultrasound Examinations
US examinations were carried out according to the respective local standards with an emphasis on high-resolution, state-of-the-art image quality, and acquisition in transversal and sagittal orientation. Therefore, examination parameters, such as patient positioning, frequency, focus number and focus positioning, zoom, depth, gain, virtual convex mode, crossbeam mode, harmonic imaging modes, and breath-hold techniques were adapted to individual patient and nodule-specific requirements.

Data Analyses and Statistics
Data were recorded on Excel software (Version 14.7.3, Microsoft Corporation, Redmond, WA, USA) and transferred to SPSS Statistics software (International Business Machines Corporation, Version 26.0, New York, NY, USA) for statistical analyses. Fisher's exact test was conducted to evaluate group differences for ordinal values (e.g., US features). A Student's t test was performed to investigate the differences among groups with normally distributed metric values (e.g., TSH-level, lesion size). For each RSS, calculations Cutoff values between benign and malignant for performance calculations were defined at 4c, TR5, 5, high, and high for Kwak-TIRADS, ACR TI-RADS, EU-TIRADS, Korean-TIRADS, and ATA Guidelines, respectively. For each test, p < 0.05 was considered significant. Histopathological and cytological results were available for 731 (60.4%) and 776 (64.1%) lesions, respectively. In total, 480 (39.6%) TNs were diagnosed as benign by cytology (Bethesda II) only. For 296 (24.4%) lesions, cytological and histopathological results were available. In 142 cases, Bethesda III/IV results were found on cytological examinations. The rate of malignancy in these TNs was 15.5% (Table 1). Table 1. Histopathological results of thyroid nodules (TNs) with fine-needle cytology (FNC) and surgery.

Patient Data and Clinical Characteristics of the Thyroid Nodules
Bethesda Classifications [34] All The mean size (largest diameter) of the TNs was 26 ± 13 mm. Since in Germany thyroid scintigraphy is only regularly performed (irrespective of the TSH level) on TNs ≥ 10 mm, only eight (0.7%) TNs measured < 10 mm and 14 (1.1%) lesions showed a size of 10 mm. These were resected along with other lesions and their RSS classifications as well as scintigraphy findings were retrospectively assessed (with blinded histopathological results). The benign lesions were larger and more frequently hypofunctioning in the present study population ( Table 2).

Discussion
One of the most dynamic fields in clinical thyroid research is the sonographic risk stratification of thyroid nodules. US devices are ubiquitous, and the procedure is a pa-

Discussion
One of the most dynamic fields in clinical thyroid research is the sonographic risk stratification of thyroid nodules. US devices are ubiquitous, and the procedure is a patientfriendly, cost-effective, and repeatable approach that has no side effects. Many different RSSs have been published in the recent years, and in the present study the diagnostic performances of five important ultrasound-based risk stratification systems (Kwak-TIRADS, ACR TI-RADS, EU-TIRADS, Korean-TIRADS, and ATA Guidelines) were evaluated in a population that has a high prevalence of TNs due to a long history of iodine deficiency [7,8].
Since 2012, the German TIRADS Study Group has been recording consecutive thyroid nodule cases from seven German institutions where there is a growing number of participating members. In this manner, a large database was built. Constant communication regarding difficult cases and the recent literature was conducted to achieve high performance levels in the application of RSSs and to reduce interobserver variability among the operators [33]. With the present multicenter trial, the group reported the first extensive German dataset regarding the diagnostic performance of five US-based RSSs for non-autonomous TNs.
Because the study focused on TNs that had been invasively diagnosed according to the clinical decision of the treating physicians, the preselected lesions (no hyperfunctioning TN, cytology or histopathology demanded) did not accurately represent the underlying patient population of Germany. Thus, malignant lesions were overrepresented: 15.5% in comparison to their natural incidence of <5% [35]. However, the data also contained TNs that had not been referred to the surgeons primarily for histopathological evaluation but had been resected as part of other surgical indications in multinodular goiters. This mitigates selection bias in favor of a higher classifications of the RSSs.
The diagnostic accuracy of EU-TIRADS (69.8%) was inferior to that of Kwak-TIRADS (78.6%), ACR TI-RADS (77.9%), or Korean-TIRADS (82.0%), because of the relatively high number of EU5 classifications. ATA Guidelines showed a comparably high accuracy of 79.3% but a remarkable number of TNs (11.1%) were N/A. The ATA Guidelines provided an atlas that was primarily pattern-based, which was missing clear definition for isoechoic TNs with suspicious further US features. This problem has already been described in previous studies [33]. However, N/A TNs were excluded from the diagnostic performance calculations. Based on these results, Kwak-TIRADS, ACR TI-RADS, and Korean-TIRADS outperformed EU-TIRADS and ATA Guidelines in the study population, despite the AUC values on ROCs of all five RSSs being very similar (between 0.795 and 0.805) without significant differences (N/A TNs excluded). The diagnostic performance parameters were in concordance with the results of current meta-analyses (Table 6). Wei et al. reported a pooled sensitivity of 79% and a pooled specificity of 71% for mixed TIRADS studies. Pooled sensitivity (specificity) values of 98% (55%), 54-82% (53-90%), 66-74% (64-91%), 55-86% (28-95%), and 74-87% (31-88%) were published for Kwak-TIRADS, EU-TIRADS, ACR TI-RADS, Korean-TIRADS, and ATA guidelines, respectively. However, the cut-off values between benign and malignant lesions were partly different among the respective meta-analyses. Considering the data from former iodine deficiency areas specifically, Dobruch-Sobczak et al. observed a sensitivity of 93.4% and a specificity of 54.6% for EU-TIRADS with a cut-off for EU5 in a Polish multicenter study containing 842 TNs (229 malignant) [44]. In a smaller study population from Austria (N = 195), EU-TIRADS, Kwak-TIRADS, ATA Guidelines, and French-TIRADS were assessed suitable for the differentiation between benign and malignant TNs. The authors found a sensitivity of 85% and a specificity of 45% with a cut-off of two or more positive US criteria. However, this was only true for the 45 included PTCs, but not for the eight FTCs [29]. In the present study, a large variety of different malignant lesions were observed, containing 54.0% PTC, 5.3% FTC, 3.7% MTC, 2.6% PDTC, 0.5% ATC, and 1% other cancer types. Therefore, to the best of our knowledge, the current data provide the most comprehensive results from an area with history of iodine deficiency. In a recently published Italian real-life setting study (single-center, retrospective, observational) that included 6474 cytologically investigated TNs and comprised five different RSSs, inferior sensitivities (50.1-94.5%), PPV (7.7-11.5%), and AUC values in ROC analyses (0.606-0.632) were reported [45]. Among other reasons, such as a different history of iodine supply between Germany and Italy [46], the superior performance of the RSSs in the current study may be due to the exclusion of non-autonomously functioning lesions. In a previous study, the GTSG revealed that a relevant number of hyperfunctioning TNs showed high-risk US patterns [32]. Scintigraphically guided preselection can therefore be recommended to improve the US-based risk stratification of TNs.
Further clinical examination data revealed larger sizes and a higher frequency of scintigraphically hypofunctioning lesions for benign compared to malignant TNs. However, since the decision for or against cytological or histopathological clarification of a TN was carried out as a comprehensive clinical decision, the data were affected by a selection bias after considering several additional findings such as laboratory results and diseaserelated symptoms. Therefore, over 80% of the lesions were hypofunctioning in the study population. The data showed a high sensitivity (75.1%) but a very low specificity (14.9%) for the hypofunctional feature for detecting malignant lesions. Due to this selection bias (especially the exclusion of hyperfunctioning lesions) these diagnostic parameters did not display the findings in a clinical routine. However, the majority of the malignant TNs showed up as hypofunctioning on scintigraphy scans, which was in accordance with the literature [47].
The multicentric study design allowed a patient enrolled in the study to be managed by different approaches during clinical practice. It needs to be underlined that this could have affected the results. Since only TNs that were characterized by scintigraphy were included, less than 1% of the TNs measured were < 10 mm. However, it is known that lesions < 10 mm can be detected as hyperfunctioning on scintigraphy and can be reliably assessed by I-124 positron emission tomography (PET)/US fusion imaging even in unfavorable localizations [47][48][49]. Furthermore, TIRADS have been proven to perform well in TNs < 10 mm [50].
So far, no uniform RSS has been established worldwide, although work has recently begun on a new international US-based RSS for TN. With the participation of several scientific societies, the so-called I-TIRADS will be proposed and established internationally as a uniform evidence-based system. Currently, different working groups are investigating individual ultrasound criteria [51]. In addition, promising data already exist regarding the use of artificial intelligence (AI) to identify ultrasound patterns. This technique could significantly reduce interobserver variability and account for regional differences such as site-typical normal findings via variable databases [52]. Another important pillar in the evaluation of TNs is related to the aforementioned topics: the establishment of (automated) structured reporting (SR). It is already well advanced in other diagnostic examination procedures such as mammography or prostate MRI as well as in professional study protocols [53,54]. Concepts for the implementation of AI pattern detection and SR in the field of thyroid US have already been proposed. In particular, the generation of automated findings from manually acquired ultrasound image data has the potential to provide considerable time savings for medical staff and may thus also have health and economic relevance for regions with a high prevalence of thyroid disease [55][56][57].

Conclusions
Kwak-TIRADS, ACR TI-RADS, Korean-TIRADS, and ATA Guidelines revealed high performance levels with diagnostic accuracies of about 80% and AUC values of approximately 0.8 without significant differences. However, over 10% of the TNs were not classifiable according to ATA Guidelines. The diagnostic performance of EU-TIRADS was slightly inferior in comparison with the aforementioned ultrasound risk stratification systems for thyroid nodules. Therefore, Kwak-TIRADS, ACR TI-RADS, and Korean-TIRADS can be preferentially recommended in areas with a history of iodine deficiency. Scintigraphic preselection to exclude hyperfunctioning nodules may improve the performance of ultrasound-based risk stratification systems.