Diagnostic Performance of ACR and Kwak TI-RADS for Benign and Malignant Thyroid Nodules: An Update Systematic Review and Meta-Analysis

Simple Summary This meta-analysis determined the optimal cut-off value for differentiating benign and malignant thyroid nodules in two risk stratification systems (ACR and Kwak TI-RADS) and compared their diagnostic performance. Both systems showed good diagnostic performance. TR4 and 4B were estimated as optimal cut-off values for ACR and Kwak TI-RADS, respectively, but the cut-off values can be adjusted in consideration of changes in sensitivity and specificity. Abstract (1) Background: To determine the optimal cut-off values of two risk stratification systems to discriminate malignant thyroid nodules and to compare the diagnostic performance; (2) Methods: True and false positive and negative data were collected, and methodological quality was assessed for forty-six studies involving 39,085 patients; (3) Results: The highest area under the receiver operating characteristic (ROC) curve (AUC) of ACR and Kwak TI-RADS were 0.875 and 0.884. Based on the optimal sensitivity and specificity, the highest accuracy values of ROC curves or diagnostic odds ratios (DOR) were taken as the cut-off values for TR4 (moderate suspicious) and 4B. The sensitivity, specificity, DOR, and AUC by ACR (TR4) and Kwak TI-RADS (4B) for malignancy risk stratification of thyroid nodules were 94.3% and 96.4%; 52.2% and 53.7%; 17.5185 and 31.8051; 0.786 and 0.884, respectively. There were no significant differences in diagnostic accuracy in any of the direction comparisons of the two systems; (4) Conclusions: ACR and Kwak TI-RADS had good diagnostic performances (AUCs > 85%). Although we determined the best cut-off values in individual risk stratification systems based on statistical assessment, clinicians can adjust the optimal cut-off value according to the clinical purpose of the ultrasonography because raising or lowering cut-points leads to reciprocal changes in sensitivity and specificity.


Introduction
Thyroid nodules are relatively common in the general population, and about 10% of thyroid nodules have a risk of malignancy with increasing prevalence [1,2]. Since malignant and benign thyroid nodules differ in treatment and prognosis, early differentiation between benign and malignant thyroid nodules is important [3].
Currently, ultrasound stratification of thyroid nodules is a fast, primary, and easyto-use diagnostic tool for thyroid nodule; it is also non-invasive and inexpensive [3]. The characteristics of benign and malignant nodules on ultrasound are different, which can be interpreted and misunderstood differently depending on the experience of the examiner or the image obtained [4]. In most ultrasound examinations, fine needle aspiration positive, false positive, false negative, and true negative, and it was used to assess diagnostic accuracy with a 95% confidence interval from random-effects models, considering both within-and between-study variation. Higher DOR values (ranging from 0 to infinity) indicated better diagnostic performance. The SROC approach is considered the best method for meta-analysis and generates paired sensitivity and specificity estimates [6,11,. As the discriminatory power of a test increases, the SROC curve more closely approaches the top left corner of the receiver operating characteristic curve (ROC) space (i.e., the point where sensitivity and specificity both equal 1 (100%)) [56]. AUC is a value between 0 and 1, and the higher AUC value means better diagnostic performance. An AUC of >0.9~1.0 is considered excellent diagnostic accuracy, >0.8~0.9 is good diagnostic accuracy, >0.7~0.8 is fair diagnostic accuracy, >0.6~0.7 is poor diagnostic accuracy, and ≤0.6 is interpreted as a failed diagnosis [57].
We extracted the following data from the included studies: number of patients, correlations of scores measured in endoscopy and CT, true positive values, true negative values, false positive values, and false negative values. The Quality Assessment of Diagnostic Accuracy Studies version 2 tool (QADAS-2) was used to evaluate methodological quality (e.g., risk of bias) [58]. For the definition of true positive and negative, guideline category < cut-off value was regarded as "test negative" and guideline category ≥ cut-off value as "test positive". Therefore, "benign" lesions classified as <cut-off were regarded as true negatives and "non-benign" lesions classified as ≥cut-off value were regarded as true positives. Accordingly, the sensitivity, specificity, and DOR were calculated with reference to the final results based on pathological examination, FNA cytology, and follow-up. ROC curve analyses and AUC were used to assess the effectiveness of guidelines in differentiating benign from malignant thyroid nodules [59].

Statistical Analyses and Outcome Measurements
We used R statistical software version 3.6.1 (R Foundation for Statistical Computing, Vienna, Austria) for meta-analysis. Q statistic was used for homogeneity analysis to evaluate heterogeneity. TI-RADS categories were proposed by Kwak et al. to classify thyroid nodules as 2 (benign lesions), 3 (no suspicious ultrasound features), 4a (one suspicious ultrasound feature), 4b (two suspicious ultrasound features), 4c (three or four suspicious ultrasound features), or 5 (five suspicious ultrasound features) according to the risk estimates of malignancy [4]. Ultrasound features in the ACR TI-RADS are categorized as benign (TR1, 0 point), not suspicious (TR2, 2 points), mildly suspicious (TR3, 3 points), moderately suspicious (TR4, 4-6 points), or highly suspicious (TR5, ≥7 points) for malignancy [12]. Diagnostics accuracy in individual risk stratification systems (ACR TI-RADS and Kwak TI-RADS) was assessed based on the use of different cut-off values. Forest plots show sensitivity, specificity, and SROC curves.

Search and Study Selection
Forty-six studies with 39,085 participants were included in the analyses (Figure 1). Study characteristics are shown in Supplementary Table S1, and the results for bias assessment are presented in Supplementary Table S2.

Diagnostic Accuracy in Various Ultrasound Risk Stratification Systems
Diagnostic efficacy and ROC curves for the two guidelines according to the various cut-off values are shown in Tables 1 and 2, respectively.

Diagnostic Accuracy in Various Ultrasound Risk Stratification Systems
Diagnostic efficacy and ROC curves for the two guidelines according to the various cut-off values are shown in Tables 1 and 2, respectively.  In Kwak TI-RADS categories, sensitivity changed from 14-99% (highest in 4a) and specificity showed the inverse association (99-27%; highest in 5) according to the different cut-off values (categories). ROC analysis and DOR showed that the best diagnostic cut-off values of Kwak TI-RADS had 4b in common. A cut-off in the screening test was chosen to minimize the rate of false negatives rather than reducing false positives because this would be appropriate for conditions in which misdiagnosing and treating someone as sick is better than missing truly sick individuals [60]. Based on the statistical considerations, the best cut-off value of Kwak TI-RADS was category 4b with 96.4% sensitivity and 53.7% specificity. However, in practical considerations, if the sensitivity or specificity of a screening test were considered to be either too high or too low, they could be adjusted by raising or lowering cut-off values [61]. It has also been suggested that a more appropriate sensitivity and specificity value would have been approximately 73% for both, and therefore, incidentally, similar to the values obtained by other researchers [62,63]. Therefore, 4c could be also be a good cut-off value for Kwak TI-RADS because balanced sensitivity and specificity could be more suitable for screening tests.
In ACR TI-RADS categories, sensitivity changed to 71.0-98.8% (highest in TR3), and specificity showed the inverse association (86.9-23.7%; highest in TR5) according to the different cut-off values (categories). ROC analysis and DOR showed that the best diagnostic cut-off values of ACR TI-RADS were TR5 and TR4, respectively. Although a test with high AUC is statistically considered "better" than one with lower AUC, AUC lacks clinical interpretability because it does not reflect the practical gains and losses to individual patients by diagnostic tests [64]. A cut-off in the screening test has been chosen to minimize the rate of false negatives rather than reducing false positives [1]. Accordingly, the best cutoff value for ACR TI-RADS was category TR4 with 94.3% sensitivity and 52.2% specificity. However, TR5 would also be a good cut-off value for ACR TI-RADS because balanced sensitivity and specificity could be more suitable for screening tests.

Direct Comparison of Diagnostic Performance for Predicting Thyroid Malignancy with the Two TIRADS
Only 11 studies that evaluated the diagnostic accuracy of the two guidelines in the same lesions or patients were included for direct comparison. The ROC curves of the cutoff values for Kwak (4b) and ACR TI-RADS (TR4) indicated that there was no significant difference between the two guidelines (Kwak TI-RADS (AUC 0.842) and ACR TI-RADS (AUC 0.846)) in diagnostic performance for thyroid malignancy (Table 3). Additionally, they both had the statistically similar and high diagnostic performances for sensitivity (Kwak 0

Discussion
Nodule number, size, calcification, and echo pattern from ultrasound images are considered when classifying thyroid nodules according to TI-RADS [3,7] to differentiate benign and malignant thyroid nodule to determine whether FNA is required [3,13]. The ultrasound stratification systems help to avoid unnecessary FNA in cases when the thyroid nodule is too small, too large, or when benign versus malignant status is ambiguous. However, the FNA recommendation threshold is different for each system, and the results of the studies analyzing the diagnostic effect are not consistent [3,4].
Many meta-analyses have compared different ultrasound risk stratification systems for thyroid nodules [3,4,[65][66][67][68], but our study analyzed the diagnostic effect in more detail including cut-off values for two risk stratification systems. In addition to identifying the optimal cut-off value, we directly compared two stratification systems at different cut-off values with an AUC > 0.8 for high specificity and sensitivity.
Other meta-analysis results related to Kwak TI-RAD showed high overall sensitivity and low specificity [65,68], which was associated with a high cut-off of 4a and 4b [65]. In our study, as the cut-off value increased, sensitivity decreases and specificity tended to increase. In other words, if the cut-off is high, it is better to rule out an increase in the number of benign thyroid nodule diagnoses and to reduce unnecessary surgeries. However, there was no significant difference in direct comparison between the best cut-offs in Kwak TI-RADS (4b) and ACR TI-RADS (TR4). From a direct comparison of Kwak TI-RADS (4b and 4c) and ACR TI-RADS (TR4 and TR5), all have an AUC greater than 0.8, which is a theoretically effective diagnostic tool [69].
In our study, both Kwak TI-RADS (4b and 4c) and ACR TI-RADS (TR4 and TR5) had high diagnostic performance, with no significant difference between them. However, although both are point-based systems that have high accuracy and may require complex analysis and calculations, they are internally different [70]. Because TI-RADS should be able to reduce the subjective effect of ultrasound and provide a standard for diagnosis and treatment, further research is needed for the best diagnostic performance with the best cut-off value. For FNA recommendation, further studies on the size threshold are needed.
ACR TI-RADS has good diagnostic performance at the cut-off value of TR4 and TR5, but the nodule size threshold for FNA was also clinically important in our study. Thresholds of categories 3, 4, and 5 in ACR TI-RADS were 2.5, 1.5, and 1 cm, which was reported as an effective criterion to reduce unnecessary FNA [71].
There are several limitations and factors explaining the high heterogeneity. First, a selection bias for patients may have occurred. Among the included studies, there were studies that included more malignant thyroid nodules than benign. Second, because thyroid nodule diagnosis is ultimately determined by pathologic or cytologic results, bias may occur depending on clinicians or diagnostic tools. Third, the various experience of radiologists as commonly described in included studies can induce high heterogeneity. However, difficult cases may be accompanied by the supervision of more experienced radiologists [72]. Additionally, intermediate categories were not tested in ACR TI-RADS, and low agreement in cytologic pathology may be expected [65]. Fourth, most included studies were retrospective studies. Prospective and multicenter studies need to be included to reduce bias. Lastly, papillary thyroid cancer is related to BRAF V600 mutation. However, correlation with the ultrasound stratification system and mutation was not considered. The possibility of mutation is low in intermediate nodules, but should be considered.

Conclusions
ACR TI-RADS and Kwak TI-RADS are both effective for differential diagnosis of benign and malignant thyroid nodules with an AUC of 85% or higher. However, since the change in statistically confirmed optimal cut-off value is related to the change in sensitivity and specificity, it is necessary to select the cut-off value depending on the clinical situation.