Inter-Reader Agreement of ATA Sonographic Risk in Thyroid Nodules with Bethesda Category III Indeterminate Cytology

Background: Substantial inter-observer variation has been documented in the recognition and description of specific sonographic features as well as for ATA sonographic risk (ASR). This raises the question if the risk stratification proposed by the ATA guidelines is reproducible and applicable for nodules with indeterminate cytology. The aim of the study was to determine the inter-reader agreement (IRR) among radiologists using the 2015 ASR stratification in indeterminate thyroid nodules. Methods: Three board certified radiologists who were blinded to clinical data and to each other, interpreted the ultrasound findings of 179 nodules that had Bethesda III cytology. The nodules were classified into high suspicion (HS), intermediate (IS), low (LS), very low (VLS). Echogenicity, composition, shape taller than wide, vascularity, type of margins, presence and type of calcifications were also described. Results: The majority consensus revealed that 28%, 27%, 39% and 5% were described as high, intermediate, low and very low ASR, respectively. The inter-reader agreement was near perfect (k 0.82 CI 95% (0.77–0.87)). Nodules were paired into a higher risk (HS + IS) and lower risk (LS + VLS) categories with substantial agreement (k 0.7) in both categories. Conclusion: A near perfect agreement among readers was observed when stratifying indeterminate cytology nodules for ASR.


Introduction
The monographic patterns proposed by the 2015 American Thyroid Association guidelines have been used to risk stratify thyroid nodules into five categories with corresponding estimated malignancy risks: High (70-90%), Intermediate (10-20%), Low (5-10%), Very Low (<3%) suspicion and benign (<1%) nodules. Nodules with high and intermediate suspicion patterns are recommended to undergo FNA biopsy at a lower threshold of 1 cm while those with low and very low suspicion patterns have higher thresholds of 1.5 and 2 cm, respectively [1].
Certain sonographic features have a higher association with risk of malignancy. For example, being hypoechoic provides enough risk to be an intermediate or high suspicion pattern nodule, whereas having a predominantly cystic composition corresponds to lower suspicion patterns. With different size thresholds for FNA biopsy, accurately describing each sonographic feature and categorizing each nodule into a certain ATA sonographic risk (ASR) pattern conveys great clinical significance. Substantial inter-observer variation has been documented in the recognition and description of specific sonographic features as well as for ATA sonographic risk [2,3] Interestingly, the use of the ATA sonographic risk patterns after obtaining a cytology result has not been outlined clearly. In the case of malignant or benign FNA cytology result, management is generally clear with resection or observation recommended correspondingly. However, the management is more variable with indeterminate cytology nodules that offer a significantly variable probability of malignancy. In the specimens diagnosed as atypia of undetermined significance/follicular lesion of undetermined significance (AUS/FLUS), the low interobserver concordance when interpreting the cytology specimens [4], paired with the wide range of management strategies that go from surveillance to diagnostic surgery pose a challenge for clinicians to determine how aggressive the evaluation and management of these particular nodules should be.
Recently, several molecular tests have been utilized to further risk stratify ITNs and guide management, but their use is limited by cost and availability. A study by our group has recently evaluated the association between ASR and Afirma gene expression classifier (GEC) and determined individual and combined diagnostic performances [5].
Some authors have recommended the use of sonographic classifications as adjunctive predictors of malignancy for indeterminate cytology nodules [6]. The ATA sonographic risk stratification in indeterminate cytology nodules has been found to be limited as shown by the low interobserver agreement of the distribution of the sonographic patterns as well as the finding of a large number of nodules that did not fit into any of the proposed risk patterns [7]. This result raises the question if the risk stratification proposed by the ATA guidelines is reproducible and applicable for nodules with indeterminate cytology.
Our objective is to determine the inter-reader agreement among radiologists for individual sonographic features as well as the ATA sonographic risk stratification when applied to thyroid nodules with Bethesda category III indeterminate cytology.

Materials and Methods
A retrospective chart review of patients with Bethesda category III nodules (Atypia of undetermined significance/follicular lesion of undetermined significance (AUS/FLUS)) from 1 January 2012 to 31 December 2017 was performed. Patients were 18 years or older and had one or more thyroid nodules confirmed by ultrasound. Three board certified radiologists at our academic medical center who were blinded to clinical data, pathology results as well as to each other's reports, interpreted the ultrasound (US) findings of 179 nodules with Bethesda category III cytology. Ultrasound used was Siemens Acusón S3000, Transducer 12 MHz.
Each radiologist had received the 2015 ATA guidelines for thyroid nodules risk stratification at least one month prior to the beginning of the study. One radiologist had experience with thyroid ultrasound imaging; one radiologist had general ultrasound imaging experience and the last radiologist had neuroimaging experience. All radiologists used picture archiving computerized system to review dedicated thyroid ultrasound images.

Ultrasound machine
This study was approved by the Institutional Review Board of the University of Miami Miller School of Medicine. Informed consent was waived as the study was retrospective in nature.
The nodules were risk stratified and classified into high (HS), intermediate (IS), low (LS), very low (VLS) or non-ATA (NA) risk categories. Hypoechoic nodules exhibiting at least one suspicious feature (microcalcifications, irregular margins, shape taller than wide, presence of suspicious lymph nodes) were classified as HS pattern. Hypoechoic nodules without any suspicious feature were classified as IS pattern. Iso or hyperechoic solid, or partly cystic nodules with eccentric solid areas, were classified as LS pattern. Spongiform or partly cystic nodules without eccentric solid areas or other suspicious features were classified as VL pattern. Radiologists determined NA risk nodules when describing characteristics that did not correspond to any risk pattern in the 2015 ATA classification. These nodules included iso or hyperechoic nodules with at least one high risk feature or mixed echogenicity nodules.
In addition, 7 features were described: echogenicity, composition, shape taller than wide, vascularity, margins, calcifications and presence of suspicious lymph nodes. Echogenicity was described as hypoechoic, isoechoic or hyperechoic in contrast to the normal thyroid parenchyma. Composition was classified into mixed solid-cystic with >50% solid, mixed cystic-solid with >50% cystic, mixed and spongiform. The shape of the nodule was classified as taller than wide or not taller than wide in the transverse view. Vascularity was classified as internal, peripheral, mixed or no vascularity. Nodule margins were classified into regular, irregular spiculated, irregular microlobulated and irregular infiltrative. Calcifications were classified into microcalcifications, macrocalcifications or none. Suspicious lymph nodes were classified into present or absent.
A consensus result was obtained for ATA risk stratification and each sonographic feature when at least 2 of the raters agreed on the result.
Inter-rater agreement between 3 observers for the ATA sonographic risk pattern and for each sonographic feature was calculated using Fleiss kappa statistics with a 95% confidence interval. Accepted kappa (k) values interpretation deems inter-rater agreement to be near perfect if k ranges from 0.81-1, substantial from 0.61-0.8, moderate if 0.41-0.6, fair if k 0.21-0.4 and poor if k ≤ 0.2.

Results
The population on our study consisted of 179 patients with 179 nodules with Bethesda III cytology. Of the patients, 81% were female and 49% of patients were Hispanic. Mean age was 57.7 years at the time of FNA biopsy. Three per cent of patients had a family history of thyroid cancer, and 12% had a history of hypothyroidism. Mean TSH was 2.26 mU/L ( Table 1). Out of 179 nodules, 178 were able to be classified into an ASR category by majority consensus, revealing that 50 (28%), 48 (27%), 70 (39%) and 10 (5%) were described as high, intermediate, low and very low ASR, respectively (Table 2). One nodule showed absolute disparity in ASR classification between 3 readers. None of the nodules were classified in the Non ATA risk category by majority consensus. Only rater 3 classified 5 nodules as Non ATA, whereas rater 1 and 2 were able to classify every nodule into one ASR pattern (Table 3). For the 5 nodules described above, raters 1 and 2 classified them as low and intermediate ASR patterns. The overall inter-reader agreement for ASR across all three readers was near perfect (k 0.82 CI 95% (0.77-0.87)).  Due to similar clinical management and FNA size thresholds, we paired the HS and IS pattern nodules into a higher risk category and the LS and VLS pattern nodules into a lower risk category. The inter-reader agreement in the higher risk category (k 0.77 CI 95% (0.66-0.87)) as well as the lower risk category (k 0.7 CI 95% (0.6-0.8)) was substantial. We observed that raters disagreed when classifying 15 nodules between the higher and lower risk categories. Notably Rater 2 classified the majority of these nodules as intermediate risk whereas Rater 1 and 3 classified the majority of them as low risk. Eight of them were finally classified as low risk and 6 of them as intermediate risk with one nodule showing absolute disparity between raters (Table 4). Table 4. Nodules with disparity between higher (HS + IS) and lower risk (L + VL) ASR categories as interpreted by each rater (n = 15).

ATA Sonographic Risk (ASR) Rater 1 Rater 2 Rater 3
High There were 28 nodules in which majority consensus was achieved but no absolute agreement between raters when classifying their ASR. The overall agreement rates for individual features of these particular nodules were lower for the description of irregular margins (48%) and echogenicity (54%) ( Table 5). Table 5. Overall agreement rates for nodules that achieved majority consensus but no absolute consensus between raters when classifying for ASR (n = 28).
The inter-reader agreement for individual sonographic features was near perfect for composition (k 0.87), presence (k 0.85) and type of calcifications (k 0.82), substantial for echogenicity (k 0.71), margins (k 0.72), shape (k 0.7) and presence of suspicious lymph node (k 0.72) and moderate for vascularity (k 0.47) and type of irregular margins (k 0.5) ( Table 2).

Discussion
Ultrasonography has an essential role in assessing the malignancy risk of thyroid nodules. Multiple risk stratifying systems have been proposed in order to estimate risk based on a composite of thyroid nodule features [8][9][10][11]. The 2015 ATA guidelines are widely used to risk stratify thyroid nodules offering cutoffs for FNA biopsy depending on their estimated malignancy risk [1]. It has been observed that these systems allow a more reproducible stratification of thyroid nodules on the basis of a higher inter-reader agreement when compared to a classification based on single suspicious features [3].
However, when applied to indeterminate cytology nodules, the 2015 ATA sonographic risk stratification showed only fair to borderline acceptable inter-reader agreement [2,3,7,12] which raises the question if this system is applicable for this specific subset of thyroid nodules. Furthermore, its diagnostic performance in ITNs conveys conflicting results with some authors suggesting appropriate prediction of malignancy [13][14][15] and others who question its adjunctive diagnostic value, with a recent study suggesting it only be used to set the threshold for FNA [16].
In our study of a large cohort of nodules with Bethesda III cytology, we observed that the inter-reader agreement or inter-rater reliability (IRR) for the ATA sonographic risk stratification between 3 radiologists was near perfect. On a recent publication by Lam et al., the greatest source of disagreement was found to be the description of echogenicity with an IRR of k 0.35 (fair agreement) driven by a substantial number of "heteroechoic nodules" unable to fit into any ASR category, as well as different thresholds for the interpretation of echogenicity by the observers [7].
Interestingly, in our study, all three readers had the option to elect a non-ATA category; however, only one rater used this option for a total of five of the 179 nodules analyzed. In comparison, the other two raters were able to classify every single nodule with the ASR system. The observers in this study had different backgrounds in Radiology, as we wanted to mimic real-life circumstances and were only given the ATA sonographic risk guidelines an average of 30 days before initiating the study. This distinction probably indicates that US interpreters may be able to associate non-ATA thyroid nodules to one of the ATA risk patterns guided by presence or absence of suspicious features.
In that sense, we observed that in the 28 nodules in which no absolute consensus was achieved when classifying into ASR, the feature in which raters disagreed the most was echogenicity followed by the description of margins. Both features alone can determine the difference between low, intermediate or high suspicious ASR nodules.
Nonetheless, the overall inter-reader agreement for specific sonographic features of all nodules with Bethesda III cytology across the 3 observers was excellent for every feature except for vascularity and type of irregular margins that ultimately do not determine ATA sonographic risk. Echogenicity carries the most substantial significance in the ATA sonographic risk stratification as hypoechoic attenuation has been shown to be a strong predictor of malignancy [1,7,17]. In our study, the IRR for echogenicity was substantial with kappa value of 0.71. In previous cohorts of no specific subsets of thyroid nodules, echogenicity was consistently among the features with lower IRR (0.33-0.57) showing only fair to moderate agreement [3,[18][19][20][21][22][23][24][25][26]. Conversely, in one study including only nodules with Bethesda III cytology, echogenicity showed the highest IRR among other 5 individual features, with k value of 0.94. Notably all other features had moderate to substantial agreement as well [26]. Along with the results of our study, it could be suggested that this disparity is explained by the fact that nodules with Bethesda III cytology are a subset with more homogenous US characteristics. However, the results by Lam et al. would contradict this suggestion [7].
The effect of ATA sonographic stratification in nodules with known Bethesda III cytology may not determine clinical management at this time, but some authors recommend its use as an adjunctive predictor of malignancy. In that sense, different sonographic patterns will determine different thresholds for FNA biopsy. The ATA 2015 guidelines propose the same threshold for high and intermediate suspicious nodules and a similar threshold for low and very low suspicious nodules (1). We decided to pair higher risk categories (HS + IS) and lower risk categories (LS + VLS) and found a substantial IRR in both groups as expected from an overall near perfect IRR. However, we observed that 15 nodules showed disparity between the higher and lower risk ASR categories as interpreted by each rater. Interestingly, there is a notable predilection for rater 1 and 3 to classify these nodules as low risk ASR whereas rater 2 classified almost all of them as intermediate risk ASR. This finding suggests different thresholds for echogenicity characterization by each rater as already suggested by Lam et al. which could potentially impact critical patient management.
Our results indicate that in contrast to previous results, the inter-reader agreement for the ATA sonographic risk stratification of nodules with Bethesda III cytology was near perfect. None of the nodules were classified in the Non ATA risk category by majority consensus. All individual features proposed by the 2015 ATA guidelines showed substantial to near perfect agreement between the readers. The sonographic features in which more robust agreement was observed were composition and presence of calcifications. The most common features of disagreement were vascularity and description of irregular margins. Nevertheless, these two features do not determine ASR. Echogenicity showed overall substantial agreement; however, it was the most common feature of disagreement in the subset of nodules that did not achieve absolute consensus between all three raters. Disparity between the classification in the lower and higher risk categories was observed in 15 cases out of 179, which could potentially impact critical patient management.  Informed Consent Statement: Patient consent was waived due to retrospective chart review study.