1. Introduction
A thyroid nodule is defined as a focal lesion, distinct from the surrounding healthy thyroid parenchyma, and can be recognized by imaging or histological sampling [
1,
2].
Most thyroid nodules are benign; however, between 10 and 15% of these are malignant [
3,
4]. According to the 2020 report of the World Cancer Observatory, thyroid cancer is responsible for approximately 586,000 cases worldwide [
5].
For this reason, all appreciable thyroid nodules should be individually and carefully examined to assess the presence or absence of ultrasound features capable of predicting the likelihood of benign or malignant noduled and thus obtain indications of the best diagnostic-therapeutic attitude: follow-up necessary, follow-up negligible, or, if necessary, minimally invasive diagnostics in depth.
In recent decades, thanks to improved technology and modern knowledge in diagnostic imaging, there has been an increase in the ability to detect thyroid nodules and consequently thyroid cancers [
6].
Ultrasound is the primary imaging modality used in the study of nodular thyroid disease and allows clinicians to recognize and evaluate some suggestive features of malignancy; on the basis of these, the need for further diagnostic investigations should be assessed, such as fine needle aspiration (FNA) assessment.
Currently, the vanguard of literature is represented by TIRADS systems proposed by three different scientific societies, including the Korean TIRADS (K-TIRADS) published in 2011 by the Korean Society of Thyroid Radiology (KSThR) and revised in 2016; the ACR-TIRADS, where ACR stands for American College of Radiology, which published its score in 2016; and the EU-TIRADS, where EU stands for European, proposed in 2017 [
7,
8,
9].
Recently, radiomics is taking its place in thyroid imaging and beyond; it uses extraction algorithms to derive various quantitative features from radiological images, and can be used by machine learning (ML) systems, which is a subset of artificial intelligence (AI) from which deep learning (DL) is derived. These innovative technologies may eventually translate into software used directly by clinicians, namely computer-aided diagnosis (CAD) [
10,
11] (
Figure 1a–c), which is a type of deep learning software.
The aim of this study was to compare, through a retrospective analysis, the performance of the various TIRADS ultrasound systems mentioned above (K-TIRADS, EU-TIRADS and ACR-TIRADS) when each one is used by observers with different levels of experience compared with the actual malignancy rate obtained from a standard cytological/histological examination, including the time required for their application. Finally, we also performed a statistical analysis comparing the diagnostic performance of AI (CAD) with that of a human observer, looking for the possible presence of diagnostic discrepancy depending on the degree of experience of the human observer, to understand how S-Detect software can really help the less experienced radiologist.
2. Methods and Materials
In our retrospective study, which was approved by the local Ethics Committee (Comitato Etico territoriale Lazio Area 1), with approval number 7458, referring to protocol 1011/2023 and meeting minutes 20 December 2023, we included 277 patients, for a total of 334 thyroid nodules selected for fine needle aspiration (FNA), who had previously signed informed consent.
The 277 patients included came to our institute for observation in the period between September 2020 and October 2023, and all selected 334 thyroid nodules were submitted to cytological evaluation by FNA. Nodules that were found to be benign by the FNA cytological evaluation were rechecked at 18 and 36 months to confirm their stability and to consider them as such [
12]. In contrast, the nodules that were found to be malignant or undetermined at cytology were subjected to surgical resection and subsequent histological examination in accordance with the Italian classification of thyroid cytology. Patients with more than three thyroid nodules, cysts and nodules smaller than 5 mm were excluded from our study.
All images related to the included nodules were obtained at diagnosis or follow-up control or during cytologic (FNA) sampling by an experienced radiologist with more than 25 years of experience in thyroid ultrasound, using a high-frequency (14–20 GHz) linear probe. Images were acquired in B-Mode and stored for subsequent evaluation according to the standardization criteria suggested by the TIRADS system. Regarding the quality of the images obtained, standardization was made possible by the processing work of the S-Detect software, which automatically forced us to discard the images that did not reflect the quality criteria.
Cytologic analysis of the nodules included in the FNA assessment was obtained within 15–20 days of sample collection. Each specimen was fixed in formalin, specially stored and sent to the Pathologic Anatomy Department of our institute for case analysis. For the evaluation of thyroid cytology, we used the new classification (TIR) published by the Italian societies of endocrinology (AIT, AME and SIE) and that of of pathological anatomy and cytology (SIAPEC-IAP) in 2014 [
13]. The data obtained were stored in a database for later comparison with the ultrasound findings.
Three observers with different experience levels (low level: <5 years of experience; medium level: between 5 and 15 years of experience; high level: >15 years of experience) were recruited for our retrospective analysis. The three different observers were not aware of the patients’ clinical information, except for age, the databases obtained from other observers, the echographic data extrapolated from CAD, nor the cytological results of FNA.
Each of the three observers viewed images of 334 thyroid stored nodes in B-Mode and, for each, applied the three risk stratification systems (ACR-TIRADS, EU-TIRADS and K-TIRADS) by assigning a score for each of five ultrasound characteristics (composition, echogenicity, shape, margins, calcification or targeted echogenic foci) and then obtaining a TIRADS risk category from the sum of these. Later, the same images were processed by CAD software (S-Detect). All data obtained were compared with each other and with the gold standard.
All data and imaging results assigned by each observer and the CAD were stored separately in order to allow subsequent comparison through appropriate statistical analysis with the results of the cytological examination.
Radio/cytological agreement was estimated by comparing the TIRADS score assigned for each node by each TIRADS system (EU-TIRADS, K-TIRADS, ACR-TIRADS) and the cytological TIR score obtained from the corresponding FNA sample. On the basis of the TIRADS systems, nodules classified as 2 or 3 were considered in our study to be benign, and those classified as 4 or 5 as malignant. Regarding TIR classification, nodes with cytology scores of TIR2 were considered benign, while the others (TIR3, -4 and -5) were considered malignant; specifically TIR3 nodes were considered indeterminate for malignancy (TIR3a: low-risk malignancy; TIR3b: high-risk malignancy). The category TIR1 corresponded to those not valid for FNA sampling. Radio/cytological data, obtained for each thyroid node included, were compared with the gold standard to estimate the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and area under the curve (AUC), each with 95% confidence intervals (CI).
The sample size was calculated assuming a Type I error (α) of 5%, an expected prevalence of 10% and a marginal error of 5% to ensure an expected sensitivity of the expert observer of at least 98%. On the basis of these parameters, the required sample size was estimated to be 334 individuals.
The results obtained by individual observers and S-Detect, and their comparison with the final cytohistological diagnosis, have been compiled in
Table 1,
Table 2 and
Table 3.
3. Statistical Analysis
An in-depth statistical analysis was carried out to assess inter-observer concordance at different levels.
The primary objective of the study was to evaluate inter-observer concordance for each of the three different TIRADS system included (EU-, K- and ACR-TIRADS) when the TIRADS score was assigned by observers with different levels of experience (low, <5 years; medium, between 5 and 15 years; and high, >15 years). Furthermore, inter-observer agreement was investigated for each of the 5 sonographic items considered (composition, form, calcifications, echogenicity and margins). Then, the results obtained by each of the three human observers were compared with those calculated using the CAD software (S-Detect).
Finally, radio/cytological agreement was evaluated for each TIRADS system (K, EU and ACR) when applied by observers with different levels of experience, comparing the TIRADS scores assigned with the TIR (the Italian cytology classification for thyroid nodes) cytological results from FNA.
To confirm the intra-observer agreement, one of the three recruited observers reviewed the same nodules a second time from the first view of each nodule; in this regard, to homogenize the sample, the observer with a medium level of experience was chosen (between 5 and 15 years).
Inter-observer agreement for TIRADS scores was evaluated both among all readers (more than two observers) and between each observer and the S-Detect (two observers) system for ACR-, K- and EU-TIRADS using the multi-user weighted Cohen’s kappa.
In addition, by means of the multi-user Cohen’s kappa, it was also possible to obtain data on the frequency of presentation of each of the five ultrasound items studied (composition, shape, margins, calcifications and echogenicity).
We considered statistically significant p-values to be below 0.05. All statistical analyses were carried out with the help of SPSS version 21 statistical software (SPSS Inc., Chicago, IL, USA).
In this regard, it was decided to establish the degree of inter-observer agreement on the basis of the interpretation of Landis and Koch, thus identifying the following degrees of concordance: poor agreement for Cohen’s k values of <0.2, fair agreement for k values of 0.2–0.4, moderate agreement for k values of 0.4–0.6, substantial agreement for k values of 0.6–0.8 and excellent for k values > 0.8 [
14].
In contrast, the inter-observer concordance was not evaluated regarding the nodules’ dimensions because the set of images had been obtained and archived previously and therefore did not have an inter-observer variability parameter of interest.
4. Results
The sample of patients from which we selected the 334 nodules included in our study, was composed mainly of female subjects, precisely 234 women against 43 males, for a total of 277 patients. The mean age was 49.2 years (SD = 16.4), with median values ranging from 45.3 years (SD = 10.3) in the female cohort to 59.5 years (SD = 13.8) in the male cohort.
Cytological data on 334 thyroid nodules sampled showed that 258 nodules were benign or non-malign (specifically, all were in Category TIR2,no TIR1 nodules observed), 33 nodules were suspected of malignity (all in Category TIR4; no TIR5), while the remaining 43 nodules were cytologically indeterminate (TIR3, all of which were TIR3a, considered to be at a low risk of malignity) as shown in
Table 3. The average diameter of the nodules examined was about 26.9 mm (DS = 1.01).
4.1. Agreement Among Human Observers with Different Levels of Experience
Results regarding the inter-observer concordance in assigning a certain TIRADS score at each nodule by the three human observers recruited showed different Cohen’s k values depending on the TIRADS system considered (K, ACR and EU). Among all three human readers, inter-observer agreement was substantial (k = 0.624) for ACR and moderate both for EU (k = 0.542) and for K (k = 0.496), as shown in
Table 4.
As regard the score assigned to each of the five sonographic characteristics, inter-observer agreement among observers with different levels of experience (high, average and low) showed extremely variable Cohen’s k values depending on the different sonographic parameters considered (
Table 5).
For the parameter composition, agreement was excellent for ACR and EU (k = 0.826 and k = 0.809, respectively) and substantial for K (k = 0.785).
For shape, agreement resulted to be substantial for ACR (k = 0.793), EU (k = 0.716) and K (k = 0.687).
Regarding echogenicity, concordance was moderate for ACR (k = 0.498), EU (k = 0.441) and K (k = 0.389).
For calcifications or targeted echogenic foci, agreement was from moderate to fair with values of k = 0.416 for ACR, k = 0.318 for EU and k = 0.351 for K.
For the last parameter considered, margins, poor agreement was registered for the three TIRADS systems included, with values of k = 0.134 for ACR, k = 0.119 for EU and k = 0.106 for K.
4.2. Agreement Between Human Observers with Different Levels of Experience and S-Detect
Regarding the observer/S-Detect concordance, we also recorded different Cohen values, which varied depending on the level of experience of the observer (high, medium or low) considered in the comparison with S-Detect and the TIRADS system applied (whether K, ACR or EU).
When considered the ACR-TIRADS system, the observer/S-Detect agreement was from substantial to moderate with values of k = 0.762 for the high-level observer, k = 0.654 for the medium-level observer and k = 0.596 for the low-level observer.
A similar degree of agreement was found for K-TIRADS (from substantial to moderate) with k values of k = 0.679 for the high-level observer, k = 0.603 for the medium-level observer and k = 0.536 for the low-level observer.
For EU-TIRADS, instead, observer/S-Detect agreement was from moderate to fair; in particular, Cohen values registered were k = 0.417, k = 0.334 and k = 0.295 for the observers with high, medium and low levels of experience, respectively. All the Cohen’s k values above are summarized in
Table 4.
4.3. Radio/Cytological Agreement
As regards the radio/cytological concordance, intended to show the correspondence between the TIRADS ultrasound category calculated (benign or malign) and the assigned TIR cytological class (benign, malign or specifically indeterminant for malignancy), it was evaluated by different statistical indices and showed very variable results, depending on various factors such as, for example, the type of observer (whether human or S-Detect), the different levels of experience of the human observers (high, medium or low) or the type of TIRADS system (EU, K, ACR) applied, as we summarise in
Table 2.
When we considered the human observer with a high level of experience, radio/cytological concordance recorded sensitivity (SEN) values of 100% for EU, ACR and K; also, the negative predictive value (NPV) was 100% for all three TIRADS systems. Positive predictive value (PPV) was 50% for ACR, 50% for K and 36.7% for EU. With respect to specificity (SPE), the values were 89% for ACR-TIRADS, 85.7% for K-TIRADS and 75.8% for EU-TIRADS, while the AUC data recorded values of 94.5% for ACR, 87.9% for EU and 92.8% for K.
The observer with an average experience level recorded SEN values of 63.6 for ACR-, K- and EU-TIRADS. NPV was 95.2% for ACR-, 95.1% for K- and 94.6% for EU-TIRADS. PPV showed values between 24.7% for K- and ACR-TIRADS and 23.1% for EU. On the other hand, for specificity and AUC, they had values of 78.8% and 71.1 for ACR-TIRADS; 76.7% and 70.2% for K-TIRADS; and 70.4% and 67% for EU-TIRADS, respectively.
For the observer with a low experience level, radio/cytological concordance registered SEN values of 60.6% for ACR, K and EU, and 94.7% (ACR), 94.6% (K) and 94.4% (EU) for NPV. PPV values varied between 9.9% of EU and 22.2% of ACR. SPE had values of 76.7% for ACR-TIRADS, 75.4% for K-TIRADS and 54.4% for EU-TIRADS; while the values of AUC were of 68.7% for ACR, 57.5% for EU and 68% for K.
The concordance as regard the results obtained with S-Detect showed the same value of SEN for K-TIRADS, ACR-TIRADS and EU-TIRADS, which is 66.7%. NPV was 96.2% for both EU and K and 96.3% for ACR TIRADS, while PPV was 50% (EU and K) and 66.7% (ACR). S-Detect had a SPE of 92.7% for EU and K and 96.3% for ACR. Finally, the AUC of S-Detect was 79.7% both for EU and K, and 81.5% for ACR.
5. Discussion
Statistical analyses carried out on data obtained from 334 nodules included from 277 patients with thyroid disease confirmed the expected results presented to date in the literature [
15,
16], showing a better degree of agreement, from substantial to excellent, regarding sonographic items such as nodule composition (k = 0.826) and shape (k = 0.783), whereas there was a moderate to poor degree of concordance with the other three echographic characteristics such as echogenicity (k = 0.498), margins (k = 0.134), and presence or absence of calcifications or targeted echogenic foci (k = 0.396). So, the parameter margins of the nodules resulted to be the main factor of inter-observer discordance, especially in nodules with a negative cytological result (TIR3).
Regarding the concordance between human observers and S-Detect when the three different TIRADS systems were applied, our study has substantially confirmed the data in the literature [
17], showing that observer/S-Detect agreement was better for ACR, slightly higher than EU, with optimal median Cohen’s k values of k = 0.762 and k = 0.679, respectively, while it was poor for K-TIRADS (k = 0.417) for the high-level observer. In our data, we also observed homogeneously decreasing concordance values for all three TIRADS systems when the comparison was made between S-Detect and human observers with decreasing level of experience (from high to low).
We also undertook a more in-depth analysis by calculating the degree of observer/S-Detect concordance for each of the five ultrasound parameters used in the three TIRADS systems in order to exclude discrepancies that are more dependent on the ultrasound characteristic considered than on the TIRADS system used. Our results showed excellent to substantial agreement for nodule composition and shape (k = 0.8–0.5), moderate to fair for the margins (k = 0.5–0.4) and fair to poor regarding the echogenicity and calcifications (k = 0.1–0.3).
In a comparison of the TIR cytological class result from FNA for each nodule and the different TIRADS scores assigned depending on the type of observer or TIRADS system considered, data on radio/cytological concordance point out that when TIRADS systems were applied by human observers, the best match was with ACR-TIRADS, as it had optimal diagnostic accuracy and slightly higher than other systems such as K and EU.
Radio/cytological concordance for each of the three TIRADS systems was shown to vary in a way that was closely dependent on the different levels of experience of the three radiologists: the observer with more than 15 years’ experience (high level) showed better accuracy than the less experienced observers with between 5 and 15 years (medium level) or less than 5 years (low level) of experience, who also had lower and almost overlapping accuracy values.
In particular, the observer with a high level of experience recorded values of sensitivity (SEN) of 100% for all three types of TIRADS and, therefore, there were no subjects with a “false negative”. To confirm the certainty of the negative result, the negative predictive value (NPV) was also 100%, so all patients assumed to be healthy were found to be healthy. The positive predictive value (PPV) was 50% concerning the low prevalence of malignancy, meaning that the patients included had a 50% chance of being ill if the ultrasound examination was positive. The relative specificity (SPE) values are 89% for ACR-TIRADS, 85.7% for K-TIRADS and 75.8% for EU-TIRADS; this implies a certain proportion of “false positives”, especially with the use of EU TIRADS (lower SPE). This is associated with a not perfect but still very high diagnostic accuracy with values of 94.52% for ACR, 94.52% for EU and 92.86% for K.
Observers with medium and low experience levels reported lower SEN values of, respectively, 64.9% and 60.1%, with the values almost comparable for the use of the three TIRADS systems, indicating the presence of a fair proportion of “false negative” patients. Even the NPV, although slightly lower, remains optimal, with values of 91.3% and 88.9%, respectively, indicating that most patients considered healthy were indeed so. There was no disagreement compared with the observer with a high level of experience regarding PPV (that is, the probability that the patients included would be ill), for which the value remained at 50%. The SPE confirmed the better diagnostic certainty of ACR-TIRADS (SPE = 78.8% for the average-level observer and SPE = 76.7% for the low-level observer) compared with K (SPE = 76.7% for the average-level observer and SPE = 75.4% for the low-level observer) and EU (SPE = 70.4% for the average-level observer and SPE = 54.4% for the low-level observer). Diagnostic accuracy was low compared with the observer with a high level of experience or with S-Detect and had values of 84.69% for ACR, 83.64% for EU and 80.95% for K.
Radio/histological concordance among the three TIRADS systems (K, ACR and EU) when applied by S-Detect software did not show substantial differences in the degree of concordance, contrary to the human observers.
Diagnostic accuracy assessed for the results obtained with S-Detect showed, as already mentioned, an almost perfect concordance among the K-TIRADS, ACR-TIRADS and EU-TIRADS systems.
SEN for S-Detect’s measurement is 66.7% for all three TIRADS systems; this results in a moderate number of “false negative” subjects. NPV is 96.2% for both ACR and K and 96.3% for EU TIRADS, so the patients’ probability of being healthy is high. PPV is 50% (ACR-K) and 66.7% (EU). SPE is higher than SEN: it is 92.7% (EU and K) and 96.3% (ACR). This is associated with fewer “false positives” and with a greater diagnostic certainty than human observers’ measurements, regardless of the different levels of observer experience. Diagnostic accuracy was 79.68% for EU, 81.51% for ACR and 79.68% for K: slightly lower compared with the human observer with a higher level of experience but better than the observers with medium and low levels of experience.
It emerges, therefore, that S-Detect software can help human observers with medium/low levels of experience in improving specificity for the TIRADS grade awarded; this is also supported in a recent study by Lee S.E. et al., which demonstrated that S-Detect improves the diagnostic accuracy of the youngest or most inexperienced radiologists by using a model based on the reader’s self-learning process [
18].
In conclusion, our study confirmed once more a better diagnostic accuracy of ACR TIRADS system than K and EU-TIRADS when applied by each of the three human observers with different experience level; while no significant discrepancy was observed when applied by S-Detect software. It was also found that ACR-TIRADS was better in the observer/S-Detect agreement (k = 0.7624).
The degree of radio/cytological concordance calculated for each of the three human observers with different experience level sand the S-Detect software was evaluated at different levels and shows that regarding the diagnostic accuracy, the human observer was better than S-Detect when she/he had a high level of experience, while she/he was worse when we considered the dataset obtained from the other two observers with medium and low levels of experience. So, we can presume that S-Detect is an innovative software tool, easy to use in the evaluation of nodular thyroid disease, providing excellent reliability and a good degree of concordance in comparison with the human observer, representing a valid aid for the radiologist, especially if he/she is young and inexperienced; in addition, S-Detect has greater diagnostic specificity compared with the human observer.