Intra- and Inter-Observer Reliability of ChatGPT-4o in Thyroid Nodule Ultrasound Feature Analysis Based on ACR TI-RADS: An Image-Based Study
Abstract
1. Introduction
2. Materials and Methods
2.1. Ethical Statement and Informed Consent
2.2. Image Dataset
2.3. Large Language Model Analysis
“Please assume the role of an experienced ultrasound physician specializing in the diagnosis of thyroid nodules. I will present you with two ultrasound images of a thyroid nodule: the first image is a transverse view, and the second image is a longitudinal view. To ensure your focus is solely on analyzing the nodule’s characteristics, I have removed any non-essential information from the images that might interfere with your judgment, retaining only the nodule and its surrounding thyroid tissue. According to the ACR TI-RADS guidelines, please carefully evaluate and classify the ultrasound features of the nodule, considering the following aspects:
Composition: cystic or almost completely cystic, spongiform, mixed cystic and solid, solid or almost completely solid. Echogenicity: anechoic, hyperechoic or isoechoic, hypoechoic, very hypoechoic. Shape: taller-than-wide, wider-than-tall. Margin: smooth, ill-defined, irregular or lobulated, extrathyroidal extension. Echogenic foci: none, large comet-tail artifacts, macrocalcifications, peripheral or rim calcifications, punctate echogenic foci.”
2.4. Benchmark Evaluation
2.5. Statistical Analysis
3. Results
3.1. Baseline Characteristics of the Image Dataset
3.2. Intra- and Inter-Observer Agreement in Ultrasound Feature Assessment
3.3. Concordance Rates Between ChatGPT-4o and Expert Evaluations
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sultan, L.R.; Mohamed, M.K.; Andronikou, S. ChatGPT-4: A breakthrough in ultrasound image analysis. Radiol. Adv. 2024, 1, umae006. [Google Scholar] [CrossRef]
- Koga, S.; Du, W. From text to image: Challenges in integrating vision into ChatGPT for medical image interpretation. Neural Regen. Res. 2025, 20, 487–488. [Google Scholar] [CrossRef]
- Waisberg, E.; Ong, J.; Masalkhi, M.; Zaman, N.; Sarker, P.; Lee, A.G.; Tavakkoli, A. GPT-4 and medical image analysis: Strengths, weaknesses and future directions. J. Med. Artif. Intell. 2023, 6, 29. [Google Scholar] [CrossRef]
- Hayden, N.; Gilbert, S.; Poisson, L.M.; Griffith, B.; Klochko, C. Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions. Radiology 2024, 312, e240153. [Google Scholar] [CrossRef]
- Xia, S.; Hua, Q.; Mei, Z.; Xu, W.; Lai, L.; Wei, M.; Qin, Y.; Luo, L.; Wang, C.; Huo, S.; et al. Clinical application potential of large language model: A study based on thyroid nodules. Endocrine 2024, 87, 206–213. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Zhang, Z.; Traverso, A.; Dekker, A.; Qian, L.; Sun, P. Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: Enhancing interpretability with a chain of thought approach. Quant Imaging Med. Surg. 2024, 14, 1602–1615. [Google Scholar] [CrossRef] [PubMed]
- Chen, D.W.; Lang, B.H.H.; McLeod, D.S.A.; Newbold, K.; Haymart, M.R. Thyroid cancer. Lancet 2023, 401, 1531–1544. [Google Scholar] [CrossRef]
- Wu, J.; Zhao, X.; Sun, J.; Cheng, C.; Yin, C.; Bai, R. The epidemic of thyroid cancer in China: Current trends and future prediction. Front. Oncol. 2022, 12, 932729. [Google Scholar] [CrossRef]
- Tessler, F.N.; Middleton, W.D.; Grant, E.G.; Hoang, J.K.; Berland, L.L.; Teefey, S.A.; Cronan, J.J.; Beland, M.D.; Desser, T.S.; Frates, M.C.; et al. ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee. J. Am. Coll. Radiol. 2017, 14, 587–595. [Google Scholar] [CrossRef] [PubMed]
- Özel Alper, M.D.; Türkyılmaz Mut Deniz, M.D.; Ağrıdağ Üçpınar Burçin, M.D.; Özdal Sayer Ayşe, M.D.; Yanç Uğur, M.D.; von Bodelschwingh Bade, M.D.; Gemalmaz Ali, M.D. Interobserver Variability of Ultrasound Features Based on American College of Radiology Thyroid Imaging Reporting and Data System Lexicon in American College of Radiology Thyroid Imaging Reporting and Data System System: A Single-Center Study with Radiologists and Radiology Residents. Ultrasound Q. 2021, 37, 324–328. [Google Scholar] [CrossRef]
- Itani, M.; Assaker, R.; Moshiri, M.; Dubinsky, T.J.; Dighe, M.K. Inter-observer Variability in the American College of Radiology Thyroid Imaging Reporting and Data System: In-Depth Analysis and Areas for Improvement. Ultrasound Med. Biol. 2019, 45, 461–470. [Google Scholar] [CrossRef]
- Hoang, J.K.; Middleton, W.D.; Farjat, A.E.; Teefey, S.A.; Abinanti, N.; Boschini, F.J.; Bronner, A.J.; Dahiya, N.; Hertzberg, B.S.; Newman, J.R.; et al. Interobserver Variability of Sonographic Features Used in the American College of Radiology Thyroid Imaging Reporting and Data System. AJR Am. J. Roentgenol. 2018, 211, 162–167. [Google Scholar] [CrossRef]
- Sant, V.R.; Radhachandran, A.; Ivezic, V.; Lee, D.T.; Livhits, M.J.; Wu, J.X.; Masamed, R.; Arnold, C.W.; Yeh, M.W.; Speier, W. From Bench-to-Bedside: How Artificial Intelligence is Changing Thyroid Nodule Diagnostics, a Systematic Review. J. Clin. Endocrinol. Metab. 2024, 109, 1684–1693. [Google Scholar] [CrossRef] [PubMed]
- Zhu, J.; Zhang, S.; Yu, R.; Liu, Z.; Gao, H.; Yue, B.; Liu, X.; Zheng, X.; Gao, M.; Wei, X. An efficient deep convolutional neural network model for visual localization and automatic diagnosis of thyroid nodules on ultrasound images. Quant. Imaging Med. Surg. 2021, 11, 1368–1380. [Google Scholar] [CrossRef] [PubMed]
- Liu, D.; Yang, K.; Zhang, C.; Xiao, D.; Zhao, Y. Fully-Automatic Detection and Diagnosis System for Thyroid Nodules Based on Ultrasound Video Sequences by Artificial Intelligence. J. Multidiscip. Healthc. 2024, 17, 1641–1651. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Li, P.; Xu, L.; Zhang, X.; Ji, H.; Wang, Y. Large language models in thyroid diseases: Opportunities and challenges. EngMedicine 2025, 2, 100076. [Google Scholar] [CrossRef]
- Helvaci, B.C.; Hepsen, S.; Candemir, B.; Boz, O.; Durantas, H.; Houssein, M.; Cakal, E. Assessing the accuracy and reliability of ChatGPT’s medical responses about thyroid cancer. Int. J. Med. Inform. 2024, 191, 105593. [Google Scholar] [CrossRef]
- Loor-Torres, R.; Duran, M.; Toro-Tobon, D.; Chavez, M.M.; Ponce, O.; Jacome, C.S.; Torres, D.S.; Perneth, S.A.; Montori, V.; Golembiewski, E.; et al. A Systematic Review of Natural Language Processing Methods and Applications in Thyroidology. Mayo Clin. Proc. Digit. Health 2024, 2, 270–279. [Google Scholar] [CrossRef]
- Chen, Z.; Chambara, N.; Wu, C.; Lo, X.; Liu, S.Y.W.; Gunda, S.T.; Han, X.; Qu, J.; Chen, F.; Ying, M.T.C. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 2024, 87, 1041–1049. [Google Scholar] [CrossRef]
- Jiang, H.; Xia, S.; Yang, Y.; Xu, J.; Hua, Q.; Mei, Z.; Hou, Y.; Wei, M.; Lai, L.; Li, N.; et al. Transforming free-text radiology reports into structured reports using ChatGPT: A study on thyroid ultrasonography. Eur. J. Radiol. 2024, 175, 111458. [Google Scholar] [CrossRef]
- Brin, D.; Sorin, V.; Barash, Y.; Konen, E.; Glicksberg, B.S.; Nadkarni, G.N.; Klang, E. Assessing GPT-4 multimodal performance in radiological image analysis. Eur. Radiol. 2024, 35, 1959–1965. [Google Scholar] [CrossRef] [PubMed]
- Reith, T.P.; D’Alessandro, D.M.; D’Alessandro, M.P. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr. Radiol. 2024, 54, 1729–1737. [Google Scholar] [CrossRef] [PubMed]
- Chen, B.; Xu, Z.; Kirmani, S.; Ichter, B.; Sadigh, D.; Guibas, L.; Xia, F. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14455–14465. [Google Scholar]
- Eppel, S.; Bismut, M.; Faktor-Strugatski, A. Shape and texture recognition in large vision-language models. arXiv 2025, arXiv:250323062. [Google Scholar] [CrossRef]
- Sievert, M.; Conrad, O.; Mueller, S.K.; Rupp, R.; Balk, M.; Richter, D.; Mantsopoulos, K.; Iro, H.; Koch, M. Risk stratification of thyroid nodules: Assessing the suitability of ChatGPT for text-based analysis. Am. J. Otolaryngol. 2024, 45, 104144. [Google Scholar] [CrossRef]
- Feinstein, A.R.; Cicchetti, D.V. High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 1990, 43, 543–549. [Google Scholar] [CrossRef]
- McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
- Byrt, T.; Bishop, J.; Carlin, J.B. Bias, prevalence and kappa. J. Clin. Epidemiol. 1993, 46, 423–429. [Google Scholar] [CrossRef] [PubMed]
Characteristic | Statistics |
---|---|
Patients | 98 |
Sex (Male/Female) | 19/79 |
Age (years) | 54.26 ± 12.19 |
Nodules (Benign/Malignant) | 100 (70/30) |
Nodule size (cm) | 2.57 ± 1.23 |
Category | ChatGPT-4o | Kappa | Ultrasound Expert | Kappa # | Kappa * | |
---|---|---|---|---|---|---|
1st Round | 2nd Round | |||||
Composition | 0.449 | 0.092 | 0.075 | |||
cystic or almost completely cystic | 2 | 3 | 0 | |||
spongiform | 0 | 0 | 8 | |||
mixed cystic and solid | 13 | 11 | 37 | |||
solid or almost completely solid | 85 | 86 | 55 | |||
Echogenicity | 0.795 | −0.006 | −0.001 | |||
anechoic | 2 | 3 | 0 | |||
hyperechoic or isoechoic | 0 | 0 | 69 | |||
hypoechoic | 98 | 97 | 30 | |||
very hypoechoic | 0 | 0 | 1 | |||
Shape | −0.051 | 0.026 | 0.082 | |||
wider-than-tall | 62 | 68 | 83 | |||
taller-than-wide | 38 | 32 | 17 | |||
Margin | 0.154 | 0.096 | 0.092 | |||
smooth | 52 | 43 | 26 | |||
ill-defined | 23 | 22 | 61 | |||
lobulated or irregular | 25 | 35 | 11 | |||
extra-thyroidal extension | 0 | 0 | 2 | |||
Echogenic foci | 0.404 | 0.142 | 0.238 | |||
none | 54 | 52 | 58 | |||
large comet-tail artifacts | 0 | 0 | 0 | |||
macrocalcifications | 1 | 0 | 13 | |||
peripheral (rim) calcifications | 0 | 0 | 1 | |||
punctate echogenic foci | 45 | 48 | 28 |
Category | 1st Round Concordance Rate | 2nd Round Concordance Rate |
---|---|---|
Overall | 46.6% | 48.2% |
Composition | 56.0% | 55.0% |
Echogenicity | 29.0% | 29.0% |
Shape | 59.0% | 65.0% |
Margin | 37.0% | 35.0% |
Echogenic foci | 52.0% | 57.0% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Z.; Chambara, N.; Liu, S.Y.W.; Chow, T.C.M.; Lai, C.M.S.; Ying, M.T.C. Intra- and Inter-Observer Reliability of ChatGPT-4o in Thyroid Nodule Ultrasound Feature Analysis Based on ACR TI-RADS: An Image-Based Study. Diagnostics 2025, 15, 2617. https://doi.org/10.3390/diagnostics15202617
Chen Z, Chambara N, Liu SYW, Chow TCM, Lai CMS, Ying MTC. Intra- and Inter-Observer Reliability of ChatGPT-4o in Thyroid Nodule Ultrasound Feature Analysis Based on ACR TI-RADS: An Image-Based Study. Diagnostics. 2025; 15(20):2617. https://doi.org/10.3390/diagnostics15202617
Chicago/Turabian StyleChen, Ziman, Nonhlanhla Chambara, Shirley Yuk Wah Liu, Tom Chi Man Chow, Carol Man Sze Lai, and Michael Tin Cheung Ying. 2025. "Intra- and Inter-Observer Reliability of ChatGPT-4o in Thyroid Nodule Ultrasound Feature Analysis Based on ACR TI-RADS: An Image-Based Study" Diagnostics 15, no. 20: 2617. https://doi.org/10.3390/diagnostics15202617
APA StyleChen, Z., Chambara, N., Liu, S. Y. W., Chow, T. C. M., Lai, C. M. S., & Ying, M. T. C. (2025). Intra- and Inter-Observer Reliability of ChatGPT-4o in Thyroid Nodule Ultrasound Feature Analysis Based on ACR TI-RADS: An Image-Based Study. Diagnostics, 15(20), 2617. https://doi.org/10.3390/diagnostics15202617