Limitations in Chest X-Ray Interpretation by Vision-Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o
Abstract
1. Introduction
2. Materials and Methods
2.1. Image Dataset and LLMs Selection
2.1.1. Chest X-Rays Selection and Randomization
2.1.2. Selection of Vision-Capable LLMs
2.1.3. Test Setting
2.2. How to Describe the Lesions in CXR
2.2.1. Criteria for a Complete Lesion Description
2.2.2. Definitions of Fully Correct, Partially Correct, and Incorrect Answers
2.2.3. Primary Diagnosis and Key Imaging Features Scoring
2.3. Scoring the LLMs
2.4. Statistical Analysis and Software
2.4.1. Statistical Analysis Methods
2.4.2. Software
3. Results
3.1. Results for Primary Diagnosis
3.1.1. Scores of Four LLMs in Primary Diagnosis
3.1.2. Analysis of Partial Correctness in Primary Diagnosis
3.1.3. Scores Across Thirteen Categories
3.1.4. Analysis for Pleural Effusion
3.2. Results for Key Imaging Features
3.2.1. Scores for Key Imaging Features
3.2.2. Analysis of Partial Correctness in Central Venous Catheter Diagnosis
3.3. Statistical Analysis
3.3.1. Statistical Analysis for Five Major Groups
3.3.2. Model Performance Comparison
4. Discussion
Limitations
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| CXRs | Chest X-rays |
| LLMs | Large language models with vision capabilities |
| GPT | Generative pretrained transformer |
| NIHCXRs | National Institutes of Health Chest X-ray Dataset |
| PICC | Peripherally inserted central catheter |
| API | Application programming interface |
| PA/AP | Posterior–anterior or anterior–posterior view |
| CVC | Central venous catheters |
References
- Panahi, A.; Askari, M.R.; Tarvirdizadeh, B.; Madani, K. Simplified U-Net as a deep learning intelligent medical assistive tool in glaucoma detection. Evol. Intel. 2024, 17, 1023–1034. [Google Scholar] [CrossRef]
- Mohan, G.; Subashini, M.M.; Balan, S.; Singh, S. A multiclass deep learning algorithm for healthy lung, Covid-19 and pneumonia disease detection from chest X-ray images. Discov. Artif. Intell. 2024, 4, 20. [Google Scholar] [CrossRef]
- Oliveira, M.; Seringa, J.; Pinto, F.J.; Henriques, R.; Magalhaes, T. Machine learning prediction of mortality in Acute Myocardial Infarction. BMC Med. Inform. Decis. Mak. 2023, 23, 70. [Google Scholar] [CrossRef]
- Kaarre, J.; Feldt, R.; Keeling, L.E.; Dadoo, S.; Zsidai, B.; Hughes, J.D.; Samuelsson, K.; Musahl, V. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg. Sports Traumatol. Arthrosc. 2023, 31, 5190–5198. [Google Scholar] [CrossRef]
- Hang, C.N.; Ho, S.M. Personalized Vocabulary Learning through Images: Harnessing Multimodal Large Language Models for Early Childhood Education. In Proceedings of the IEEE Integrated STEM Education Conference (ISEC), Princeton, NJ, USA, 15–15 March 2025; pp. 1–7. [Google Scholar] [CrossRef]
- Chen, Y.D.; Li, J.; Xu, J. Children’s psychological recognition with a multimodal language model incorporating visual language features. Eng. Appl. Artif. Intell. 2026, 163, 113114. [Google Scholar] [CrossRef]
- Open, AI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. Available online: http://arxiv.org/abs/2303.08774 (accessed on 21 December 2025).
- Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; Silver, D. on behalf of the Gemini Team Google; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- Abbas, A.; Rehman, M.S.; Rehman, S.S. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus 2024, 16, e55991. [Google Scholar] [CrossRef]
- Chen, C.; Hsieh, K.; Huang, K.; Lai, H.Y. Comparing Vision-Capable Models, GPT-4 and Gemini, with GPT-3.5 on Taiwan’s Pulmonologist Exam. Cureus 2024, 16, e67641. [Google Scholar] [CrossRef]
- National Institutes of Health Clinical Center. Ronald, M.S. 10 Center Drive, Bethesda, MD 20892, September 2017. Available online: https://nihcc.app.box.com/v/ChestXray-NIHCC (accessed on 21 December 2025).
- Fanni, S.C.; Marcucci, A.; Volpi, F.; Valentino, S.; Neri, E.; Romei, C. Artificial Intelligence-Based Software with CE Mark for Chest X-Ray Interpretation: Opportunities and Challenges Review. Diagnostics 2023, 13, 2020. [Google Scholar] [CrossRef]
- Lee, R.W.; Lee, K.H.; Yun, J.S.; Kim, M.S.; Choi, H.S. Comparative Analysis of M4CXR, an LLM-Based Chest X-Ray Report Generation Model, and ChatGPT in Radiological Interpretation. J. Clin. Med. 2024, 13, 7057. [Google Scholar] [CrossRef]
- Celik, A.; Surmeli, A.O.; Demir, M.; Esen, K.; Camsari, A. The diagnostic value of chest X-ray scanning by the help of Artificial Intelligence in Heart Failure (ART-IN-HF). Clin. Cardiol. 2023, 46, 1562–1568. [Google Scholar] [CrossRef] [PubMed]
- Wong, K.P.; Homer, S.Y.; Wei, S.H.; Yaghmai, N.; Estrada Paz, O.A.; Young, T.J.; Buhr, R.G.; Barjaktarevic, I.; Shrestha, L.; Daly, M.; et al. Integration and evaluation of chest X-ray artificial intelligence in clinical practice. J. Med. Imaging 2023, 10, 051805. [Google Scholar] [CrossRef] [PubMed]
- Kufel, J.; Bargieł, K.; Koźlik, M.; Czogalik, L.; Dudek, P.; Jaworski, A.; Cebula, M.; Gruszczyńska, K. Application of artificial intelligence in diagnosing COVID-19 disease symptoms on chest X-rays: A systematic review. Int. J. Med. Sci. 2022, 19, 1743–1752. [Google Scholar] [CrossRef] [PubMed]
- McGrath, E.E.; Anderson, P.B. Diagnosis of pleural effusion: A systematic approach. Am. J. Crit. Care 2011, 20, 119–127, quiz 128. [Google Scholar] [CrossRef]
- Rothstein, E.; Landis, F.B. Intrapulmonary pleural effusion simulating elevation of the diaphragm. Am. J. Med. 1950, 8, 46–52. [Google Scholar] [CrossRef]
- Clusmann, J.; Kolbinger, F.R.; Muti, H.S.; Carrero, Z.I.; Eckardt, J.N.; Laleh, N.G.; Löffler, C.M.L.; Schwarzkopf, S.-C.; Unger, M.; Veldhuizen, G.P.; et al. The future landscape of large language models in medicine. Commun. Med. 2023, 3, 141. [Google Scholar] [CrossRef]
- Iniesta, R. The human role to guarantee an ethical AI in healthcare: A five-facts approach. AI Ethics 2023, 5, 385–397. [Google Scholar] [CrossRef]
- Kumar, M.; Mani, U.A.; Tripathi, P.; Saalim, M.; Roy, S. Artificial Hallucinations by Google Bard: Think Before You Leap. Cureus 2023, 15, e43313. [Google Scholar] [CrossRef]
- Nakaura, T.; Ito, R.; Ueda, D.; Nozaki, T.; Fushimi, Y.; Matsui, Y.; Yanagawa, M.; Yamada, A.; Tsuboyama, T.; Fujima, N.; et al. The impact of large language models on radiology: A guide for radiologists on the latest innovations in AI. Jpn. J. Radiol. 2024, 42, 685–696. [Google Scholar] [CrossRef]
- Rau, S.; Rau, A.; Nattenmüller, J.; Fink, A.; Bamberg, F.; Reisert, M.; Russe, M.F. A retrieval-augmented chatbot based on GPT-4 provides appropriate differential diagnosis in gastrointestinal radiology: A proof of concept study. Eur. Radiol. Exp. 2024, 8, 60. [Google Scholar] [CrossRef]
- Gilbert, S.; Kather, J.N.; Hogan, A. Augmented non-hallucinating large language models as medical information curators. NPJ Digit. Med. 2024, 7, 100. [Google Scholar] [CrossRef]
- Viana Vargas, T.; Pedrini, H.; Santanchè, A. LLM-Driven Chest X-Ray Report Generation With a Modular, Reduced-Size Architecture. In Intelligent Systems. BRACIS; Springer: Cham, Switzerland, 2024; p. 15413. [Google Scholar] [CrossRef]
- Hasani, A.M.; Singh, S.; Zahergivar, A.; Ryan, B.; Nethala, D.; Bravomontenegro, G.; Mendhiratta, N.; Ball, M.; Farhadi, F.; Malayeri, A. Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. Eur. Radiol. 2024, 34, 3566–3574. [Google Scholar] [CrossRef]
- Cheung, J.L.S.; Ali, A.; Abdalla, M.; Fine, B. U”AI” Testing: User Interface and Usability Testing of a Chest X-ray AI Tool in a Simulated Real-World Workflow. Can. Assoc. Radiol. J. 2023, 74, 314–325. [Google Scholar] [CrossRef]
- Yao, Y.; Wen, Z.; Tong, Y.; Tian, X.; Li, X.; Ma, X.; Xu, D.; Gedeon, T. Simple Radiology VLLM Test-time Scaling with Thought Graph Traversal. arXiv 2025, arXiv:2506.11989. [Google Scholar] [CrossRef]
- Tanno, R.; Barrett, D.G.T.; Sellergren, A.; Ghaisas, S.; Dathathri, S.; See, A.; Welbl, J.; Lau, C.; Tu, T.; Azizi, S.; et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat. Med. 2025, 31, 599–608. [Google Scholar] [CrossRef]
- Horiuchi, D.; Tatekawa, H.; Oura, T.; Shimono, T.; Walston, S.L.; Takita, H.; Matsushita, S.; Mitsuyama, Y.; Miki, Y.; Ueda, D. ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology. Eur. Radiol. 2025, 35, 506–516. [Google Scholar] [CrossRef] [PubMed]
- Chen, C.H.; Hsu, S.H.; Hsieh, K.Y.; Lai, H.Y. The two-stage detection-after-segmentation model improves the accuracy of identifying subdiaphragmatic lesions. Sci. Rep. 2024, 14, 25414. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. 2017. Available online: https://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf (accessed on 21 December 2025).







| n | Gemini 1.0 | Gemini 1.5 Pro | GPT-4 Turbo | GPT-4o | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Fully | Partial | Fully | Partial | Fully | Partial | Fully | Partial | |||
| Acute pulmonary edema | 20 | 7 (35.0%) | 7 (35.0%) | 6 (30.0%) | 11 (55.0%) | 0 | 3 (15.0%) | 6 (30.0%) | 3 (15.0%) | |
| Cardiomegaly | 20 | 1 (5.0%) | 4 (20.0%) | 3 (15.0%) | 4 (20.0%) | 1 (5.0%) | 1 (5.0%) | 0 | 0 | |
| Lobar pneumonia | 15 | 1 (6.7%) | 9 (60.0%) | 3 (20.0%) | 10 (66.7%) | 0 | 2 (13.3%) | 1 (6.7%) | 7 (46.7%) | |
| Pacemaker | 15 | 10 (66.7%) | 4 (26.7%) | 8 (53.3%) | 7 (46.7%) | 13 (86.7%) | 2 (13.3%) | 6 (40.0%) | 9 (60.0%) | |
| Port-A-Cath | 5 | 0 | 4 (80.0%) | 0 | 5 (100.0%) | 0 | 5 (100.0%) | 2 (40.0%) | 3 (60.0%) | |
| PICC | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Malignancy (single lesion) | 22 | 0 | 4 (18.2%) | 0 | 2 (9.1%) | 0 | 1 (4.5%) | 1 (4.5%) | 1 (4.5%) | |
| Malignancy (multiple lesions) | 30 | 4 (13.3%) | 18 (60.0%) | 0 | 20 (66.7%) | 1 (3.3%) | 8 (26.7%) | 5 (16.7%) | 10 (33.3%) | |
| Malignancy (central) | 15 | 0 | 0 | 0 | 2 (13.3%) | 0 | 0 | 0 | 0 | |
| Hiatal hernia | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Normal findings | 30 | 28 (93.3%) | 0 | 29 (96.7%) | 0 | 24 (80.0%) | 0 | 30 (100.0%) | 0 | |
| Diaphragm elevation | 5 | 0 | 1 (20.0%) | 0 | 1 (20.0%) | 0 | 0 | 1 (20.0%) | 0 | |
| Pleural effusion | 50 | 3 (6.0%) | 12 (24.0%) | 6 (12.0%) | 9 (18.0%) | 0 | 3 (6.0%) | 4 (8.0%) | 11 (22.0%) | |
| Minimal | 15 | 1 (6.7%) | 1 (6.7%) | 1 (6.7%) | 0 | 0 | 0 | 0 | 0 | |
| Small | 10 | 0 | 3 (30.0%) | 2 (20.0%) | 2 (20.0%) | 0 | 0 | 0 | 1 (10.0%) | |
| Moderate | 15 | 1 (6.7%) | 5 (33.3%) | 0 | 6 (40.0%) | 0 | 1 (6.7%) | 1 (6.7%) | 4 (26.7%) | |
| Massive | 10 | 1 (10%) | 3 (30%) | 3 (30%) | 1 (10%) | 0 | 2 (20%) | 3 (30%) | 6 (60%) | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chen, C.-H.; Chen, C.-W.; Hsieh, K.-Y.; Huang, K.-E.; Lai, H.-Y. Limitations in Chest X-Ray Interpretation by Vision-Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o. Diagnostics 2026, 16, 376. https://doi.org/10.3390/diagnostics16030376
Chen C-H, Chen C-W, Hsieh K-Y, Huang K-E, Lai H-Y. Limitations in Chest X-Ray Interpretation by Vision-Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o. Diagnostics. 2026; 16(3):376. https://doi.org/10.3390/diagnostics16030376
Chicago/Turabian StyleChen, Chih-Hsiung, Chang-Wei Chen, Kuang-Yu Hsieh, Kuo-En Huang, and Hsien-Yung Lai. 2026. "Limitations in Chest X-Ray Interpretation by Vision-Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o" Diagnostics 16, no. 3: 376. https://doi.org/10.3390/diagnostics16030376
APA StyleChen, C.-H., Chen, C.-W., Hsieh, K.-Y., Huang, K.-E., & Lai, H.-Y. (2026). Limitations in Chest X-Ray Interpretation by Vision-Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o. Diagnostics, 16(3), 376. https://doi.org/10.3390/diagnostics16030376

