Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering
Abstract
1. Introduction
- Assessed the diagnostic capabilities of state-of-the-art vision–language models (VLMs) in a zero-shot setting using general prompts, assessing their ability to answer complex medical questions based solely on pre-trained knowledge, without explicit reasoning.
- Introduced a zero-shot chain-of-thought (CoT) prompting framework tailored for MedVQA, which encourages models to engage in structured, multi-step reasoning by generating intermediate explanations before arriving at a final diagnosis. This approach promotes greater transparency and interpretability in model inference.
- Conducted a comprehensive evaluation of three leading VLMs: Gemini 2.5 Pro, Claude 3.5 Sonnet, and GPT-4o Mini, under both standard zero-shot and our proposed zero-shot CoT prompting strategies, providing insights into how each model leverages visual and textual information for clinical reasoning.
- Performed an in-depth comparative analysis between conventional zero-shot prompting and our reasoning-enhanced zero-shot CoT method, demonstrating how structured reasoning significantly improves not only diagnostic accuracy but also the explainability and trustworthiness of model outputs in clinical contexts.
2. Related Works
2.1. Vision and Multimodal Approaches in Medical VQA
2.2. Prompting and Retrieval-Based Strategies in Medical Question Answering
3. Background Study
3.1. Overview of the Vision–Language Models
3.1.1. Gemini 2.5 Pro
3.1.2. Claude 3.5 Sonnet
3.1.3. GPT-4o Mini
3.2. Zero-Shot Chain-of-Thought Framework
3.2.1. Zero-Shot Learning
- A medical image I, such as an X-ray or MRI scan;
- A diagnostic question q, like “What part of the body was the mass located in?”;
- A set of possible answers .
- M is the vision–language model;
- I is the input medical image;
- q is the natural language question;
- a is one of the candidate answers from the set ;
- denotes the model’s confidence or compatibility score for answer a given image I and question q.
3.2.2. Chain-of-Thought Reasoning
- M is the vision–language model;
- I is the medical image (e.g., an X-ray or MRI);
- q is the natural language question (e.g., “Where is the cavitary lesion located?”);
- is the chain of reasoning steps;
- is the final predicted answer.
3.2.3. Why Apply Chain-of-Thought?
- 1.
- Enhanced Diagnostic Precision: The stepwise examination of visual features reduces the probability of overlooking critical diagnostic indicators. Each reasoning step contributes incrementally to diagnostic certainty:
- 2.
- Alignment with Clinical Methodology: The reasoning structure mirrors established clinical diagnostic protocols, where
- 3.
- Interpretability Coefficient: The comprehensive elucidation of reasoning enhances clinician–AI collaboration through improved transparency, quantifiable as
3.2.4. Zero-Shot Chain-of-Thought Prompting
3.3. Evaluation Metrics Summary
3.3.1. Accuracy
- N is the total number of visual–question pairs;
- is the predicted answer option for the ith query;
- is the ground truth answer option for the ith query;
- is the indicator function, which returns 1 if the argument is true and 0 otherwise.
3.3.2. Precision
- is the number of times option o was correctly predicted for category c;
- is the number of times option o was incorrectly predicted (i.e., it was selected but was not the correct answer).
3.3.3. Recall
- is the number of false negatives (i.e., instances where option o was the correct answer but the model failed to predict it).
3.3.4. F1-Score
- is the number of question–answer pairs in category c;
- N is the total number of question–answer pairs across all categories.
3.4. Dataset Description
3.4.1. PMC-VQA Dataset
3.4.2. Dataset Statistics and Structure
4. Proposed Methodology
4.1. Experimental Setup
4.2. Prompt Engineering for Visual Question Answering
4.2.1. Task Definition
4.2.2. Zero-Shot Instruction
4.2.3. Chain-of-Thought Instruction
4.2.4. Model Response Structure and Reasoning Chain
- denotes the probability of selecting answer given the image I, question q, and the generated reasoning chain r.
- The model chooses the answer with the highest probability as its final prediction.
- Temperature controls the randomness of the output distribution. A lower temperature results in more deterministic outputs, while a higher temperature encourages more diverse responses.
- Top-k limits the sampling pool to the top k most probable tokens, reducing noise from low-probability options.
- Top-p restricts the selection to the smallest set of tokens whose cumulative probability exceeds a threshold p, balancing between diversity and confidence.
4.2.5. Final Answer Format
4.3. Assessment Protocol for Medical VQA Tasks
4.3.1. Evaluation Procedure for Zero-Shot Medical VQA
Algorithm 1 Evaluation procedure for zero-shot Medical VQA. |
|
4.3.2. Evaluation Procedure for Zero-Shot Chain-of-Thought Reasoning
Algorithm 2 Evaluation procedure for zero-shot chain-of-thought Medical VQA. |
|
4.4. Configurations Details and Cost Analysis
Model | Input Token Cost (per 1M) | Output Token Cost (per 1M) | Provider |
---|---|---|---|
Gemini 2.5 Pro | USD 1.25 | USD 10.00 | |
Claude 3.5 Sonnet | USD 3.00 | USD 15.00 | Anthropic |
GPT-4o Mini | USD 0.15 | USD 0.60 | OpenAI |
4.4.1. Zero-Shot Evaluation Prompt Design for Gemini 2.5 Pro
4.4.2. Zero-Shot Evaluation Prompt Design for Claude 3.5 Sonnet
4.4.3. Zero-Shot Evaluation Prompt Design for GPT-4o Mini
4.4.4. Zero-Shot Chain-of-Thought Prompt Design for Gemini 2.5 Pro
4.4.5. Zero-Shot Chain-of-Thought Prompt Design for Claude 3.5 Sonnet
4.4.6. Zero-Shot Chain-of-Thought Prompt Design for GPT-4o Mini
5. Result Analysis
5.1. Zero-Shot Result Analysis for Gemini 2.5 Pro
5.2. Evaluating Zero-Shot Performance of Claude 3.5 Sonnet
5.3. Performance Evaluation of GPT-4o Mini in Zero-Shot Tasks
5.4. Analyzing Zero-Shot Chain-of-Thought Results for GPT-4o Mini
5.5. Analyzing Zero-Shot Chain-of-Thought Results for Claude 3.5 Sonnet
5.6. Analyzing Zero-Shot Chain-of-Thought Results for Gemini 2.5 Pro
5.7. Comparison Between Medical Experts and Chain-of-Thought Reasoning
- Assess the correctness of each model’s answer relative to their own diagnosis.
- Evaluate the clarity, relevance, and clinical soundness of the provided reasoning steps.
- Reflect on how the coherence and transparency of the reasoning affected their trust in the model’s conclusion.
- Comment on the potential role of such vision–language models as diagnostic support tools in clinical workflows.
5.8. Expert Evaluation and Trust Analysis
5.8.1. Answer Concordance
- Gemini 2.5 Pro: 76% agreement with at least one expert, indicating the highest level of diagnostic alignment.
- GPT-4o Mini: 72% agreement, showing competitive performance in matching clinical judgments.
- Claude 3.5 Sonnet: 67% agreement, slightly lower but still demonstrating a reasonable overlap with expert responses.
5.8.2. Reasoning Transparency
5.8.3. Trust and Confidence
5.8.4. Clinical Integration Potential
5.9. Comparison of Prompting Strategies on PMC-VQA Multiple-Choice Task
6. Limitation
7. Future Work
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Canepa, L.; Singh, S.; Sowmya, A. Visual Question Answering in the Medical Domain. In Proceedings of the 2023 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Port Macquarie, Australia, 28 November–1 December 2023; pp. 379–386. [Google Scholar] [CrossRef]
- Pinto, F.; Rauschmayr, N.; Tramèr, F.; Torr, P.; Tombari, F. Extracting training data from document-based VQA models. arXiv 2024, arXiv:2407.08707. [Google Scholar] [CrossRef]
- Zhu, Z.; Jia, M.; Zhang, Z.; Li, L.; Jiang, M. Multichartqa: Benchmarking vision-language models on multi-chart problems. arXiv 2024, arXiv:2410.14179. [Google Scholar]
- Srivastava, A.; Kumar, A.; Kumar, R.; Srinivasan, P. Enhancing Financial VQA in Vision Language Models using Intermediate Structured Representations. arXiv 2024, arXiv:2501.04675. [Google Scholar]
- Siripong, S.; Chaiyapan, A.; Phonchai, T. Large Vision-Language Models for Remote Sensing Visual Question Answering. arXiv 2024, arXiv:2411.10857. [Google Scholar] [CrossRef]
- Chen, B.; Jin, L.; Wang, X.; Gao, D.; Jiang, W.; Ning, W. Unified Vision-Language Representation Modeling for E-Commerce Same-style Products Retrieval. In Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion); Association for Computing Machinery: New York, NY, USA, 2023; pp. 381–385. [Google Scholar] [CrossRef]
- Bongini, P.; Becattini, F.; Bagdanov, A.D.; Bimbo, A.D. Visual question answering for cultural heritage. IOP Conf. Ser. Mater. Sci. Eng. 2020, 949, 012074. [Google Scholar] [CrossRef]
- Elhenawy, M.; Ashqar, H.I.; Rakotonirainy, A.; Alhadidi, T.I.; Jaber, A.; Tami, M.A. Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding. Electronics 2025, 14, 1282. [Google Scholar] [CrossRef]
- Maharjan, J.; Garikipati, A.; Singh, N.P.; Cyrus, L.; Sharma, M.; Ciobanu, M.; Barnes, G.; Thapa, R.; Mao, Q.; Das, R. OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci. Rep. 2024, 14, 14156. [Google Scholar] [CrossRef]
- Yan, Q.; Duan, J.; Wang, J. Multi-modal Concept Alignment Pre-training for Generative Medical Visual Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 17–22 March 2024; pp. 5378–5389. [Google Scholar] [CrossRef]
- Jin, C.; Zhang, M.; Ma, W.; Li, Y.; Wang, Y.; Jia, Y.; Du, Y.; Sun, T.; Wang, H.; Fan, C.; et al. RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), Barcelona, Spain, 25–29 August 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 5218–5229. [Google Scholar] [CrossRef]
- Bardhan, J.; Colas, A.; Roberts, K.; Wang, D.Z. Drugehrqa: A question answering dataset on structured and unstructured electronic health records for medicine related queries. arXiv 2022, arXiv:2205.01290. [Google Scholar] [CrossRef]
- Wu, Q.; Wang, P.; Wang, X.; He, X.; Zhu, W. Medical VQA. In Visual Question Answering. Advances in Computer Vision and Pattern Recognition; Springer: Singapore, 2022. [Google Scholar] [CrossRef]
- Zhan, L.; Liu, B.; Fan, L.; Chen, J.; Wu, X. Medical Visual Question Answering via Conditional Reasoning. In Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), Seattle, WA, USA, 12–16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 2345–2354. [Google Scholar] [CrossRef]
- Gai, X.; Zhou, C.; Liu, J.; Feng, Y.; Wu, J.; Liu, Z. Medthink: Explaining medical visual question answering via multimodal decision-making rationale. arXiv 2024, arXiv:2404.12372. [Google Scholar]
- Bi, Y.; Wang, X.; Wang, Q.; Yang, J. Overcoming Data Limitations and Cross-Modal Interaction Challenges in Medical Visual Question Answering. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar] [CrossRef]
- Wang, G.; Bai, L.; Nah, W.J.; Wang, J.; Zhang, Z.; Chen, Z.; Wu, J.; Islam, M.; Liu, H.; Ren, H. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery. arXiv 2024, arXiv:2405.10948. [Google Scholar]
- Van, M.-H.; Verma, P.; Wu, X. On large visual language models for medical imaging analysis: An empirical study. In Proceedings of the 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Wilmington, DE, USA, 19–21 June 2024; pp. 172–176. [Google Scholar]
- Hu, N.; Zhang, X.; Zhang, Q.; Huo, W.; You, S. ZPVQA: Visual Question Answering of Images Based on Zero-Shot Prompt Learning. IEEE Access 2025, 13, 50849–50859. [Google Scholar] [CrossRef]
- Zhan, C.; Peng, P.; Wang, H.; Wang, G.; Lin, Y.; Chen, T.; Wang, H. UnICLAM: Contrastive representation learning with adversarial masking for unified and interpretable medical vision question answering. Med. Image Anal. 2025, 101, 103464. [Google Scholar] [CrossRef] [PubMed]
- Pan, J.; Liu, C.; Wu, J.; Liu, F.; Zhu, J.; Li, H.B.; Chen, C.; Ouyang, C.; Rueckert, D. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv 2025, arXiv:2502.19634. [Google Scholar]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
- Li, F.; Chen, Y.; Liu, H.; Yang, R.; Yuan, H.; Jiang, Y.; Li, T.; Taylor, E.M.; Rouhizadeh, H.; Iwasawa, Y.; et al. Mkg-rank: Enhancing large language models with knowledge graph for multilingual medical question answering. arXiv 2025, arXiv:2503.16131. [Google Scholar]
- Liang, S.; Zhang, L.; Zhu, H.; Wang, W.; He, Y.; Zhou, D. RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering. arXiv 2025, arXiv:2502.13361. [Google Scholar]
- Shi, Y.; Yang, T.; Chen, C.; Li, Q.; Liu, T.; Li, X.; Liu, N. SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering? arXiv 2025, arXiv:2502.13233. [Google Scholar]
- Google. Gemini 2.5 Pro (Preview Version, as of 14 May 2025) [Large Multimodal Model Accessed via Google AI Studio]. 2025. Available online: https://aistudio.google.com/ (accessed on 10 May 2025).
- Anthropic. Claude 3.5 Sonnet. 2024. Available online: https://www.anthropic.com/claude (accessed on 14 May 2025).
- OpenAI. GPT-4o Mini. 2024. Available online: https://openai.com (accessed on 14 May 2025).
- Jin, F.; Liu, Y.; Tan, Y. Zero-shot chain-of-thought reasoning guided by evolutionary algorithms in large language models. arXiv 2024, arXiv:2402.05376. [Google Scholar]
- Gupta, A.; Vuillecard, P.; Farkhondeh, A.; Odobez, J.-M. Exploring the zero-shot capabilities of vision-language models for improving gaze following. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 615–624. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In AI 2006: Advances in Artificial Intelligence, Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4304, pp. 1015–1021. [Google Scholar] [CrossRef]
- Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science; Losada, D.E., Fernández-Luna, J.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3408. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv 2023, arXiv:2305.10415. [Google Scholar] [CrossRef]
- Bordes, F.; Pang, R.Y.; Ajay, A.; Li, A.C.; Bardes, A.; Petryk, S.; Mañas, O.; Lin, Z.; Mahmoud, A.; Jayaraman, B.; et al. An introduction to vision-language modeling. arXiv 2024, arXiv:2405.17247. [Google Scholar] [CrossRef]
Model | Accuracy (%) | Precision (%) | Recall (%) | Weighted F1 (%) |
---|---|---|---|---|
Gemini 2.5 Pro | 54.667 | 55.222 | 54.667 | 54.782 |
Claude Sonnet 3.5 | 52.667 | 53.719 | 52.667 | 52.841 |
GPT-4o Mini | 48.000 | 49.198 | 48.000 | 48.230 |
Model | Accuracy (%) | Precision (%) | Recall (%) | Weighted F1 (%) |
---|---|---|---|---|
Gemini 2.5 Pro | 72.48 | 75.36 | 75.38 | 74.67 |
Claude 3.5 Sonnet | 69.00 | 71.62 | 71.66 | 70.72 |
GPT-4o Mini | 67.33 | 69.24 | 68.32 | 68.81 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Faria, F.T.J.; Baniata, L.H.; Choi, A.; Kang, S. Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering. Mathematics 2025, 13, 2322. https://doi.org/10.3390/math13142322
Faria FTJ, Baniata LH, Choi A, Kang S. Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering. Mathematics. 2025; 13(14):2322. https://doi.org/10.3390/math13142322
Chicago/Turabian StyleFaria, Fatema Tuj Johora, Laith H. Baniata, Ahyoung Choi, and Sangwoo Kang. 2025. "Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering" Mathematics 13, no. 14: 2322. https://doi.org/10.3390/math13142322
APA StyleFaria, F. T. J., Baniata, L. H., Choi, A., & Kang, S. (2025). Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering. Mathematics, 13(14), 2322. https://doi.org/10.3390/math13142322