MDPI - Publisher of Open Access Journals

35 pages, 7934 KiB

Open AccessArticle

Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering

by Fatema Tuj Johora Faria, Laith H. Baniata, Ahyoung Choi and Sangwoo Kang

Mathematics 2025, 13(14), 2322; https://doi.org/10.3390/math13142322 - 21 Jul 2025

Viewed by 791

Abstract

Medical Visual Question Answering (MedVQA) lies at the intersection of computer vision, natural language processing, and clinical decision-making, aiming to generate accurate responses from medical images paired with complex inquiries. Despite recent advances in vision–language models (VLMs), their use in healthcare remains limited by a lack of interpretability and a tendency to produce direct, unexplainable outputs. This opacity undermines their reliability in medical settings, where transparency and justification are critically important. To address this limitation, we propose a zero-shot chain-of-thought prompting framework that guides VLMs to perform multi-step reasoning before arriving at an answer. By encouraging the model to break down the problem, analyze both visual and contextual cues, and construct a stepwise explanation, the approach makes the reasoning process explicit and clinically meaningful. We evaluate the framework on the PMC-VQA benchmark, which includes authentic radiological images and expert-level prompts. In a comparative analysis of three leading VLMs, Gemini 2.5 Pro achieved the highest accuracy (72.48%), followed by Claude 3.5 Sonnet (69.00%) and GPT-4o Mini (67.33%). The results demonstrate that chain-of-thought prompting significantly improves both reasoning transparency and performance in MedVQA tasks. Full article

(This article belongs to the Special Issue Mathematical Foundations in NLP: Applications and Challenges)

► Show Figures

Figure 1

19 pages, 704 KiB

Open AccessArticle

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

by Junkai Zhang, Bin Li and Shoujun Zhou

Appl. Sci. 2025, 15(9), 4712; https://doi.org/10.3390/app15094712 - 24 Apr 2025

Viewed by 1188

Abstract

Medical Visual Question Answering (Med-VQA) is designed to accurately answer medical questions by analyzing medical images when given both a medical image and its corresponding clinical question. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical Med-VQA tasks and established datasets. However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels, resulting in semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which can obscure crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion (HiCA-VQA) method. Specifically, the hierarchical modeling includes two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. This approach effectively avoids the irrelevant signals introduced by global interactions while achieving lower computational complexity compared to global self-attention fusion modules. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions, especially achieving an 18 percent improvement in the F1 score. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding. Full article

(This article belongs to the Special Issue New Trends in Natural Language Processing)

► Show Figures

Figure 1

29 pages, 549 KiB

Open AccessReview

Generative Models in Medical Visual Question Answering: A Survey

by Wenjie Dong, Shuhao Shen, Yuqiang Han, Tao Tan, Jian Wu and Hongxia Xu

Appl. Sci. 2025, 15(6), 2983; https://doi.org/10.3390/app15062983 - 10 Mar 2025

Cited by 1 | Viewed by 4061

Abstract

Medical Visual Question Answering (MedVQA) is a crucial intersection of artificial intelligence and healthcare. It enables systems to interpret medical images—such as X-rays, MRIs, and pathology slides—and respond to clinical queries. Early approaches primarily relied on discriminative models, which select answers from predefined candidates. However, these methods struggle to effectively address open-ended, domain-specific, or complex queries. Recent advancements have shifted the focus toward generative models, leveraging autoregressive decoders, large language models (LLMs), and multimodal large language models (MLLMs) to generate more nuanced and free-form answers. This review comprehensively examines the paradigm shift from discriminative to generative systems, examining generative MedVQA works on their model architectures and training process, summarizing evaluation benchmarks and metrics, highlighting key advances and techniques that propels the development of generative MedVQA, such as concept alignment, instruction tuning, and parameter-efficient fine-tuning (PEFT), alongside strategies for data augmentation and automated dataset creation. Finally, we propose future directions to enhance clinical reasoning and intepretability, build robust evaluation benchmarks and metrics, and employ scalable training strategies and deployment solutions. By analyzing the strengths and limitations of existing generative MedVQA approaches, we aim to provide valuable insights for researchers and practitioners working in this domain. Full article

(This article belongs to the Special Issue Feature Review Papers in "Computing and Artificial Intelligence")

► Show Figures

Figure 1

12 pages, 1359 KiB

Open AccessArticle

Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

by Jianfeng Wang, Kah Phooi Seng, Yi Shen, Li-Minn Ang and Difeng Huang

Electronics 2024, 13(12), 2273; https://doi.org/10.3390/electronics13122273 - 10 Jun 2024

Cited by 2 | Viewed by 1725

Abstract

Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications. Full article

(This article belongs to the Special Issue Emerging Topics in Artificial Intelligence (AI): Architectures and Techniques for Real-World Applications)

► Show Figures

Figure 1

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI