Systematic Analysis of Vision–Language Models for Medical Visual Question Answering
Abstract
1. Introduction
1.1. Motivation
1.2. Context
1.3. Scope
- How well do general-purpose models (like ViLT, BLIP, and MiniCPM-V-2) perform in a strict zero-shot setting on CT-, MRI-, and X-ray-specific Med-VQA subsets constructed from SLAKE and OmniMedVQA data?
- Can supervised fine-tuning on modality-specific training subsets reduce the performance gap between these models and specialised pretrained Med-VQA models? and
- Can a simple post-hoc answer-selection strategy, which matches model-generated outputs to multiple-choice options via semantic similarity, further improve reliability?
1.4. Contributions
- i.
- We propose a novel benchmark for medical VLMs derived from well-known datasets and covering three imaging modalities: Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and X-ray radiography. The benchmark includes a reproducible curation pipeline that merges SLAKE [5] and OmniMedVQA [6] into aligned CT, MRI, and X-ray subsets, enabling fair cross-modality comparisons under a unified experimental setup.
- ii.
- We present a general evaluation methodology and conduct a systematic analysis of general-purpose MiniCPM, ViLT and BLIP models on modality-specific Med-VQA subsets. We quantify the performance limitations of non-medical VQA models in both zero-shot and supervised settings. In addition, our methodology introduces and evaluates a semantic answer-matching procedure that converts free-form outputs into robust multiple-choice predictions, bridging the gap between generative and discriminative VQA modalities.
- iii.
- Our novel results reveal that MRI-based questions are more challenging than CT and X-ray-based questions. They also demonstrate that modality-aware fine-tuning yields substantial gains in both lexical and semantic correctness without requiring large, fully medical foundation models. These findings offer practical guidance on when and how general-purpose VQA architectures can be safely and effectively adapted for medical visual question answering across radiology modalities.
2. Related Works
2.1. Foundation VLMs and Multimodal LLMs
2.2. General-Domain Visual Question Answering
2.3. Medical Visual Question Answering Datasets and Benchmarks
2.4. Medical VQA Models and Multimodal Clinical Assistants
3. Materials and Methods
3.1. General Methodology
- ViLT (dandelin/vilt-b32-finetuned-vqa), a classification-style VQA model over a fixed VQAv2 answer vocabulary;
- BLIP (Salesforce/blip-vqa-base), an encoder–decoder model that generates short free-text answers;
- MiniCPM-V-2 (openbmb/MiniCPM-V-2), a chat-style multimodal large language model using the official model.chat interface.
3.2. Benchmarking Pipeline
3.2.1. Data Source and Modality-Specific Subsets
3.2.2. Model Configurations and Zero-Shot Inference
3.2.3. Evaluation Protocol and Metrics
3.3. Fine-Tuning Pipeline
3.3.1. Train–Test Splits
3.3.2. ViLT: Multi-Label Classification Fine-Tuning
3.3.3. BLIP: Generative Sequence-to-Sequence Fine-Tuning
3.3.4. MiniCPM-V-2: LoRA-Based Instruction Fine-Tuning
- -
- Per-device train and evaluation batch size ;
- -
- Gradient accumulation steps (effective batch size );
- -
- Number of epochs ;
- -
- Learning rate ;
- -
- fp16 = True;
- -
- Evaluation and checkpointing at each epoch with save_total_limit = 1.
3.3.5. Inference and Metrics After Fine-Tuning
- For ViLT, the model produced logits over the modality-specific label set. The predicted answer index was , which was mapped back to text via the corresponding id2label mapping.
- For BLIP, answers were generated with the same constrained decoding configuration employed during training-time evaluation.
- For MiniCPM-V-2, answers were produced through the chat interface conditioned on both the image and the question, and the resulting text was normalised using the same routine as in the benchmarking stage.
3.4. Option Selection After the Fine-Tuning Pipeline
3.4.1. Multiple-Choice Harmonisation of SLAKE and OmniMedVQA
3.4.2. Combined MCQ Dataset per Modality
- Filtered combo by modality (CT, MRI, X-ray), as in previous stages;
- Applied the same stratified random split as in Section 3.2 (typically 80/20 train–test, with TRAIN_PERCENT = TEST_PERCENT = 1.0 for these runs);
- Retained all option columns and the correct_option label in both training and test splits, even though options were not used by the loss during training.
3.4.3. Semantic Option Selection with BERTScore
3.4.4. Integration with ViLT, BLIP, and MiniCPM-V-2
4. Results
4.1. Baseline Results
4.2. Supervised Fine-Tuning Results
4.2.1. Overall Effect of Fine-Tuning
4.2.2. Comparison Between Models
4.2.3. Modality Trends
- CT emerged as the easiest modality after fine-tuning, with all models achieving their highest accuracies and BERTScore values there.
- MRI remained the most challenging: while ViLT and BLIP both exceeded 73% EM, their other metrics dipped compared with CT, and MiniCPM lagged further behind.
- X-ray performance was intermediate, but again consistently higher for ViLT and BLIP than for MiniCPM.
4.3. Fine-Tuning with Post-Hoc Option Selection
4.4. Stratified Performance Analysis by Answer Type
5. Discussion
5.1. How Well Do General-Purpose VLMs Perform in a Strict Zero-Shot Setting?
5.2. Can Supervised Fine-Tuning on Modality-Specific Training Subsets Reduce the Performance Gap Between General-Purpose VLMs and Specialised Ones?
5.3. Can a Simple Post-Hoc Answer-Selection Strategy Further Improve Reliability on Benchmarks with Predefined Answer Sets?
5.4. Relation to Prior Med-VQA Literature
6. Conclusions and Future Directions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Environment, Dataset Curation and VLM Benchmarking
Appendix A.1. Computational Environment
| Component | Description |
|---|---|
| Programming language | Python |
| Deep learning framework | PyTorch (tensor operations, GPU-accelerated inference, model execution) |
| Model library | Hugging Face transformers |
| Vision–language models | ViLT (dandelin/vilt-b32-finetuned-vqa), BLIP (Salesforce/blip-vqa-base), MiniCPM-V-2 (openbmb/MiniCPM-V-2) |
| Dataset interface | Hugging Face datasets (loading SLAKE and OmniMedVQA-Mini, filtering, mapping, concatenation) |
| Image processing | Pillow (PIL) for image loading and basic pre-processing (e.g., RGB conversion) |
| Evaluation metrics | evaluate (ROUGE-L, BERTScore); nltk.corpus.wordnet (Wu–Palmer similarity) |
| Dataset acquisition | snapshot_download used to obtain immutable local copies of SLAKE JSON annotations and image archives |
| Execution mode | Models moved to GPU via .to(device), set to evaluation mode with .eval(), and run inside torch.no_grad() or torch.inference_mode() |
| Reproducibility | Global random seeds initialised for NumPy and PyTorch at the start of each run |
Appendix A.2. Dataset Curation and Harmonisation
- img_name (image filename);
- question (text);
- answer (reference answer);
- modality (raw modality label).
- The image field (tensor or path, depending on configuration);
- question (text);
- gt_answer (canonical answer);
- modality (raw modality string).
Appendix A.3. Fairness of Modality Splits

Appendix A.4. Model-Specific Inference
| Algorithm A1 Model-specific inference for zero-shot benchmarking (ViLT, BLIP, and MiniCPM-V-2) |
|
Appendix A.5. Metric Computation
Appendix B. VLM Fine-Tuning
Appendix B.1. Overview of the Fine-Tuning Protocol
- ViLT: three multi-label classification models, one per modality, using ViltForQuestionAnswering;
- BLIP: three generative VQA models, one per modality, using
- MiniCPM-V-2: three LoRA-adapted instruction-tuned models, one per modality, using the official MiniCPM-V training scripts with base model openbmb/MiniCPM-V-2 and a SigLIP-400M visual encoder.
Appendix B.2. ViLT: Multi-Label Classification Fine-Tuning
- config.num_labels = len(label2id);
- config.id2label and config.label2id set to the constructed dictionaries.
Appendix B.3. BLIP: Generative VQA Fine-Tuning
Appendix B.4. MiniCPM-V-2: LoRA-Based Instruction Fine-Tuning
Appendix C. Post-Hoc Option Selection
Appendix C.1. Multiple-Choice Harmonisation of SLAKE and OmniMedVQA
Appendix C.2. Combined MCQ Dataset per Modality
- Modality filtering: Combined Dataset is filtered by modality to create modality-specific datasets.
- Train–test split: For each modality, the same random_split utility as in Appendix B.2 is applied with test_size = 0.2 and TRAIN_PERCENT = TEST_PERCENT = 1.0, using a fixed random seed.
- Retention of options: All option columns and correct_option are retained in both splits. These fields are ignored by the training loss and carried forward solely for evaluation.
Appendix C.3. MiniCPM-V-2 SFT JSON with Options



Appendix C.4. Semantic Option Selection and Model Integration
References
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
- Hartsock, I.; Rasool, G. Vision-language models for medical report generation and visual question answering: A Review. Front. Artif. Intell. 2024, 7, 1430984. [Google Scholar] [CrossRef] [PubMed]
- Lin, Z.; Zhang, D.; Tao, Q.; Shi, D.; Haffari, G.; Wu, Q.; He, M.; Ge, Z. Medical visual question answering: A survey. Artif. Intell. Med. 2023, 143, 102611. [Google Scholar] [CrossRef] [PubMed]
- Sunitha, U.; Shastri, H. Visual question answering system. Int. J. Res. Publ. Rev. 2025, 6, 1793–1796. [Google Scholar] [CrossRef]
- Liu, B.; Zhan, L.-M.; Xu, L.; Ma, L.; Yang, Y.; Wu, X.-M. Slake: A semantically-labeled knowledge-enhanced dataset for Medical Visual Question Answering. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021. [Google Scholar]
- Hu, Y.; Li, T.; Lu, Q.; Shao, W.; He, J.; Qiao, Y.; Luo, P. OmniMedVQA: A new large-scale comprehensive evaluation benchmark for medical LVLM. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 22170–22183. [Google Scholar]
- Liu, G.; He, J.; Li, P.; Zhao, Z.; Zhong, S. Cross-modal self-supervised vision language pre-training with multiple objectives for medical visual question answering. J. Biomed. Inform. 2024, 160, 104748. [Google Scholar] [CrossRef] [PubMed]
- Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021. [Google Scholar]
- Nguyen, D.; Ho, M.K.; Ta, H.; Nguyen, T.T.; Chen, Q.; Rav, K.; Dang, Q.D.; Ramchandre, S.; Phung, S.L.; Liao, Z.; et al. Localizing before answering: A benchmark for grounded medical visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 16–22 August 2025. [Google Scholar]
- Yi, Z.; Xiao, T.; Albert, M.V. A survey on multimodal large language models in radiology for report generation and visual question answering. Information 2025, 16, 136. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
- Raschka, S. Understanding Multimodal LLMs. Blog Post. Available online: https://magazine.sebastianraschka.com/p/understanding-multimodal-llms (accessed on 7 December 2025).
- Hu, S.; Tu, Y.; Han, X.; He, C.; Cui, G.; Long, X.; Zheng, Z.; Fang, Y.; Huang, Y.; Zhao, W.; et al. MiniCPM: Unveiling the potential of small language models with Scalable Training Strategies. arXiv 2024, arXiv:2404.06395. [Google Scholar] [CrossRef]
- Moor, M.; Huang, Q.; Wu, S.; Yasunaga, M.; Zakka, C.; Dalmia, Y.; Reis, E.P.; Rajpurkar, P.; Leskovec, J. Med-Flamingo: A multimodal medical few-shot learner. In Proceedings of the Machine Learning for Health Symposium, New Orleans, LA, USA, 10 December 2023. [Google Scholar]
- Guo, Y.; Huang, W. Llava-next-med: Medical multimodal large language model. In Proceedings of the 2025 Asia-Europe Conference on Cybersecurity, Internet of Things and Soft Computing (CITSC), Rimini, Italy, 10–12 January 2025; pp. 474–477. [Google Scholar]
- Malinowski, M.; Fritz, M. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS), Montreal, QC, USA, 8–13 December 2014. [Google Scholar]
- Malinowski, M.; Rohrbach, M.; Fritz, M. Ask your neurons: A deep learning approach to visual question answering. Int. J. Comput. Vis. 2017, 125, 110–135. [Google Scholar] [CrossRef]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Zakari, R.Y.; Owusu, J.W.; Qin, K.; Sagir, A.M. A transformer-based approach for effective visual question answering. In Proceedings of the 2024 IEEE Smart World Congress (SWC), Denarau Island, Fiji, 2–7 December 2024; pp. 1532–1539. [Google Scholar]
- Sharma, H.; Jalal, A.S. A survey of methods, datasets and evaluation metrics for visual question answering. Image Vis. Comput. 2021, 116, 104327. [Google Scholar] [CrossRef]
- Lu, S.; Liu, M.; Yin, L.; Yin, Z.; Liu, X.; Zheng, W. The multi-modal fusion in visual question answering: A review of Attention Mechanisms. PeerJ Comput. Sci. 2023, 9, e1400. [Google Scholar] [CrossRef] [PubMed]
- Bazi, Y.; Rahhal, M.M.; Bashmal, L.; Zuair, M. Vision–language model for visual question answering in medical imagery. Bioengineering 2023, 10, 380. [Google Scholar] [CrossRef] [PubMed]
- Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 180251. [Google Scholar] [CrossRef] [PubMed]
- He, X.; Zhang, Y.; Mou, L.; Xing, E.; Xie, P. PATHVQA: 30000+ questions for medical visual question answering. arXiv 2020, arXiv:2003.10286. [Google Scholar] [CrossRef]
- Yuan, D. Language bias in visual question answering: A survey and taxonomy. arXiv 2021, arXiv:2111.08531. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv 2023, arXiv:2305.10415. [Google Scholar] [CrossRef]
- Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. 2023. Available online: https://openreview.net/forum?id=GSuP99u2kR (accessed on 5 December 2025).
- Chang, T.; Chen, S.; Fan, G.; Feng, Z. A vision-language model based on prompt learner for few-shot medical images diagnosis. In Proceedings of the International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; pp. 1455–1460. [Google Scholar]
- Chen, X.; Lai, Z.; Ruan, K.; Chen, S.; Liu, J.; Liu, Z. R-llava: Improving med-vqa understanding through visual region of interest. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; pp. 1–10. [Google Scholar]
- Zhang, D.; Cao, R.; Wu, S. Information fusion in visual question answering: A survey. Inf. Fusion 2019, 52, 268–280. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016. [Google Scholar]
- Mathur, N.; Baldwin, T.; Cohn, T. Tangled up in Bleu: Reevaluating the evaluation of Automatic Machine Translation Evaluation Metrics. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020. [Google Scholar]
- Barbella, M.; Tortora, G. Rouge metric evaluation for text summarization techniques. SSRN Electron. J. 2022. [Google Scholar] [CrossRef]
- Kim, B.S.; Kim, J.; Lee, D.; Jang, B. Visual question answering: A survey of methods, datasets, evaluation, and challenges. ACM Comput. Surv. 2025, 57, 249. [Google Scholar] [CrossRef]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Contributors, P. BCEWITHLOGITSLOSS. Available online: https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html (accessed on 19 January 2026).
- Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. Development of a large-scale medical visual question-answering dataset. Commun. Med. 2024, 4, 277. [Google Scholar] [CrossRef] [PubMed]

| Dataset | Modality Type | Exact Match | Average F1 | Rouge-L | WUPS | WUPS@0.9 | BertScore-F1 |
|---|---|---|---|---|---|---|---|
| Salesforce/blip-vqa-base | |||||||
| SLAKE—Omni Med VQA | CT—3571 | 24.56 | 25.02 | 25.07 | 64.87 | 28.48 | 72.42 |
| MRI—1840 | 18.86 | 19.88 | 20.15 | 50.62 | 23.48 | 65.59 | |
| X-ray—2372 | 23.19 | 23.69 | 26.39 | 65.58 | 28.54 | 70.59 | |
| dandelin/vilt-b32-finetuned-vqa | |||||||
| SLAKE—Omni Med VQA | CT—3571 | 23.55 | 23.97 | 23.99 | 55.08 | 24.84 | 68.21 |
| MRI—1840 | 16.47 | 16.66 | 16.93 | 46.10 | 19.78 | 61.51 | |
| X-ray—2372 | 18.59 | 18.89 | 18.92 | 54.88 | 24.24 | 63.47 | |
| MiniCPM-V2 (OpenBMB) | |||||||
| SLAKE—Omni Med VQA | CT—3571 | 0.00 | 3.16 | 6.88 | 70.47 | 18.96 | 35.49 |
| MRI—1840 | 0.00 | 1.96 | 7.68 | 58.54 | 13.86 | 36.39 | |
| X-ray—2372 | 0.00 | 2.92 | 7.45 | 67.93 | 21.29 | 35.90 | |
| Dataset | Modality Type | Exact Match | Average F1 | Rouge-L | WUPS | WUPS@0.9 | BertScore-F1 |
|---|---|---|---|---|---|---|---|
| Salesforce/blip-vqa-base | |||||||
| SLAKE—Omni Med VQA | CT—714 | 81.72 | 84.33 | 84.79 | 91.78 | 83.26 | 93.51 |
| MRI—368 | 73.55 | 79.39 | 79.96 | 78.43 | 66.67 | 91.31 | |
| X-ray—474 | 71.28 | 75.07 | 82.08 | 88.65 | 78.51 | 92.17 | |
| dandelin/vilt-b32-finetuned-vqa | |||||||
| SLAKE—Omni Med VQA | CT—714 | 82.70 | 84.57 | 84.35 | 91.58 | 82.98 | 93.93 |
| MRI—368 | 76.03 | 80.03 | 80.29 | 77.37 | 64.74 | 91.83 | |
| X-ray—474 | 78.09 | 80.95 | 81.25 | 88.50 | 77.87 | 91.50 | |
| MiniCPM-V2 (OpenBMB) | |||||||
| SLAKE—Omni Med VQA | CT—714 | 79.97 | 81.52 | 81.68 | 89.13 | 78.43 | 93.00 |
| MRI—368 | 68.48 | 74.30 | 76.23 | 73.36 | 61.14 | 88.73 | |
| X-ray – 474 | 59.70 | 65.52 | 70.20 | 78.66 | 64.14 | 85.88 | |
| Dataset | Modality Type | Exact Match | Average F1 | Rouge-L | WUPS | WUPS@0.9 | BertScore-F1 |
|---|---|---|---|---|---|---|---|
| Salesforce/blip-vqa-base | |||||||
| SLAKE—Omni Med VQA | CT—714 | 90.76 | 91.22 | 91.48 | 93.95 | 87.25 | 96.48 |
| MRI—368 | 87.77 | 87.77 | 87.77 | 78.04 | 66.58 | 95.65 | |
| X-ray—474 | 93.25 | 93.61 | 93.61 | 91.33 | 81.65 | 97.57 | |
| dandelin/vilt-b32-finetuned-vqa | |||||||
| SLAKE—Omni Med VQA | CT—714 | 92.12 | 92.45 | 92.59 | 94.67 | 89.03 | 96.72 |
| MRI—368 | 92.84 | 92.84 | 92.84 | 79.81 | 70.25 | 96.87 | |
| X-ray—474 | 92.34 | 92.87 | 92.86 | 90.63 | 82.34 | 96.54 | |
| MiniCPM-V2 (OpenBMB) | |||||||
| SLAKE—Omni Med VQA | CT—714 | 80.95 | 82.77 | 82.91 | 90.00 | 79.69 | 92.86 |
| MRI—368 | 76.63 | 80.47 | 81.35 | 75.71 | 64.95 | 91.27 | |
| X-ray—474 | 73.84 | 78.42 | 78.87 | 85.39 | 73.42 | 89.81 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Shah, M.H.; Cuayáhuitl, H. Systematic Analysis of Vision–Language Models for Medical Visual Question Answering. Multimodal Technol. Interact. 2026, 10, 16. https://doi.org/10.3390/mti10020016
Shah MH, Cuayáhuitl H. Systematic Analysis of Vision–Language Models for Medical Visual Question Answering. Multimodal Technologies and Interaction. 2026; 10(2):16. https://doi.org/10.3390/mti10020016
Chicago/Turabian StyleShah, Muhammad Haseeb, and Heriberto Cuayáhuitl. 2026. "Systematic Analysis of Vision–Language Models for Medical Visual Question Answering" Multimodal Technologies and Interaction 10, no. 2: 16. https://doi.org/10.3390/mti10020016
APA StyleShah, M. H., & Cuayáhuitl, H. (2026). Systematic Analysis of Vision–Language Models for Medical Visual Question Answering. Multimodal Technologies and Interaction, 10(2), 16. https://doi.org/10.3390/mti10020016

