Next Article in Journal
Beyond the Classroom: Technology-Enabled Acceleration Models for Gifted Learners in the Digital Era
Previous Article in Journal
A User-Centered Evaluation of a VR HMD-Based Harvester Training Simulator
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Systematic Analysis of Vision–Language Models for Medical Visual Question Answering

by
Muhammad Haseeb Shah
* and
Heriberto Cuayáhuitl
*
School of Engineering and Physical Sciences, University of Lincoln, Brayford Pool, Lincoln LN6 7TS, UK
*
Authors to whom correspondence should be addressed.
Multimodal Technol. Interact. 2026, 10(2), 16; https://doi.org/10.3390/mti10020016
Submission received: 22 December 2025 / Revised: 29 January 2026 / Accepted: 30 January 2026 / Published: 3 February 2026

Abstract

General-purpose vision–language models (VLMs) are increasingly applied to imaging tasks, yet their reliability on medical visual question answering (Med-VQA) remains unclear. We investigate how three state-of-the-art VLMs—ViLT, BLIP, and MiniCPM-V-2—perform on radiology-focused Med-VQA when evaluated in a modality-aware manner. Using SLAKE and OmniMedVQA-Mini, we construct harmonised subsets for computed tomography (CT), magnetic resonance imaging (MRI), and X-ray, standardising schema and answer processing. We first benchmark all models in a strict zero-shot setting, then perform supervised fine-tuning on modality-specific data splits, and finally add a post-hoc semantic option-selection layer that maps free-text predictions to multiple-choice answers. Zero-shot performance is modest (exact match ≈20% for ViLT/BLIP and 0% for MiniCPM-V-2), confirming that off-the-shelf deployment is inadequate. Fine-tuning substantially improves all models, with ViLT reaching ≈80% exact match and BLIP ≈50%, while MiniCPM-V-2 lags behind. When coupled with option selection, ViLT and BLIP achieve 90–93% exact match and F1 across all modalities, corresponding to 95–97% BERTScore-F1. Our novel results show that (i) modality-specific supervision is essential for Med-VQA, and (ii) post-hoc option selection can transform strong but imperfect generative predictions into highly reliable discrete decisions on harmonised radiology benchmarks. The latter is useful for medical VLMs that combine generative responses with option or sentence selection.

1. Introduction

1.1. Motivation

Advances in artificial intelligence over the past decade have transformed how images and text are processed jointly, from generic captioning to complex visual reasoning. Visual question answering (VQA) was introduced as a benchmark task where models must answer natural-language questions about images, requiring them to combine visual recognition, language understanding, and commonsense reasoning [1]. These vision–language systems have rapidly improved through large-scale pre-training and transformer-based architectures, and are now being deployed in applications that demand high reliability, including healthcare [2].
In medicine, VQA is a key paradigm for building interactive decision-support tools that can answer clinicians’ questions on radiology, pathology, and other diagnostic images. Medical visual question answering (Med-VQA) extends general VQA to clinical images and domain-specific questions, such as “What abnormality is present in this CT scan?” or “Which anatomical structure is enlarged?” [3]. Compared to natural images, medical data present additional challenges: images are often noisy or low contrast, may represent 3D anatomy through 2D slices, and questions contain specialised terminology and implicit clinical context. Recent surveys emphasise that Med-VQA remains substantially more challenging than its general-domain counterpart, with models needing not only accurate perception but also medically grounded reasoning and robust uncertainty handling [3].

1.2. Context

The field has progressed through a series of specialised datasets and architectures. Early Med-VQA benchmarks such as VQA-RAD, PathVQA, and VQA-Med focus on radiology or pathology images with curated question templates and at a relatively small scale [4]. More recent datasets have introduced richer semantics, bilingual annotations, and ontology links. SLAKE is a semantically labelled bilingual Med-VQA dataset that covers CT, MRI, and chest X-ray images with expert-annotated questions, answers, and structured medical concepts [5]. OmniMedVQA, in contrast, aggregates images from 73 public medical datasets into a large, heterogeneous benchmark spanning 12 imaging modalities (including CT, MRI, and X-ray) and over 20 anatomical regions. It is designed to explicitly stress-test Large Vision–Language Models (LVLMs) on realistic, noisy clinical data [6]. Alternative benchmarking frameworks such as BESTMVQA provide unified pipelines for evaluating Med-VQA models across datasets and architectures, highlighting issues of data scarcity and reproducibility.
On the modelling side, most early Med-VQA systems adapt architectures from general VQA, for example via CNN-RNN pipelines or attention-based fusion, and refine them for clinical images and questions. Recent methods introduce asymmetric cross-modal attention networks, graph-based reasoning, contrastive pre-training, and domain-specific pre-training to bridge the semantic gap between medical images and textual questions [7]. In parallel, foundation VLMs such as Vision-and-Language Transformer (ViLT) and Bootstrapping Language-Image Pre-training (BLIP) have shown strong performance on general VQA and related tasks by pre-training on large-scale image–text pairs [8]. These models process images and questions with a unified transformer backbone and can be fine-tuned efficiently for downstream tasks, making them attractive candidates for medical applications where computational resources and deployment constraints matter.
More recently, large multimodal models (LMMs) (Whilst VLMs make use of two modalities (images and text), LMMs can handle multiple modalities.) such as Med-Flamingo, LLaVA-Med, and GPT-4V have been explored for medical imaging tasks, including VQA, report generation, and triaging [2]. Studies on OmniMedVQA and related benchmarks consistently show that even state-of-the-art LVLMs struggle when faced with real clinical images, multi-step reasoning, and fine-grained modality-specific questions [6]. At the same time, new evaluation suites such as HEAL-MedVQA and hallucination-focused analyses indicate that these models can produce confident but incorrect answers, underscoring reliability and safety concerns for clinical deployment [9].

1.3. Scope

Despite this growing body of work, two important gaps remain. First, much of the recent literature concentrates on either bespoke medical vision–language models or very large LVLMs that require substantial resources and domain-specific pre-training [10]. In contrast, comparatively few studies systematically examine how general-purpose VQA models—pre-trained only on natural images (e.g., VQAv2) and web-scale captions—behave on realistic medical VQA benchmarks when applied “as is” and after lightweight supervised adaptation. Secondly, although datasets like SLAKE and OmniMedVQA explicitly include multiple imaging modalities, most evaluation protocols aggregate performance across modalities or treat modality as a descriptive attribute rather than a central axis of analysis [5]. OmniMedVQA does report modality-wise scores for LVLMs, but primarily in zero-shot or prompt-based settings and without detailed investigation of how small to mid-sized VQA models can be fine-tuned on modality-specific subsets. As a result, there is limited understanding of how imaging modality (CT vs. MRI vs. X-ray) interacts with model architecture, fine-tuning strategy, and answer representation in Med-VQA.
In this study, we address these gaps by focusing specifically on modality-aware medical VQA across CT, MRI, and X-ray radiology images. We build a harmonised benchmark by combining the SLAKE dataset with a curated subset of the OmniMedVQA benchmark, restricting both to questions containing CT, MRI, and X-ray images and standardising annotations into a unified schema. On top of this shared dataset curation pipeline, we systematically evaluate three widely used general-purpose vision–language models: (a) ViLT, a classification-style VQA model pre-trained and fine-tuned on VQAv2; (b) BLIP, a generative vision–language model capable of open-ended VQA—without any medical pre-training; and (c) MiniCPM-V-2, an end-side multimodal large language model built on a SigLIP-based visual encoder and a lightweight MiniCPM language backbone—connected via a perceiver resampler. Our work aims to answer the following research questions:
  • How well do general-purpose models (like ViLT, BLIP, and MiniCPM-V-2) perform in a strict zero-shot setting on CT-, MRI-, and X-ray-specific Med-VQA subsets constructed from SLAKE and OmniMedVQA data?
  • Can supervised fine-tuning on modality-specific training subsets reduce the performance gap between these models and specialised pretrained Med-VQA models? and
  • Can a simple post-hoc answer-selection strategy, which matches model-generated outputs to multiple-choice options via semantic similarity, further improve reliability?
To answer the above, we conduct a comprehensive evaluation using a suite of lexical and semantic metrics, including exact match, token-level F1, ROUGE-L, WUPS, and BERTScore, comparing zero-shot, fine-tuned, and post-processed configurations for each modality.

1.4. Contributions

i.
We propose a novel benchmark for medical VLMs derived from well-known datasets and covering three imaging modalities: Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and X-ray radiography. The benchmark includes a reproducible curation pipeline that merges SLAKE [5] and OmniMedVQA [6] into aligned CT, MRI, and X-ray subsets, enabling fair cross-modality comparisons under a unified experimental setup.
ii.
We present a general evaluation methodology and conduct a systematic analysis of general-purpose MiniCPM, ViLT and BLIP models on modality-specific Med-VQA subsets. We quantify the performance limitations of non-medical VQA models in both zero-shot and supervised settings. In addition, our methodology introduces and evaluates a semantic answer-matching procedure that converts free-form outputs into robust multiple-choice predictions, bridging the gap between generative and discriminative VQA modalities.
iii.
Our novel results reveal that MRI-based questions are more challenging than CT and X-ray-based questions. They also demonstrate that modality-aware fine-tuning yields substantial gains in both lexical and semantic correctness without requiring large, fully medical foundation models. These findings offer practical guidance on when and how general-purpose VQA architectures can be safely and effectively adapted for medical visual question answering across radiology modalities.

2. Related Works

This section reviews previous work, starting from general-purpose vision–language models (VLMs) and moving towards specialised medical visual question answering, relevant datasets, and evaluation metrics. We focus on explaining how existing studies handle different imaging modalities, and highlight that systematic, modality-aware analyses across CT, MRI, and X-ray are still limited.

2.1. Foundation VLMs and Multimodal LLMs

Large-scale vision–language pre-training has led to a new generation of foundation models that can be adapted to many downstream tasks, including VQA. ViLT (Vision-and-Language Transformer) is one of the earliest “minimal” vision–language models. It replaces the heavy CNN or region-based visual backbone with a Transformer that operates directly on patch embeddings along with text tokens, achieving up to 60× improvements in speed over previous vision–language pre-training (VLP) models while remaining competitive on VQAv2 (71.3% accuracy on the test set when fine-tuned as a 3129-class classifier) [8]. This efficiency and classification-style formulation make ViLT an attractive baseline for our work, as it can be fine-tuned separately on CT-, MRI-, and X-ray–specific answer vocabularies without prohibitive computational cost.
BLIP (Bootstrapping Language–Image Pre-training) extends this line of work with a multimodal encoder–decoder architecture designed to handle both understanding and generation tasks, including image–text retrieval, captioning, and open-ended VQA [11]. The key idea is to bootstrap cleaner supervision from noisy web image–text pairs using a captioner–filter pipeline. BLIP then achieves state-of-the-art results across several benchmarks, improving VQAv2 performance by around 1.6 VQA points compared with prior models and substantially boosting retrieval and captioning scores. Because BLIP exposes a generative VQA interface, it is well-suited to our setting where free-form answers must later be mapped to multiple-choice options.
A recent narrative review of multimodal large language models (MLLMs) outlines the emerging trend of models that accept text, images, audio, and video as input while generating language outputs [12]. It distinguishes two main architectural families: unified embedding–decoder models, in which image patches are embedded and concatenated with text tokens before being passed through a decoder stack, and cross-modality attention models, in which visual and textual streams interact via dedicated cross-attention layers within the transformer backbone. The review also clarifies how standard vision encoders (e.g., ViT) can convert image patches into representations compatible with language models. Although it is not an empirical study, it serves as a practical guide for practitioners by summarising current MLLM designs (including recent multimodal LLaMA variants) and offering conceptual guidance for designing multimodal pipelines.
Recent multimodal large language models (MLLMs) push these ideas further. MiniCPM-V-2 is a lightweight but strong multimodal LLM built from a SigLIP-400M vision encoder and a 2.4B-parameter MiniCPM language model, connected via a perceiver resampler. It targets efficient “end-side” deployment on consumer GPUs and mobile devices while maintaining competitive performance on benchmarks such as TextVQA and OCRBench [13]. In the biomedical space, Med-Flamingo adapts the Flamingo architecture to medical images, learning from multimodal medical knowledge sources and enabling few-shot generative Med-VQA. On datasets such as VQA-RAD, it improves clinician-rated answer quality by up to 20% over prior baselines [14]. LLaVA-Med follows a different route: starting from a general LLaVA-like model, it performs visual instruction tuning using figure–caption data from PubMed Central and GPT-4–generated instruction-following data, achieving strong performance on multiple Med-VQA datasets in under 15 h of training on eight A100 GPUs [15].

2.2. General-Domain Visual Question Answering

The task of visual question answering (VQA) was formalised by [16], who introduced the VQA dataset with 0.25M images, 0.76M questions, and about 10M answers, combining images from MS COCO and abstract scenes. They framed VQA as an open-ended, natural-language question answering problem over images and showed that solving it requires both visual understanding and language reasoning, beyond what generic captioning systems provide. Early deep learning approaches, such as “Ask Your Neurons”, jointly trained CNN-RNN pipelines to map images and questions to answers, and already highlighted the role of language priors and dataset biases [17].
The later-introduced VQAv2 [18] is a “balanced” version of the dataset, containing 265,016 images and 1.1 million questions. It was constructed so that each question is paired with two similar images that yield different answers. This design explicitly reduces language-only shortcuts and forces models to attend to visual content, becoming the de facto standard benchmark for VQA. Many modern VQA systems—including ViLT and BLIP—treat VQA as a multi-class classification problem over the most frequent 3100 answers, fine-tuning transformer-based vision–language encoders on VQAv2 and often achieving 70–75% accuracy on test-dev. These works establish the baseline practice we adopt in this paper: ViLT is used with a classification head over modality-specific answer vocabularies, while BLIP and MiniCPM-V-2 generate free-form answers that are later mapped to discrete options.
General VQA research has also explored robustness, ambiguity, and reasoning, for example through studies of question rephrasing, compositional generalisation, and causal analysis of language vs. vision contributions [19]. However, most of these analyses are carried out on natural images and everyday questions; how these insights transfer to medical imaging modalities remains underexplored. Models of different AM types were categorized into single-hop and multi-hop. Considering traditional VQA models as well as models with the capacity to read out text in images [20]. Furthermore, [21] gave a brief description of model with attention mechanism which showed excellent results on VQA. Allowing visualizing different analyses as well. [22] also explored a transformer-based encoder–decoder framework for medical visual question answering, combining Vision Transformer (ViT) image features with a transformer text encoder to jointly model visual and textual information. The fused representations were decoded autoregressively to generate answers, and the approach was validated on VQA-RAD and PathVQA, demonstrating improved performance over prior methods.

2.3. Medical Visual Question Answering Datasets and Benchmarks

The development of Med-VQA has been tightly coupled with the creation of specialised datasets. VQA-RAD is one of the earliest radiology-oriented datasets, consisting of 315 images and about 3.5K question–answer pairs collected from the MedPix database [23]. Questions were written by radiologists and include both open-ended and yes/no formats, covering tasks such as identifying anatomical structures, describing abnormalities, and interpreting imaging findings. PathVQA extends Med-VQA to pathology by extracting 4998 images from textbooks and online resources and generating 32,799 open-ended questions via a semi-automated NLP pipeline, with every question manually verified [24]. These datasets demonstrated feasibility but are limited in size and modality coverage. Ref. [25] introduced VQA-CP dataset for constrained experimentation on language bias. They categorized the existing methods into three classes and presented the causes of language bias.
SLAKE (Semantically Labelled Knowledge-Enhanced) is a more recent bilingual Med-VQA dataset that explicitly integrates structured medical knowledge. It contains 642 radiology images (282 CT, 181 MRI, 179 X-rays) and around 14,028 English/Chinese question–answer pairs, annotated with semantic labels and linked to a medical knowledge graph [5]. SLAKE covers multiple body regions and question types (e.g., “abnormality”, “organ”, and “colour”) and provides ontology-level annotations, making it a natural choice for evaluating knowledge-aware Med-VQA systems.
PMC-VQA scales Med-VQA to 227K question–answer pairs and 149K images derived from PubMed Central figures, with 80% of images being radiology. The authors also introduce MedVInT, a generative model pre-trained on PMC-VQA and fine-tuned on benchmarks such as VQA-RAD and SLAKE, achieving substantial improvements over prior models in terms of accuracy and BLEU, demonstrating superior performance [26]. OmniMedVQA pushes this idea further by aggregating data from 73 different medical datasets into a single benchmark spanning 12 imaging modalities (including CT, MRI, X-ray) and >20 anatomical regions, and demonstrates that even strong LVLMs struggle on many of the 200K+ VQA pairs [6].
Beyond datasets, Lin et al. [3] provide a comprehensive survey of Med-VQA methods, highlighting challenges such as data scarcity, domain shift, and the need for trustworthy evaluations. More recently, HEAL-MedVQA introduces a 67K-pair benchmark focusing on hallucination and localisation. In addition to VQA pairs, it includes doctor-annotated segmentation masks and new evaluation protocols that test whether models attend to correct regions or rely on spurious text cues [9].
These datasets collectively show that Med-VQA spans diverse modalities and question types. However, most studies report aggregate performance across modalities or focus on a single modality (e.g., pathology or radiology). The modality labels available in SLAKE, and OmniMedVQA are rarely used to build modality-specific training and evaluation subsets, which is the central focus of our work.

2.4. Medical VQA Models and Multimodal Clinical Assistants

A first generation of Med-VQA models adapted general VQA architectures (CNN-RNN, attention-based fusion) to medical images, sometimes with additional knowledge graphs or external clinical ontologies. Lin et al. point out that these approaches are limited by small datasets and hand-crafted fusion mechanisms. More recent work explores asymmetric cross-modal attention, graph reasoning over anatomical structures, and domain-specific pre-training to better align questions with image regions and structured knowledge [16].
With the rise of LVLMs, models such as LLaVA-Med and Med-Flamingo have become prominent. LLaVA-Med trains a biomedical vision–language assistant by first aligning a general model with biomedical figure captions and then instruction-tuning it with GPT-4-generated question–answer pairs. On three Med-VQA datasets (including VQA-RAD, SLAKE, and PathVQA), LLaVA-Med matches or surpasses previous supervised baselines in terms of overall accuracy, while supporting conversational interactions [27]. Med-Flamingo takes a few-shot learning perspective, extending Flamingo to medical domains and demonstrating improved clinician-rated answer quality (20% gain) on several benchmarks, including custom VQA splits designed to avoid data leakage [28].
More targeted models such as R-LLaVA incorporate region-of-interest prompts or localisation-aware training to better ground answers in specific image areas. R-LLaVA-7B, for example, reports state-of-the-art accuracies on SLAKE-EN (89.5% for open-ended and 90.1% for closed-ended questions) by explicitly modelling spatial focus [29]. HEAL-MedVQA similarly emphasises grounding and hallucination robustness, showing that many current LMMs can answer correctly while still attending to irrelevant or incorrect regions, thus motivating localisation-aware training protocols [9]. Ref. [30] also presented multiple fusion proposals and an abstract fusion framework that can be used with popular VQA models.
Despite recent advances in (Med-)VQA, they rely on large, domain-specialised LVLMs and focus on overall performance and hallucination behaviour rather than systematically dissecting performance per imaging modality. Furthermore, they provide limited guidance on how smaller, general-purpose models can be adapted efficiently to different radiology modalities using modest supervision, which is the problem we address in this work.

3. Materials and Methods

3.1. General Methodology

Dataset harmonisation (input to all pipelines).
We first load the SLAKE dataset, restrict it to English-language questions, and retain the image path, question text, reference answer, and modality. We then load OmniMedVQA-Mini (default test split) and retain the image, question, ground-truth answer, modality, and native multiple-choice options. Modality strings from both datasets are normalised into three canonical classes (CT, MRI, X-ray), and all other modalities are discarded. All records are mapped to a unified schema { i m a g e P a t h , q u e s t i o n , c a n o n i c a l A n s w e r , m o d a l i t y , d a t a s e t I d } followed by a quality-control pass to ensure that every referenced image file exists and is readable. This procedure yields three modality-specific subsets— D | CT | = 3574 ,   D | MRI | = 1840 , and   D | X - ray | = 2372  image–question pairs—for a total of 7783 supervised examples.
For each modality m, we partition D m into training and test sets using a single deterministic split with test_size = 0.2 and seed = SPLIT_SEED, implemented via Hugging Face datasets. The resulting frozen split is reused across all models and configurations. Approximate train–test sizes are | CT | = 2857 / 714 ,   | MRI | = 1472 / 368 ,   | X - ray | = 1898 / 474 . All answer strings are normalised (lowercasing, trimming, and collapsing whitespace) prior to label construction and metric computation.
Our modality-aware evaluation is defined at the level of image–question pairs rather than patients or studies, as neither SLAKE nor OmniMedVQA-Mini provides patient-level identifiers. Harmonisation performs schema alignment and concatenation only, without generating or duplicating records. The benchmark thus consists solely of original dataset examples grouped by imaging modality (CT, MRI, and X-ray).
Zero-shot benchmarking (Section 3.2.1).
For each modality m { CT , MRI , X - ray } , we evaluate the entire subset D m without any training. We benchmark three general-purpose vision–language models:
  • ViLT (dandelin/vilt-b32-finetuned-vqa), a classification-style VQA model over a fixed VQAv2 answer vocabulary;
  • BLIP (Salesforce/blip-vqa-base), an encoder–decoder model that generates short free-text answers;
  • MiniCPM-V-2 (openbmb/MiniCPM-V-2), a chat-style multimodal large language model using the official model.chat interface.
For every sample, models are run in inference mode (GPU when available, gradients disabled), predictions and reference answers are normalised, and we compute Exact Match (EM), token-level F1, ROUGE-L, WUPS, WUPS@0.9, and BERTScore-F1. These scores form the zero-shot baseline per model and modality.
Supervised fine-tuning regimes (Section 3.3).
For each modality-specific training set and its frozen test set, we apply three supervised fine-tuning strategies.
ViLT (multi-label classification). A modality-specific label vocabulary is built from normalised training answers, defining label2id and id2label mappings with config . num _ labels = K m . A custom collator produces pixel values, tokenised questions, and a multi-hot label tensor Y { 0 , 1 } B × K m . The model is optimised end-to-end using binary cross-entropy with logits over all labels (no PEFT).
BLIP (sequence-to-sequence generation). The BLIP processor constructs encoder inputs from image and question. Ground-truth answers are tokenised, an end-of-sequence token is appended, and padded positions are masked with 100 . Training minimises the standard autoregressive negative log-likelihood over answer tokens. The full encoder–decoder stack is trained with a schedule matched to ViLT (small batch size with gradient accumulation, ∼20 epochs, learning rate 5 × 10 5 ), in full precision (no fp16).
MiniCPM-V-2 (LoRA-based instruction tuning). Training and test splits are converted to JSONL SFT format, where each record contains an image path and a two-message conversation (user: <image> + question; assistant: normalised answer). LoRA adapters are attached only to the language model self-attention projections (q/k/v) with rank r = 8 , scaling α = 16 , and dropout 0.05 , while the SigLIP-400M vision encoder and base LLM weights remain frozen. Training uses per-device batch size 4, gradient accumulation 16 (effective batch 16 ), 10 epoch, learning rate 2 × 10 4 , and fp16, following the official OpenBMB training loop. After training, each model is evaluated on its modality-specific frozen test set using the same metric stack as in zero-shot benchmarking. For MiniCPM-V-2, we did not use any custom prompt engineering or task-specific templates. All zero-shot evaluations followed the official OpenBMB model.chat(…) interface with the model’s default chat template and decoding settings, using only the image and the raw question as input.
Multiple-choice reformulation (Section 3.4.1 and Section 3.4.2).
For OmniMedVQA-Mini, the native options field is expanded into option_Aoption_D, and a correct_option label is derived by matching the normalised ground-truth answer to the normalised options. For SLAKE, synthetic options are generated using only the SLAKE answer distribution by constructing a top-10 high-frequency answer pool and an extended pool of all unique answers. For each question, three unique distractors are sampled, the correct answer is inserted to form four distinct options, and correct_option  { A , B , C , D } is recorded. The augmented SLAKE and OmniMed tables are concatenated, and modality-specific train–test splits are recreated using the same seeds and fractions as above; options are carried along but not used in the loss.
Post-hoc semantic option selection (Section 3.4.3).
For each fine-tuned model and modality, inference is run on the MCQ test split to obtain a normalised free-text prediction y ^ i for each example. We compute BERTScore-F1 between y ^ i and each non-empty option o j , and the option with the highest similarity score is selected and treated as the calibrated prediction. Evaluation metrics (EM, F1, ROUGE-L, WUPS, WUPS@0.9, BERTScore-F1) are computed on pairs { ( o ^ i , y i ) } , and MCQ accuracy is reported on the subset with a valid correct_option.
All experiments were implemented in Python 3.12.11 using PyTorch 2.10.0 and the Hugging Face ecosystem. We used the datasets library to access SLAKE and OmniMedVQA-Mini, and Transformers to load the following models: ViLT (dandelin/vilt-b32-finetuned-vqa), BLIP (Salesforce/blip-vqa-base), and MiniCPM-V-2 (openbmb/MiniCPM-V-2). All models were evaluated on test data, with gradients disabled during inference.

3.2. Benchmarking Pipeline

Here we describe the zero-shot benchmarking of ViLT, BLIP, and MiniCPM-V-2 on harmonised CT, MRI, and X-ray subsets derived from SLAKE and OmniMedVQA-Mini. Full implementation details (including libraries, exact loading utilities, and metric code) are provided in Appendices Appendix A.1, Appendix A.2, Appendix A.3, Appendix A.4.

3.2.1. Data Source and Modality-Specific Subsets

We used two publicly available Med-VQA datasets: SLAKE and OmniMedVQA-Mini. SLAKE is a bilingual, semantically labelled dataset with CT, MRI, and X-ray images and physician-annotated question–answer pairs. For this study, we restricted SLAKE to English questions and, for each item, retained the image filename, question text, reference answer, and modality label. OmniMedVQA-Mini is a compact multiple-choice subset of OmniMedVQA that provides, for each image–question pair, four answer options, a ground-truth answer, and a modality string.
Both datasets were mapped to a common schema with four core fields: image path, question, canonical reference answer, and modality. For OmniMedVQA-Mini, the gt_answer field was used as the canonical answer; SLAKE answers were kept as supplied. Modality strings were normalised to three labels—CT, MRI, and X-ray—via a lookup table. Any samples whose modality could not be resolved to one of these three categories were discarded. The harmonised SLAKE and OmniMed tables were then concatenated, with a dataset_id flag added. Three disjoint subsets (CT, MRI, X-ray) were created by filtering on modality. A final quality-control pass verified that all referenced image files existed and were readable. For the benchmarking pipeline, these modality-specific subsets were used exclusively as evaluation sets; no training or validation splits were defined at this stage (see Appendix A.2 for the full curation logic).
To assess any data leakage, we analysed the train–test split at the image level by visualising the distribution of image IDs across all QA pairs for each modality. The resulting plots (Appendix A.3) confirm that while the same image may appear in both splits with different questions, no identical image–question pairs are shared between training and testing.

3.2.2. Model Configurations and Zero-Shot Inference

We evaluated three general-purpose vision–language models in a strict zero-shot setting, i.e., without any additional medical pre-training or task-specific fine-tuning:
ViLT (dandelin/vilt-b32-finetuned-vqa). ViLT is a transformer-based vision–language model that jointly encodes image patches and text tokens in a single transformer stack and is fine-tuned on VQAv2 as a multi-class classifier over a fixed answer vocabulary. In our pipeline, each SLAKE or OmniMed sample is processed by the ViLT processor to obtain joint image–question embeddings. The VQA classification head yields logits over the VQAv2 answer set; the highest-scoring label is mapped back to text using the checkpoint’s id2label mapping and used as the prediction. We do not modify this answer vocabulary, so ViLT can only output answers that appear in VQAv2.
BLIP (Salesforce/blip-vqa-base). BLIP is a multimodal encoder–decoder model designed for both understanding and generation. We use the VQA base checkpoint in generative mode. For each image–question pair, the BLIP visual encoder and text encoder produce a fused representation that conditions an autoregressive decoder. The decoder generates an answer token by token up to a short maximum length; the decoded string is then normalised (lowercasing and whitespace cleanup) and treated as the prediction. BLIP is not constrained to a fixed answer list and can produce paraphrases or clinically equivalent variants of the reference answer.
MiniCPM-V-2 (openbmb/MiniCPM-V-2). MiniCPM-V-2 is a multimodal LLM that combines a SigLIP-based vision encoder with a MiniCPM language backbone. We use the official chat interface. Each sample is formatted as a single-turn “assistant” dialogue where the radiology image and question are passed together to the chat method, which returns a short natural-language answer. This answer is normalised in the same way as the BLIP outputs.
All models were run on GPU (NVIDIA A100 40 GB VRAM), set to evaluation mode, and executed with no_grad/inference-mode contexts. Random seeds were fixed at the start of each run to make iterative processes such as dataset shuffling and pre-processing as deterministic as possible. Further model-loading and processor details are given in Appendix A.1 and Appendix A.2.

3.2.3. Evaluation Protocol and Metrics

For each modality-specific subset (CT, MRI, and X-ray) and each model, the pipeline iterated over all samples, produced a normalised prediction y ^ i from the image and question, and paired it with the normalised reference answer y i , for  i = 1 , , N . We then computed the metrics below. It should be noted that all metrics were computed separately for each model and modality, forming a modality-aware zero-shot baseline for the fine-tuning and post hoc option-selection pipelines. The implementation details for metric computation are summarised in Appendix A.
   Exact Match (EM)
EM = 1 N i = 1 N 1 [ y ^ i = y i ] × 100 ,
where 1 [ · ] denotes the indicator funcion [31].
Token-level F1. Let T ( · ) denote the set of tokens in a string. For each prediction–reference pair,
P i = | T ( y ^ i ) T ( y i ) | | T ( y ^ i ) | , R i = | T ( y ^ i ) T ( y i ) | | T ( y i ) | ,
and
F 1 i = 2 P i R i P i + R i , P i + R i > 0 , 0 , otherwise .
We report the mean token-level F1, 1 N i = 1 N F 1 i × 100 .
ROUGE-L. ROUGE-L measures the longest common subsequence overlap between y ^ i and y i , computed using the evaluate implementation [32,33].
WUPS and WUPS@0.9. Wu–Palmer similarity is computed per answer pair using WordNet synsets, yielding a score in [ 0 , 1 ] . We report the mean WUPS and the proportion of examples with WUPS 0.9 (denoted WUPS@0.9) [34].
BERTScore-F1. BERTScore-F1 is a semantic similarity score computed between the contextual embeddings of y ^ i and y i using a pre-trained transformer encoder [35]. We report the average BERTScore-F1 × 100 . Appendix A.5 provides the detailed implementation steps for Evaluation the models.

3.3. Fine-Tuning Pipeline

Building directly on the benchmarking pipeline, we next introduced supervised fine-tuning on the harmonised, modality-specific SLAKE + OmniMedVQA subsets. The same curation, CT/MRI/X-ray separation, text normalisation, and evaluation metrics as in the zero-shot experiments were re-used (see Section 3.1 and Section 3.2 and Appendix A). The new elements are: (i) explicit train–test splits per modality, and (ii) model-specific optimisation strategies for ViLT, BLIP, and MiniCPM-V-2. Full implementation details are provided in Appendices Appendix B.1, Appendix B.2, Appendix B.3, Appendix B.4.

3.3.1. Train–Test Splits

After harmonisation, the corpus comprised 7783 image–question pairs across three radiology modalities (3571 CT, 1840 MRI, 2372 X-ray). Each row is a single image–question pair, and because SLAKE and OmniMedVQA contain more QA items than images, the same radiology image can appear in multiple pairs; we therefore treat each pair as an independent supervised example.
For each modality m { CT , MRI , X - ray } , we applied the stratified 80/20 train–test protocol with fixed seed described in Section 3.1. The split behaviour and absence of QA-level leakage are illustrated in Figure 1 and documented in detail in Appendix A.5.

3.3.2. ViLT: Multi-Label Classification Fine-Tuning

For ViLT, we trained separate classification heads for CT, MRI, and X-ray on modality-specific answer vocabularies.
Label space. From each training split, we constructed the set of unique, normalised answers Y m and defined the mappings
label 2 id : Y m { 0 , , K m 1 } , id 2 label : { 0 , , K m 1 } Y m ,
where K m = | Y m | . These mappings were injected into the ViLT configuration via config.num_labels = K m , config.id2label, and config.label2id. To avoid undefined labels at evaluation time, an optional filter (FILTER_OOD_TEST_ANSWERS = True) removed test examples whose answers were not present in the corresponding training label set. The fraction of removed test samples is: 0.42% (CT), 1.09% (MRI), and 0.84% (X-ray).
Collator and loss. A custom VQACollator wrapped the ViLT processor, loading images and questions and constructing a multi-hot label tensor Y { 0 , 1 } B × K m for each batch, where y i k = 1 if example i has answer label k. This tensor was passed as labels to ViltForQuestionAnswering, which optimises a binary cross-entropy–with–logits objective over all labels [36]:
L ViLT = 1 B K m i = 1 B k = 1 K m y i k log σ ( z i k ) + ( 1 y i k ) log 1 σ ( z i k ) ,
where z i k are the logits for example i and label k, σ ( · ) denotes the sigmoid function, B is the batch size, and  K m is the number of distinct answer labels for modality m.
Training configuration. For each modality, a fresh ViltForQuestionAnswering model was initialised from the VQAv2-fine-tuned checkpoint and trained end-to-end (without Parameter-Efficient Fine-Tuning (PEFT)). Further elaboration is presented in Appendix B.2

3.3.3. BLIP: Generative Sequence-to-Sequence Fine-Tuning

BLIP was fine-tuned as an encoder–decoder model that autoregressively generates the ground-truth answer text. A custom BlipVQACollator used the BLIP processor to obtain encoder inputs from the image and question, and tokenised the target answer sequence y i = ( y i , 1 , , y i , T i ) with an end-of-sequence token. Padded positions were set to 100 in the label tensor so that they did not contribute to the loss. The model was optimised using the standard negative log-likelihood objective:
L BLIP ( θ ) = 1 N i = 1 N t = 1 T i log p θ y i , t y i , < t , x i , q i ,
where x i denotes the image, N is the batch size (the number of image–question–answer examples in the current batch), q i is the question, y i , t is the t-th token in the ground-truth answer sequence of length T i , and  θ is the BLIP parameters [13].
For each modality, a fresh BlipForConditionalGeneration model was initialised from the VQA-pretrained checkpoint and trained using the same high-level schedule as ViLT but in full precision (fp16 = False) to avoid numerical instabilities. At inference time, answers were generated with constrained decoding (maximum 16 new tokens, 3 beams, do_sample = False) before normalisation (see Appendix B.3).

3.3.4. MiniCPM-V-2: LoRA-Based Instruction Fine-Tuning

For MiniCPM-V-2, we followed the official OpenBMB instruction-tuning recipe and applied parameter-efficient fine-tuning with LoRA on the language backbone.
Conversation-style SFT data. For each modality, a helper make_sft_json_for_split converted the Hugging Face dataset split into JSONL files expected by the MiniCPM training scripts. Each example was represented as a single-turn conversation consisting of: (1) an image field pointing to the radiology image on disk; and (2) a conversations list with two messages: a user message containing the question and an image placeholder, and an assistant message containing the normalised ground-truth answer. The resulting JSON files were validated for image-path existence and schema correctness (see Appendix B.4).
LoRA and training setup. A customised finetune.py script extended the official MiniCPM training loop to attach LoRA adapters to the self-attention projections of the language model. We used LoRA with rank r = 8 , scaling α = 16 , dropout 0.05 , and targeted modules l l m .. l a y e r s . < i > . s e l f _ a t t n . ( q _ p r o j k p r o j | v p r o j ) | .
The SigLIP-400M vision encoder and the base MiniCPM weights were kept frozen throughout training. The training hyperparameters were:
-
Per-device train and evaluation batch size = 4 ;
-
Gradient accumulation steps = 16 (effective batch size 16 );
-
Number of epochs = 10 ;
-
Learning rate = 2 × 10 4 ;
-
fp16 = True;
-
Evaluation and checkpointing at each epoch with save_total_limit = 1.
The supervised fine-tuning objective is defined as
L SFT ( θ ) = 1 N i = 1 N 1 | A i | t A i log p θ y i , t | y i , < t , v i ,
where N denotes the number of training examples, y i , 1 : T i is the tokenised full conversation sequence for example i (including both the user prompt and the assistant response), and  A i { 1 , , T i } indexes only the assistant answer tokens (i.e., positions not masked by the ignore index, such as 100 ). The term v i represents the visual tokens extracted from the input image by the vision encoder and provided to the multimodal model alongside text, and  p θ ( · ) denotes the model’s autoregressive next-token distribution parameterised by θ [13].
After training, a separate inference script loaded the fine-tuned checkpoint (base model plus LoRA adapters) and generated answers for the modality-specific test JSON files using the MiniCPM chat interface (see Appendix B.4).

3.3.5. Inference and Metrics After Fine-Tuning

Inference and evaluation after fine-tuning were kept identical to the benchmarking stage to enable direct comparison across settings.
  • For ViLT, the model produced logits over the modality-specific label set. The predicted answer index was arg max k σ ( z i k ) , which was mapped back to text via the corresponding id2label mapping.
  • For BLIP, answers were generated with the same constrained decoding configuration employed during training-time evaluation.
  • For MiniCPM-V-2, answers were produced through the chat interface conditioned on both the image and the question, and the resulting text was normalised using the same routine as in the benchmarking stage.
For each modality-specific test set, we computed Exact Match (EM), token-level F1, ROUGE-L, WUPS, WUPS@0.9, and BERTScore-F1, following the metric definitions in Section 3.2.3 and the implementations described in Appendix A.5.

3.4. Option Selection After the Fine-Tuning Pipeline

In the final stage of our study, we extended the supervised fine-tuning pipeline by recasting both datasets into a unified four-option multiple-choice format and adding a post hoc semantic option-selection layer on top of each model’s free-text prediction. The training objective remained unchanged: all models were still trained to generate the ground-truth answer text, while the multiple-choice structure was used only at evaluation time. Through these experiments, we can compare the performance of models, before post hoc option selection, which our fine-tuned pipelines can provide us, and after post hoc option selection performance whose detailed implementation steps are given in Appendices Appendix C.1, Appendix C.2, Appendix C.3, Appendix C.4.

3.4.1. Multiple-Choice Harmonisation of SLAKE and OmniMedVQA

Starting from the harmonised SLAKE + OmniMedVQA corpus and modality-specific subsets, we attached four candidate options (A–D) and a correct_option label to every question–answer pair.
OmniMedVQA options. For OmniMedVQA-Mini, which already provides an options field, the list of candidates was expanded into explicit columns option_Aoption_D, truncating or padding to exactly four entries. A correct_option label was derived by normalising the ground-truth answer and comparing it with the normalised option texts. If the answer matched exactly one option after normalisation, the corresponding label (AD) was stored; otherwise, correct_option was left undefined, and the example was excluded from MCQ-accuracy computation. This procedure preserved OmniMedVQA’s manually curated distractors while aligning them with our normalisation and scoring pipeline.
Synthetic options for SLAKE. SLAKE does not provide candidate options, so we constructed synthetic distractors using only the SLAKE internal answer distribution. Initially, a global answer pool was built by normalising all answers in the full SLAKE split and counting frequencies. From this distribution, we selected the (TOP_K_SYNTH_ANSWERS = 10) most frequent normalised answers and mapped each normal form to a single canonical surface string observed in the dataset, forming a high-frequency pool. In parallel, we constructed an extended pool containing all distinct normalised answer strings to ensure broader coverage.
For a sample with ground-truth answer a, we normalised it to a norm and invoke sample_unique_distractors( a norm , need_k = 3), which operates as follows. First, the high-frequency pool was shuffled and candidates whose normalised form differed from a norm and from each other were selected. Second, if fewer than three distractors were found, the procedure fell back to the extended pool, always enforcing uniqueness.
To assess the hardness of the synthetic distractors constructed for SLAKE, we measured the semantic similarity between each distractor and the ground-truth answer using BERTScore-F1, where higher similarity indicates greater confusability. Across approximately 6000 distractor instances, the similarity distribution has a mean and median of 0.56 (SD 0.14; IQR 0.48–0.67), with a pronounced upper tail (90th percentile 0.76, maximum 0.85) that reflects the presence of genuinely hard, semantically close alternatives. At the same time, lower-similarity cases (minimum 0.20) correspond to easier but still valid distractors drawn from the dataset’s answer vocabulary. Overall, this distribution shows that the synthetic options span a realistic range of difficulty and are neither trivial nor near-duplicate, yielding a balanced, non-degenerate MCQ setting suitable for post hoc calibration without artificially simplifying the task.
The three distractors were randomly shuffled, the correct answer string was inserted at a random position among four slots, and we asserted that all four normalised options were distinct. The final candidates were stored as option_Aoption_D, and the position of the correct answer was encoded as correct_option  { A , B , C , D } .
This procedure equipped every SLAKE item with four plausible, dataset-consistent options without relying on external knowledge or model-generated distractors (see Appendix C.1 for detailed implementation).

3.4.2. Combined MCQ Dataset per Modality

For ViLT and BLIP, the augmented SLAKE and OmniMedVQA tables—containing image, question, answer, modality, dataset_id, option_Aoption_D, and correct_option—were concatenated into a single dataset (combo). We then:
  • Filtered combo by modality (CT, MRI, X-ray), as in previous stages;
  • Applied the same stratified random split as in Section 3.2 (typically 80/20 train–test, with TRAIN_PERCENT = TEST_PERCENT = 1.0 for these runs);
  • Retained all option columns and the correct_option label in both training and test splits, even though options were not used by the loss during training.
The fine-tuning configuration (loss functions, optimisers, and learning-rate schedules) was identical to the Section 3.2.3 The only change was that each example now carried a consistent four-option MCQ scaffold that was later used during evaluation (Appendix C.2).

3.4.3. Semantic Option Selection with BERTScore

After fine-tuning on the MCQ-augmented data, we introduced a post hoc decision layer that maps each model’s free-text prediction to the most semantically similar option.
For each test example, we had the model’s normalised free-text prediction y ^ , and a set of non-empty options { o 1 , , o 4 } and the dataset-provided correct_option label.
We computed BERTScore-F1 between y ^ and each candidate option o j using a pre-trained bert-base-uncased encoder. Let s j = BERTScore F 1 ( y ^ , o j ) . BERTScore derives contextual embeddings for both strings, matches tokens by cosine similarity, and computes precision, recall, and F1 over these soft matches, providing a semantic similarity measure that is robust to paraphrasing.
If the BERTScore computation failed for any reason (for example, because the BERTScore model cannot be loaded or executed correctly in the present environment), we fell back to a lexical token-level F1 between the normalised prediction and option texts:
F 1 lex ( p , t ) = 2 | T ( p ) T ( t ) | | T ( p ) | + | T ( t ) | ,
where T ( · ) denotes the set of tokens in a string. The selected option was then
( ^ , o ^ ) = arg max ^ j { A , B , C , D } , o ^ j s j ,
returning both the option label ^ and its text o ^ . If no valid options were available (e.g., all options are empty), we retained the original free-text prediction and marked the MCQ label as undefined. However, there was no such sample that we could mark as undefined. We also included an error analyses for post hoc option selection by computing the BertScore-F1 between the model’s raw free text and the selected option text. The plots are provided with the generic implementation of this logic in best_option_for_prediction (see Appendix C.3).

3.4.4. Integration with ViLT, BLIP, and MiniCPM-V-2

The option-selection layer is integrated uniformly across all three models, without modifying their respective training objectives.
ViLT (classification-style VQA). For each test example, a raw prediction was obtained by applying the fine-tuned classifier, taking the argmax over the output logits, and decoding the resulting index via id2label. This raw text prediction was passed to best_option_for_prediction together with the four stored options. We retained both the raw prediction and the selected option text. Evaluation metrics (EM, F1, ROUGE-L, WUPS, WUPS@0.9, and BERTScore-F1) were computed between the selected option text and the normalised ground-truth answer, while MCQ accuracy was obtained by comparing the selected label ^ with correct_option.
BLIP (generative VQA). A free-text answer was generated using the same constrained decoding configuration as in Section 3.2.3. The decoded and normalised string played the role of y ^ and was processed by the same best_option_for_prediction function. The resulting option texts and labels were then used for metric computation in exactly the same manner as for ViLT.
MiniCPM-V-2 (chat-style LVLM). The LoRA-based instruction fine-tuning remained unchanged. A dedicated evaluation script loaded the fine-tuned checkpoint, iterated over the modality-specific JSON files (containing image paths, conversation fields, options, and correct_option), extracted the image and question, and called the official model.chat interface to produce y ^ . The script then applied best_option_for_prediction to the stored options, recorded both the raw and selected predictions, and computed the semantic metrics (EM, F1, ROUGE-L, WUPS, WUPS@0.9, and BERTScore-F1) together with MCQ accuracy.
Overall, this pipeline preserves a strictly free-text learning objective during training, while evaluation is performed in a multiple-choice space via semantic snapping onto dataset-consistent options. This design enabled us to quantify the performance gains achievable through simple, post hoc calibration on top of fine-tuned medical VQA models.

4. Results

4.1. Baseline Results

In the zero-shot setting, all three models were evaluated directly on the harmonised SLAKE–OmniMedVQA without any medical fine-tuning; see Table 1.
Across modalities, BLIP showed the strongest overall zero-shot performance. Its exact match accuracy was 24.6% on CT, 18.9% on MRI, and 23.2% on X-ray, slightly higher than ViLT (23.6% CT, 16.5% MRI, 18.6% X-ray). Token-level F1 followed the same pattern, with BLIP reaching ≈25% F1 on CT and ≈20–24% F1 on MRI/X-ray, whereas ViLT remained a few points lower on each modality. Both models exhibited only moderate lexical overlap, confirming that generic VQA pre-training alone is insufficient for reliable medical VQA.
Semantic metrics showed a similar trend. BLIP achieved WUPS scores of 64.9% (CT), 50.6% (MRI), and 65.6% (X-ray), and BERTScore-F1 of 72.4%, 65.6%, and 70.6%, respectively. ViLT’s WUPS and BERTScore were consistently lower (e.g., WUPS 55.1% CT, 46.1% MRI, 54.9% X-ray; BERTScore-F1 68.2%, 61.5%, 63.5%), indicating weaker semantic alignment to the gold answers even when predictions were partially correct.
In contrast, MiniCPM-V-2 performed poorly in strict lexical terms under pure zero-shot usage. Exact match was 0% on all three modalities, with F1 in the 2–3% range. Nonetheless, its semantic metrics showed reasonable performance: WUPS reached 70.5% (CT), 58.5% (MRI), and 67.9% (X-ray), and BERTScore-F1 was around 35–36%, suggesting that the model often produced medically related but lexically mismatched answers that failed the strict EM criterion.
Overall, the zero-shot results show that (i) BLIP transfers best among the three general-purpose models, (ii) ViLT retains some useful cross-domain signal but lags behind BLIP in both lexical and semantic alignment, and (iii) MiniCPM-V-2’s raw generations are semantically related but too unconstrained to score well under exact-match evaluation. These findings motivate the subsequent modality-specific fine-tuning and post hoc option-selection pipelines.

4.2. Supervised Fine-Tuning Results

After modality-specific fine-tuning on the harmonised SLAKE–OmniMedVQA subsets, our three models showed substantial gains over their zero-shot performance; see Table 2.

4.2.1. Overall Effect of Fine-Tuning

Exact-match accuracy on CT increased from ≈24% to ≈82–83% for BLIP and ViLT, and from 0% to ≈80% for MiniCPM. On MRI, accuracies rose into the mid-70% range for BLIP and ViLT and high-60% for MiniCPM. On X-ray, BLIP and ViLT reached ≈71–78% EM, while MiniCPM improved to ≈60%. All secondary metrics (F1/ROUGE-L/WUPS/BERTScore) followed the same trend, typically landing in the mid-70s to low-90s after fine-tuning.

4.2.2. Comparison Between Models

ViLT achieved the strongest strict performance overall. CT: EM 82.7 % with all other metrics in the mid-80s to low-90s. MRI: EM 76.0 % , slightly higher than BLIP (73.6%) and MiniCPM (68.5%). X-ray: EM 78.1 % , again ahead of BLIP (71.3%) and MiniCPM (59.7%).
BLIP was competitive with ViLT, especially in semantic metrics. CT: EM 81.7%, with BERTScore-F1 93.5 % . MRI and X-ray: EM 73.6% and 71.3%, with BERTScore-F1  91 % across modalities, indicating strong semantic alignment even when exact strings differ.
MiniCPM-V2 (OpenBMB) benefited markedly from fine-tuning but remained behind the VQA-specialised models. CT: EM 80.0 % , secondary metrics mostly in the low-80s to high-80s, BERTScore-F1 93.0 % . MRI: EM 68.5 % ; X-ray: EM 59.7 % , with corresponding semantic scores below those of BLIP and ViLT.

4.2.3. Modality Trends

  • CT emerged as the easiest modality after fine-tuning, with all models achieving their highest accuracies and BERTScore values there.
  • MRI remained the most challenging: while ViLT and BLIP both exceeded 73% EM, their other metrics dipped compared with CT, and MiniCPM lagged further behind.
  • X-ray performance was intermediate, but again consistently higher for ViLT and BLIP than for MiniCPM.
In summary, modality-specific supervised fine-tuning transforms all three backbones from marginal zero-shot baselines into reasonably strong medical VQA models, with ViLT leading on strict accuracy, BLIP excelling in semantic similarity, and MiniCPM showing solid but clearly lower performance across all modalities.

4.3. Fine-Tuning with Post-Hoc Option Selection

Finally, we retained the fine-tuned models from Section 4.2 but replaced raw free-text evaluation with a post-hoc option-selection step: each generated answer was mapped to the most semantically similar multiple-choice option before scoring. This substantially increased reliability for all three backbones, especially the generative ones; see Table 3.
ViLT (dandelin/vilt-b32-finetuned-vqa) had a very stable performance across CT, MRI, and X-ray data.
Compared with the plain fine-tuning stage, option selection consolidated the ViLT predictions and lifted it into the ∼92–93% EM and F1 range on all three modalities, with consistently high BERTScore-F1 (>96%).
BLIP (Salesforce/blip-vqa-base) benefited strongly from option selection, converting its semantically rich generations into reliable discrete answers.
BLIP now matches or slightly exceeds ViLT on X-ray (93.25% EM vs. 92.34%) and is only marginally behind on CT and MRI, with very high BERTScore-F1 across all modalities.
MiniCPM-V2 (OpenBMB) showed clear improvements under the same option-selection regime, but still lagged behind the two VQA-specialised models.
In sum, the option-selection pipeline shows that, once models are fine-tuned on modality-specific data, constraining their outputs to the multiple-choice answer set via semantic matching can raise both Exact Match and F1 above 90% for ViLT and BLIP, while also improving WUPS and BERTScore-F1. MiniCPM-V2 also benefits from this approach but remains behind the VQA-specialised backbones across all metrics and modalities.

4.4. Stratified Performance Analysis by Answer Type

We further report a per-model stratification by answer type, with results weighted across CT, MRI, and X-ray. For BLIP, option selection improved accuracy from 82.39→94.53 on yes/no (n = 494), 71.43→92.00 on number (n = 56), 74.10→88.78 on anatomy-term-like (n = 722), and 44.36→89.09 on other/free-text answers (n = 275), where n is the per–answer-type test set size used for the comparison. MiniCPM showed similar gains: 75.30→92.11 (yes/no), 62.50→91.07 (number), 55.96→72.16 (anatomy-term-like), and 39.27→72.73 (other). For ViLT, gains were smaller where raw exact match was already high—86.44→96.36 (yes/no) and 85.10→92.34 (anatomy-term-like)—but remained substantial for number (80.36→93.00) and other/free-text (56.18→84.27) answers. Overall, the improvements were not limited to short categorical responses; the largest gains frequently occurred for free-text answers, consistent with the post-hoc option selector calibrating semantically correct but non-identical predictions into the discrete option space.

5. Discussion

This study set out to examine how state-of-the-art general-purpose vision–language models behave on radiology-focused medical VQA when (i) used off-the-shelf, (ii) fine-tuned on modality-specific CT, MRI and X-ray subsets derived from SLAKE and OmniMedVQA, and (iii) coupled with a post-hoc option-selection layer. By harmonising two complementary Med-VQA datasets and systematically evaluating ViLT, BLIP and MiniCPM-V-2, we provide a controlled picture of how far generic VQA backbones can be pushed towards reliable performance in this medical setting. Our discussion is centred on answering the research questions posed in Section 1.3.

5.1. How Well Do General-Purpose VLMs Perform in a Strict Zero-Shot Setting?

In the zero-shot regime, all three models performed poorly in strict terms, with exact-match accuracies around the low-20% range for ViLT and BLIP and essentially 0% for MiniCPM-V-2; despite reasonable WUPS and BERTScore values. This confirms that simply deploying large VQA or multimodal LLMs “as is” on medical images is inadequate. The OmniMedVQA authors similarly report that both general-domain LVLMs and several medical-specialised LVLMs struggle on their benchmark, often underperforming expectations from natural-image VQA. Our zero-shot results are consistent with that broader observation: pre-training on large natural-image corpora provides limited semantic signal and is insufficient to meet even moderate accuracy requirements in radiology VQA.
Across models, BLIP was the strongest zero-shot baseline, slightly outperforming ViLT on all three modalities in both exact-match and semantic metrics. This aligns with the BLIP design as a unified understanding-and-generation framework that bootstraps high-quality supervision from noisy web image–text pairs and achieves state-of-the-art natural-image VQA scores. ViLT, in contrast, is a minimalist monolithic transformer that trades some accuracy for architectural simplicity and speed. MiniCPM-V-2 produced semantically related but highly unconstrained answers, reflected in reasonable WUPS and BERTScore scores but near-zero exact matches. Together, these findings reinforce that domain shift, specialised terminology, and a structured answer space remain major obstacles for off-the-shelf deployment of general VLMs in medical VQA.

5.2. Can Supervised Fine-Tuning on Modality-Specific Training Subsets Reduce the Performance Gap Between General-Purpose VLMs and Specialised Ones?

Introducing supervised fine-tuning on the harmonised SLAKE+OmniMedVQA splits dramatically changed this picture. All three backbones showed large gains, but the magnitude and character of the improvement depended on the underlying architecture and training objective.
ViLT benefited directly from fine-tuning. Because it is already trained as a multi-class classifier on VQAv2, re-training its classification head and encoder on the modality-specific medical label spaces lifted exact-match accuracy into the ≈80% range across CT, MRI and X-ray, with F1, ROUGE-L, WUPS and BERTScore all in the high-70s to low-90s. This suggests that, given a well-curated medical label space and sufficient supervised data, a relatively compact, task-aligned VQA backbone can adapt effectively to radiology.
BLIP also improved substantially after fine-tuning but retained a characteristic gap between semantic and lexical metrics: BERTScore-F1 and WUPS were high, yet EM and F1 remained noticeably lower than for ViLT. This behaviour is consistent with the generative nature of BLIP. It is pre-trained for both understanding and generation and tends to produce fluent paraphrases rather than exact strings. In our results, the BLIP answers often captured the correct clinical meaning but failed strict string equality, which is penalised by EM.
MiniCPM-V-2, despite being a modern multimodal LLM, continued to lag behind the specialised VQA backbones after LoRA-based instruction tuning. It did achieve solid improvements compared with its zero-shot baseline, especially on CT, but its EM and F1 scores on MRI and X-ray remained clearly lower than those of ViLT and BLIP. This suggests that scale and generality alone do not guarantee strong performance on Med-VQA subsets; specialised pre-training objectives and architectures still matter.
Our metrics show that even after fine-tuning, MRI questions yield lower WUPS@0.9 and F1 than CT and X-ray for all three backbones. This systematic gap indicates that modality-specific factors—not just overall dataset size—shape how well general VLMs can be adapted to radiology VQA. Box 1 is the key take-away from the results.
Box 1. Key finding on modality complexity using SLAKE + Omni Med VQA data.
Across all models and pipelines, a stable modality pattern emerged: CT was consistently the easiest setting, X-ray intermediate, and MRI the most challenging.

5.3. Can a Simple Post-Hoc Answer-Selection Strategy Further Improve Reliability on Benchmarks with Predefined Answer Sets?

The third pipeline shows that a large part of the residual error after fine-tuning is due to mismatches between free-form generation and the discrete answer space. By reframing evaluation as a multiple-choice decision problem and selecting the option that is semantically closest (via BERTScore) to the model’s generated answer, both ViLT and BLIP moved into a near-ceiling regime: exact-match and F1 exceeded 90% on all three modalities, and BERTScore-F1 approached or exceeded 96%.
For ViLT, option selection consolidates an already strong classifier, raising EM from ≈80% to >92% and improving semantic metrics further. For BLIP, the effect is even more pronounced. While fine-tuned BLIP alone lags ViLT in strict EM, the option-selection layer effectively converts its high-quality semantic predictions into reliable discrete choices, yielding performance comparable to—or slightly better than—ViLT on X-ray and only marginally lower on CT and MRI. These observations are in line with the BLIP original design goals of flexible transfer to both understanding and generation tasks.
MiniCPM-V-2 also benefits from option selection, with clear gains in all metrics, particularly on CT and X-ray. However, its scores remain noticeably below those of ViLT and BLIP. This indicates that while post hoc calibration is broadly helpful, it cannot fully compensate for weaker task adaptation in the underlying model.
Importantly, the option-selection pipeline does not change the training objective: the models are still fine-tuned on open-ended answers. The improvements arise purely from re-projecting predictions into the dataset’s answer space at evaluation time. This suggests that similar post hoc calibration layers—using semantic matching, cross-encoders, or retrieval-augmented scoring—may be a practical way to increase the reliability of generative Med-VQA systems without modifying the base training loop.

5.4. Relation to Prior Med-VQA Literature

Our findings sit alongside a growing body of work on medical VQA benchmarks. SLAKE [5] introduced semantically labelled, knowledge-enhanced questions spanning multiple modalities and anatomical regions, explicitly to support robust Med-VQA evaluation. OmniMedVQA extended this idea to a far larger scale and demonstrated that many existing LVLMs—both general and medical-specialised—struggle on realistic, multimodal clinical data. Other datasets such as PMC-VQA likewise report that out-of-the-box VLMs achieve modest scores and require careful adaptation [37].
Against this backdrop, our results show that, on harmonised CT, MRI, and X-ray subsets drawn from SLAKE and OmniMedVQA, relatively lightweight adaptations—modality-specific fine-tuning and post hoc option selection—are sufficient to push ViLT and BLIP into very high accuracy ranges. This does not contradict the earlier benchmarks. Our study focuses on three modalities and a merged dataset with unified answer processing, whereas OmniMedVQA evaluates across 12 modalities and a much broader variety of question types [6]. Rather, it illustrates that when the problem is carefully scoped and curation is controlled, general-purpose VQA backbones can become competitive Med-VQA solvers.

6. Conclusions and Future Directions

In this work, we systematically evaluated three general-purpose vision–language models—ViLT, BLIP, and MiniCPM-V-2—on harmonised Med-VQA subsets derived from SLAKE and OmniMedVQA-Mini, focusing explicitly on CT, MRI, and X-ray questions. Across all experiments, our findings were consistent. First, zero-shot performance was inadequate for medical use, with exact match confined to ≈24–25% for BLIP, ≈16–24% for ViLT, and 0% for MiniCPM-V-2 despite moderate semantic similarity scores. Second, modality-specific supervised fine-tuning was essential: ViLT reached 78–83% exact match across modalities, BLIP improved to 71–82%, and MiniCPM-V-2 to 60–80%, with CT being consistently easier than MRI and X-ray. Finally, adding a simple post hoc option-selection layer that maps free-text predictions to multiple-choice answers yielded the strongest results: ViLT and BLIP achieved 90–93% exact match, F1, and ROUGE-L on all three modalities, while MiniCPM-V-2 reached 74–81% exact match with BERTScore-F1 up to ≈93%. Taken together, these results show that carefully curated, modality-aware supervision combined with lightweight semantic calibration can transform generic VQA backbones into highly accurate solvers on radiology-focused Med-VQA benchmarks, while also clarifying the remaining performance gap for current multimodal LLMs.
This work has several practical limitations that suggest clear directions for future research. First, all experiments were constrained due to limited GPU resources, which restricted batch size, training duration, hyperparameter exploration and, in the case of MiniCPM-V-2, the proportion of data used for fine-tuning. Running the same pipelines on more powerful single or multi-GPU setups could enable deeper optimisation and potentially stronger performance. Second, we focused on two Med-VQA datasets (SLAKE and OmniMedVQA-Mini) and three imaging modalities (CT, MRI, and X-ray), so extending the current harmonisation and option-generation pipeline to additional datasets and modalities (e.g., ultrasound, histopathology, dermoscopy, and larger hospital cohorts) would be important to assess generalisability. Third, we evaluated only medium-scale backbones (ViLT-B/32, BLIP-VQA-base, and MiniCPM-V-2). Future work should explore larger, more powerful multimodal models—and medical-specialised LVLMs—within the same modality-aware, post hoc option-selection framework to determine how much additional benefit can be gained from increased parameter counts and more advanced architectures.

Author Contributions

Conceptualisation and methodology by both authors; software and data curation, M.H.S.; validation and formal analysis, both authors; writing of first draft of manuscript, M.H.S.; visualisation of results, both authors; review and editing, H.C.; supervision, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets we used in this paper are publicly available; see Section 3.1 and the following GitHub link: https://github.com/Haseeb-CS/Systematic-Analysis-of-Vision-Language-Models-for-Medical-Visual-Question-Answering.git, accessed on 29 January 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Environment, Dataset Curation and VLM Benchmarking

Appendix A.1. Computational Environment

Table A1. Computational environment and software stack used for benchmarking experiments.
Table A1. Computational environment and software stack used for benchmarking experiments.
ComponentDescription
Programming languagePython
Deep learning frameworkPyTorch (tensor operations, GPU-accelerated inference, model execution)
Model libraryHugging Face transformers
Vision–language modelsViLT (dandelin/vilt-b32-finetuned-vqa), BLIP (Salesforce/blip-vqa-base), MiniCPM-V-2 (openbmb/MiniCPM-V-2)
Dataset interfaceHugging Face datasets (loading SLAKE and OmniMedVQA-Mini, filtering, mapping, concatenation)
Image processingPillow (PIL) for image loading and basic pre-processing (e.g., RGB conversion)
Evaluation metricsevaluate (ROUGE-L, BERTScore); nltk.corpus.wordnet (Wu–Palmer similarity)
Dataset acquisitionsnapshot_download used to obtain immutable local copies of SLAKE JSON annotations and image archives
Execution modeModels moved to GPU via .to(device), set to evaluation mode with .eval(), and run inside torch.no_grad() or torch.inference_mode()
ReproducibilityGlobal random seeds initialised for NumPy and PyTorch at the start of each run

Appendix A.2. Dataset Curation and Harmonisation

SLAKE annotations were loaded from JSON, and only entries with q_lang == “en” were retained. For each record we extracted:
  • img_name (image filename);
  • question (text);
  • answer (reference answer);
  • modality (raw modality label).
Images were unpacked from imgs.zip into a local directory. A verification loop attempted to open each image with PIL.Image.open; records with missing or unreadable files were discarded.
OmniMedVQA-Mini. The test split of simwit/omni-med-vqa-mini was loaded using datasets.load_dataset. For benchmarking, we retained:
  • The image field (tensor or path, depending on configuration);
  • question (text);
  • gt_answer (canonical answer);
  • modality (raw modality string).
OmniMedVQA-Mini’s multiple-choice options were preserved but not used in the zero-shot evaluation.
Schema alignment and modality mapping.
Both datasets were mapped to a common schema: image_path, question, answer, modality, and dataset_id  { slake , omni } . For SLAKE, the answer field was copied directly; for OmniMedVQA-Mini, gt_answer was used. A mapping table converted raw modality strings into one of three canonical labels: CT, MRI, or X-ray. Records whose modality did not map to these labels were discarded.
The harmonised SLAKE and OmniMedVQA-Mini tables were concatenated, and three subsets were constructed by filtering on CT, MRI, and X-ray. A final quality-control pass again attempted to open each image_path with PIL, removing any records with corrupted or inaccessible images.

Appendix A.3. Fairness of Modality Splits

Figure A1 visualises the relationship between image IDs and image–question (QA) pairs for the CT, MRI, and X-ray modalities after the stratified train–test split. Each point corresponds to a single QA pair, with the x-axis indexing QA pairs and the y-axis indicating the associated image ID; blue and orange points denote training and test samples, respectively. Horizontal bands represent multiple distinct questions linked to the same underlying image. The presence of both colours within some bands indicates that different questions referring to the same image may appear in both splits. Importantly, no individual QA pair appears in both training and test sets, ensuring that all evaluations are performed on held-out question–answer pairs. This analysis confirms that the protocol avoids direct duplication or label leakage at the QA level, while explicitly characterising the controlled sharing of image content across splits.
Figure A1. Scatter plots for different imaging modality subsets. (a) CT subset; (b) MRI subset; (c) X-ray subset.
Figure A1. Scatter plots for different imaging modality subsets. (a) CT subset; (b) MRI subset; (c) X-ray subset.
Mti 10 00016 g0a1

Appendix A.4. Model-Specific Inference

Algorithm A1 Model-specific inference for zero-shot benchmarking (ViLT, BLIP, and MiniCPM-V-2)
  • Require: Dataset D with rows ( img _ path , q , y ) ; device cuda or cpu
  • Require: Text normaliser NormalizeText( · )
  • Ensure: Lists of normalised predictions { y ^ i } and normalised references { y i }
  •    1: Initialise device: device ← “cuda” if available else “cpu”
  •        (A) ViLT inference (classification-style)
  •    2: VILT _ PROC ViltProcessor.from_pretrained(“dandelin/vilt-b32-finetuned-vqa”)
  •    3: VILT ViltForQuestionAnswering.from_pretrained(“dandelin/vilt-b32-finetuned-vqa”)
  •        (B) BLIP inference (generative)
  •    4: BLIP _ PROC BlipProcessor.from_pretrained(“Salesforce/blip-vqa-base”)
  •    5: BLIP BlipForConditionalGeneration.from_pretrained(“Salesforce/blip-vqa-base”)
  •        (C) MiniCPM-V-2 inference (chat-style LVLM)
  •    6: Load openbmb/MiniCPM-V-2 with trust_remote_code = True using the official OpenBMB API
  •    7: Move model to device; set evaluation mode
  •    8: Initialise empty lists: PRESD [ ] , REFS [ ]
  •    9: for each row ( img _ path , q , y ) in D do
  •    10:      IMG LoadImage As RGB(img_path) ▹ PIL load + RGB conversion if required
  •    11:      y norm NormalizeText(y) ▹ Example: “Pleural effusion.” → “pleural effusion” (lowercase + trim + spacing/punct cleanup).
  •    12:     Append y norm to REFS
  •        ViLT prediction
  •    13:      ENC VILT _ PROC ( IMG , q , padding = True, truncation = True, return_tensors = “pt”)
  •    14:     with torch.no_grad() do
  •    15:      OUT VILT ( ENC ) ▹ logits indexed by k over answer vocabulary
  •    16:      k arg max k OUT . LOGITS [ k ] k indexes candidate answers (e.g., k = 0 : no, 1: yes, 2: left); k is the argmax. If logits are [0.1, 2.3, 0.4], then k = 1 yes.
  •    17:      y ^ vilt VILT . CONFIG . ID 2 LABEL [ k ]
  •    18:      y ^ norm vilt NormalizeText( y ^ vilt )
  •        BLIP prediction
  •    19:      ENC BLIP _ PROC ( IMG , q , padding = True, truncation = True, return_tensors = “pt”)
  •    20:     with torch.no_grad() do
  •    21:      GEN BLIP . GENERATE ( EMC , m a x _ n e w _ t o k e n s = 16 , n u m _ b e a m s = 3 , d o _ s a m p l e = F a l s e )
  •    22:      y ^ blip Decode(gen) ▹ token ids → text string
  •    23:      y ^ norm blip NormalizeText( y ^ blip )
  •        MiniCPM-V-2 prediction
  •    24:     Construct a single-turn user message containing q and attach IMG as input
  •    25:     with torch.no_grad() do
  •    26:      y ^ mcp MiniCPMChat(img, q) ▹ official model.chat(…) interface
  •    27:      y ^ norm mcp NormalizeText( y ^ mcp )
  •        Store predictions (choose the model being benchmarked)
  •    28:     Append the relevant y ^ norm (ViLT/BLIP/MiniCPM) to PREDS
  •    29: end for
  •    30: return PREDS , REFS

Appendix A.5. Metric Computation

All answers were normalised by lowercasing, stripping leading/trailing whitespace, and collapsing multiple spaces. Exact Match and token-level F1 were computed using the formulas given in Section 3.2.3.
ROUGE-L was computed using the evaluate library, and the ROUGE-L F1 component was extracted. WUPS and WUPS@0.9 were implemented via WordNet synsets and Wu–Palmer similarity, reporting the mean score and the proportion of samples with WUPS 0.9 . BERTScore-F1 was computed using the evaluate wrapper with an English base encoder (e.g., bert-base-uncased), and the mean F1 ×100 was reported per modality and model. We adopted Exact Match, token-level F1, ROUGE-L, WUPS/WUPS@0.9, and BERTScore-F1 because they are well-established metrics in the VQA and text-generation literature and together provide complementary views of model performance. EM and token-level F1 capture strict and partial lexical correctness, while ROUGE-L, WUPS, and BERTScore-F1 quantify graded semantic similarity and synonymy. Using this combination allows us to evaluate both exact clinical phrasing and semantically correct but lexically varied answers in a principled and reproducible manner.
This appendix fully specify the zero-shot benchmarking implementation and mirror the logic of the evaluation notebooks used for the main experiments.

Appendix B. VLM Fine-Tuning

Appendix B.1. Overview of the Fine-Tuning Protocol

For each radiology modality (CT, MRI, X-ray), the merged SLAKE+OmniMedVQA subset was partitioned into training and test splits using a stratified random shuffle at the Hugging Face datasets level. The notebooks implement a function random_split(ds, test_size=TEST_SIZE, seed=SPLIT_SEED), which shuffles indices under a fixed NumPy random state and assigns a proportion TEST_SIZE = 0.2 to the test set, leaving 80% of samples for training. This split is performed independently for CT, MRI, and X-ray, yielding three disjoint pairs ( train m , test m ) .
Fine-tuning is then carried out separately for each model and modality:
  • ViLT: three multi-label classification models, one per modality, using ViltForQuestionAnswering;
  • BLIP: three generative VQA models, one per modality, using
  • MiniCPM-V-2: three LoRA-adapted instruction-tuned models, one per modality, using the official MiniCPM-V training scripts with base model openbmb/MiniCPM-V-2 and a SigLIP-400M visual encoder.
Across all pipelines, PyTorch handles back-propagation and optimisation, Hugging Face Transformers provide model abstractions and training utilities, and the same evaluation stack (evaluate, nltk.wordnet, BERTScore) as in the benchmarking stage is reused.

Appendix B.2. ViLT: Multi-Label Classification Fine-Tuning

Label space construction.
For each modality, the notebooks define a helper build_vocab(train_ds, test_ds, source = “train”), which normalises each ground-truth answer using the same normalize_text function as in benchmarking (lowercasing, stripping leading and trailing whitespace, and collapsing multiple spaces). Unique answer strings are collected over the training split only, sorted, and mapped to integer identifiers via label2id and id2label. The ViLT configuration is updated as:
  • config.num_labels = len(label2id);
  • config.id2label and config.label2id set to the constructed dictionaries.
To avoid undefined labels at evaluation, the code optionally enables FILTER_OOD_ TEST_ANSWERS = True and filters the test split so that only examples whose normalised answers appear in label2id are retained.
We have evaluated VILT as a closed-set classifier with a modality-specific answer vocabulary derived from the training split. Consequently, test instances whose ground-truth answers do not appear in this vocabulary fall outside the model’s supported output space and cannot be evaluated in a well-defined manner. To ensure a valid and fair assessment, we exclude such out-of-vocabulary cases from the ViLT evaluation. The proportion of removed samples is minimal—0.42% for CT, 1.09% for MRI, and 0.84% for X-ray—affecting fewer than 1% of test instances per modality. This filtering prevents undefined scoring while having a negligible impact on the overall evaluation. In contrast, the generative models (BLIP and MiniCPM-V-2) are not constrained by a fixed answer vocabulary and are therefore evaluated on the full test sets without any filtering.
Input collator and labels.
The VQACollator wraps the ViLT processor. For each batch it loads radiology images from the disk, collects the free-text questions, and calls the ViLT processor with padding=True and truncation=True to obtain input_ids, attention_mask, and pixel_values. A multi-hot label matrix of shape ( B , K ) is then constructed, where K = | label 2 id | . For each example, the entry corresponding to the correct answer label is set to 1.0 , with all others set to 0.0 . This dense floating-point label tensor is attached to the batch as enc[“labels”], activating the multi-label classification regime in ViltForQuestionAnswering, which internally applies BCEWithLogitsLoss.
Optimisation and schedule.
For each modality, a fresh ViltForQuestionAnswering model is initialised from dandelin/vilt-b32-finetuned-vqa and configured with the modality-specific label mapping. Training is executed via the Hugging Face Trainer using VQACollator as the data collator. No PEFT is used; the ViLT encoder and classification head are fully fine-tuned. The complete set of training hyperparameters is reported in Table 1 (left column).

Appendix B.3. BLIP: Generative VQA Fine-Tuning

Supervision format and collator.
Unlike ViLT, BLIP does not rely on a fixed answer vocabulary. The notebooks define a BlipVQACollator that loads images and questions as in benchmarking, calls the BLIP processor with return_tensors=“pt”, padding=True, and truncation=True, and tokenises ground-truth answers using the same tokenizer. An end-of-sequence token is appended, sequences are padded to a uniform length, and padded positions are set to 100 so that they do not contribute to the loss. Under this setup, BLIP is trained using the standard autoregressive negative log-likelihood over answer tokens (Section 3.2.3).
Training configuration.
For each modality, a fresh BlipForConditionalGeneration model is loaded from VQA-pretrained weights. The TrainingArguments mirror those of ViLT, with the exception that training is performed in full precision (fp16 = False) due to numerical instabilities observed with half-precision on some GPUs. Training uses the standard Trainer with BlipVQACollator as the data collator. At inference time, answers are generated using constrained decoding (max_new_tokens = 16, num_beams = 3, do_sample = False), followed by decoding and normalisation.

Appendix B.4. MiniCPM-V-2: LoRA-Based Instruction Fine-Tuning

Conversation-style SFT data.
Each modality’s training and test splits are converted into JSONL files compatible with the official MiniCPM SFT scripts using make_sft_json_for_split(split, modality). For each example, the output includes an “image” field pointing to the image file on disk and a “conversations” list containing a user message with the question and image placeholder, followed by an assistant message with the normalised ground-truth answer. A helper validate_minicpm_json verifies image-path existence and schema correctness.
LoRA configuration and training script.
The notebooks define a modified finetune.py that wraps the official MiniCPM training loop and introduces LoraArguments. LoRA adapters are attached to the language model’s self-attention projections with rank r = 8 , scaling α = 16 , dropout = 0.05 , and target modules llm..*layers.\d+.self_attn.(q_proj|k_proj|v_proj). The underlying MiniCPM-V-2 and SigLIP-400M weights remain frozen. Training is performed with a per-device batch size of 1, gradient accumulation of 16 steps, a single training epoch, learning rate 2 × 10 4 , fp16 = True, and evaluation and checkpointing at each epoch.
Post-training evaluation.
After training, a separate inference script loads the fine-tuned checkpoint (base model with LoRA adapters), iterates over the modality-specific test JSON files, and produces answers via the MiniCPM chat interface. Generated responses are decoded to plain text and normalised using the same normalize_text function as elsewhere. Predictions are evaluated using the same metric computation routines as in the benchmarking stage (EM, F1, ROUGE-L, WUPS, WUPS@0.9, and BERTScore-F1), enabling direct comparison between zero-shot, fine-tuned, and post-hoc option-selection settings.

Appendix C. Post-Hoc Option Selection

Option Selection after Fine-Tuning: Implementation Details.

Appendix C.1. Multiple-Choice Harmonisation of SLAKE and OmniMedVQA

Starting from the harmonised SLAKE+OmniMedVQA corpus and the CT/MRI/X-ray subsets, we attach four candidate options and a correct_option label to every example.
OmniMedVQA options. For OmniMedVQA-Mini, the existing options field is expanded into explicit columns option_A, option_B, option_C, and option_D. If fewer than four candidates are present, empty strings are used as padding; if more than four are present, only the first four are retained for consistency. The ground-truth answer is taken from gt_answer and normalised using normalize_text, and each option is normalised in the same way. If exactly one option matches the normalised ground-truth answer, the corresponding label (AD) is stored in correct_option. If the answer matches none or more than one option, correct_option is left undefined and the example is excluded from MCQ-accuracy computation, although it may still contribute to free-text metrics.
Synthetic options for SLAKE. SLAKE does not provide candidate options, so synthetic distractors are constructed using the SLAKE internal answer distribution only.
Global answer pool. Normalised answers are computed over the full SLAKE split, frequencies are counted, and the top K = 10 most frequent normalised answers (TOP_K_SYNTH_ANSWERS = 10) are selected. Each normal form is mapped back to one representative original string, forming a high-frequency pool.
Extended unique pool. An extended pool is constructed containing all distinct normalised answer strings in SLAKE, ensuring coverage of rare answers.
Distractor sampling. For a record with ground-truth answer a, we compute a norm = normalize _ text ( a ) and call sample_unique_distractors( a norm , need_k = 3). The procedure randomly shuffles the high-frequency pool, selects candidates whose normalised forms differ from a norm and from each other, and stops once three distractors are collected or the pool is exhausted. If fewer than three distractors are found, the procedure falls back to the extended pool under the same uniqueness constraints.
Option construction. The three sampled distractors and the original correct answer are combined into a list of four strings, shuffled randomly, and checked to ensure that all four normalised options are distinct. The resulting list is assigned to option_Aoption_D, and correct_option is set to the label corresponding to the position of the correct answer. This process equips every SLAKE example with a four-option structure without relying on external knowledge or model-generated distractors.

Appendix C.2. Combined MCQ Dataset per Modality

For ViLT and BLIP, the SLAKE and OmniMedVQA tables enriched with option_Aoption_D and correct_option are concatenated into a single table combo with columns image_path, question, answer, modality, dataset_id, option_Aoption_D, and correct_option. The following steps are applied:
  • Modality filtering: Combined Dataset is filtered by modality { CT , MRI , X - ray } to create modality-specific datasets.
  • Train–test split: For each modality, the same random_split utility as in Appendix B.2 is applied with test_size = 0.2 and TRAIN_PERCENT = TEST_PERCENT = 1.0, using a fixed random seed.
  • Retention of options: All option columns and correct_option are retained in both splits. These fields are ignored by the training loss and carried forward solely for evaluation.
The fine-tuning hyperparameters and loss functions remain identical to those described in Appendices Appendix B.2 and Appendix B.3.

Appendix C.3. MiniCPM-V-2 SFT JSON with Options

MiniCPM-V-2 uses JSONL files in the official SFT “cookbook” format. To embed MCQ information without modifying the training loop, the SLAKE synthetic distractor pools are reused. A helper make_options_for_answer(answer_text) generates three distractors using sample_unique_distractors, appends the correct answer, randomly permutes the four candidates, and returns both the option texts and the correct label.
Each record in make_sft_json_for_split(split, modality) is written as
{
  “id”: “<unique-id>”,
  “image”: “<path-to-image>”,
  “conversations”: [
    {“role”: “user”, “content”: “<image>\n<Question>”},
    {“role”: “assistant”, “content”: “<Answer>”}
  ],
  “options”: [“opt_A”, “opt_B”, “opt_C”, “opt_D”],
  “correct_option”: “A” | “B” | “C” | “D”
}
SLAKE images are referenced via their img_name fields and a local root directory, while OmniMed images are written as temporary PNG files and referenced by path. The LoRA-based SFT script ignores options and correct_option during training; these fields are used only during evaluation.
The plots for the error analyses are Figure A2, Figure A3, Figure A4:
Figure A2. BLIP—error analyses of post hoc option selection results.
Figure A2. BLIP—error analyses of post hoc option selection results.
Mti 10 00016 g0a2
Figure A3. VILT—error analyses of post hoc option selection results.
Figure A3. VILT—error analyses of post hoc option selection results.
Mti 10 00016 g0a3
Figure A4. MiniCPM—error analyses of post hoc option selection results.
Figure A4. MiniCPM—error analyses of post hoc option selection results.
Mti 10 00016 g0a4

Appendix C.4. Semantic Option Selection and Model Integration

BERTScore-based similarity. The function best_option_for_prediction(pred_text, options) implements semantic snapping of free-text predictions to MCQ options. Given a normalised prediction and a list of candidate options, BERTScore-F1 is computed for each non-empty option using a bert-base-uncased encoder on CPU, yielding a score s j per option.
Fallback to lexical F1. If BERTScore computation fails, a lexical token-level F1 is computed according to
F 1 lex ( p , t ) = 2 | T ( p ) T ( t ) | | T ( p ) | + | T ( t ) | ,
where T ( · ) denotes the set of tokens in a string.
Selection. The option indexed as j * with maximum score s j * is selected, returning the corresponding label ^ { A , B , C , D } , option text o ^ , and score. If all options are empty, the function returns ( None , pred _ text , 0.0 ) .
Integration per model. For ViLT, the fine-tuned classifier output is decoded via id2label to obtain a raw prediction, which is passed to best_option_for_prediction. BLIP follows the same procedure using its generated answer. For MiniCPM-V-2, an evaluation script loads the base model with the trained LoRA adapter, produces predictions via the official chat interface, and applies semantic option selection. In all cases, metrics (EM, F1, ROUGE-L, WUPS, WUPS@0.9, and BERTScore-F1) are computed on the selected option texts, and MCQ accuracy is obtained by comparing the selected label with correct_option when defined.
Together with Appendix A and Appendix B, Appendix C fully specifies the implementation of the post hoc multiple-choice option-selection pipeline used to obtain the results reported in the main text.

References

  1. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
  2. Hartsock, I.; Rasool, G. Vision-language models for medical report generation and visual question answering: A Review. Front. Artif. Intell. 2024, 7, 1430984. [Google Scholar] [CrossRef] [PubMed]
  3. Lin, Z.; Zhang, D.; Tao, Q.; Shi, D.; Haffari, G.; Wu, Q.; He, M.; Ge, Z. Medical visual question answering: A survey. Artif. Intell. Med. 2023, 143, 102611. [Google Scholar] [CrossRef] [PubMed]
  4. Sunitha, U.; Shastri, H. Visual question answering system. Int. J. Res. Publ. Rev. 2025, 6, 1793–1796. [Google Scholar] [CrossRef]
  5. Liu, B.; Zhan, L.-M.; Xu, L.; Ma, L.; Yang, Y.; Wu, X.-M. Slake: A semantically-labeled knowledge-enhanced dataset for Medical Visual Question Answering. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021. [Google Scholar]
  6. Hu, Y.; Li, T.; Lu, Q.; Shao, W.; He, J.; Qiao, Y.; Luo, P. OmniMedVQA: A new large-scale comprehensive evaluation benchmark for medical LVLM. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 22170–22183. [Google Scholar]
  7. Liu, G.; He, J.; Li, P.; Zhao, Z.; Zhong, S. Cross-modal self-supervised vision language pre-training with multiple objectives for medical visual question answering. J. Biomed. Inform. 2024, 160, 104748. [Google Scholar] [CrossRef] [PubMed]
  8. Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021. [Google Scholar]
  9. Nguyen, D.; Ho, M.K.; Ta, H.; Nguyen, T.T.; Chen, Q.; Rav, K.; Dang, Q.D.; Ramchandre, S.; Phung, S.L.; Liao, Z.; et al. Localizing before answering: A benchmark for grounded medical visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 16–22 August 2025. [Google Scholar]
  10. Yi, Z.; Xiao, T.; Albert, M.V. A survey on multimodal large language models in radiology for report generation and visual question answering. Information 2025, 16, 136. [Google Scholar] [CrossRef]
  11. Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
  12. Raschka, S. Understanding Multimodal LLMs. Blog Post. Available online: https://magazine.sebastianraschka.com/p/understanding-multimodal-llms (accessed on 7 December 2025).
  13. Hu, S.; Tu, Y.; Han, X.; He, C.; Cui, G.; Long, X.; Zheng, Z.; Fang, Y.; Huang, Y.; Zhao, W.; et al. MiniCPM: Unveiling the potential of small language models with Scalable Training Strategies. arXiv 2024, arXiv:2404.06395. [Google Scholar] [CrossRef]
  14. Moor, M.; Huang, Q.; Wu, S.; Yasunaga, M.; Zakka, C.; Dalmia, Y.; Reis, E.P.; Rajpurkar, P.; Leskovec, J. Med-Flamingo: A multimodal medical few-shot learner. In Proceedings of the Machine Learning for Health Symposium, New Orleans, LA, USA, 10 December 2023. [Google Scholar]
  15. Guo, Y.; Huang, W. Llava-next-med: Medical multimodal large language model. In Proceedings of the 2025 Asia-Europe Conference on Cybersecurity, Internet of Things and Soft Computing (CITSC), Rimini, Italy, 10–12 January 2025; pp. 474–477. [Google Scholar]
  16. Malinowski, M.; Fritz, M. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS), Montreal, QC, USA, 8–13 December 2014. [Google Scholar]
  17. Malinowski, M.; Rohrbach, M.; Fritz, M. Ask your neurons: A deep learning approach to visual question answering. Int. J. Comput. Vis. 2017, 125, 110–135. [Google Scholar] [CrossRef]
  18. Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  19. Zakari, R.Y.; Owusu, J.W.; Qin, K.; Sagir, A.M. A transformer-based approach for effective visual question answering. In Proceedings of the 2024 IEEE Smart World Congress (SWC), Denarau Island, Fiji, 2–7 December 2024; pp. 1532–1539. [Google Scholar]
  20. Sharma, H.; Jalal, A.S. A survey of methods, datasets and evaluation metrics for visual question answering. Image Vis. Comput. 2021, 116, 104327. [Google Scholar] [CrossRef]
  21. Lu, S.; Liu, M.; Yin, L.; Yin, Z.; Liu, X.; Zheng, W. The multi-modal fusion in visual question answering: A review of Attention Mechanisms. PeerJ Comput. Sci. 2023, 9, e1400. [Google Scholar] [CrossRef] [PubMed]
  22. Bazi, Y.; Rahhal, M.M.; Bashmal, L.; Zuair, M. Vision–language model for visual question answering in medical imagery. Bioengineering 2023, 10, 380. [Google Scholar] [CrossRef] [PubMed]
  23. Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 180251. [Google Scholar] [CrossRef] [PubMed]
  24. He, X.; Zhang, Y.; Mou, L.; Xing, E.; Xie, P. PATHVQA: 30000+ questions for medical visual question answering. arXiv 2020, arXiv:2003.10286. [Google Scholar] [CrossRef]
  25. Yuan, D. Language bias in visual question answering: A survey and taxonomy. arXiv 2021, arXiv:2111.08531. [Google Scholar] [CrossRef]
  26. Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv 2023, arXiv:2305.10415. [Google Scholar] [CrossRef]
  27. Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. 2023. Available online: https://openreview.net/forum?id=GSuP99u2kR (accessed on 5 December 2025).
  28. Chang, T.; Chen, S.; Fan, G.; Feng, Z. A vision-language model based on prompt learner for few-shot medical images diagnosis. In Proceedings of the International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; pp. 1455–1460. [Google Scholar]
  29. Chen, X.; Lai, Z.; Ruan, K.; Chen, S.; Liu, J.; Liu, Z. R-llava: Improving med-vqa understanding through visual region of interest. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; pp. 1–10. [Google Scholar]
  30. Zhang, D.; Cao, R.; Wu, S. Information fusion in visual question answering: A survey. Inf. Fusion 2019, 52, 268–280. [Google Scholar] [CrossRef]
  31. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016. [Google Scholar]
  32. Mathur, N.; Baldwin, T.; Cohn, T. Tangled up in Bleu: Reevaluating the evaluation of Automatic Machine Translation Evaluation Metrics. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020. [Google Scholar]
  33. Barbella, M.; Tortora, G. Rouge metric evaluation for text summarization techniques. SSRN Electron. J. 2022. [Google Scholar] [CrossRef]
  34. Kim, B.S.; Kim, J.; Lee, D.; Jang, B. Visual question answering: A survey of methods, datasets, evaluation, and challenges. ACM Comput. Surv. 2025, 57, 249. [Google Scholar] [CrossRef]
  35. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
  36. Contributors, P. BCEWITHLOGITSLOSS. Available online: https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html (accessed on 19 January 2026).
  37. Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. Development of a large-scale medical visual question-answering dataset. Commun. Med. 2024, 4, 277. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Workflow of the proposed pipeline (benchmarking, fine-tuning, and post hoc option selection).
Figure 1. Workflow of the proposed pipeline (benchmarking, fine-tuning, and post hoc option selection).
Mti 10 00016 g001
Table 1. Modality-wise benchmark results on SLAKE+Omni Med VQA (zero-shot).
Table 1. Modality-wise benchmark results on SLAKE+Omni Med VQA (zero-shot).
DatasetModality TypeExact MatchAverage F1Rouge-LWUPSWUPS@0.9BertScore-F1
Salesforce/blip-vqa-base
SLAKE—Omni Med VQACT—357124.5625.0225.0764.8728.4872.42
MRI—184018.8619.8820.1550.6223.4865.59
X-ray—237223.1923.6926.3965.5828.5470.59
dandelin/vilt-b32-finetuned-vqa
SLAKE—Omni Med VQACT—357123.5523.9723.9955.0824.8468.21
MRI—184016.4716.6616.9346.1019.7861.51
X-ray—237218.5918.8918.9254.8824.2463.47
MiniCPM-V2 (OpenBMB)
SLAKE—Omni Med VQACT—35710.003.166.8870.4718.9635.49
MRI—18400.001.967.6858.5413.8636.39
X-ray—23720.002.927.4567.9321.2935.90
Table 2. Modality-wise fine-tuning results on SLAKE + Omni Med VQA after fine-tuning.
Table 2. Modality-wise fine-tuning results on SLAKE + Omni Med VQA after fine-tuning.
DatasetModality TypeExact MatchAverage F1Rouge-LWUPSWUPS@0.9BertScore-F1
Salesforce/blip-vqa-base
SLAKE—Omni Med VQACT—71481.7284.3384.7991.7883.2693.51
MRI—36873.5579.3979.9678.4366.6791.31
X-ray—47471.2875.0782.0888.6578.5192.17
dandelin/vilt-b32-finetuned-vqa
SLAKE—Omni Med VQACT—71482.7084.5784.3591.5882.9893.93
MRI—36876.0380.0380.2977.3764.7491.83
X-ray—47478.0980.9581.2588.5077.8791.50
MiniCPM-V2 (OpenBMB)
SLAKE—Omni Med VQACT—71479.9781.5281.6889.1378.4393.00
MRI—36868.4874.3076.2373.3661.1488.73
X-ray – 47459.7065.5270.2078.6664.1485.88
Table 3. Modality-wise fine-tuning with post-hoc option selection on SLAKE+Omni Med VQA.
Table 3. Modality-wise fine-tuning with post-hoc option selection on SLAKE+Omni Med VQA.
DatasetModality TypeExact MatchAverage F1Rouge-LWUPSWUPS@0.9BertScore-F1
Salesforce/blip-vqa-base
SLAKE—Omni Med VQACT—71490.7691.2291.4893.9587.2596.48
MRI—36887.7787.7787.7778.0466.5895.65
X-ray—47493.2593.6193.6191.3381.6597.57
dandelin/vilt-b32-finetuned-vqa
SLAKE—Omni Med VQACT—71492.1292.4592.5994.6789.0396.72
MRI—36892.8492.8492.8479.8170.2596.87
X-ray—47492.3492.8792.8690.6382.3496.54
MiniCPM-V2 (OpenBMB)
SLAKE—Omni Med VQACT—71480.9582.7782.9190.0079.6992.86
MRI—36876.6380.4781.3575.7164.9591.27
X-ray—47473.8478.4278.8785.3973.4289.81
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shah, M.H.; Cuayáhuitl, H. Systematic Analysis of Vision–Language Models for Medical Visual Question Answering. Multimodal Technol. Interact. 2026, 10, 16. https://doi.org/10.3390/mti10020016

AMA Style

Shah MH, Cuayáhuitl H. Systematic Analysis of Vision–Language Models for Medical Visual Question Answering. Multimodal Technologies and Interaction. 2026; 10(2):16. https://doi.org/10.3390/mti10020016

Chicago/Turabian Style

Shah, Muhammad Haseeb, and Heriberto Cuayáhuitl. 2026. "Systematic Analysis of Vision–Language Models for Medical Visual Question Answering" Multimodal Technologies and Interaction 10, no. 2: 16. https://doi.org/10.3390/mti10020016

APA Style

Shah, M. H., & Cuayáhuitl, H. (2026). Systematic Analysis of Vision–Language Models for Medical Visual Question Answering. Multimodal Technologies and Interaction, 10(2), 16. https://doi.org/10.3390/mti10020016

Article Metrics

Back to TopTop