Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

Zhang, Junkai; Li, Bin; Zhou, Shoujun

doi:10.3390/app15094712

Open AccessArticle

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

by

Junkai Zhang

¹,

Bin Li

^2,*

and

Shoujun Zhou

²

¹

School of Artificial Intelligence, Tiangong University, Tianjin 300387, China

²

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4712; https://doi.org/10.3390/app15094712

Submission received: 2 April 2025 / Revised: 20 April 2025 / Accepted: 22 April 2025 / Published: 24 April 2025

(This article belongs to the Special Issue New Trends in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Medical Visual Question Answering (Med-VQA) is designed to accurately answer medical questions by analyzing medical images when given both a medical image and its corresponding clinical question. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical Med-VQA tasks and established datasets. However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels, resulting in semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which can obscure crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion (HiCA-VQA) method. Specifically, the hierarchical modeling includes two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. This approach effectively avoids the irrelevant signals introduced by global interactions while achieving lower computational complexity compared to global self-attention fusion modules. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions, especially achieving an 18 percent improvement in the F1 score. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.

Keywords:

medical visual question answering; hierarchical answer decoders; medical image understanding

1. Introduction

Medical Visual Question Answering (Med-VQA) aims to generate accurate diagnostic answers from candidates by jointly analyzing medical images and natural language-described clinical questions [1,2]. This task provides essential technical support for AI-assisted diagnosis, medical education, and clinical decision-making [3,4]. In recent years, Med-VQA has demonstrated potential in applications such as pneumonia detection [5] and tumor classification [6], with current research focusing on precise multi-modal feature fusion and emulation of physicians’ hierarchical diagnostic logic [7] to better align with clinical workflows [8]. Traditional Med-VQA approaches typically treat complex medical questions as single granularity tasks, neglecting the progressive reasoning pathway of “Topic Existence → Element Existence → Attributes” inherent to clinical diagnosis. To address this limitation, as illustrated in Figure 1, a traditional Hierarchical Medical Visual Question Answering (Hierarchical Med-VQA) task has been proposed [9]. These models simulate clinical diagnostic pathways by decomposing structured report questions into hierarchical levels, using autoregressive methods to sequentially predict multi-level medical questions corresponding to a single medical image. While demonstrating advantages over conventional image–question Med-VQA models in clinical relevance and automated hierarchical report generation, these models retain critical limitations: (1) The continued use of single-layer answer encoding architectures from traditional Med-VQA leads to hierarchical semantic fragmentation, stemming from feature space coupling and gradient optimization conflicts across hierarchical fine-grained question tasks. Specifically, shared visual representations must simultaneously satisfy divergent semantic requirements across hierarchy levels, /textcolorblueleading to attention competition and representational confusion during feature learning, loss gradients from subtasks exhibit directional conflicts during backpropagation, exacerbating optimization instability. This fragmentation amplifies at the data distribution level: Anatomical regions with long-tailed distributions suffer representational suppression due to gradient dominance from high-frequency tasks, resulting in imbalanced performance across hierarchical question levels. In conclusion, current hierarchical Med-VQA implementations achieve question hierarchy only at the dataset level without a corresponding architectural hierarchy. (2) The conventional Transformer-based concatenated self-attention fusion method [10] in such models shows critical limitations: Its implicit, coarse-grained cross-modal interaction mechanism fails to meet medical scenarios’ demands for precise alignment and robustness. By simply concatenating image-text embeddings before self-attention layers, existing methods achieve global feature mixing but fundamentally rely on implicit cross-modal relationship learning, thereby obscuring crucial local semantic correlations in medical contexts.

To address these challenges, this study proposes HiCA-VQA, a framework incorporating hierarchical modeling with cross-attention fusion. Within this architecture, questions at different fine-grained levels are processed through dedicated answer decoders. Specifically, we introduce a hierarchical prompting module and hierarchical answer decoders. Based on the fine-grained level of the current question, the hierarchical prompting module introduces distinct prompts to pre-align text prompts with corresponding image features before fusing them with question features for decoding. The primary objective is to guide the model in progressively shifting attention from global to local image regions, thereby completing stepwise reasoning from screening to detailed analysis. We introduce a cross-attention fusion module [11] where image features serve as queries and text features as key-value pairs. This image-to-text directed retrieval mechanism dynamically retrieves the most relevant textual semantic clues for the current diagnostic task, establishing explicit mapping relationships between images and questions while reducing cross-modal noise interference.

Our results demonstrate that HiCA-VQA has better advantages and strong adaptability in hierarchical medical visual question answering tasks. It dynamically adjusts the hierarchy depth of answer decoders according to the question hierarchy division in different visual question answering tasks, providing a better solution for hierarchical VQA. Our main contributions include the following:

We introduce a hierarchical prompting module and hierarchical answer decoders that provide different context prompts based on varying levels of question-image sample pairs to guide the model’s attention to distinct image regions.
We incorporate cross-attention into multi-modal feature fusion to utilize attention mechanisms for emphasizing critical components, establishing precise associations between anatomical regions and diagnostic terminology through directed alignment, and outputting a final embedding reflecting inter-modal interactions. This achieves accurate mapping between local lesions and textual terms while enhancing robustness against cross-modal noise.
Experimental results prove that our strategy outperforms baseline methods and current state-of-the-art approaches on the Rad-Restruct [9] dataset, achieving new state-of-the-art performance.

2. Related Work

2.1. Pretrained Models in Medical Visual Question Answering

Recent advances in large-scale pretrained models have significantly advanced medical visual question answering (Med-VQA) through cross-modal learning paradigms. In vision-language joint modeling [12,13], domain-specific pretrained models achieve semantic alignment between medical images and text via contrastive learning strategies [14]. PubMedCLIP [15], a medical variant of CLIP [16], establishes cross-modal shared semantic spaces through contrastive pretraining on millions of medical image–text pairs. Its visual encoder extracts clinically relevant, anatomically-aware features that provide high-quality image representations for downstream VQA tasks. For textual modeling, RadBERT [17] implements domain-adaptive pretraining on radiology report corpora, enhancing radiological term comprehension through masked language modeling and contrastive learning tasks. This model dynamically encodes hierarchical semantic structures in medical questions while preserving contextual dependencies. Notably, MedFuse [18] addresses medical data scarcity through hierarchical feature fusion architectures, where pretrained EfficientNet image features interact with BioClinicalBERT text embeddings via gated fusion mechanisms, demonstrating superior performance in pneumonia detection compared to general-purpose models. Benchmark studies on VQA-Rad [19] further validate the advantages of medical-specific pretraining over generic models like ViLBERT [20], particularly in fine-grained reasoning, where pretrained models better capture correlations. However, existing approaches predominantly employ single-layer decoding architectures to process hierarchical questions, failing to synergize the semantic advantages of pretrained models with hierarchical diagnostic logic. This limitation results in high-level semantic conflicts and low-level feature confusion, creating innovation opportunities for our hierarchical prompting and decoding framework design.

2.2. Hierarchical Medical Visual Question Answering

Although medical VQA has advanced in single-question reasoning, hierarchical semantic relationship modeling remains underdeveloped. Traditional methods like the VQA-Rad baseline model [19] treat questions as isolated tasks, leading to logical disconnections in structured report generation. Early structured report studies, such as the unstructured label retrieval by Syeda-Mahmood et al. [21] and single-disease attribute prediction by Bhalodia et al. [22], failed to systematically organize multi-level diagnostic elements. While generic hierarchical reasoning studies (e.g., Kovaleva et al.’s stochastic history-sampling dialogue model [23]) provide inspiration, their hierarchy construction logic fundamentally differs from medical diagnosis’s tree-structured semantics. Pellegrini et al.’s Hi-VQA [9] pioneered modeling radiology report generation as an autoregressive hierarchical VQA task: explicitly constructing tree-like dependency chains from system-level anomaly detection to lesion-specific attribute description, enabling progressive reasoning. Their multi-modal Transformer self-attention fusion innovatively integrates image features with hierarchical text semantics through spatial position encoding, enhancing report interpretability via hierarchical consistency constraints during inference. Compared to domain-specific pretrained models like MedFuse [18], Hi-VQA achieves comparable performance using the general-purpose RadBERT, demonstrating the hierarchical architecture’s knowledge transfer enhancement. The Rad-ReStruct dataset further bridges the academic gaps with its three-tier diagnostic annotation system, surpassing flat datasets like PathVQA [24] and Slake [25], and providing a standardized benchmark for hierarchical reasoning. Despite these advancements, existing work only achieves dataset-level hierarchy without exploiting the architectural hierarchy’s potential.

2.3. Context Alignment Enhancements

Since the inception of medical VQA, precise image-text semantic alignment has been critical for performance improvement. Early studies attempted to transfer general VQA attention mechanisms (e.g., BioGPT [26], BLIP-2 [27]) or enhanced medical image representations via Mixed Enhanced Visual Features (MEVF) [28], yet textcolorbluethey remained limited by modality gaps and medical data scarcity. Recent work has focused on pre-trained vision-language models (e.g., PubMedCLIP [15]) and autoregressive history modeling, but still relies on self-attention fusion. Diverging from existing approaches, Arsalane et al. [29] first proposed leveraging medical reports as contextual enhancement signals. Their trainable cross-modal alignment module uses stacked multi-head self-attention layers to pre-align image features with report semantics, followed by multi-modal fusion with medical questions. While this approach implicitly establishes vision–text correlations during training for data augmentation, this design—like Hi-VQA, which employs single-layer answer decoders rather than hierarchical architectures—limits enhancement effectiveness.

3. Methodology

Figure 2 illustrates the overview of our proposed hierarchical medical visual question answering model. First, we clarify that each medical image corresponds to a complete medical report. The medical image and a question extracted from the report form each sample, where the question is categorized into three hierarchical levels based on granularity. The method primarily leverages the varying granularity levels of medical questions to prompt the medical image, then hierarchically inputs the questions from the current sample into distinct answer decoders for prediction. This enhances the visual reference space, guiding the model to focus on specific image regions for more accurate answers.

To achieve this goal, we first encode the image from an image–question pair using an image encoder to obtain

V_{I}

. The hierarchical prompts corresponding to the medical questions of each image are then encoded by a text encoder. These prompts—

p_{1}

,

p_{2}

, or

p_{3}

, depending on the question’s hierarchical level—generate prompt embeddings

V_{p}

. The image encoding

V_{I}

and prompt encoding

V_{p}

are fed into an attention-based alignment module for preliminary fusion, producing image-prompt features

F_{p}

to strengthen the visual reference space. The medical question from the sample is input into the same text encoder to obtain question features

V_{q}

. Finally, the question features

V_{q}

and image-prompt features

F_{p}

are processed by hierarchical cross-attention answer decoders for multi-modal fusion and final answer prediction. The subsequent sections provide detailed descriptions of each module.

3.1. Hierarchical Prompting Module

The questions in structured medical VQA are typically hierarchical, such as those in the Rad-restruct dataset, as shown in Figure 3. In structured medical visual question answering, questions are often hierarchical. The question hierarchy, as illustrated in the figure, consists of three levels. The highest level asks about general findings, such as the presence of signs, diseases, abnormal regions, or objects. The second level focuses on specific elements, such as a particular object or disease. The lowest level asks about specific attributes. Current models often struggle to differentiate the fine-grained visual details required by the questions due to the lack of clear hierarchical reasoning guidance. Our approach addresses this by using a hierarchical prompt module to guide the model’s attention to different fine-grained image regions. One point that needs clarification is that in the Rad-Restruct dataset we used, each patient sample corresponds to a medical image, and each image is linked to a pre-filled structured report template. In the hierarchical prompt module, we employ the Deepseek API (https://platform.deepseek.com/api_keys, accessed on 8 January 2025) to input different prompt words for each level, summarizing these structured report templates. The diagram illustrates this process.

As shown in Figure 4, for low-granularity tasks, we provide the prompt: “Summarize the low-granularity findings, directing the model’s attention to the overall image”. This corresponds to Level 1 questions. For mid-granularity tasks, we give the prompt: “Summarize the mid-granularity findings, focusing on disease elements”. This corresponds to Level 2 questions. For high-granularity tasks, we provide the prompt: “Summarize the high-granularity findings, guiding the model’s focus on specific organ lesions and other detailed aspects”. Deepseek outputs disease findings at three levels, each with its own focus. These three levels of disease findings are then passed into a text encoder for encoding, followed by subsequent feature alignment and feature fusion modules.

3.2. Image Encoder

For the input image I, we use the image encoder

E_{i m a g e}

to extract visual features. The image encoder employs PubMedClip [15], a variant of the CLIP model [16] specifically designed for medical visual question answering. It is a contrastive vision-language pre-training model tailored for the medical domain. During the dual-modal pretraining phase, it takes medical images and corresponding professional texts as input and outputs cross-modal aligned joint feature vectors through contrastive learning. In the single modal application phase, the model can serve as a dedicated visual encoder. When fed with raw medical images, it outputs a 768-dimensional visual feature vector that integrates clinical semantics, directly supporting the visual reasoning of downstream MedVQA models. Specifically, we input a single modal medical image I and obtain a high-dimensional visual representation

V_{I}

that fuses anatomical features and clinical semantics as the feature of the medical image:

V_{I} = E_{i m a g e} (I)

(1)

3.3. Text Encoder

For the text encoder, we employ RadBert, a domain-adaptive pre-trained language model optimized for radiology reports. It takes unstructured clinical narratives from radiology free-text reports, such as CT/MRI descriptions, as input. After undergoing RadLex encoding for anatomical localization and pathological feature representation and subword tokenization, it generates a sequence of tokens. The output is a 768-dimensional context-aware dynamic semantic vector. This model is trained in two stages on the PubMed [30] radiology literature and clinical report corpus: masked language modeling to reconstruct masked medical terms and contrastive learning to align the semantic associations between image descriptions and diagnostic conclusions. This results in an improved performance over the general BERT model [31] in tasks such as automatic encoding of radiology reports and extraction of key information, such as generating lesion location-attribute triples, especially when dealing with ambiguous descriptions, it exhibits clinically interpretable feature space distributions. To encode the text input, we utilize the pre-trained embeddings of RadBert, which capture domain-specific semantic and contextual information. The encoder is frozen, retaining its pretrained weights to prevent further parameter modification. This not only saves computational resources but also allows us to focus on the subsequent alignment and fusion tasks between image and text embeddings. Specifically, for a sample prompt

p_{i}

and medical question

q_{i}

, through the text encoder

E_{t e x t}

, we obtain the prompt feature

v_{p}

and the question feature

v_{q}

as follows:

V_{q i} = E_{t e x t} (q i), V_{p i} = E_{t e x t} (p i) .

(2)

3.4. Alignment Module

The main purpose of the Alignment Module is to make the model pay more attention to specific regions of the image by aligning the medical image with the hierarchical cued features. We adopt the stacking of two layers of multi-head self-attention [32], take the previously processed medical image feature

V_{I}

as a query, and the prompt feature

V_{p i}

as a key-value pair. The Alignment Module adopts the stacked architecture of a multi-head self-attention layer and realizes the fine semantic alignment of medical images and text through a hierarchical cross-modal interaction mechanism. The first multi-head self-attention layer processes the medical image feature

V_{I}

(query) and text prompt feature

V_{p i}

(key/value), computing an attention weight matrix to model global semantic relationships between image regions and medical text.

V_{F}^{1} = A t t e n t i o n_{1} (V_{I}, V_{p i})

(3)

In the second self-attention layer, the attention feature

V_{F}^{1}

output from the first layer is directly used as the new query vector, and the original text prompt feature

V_{p i}

is continued as the key-value pair. The second-level cross-modal attention weight is calculated through an independent parameter matrix, so that the model cannot introduce additional mechanisms. Progressive focusing of medical semantics is achieved through pure attention stacking: the first layer captures global-level image-text associations, and the second layer deepens the local semantic alignment under the same key-value space. The final output of the module is aligned with the feature

V_{F}

.

V_{F} = A t t e n t i o n_{2} (V_{F}^{1}, V_{p i})

(4)

3.5. Hierarchical Answer Decoders

Hierarchical Answer Decoders employ cross-modal cross-head attention followed by a residual FFN block. The cross-modal attention layer aims to establish a dynamic correlation mapping between visual and text features. We use medical image embedding as the query and text embedding as the key value at the same time. This design idea is derived from a systematic analysis of the characteristics of heterogeneous modalities: medical reports have strong semantic coherence and high information density, which are suitable as semantic anchors in the attention mechanism, while medical images contain complex spatial distribution characteristics and fuzzy pathological representations, which are more suitable as query subjects to trigger semantic retrieval. By setting the text as a key/value pair, the model can effectively use the precise semantics of the diagnostic text to guide the semantic focus of image features—the attention mechanism reconstructs the visual representation in the form of weighted aggregation of text features (values) by calculating the similarity between the local area of the image (query) and the text semantic unit (key). In particular, since the attention output is essentially a probability-weighted combination of values, using text with higher information density as the value carrier can maximize the effectiveness of semantic fusion. For example, using the embedding vector of precise terms such as “pneumonia” as the value input can make the output features more accurately reflect the key pathological signs in medical images. Since the information density of the questions extracted from the structured medical reports is greater than that of the ordinary questions, and the medical image features are already the features that have been previously prompted according to the hierarchy, this design can better refer to the local area of the image and the contextual information of the question, while ensuring the efficiency of cross-modal interaction fully fits the objective characteristics of accurate text and complex images in the medical field, and provides the optimal multi-modal representation basis for subsequent disease classification and positioning tasks. The calculation process of cross-attention is as follows:

First, the image-prompt embedding

F_{p}

is converted to query Q, and the text embedding

V_{q}

is converted to K as Key and V as Value:

Q = F_{p} W_{Q}, K = V_{q} W_{K}, V = V_{q} W_{V}

(5)

W_{Q}

,

W_{K}

and

W_{V} \in R^{d \times d k}

are learnable weight matrices. The inner product of the query and key is calculated and scaled and normalized:

F = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

(6)

where F denotes the attention weight between image and text tokens. The attention score is applied to the value to obtain an output that combines the image and text embeddings:

E = F V,

(7)

where E is the combined embedding. A feedforward network is applied to the output of the cross-attention and a nonlinear transformation is performed. This further enhances the feature representation. The fused features are then used to perform multi-label classification on all answer candidates. However, we only consider outputs that are valid for the current question as the correct answer. For single-choice questions, we predict a single label by applying a softmax function to all valid answers. For multiple-choice questions, we predict multiple labels using a sigmoid function. For all question categories, due to the imbalanced class distribution within the dataset, the model fails to adequately learn relevant features for certain medical questions. To address this, we employ a weighted masked cross-entropy loss function where w represents a class-specific weight matrix Algorithm 1. Classes with fewer samples are assigned higher weights to mitigate data imbalance. Specifically for level1 and level2 questions, we introduce a mask matrix M to exclusively compute the binary cross-entropy with logits loss for “yes” and “no” candidates by masking irrelevant answer choices. Conversely, for level3 questions, the mask is applied to “yes” and “no” candidates while calculating BCEWithLogitsLoss for other available choices. This hierarchical masking strategy enables task-specific optimization across different question hierarchies.

Algorithm 1: Weighted masked cross-entropy loss

4. Experiments

4.1. Dataset

Our experiment utilized the first benchmark dataset for structured radiology report generation, Rad-ReStruct, which was constructed based on the semi-structured coding of the IU-Xray [33] dataset. By systematically integrating 3720 standardized chest X-ray images with 3597 reports, it formed a medical knowledge system containing over 180,000 fine-grained question-answer pairs, or over 180,000 sample quantities. The dataset construction process adopted a two-layer coding architecture: first, based on the semi-structured findings annotated by medical experts (utilizing 178 controlled vocabularies from MeSH [34] medical subject headings and RadLex [35] radiology terms, covering anatomy, pathological signs, foreign bodies, and attribute descriptions), the term combinations in the original unstructured reports (e.g., “infiltration/lung/upper lobe/left/patchy/mild”) were parsed; then, through full-patient data mining, a three-level decision tree-style report template was constructed, with the top level determining the existence of abnormalities (e.g., “Is there any opacity in the lungs?”), the middle level locating anatomical and pathological features (e.g., “Are there any signs of pneumonia in the lungs?”), and the bottom level describing morphological attributes (e.g., “What are the boundary characteristics of the abnormal area?”), and the term combinations not found were removed to form a streamlined template. This template innovatively introduced a clinical logic constraint mechanism, containing 96 medical entity categories, dynamically marking questions as single-choice or multiple-choice types, while retaining a “no selection” option to simulate real diagnostic scenarios. The data were divided following an 80-10-10 stratified strategy, with patient ID hash mapping ensuring data isolation across subsets. As the sole dataset currently providing medical images, structured reports, and hierarchical question–answer triples, Rad-ReStruct established a new benchmark for explainability in radiology report generation tasks through clinical logic constraints and strict evaluation protocols, and it is also the only dataset that can provide medical images, hierarchical question–answer pairs, and is ideally suited our experimental design.

4.2. Training and Evaluation

During training, we employ the teacher-forcing strategy [36], feeding the problem along with the previous layer’s problem and answer as context input. Given that the level3 problem is a multi-choice question, we implement a weighted mask cross-entropy loss function, calculating the loss only for the labels relevant to the current problem. The AdamW optimizer is set with a learning rate of

10^{- 5}

, and the end-to-end training is conducted on an NVIDIA RTX4090 GPU (NVIDIA, Beijing, China) using the PyTorch-Lightning 1.8.3. framework. The number of epochs is dynamically determined based on the performance on the validation set. Additionally, we adopt data augmentation strategies [14,37], including random dropping [38] and reordering of questions at the same level to prevent overfitting [39]. The model supports a multi-task output mechanism, using softmax for single-choice classification and sigmoid for multi-label classification in multiple-choice scenarios, while strictly constraining the valid answer space. The training process emphasizes hierarchical dependencies. Each sample in the training data is summarized as

D = (p_{i}, q_{i}, I, y_{i})

, where

p_{i}

represents the prompt corresponding to each question,

q_{i}

represents the question, I is the medical image, and

y_{i}

is the true answer label corresponding to the question. One medical image receives a complete structured report, and a structured report includes three levels of medical questions. Each sample is split into a medical image and one of the medical questions within it. We freeze the image encoder and text encoder, training only the alignment module, hierarchical answer decoders, and MLP classifier.

For fair and convenient comparison, we adopt the same evaluation method as Hi-VQA and context-VQA [29]. The evaluation system uses macro-average precision, recall and F1 metrics to cover all possible paths of hierarchical questions, and simultaneously calculates the accuracy of the complete report. The report-level accuracy is the proportion of all paths where all questions on that path are predicted correctly out of all paths. If the higher-level question on a path is predicted correctly but the lower-level question is incorrect, that path is considered a wrong prediction. The evaluation is auto-regressive, so the model utilizes the previously proposed questions and their predicted answers as historical context. The evaluation enforces hierarchical consistency constraints. In hierarchical visual question-answering tasks like HiCA-VQA, if the model predicts “no” for a higher-level question, the path reasoning is interrupted, and the lower-level sub-questions in the hierarchical structure are automatically determined as “no” or “not selected” (for the level3 question), thereby enforcing consistency in predictions. This aligns with the actual situation in clinical diagnosis, ensuring that the generated reports are consistent and coherent. If a medical expert determines from a global perspective that there are no abnormalities in the image, then more detailed and granular investigations are unnecessary. Finally, since the object, sign, or pathology of a patient may appear multiple times, when the model predicts “yes”, the model will iteratively ask about further occurrences (e.g., “Are there any other opaque areas in the lungs?”). The model will limit the number of subsequent questions based on the maximum occurrence of each patient in the data to ensure data consistency. Due to the unclear order of appearance, instance matching is applied during the metric calculation for evaluation to achieve the highest F1 score for this discovery.

4.3. Baseline and SOTA

We compare our method with the baseline Hi-VQA, which also uses the Rad-restruct dataset. The difference between our method and Hi-VQA is that Hi-VQA does not employ any prompt module. It adopts a conventional VQA architecture, with its core process consisting of two key stages: feature extraction and fusion. In the feature extraction stage, a pre-trained EfficientNet-b5 [40] image encoder is used to extract global and spatial-aware local features, while a domain-specific RadBERT text encoder processes hierarchical text inputs, including historical question-answer pairs and the current question concatenated in the format of <Question> <SEP> <Answer>. In the feature fusion stage, the image encoding, RadBERT-encoded historical text, and current question text are concatenated in the order of <image> <history> <question>, and injected with hybrid position encodings (2D sinusoidal encoding for the image part to retain spatial coordinates, and 1D absolute position encoding for the text part) and four types of token type embeddings (distinguishing image, historical question, historical answer, and current question). The fusion module uses a single-layer Transformer, and the input is processed in a single-layer Transformer for cross-modal interaction, using traditional multi-head self-attention to simultaneously capture fine-grained associations between visual regions and medical terms (e.g., “upper lobe of the lung” and the corresponding image region) and the semantic constraints of historical answers on the current question (e.g., activating the “degree” attribute prediction when “pneumonia exists”) [41]. Finally, based on the question type, Softmax single classification or Sigmoid multi-label classification is used to generate predictions in the restricted answer space, and an autoregressive mechanism is employed to use high-level predictions as the historical context of lower-level questions, ensuring the clinical rationality of structured reports through logical consistency. This provides a baseline for medical hierarchical VQA.

The SOTA method, context-VQA, builds upon the Hi-VQA architecture by first using a free-text report summarized by the GPT [42] model as additional context. After passing through the text encoder, it is aligned with the image features through an attention-based alignment module, and then fused with the question features in the same single-layer Transformer, and finally input into an MLP for answer prediction. Compared to Hi-VQA, it shows performance improvement, but using GPT incurs additional costs and time, and using a large GPT model instead of a medical pre-trained model may lose important medical information in the free-text medical report.

In summary, both Hi-VQA and context-VQA are traditional multi-modal fusion methods using Transformer, where features of different modalities are concatenated and then input into the self-attention module. This is different from the cross-attention-based fusion method of HiCA-VQA. Moreover, HiCA-VQA uses a hierarchical prompt approach rather than introducing additional context, saving time and cost.

4.4. Comparative Results

We tested our method on the VQA-Rad dataset as a comparative experiment and compared it with previous strong general-purpose methods that are also based on the VQA-Rad dataset Table 1. VQA-Rad is a medical visual question answering benchmark that contains 315 radiology images and 3515 questions. The task involves classifying answers from a set of 2248 possible options. In VQA-Rad, multiple questions are posed for a single image, and in prior work, these questions have always been answered separately. Therefore, we measured the accuracy of correctly answering questions in both previous methods and our method on the VQA-Rad dataset. As can be seen, our model achieved competitive results in comparison with these strong general-purpose methods. This demonstrates that, in medical visual question answering tasks, our method can enhance the model’s ability to predict questions based on their different granularities. This aligns with the conclusions from our experiments.

4.5. Experimental Results

Table 2 shows the comparison results of our proposed method on the Rad-ReStruct dataset with the baseline model, Hi-VQA, and the most advanced context-VQA method. For fair comparison, we adopted the same evaluation method for Hi-VQA and context-VQA, as introduced above, and used the indicators of accuracy, F1 value, precision and recall to evaluate our results.

In terms of report-level accuracy, HiCA-VQA, a hierarchical cueing and cross-attention visual question answering, achieved the highest metric, improving by about 20 percent compared to Hi-VQA and surpassing the SOTA context-VQA, which reflects our improved ability to fully predict reports. At the same time, our F1 score is 18% higher than other methods.

Table 3 shows the indicators of Hi-VQA, context-VQA and HiCA-VQA at each question level. It can be seen that HiCA-VQA has improved F1 at each level, and the F1 score of the third-level fine-grained complex questions has increased by more than 20 percent. This hierarchical performance gain fully verifies the effectiveness of the hierarchical prompt mechanism and cross-modal cross-attention module proposed in this paper. The hierarchical decoding strategy guides the model to gradually focus on visual semantic features of different granularities, and the attention weight allocation mechanism based on region alignment better enhances the model’s collaborative perception of multi-modal fine-grained features, especially when dealing with complex reasoning tasks that require comprehensive image spatial features and text semantic constraints. It shows stronger feature decoupling and fusion capabilities. Prove the effectiveness of our method in complex medical image question answering. These results prove that hierarchical prompts can make the model perform better in learning deeper and more complex problems. Report accuracy can be understood as the path accuracy in a decision tree. If all three questions along a given path are predicted correctly, the path is considered correct; otherwise, it is considered incorrect. We use path accuracy instead of regular accuracy to encourage the model to make correct predictions along an entire diagnostic path. From the results above, we can see that our model achieves competitive performance in terms of report accuracy, though the improvement is not substantial. However, there is a significant improvement in the F1 score. Despite the large scale of the Rad-Restruct dataset, which includes 3720 images matching 3597 structured patient reports and involves over 180,000 questions, the dataset may have relatively fewer positive samples, especially for level 3 questions. This is because level 3 questions are more complex and have more answer options. As shown by the results above, our method demonstrates a significant improvement in predicting level 3 questions.

4.6. Ablation Experiments

We conducted ablation experiments to evaluate the impact of hierarchical answer decoders and cross-attention fusion modules on the predictive ability of our HiCA-VQA model. Table 4 evaluates the impact of the hierarchical answer decoders and cross-attention fusion module on the performance of each method model, using accuracy, F1 value, precision, and recall. First, we introduce our hierarchical answer decoders into the Hi-VQA framework. Similarly, we remove the hierarchical answer decoders in our architecture. The results show that the performance is reduced due to the lack of hierarchical answer decoders, indicating the effectiveness of our hierarchical prompt module and showing a performance improvement compared with the baseline model. We further examine the importance of the cross-attention fusion module. Both Hi-VQA and context-VQA use a single-layer transformer fusion module with a self-attention mechanism. We first introduce our cross-attention fusion module into the Hi-VQA. We also use the image as the query and the text features as the key–value pair to input the fusion module without alignment. The effectiveness of the cross-attention fusion module can be seen from the performance improvement of the results. Similarly, we replace our fusion module with a single-layer transformer fusion module. Experiments demonstrate that simple feature concatenation fails to model cross-modal relationships effectively, whereas our attention mechanism enables superior hierarchical fusion. Ablation studies confirm the advantages in visual QA tasks.

4.7. Qualitative Analysis

Figure 5 presents qualitative prediction examples comparing HiCA-VQA with Hi-VQA. The questions are arranged from left to right in the hierarchical order of their granularity to illustrate their hierarchical dependencies. In the first case, Hi-VQA generated a negative response to the initial question, which propagated to subsequent questions, resulting in cascading negative predictions. In the second and third examples, our method demonstrates improved accuracy in predicting lower-level questions compared to Hi-VQA. This observation underscores that hierarchical answer encoding enables contextually adaptive predictions for questions of varying granularities, thereby enhancing overall prediction accuracy, as previously reported in prior studies.

5. Discussion

In this study, we introduce the HiCA-VQA, a novel framework for hierarchical visual question answering. Experimental results show superior performance over conventional approaches, confirming its ability to leverage multimodal complementarity between medical images and diagnostic reports through comprehensive cross-modal processing. In the medical visual question answering system of the traditional method, the image and text are simply combined and input into the fusion module. However, in our method, the cross-attention enables the correlation between the image and the text to be directly modeled. From a practical point of view, people often observe images based on the text content in their brains. Therefore, our method is reasonable both experimentally and practically. Although our method has improved its indicators and achieved good results, it is still far from accurate prediction. In medical diagnosis, even small mistakes are absolutely not allowed. In hierarchical VQA, one challenge is the error propagation problem. Once the high-level question is predicted incorrectly, the low-level question will not be predicted, leading to the error of the high-level question to be transmitted to the low-level question. This may require adjusting the indicators or greatly improving the prediction accuracy of the high-level question to solve, which requires a large number of follow-up experiments and a reasonable sample distribution dataset. In conclusion, our methodology exhibits three principal limitations: (1) Hierarchical medical QA systems inherently pose error propagation risks, where prediction deviations in upper-tier questions may induce cascade failures in downstream tiers, necessitating mitigation through optimized hierarchical metric weighting and sample distribution calibration; (2) Contemporary medical datasets remain constrained by prohibitive annotation costs and expert-crafted template dependency, particularly exhibiting suboptimal performance in generalizability on rare diseases and fine-grained attribute prediction; (3) Prevailing evaluation frameworks demonstrate over-reliance on macro-averaged F1 scores while clinical decision causality chains remain under-validated, compounded by persistent barriers in multi-institutional data sharing due to privacy constraints and annotation standard discrepancies.

6. Conclusions

In this study, we introduce a method, HiCA-VQA, for medical hierarchical visual question answering, which can give different prompts according to the different levels of the current question and can more effectively utilize the interactive information of images and texts. Given the increasing number of datasets with medical reports and medical images, this work paves the way for the further exploration of medical multi-modal information fusion methods to enhance the capabilities of medical VQA systems and improve the results of medical AI systems, assisting medical experts in better diagnosing diseases and improving medical work efficiency. Future work may focus on developing dynamically adjustable self-adaptive hierarchical architecture optimization mechanisms through the integration of multi-modal medical data streams and attention-guided hierarchical pathway generation algorithms, while simultaneously exploring synergistic enhancement pathways for cross-modal semantic alignment and fine-grained reasoning. Concurrently, efforts should prioritize constructing explainable decision traceability systems grounded in medical ontology knowledge graphs, leveraging the deep coupling between visual attention mapping and clinical diagnostic logic to advance the development of trustworthy AI diagnostic frameworks compliant with medical regulatory standards.At the same time, in the future, we may focus on further improving the accuracy of fine-grained questions at each level, while introducing hierarchical weighted loss functions and confidence filtering, or designing dynamic reasoning paths to further mitigate the issue of error propagation. We can also explore the learnable prompt vector embeddings for hierarchical alignment modules or utilize medical knowledge graphs to dynamically generate context-aware prompts. These improvements would not only preserve the hierarchical reasoning paradigm but also enable adaptive prompt–texture interactions, thereby enhancing the robustness across various medical VQA scenarios. In addition, future work should explore agentic AI’s potential in medical VQA [49], such as integrating dynamic clinical decision frameworks for real-time image-text reasoning with fewer annotations. Iterative self-correction and uncertainty modeling could also address the visual ambiguity and contextual complexity, advancing systems from passive QA to active collaboration.

Author Contributions

Methodology, J.Z.; software, J.Z. and B.L.; validation, J.Z. and B.L.; formal analysis, J.Z.; investigation, J.Z. and B.L.; data curation, J.Z. and B.L.; writing-original draft preparation: J.Z. and B.L.; writing-review and editing: J.Z., B.L. and S.Z.; visualization, J.Z.; supervision, B.L. and S.Z.; Project administration, B.L. and S.Z.; funding acquisition, B.L. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Guangdong Province (No. 2023A1515010673), in part by the Shenzhen Science and Technology Innovation Bureau key project (No. JSGG20220831110400001, No. CJGJZD20230724093303007, No. KJZD20240903101259001), in part by Shenzhen Medical Research Fund (No. D2404001), in part by Shenzhen Engineering Laboratory for Diagnosis & Treatment Key Technologies of Interventional Surgical Robots (XMHT20220104009), and the Key Laboratory of Biomedical Imaging Science and System, CAS, for the Research platform support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed during the current study are available in: https://github.com/ChantalMP/Rad-ReStruct (accessed on 5 January 2025).

Acknowledgments

We thank Yue Du for his assistance in data validation and funding coordination. During the preparation of this manuscript/study, the authors used Deepseek API (https://platform.deepseek.com/api_keys, accessed on 8 January 2025) for the purposes of summary structured report. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors confirm that there are no conflicts of interest, and the research was carried out without any involvement of commercial or financial relationships.

References

Ben Abacha, A.; Hasan, S.A.; Datla, V.V.; Demner-Fushman, D.; Müller, H. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In Proceedings of the CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes, Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
Li, S.; Li, B.; Sun, B.; Weng, Y. Towards Visual-Prompt Temporal Answer Grounding in Instructional Video. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8836–8853. [Google Scholar] [CrossRef] [PubMed]
Lin, Z.; Zhang, D.; Tao, Q.; Shi, D.; Haffari, G.; Wu, Q.; He, M.; Ge, Z. Medical visual question answering: A survey. Artif. Intell. Med. 2023, 143, 102611. [Google Scholar] [CrossRef] [PubMed]
Xie, R.; Jiang, L.; He, X.; Pan, Y.; Cai, Y. A Weakly Supervised and Globally Explainable Learning Framework for Brain Tumor Segmentation. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 5–19 July 2024; pp. 1–6. [Google Scholar]
Gabruseva, T.; Poplavskiy, D.; Kalinin, A. Deep learning for automatic pneumonia detection. In Proceedings of the the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 350–351. [Google Scholar]
Xiang, J.; Wang, X.; Zhang, X.; Xi, Y.; Eweje, F.; Chen, Y.; Li, Y.; Bergstrom, C.; Gopaulchan, M.; Kim, T.; et al. A vision–language foundation model for precision oncology. Nature 2025, 638, 769–778. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Wang, Y.; Du, J.; Zhou, J.; Liu, Z. MedCoT: Medical Chain of Thought via Hierarchical Expert. In Proceedings of the the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 17371–17389. [Google Scholar]
Xie, R.; Chen, J.; Jiang, L.; Xiao, R.; Pan, Y.; Cai, Y. Accurate Explanation Model for Image Classifiers using Class Association Embedding. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–17 May 2024; pp. 2271–2284. [Google Scholar]
Pellegrini, C.; Keicher, M.; Özsoy, E.; Navab, N. Rad-restruct: A novel vqa benchmark and method for structured radiology reporting. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 409–419. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Diao, X.; Zhang, C.; Wu, W.; Ouyang, Z.; Qing, P.; Cheng, M.; Vosoughi, S.; Gui, J. Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding. arXiv 2025, arXiv:2502.06020. [Google Scholar]
Diao, X.; Cheng, M.; Barrios, W.; Jin, S. FT2TF: First-Person Statement Text-To-Talking Face Generation. In Proceedings of the the Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 4821–4830. [Google Scholar]
Wu, W.; Dai, T.; Huang, X.; Ma, F.; Xiao, J. Image Augmentation with Controlled Diffusion for Weakly-Supervised Semantic Segmentation. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 6175–6179. [Google Scholar]
Eslami, S.; Meinel, C.; De Melo, G. Pubmedclip: How much does clip benefit visual question answering in the medical domain? In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2–6 May 2023; pp. 1181–1193. [Google Scholar]
Hafner, M.; Katsantoni, M.; Köster, T.; Marks, J.; Mukherjee, J.; Staiger, D.; Ule, J.; Zavolan, M. CLIP and complementary methods. Nat. Rev. Methods Prim. 2021, 1, 20. [Google Scholar] [CrossRef]
Yan, A.; McAuley, J.; Lu, X.; Du, J.; Chang, E.Y.; Gentili, A.; Hsu, C.N. RadBERT: Adapting transformer-based language models to radiology. Radiol. Artif. Intell. 2022, 4, e210258. [Google Scholar] [CrossRef] [PubMed]
Hayat, N.; Geras, K.J.; Shamout, F.E. MedFuse: Multi-modal fusion with clinical time-series data and chest X-ray images. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Durham, NC, USA, 5–6 August 2022; pp. 479–503. [Google Scholar]
Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 180251. [Google Scholar] [CrossRef] [PubMed]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Syeda-Mahmood, T.; Wong, K.C.; Gur, Y.; Wu, J.T.; Jadhav, A.; Kashyap, S.; Karargyris, A.; Pillai, A.; Sharma, A.; Syed, A.B.; et al. Chest x-ray report generation through fine-grained label learning. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part II 23. Springer: Berlin/Heidelberg, Germany, 2020; pp. 561–571. [Google Scholar]
Bhalodia, R.; Hatamizadeh, A.; Tam, L.; Xu, Z.; Wang, X.; Turkbey, E.; Xu, D. Improving pneumonia localization via cross-attention on medical images and reports. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part II 24. Springer: Berlin/Heidelberg, Germany, 2021; pp. 571–581. [Google Scholar]
Kovaleva, O.; Shivade, C.; Kashyap, S.; Kanjaria, K.; Wu, J.; Ballah, D.; Coy, A.; Karargyris, A.; Guo, Y.; Beymer, D.B.; et al. Towards visual dialog for radiology. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online, 9 July 2020; pp. 60–69. [Google Scholar]
He, X.; Zhang, Y.; Mou, L.; Xing, E.; Xie, P. Pathvqa: 30000+ questions for medical visual question answering. arXiv 2020, arXiv:2003.10286. [Google Scholar]
Liu, B.; Zhan, L.M.; Xu, L.; Ma, L.; Yang, Y.; Wu, X.M. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1650–1654. [Google Scholar]
Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Fidder, H.; Chowers, Y.; Ackerman, Z.; Pollak, R.D.; Crusius, J.B.A.; Livneh, A.; Bar-Meir, S.; Avidan, B.; Shinhar, Y. The familial Mediterranean fever (MEVF) gene as a modifier of Crohn’s disease. Off. J. Am. Coll. Gastroenterol.|ACG 2005, 100, 338–343. [Google Scholar] [CrossRef]
Arsalane, W.; Chikontwe, P.; Luna, M.; Kang, M.; Park, S.H. Context-Guided Medical Visual Question Answering. In Medical Information Computing; Springer: Berlin/Heidelberg, Germany, 2024; pp. 245–255. [Google Scholar] [CrossRef]
White, J. PubMed 2.0. Med. Ref. Serv. Q. 2020, 39, 382–387. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5797–5808. [Google Scholar]
Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2016, 23, 304–310. [Google Scholar] [CrossRef] [PubMed]
Lipscomb, C.E. Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 2000, 88, 265. [Google Scholar] [PubMed]
Langlotz, C. RadLex: A new method for indexing online educational materials. Radiographics 2006, 26, 1595–1597. [Google Scholar] [CrossRef] [PubMed]
Di, Y.; Shi, H.; Ma, R.; Gao, H.; Liu, Y.; Wang, W. FedRL: A reinforcement learning federated recommender system for efficient communication using reinforcement selector and hypernet generator. ACM Trans. Recomm. Syst. 2024. [Google Scholar] [CrossRef]
Wu, W.; Qiu, X.; Song, S.; Chen, Z.; Huang, X.; Ma, F.; Xiao, J. Image Augmentation Agent for Weakly Supervised Semantic Segmentation. arXiv 2024, arXiv:2412.20439. [Google Scholar]
Wu, W.; Dai, T.; Chen, Z.; Huang, X.; Xiao, J.; Ma, F.; Ouyang, R. Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation. Eng. Appl. Artif. Intell. 2025, 139, 109626. [Google Scholar] [CrossRef]
Di, Y.; Wang, X.; Shi, H.; Fan, C.; Zhou, R.; Ma, R.; Liu, Y. Personalized Consumer Federated Recommender System Using Fine-grained Transformation and Hybrid Information Sharing. IEEE Trans. Consum. Electron. 2025; early access. [Google Scholar] [CrossRef]
Koonce, B. EfficientNet. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 109–123. [Google Scholar]
Li, B.; Sun, B.; Li, S.; Chen, E.; Liu, H.; Weng, Y.; Bai, Y.; Hu, M. Distinct but correct: Generating diversified and entity-revised medical response. Sci. China Inf. Sci. 2024, 67, 132106. [Google Scholar] [CrossRef]
Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
Nguyen, B.D.; Do, T.T.; Nguyen, B.X.; Do, T.; Tjiputra, E.; Tran, Q.D. Overcoming data limitation in medical visual question answering. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Proceedings, Part IV 22. Springer: Berlin/Heidelberg, Germany, 2019; pp. 522–530. [Google Scholar]
Do, T.; Nguyen, B.X.; Tjiputra, E.; Tran, M.; Tran, Q.D.; Nguyen, A. Multiple meta-model quantifying for medical visual question answering. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part V 24. Springer: Berlin/Heidelberg, Germany, 2021; pp. 64–74. [Google Scholar]
Khare, Y.; Bagal, V.; Mathew, M.; Devi, A.; Priyakumar, U.D.; Jawahar, C. Mmbert: Multimodal bert pretraining for improved medical vqa. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1033–1036. [Google Scholar]
Liu, B.; Zhan, L.M.; Wu, X.M. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September– 1 October 2021; Proceedings, Part II 24. Springer: Berlin/Heidelberg, Germany, 2021; pp. 210–220. [Google Scholar]
Tanwani, A.K.; Barral, J.; Freedman, D. Repsnet: Combining vision with language for automated medical reports. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 714–724. [Google Scholar]
Chen, Z.; Du, Y.; Hu, J.; Liu, Y.; Li, G.; Wan, X.; Chang, T.H. Multi-modal masked autoencoders for medical vision-and-language pre-training. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 679–689. [Google Scholar]
Shavit, Y.; Agarwal, S.; Brundage, M.; Adler, S.; O’Keefe, C.; Campbell, R.; Lee, T.; Mishkin, P.; Eloundou, T.; Hickey, A.; et al. Practices for Governing Agentic AI Systems. Research Paper, OpenAI. 2023. Available online: https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf (accessed on 5 January 2025).

Figure 1. A schematic diagram of a traditional hierarchical medical visual question answering framework. Medical images and fine-grained hierarchical medical questions are fed into an image encoder and a text encoder. The encoded features are then input into a Transformer-based fusion module for multi-modal feature integration, and finally, an MLP classification layer is employed to predict the answer candidates for the corresponding medical question.

Figure 2. Overview of HiCA-VQA architecture. It comprises: (1) Hierarchical prompting module that generates prompts for questions at different levels. (2) Image encoder. (3) Text encoder. (4) An Alignment Module is responsible for aligning image and prompt features. (5) Hierarchical Answer Decoders that fuse multi-modal features for final prediction.

Figure 3. Hierarchical questions overview: The questions are organized into three levels, representing a stepwise refinement of inquiries regarding the patient’s medical imaging condition. The first two levels employ binary “Yes” or “No” response candidates, while the final level contains multiple-choice candidates primarily describing pathological attributes.

Figure 4. The process of inputting different levels of prompt words and structured reports into the Deepseek model to summarize is as follows, ultimately generating disease findings at different granularities.

Figure 5. A schematic diagram of a hierarchical medical visual question answering framework. Medical images and fine-grained hierarchical medical questions are fed into an image encoder and a text encoder. The encoded features are then input into a Transformer-based fusion module for multi-modal feature integration, and finally, an MLP classification layer is employed to predict the answer candidates for the corresponding medical question.

Table 1. Comparative Experiments on Previous General-Purpose Datasets.

Model	Acc
MEVF [43]	66.1
MMQ [44]	67.0
MM-BERT [45]	72.0
CRPD [46]	72.7
RepsNet [47]	73.5
M3AE [48]	77.0
Hi-VQA [9]	76.3
HiCA-VQA (Ours)	79.6

Table 2. Performance comparison on Rad-Restruct dataset. We compared three methods in total: Hi-VQA, context-VQA and our HiCA-VQA, where the best scores are in bold.

Model	Report Accuracy	F1	Prec	Recall
Hi-VQA [9]	32.6	31.9	59.9	34.1
con-VQA [29]	39.7	31.0	90.4	33.6
HiCA-VQA (Ours)	39.9	49.1	69.8	34.2

Table 3. A comparison of the Hi-VQA [9], Context-VQA [29] and HiCA-VQA for each question level.

Level	Hi-Acc	Hi-F1	Hi-Pre	Hi-Rec	Context-Acc	Context-F1	Context-Pre	Context-Rec	HiCA-Acc	HiCA-F1	HiCA-Pre	HiCA-Rec
Level1	33.6	64.3	81.0	64.5	34.7	67.2	80.7	61.2	33.7	68.5	81.1	64.6
Level2-all	31.0	71.6	85.2	72.0	32.9	71.8	88.9	70.8	31.0	78.3	86.0	72.0
Level2-diseases	48.1	73.5	83.8	71.3	52.1	72.8	89.6	72.7	48.2	81.1	84.5	74.1
Level2-signs	71.9	74.2	93.1	74.4	74.4	73.7	90.6	73.7	71.9	77.1	93.1	74.2
Level2-objects	87.4	67.0	77.1	67.5	91.4	67.2	85.0	68.6	87.7	84.6	77.5	67.9
Level2-regions	52.4	68.1	82.1	69.5	61.2	68.7	85.4	68.3	52.4	72.5	84.1	69.6
Level3	30.2	4.1	49.9	6.2	32.5	3.2	68.7	4.2	29.6	29.0	58.5	7.9

Table 4. The ablation experimental results are presented in the following table. Among them, SF denotes the self-attention fusion module, CF denotes cross-attention fusion, AL denotes the alignment module, and HD denotes hierarchical answer decoders.

Method		Proposed Module				Metrics
Method		SF	CF	AL	HD	Acc	F1	Pre	Rec
Hi-VQA [9]	(a)	✓	×	×	×	32.6	31.7	70.7	32.1
Hi-VQA [9]	(b)	✓	×	×	✓	33.7	29.5	80.4	30.7
con-VQA [29]	(a)	✓	×	×	×	32.6	28.7	80.0	28.8
con-VQA [29]	(b)	✓	×	✓	×	39.7	31.0	90.4	33.6
HiCA-VQA (Ours)	(a)	×	✓	×	×	38.0	33.0	67.7	32.2
	(b)	✓	×	✓	✓	36.8	32.7	68.2	32.3
	(c)	×	✓	✓	✓	39.9	49.1	69.8	34.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Li, B.; Zhou, S. Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion. Appl. Sci. 2025, 15, 4712. https://doi.org/10.3390/app15094712

AMA Style

Zhang J, Li B, Zhou S. Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion. Applied Sciences. 2025; 15(9):4712. https://doi.org/10.3390/app15094712

Chicago/Turabian Style

Zhang, Junkai, Bin Li, and Shoujun Zhou. 2025. "Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion" Applied Sciences 15, no. 9: 4712. https://doi.org/10.3390/app15094712

APA Style

Zhang, J., Li, B., & Zhou, S. (2025). Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion. Applied Sciences, 15(9), 4712. https://doi.org/10.3390/app15094712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

Abstract

1. Introduction

2. Related Work

2.1. Pretrained Models in Medical Visual Question Answering

2.2. Hierarchical Medical Visual Question Answering

2.3. Context Alignment Enhancements

3. Methodology

3.1. Hierarchical Prompting Module

3.2. Image Encoder

3.3. Text Encoder

3.4. Alignment Module

3.5. Hierarchical Answer Decoders

4. Experiments

4.1. Dataset

4.2. Training and Evaluation

4.3. Baseline and SOTA

4.4. Comparative Results

4.5. Experimental Results

4.6. Ablation Experiments

4.7. Qualitative Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI