Vision–Language Model for Visual Question Answering in Medical Imagery

In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder–decoder architecture. Specifically, we extract image features using the vision transformer (ViT) model, and we embed the question using a textual encoder transformer. Then, we concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way. In the experiments, we validate the proposed model on two VQA datasets for radiology images termed VQA-RAD and PathVQA. The model shows promising results compared to existing solutions. It yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and 83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score showing the alignment between the predicted and true answer sentences are also reported.


Introduction
Assessing the physical state of the human body requires a variety of non-invasive sensory data. In particular, the use of medical imaging in disease diagnosis, screening, and surgical operations has prompted significant advances in this area [1,2]. Further, multi-modal data enable physicians to minimize diagnosis errors significantly [3,4]. One of the recent research topics in contemporary computer-aided diagnosis (CAD) is medical visual question answering (VQA) [5][6][7]. It involves both computer vision (CV) and natural language processing (NLP).
Typical medical image analysis systems include detection [8] and segmentation [9], which regard spatial information about objects and/or specific regions of interest across the image at hand. However, the need for ad hoc image analysis gave rise to sophisticated paradigms such as image captioning [10,11], which consists of building an informative textual caption about the image. For instance, in contrast to the limited mere object detection that produces spatial coordinates of objects in the image, image captioning takes it a step further by describing the relative position of objects, colors/textures, and the ongoing dynamics therein (e.g., a dog jumps over a hurdle, a kid is wearing a red hat, a guy is playing basketball, etc.).
In this regard, recent trends in computer vision attempt to drive image captioning into an even finer-grained version, namely visual question answering (VQA), by querying about detailed specifics about the image [12] (e.g., how many buildings are there at the top-right of the image? Are there cars parked on the roadside? etc.). Therefore, unlike the aforementioned image analysis approaches that require only an image as input, in VQA both the image and a relevant question are envisioned. Despite the fact that VQA has an interesting research basis in computer vision, thanks to the abundance of object annotations, in medical imaging it is still relatively lagging behind, and there is much room for improvement.
A VQA pipeline normally consists of four blocks, namely (i) a feature extractor to derive visual features from the query image, (ii) a textual feature extractor from the query question, (iii) an embedding module that incorporates the former two modalities, and (iv) a prediction head.
In this respect, the majority of existing medical VQA solutions rely on deep learning techniques such as recurrent neural networks (RNNs) [13] for text embedding and feature extraction, and convolution neural networks (CNNs) for visual feature extraction, as well as advanced techniques such as attention mechanisms. The advent of deep learning transformers has been effectively applied to the medical VQA task. For instance, transformers were first utilized in NLP tasks such as machine translation [14] and speech recognition [15]. Its encoder-decoder architecture is solely dependent on a self-attention mechanism. Unlike an RNN, which processes sequence items recursively and only pays attention to short-term context, transformers have demonstrated a potential to learn the relationships between sequence elements. Transformer designs may learn long-range associations by attending to whole sequences. In particular, the bidirectional encoder representation from transformers (BERT) [16] has been the most widely used model for representing textual information.
BERT is a language model that employs a bidirectional attention mechanism with largescale unsupervised corpora to provide a context-sensitive representation of each word in a phrase.
The success of transformer networks in the context of NLP has attracted a lot of interest in computer vision. They were first proposed in [17], and have been successfully used in many tasks [18][19][20][21][22]. Further, they were adopted as a conventional transformer to images directly, by dividing them into patches that were treated as tokens in NLP applications [23][24][25].
However, despite the achievements related to vision-only and language-only tasks, there has not been much prior effort to connect multi-modal tasks with transformers [26][27][28][29][30][31]. Among these efforts, in [26], the authors introduced a UNiversal Image-TExt Representation (UNITER) that used a large-scale pre-trained model for combined multi-modal embedding. Hu et al. [27] proposed a unified transformer (UniT) based on encoding each input modality with an encoder and a joint decoder that makes predictions for multiple tasks including VQA. In contrast, VQA in the medical domain has not been validated yet by using multimodal transformers.
Therefore, in this study we exploit transformer models for VQA in medical images. We devise a full transformer encoder-decoder architecture that takes medical images and relevant text as inputs and jointly uses them for training. This model consists of two encoders to encode each input modality. The extracted features from both modalities are fused using concatenation, followed by a decoder to draw the final answer predictions. With respect to previous literature on multi-modal learning with transformers, to the best of our knowledge, our work is the first that operates on medical images. Compared to the existing solutions, our model includes a decoder part for generating the open-ended answers in an autoregressive way.
The remainder of this paper is organized as follows: Section 2 reviews related works on VQA and the multi-modal transformer. Section 3 introduces the proposed VQA methodology. Section 4 presents detailed experimental analysis and presents comparisons with state-of-the-art methods. Finally, Section 5 draws conclusions and discusses future developments.

Related Work
In this section, we first provide an overview of the related works for VQA in the general computer vision and medical domains. Then, we discuss the multi-modal transformer architectures proposed in the literature.

Visual Question Answering
With the success of deep learning, VQA in the general domain has received increasing attention in recent years. It has made tangible progress, mainly by exploiting deep CNN and RNN pipelines. For instance, many works rely on different RNNs such as LSTM [32], word2vec [33], and GloVe [34] for text embedding to capture the relationships between words and extract textual features. CNN architectures such as VGG16 [35], ResNet [36], and Faster R-CNN [37] were adopted for extracting low-and high-level visual cues. Afterwards, both feature representations were opportunely combined/fused to answer the query question by using more advanced attention techniques such as stacked attention networks (SANs) [38] and hierarchical co-attention [39].
VQA in medical domain tasks is derived from natural VQA tasks. Although some methods for natural images do not work effectively on medical images because of the different characteristics, some ideas and strategies from natural images are still beneficial to the medical field. However, in medical VQA, the development is much slower due to the lack of large-scale labeled training data, which require professional medical staff and a lot of time to carefully select and label high-quality relevant data. At this point, there are a few medical VQA datasets [40,41] that have been proposed. The VQA-RAD dataset [41] is one of the few datasets that provide data in the medical field of VQA. Many studies [6, [42][43][44][45] developed their methods based on this dataset with manual annotations. For instance, the authors of [6] proposed a novel framework for learning reasoning skills conditioned on both question information and task type information. The authors of [43] utilized unlabeled radiology images to train teacher models via contrastive learning. The knowledge of the teacher models is distilled into a lightweight student model that is fine-tuned for medical VQA. Do et al. [43] presented a multiple meta-model quantifying method that learns metaannotation and deals with noisy labels to provide robust features for medical VQA tasks. Pan et al. [44] proposed a multi-view attention-based model that gives attention to the question from two views, word-to-text and image-to-question, which helps to fuse the high-level semantics of images on the basis of text. The work in [45] adopted ResNet-34 [36] as a backbone which was jointly pre-trained on image understanding task and a questionimage compatibility task. The model uses a cross-modal self-attention module for visual and linguistic feature fusion. PathVQA [40] is another dataset that contains more questions than other datasets in medical VQA. The work in [40] proposed bilinear attention networks (BANs) based on gated recurrent units (GRUs) [46] and a Faster R-CNN network for feature extraction. In [47], a model pre-trained with cross-modal self-supervised learning was devised to extract visual features. The model uses a learning-by-ignoring method to remove problematic training samples. In [48], an encoder-decoder architecture with a three-level optimization framework that relies on cross-modal self-supervised learning methods was developed to improve performance. Sharma et al. [49] proposed a model based on ResNet and BERT models with attention modules to focus on the relevant part of the medical images and questions. The model predicts the answer either by a classification or a generation head depending on the type of question. Finally, in [50], a bibranched model is proposed in which the first branch answers closed-ended questions with a transformer architecture, and the second branch answers open-ended questions with image retrieval that gives the most similar answer to the test image.
Despite the potential of these studies, medical VQA research is still lagging behind in terms of methodology (i.e., with regard to VQA in computer vision). Furthermore, the methodologies that have been proposed to date for medical VQA show limited performance and there is still much room for improvement.

Multi-Modal Transformer
Transformers were first applied in NLP tasks [51][52][53][54] and subsequently achieved satisfactory results in computer vision tasks. Most of the previous efforts on multi-modal learning focused on specific domains or single modalities. For instance, in [55] the authors applied an encoder-decoder architecture based on the transformer's multi-head attention mechanism for image classification, machine translation, and image captioning. The work in [56] opts for a long-short transformer (Transformer-LS) with an efficient self-attention mechanism for both language and vision tasks. Tan et al. [30] utilized a method denoted as learning cross-modality encoder representations from transformers (LXMERT) that uses three transformer encoders with co-attention and only pre-trained the model with in-domain data. Lu et al. [29] used vision-and-language BERT (ViLBERT) with the same architecture but with more complex co-attention, and pre-trained with out-of-domain data. Despite the above-mentioned efforts, there are no existing works that attempt to tailor multi-modal transformers for medical VQA.

Methodology
Let us consider a set D = {X i , q i , y i } N i=1 composed of N triplets of a medical image, a question, and the corresponding answer. The goal is to teach the model to generate the correct answer y i to the question q i about the given medical image X i . Figure 1 shows the overall framework of our medical VQA model, which is composed of an image encoder, a question encoder, and an answer decoder.
learning focused on specific domains or single modalities. For instance, in [55] the authors applied an encoder-decoder architecture based on the transformer's multi-head attention mechanism for image classification, machine translation, and image captioning. The work in [56] opts for a long-short transformer (Transformer-LS) with an efficient self-attention mechanism for both language and vision tasks. Tan et al. [30] utilized a method denoted as learning cross-modality encoder representations from transformers (LXMERT) that uses three transformer encoders with co-attention and only pre-trained the model with in-domain data. Lu et al. [29] used vision-and-language BERT (ViLBERT) with the same architecture but with more complex co-attention, and pre-trained with out-of-domain data. Despite the above-mentioned efforts, there are no existing works that attempt to tailor multi-modal transformers for medical VQA.

Methodology
Let us consider a set , , composed of triplets of a medical image, a question, and the corresponding answer. The goal is to teach the model to generate the correct answer to the question about the given medical image . Figure 1 shows the overall framework of our medical VQA model, which is composed of an image encoder, a question encoder, and an answer decoder.
Our model consists of a separate encoder for each input modality followed by a decoder. The first encoder is a transformer-based model that is used to extract visual features from the input medical image. The second is a language encoder that is used to generate a language representation from the input question. After encoding the two input modalities, the two feature representations are concatenated to generate a multi-modal representation from both the image and the question. Finally, the concatenated features go through several layers of the decoder to generate the proper answer. In the following, we describe the architecture of our framework in more detail: Our model consists of a separate encoder for each input modality followed by a decoder. The first encoder is a transformer-based model that is used to extract visual features from the input medical image. The second is a language encoder that is used to generate a language representation from the input question. After encoding the two input modalities, the two feature representations are concatenated to generate a multi-modal representation from both the image and the question. Finally, the concatenated features go through several layers of the decoder to generate the proper answer. In the following, we describe the architecture of our framework in more detail: Figure 2 illustrates the detailed architecture of the image encoder. The medical image X i is fed into the image encoder to generate the corresponding visual representation. The image is first resized to the size of 224 × 224 × 3 pixels. Then, it is partitioned into 49 non-overlapping patches with a spatial dimension of 32 × 32 pixels. These patches are flattened into one-dimensional vectors and mapped with an image embedding layer into dimension 768 to match the encoder dimension. The positional encoding is combined with the patch representations and passed to the image encoder. Figure 2 illustrates the detailed architecture of the image encoder. The m is fed into the image encoder to generate the corresponding visual represe image is first resized to the size of 224 224 3 pixels. Then, it is partiti non-overlapping patches with a spatial dimension of 32 32 pixels. These flattened into one-dimensional vectors and mapped with an image embeddi dimension 768 to match the encoder dimension. The positional encoding is co the patch representations and passed to the image encoder. The adopted image encoder is a ViT32 model, which is a variant o transformer proposed in [17]. Typically, this model is composed of 12 ide Each layer comprises multi-headed self-attention (MSA) and feed-forward ne blocks that work together to generate visual features. Each block is pre normalization layer [57] and a residual connection to the next block. The MSA in the encoder employs the self-attention mechanism which find correlations between different patches of the medical image. To de correlation, the embedded representation of the input image is transforme distinct matrices by using three linear layers. These resultant matrices are the key K, and the value V. The dot product is calculated between the Q and th The resulting value is divided by the square root of the dimension of the K. passed through a SoftMax operation to obtain the attention weights. Finally, The adopted image encoder is a ViT32 model, which is a variant of the vision transformer proposed in [17]. Typically, this model is composed of 12 identical layers. Each layer comprises multi-headed self-attention (MSA) and feed-forward network (FFN) blocks that work together to generate visual features. Each block is preceded by a normalization layer [57] and a residual connection to the next block.

Image Encoder
The MSA in the encoder employs the self-attention mechanism which is utilized to find correlations between different patches of the medical image. To determine this correlation, the embedded representation of the input image is transformed into three distinct matrices by using three linear layers. These resultant matrices are the query Q, the key K, and the value V. The dot product is calculated between the Q and the K matrices. The resulting value is divided by the square root of the dimension of the K. The score is passed through a SoftMax operation to obtain the attention weights. Finally, the V vector is multiplied by the output of the SoftMax to find the weighted input. This operation is expressed in the following formula: Multiple independent self-attention heads compute the scaled dot product attention in the MSA block. The results of all the attention heads are concatenated together and then passed to the FFN block. The FFN consists of two fully connected layers with a Gaussian error linear unit activation function (GELU) applied in between [58]. The encoded image representation obtained from the image encoder is subsequently projected into a vector of dimension 512 to match it with the dimension of the question representation. Thus, the resultant representation f Xi has a dimension of 49 × 512.

Question Encoder
The question encoder uses a BERT-like architecture to generate the question's textual features [59]. Similar to the image encoder, the question encoder consists of a stack of 12 identical layers. As shown in Figure 3, the first step in encoding the question is tokenization, in which the question is tokenized as a sequence of word tokens. Two special tokens, <CLS> and <SEP>, are appended to the sequence to mark its beginning and end, respectively. The encoder uses a sequence with a fixed length equal to 77 tokens and uses a vocabulary size of 49,408 words. The word embedding layer embeds the sequence of the question tokens into features of dimension 512. A learnable positional embedding is added to the sequence to provide information about the order of each word. The final representation is generated by feeding the initial representation through the 12 layers of the encoder. Analogously to the image encoder, the question encoder employs the MSA block to capture dependencies within the question tokens. The model also uses normalization layers and skip connections, but unlike the image encoder, the normalization layers come after the MSA and FNN. The output of the question encoder is the question feature representation of size 77 × 512. This representation holds information about the semantics of the question and the relationships between words.

Multi-Modal Representations
Our VQA model is supposed to receive a question and look at the given image to find relevant information for generating the correct answer. To model this, the image features f Xi ∈ R 49×512 obtained from the image encoder and the question features f qi ∈ R 77×512 obtained from the question encoder are concatenated to form the joint representation Here, ⊕ is the concatenation operator. The representation f i , which aggregates the relevant information from the two modalities, is supplied as input to the answer generator which decodes it into an answer.
Since the VQA task requires encoding both the question and the image, we leverage the rich semantic embedding of the contrastive language-image pre-training (CLIP) model [52] and use it as a backbone. CLIP is built on dual transformers that have been optimized by contrastive learning to match a large batch of image-text pairs. Specifically, CLIP learns a multi-modal embedding space by jointly training an image encoder and a text encoder on a corpus of 400 M image-text pairs. The contrastive learning used by the CLIP model aims at maximizing the similarity of truly corresponding image-text pairs while minimizing the similarity of mismatched image-text pairs. block to capture dependencies within the question tokens. The mo normalization layers and skip connections, but unlike the image normalization layers come after the MSA and FNN. The output of the ques the question feature representation of size 77 × 512. This representation hol about the semantics of the question and the relationships between words.

Answer Decoder
The decoder is modeled as a generative model. It generates the answer one word at a time in an autoregressive manner. When a word is predicted, it is added to the input sequence, which then serves as the model's new input in the next time step. The decoder architecture consists of two identical layers. Figure 4 shows the internal architecture of a single layer. Similar to the question encoding, the input answer is first tokenized into words and trimmed or padded to the maximum length of 77 words. The two special tokens <CLS> and <SEP> are appended to the sequence, and each word is represented as a word embedding. The positional information is added, and the word is fed into the first layer of the decoder. The decoder layer is composed of the same MSA and FFN blocks present in the encoder. However, the decoder uses a masked self-attention block that learns the dependencies within the answer tokens without considering future tokens. This helps the model to make a prediction about the next word based on the sequence of the previous tokens. Another difference in the decoder is the multi-head cross-attention block, which is designed to capture the interdependencies between two different inputs, as opposed to the self-attention mechanism employed by the image and question encoders, which derives Q, K, and V from the same modality. As shown in Figure 4, the cross-attention mechanism in each decoder layer uses Q derived from the multi-modal representation, and K, and V derived from the answer. This helps the model to detect the correlation between the different data modalities involved in the VQA task.

Network Optimization
Let us consider and as the visual and textual representations of the and the question , respectively, with ∈ ℝ and ∈ ℝ . The 49 represent the length of the image and question sequences, respectively, and 512 rep the feature dimension. The two features are generated by fixing the weights of th encoders. The goal of training is to fine-tune the parameters of the answer deco optimizing it on the autoregressive language modeling objective function conditio the multi-modal joint feature representation and the previously generated to the answer. Formally, the objective function is defined as: where is the joint feature representation of the image and the quest , , … , , , , are the words composing the answer , and is the number of in the answer .
At the inference phase, the test image and the question are given as input to the encoder and the question encoder, respectively, to obtain the joint feature represen The decoding begins by reading this representation and the start token to gener first word of the answer. The model generates a word at each step by sampling th with the highest posterior probability over the vocabulary . The predicted constitute the answer to the question about the test image. The answer genera terminated when either the end of the sequence token is predicted, or the ma sequence length is reached.

Network Optimization
Let us consider f Xi and f qi as the visual and textual representations of the image X i and the question q i , respectively, with f Xi ∈ R 49×512 and f qi ∈ R 77×512 . The 49 and 77 represent the length of the image and question sequences, respectively, and 512 represents the feature dimension. The two features are generated by fixing the weights of the CLIP encoders. The goal of training is to fine-tune the parameters of the answer decoder by optimizing it on the autoregressive language modeling objective function conditioned on the multi-modal joint feature representation f i and the previously generated tokens of the answer. Formally, the objective function is defined as: where f i is the joint feature representation of the image X i and the question q i , a i,0 , . . . , a i,k−1 , a i,k are the words composing the answer y i , and K is the number of words in the answer y i . At the inference phase, the test image and the question are given as input to the image encoder and the question encoder, respectively, to obtain the joint feature representation. The decoding begins by reading this representation and the start token to generate the first word of the answer. The model generates a word at each step by sampling the word with the highest posterior probability over the vocabulary V. The predicted words constitute the answer to the question about the test image. The answer generation is terminated when either the end of the sequence token is predicted, or the maximum sequence length is reached.

Experimental Results
In this section, we first introduce the datasets and then explain the evaluation metrics utilized in this work and the experimental setup. Lastly, we present results and analysis related to our experiments.

Dataset Description
We trained and evaluated our model on two medical VQA datasets, namely VQA-RAD and PathVQA. The characteristics of these datasets are listed in Table 1, and sample images and their corresponding question-answer pairs from the training set are shown in Figures 5 and 6

Dataset Description
We trained and evaluated our model on two medical VQA datasets, namely VQA-RAD and PathVQA. The characteristics of these datasets are listed in Table 1, and sample images and their corresponding question-answer pairs from the training set are shown in Figures 5 and 6

Evaluation Measures
To quantitatively validate the proposed methodology and compare our results to other state-of-the-art methods, we used the metrics proposed in [60] which are a commonly used evaluation metric in VQA. We adopted strict accuracy to measure the ratio between correctly predicted observations and total observations. BiLingual Evaluation Understudy [61] (BLEU) is another automatic evaluation metric to measure the similarity of predicted answers and ground-truth by matching n-grams, as expressed below: where is the brevity penalty to penalize short answers, is the weight between 0 and 1 for log and ∑ 1 , is the geometric average of the modified n-gram precision, and is the maximum length of n-grams. N-grams here are up to length 4.

Experimental Setup
For the experiment, we trained our model with PyTorch [62], and experiments were run on a machine with an Intel Core (TM) i9-7920× CPU @ 2.9 GHz, RAM of 32 GB, and an NVIDIA GeForce GTX 1080 Ti Graphical Processing Unit (GPU) (with 11 GB GDDR5X memory). During our experiments, we defined the following hyperparameters. The input and output dimensions of the multi-head attention are 512, the dropout in all full connection layers is 0.1, and we set two transformer blocks in the encoder and decoder. As for training, we used the Adam optimizer [63] with a base learning rate set to 0.001. The batch size for training was set to 50, and the number of training epochs was set to 50. The images were shuffled for each epoch, and randomly flipped left and right with a probability of 0.2. We used the BERT-base-uncased tokenizer for text inputs with vocabulary size 49,408.

Results
The evaluation results of the visual question answering (VQA) model using the VQA-RAD and the PathVQA datasets are presented in Table 2. Several metrics are used

Evaluation Measures
To quantitatively validate the proposed methodology and compare our results to other state-of-the-art methods, we used the metrics proposed in [60] which are a commonly used evaluation metric in VQA. We adopted strict accuracy to measure the ratio between correctly predicted observations and total observations. BiLingual Evaluation Understudy [61] (BLEU) is another automatic evaluation metric to measure the similarity of predicted answers and ground-truth by matching n-grams, as expressed below: where BP is the brevity penalty to penalize short answers, w n is the weight between 0 and 1 for log e P n and ∑ N n=1 w n = 1, P n is the geometric average of the modified n-gram precision, and N is the maximum length of n-grams. N-grams here are up to length 4.

Experimental Setup
For the experiment, we trained our model with PyTorch [62], and experiments were run on a machine with an Intel Core (TM) i9-7920× CPU @ 2.9 GHz, RAM of 32 GB, and an NVIDIA GeForce GTX 1080 Ti Graphical Processing Unit (GPU) (with 11 GB GDDR5X memory). During our experiments, we defined the following hyperparameters. The input and output dimensions of the multi-head attention are 512, the dropout in all full connection layers is 0.1, and we set two transformer blocks in the encoder and decoder. As for training, we used the Adam optimizer [63] with a base learning rate set to 0.001. The batch size for training was set to 50, and the number of training epochs was set to 50. The images were shuffled for each epoch, and randomly flipped left and right with a probability of 0.2. We used the BERT-base-uncased tokenizer for text inputs with vocabulary size 49,408.

Results
The evaluation results of the visual question answering (VQA) model using the VQA-RAD and the PathVQA datasets are presented in Table 2. Several metrics are used to assess the model's performance, including BLEU scores and both open-ended and closed-ended accuracy ratings by using two decoder layers. For the VQA-RAD dataset, the BLEU-1 test resulted in the highest score of 71.03%, while the BLEU-4 test resulted in the lowest score of 64.43%. The score for closed-ended accuracy is 82.47%, and the score for open-ended accuracy is 71.49%. The model's overall accuracy is 75.41%. The results on the PathVQA dataset are also shown in Table 2, and the model's effectiveness is assessed with different measures. The BLEU-1 test resulted in the highest score of 61.78%, while the BLEU-4 test resulted in the lowest score of 58.19%. The score for closed-ended accuracy is 84.63% while the score for open-ended accuracy is 58.29%. The model's total accuracy is 67.05%. Overall, the results show that the model performs well on both datasets. The closedended accuracy metric received the greatest results, demonstrating the model's superior ability to choose the right response from a pre-defined set of options. The model may have trouble producing free-form replies, as indicated by the lower results on the open-ended accuracy metric. The lower BLEU scores imply that there may be less overlap between the model's predictions and the ground-truth answers. Table 3 shows the effect of using a different number of layers in the answer decoder on the two datasets, RAD-VQA and PathVQA. The results of the VQA-RAD dataset exhibit a consistent improvement in all measures as the number of layers increases. For the PathVQA dataset, the highest results are obtained with three decoder layers, while in the case of the VQA-RAD dataset the best results are obtained using four decoder layers. For quantifying the capability of the proposed model, we contrast our results against several models. For instance, Table 4 compares the results of open-ended questions with the main work that introduces the PathVQA dataset. We recall that these metrics are not used in most recent works. The results are shown in terms of BLEU scores (1, 2, and 3), as well as the F1 score. It is clear that our approach yields superior outcomes thanks to its autoregressive decoder that is able to generate answers in the form of sentences.  In Table 5, we also compare our results to several existing state-of-the-art approaches [6,7,43,45,46,64,65] in terms of open-ended and closed-ended accuracies. For instance, in [6] the authors suggest a framework that uses conditional reasoning to automatically learn how to reason effectively for different VQA-RAD tasks. Their approach involves creating a reasoning module that considers the question being asked and uses this information to select the most important features from a fusion of different types of data. The authors of [7] suggest a system that tackles the issue of limited labeled data by means of an unsupervised denoising autoencoder and supervised meta-learning. The denoising autoencoder helps to make use of a large number of unlabeled images, while meta-learning helps learn meta-weights that can quickly adapt to VQA problems that have limited labeled data. On the other side, the authors of [45] propose a cross-modal self-attention (CMSA) module that enhances the fusion of visual and linguistic features by capturing long-range contextual relevance. This module helps to better incorporate information from both visual and linguistic modalities by emphasizing their important and relevant features. In another work [43], the authors introduce a method for quantifying multiple meta-models that utilizes meaningful features and learns meta-annotations. Finally, in [64], the authors suggest a data augmentation method called VQAMix to address the data limitation problem in VQA. Compared to all these models, our method shows improvements for the metrics closed, open, and overall accuracy. Specifically, we observed an improvement of about 2.5%, 11%, and 5% in VQA-RAD and 2%, 58%, and 18% in PathVQA. We also noticed that our model's improvements were more significant on open-ended questions compared to yes/no questions.

Discussions
For further analysis, we present in Figures 7 and 8 the attention maps of questions and images obtained for samples from both datasets. These maps are useful to understand the regions of interest of an image and the corresponding question. They can also help radiologists and other medical professionals to identify key regions of an image, and to improve the performance of automated diagnostic systems.
in the image. The third sample asks about the condition of the "mitral valve". The question is correctly answered by our model and the corresponding region in the image is highlighted. Finally, the question asked in the fourth example is an open-ended question, regarding the "lumen" present in the image. It can be seen that the model could not obtain the correct answer because open-ended questions are more challenging, and require further developments.

Conclusions and Future Work
In this work, we have proposed a VQA model for medical datasets. This model composed of an image and text encoders for encoding the medical image and t corresponding question. Then, a decoder is mounted on the top to generate an answer Considering the RAD-VQA dataset samples displayed in Figure 7, for the first two samples the model provides the correct answers, and the attention maps highlight the "arterial" region in the first image and the "mesenteric arteries" in the second image. The third example also shows a correctly predicted answer, with the corresponding attention map placing emphasis on the lung area. This is consistent with the question that concerns a general query about lung condition. The last sample shows a failure case in which the model could not predict the correct answer about the "peritoneal cavity". Figure 8 shows four samples of questions answered by our model for images from the PathVQA dataset. The first sample shows that the model correctly predicts the answer and the attention span across the relevant regions in the image. In the second example, although the model cannot provide the correct answer, it can still highlight related regions in the image. The third sample asks about the condition of the "mitral valve". The question is correctly answered by our model and the corresponding region in the image is highlighted. Finally, the question asked in the fourth example is an open-ended question, regarding the "lumen" present in the image. It can be seen that the model could not obtain the correct answer because open-ended questions are more challenging, and require further developments.

Conclusions and Future Work
In this work, we have proposed a VQA model for medical datasets. This model is composed of an image and text encoders for encoding the medical image and the corresponding question. Then, a decoder is mounted on the top to generate an answer in an autoregressive way. The complete architecture is end-to-end learnable via the backpropagation algorithm. In the experiments, we have validated our model on two well-known VQA medical datasets, namely VQA-RAD and PathVQA. The obtained results confirm its promising capabilities compared to recent works. For future developments, we will look for more effective data augmentation methods to achieve better performance. In addition, we will investigate other types of multi-modal transformers to learn better representations for both images and textual questions.

Data Availability Statement:
The data presented in this study are available on request.