Goal-Driven Visual Question Generation from Radiology Images

: Visual Question Generation (VQG) from images is a rising research topic in both fields of natural language processing and computer vision. Although there are some recent efforts towards generating questions from images in the open domain, the VQG task in the medical domain has not been well-studied so far due to the lack of labeled data. In this paper, we introduce a goal-driven VQG approach for radiology images called VQGRaD that generates questions targeting specific image aspects such as modality and abnormality. In particular, we study generating natural language questions based on the visual content of the image and on additional information such as the image caption and the question category. VQGRaD encodes the dense vectors of different inputs into two latent spaces, which allows generating, for a specific question category, relevant questions about the images, with or without their captions. We also explore the impact of domain knowledge incorporation (e.g., medical entities and semantic types) and data augmentation techniques on visual question generation in the medical domain. Experiments performed on the VQA-RAD dataset of clinical visual questions showed that VQGRaD achieves 61.86% BLEU score and outperforms strong baselines. We also performed a blinded human evaluation of the grammaticality, fluency, and relevance of the generated questions. The human evaluation demonstrated the better quality of VQGRaD outputs and showed that incorporating medical entities improves the quality of the generated questions. Using the test data and evaluation process of the ImageCLEF 2020 VQA-Med challenge, we found that relying on the proposed data augmentation technique to generate new training samples by applying different kinds of transformations, can mitigate the lack of data, avoid overfitting, and bring a substantial improvement in medical VQG.

Generating natural language questions for image understanding is a rising research topic in both the fields of natural language processing and computer vision [21,22]. The task, known as Visual Question Generation (VQG), has two main motivations. First, it supports creating large-scale collections of Visual Question Answering (VQA) pairs at low cost since VQG could automatically generate questions about an image. Second, it can also play a role in improving the efficiency of human annotation for VQA datasets construction [23]. VQG combines natural language processing that provides the ability to generate the question, and computer vision techniques that allow the understanding of the image's content.
In contrast to answering visual questions about images, generating questions has received little attention so far. A few recent works have attempted to generate questions from images in the open domain [24][25][26]. However, the task of VQG in the medical domain has not been well-studied. In addition to the two main motivations mentioned above, VQG could benefit both doctors and patients. For example, patients could use the questions provided by VQG systems to start a conversation with their doctors and understand better their medical images. Moreover, such VQG systems could support medical education and clinical decision making by understanding medical images and generating questions related to their content [27].
The VQG task, as shown in Figure 1, consists of three main phases: (1) generating a representations of the image; (2) producing the embeddings by a neural network; and then (3) generating the question. One major problem with medical VQG is the lack of large-scale labeled training data, which usually requires huge efforts to build, especially in the medical domain where domain experts are needed for data construction. Although deep learning models have achieved a remarkable success in computer vision and natural language processing tasks, the performance often depends on the size and quality of available training data, which is often tedious to collect [17,28,29]. Usually, to avoid the overfitting problem, the neural networks have to access more training data. However, many tasks lack access to large amounts of data, such as medical VQG and VQA.
Recently, we have presented a VQG system that is able to generate questions when shown radiology images [30]. However, this approach is not goal-driven as it does not guarantee that the generated questions will address a specific aspect of the image. Our previous approach tackled generating natural language questions that are relevant to radiology images without any constraints on the types of the generated questions. Developing a method capable of asking goal-oriented questions about images is a challenging research problem. Towards this end, this work aims to develop an approach to generating questions that ask specific information about radiology images. As shown in Figure 2, by specifying the type of the expected questions, different questions can be generated for a given image such as "What is the condition seen in this image" for abnormality, "Does this have contrast" for modality. The goal-driven question generation process allows, for a given image, to specify the category of the expected question. Such an approach will allow better control of the generated questions from radiology images when they involve multiple topics of interest, which occurs often in medical images.
In this paper, we introduce VQGRaD, a goal-driven VQG system for generating natural language questions about radiology images. VQGRaD is tasked with generating a natural language question when provided with an image (with or without its caption) and the question category. In summary, this paper makes the following contributions: 1. To the best of our knowledge, this work is the first attempt to generate visual questions about medical images that will result in a specific type of answer when provided with an initial indication of the question category/type. 2. To overcome the data limitation of VQG in closed domains, we propose a new data augmentation method for natural language questions.
3. VQGRaD is designed to work with or without an image caption and only requires the image and the question category as minimal inputs. 4. We study the impact of domain knowledge incorporation, such as named entities and semantic types, in the proposed VQGRaD approach. 5. VQGRaD is evaluated using the VQA-RAD dataset of clinical questions and radiology images. Experimental results show that VQGRaD performs better than strong baselines, with a BLEU-1 score of 61.86%. We also report the VQG models' results in our participation at the VQA-Med 2020 challenge. 6. We perform a manual evaluation to study the grammaticality, fluency, and relevance of the generated question from radiology images.

Figure 2.
Examples of goal-driven questions about a radiology image. The possible questions should be relevant to the given category, also known as question type. By specifying the type of the expected question, different questions such as (a-c) can be generated from the same radiology image.
The remainder of the paper is organized as follows. First, related work concerning the main visual question generation systems is reviewed in Section 2. Then, the proposed methods are presented in Section 3. Several comprehensive experiments are performed to evaluate the effectiveness of the proposed methods in Section 4 where experimental settings, evaluation metrics, benchmark datasets, and results are presented. Conclusions and future work are finally presented in Section 5.

Related Work
Question generation (QG) is the task of automatically creating natural language questions from a range of inputs, such as natural language text [31][32][33], structured data [34] and images [21,24]. While many natural language processing and computer vision problems involve extracting information from the texts and images such as VQA [19,35,36], VQG, which can be considered as a complementary task of VQA, is a multi-modal problem involving image understanding and natural language generation, especially using generative methods.
VQG in the open-domain benefited from the available large-scale annotated datasets [37][38][39]. These large-scale datasets allow for a variety of work studying generative models and continuous latent spaces for generating visual questions in the open domain [22,40]. Early work on goal-oriented visual question generation focused primarily on reinforcement learning setting [41,42]. Recent VQG approaches have used autoencoders' architecture to generate questions from images and some additional inputs such as answers and categories of questions [23,24,43]. The successes of these systems have primarily been a result of variational autoencoders [44].
In this work, we are interested in generating questions in closed domains. Visual question generation in closed domains, such as the medical field, is a challenging task [45][46][47][48] that is still understudied. For instance, Lau et al. [27] created the first VQG dataset in the medical domain (VQA-RAD), where each radiology image was manually annotated with several questions. However, this dataset is too small for training efficient VQG models. Recently, we have developed a VQG system that is able to generate questions from radiology images [30]. However, this approach is not goal-driven as it does not guarantee that the generated question will address a specific aspect of the image.
Inspired by the aforementioned open-domain research, we present in this paper VQGRaD, a goal-driven visual question generation system for radiology images based on the variational autoencoders architecture. Our work extends our previous method [30] by tackling visual question generation as a process that considers the question's category, the image and its caption.
VQGRaD is able to generate questions about four main categories: abnormality, modality, plane, and organ. These are the most frequent question categories in the VQA-RAD dataset. The questions can be generated from either (i) the image and the question category or (ii) the image, its caption, and the question category.
In addition, to overcome the data limitation problem in the medical domain, we propose a text-based augmentation method to automatically create new training questions. Data augmentation, the application of one or more deformations to labeled data which result in new, additional training data, is a promising solution to handle the data insufficiency problem [49,50].
Automatic data augmentation based on images is commonly used in computer vision [17,28,29] and can help train deep learning models, particularly when using smaller datasets. Simply flipping or shifting images can help the models to better learn by increasing the number of training images. However, the lack of available training images for VQG in the medical domain makes image-based data augmentation alone insufficient for boosting performance on the visual question generation task. Thus, training our supervised models on the augmented natural language data can allow them to become more invariant to these deformations and generalize better to unseen data.
Our text data augmentation method can also be used in open-domain and restricteddomain NLP tasks, such as text classification and question answering, as it relies on general morpho-syntactic features to replace relevant target words in the original text with words that have a high contextual similarity. The following section will present our proposed methods in detail.

Methods
The first goal of this study is to generate natural language questions that ask about specific topics such as modality, abnormality, plane, and organ.
To address this challenge, we present VQGRaD, a goal-driven visual question generation system for radiology images that aims to generate relevant questions based on the visual content of the image.

Problem Modeling
Given a pair (C, I), where C is the question category accompanied by a medical image I, VQGRaD is tasked with generating the appropriate question Q that will result in a specific type of answer. Mathematically, the VQG task can be formulated as: where f is the question generation function and α denotes the parameters of the model. Categories in C are modality, abnormality, plane, and organ. In the following sections, we will provide a detailed description of our proposed methods.

VQGRaD
The VQGRaD model is based on the variational autoencoders architecture [44]. It first encodes the image and its caption along with the category before generating the question. VAEs comprise two neural network components, known as encoder and decoder, for learning the probability distributions of data p(x). The encoder transforms the latent variable z that is created from raw data x into latent space z-space. In other words, the encoder compresses the data from the initial space to the encoded space, also called latent space. The decoder, on the other hand, aims at recovering x using z extracted from the latent space. The training of the encoder and decoder proceeds by maximizing marginal likelihood log p(x). Finding the Evidence Lower Bound (ELBO) yields: where q(z|x) and p(x|z) are the probability distributions of the encoder and the decoder, respectively. The loss function that is minimized when training VAEs is the negative log-likelihood with a regularizer. It consists of a reconstruction loss and a regularisation loss (on the latent layer). Reconstruction loss consists at making the scheme for encoding and decoding as performant as possible, whereas the regularisation loss regularises the latent space organisation by making the distributions returned by the encoder close to a standard normal distribution. The loss function l i for datapoint x i is: is the reconstruction error and KL(q θ (z|x)||p(z)) is the Kullback-Leibler divergence regularization between the returned distribution and a standard Gaussian. φ and θ, the parameters for the decoder distribution p φ (x|z) and the encoder distribution q θ (z|x), respectively. In VQGRaD, as shown in Figure 3, a Convolutional Neural Network (CNN) is used to obtain the image feature map v and a Long Short Term Memory network (LSTM) [51] is used to generate the embedded caption features c. The categories of the questions are represented as a one hot vector a. It then encodes the dense vectors h c , h a and h v of the caption, the category, and the image, respectively, into a continuous, dense, latent z-space. It also encodes the dense vectors h a and h v into another continuous, dense, latent t-space based on the continuous latent space introduced in [24] for regularization. This allows our system to maximize the mutual information MI(.) between the encoded features, i.e., the image, the caption, the category, and the latent space. MI(.) measures how much knowing one of the predefined features reduces uncertainty about the other. For example, if h a and h v are independent, then knowing h v does not give any information about h a and vice versa, so their mutual information is zero. In our case, the optimization is computed as follows: where λ 1 , λ 2 , λ 3 are hyperparameters that relatively weight MI(.) terms in the optimization. p φ (q|z) is the learned mapping, parameterized by φ, from the image, the caption, and the category to this latent space. Then, VQGRaD reconstructs the inputs from the z-space using a simple Multi Layer Perceptron (MLP) which is a neural network with fully connected layers. It generates the reconstructed image, caption, and category features L v , L c , L a , and optimizes the model by minimizing the following l 2 losses: On the other hand, VQGRaD trained t-space by minimizing the KL-divergence with z-space: where φ and θ are the parameters used to embed into z-space and t-space, respectively. We used the reparameterization trick [44], to generate means µ z and standard deviations σ z , combine it with a sampled unit Gaussian noise to generate: In VQGRaD, the t-space is not only used for regularization but also to generate questions from only the image and the question category.
Finally, VQGRaD uses an LSTM decoder to generate the questionq from either the z-space or the t-space. The decoder takes a sample from the latent dimension z-space, and uses that as an input to output the questionq. It receives a "start" symbol and proceeds to output a question word by word until it produces an "end" symbol. We used Cross Entropy loss function to evaluate the neural network's quality and minimize the error L g between the generated questionq and the ground truth question q. The generation of each word of the question can be written as: whereŵ t is the predicted word at t step, W denotes the word vocabulary, andŵ i represents the i-th ground-truth word. The final loss of VQGRaD is as follows: where KL is Kullback-Leibler divergence, λ 1 , λ 2 , λ 3 have already been introduced and λ 4 , λ 5 , λ 6 are hyperparameters that control the variational loss, the question generation loss, and the amount of regularization used in our model, respectively.

Data Augmentation
Questions. For a given medical question q, we generate a set of new questions. During the augmenting process, we use the whole training data D = {q i } n i=1 where n is the number of training questions. We expand each training question q i into a set of instances q k i where k is the number of derived pairs for each training question. To do so, we first select nouns and verbs as candidate words, using the following part-of-speech tags (we used NLTK [52] to perform part-of-speech tagging): Each candidate word is then replaced by contextually similar words using Wiki-PubMed-PMC embedding which was trained using four million English Wikipedia, PubMed, and PMC articles. Similar words for a given word are retrieved from the word embeddings space using cosine similarity. We compute the cosine similarity between a weight vector of the given word w i in the question and the vectors for each word w j in the pre-trained word embeddings. We use the top k similar words according to the cosine similarity. Several experiments were carried out with k = {5, 10, 15, 20, 30} and found that the best result can be achieved with k = 10. Figure 4 presents some examples of created questions for the input question "Are the kidneys normal?". Images. We also generate new training instances based on image augmentation techniques. To do so, we apply flipping, rotation, shifting, blurring techniques on the whole VQA-RAD training images. Figure 5 presents some examples of created images.

Experimental Settings and Results
In this section, we present our VQG results and conduct a comprehensive ablation analysis. As mentioned above, the proposed method is evaluated on the VQA-RAD and VQA-Med 2020 datasets.

Datasets
In this study, we used the VQA-RAD [27] dataset of clinical visual questions to evaluate our VQG system. The dataset contains 315 images and 3515 corresponding questions. Figure 6 presents simple images and questions. Each image is associated with more than one question, each of which is accompanied with its category. In this work, we are particularly interested in five categories of questions: "Modality", "Abnormality", "Organ", "Plane" and "Other". Table 1 presents the number of questions and images associated to each of these categories before and after data augmentation. The test set contains 100 reference questions with associated categories and images.
We have also used the datasets provided by the VQA-Med 2020 challenge at Image-CLEF 2020 during our participation. Given a radiology image, the VQG task consists of generating a natural language question based on the image's content. The dataset used in VQA-Med 2020 consists of 780 radiology images with 2156 associated questions as training data, 141 radiology images with 164 questions as validation data, and 80 radiology images as test data. There are 1942 unique questions in the 2156 training questions. Some questions are associated with more than one image (up to 8 images). After applying data augmentation, our final training set consists of 161,348 questions. Figure 7 shows examples from VQA-Med 2020 VQG data.

Evaluation Metrics
To investigate the performance of our visual question generation model, we make use of both automatic and manual evaluations.

Automatic Evaluation
VQG is a sequence generation problem. Therefore, in the automatic evaluation, we used various language modeling evaluation metrics such as BLEU, ROUGE, METEOR, and CIDEr to measure the similarity of the system-generated questions and the ground truth questions in the test set. We used the evaluation package published by [53]. BLEU-{1-4} measures the quality of the generated question by counting the matching {1-4}-grams in the generated question to the {1-4}-grams in the reference question, respectively. METEOR compares the generated question with the reference question in terms of exact, stem, synonym, and paraphrase matches between words and phrases. ROUGE-L assesses the generated question based on the longest common subsequence shared by both the candidate and the reference question. The CIDEr measures consensus in questions by performing a Term Frequency Inverse Document Frequency (TF-IDF) weighting for each n-gram.

Human Evaluation
We also performed a human evaluation to measure the quality of the questions generated by our system and the baseline. To do so, we followed the standard approach in evaluating text generation systems [54], as used for question generation by [55,56]. We manually checked the generated questions and rated them in terms of relevancy, grammaticality, and fluency. The relevancy of a question is determined by the relationship between the question, the image, and the category. Grammaticality refers to the conformity of a question to the grammar rules. Fluency and common sense (readability) refers to the way individual words sound together within a question. Two experts at the U.S National Institutes of Health (NIH) performed manual evaluation. For each measure, the assessors were required to give a rating ranging from 1 to 3 scale (1 = Incorrect, 2 = Average (minor errors), 3 = Correct).

Implementation Details
VQGRaD. Our VQGRaD is implemented using PyTorch. We used ImageNet-pretrained ResNet-50 [57] without fine-tuning its weights. Since the model expects an input of dimension 224 × 224, we resized the input images to suit that dimension. The z-space and t-space are 100 dimensions, the Adam optimiser [58] with a learning rate of 0.0001, a batch size of 32, maximum sequence length for outputs of 20 tokens were used. All models were trained for 40 epochs using single P100 GPUs (16 GB VRAM) on a shared cluster, and the best results were used as final results. We optimized the hyperparameters such that λ 1 = 0.001, λ 2 = 0.005, λ 3 = 0.001, λ 4 = 0.0001 λ 5 = 0.001 and λ 6 = 0.001 for a total of 20 epochs. The source code are publicly available on GitHub at https://github.com/sarrouti/vqgrad, accessed on 15 August 2021.
VQG baseline. We used our recent VQG system named VQGR [30] as a baseline. This model is based on the variational autoencoder architecture that takes an image as input and generates a question. In our implementation, we used ImageNet-pretrained ResNet-50 [57] provided by PyTorch without fine-tuning its weights as the image encoder and an LSTM decoder for generating questions. The source code is publicly available on GitHub at https://github.com/sarrouti/vqgr, accessed on 15 August 2021.

Experiments and Results
In order to study the task of visual question generation about radiology images and explore the impact of domain knowledge incorporation such as medical entities and UMLS semantic types, we perform several experiments with different settings as shown in Table 2: • VQGRaD is our full model that can generate questions from either the caption latent space z (image, caption, and category) or the category latent space t (image and category). • VQGRaD w_t includes another LSTM encoder to encode the image titles. • VQGRaD w_st includes another LSTM encoder to encode the UMLS semantic types extracted from the image captions. • VQGRaD w_e uses only UMLS entities instead of using all words in captions. PyMetamap (https://github.com/AnthonyMRios/pymetamap, accessed on 15 August 2021), a python wrapper for MetaMap [59], has been used for extracting UMLS entities and semantic types. • In VQGRaD cap_or_c , the z-space contains only image and caption features.
All systems can generate questions from either the caption latent space z or the category latent space t. Table 2 also presents a comparison of our proposed models and the baseline systems: • The VQGR baseline system is trained on the VQA-RAD dataset without data augmentation. • VQGR w_im_aug is trained on the dataset generated by augmenting the images. • VQGR w_our_aug is trained on the dataset generated by our data augmentation technique.
By comparing VQGR, VQGR w_im_aug and VQGR w_our_aug , we can see that our data augmentation technique helped considerably producing a significant improvement in the results. The best BLEU-1 score, 55.05%, was achieved using our data augmentation technique. Furthermore, it is interesting to see that VQGRaD performs the best over the baseline systems and on all evaluation metrics. Moreover, all of our VQGRaD models outperform the baseline system by a significant margin. This confirms our hypothesis that the task of visual question generation can be goal-driven. VQGRaD achieved consistently better scores among other ablations when the questions were generated from the t-space, which contains the image and the question category features.
When adding the UMLS semantic types extracted from the captions in VQGRaD w_st or the image titles in VQGRaD w_t , the models' performance was continuously improved in most metrics when the questions were generated from the z-space (caption latent space). This is likely because the questions were generated from a latent space that encodes more features (including images, question category, caption or title, and UMLS semantic types) than in the VGGRaD system. However, the best results were obtained by VGGRaD when the questions were generated from the t-space (category latent space). Thus, building end-to-end VQG models that consider the question type is a feasible and efficient task.
For additional evaluation, we used our VQG system during our participation in the VQG task of the VQA-Med challenge at ImageCLEF 2020 [60]. Given a radiology image, the VQG task consists of generating a natural language question based on the image's content. As the questions were all about abnormality, we only used our recent VQGR system based on VAEs and data augmentation, which takes an image as input and generates a natural language question as output. Table 3 shows our official results on the validation set at the VQG task of the VQA-Med challenge.
The official results of ImageCLEF 2020 VQA-Med showed that using a sequence generation model to solve VQG in the medical domain is complicated due to the problem of labeled data scarcity. Hence, the participating systems have used image classification approaches [48] to solve the VQG task. Small datasets might require models that have low complexity. Whereas sequence generation models require a large amount of training data as they try to deeply learn the underlying data distribution of the input to output new sequences. The available training data for VQG in the medical domain is not large/varied enough for training a seq2seq model. However, once we increased the size/variance in the dataset through the proposed augmentations, the performance of the proposed VQG increases significantly, yielding a BLEU score of 39.74% and 11.6% on the validation set and the test set, respectively. An additional manual evaluation of the VQG models' outputs was performed by two experts in medical informatics. Table 4 presents the results of the manual evaluation. Twenty (question, image, category) triples from the test set were randomly selected for the manual evaluation. Detailed guidelines for the raters are listed in Section 4.2.2. Inter-rater reliability was calculated on each of the 3 measures. F1-score for each measure is presented in Table 5. Most of the reliability scores are close to 0.50, which is considered satisfactory reliability [61]. Table 4. Results of the manual evaluation of the best VQGRaD models and the VQG baseline system. "Relevancy", "Fluency", and "Grammaticality" are rated on a 1-3 scale (3 for the best). "Score" is the average of relevancy, fluency, and grammaticality scores. All numbers are normalized (divided by 60). The perfect score is 100.  Table 5. Inter-rater reliability. We used F1-score to compute the inter-annotator agreement [62]. The human evaluation showed that our models achieved the highest scores by generating more relevant and correct questions. This also demonstrates that the image caption and the question category features contribute to generating better questions. Furthermore, the results showed that adding medical entities as an additional input improves the quality of the generated questions.

Model
Overall, VQGRaD provides an improved approach to generating visual questions by targeting specific types of natural language questions about radiology images. Table 6 provides example questions generated by [27] (ground truth questions) and the VQGRaD model. These examples show that the questions generated by our model are more consistent with the reference questions. Table 6. Example image along with the question category, the automatically generated questions, and the ground truth question. The generated questions by our VQGRaD model and the baseline system are shown in blue and red, respectively. We manually selected the baseline's question from its outputs as the baseline system does not recognize the question category and generates a random question for each image.

Image Category Generated and Ground Truth Questions
Abnormality is a ring enhancing lesion present in the right lobe of the liver? is a ring enhancing lesion present in the right lobe of the liver? is the liver normal?
Modality was this mri taken with or without contrast? which ventricle is compressed by the t2-hyperintense? was this mri taken with or without contrast?
Organ is this a typical liver? are these normal laughed kidneys? Is this a study of the brain?
Plane what plane is this image obtained? what plane is this image blood-samples? Is this image of a saggital plane?
The manual evaluation scores are much higher than the automatic ones. This is because the system, as shown in Table 6, generates the question words that are semantically comparable but does not generate the exact same words as the ground-truth answer. Indeed, we believe that the existing automatic evaluation metrics are not enough to accurately evaluate text/question generation tasks. Further efforts are needed to investigate a better evaluation strategy for the VQG task.

Conclusions and Future Work
In this paper, we presented a goal-driven visual question generation approach called VQGRaD that can generate a question that is relevant to the image and a specified category. In particular, we were interested in questions about Abnormality, Modality, Organ, and Plane of radiology images. The generated questions are evaluated using automatic and manual evaluations and are found to outperform the baseline systems. The manual evaluation showed that the generated questions appear comparable in quality to the human-generated questions. The results also showed that our data augmentation technique can boost performance on the VQG task.
Although there are several categories of questions about radiology images, the proposed method can handle only four categories (i.e., abnormality, modality, organ, plane). In the future work, we will study additional question categories. Future work will also include the creation of larger and more varied VQG datasets as well as the use of VQG models to create VQA data. In addition, we will investigate the use of the attention mechanism to focus on specific regions instead of the whole image. We also plan to investigate better evaluation strategies/metrics for the VQG task.