Goal-Driven Visual Question Generation from Radiology Images

Sarrouti, Mourad; Ben Abacha, Asma; Demner-Fushman, Dina

doi:10.3390/info12080334

Open AccessArticle

Goal-Driven Visual Question Generation from Radiology Images

by

Mourad Sarrouti

^*

,

Asma Ben Abacha

^* and

Dina Demner-Fushman

U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

^*

Authors to whom correspondence should be addressed.

Information 2021, 12(8), 334; https://doi.org/10.3390/info12080334

Submission received: 15 July 2021 / Revised: 14 August 2021 / Accepted: 16 August 2021 / Published: 20 August 2021

(This article belongs to the Special Issue Neural Natural Language Generation)

Download

Browse Figures

Versions Notes

Abstract

:

Visual Question Generation (VQG) from images is a rising research topic in both fields of natural language processing and computer vision. Although there are some recent efforts towards generating questions from images in the open domain, the VQG task in the medical domain has not been well-studied so far due to the lack of labeled data. In this paper, we introduce a goal-driven VQG approach for radiology images called VQGRaD that generates questions targeting specific image aspects such as modality and abnormality. In particular, we study generating natural language questions based on the visual content of the image and on additional information such as the image caption and the question category. VQGRaD encodes the dense vectors of different inputs into two latent spaces, which allows generating, for a specific question category, relevant questions about the images, with or without their captions. We also explore the impact of domain knowledge incorporation (e.g., medical entities and semantic types) and data augmentation techniques on visual question generation in the medical domain. Experiments performed on the VQA-RAD dataset of clinical visual questions showed that VQGRaD achieves 61.86% BLEU score and outperforms strong baselines. We also performed a blinded human evaluation of the grammaticality, fluency, and relevance of the generated questions. The human evaluation demonstrated the better quality of VQGRaD outputs and showed that incorporating medical entities improves the quality of the generated questions. Using the test data and evaluation process of the ImageCLEF 2020 VQA-Med challenge, we found that relying on the proposed data augmentation technique to generate new training samples by applying different kinds of transformations, can mitigate the lack of data, avoid overfitting, and bring a substantial improvement in medical VQG.

Keywords:

visual question generation; visual question answering; variational autoencoders; radiology images; domain knowledge; unified medical language system; data augmentation; computer vision; natural language processing; artificial intelligence; medical domain

1. Introduction

Recent advancements in computer vision [1,2,3,4,5], natural language processing [6,7,8,9,10,11] and deep learning [12,13,14,15,16] research have enabled enormous progress in many medical image interpretation technologies that support clinical decision making and improve patient engagement [17,18,19,20].

Generating natural language questions for image understanding is a rising research topic in both the fields of natural language processing and computer vision [21,22]. The task, known as Visual Question Generation (VQG), has two main motivations. First, it supports creating large-scale collections of Visual Question Answering (VQA) pairs at low cost since VQG could automatically generate questions about an image. Second, it can also play a role in improving the efficiency of human annotation for VQA datasets construction [23]. VQG combines natural language processing that provides the ability to generate the question, and computer vision techniques that allow the understanding of the image’s content.

In contrast to answering visual questions about images, generating questions has received little attention so far. A few recent works have attempted to generate questions from images in the open domain [24,25,26]. However, the task of VQG in the medical domain has not been well-studied. In addition to the two main motivations mentioned above, VQG could benefit both doctors and patients. For example, patients could use the questions provided by VQG systems to start a conversation with their doctors and understand better their medical images. Moreover, such VQG systems could support medical education and clinical decision making by understanding medical images and generating questions related to their content [27].

The VQG task, as shown in Figure 1, consists of three main phases: (1) generating a representations of the image; (2) producing the embeddings by a neural network; and then (3) generating the question.

One major problem with medical VQG is the lack of large-scale labeled training data, which usually requires huge efforts to build, especially in the medical domain where domain experts are needed for data construction. Although deep learning models have achieved a remarkable success in computer vision and natural language processing tasks, the performance often depends on the size and quality of available training data, which is often tedious to collect [17,28,29]. Usually, to avoid the overfitting problem, the neural networks have to access more training data. However, many tasks lack access to large amounts of data, such as medical VQG and VQA.

Recently, we have presented a VQG system that is able to generate questions when shown radiology images [30]. However, this approach is not goal-driven as it does not guarantee that the generated questions will address a specific aspect of the image. Our previous approach tackled generating natural language questions that are relevant to radiology images without any constraints on the types of the generated questions. Developing a method capable of asking goal-oriented questions about images is a challenging research problem. Towards this end, this work aims to develop an approach to generating questions that ask specific information about radiology images. As shown in Figure 2, by specifying the type of the expected questions, different questions can be generated for a given image such as “What is the condition seen in this image” for abnormality, “Does this have contrast” for modality. The goal-driven question generation process allows, for a given image, to specify the category of the expected question. Such an approach will allow better control of the generated questions from radiology images when they involve multiple topics of interest, which occurs often in medical images.

In this paper, we introduce VQGRaD, a goal-driven VQG system for generating natural language questions about radiology images. VQGRaD is tasked with generating a natural language question when provided with an image (with or without its caption) and the question category. In summary, this paper makes the following contributions:

To the best of our knowledge, this work is the first attempt to generate visual questions about medical images that will result in a specific type of answer when provided with an initial indication of the question category/type.
To overcome the data limitation of VQG in closed domains, we propose a new data augmentation method for natural language questions.
VQGRaD is designed to work with or without an image caption and only requires the image and the question category as minimal inputs.
We study the impact of domain knowledge incorporation, such as named entities and semantic types, in the proposed VQGRaD approach.
VQGRaD is evaluated using the VQA-RAD dataset of clinical questions and radiology images. Experimental results show that VQGRaD performs better than strong baselines, with a BLEU-1 score of 61.86%. We also report the VQG models’ results in our participation at the VQA-Med 2020 challenge.
We perform a manual evaluation to study the grammaticality, fluency, and relevance of the generated question from radiology images.

The remainder of the paper is organized as follows. First, related work concerning the main visual question generation systems is reviewed in Section 2. Then, the proposed methods are presented in Section 3. Several comprehensive experiments are performed to evaluate the effectiveness of the proposed methods in Section 4 where experimental settings, evaluation metrics, benchmark datasets, and results are presented. Conclusions and future work are finally presented in Section 5.

2. Related Work

Question generation (QG) is the task of automatically creating natural language questions from a range of inputs, such as natural language text [31,32,33], structured data [34] and images [21,24]. While many natural language processing and computer vision problems involve extracting information from the texts and images such as VQA [19,35,36], VQG, which can be considered as a complementary task of VQA, is a multi-modal problem involving image understanding and natural language generation, especially using generative methods.

VQG in the open-domain benefited from the available large-scale annotated datasets [37,38,39]. These large-scale datasets allow for a variety of work studying generative models and continuous latent spaces for generating visual questions in the open domain [22,40]. Early work on goal-oriented visual question generation focused primarily on reinforcement learning setting [41,42]. Recent VQG approaches have used autoencoders’ architecture to generate questions from images and some additional inputs such as answers and categories of questions [23,24,43]. The successes of these systems have primarily been a result of variational autoencoders [44].

In this work, we are interested in generating questions in closed domains. Visual question generation in closed domains, such as the medical field, is a challenging task [45,46,47,48] that is still understudied. For instance, Lau et al. [27] created the first VQG dataset in the medical domain (VQA-RAD), where each radiology image was manually annotated with several questions. However, this dataset is too small for training efficient VQG models. Recently, we have developed a VQG system that is able to generate questions from radiology images [30]. However, this approach is not goal-driven as it does not guarantee that the generated question will address a specific aspect of the image.

Inspired by the aforementioned open-domain research, we present in this paper VQGRaD, a goal-driven visual question generation system for radiology images based on the variational autoencoders architecture. Our work extends our previous method [30] by tackling visual question generation as a process that considers the question’s category, the image and its caption.

VQGRaD is able to generate questions about four main categories: abnormality, modality, plane, and organ. These are the most frequent question categories in the VQA-RAD dataset. The questions can be generated from either (i) the image and the question category or (ii) the image, its caption, and the question category.

In addition, to overcome the data limitation problem in the medical domain, we propose a text-based augmentation method to automatically create new training questions. Data augmentation, the application of one or more deformations to labeled data which result in new, additional training data, is a promising solution to handle the data insufficiency problem [49,50].

Automatic data augmentation based on images is commonly used in computer vision [17,28,29] and can help train deep learning models, particularly when using smaller datasets. Simply flipping or shifting images can help the models to better learn by increasing the number of training images. However, the lack of available training images for VQG in the medical domain makes image-based data augmentation alone insufficient for boosting performance on the visual question generation task. Thus, training our supervised models on the augmented natural language data can allow them to become more invariant to these deformations and generalize better to unseen data.

Our text data augmentation method can also be used in open-domain and restricted-domain NLP tasks, such as text classification and question answering, as it relies on general morpho-syntactic features to replace relevant target words in the original text with words that have a high contextual similarity. The following section will present our proposed methods in detail.

3. Methods

The first goal of this study is to generate natural language questions that ask about specific topics such as modality, abnormality, plane, and organ.

To address this challenge, we present VQGRaD, a goal-driven visual question generation system for radiology images that aims to generate relevant questions based on the visual content of the image.

3.1. Problem Modeling

Given a pair

(C, I)

, where C is the question category accompanied by a medical image I, VQGRaD is tasked with generating the appropriate question Q that will result in a specific type of answer. Mathematically, the VQG task can be formulated as:

Q = f (C, I, α)

(1)

where f is the question generation function and

α

denotes the parameters of the model. Categories in C are modality, abnormality, plane, and organ.

In the following sections, we will provide a detailed description of our proposed methods.

3.2. VQGRaD

The VQGRaD model is based on the variational autoencoders architecture [44]. It first encodes the image and its caption along with the category before generating the question. VAEs comprise two neural network components, known as encoder and decoder, for learning the probability distributions of data

p (x)

. The encoder transforms the latent variable z that is created from raw data x into latent space z-space. In other words, the encoder compresses the data from the initial space to the encoded space, also called latent space. The decoder, on the other hand, aims at recovering x using z extracted from the latent space. The training of the encoder and decoder proceeds by maximizing marginal likelihood

log p (x)

. Finding the Evidence Lower Bound (ELBO) yields:

log p (x) \geq E_{z \sim q_{θ} (z | x)} [log p_{ϕ} (x | z)] - KL (q_{θ} (z | x) | | p (z)) = E L B O

(2)

where

q (z | x)

and

p (x | z)

are the probability distributions of the encoder and the decoder, respectively.

The loss function that is minimized when training VAEs is the negative log-likelihood with a regularizer. It consists of a reconstruction loss and a regularisation loss (on the latent layer). Reconstruction loss consists at making the scheme for encoding and decoding as performant as possible, whereas the regularisation loss regularises the latent space organisation by making the distributions returned by the encoder close to a standard normal distribution. The loss function

l_{i}

for datapoint

x_{i}

is:

l_{i} (ϕ, θ) = - E_{z \sim q_{θ} (z | x_{i})} [log p_{ϕ} (x_{i} | z)] + KL (q_{θ} (z | x_{i}) | | p (z))

(3)

where

E_{z \sim q_{θ} (z | x_{i})} [log p_{ϕ} (x_{i} | z)]

is the reconstruction error and

KL (q_{θ} (z | x) | | p (z))

is the Kullback–Leibler divergence regularization between the returned distribution and a standard Gaussian.

ϕ

and

θ

, the parameters for the decoder distribution

p_{ϕ} (x | z)

and the encoder distribution

q_{θ} (z | x)

, respectively.

In VQGRaD, as shown in Figure 3, a Convolutional Neural Network (CNN) is used to obtain the image feature map v and a Long Short Term Memory network (LSTM) [51] is used to generate the embedded caption features c. The categories of the questions are represented as a one hot vector a. It then encodes the dense vectors

h_{c}

,

h_{a}

and

h_{v}

of the caption, the category, and the image, respectively, into a continuous, dense, latent z-space. It also encodes the dense vectors

h_{a}

and

h_{v}

into another continuous, dense, latent t-space based on the continuous latent space introduced in [24] for regularization. This allows our system to maximize the mutual information

MI (.)

between the encoded features, i.e., the image, the caption, the category, and the latent space.

MI (.)

measures how much knowing one of the predefined features reduces uncertainty about the other. For example, if

h_{a}

and

h_{v}

are independent, then knowing

h_{v}

does not give any information about

h_{a}

and vice versa, so their mutual information is zero.

In our case, the optimization is computed as follows:

\begin{matrix} max_{ϕ} MI (q, z | c, a, v) + λ_{1} MI (c, z) + λ_{2} MI (a, z) + λ_{3} MI (v, z) \\ s . t . | z | = p_{ϕ} (q | z) \end{matrix}

(4)

where

λ_{1}

,

λ_{2}

,

λ_{3}

are hyperparameters that relatively weight

MI (.)

terms in the optimization.

p_{ϕ} (q | z)

is the learned mapping, parameterized by

ϕ

, from the image, the caption, and the category to this latent space.

Then, VQGRaD reconstructs the inputs from the z-space using a simple Multi Layer Perceptron (MLP) which is a neural network with fully connected layers. It generates the reconstructed image, caption, and category features

L_{v}, L_{c}, L_{a}

, and optimizes the model by minimizing the following

l_{2}

losses:

\begin{matrix} L_{v} = | | h_{v} - \hat{h_{v}} {| |}_{2} \\ L_{c} = | | h_{c} - \hat{h_{c}} {| |}_{2} \\ L_{a} = | | h_{a} - \hat{h_{a}} {| |}_{2} \end{matrix}

(5)

On the other hand, VQGRaD trained t-space by minimizing the KL-divergence with z-space:

\begin{matrix} L_{t} & = KL (p_{ϕ} (z | c, a, v), p_{θ} (t | a, v)) \\ = log σ_{t} - log σ_{p} + \frac{σ_{z} + (μ_{t} - μ_{z})}{2 σ_{t}} - 0.5 \end{matrix}

(6)

where

ϕ

and

θ

are the parameters used to embed into z-space and t-space, respectively. We used the reparameterization trick [44], to generate means

μ_{z}

and standard deviations

σ_{z}

, combine it with a sampled unit Gaussian noise

ϵ

to generate:

z = μ_{z} + ϵ σ_{z}

(7)

In VQGRaD, the t-space is not only used for regularization but also to generate questions from only the image and the question category.

Finally, VQGRaD uses an LSTM decoder to generate the question

\hat{q}

from either the z-space or the t-space. The decoder takes a sample from the latent dimension z-space, and uses that as an input to output the question

\hat{q}

. It receives a “start” symbol and proceeds to output a question word by word until it produces an “end” symbol. We used Cross Entropy loss function to evaluate the neural network’s quality and minimize the error

L_{g}

between the generated question

\hat{q}

and the ground truth question q. The generation of each word of the question can be written as:

\hat{w_{t}} = arg max_{w \in W} p (w | v, w_{0}, \dots, w_{t - 1})

(8)

where

\hat{w_{t}}

is the predicted word at t step,

W

denotes the word vocabulary, and

\hat{w_{i}}

represents the i-th ground-truth word.

The final loss of VQGRaD is as follows:

L_{v q g r a d} = λ_{5} L_{g} + λ_{4} KL + λ_{3} L_{v} + λ_{2} L_{a} + λ_{1} L_{c} + λ_{6} L_{t}

(9)

where

KL

is Kullback–Leibler divergence,

λ_{1}, λ_{2}, λ_{3}

have already been introduced and

λ_{4}, λ_{5}, λ_{6}

are hyperparameters that control the variational loss, the question generation loss, and the amount of regularization used in our model, respectively.

3.3. Data Augmentation

Questions. For a given medical question q, we generate a set of new questions. During the augmenting process, we use the whole training data

D = {q_{i}}_{i = 1}^{n}

where n is the number of training questions. We expand each training question

q_{i}

into a set of instances

q_{i}^{k}

where k is the number of derived pairs for each training question. To do so, we first select nouns and verbs as candidate words, using the following part-of-speech tags (we used NLTK [52] to perform part-of-speech tagging):

NN: Noun, singular or mass.
NNS: Noun, plural.
NNP: Proper noun, singular.
NNPS: Proper noun, plural.
VBD: Verb, past tense.
VBP: Verb, non-3rd person singular present.
VBN: Verb, past participle.
VBG: Verb, gerund or present participle.
VBZ: Verb, 3rd person singular present.
VB: Verb, base form.

Each candidate word is then replaced by contextually similar words using Wiki-PubMed-PMC embedding which was trained using four million English Wikipedia, PubMed, and PMC articles. Similar words for a given word are retrieved from the word embeddings space using cosine similarity. We compute the cosine similarity between a weight vector of the given word

w_{i}

in the question and the vectors for each word

w_{j}

in the pre-trained word embeddings. We use the top k similar words according to the cosine similarity. Several experiments were carried out with

k = {5, 10, 15, 20, 30}

and found that the best result can be achieved with

k = 10

. Figure 4 presents some examples of created questions for the input question “Are the kidneys normal?”.

Images. We also generate new training instances based on image augmentation techniques. To do so, we apply flipping, rotation, shifting, blurring techniques on the whole VQA-RAD training images. Figure 5 presents some examples of created images.

4. Experimental Settings and Results

In this section, we present our VQG results and conduct a comprehensive ablation analysis. As mentioned above, the proposed method is evaluated on the VQA-RAD and VQA-Med 2020 datasets.

4.1. Datasets

In this study, we used the VQA-RAD [27] dataset of clinical visual questions to evaluate our VQG system. The dataset contains 315 images and 3515 corresponding questions. Figure 6 presents simple images and questions. Each image is associated with more than one question, each of which is accompanied with its category. In this work, we are particularly interested in five categories of questions: “Modality”, “Abnormality”, “Organ”, “Plane” and “Other”. Table 1 presents the number of questions and images associated to each of these categories before and after data augmentation. The test set contains 100 reference questions with associated categories and images.

We have also used the datasets provided by the VQA-Med 2020 challenge at ImageCLEF 2020 during our participation. Given a radiology image, the VQG task consists of generating a natural language question based on the image’s content. The dataset used in VQA-Med 2020 consists of 780 radiology images with 2156 associated questions as training data, 141 radiology images with 164 questions as validation data, and 80 radiology images as test data. There are 1942 unique questions in the 2156 training questions. Some questions are associated with more than one image (up to 8 images). After applying data augmentation, our final training set consists of 161,348 questions. Figure 7 shows examples from VQA-Med 2020 VQG data.

4.2. Evaluation Metrics

To investigate the performance of our visual question generation model, we make use of both automatic and manual evaluations.

4.2.1. Automatic Evaluation

VQG is a sequence generation problem. Therefore, in the automatic evaluation, we used various language modeling evaluation metrics such as BLEU, ROUGE, METEOR, and CIDEr to measure the similarity of the system-generated questions and the ground truth questions in the test set. We used the evaluation package published by [53]. BLEU-{1-4} measures the quality of the generated question by counting the matching {1-4}-grams in the generated question to the {1-4}-grams in the reference question, respectively. METEOR compares the generated question with the reference question in terms of exact, stem, synonym, and paraphrase matches between words and phrases. ROUGE-L assesses the generated question based on the longest common subsequence shared by both the candidate and the reference question. The CIDEr measures consensus in questions by performing a Term Frequency Inverse Document Frequency (TF-IDF) weighting for each n-gram.

4.2.2. Human Evaluation

We also performed a human evaluation to measure the quality of the questions generated by our system and the baseline. To do so, we followed the standard approach in evaluating text generation systems [54], as used for question generation by [55,56]. We manually checked the generated questions and rated them in terms of relevancy, grammaticality, and fluency. The relevancy of a question is determined by the relationship between the question, the image, and the category. Grammaticality refers to the conformity of a question to the grammar rules. Fluency and common sense (readability) refers to the way individual words sound together within a question. Two experts at the U.S National Institutes of Health (NIH) performed manual evaluation. For each measure, the assessors were required to give a rating ranging from 1 to 3 scale (1 = Incorrect, 2 = Average (minor errors), 3 = Correct).

4.3. Implementation Details

VQGRaD. Our VQGRaD is implemented using PyTorch. We used ImageNet-pretrained ResNet-50 [57] without fine-tuning its weights. Since the model expects an input of dimension

224 \times 224

, we resized the input images to suit that dimension. The z-space and t-space are 100 dimensions, the Adam optimiser [58] with a learning rate of 0.0001, a batch size of 32, maximum sequence length for outputs of 20 tokens were used. All models were trained for 40 epochs using single P100 GPUs (16 GB VRAM) on a shared cluster, and the best results were used as final results. We optimized the hyperparameters such that

λ_{1}

= 0.001,

λ_{2}

= 0.005,

λ_{3}

= 0.001,

λ_{4}

= 0.0001

λ_{5}

= 0.001 and

λ_{6}

= 0.001 for a total of 20 epochs. The source code are publicly available on GitHub at https://github.com/sarrouti/vqgrad, accessed on 15 August 2021.

VQG baseline. We used our recent VQG system named VQGR [30] as a baseline. This model is based on the variational autoencoder architecture that takes an image as input and generates a question. In our implementation, we used ImageNet-pretrained ResNet-50 [57] provided by PyTorch without fine-tuning its weights as the image encoder and an LSTM decoder for generating questions. The source code is publicly available on GitHub at https://github.com/sarrouti/vqgr, accessed on 15 August 2021.

4.4. Experiments and Results

In order to study the task of visual question generation about radiology images and explore the impact of domain knowledge incorporation such as medical entities and UMLS semantic types, we perform several experiments with different settings as shown in Table 2:

VQGRaD is our full model that can generate questions from either the caption latent space z (image, caption, and category) or the category latent space t (image and category).
VQGRaD $_{w_t}$ includes another LSTM encoder to encode the image titles.
VQGRaD $_{w_st}$ includes another LSTM encoder to encode the UMLS semantic types extracted from the image captions.
VQGRaD $_{w_e}$ uses only UMLS entities instead of using all words in captions. PyMetamap (https://github.com/AnthonyMRios/pymetamap, accessed on 15 August 2021), a python wrapper for MetaMap [59], has been used for extracting UMLS entities and semantic types.
In VQGRaD $_{cap_or_c}$ , the $z$ -space contains only image and caption features.

All systems can generate questions from either the caption latent space z or the category latent space t.

Table 2 also presents a comparison of our proposed models and the baseline systems:

The VQGR baseline system is trained on the VQA-RAD dataset without data augmentation.
VQGR $_{w_im_aug}$ is trained on the dataset generated by augmenting the images.
VQGR $_{w_our_aug}$ is trained on the dataset generated by our data augmentation technique.

By comparing VQGR, VQGR

_{w_i m_a u g}

and VQGR

_{w_o u r_a u g}

, we can see that our data augmentation technique helped considerably producing a significant improvement in the results. The best BLEU-1 score, 55.05%, was achieved using our data augmentation technique.

Furthermore, it is interesting to see that VQGRaD performs the best over the baseline systems and on all evaluation metrics. Moreover, all of our VQGRaD models outperform the baseline system by a significant margin. This confirms our hypothesis that the task of visual question generation can be goal-driven. VQGRaD achieved consistently better scores among other ablations when the questions were generated from the t-space, which contains the image and the question category features.

When adding the UMLS semantic types extracted from the captions in VQGRaD

_{w_s t}

or the image titles in VQGRaD

_{w_t}

, the models’ performance was continuously improved in most metrics when the questions were generated from the z-space (caption latent space). This is likely because the questions were generated from a latent space that encodes more features (including images, question category, caption or title, and UMLS semantic types) than in the VGGRaD system. However, the best results were obtained by VGGRaD when the questions were generated from the t-space (category latent space). Thus, building end-to-end VQG models that consider the question type is a feasible and efficient task.

For additional evaluation, we used our VQG system during our participation in the VQG task of the VQA-Med challenge at ImageCLEF 2020 [60]. Given a radiology image, the VQG task consists of generating a natural language question based on the image’s content. As the questions were all about abnormality, we only used our recent VQGR system based on VAEs and data augmentation, which takes an image as input and generates a natural language question as output. Table 3 shows our official results on the validation set at the VQG task of the VQA-Med challenge.

The official results of ImageCLEF 2020 VQA-Med showed that using a sequence generation model to solve VQG in the medical domain is complicated due to the problem of labeled data scarcity. Hence, the participating systems have used image classification approaches [48] to solve the VQG task. Small datasets might require models that have low complexity. Whereas sequence generation models require a large amount of training data as they try to deeply learn the underlying data distribution of the input to output new sequences. The available training data for VQG in the medical domain is not large/varied enough for training a seq2seq model. However, once we increased the size/variance in the dataset through the proposed augmentations, the performance of the proposed VQG increases significantly, yielding a BLEU score of 39.74% and 11.6% on the validation set and the test set, respectively.

An additional manual evaluation of the VQG models’ outputs was performed by two experts in medical informatics. Table 4 presents the results of the manual evaluation. Twenty (question, image, category) triples from the test set were randomly selected for the manual evaluation. Detailed guidelines for the raters are listed in Section 4.2.2. Inter-rater reliability was calculated on each of the 3 measures. F1-score for each measure is presented in Table 5. Most of the reliability scores are close to 0.50, which is considered satisfactory reliability [61].

The human evaluation showed that our models achieved the highest scores by generating more relevant and correct questions. This also demonstrates that the image caption and the question category features contribute to generating better questions. Furthermore, the results showed that adding medical entities as an additional input improves the quality of the generated questions.

Overall, VQGRaD provides an improved approach to generating visual questions by targeting specific types of natural language questions about radiology images. Table 6 provides example questions generated by [27] (ground truth questions) and the VQGRaD model. These examples show that the questions generated by our model are more consistent with the reference questions.

The manual evaluation scores are much higher than the automatic ones. This is because the system, as shown in Table 6, generates the question words that are semantically comparable but does not generate the exact same words as the ground-truth answer. Indeed, we believe that the existing automatic evaluation metrics are not enough to accurately evaluate text/question generation tasks. Further efforts are needed to investigate a better evaluation strategy for the VQG task.

5. Conclusions and Future Work

In this paper, we presented a goal-driven visual question generation approach called VQGRaD that can generate a question that is relevant to the image and a specified category. In particular, we were interested in questions about Abnormality, Modality, Organ, and Plane of radiology images. The generated questions are evaluated using automatic and manual evaluations and are found to outperform the baseline systems. The manual evaluation showed that the generated questions appear comparable in quality to the human-generated questions. The results also showed that our data augmentation technique can boost performance on the VQG task.

Although there are several categories of questions about radiology images, the proposed method can handle only four categories (i.e., abnormality, modality, organ, plane). In the future work, we will study additional question categories. Future work will also include the creation of larger and more varied VQG datasets as well as the use of VQG models to create VQA data. In addition, we will investigate the use of the attention mechanism to focus on specific regions instead of the whole image. We also plan to investigate better evaluation strategies/metrics for the VQG task.

Author Contributions

M.S. developed the VQG systems, carried out the experiments, and wrote the first draft of the manuscript. A.B.A. and M.S. designed the goal-oriented VQG model and the knowledge incorporation and data augmentation methods. A.B.A. and D.D.-F. performed the manual evaluation of the VQG systems, supervised the project, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This work was supported by the intramural research program at the U.S. National Library of Medicine, National Institutes of Health.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, X.; Grandvalet, Y.; Davoine, F.; Cheng, J.; Cui, Y.; Zhang, H.; Belongie, S.; Tsai, Y.H.; Yang, M.H. Transfer learning in computer vision tasks: Remember where you come from. Image Vis. Comput. 2020, 93, 103853. [Google Scholar] [CrossRef]
Guo, J.; He, H.; He, T.; Lausen, L.; Li, M.; Lin, H.; Shi, X.; Wang, C.; Xie, J.; Zha, S.; et al. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing. J. Mach. Learn. Res. 2020, 21, 1–7. [Google Scholar]
Pelka, O.; Friedrich, C.M.; Garcıa Seco de Herrera, A.; Müller, H. Overview of the ImageCLEFmed 2020 concept prediction task: Medical image understanding. In Proceedings of the CLEF 2020—Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 22–25 September 2020. [Google Scholar]
Elharrouss, O.; Almaadeed, N.; Al-Maadeed, S.; Bouridane, A. Gait recognition for person re-identification. J. Supercomput. 2021, 77, 3653–3672. [Google Scholar] [CrossRef]
Elharrouss, O.; Almaadeed, N.; Al-Maadeed, S. Mhad: Multi-human action dataset. In Fourth International Congress on Information and Communication Technology; Springer: Singapore, 2020; pp. 333–341. [Google Scholar]
Sarrouti, M.; Alaoui, S.O.E. SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions. Artif. Intell. Med. 2020, 102, 101767. [Google Scholar] [CrossRef] [PubMed]
Ruder, S.; Peters, M.E.; Swayamdipta, S.; Wolf, T. Transfer Learning in Natural Language Processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Minneapolis, MN, USA, 2–7 June 2019; pp. 15–18. [Google Scholar] [CrossRef]
El-allaly, E.-d.; Sarrouti, M.; En-Nahnahi, N.; Alaoui, S.O.E. An adverse drug effect mentions extraction method based on weighted online recurrent extreme learning machine. Comput. Methods Programs Biomed. 2019, 176, 33–41. [Google Scholar] [CrossRef]
Sarrouti, M.; Alaoui, S.O.E. A Yes/No Answer Generator Based on Sentiment-Word Scores in Biomedical Question Answering. Int. J. Healthc. Inf. Syst. Inform. 2017, 12, 62–74. [Google Scholar] [CrossRef] [Green Version]
Sarrouti, M.; Lachkar, A. A new and efficient method based on syntactic dependency relations features for ad hoc clinical question classification. Int. J. Bioinform. Res. Appl. 2017, 13, 161. [Google Scholar] [CrossRef]
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv 2019, arXiv:1905.03197. [Google Scholar]
Moen, E.; Bannon, D.; Kudo, T.; Graf, W.; Covert, M.; Van Valen, D. Deep learning for cellular image analysis. Nat. Methods 2019, 16, 1233–1246. [Google Scholar] [CrossRef] [PubMed]
El-allaly, E.-d.; Sarrouti, M.; En-Nahnahi, N.; Alaoui, S.O.E. DeepCADRME: A deep neural model for complex adverse drug reaction mentions extraction. Pattern Recognit. Lett. 2021, 143, 27–35. [Google Scholar] [CrossRef]
El-allaly, E.-d.; Sarrouti, M.; En-Nahnahi, N.; Alaoui, S.O.E. MTTLADE: A multi-task transfer learning-based method for adverse drug events extraction. Inf. Process. Manag. 2021, 58, 102473. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Sarrouti, M.; Ben Abacha, A.; Demner-Fushman, D. Multi-task transfer learning with data augmentation for recognizing question entailment in the medical domain. In Proceedings of the 2021 IEEE International Conference on Healthcare Informatics (ICHI), Victoria, BC, Canada, 9–12 August 2021. [Google Scholar]
Ionescu, B.; Müller, H.; Villegas, M.; de Herrera, A.G.S.; Eickhoff, C.; Andrearczyk, V.; Cid, Y.D.; Liauchuk, V.; Kovalev, V.; Hasan, S.A.; et al. Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, Avignon, France, 10–14 September 2018; pp. 309–334. [Google Scholar]
Pelka, O.; Friedrich, C.M.; Seco De Herrera, A.; Müller, H. Overview of the ImageCLEFmed 2019 concept detection task. In Proceedings of the CLEF 2019—Conference and Labs of the Evaluation Forum, Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
Ben Abacha, A.; Datla, V.V.; Hasan, S.A.; Demner-Fushman, D.; Müller, H. Overview of the VQA-Med Task at ImageCLEF 2020: Visual Question Answering and Generation in the Medical Domain. In Proceedings of the CLEF 2020—Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 22–25 September 2020. [Google Scholar]
Gupta, D.; Suman, S.; Ekbal, A. Hierarchical deep multi-modal network for medical visual question answering. Expert Syst. Appl. 2021, 164, 113993. [Google Scholar] [CrossRef]
Mostafazadeh, N.; Misra, I.; Devlin, J.; Mitchell, M.; He, X.; Vanderwende, L. Generating Natural Questions about an Image. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Berlin, Germany, 2016; pp. 1802–1813. [Google Scholar] [CrossRef]
Zhang, S.; Qu, L.; You, S.; Yang, Z.; Zhang, J. Automatic Generation of Grounded Visual Questions. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2016; pp. 4235–4243. [Google Scholar]
Li, Y.; Duan, N.; Zhou, B.; Chu, X.; Ouyang, W.; Wang, X. Visual Question Generation as Dual Task of Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6116–6124. [Google Scholar] [CrossRef] [Green Version]
Krishna, R.; Bernstein, M.; Fei-Fei, L. Information Maximizing Visual Question Generation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2008–2018. [Google Scholar] [CrossRef] [Green Version]
Patro, B.N.; Kurmi, V.K.; Kumar, S.; Namboodiri, V.P. Deep Bayesian Network for Visual Question Generation. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1555–1565. [Google Scholar] [CrossRef]
Patil, C.; Patwardhan, M. Visual Question Generation: The State of the Art. ACM Comput. Surv. 2020, 53, 1–22. [Google Scholar] [CrossRef]
Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 180251. [Google Scholar] [CrossRef]
Perez, L.; Wang, J. The Effectiveness of Data Augmentation in Image Classification using Deep Learning. arXiv 2017, arXiv:1712.04621. [Google Scholar]
Inés, A.; Domínguez, C.; Heras, J.; Mata, E.; Pascual, V. Biomedical image classification made easier thanks to transfer and semi-supervised learning. Comput. Methods Programs Biomed. 2021, 198, 105782. [Google Scholar] [CrossRef]
Sarrouti, M.; Ben Abacha, A.; Demner-Fushman, D. Visual Question Generation from Radiology Images. In Proceedings of the First Workshop on Advances in Language and Vision Research, Online, 6–8 July 2020; pp. 12–18. [Google Scholar]
Kalady, S.; Elikkottil, A.; Das, R. Natural language question generation using syntax and keywords. In Proceedings of the QG2010: The Third Workshop on Question Generation, Pittsburgh, PA, USA, 14–18 June 2010; Volume 2, pp. 5–14. [Google Scholar]
Kim, Y.; Lee, H.; Shin, J.; Jung, K. Improving neural question generation using answer separation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6602–6609. [Google Scholar]
Li, J.; Gao, Y.; Bing, L.; King, I.; Lyu, M.R. Improving Question Generation With to the Point Context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; Volume 33, pp. 3216–3226. [Google Scholar]
Serban, I.V.; García-Durán, A.; Gulcehre, C.; Ahn, S.; Chandar, S.; Courville, A.; Bengio, Y. Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 588–598. [Google Scholar]
Kafle, K.; Kanan, C. Visual Question Answering: Datasets, Algorithms, and Future Challenges. Comput. Vis. Image Underst. 2017, 163, 3–20. [Google Scholar] [CrossRef] [Green Version]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar] [CrossRef] [Green Version]
Agrawal, A.; Lu, J.; Antol, S.; Mitchell, M.; Zitnick, C.L.; Batra, D.; Parikh, D. VQA: Visual Question Answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar] [CrossRef] [Green Version]
Goyal, Y.; Khot, T.; Agrawal, A.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int. J. Comput. Vis. 2019, 127, 398–414. [Google Scholar] [CrossRef] [Green Version]
Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1988–1997. [Google Scholar] [CrossRef] [Green Version]
Masuda-Mora, I.; Pascual-deLaPuente, S.; Giro-i-Nieto, X. Towards Automatic Generation of Question Answer Pairs from Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhang, J.; Wu, Q.; Shen, C.; Zhang, J.; Lu, J.; van den Hengel, A. Goal-Oriented Visual Question Generation via Intermediate Rewards. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 189–204. [Google Scholar] [CrossRef] [Green Version]
Yang, J.; Lu, J.; Lee, S.; Batra, D.; Parikh, D. Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. arXiv 2018, arXiv:1810.00912. [Google Scholar]
Jain, U.; Zhang, Z.; Schwing, A. Creativity: Generating diverse questions using variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5415–5424. [Google Scholar] [CrossRef] [Green Version]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Hasan, S.A.; Ling, Y.; Farri, O.; Liu, J.; Müller, H.; Lungren, M.P. Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task. In Working Notes of CLEF 2018, Proceedings of the Conference and Labs of the Evaluation Forum, Avignon, France, 10–14 September 2018; Cappellato, L., Ferro, N., Nie, J., Soulier, L., Eds.; CEUR-WS: Aachen, Germany, 2018; Volume 2125. [Google Scholar]
Ben Abacha, A.; Gayen, S.; Lau, J.J.; Rajaraman, S.; Demner-Fushman, D. NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain. In Working Notes of CLEF 2018, Proceedings of the Conference and Labs of the Evaluation Forum, Avignon, France, 10–14 September 2018; Cappellato, L., Ferro, N., Nie, J., Soulier, L., Eds.; CEUR-WS: Aachen, Germany, 2018; Volume 2125. [Google Scholar]
Ben Abacha, A.; Hasan, S.A.; Datla, V.V.; Liu, J.; Demner-Fushman, D.; Müller, H. VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019. In Working Notes of CLEF 2019, Proceedings of the Conference and Labs of the Evaluation Forum, Lugano, Switzerland, 9–12 September 2019; Cappellato, L., Ferro, N., Losada, D.E., Müller, H., Eds.; CEUR-WS: Aachen, Germany, 2019; Volume 2380. [Google Scholar]
Al-Sadi, A.; Al-Theiabat, H.; Al-Ayyoub, M. The Inception Team at VQA-Med 2020: Pretrained VGG with Data Augmentation for Medical VQA and VQG. In Working Notes of CLEF 2020, Proceedings of the Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 22–25 September 2020; Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A., Eds.; CEUR-WS: Aachen, Germany, 2020; Volume 2696. [Google Scholar]
Kobayashi, S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. In Proceedings of the 2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018. [Google Scholar] [CrossRef]
Şahin, G.G.; Steedman, M. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar] [CrossRef]
Schmidhuber, J.; Hochreiter, S. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar]
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
Koehn, P.; Monz, C. Manual and automatic evaluation of machine translation between European languages. In Proceedings of the Workshop on Statistical Machine Translation—StatMT’06, New York, NY, USA, 8–9 June 2006. [Google Scholar] [CrossRef] [Green Version]
Du, X.; Cardie, C. Harvesting Paragraph-level Question-Answer Pairs from Wikipedia. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1907–1917. [Google Scholar] [CrossRef] [Green Version]
Hosking, T.; Riedel, S. Evaluating Rewards for Question Generation Models. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Aronson, A.R. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In Proceedings of the AMIA Symposium, Washington, DC, USA, 3–7 November 2001; p. 17. [Google Scholar]
Sarrouti, M. NLM at VQA-Med 2020: Visual Question Answering and Generation in the Medical Domain. In Working Notes of CLEF 2020, Proceedings of the Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 22–25 September 2020; Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A., Eds.; CEUR-WS: Aachen, Germany, 2020; Volume 2696. [Google Scholar]
Viera, A.J.; Garrett, J.M. Understanding interobserver agreement: The kappa statistic. Fam. Med. 2005, 37, 360–363. [Google Scholar]
Hripcsak, G. Agreement, the F-Measure, and Reliability in Information Retrieval. J. Am. Med. Inform. Assoc. 2005, 12, 296–298. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The VQG pipeline: image is taken as input to generate the question.

Figure 2. Examples of goal-driven questions about a radiology image. The possible questions should be relevant to the given category, also known as question type. By specifying the type of the expected question, different questions such as (a–c) can be generated from the same radiology image.

Figure 3. Overview of VQGRaD, a visual question generation system for radiology images. Input captions are encoded by an LSTM. Images are encoded by a CNN. t-space contains images and categories features, whereas z-space includes images, captions, and categories features. Questions can be generated from either the caption latent space z or the category latent space t.

Figure 4. Contextual augmentation, when a clinical question “Are the kidneys normal?” is augmented by replacing only selected words with similar words retrieved based on the cosine similarity and pretrained word embeddings.

Figure 5. Examples of created images from the original image (a): (b) is the rotated image, (c) the blurred image, (d) the horizontally flipped image, (e) the vertically flipped image, (f) the shifted image, and (g) the noisy image.

Figure 6. Sample radiology images and the associated questions from the VQA-RAD dataset.

Figure 7. Example of radiology images and the associated questions from the VQG training set of ImageCLEF 2020 VQA-Med.

Table 1. The number of questions and images associated with each category. The values after “/” represent the number of questions and images created by our data augmentation techniques.

Category	#Questions	#Images
Abnormality	397/18,642	112/784
Modality	288/5534	54/378
Organ	73/16,408	135/945
Plane	163/9216	99/693
Other	348/19,798	81/567
Total	1269/69,598	239/1673

Table 2. Ablation study and comparison of VQGR (baseline) and VQGRaD systems.

	Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
	VQGR	31.45	14.60	7.82	3.27	10.43	38.80	21.19
	VQGR $_{w_i m_a u g}$	44.83	30.10	23.62	19.81	18.98	24.30	23.43
	VQGR $_{w_o u r_a u g}$	55.05	43.39	37.98	34.54	29.35	56.34	31.18
	VQGRaD $_{c a p_o r_c}$	58.69	48.82	44.08	40.92	31.71	59.70	35.74
	VQGRaD $_{w_t}$	57.12	46.62	41.49	37.96	29.61	58.93	34.99
t-space	VQGRaD $_{w_s t}$	56.86	45.73	40.58	37.25	29.99	58.42	36.21
	VQGRaD $_{w_e}$	61.81	51.69	46.69	43.35	33.94	63.62	40.33
	VQGRaD	61.86	51.65	46.70	43.40	33.88	63.75	41.13
	VQGRaD $_{c a p_o r_c}$	59.31	49.31	44.73	41.75	32.54	60.90	36.38
	VQGRaD $_{w_t}$	58.49	48.26	43.87	41.21	31.79	58.81	36.03
z-space	VQGRaD $_{w_s t}$	60.74	50.06	45.00	41.90	32.81	61.36	36.89
	VQGRaD $_{w_e}$	60.52	50.61	46.31	43.68	33.07	61.29	37.86
	VQGRaD	59.11	47.44	41.78	37.85	31.00	60.39	36.79

Table 3. Evaluation results on the validation set of the VQG dataset provided in the ImageCLEF 2020 VQA-Med challenge. VQGR trained on the original training datatset. VQGR

_{w_o u r_a u g}

trained on the augmented data obtained by our data augmentation technique.

Table 3. Evaluation results on the validation set of the VQG dataset provided in the ImageCLEF 2020 VQA-Med challenge. VQGR trained on the original training datatset. VQGR

_{w_o u r_a u g}

trained on the augmented data obtained by our data augmentation technique.

Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
VQGR	33.32	23.18	5.86	2.64	11.59	35.38	20.69
VQGR $_{w_o u r_a u g}$	39.74	23.18	14.91	11.52	16.56	41.23	50.03

Table 4. Results of the manual evaluation of the best VQGRaD models and the VQG baseline system. “Relevancy”, “Fluency”, and “Grammaticality” are rated on a 1–3 scale (3 for the best). “Score” is the average of relevancy, fluency, and grammaticality scores. All numbers are normalized (divided by 60). The perfect score is 100.

Model	Relevancy	Grammaticality	Fluency	Score
VQGR	78.3	93.3	80.0	83.3
VQGRaD $_{w_s t (z - s p a c e)}$	83.3	92.5	91.6	89.16
VQGRaD $_{w_e (z - s p a c e)}$	81.6	92.5	93.3	89.16
VQGRaD $_{w_e (t - s p a c e)}$	96.6	96.6	97.5	96.9
VQGRaD $_{(t - s p a c e)}$	86.6	96.6	92.5	91.9

Table 5. Inter-rater reliability. We used F1-score to compute the inter-annotator agreement [62].

Model	Relevancy	Grammaticality	Fluency
VQG	0.42	0.27	0.51
VQGRaD $_{w_s t (z - s p a c e)}$	0.32	0.64	0.40
VQGRaD $_{w_e (z - s p a c e)}$	0.40	0.64	0.48
VQGRaD $_{w_e (t - s p a c e)}$	0.32	0.72	0.33
VQGRaD $_{(t - s p a c e)}$	0.35	0.72	0.43

Table 6. Example image along with the question category, the automatically generated questions, and the ground truth question. The generated questions by our VQGRaD model and the baseline system are shown in blue and red, respectively. We manually selected the baseline’s question from its outputs as the baseline system does not recognize the question category and generates a random question for each image.

Image	Category	Generated and Ground Truth Questions
	Abnormality	is a ring enhancing lesion present in the right lobe of the liver? is a ring enhancing lesion present in the right lobe of the liver? is the liver normal?
	Modality	was this mri taken with or without contrast? which ventricle is compressed by the t2-hyperintense? was this mri taken with or without contrast?
	Organ	is this a typical liver? are these normal laughed kidneys? Is this a study of the brain?
	Plane	what plane is this image obtained? what plane is this image blood-samples? Is this image of a saggital plane?

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sarrouti, M.; Ben Abacha, A.; Demner-Fushman, D. Goal-Driven Visual Question Generation from Radiology Images. Information 2021, 12, 334. https://doi.org/10.3390/info12080334

AMA Style

Sarrouti M, Ben Abacha A, Demner-Fushman D. Goal-Driven Visual Question Generation from Radiology Images. Information. 2021; 12(8):334. https://doi.org/10.3390/info12080334

Chicago/Turabian Style

Sarrouti, Mourad, Asma Ben Abacha, and Dina Demner-Fushman. 2021. "Goal-Driven Visual Question Generation from Radiology Images" Information 12, no. 8: 334. https://doi.org/10.3390/info12080334

APA Style

Sarrouti, M., Ben Abacha, A., & Demner-Fushman, D. (2021). Goal-Driven Visual Question Generation from Radiology Images. Information, 12(8), 334. https://doi.org/10.3390/info12080334

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Goal-Driven Visual Question Generation from Radiology Images

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Problem Modeling

3.2. VQGRaD

3.3. Data Augmentation

4. Experimental Settings and Results

4.1. Datasets

4.2. Evaluation Metrics

4.2.1. Automatic Evaluation

4.2.2. Human Evaluation

4.3. Implementation Details

4.4. Experiments and Results

5. Conclusions and Future Work

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI