Multi-View Attention Network for Visual Dialog

Park, Sungjin; Whang, Taesun; Yoon, Yeochan; Lim, Heuiseok

doi:10.3390/app11073009

Open AccessArticle

Multi-View Attention Network for Visual Dialog

¹

Department of Computer Science and Engineering, Korea University, 145, Anam-ro, Seongbuk-gu, Seoul 02841, Korea

²

Wisenut Inc., 49, Daewangpangyo-ro 644beon-gil, Bundang-gu, Seongnam-si 13493, Gyeonggi-do, Korea

³

Electronics and Technology Research Institute (ETRI), 161, Gajeong-dong, Yuseong-gu, Daejeon 34129, Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

^‡

Work performed while at Korea University.

Appl. Sci. 2021, 11(7), 3009; https://doi.org/10.3390/app11073009

Submission received: 19 February 2021 / Revised: 24 March 2021 / Accepted: 25 March 2021 / Published: 27 March 2021

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, and image) is required. Specifically, it is necessary for an agent to (1) determine the semantic intent of question and (2) align question-relevant textual and visual contents among heterogeneous modality inputs. In this paper, we propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two complementary modules (i.e., Topic Aggregation and Context Matching), and builds multimodal representations through sequential alignment processes (i.e., Modality Alignment). Experimental results on VisDial v1.0 dataset show the effectiveness of our proposed model, which outperforms previous state-of-the-art methods under both single model and ensemble settings.

Keywords:

visual dialog; attention mechanism; multimodal learning; vision-language

1. Introduction

As a part of the interdisciplinary research that combines natural language processing with computer vision, a wide variety of vision–language tasks (e.g., Visual Question Answering (VQA), image captioning, referring expressions, etc.) have been introduced in recent years. Considerable efforts in this field have advanced the capabilities of artificial intelligence agents a step further, but the agent’s comprehension of multimodal information is still far from human-level reasoning and cognitive ability [1,2].

The visual dialog task is similar to VQA in that it requires the agent to answer a question that is guided by an image, but this task differs in that the agent needs to answer a series of questions focusing on previous dialog as well as a given image. It is more challenging than other vision–language tasks because the agent is asked to selectively ground the visual contents related to the question topics, which change as the dialog proceeds. For example, Figure 1 illustrates how the question topics change during each dialog turn (e.g., “household goods”, “people”, and “food”). Furthermore, answering some questions that contain ambiguous expressions (e.g., question 5 (Q5): “how old do they look?”) can be more difficult because the agent should consider which entity “they” refers to (i.e., “kids” and “boys” in caption and Q4–A4 pair) and then ground the referent in the image. To address these issues, the dialog agent should be capable of determining the semantic intent of the question, clarifying referential ambiguities, and then leveraging grounded visual contents and identified semantic information.

Several recent researches have been performed to solve the visual dialog task from the perspective of visual co-reference resolution [3,4,5,6]. However, resolving the visual co-reference does not always lead to complete understanding of semantic intent and topics of the question. For example, when answering Q6 in Figure 1, the agent is required to not only resolve visual co-reference, but also explicitly understand the semantic intent of the question which is asking whether other “snacks” exist or not (i.e., the model should focus on “snacks” rather than “they”). From these observations, it is crucial to capture what the topic of the question is in order to accurately determine the semantic intent of the question.

To this end, this paper proposes Multi-View Attention Network (MVAN), which leverages question-guided contextual information and clues through the dialog history; and then effectively learns semantic alignments between visual and textual representations through the sequential alignment processes. MVAN consists of three main modules. First, the Context Matching module effectively represents contextual information of dialog history that is relevant to the question on a sentence level. This is because, in general, the semantic intent of a sentence tends to be determined by the context of the entire sequence as well as some words that are directly connected to the topic. Second, the Topic Aggregation module is capable of capturing topic-guided clues from the dialog history on a word level. This takes advantage of the fact that topic-related representation is well-constructed by directly using the original word embeddings of each word. Both modules adaptively propagate textual information that interacts with the semantic intent of the current question by attention mechanism and gate function. Lastly, the Modality Alignment module performs two sequential aligning processes to learn semantic alignments between each textual output of previous modules with visual features. Since the alignments between contextual representation and visual contents can be implicit and noisy, the Modality Alignment module first learns the soft mapping between heterogeneous inputs based on the topic-guided clues, then sequentially aligns them with the surrounding contextual information.

The main contributions of this paper are as follows. (1) We propose MVAN that consists of two complementary views and combine them with visual contents through multiple alignment steps. (2) Experimental results on VisDial v1.0 show that our proposed model outperforms the previous state-of-the-art methods. (3) Extensive experiments such as multi-task learning and fine-tuning on dense annotations demonstrate the superiority of our model. (4) We show that MVAN explicitly understands the semantic intent of the question through the visualization of reasoning process for each module, which leads to reasonable interpretations of employing heterogeneous multimodal inputs.

2. Related Work

Visual Dialog

Visual dialog is a task proposed by Das et al. [7] that requires the dialog agent to answer the current question by exploiting both the image and dialog history. Das et al. [7] also introduced encoder-decoder models such as late fusion, hierarchical recurrent network, and memory network as baseline methods. Most of the previous approaches are predominantly based on attention mechanisms to fuse representations of the given multimodal inputs [8,9,10,11]. Another direction of research that is inspired by graphical networks focuses on learning the inherent relations among image, dialog history, and question [12,13,14].

On the other hand, several approaches that explicitly resolve ambiguous references are based on a visual co-reference resolution. Kottur et al. [4] used neural module networks [15] to effectively link references and ground relevant visual contents at word level. Kang et al. [5] adapted a self-attention mechanism [16] based on sentence-level representation to resolve referential ambiguity. Niu et al. [6] proposed a recursive attention mechanism to capture question-relevant dialog history and ground-related visual contents to the image. Most existing works [4,5] that use only word or sentence representations have limitations in identifying the semantic intent of the question. Unlike this, our proposed model considers both topic-related clues and contextual information to effectively capture the semantic intent of the question. In addition, MVAN adaptively integrates dialog history and visual contents by performing sequential alignment steps rather than exploiting only dialog history and visual contents that meet specific recursion conditions [6]. More recently, Murahari et al. [17] and Wang et al. [18] introduced a fine-tuning method using a pre-trained model, as it is observed that pre-trained language model architectures (e.g., BERT [19]) also effectively perform on vision-language tasks [20]. Also, Qi et al. [21] proposed causal intervention algorithms that can be applicable to other visual dialog models and Agarwal et al. [22] proposed curriculum fine-tuning inspired by the work of Bengio et al. [23].

The curriculum fine-tuning approach is adopted in several existing works [17,18,22,24] to boost the performance of a specific metric, Normalized Discounted Cumulative Gain (NDCG), but this significantly deteriorates the rest of the metrics, such as Mean Reciprocal Rank (MRR), recall, and mean rank. Many researches have struggled to improve performance only in NDCG by fine-tuning the model on dense annotations since the previous visual dialog challenge (https://visualdialog.org/challenge/2018 (accessed on 26 March 2021)) have selected the challenge winners by the ranking of NDCG performance. However, most recent challenge (https://visualdialog.org/challenge/2020 (accessed on 26 March 2021)) picked winners based on both the NDCG and MRR evaluation metrics so that we focus on building a model that boosts the performance with respect to all evaluation metrics in this work.

3. Model

In the visual dialog task [7], a dialog agent is given a set of multimodal inputs for each dialog turn t. This input set consists of an image I, a current question

Q_{t}

, the dialog history set

H_{t} = {C, (Q_{1}, A_{1}^{g t}), \dots, (Q_{t - 1}, A_{t - 1}^{g t})}

, which contains image caption C and

t - 1

consecutive question-answer pairs, and a set of answer candidates

A_{t} = {A_{t}^{1}, A_{t}^{2}, \dots, A_{t}^{100}}

. The agent then is required to answer the question by either discriminating or generating a correct answer.

3.1. Multimodal Representation

3.1.1. Visual Features

We employ a bottom-up attention mechanism [25] to represent the objects appearing in an image. Visual features of object regions

{v_{k}}_{k = 1}^{n_{v}} \in R^{d_{v} \times n_{v}}

, where

n_{v}

is the number of detected objects ranging from 10 to 100, are adaptively extracted from a Faster-RCNN [26] that is pre-trained with Visual Genome [27].

3.1.2. Language Features

We first embed three different text inputs: The current question, dialog history, and answer candidates. The word embedding layer is initialized with pre-trained GloVe embeddings [28]. We then feed the word embeddings into a Bi-Directional Long Short-Term Memory (BiLSTM) to encode a sequential representation of each embedding. Specifically, each word in question Q is embedded as

{w_{i}^{q}}_{i = 1}^{n_{q}} \in R^{d_{w} \times n_{q}}

, where

n_{q}

is the number of words in the question. Each word embedding is fed into the BiLSTM layer as follows:

\begin{matrix} {\vec{u}}_{i}^{q} & = L S T M (w_{i}^{q}, {\vec{u}}_{i - 1}^{q}) \end{matrix}

(1a)

\begin{matrix} {\overset{\leftarrow}{u}}_{i}^{q} & = L S T M (w_{i}^{q}, {\overset{\leftarrow}{u}}_{i + 1}^{q}) . \end{matrix}

(1b)

The sequential representation of each token is constructed by concatenating the hidden states of the forward and backward LSTMs, denoted as

u_{i}^{q} = [{\vec{u}}_{i}^{q}, {\overset{\leftarrow}{u}}_{i}^{q}]

. Meanwhile, the sequential representation for each dialog history

u_{r}^{h} = {u_{r, j}^{h}}_{j = 1}^{n_{h_{r}}} (0 \leq r \leq t - 1)

is constructed using the question construction process with different BiLSTM layers. The maximum sequence lengths of the question and dialog history are set to

n_{q}

= 20 and

n_{h_{r}}

= 40, respectively. For the answer candidates, we use a different uni-directional LSTM to represent them because their sequence lengths are shorter than those of the questions.

3.2. Multi-View Attention Network

We propose MVAN, which considers the semantic intent and topic of the question simultaneously and effectively aligns the textual and visual information through multiple alignment processes. Figure 2 describes the MVAN architecture, which consists of three components: (1) Context Matching, (2) Topic Aggregation, and (3) Modality Alignment.

3.2.1. Context Matching Module

Generally, the semantic intent of a sentence not only relies on particular words that implicitly point to the topic of the sentence but also tends to be determined by the context of the entire sequence. Therefore, we build the Context Matching module that adaptively integrates the question and its relevant history at sentence level. Contextual representation is constructed by concatenating the last hidden states of the forward and backward LSTMs for the question and dialog history, denoted as

s^{q} = [{\vec{u}}_{1}^{q}, {\overset{\leftarrow}{u}}_{n_{q}}^{q}]

and

s_{r}^{h} = [{\vec{u}}_{r, 1}^{h}, {\overset{\leftarrow}{u}}_{r, n_{h_{r}}}^{h}]

, respectively. We then apply an attention mechanism to focus on question-relevant history. The Context Matching module takes contextual representation of the question

s^{q} \in R^{d_{s} \times 1}

and dialog history

s^{h} = {s_{0}^{h}, s_{1}^{h}, \dots, s_{t - 1}^{h}} \in R^{d_{s} \times t}

and outputs question-relevant history features as follows:

\begin{matrix} z_{r}^{S} & = W^{⊤} (f_{q}^{S} (s^{q}) \circ f_{h}^{S} (s_{r}^{h})) + b \end{matrix}

(2a)

\begin{matrix} a_{r}^{S} & = softmax (z_{r}^{S}) \end{matrix}

(2b)

\begin{matrix} {\tilde{s}}^{h} & = \sum_{r = 0}^{t - 1} a_{r}^{S} s_{r}^{h}, \end{matrix}

(2c)

where ∘ is element-wise multiplication,

W \in R^{d_{f} \times 1}

is a projection matrix, and

f_{q}^{S} (\cdot)

and

f_{h}^{S} (\cdot)

denote non-linear transformation functions which convert

d_{s}

to

d_{f}

dimensions. We apply a gate function to automatically filter out dialog history that is irrelevant to current question as follows:

\begin{matrix} g a t e^{C} & = σ (W_{g a t e}^{C} [s^{q}, {\tilde{s}}^{h}] + b_{g a t e}^{C}) \end{matrix}

(3a)

\begin{matrix} e^{C} & = g a t e^{C} \circ [s^{q}, {\tilde{s}}^{h}], \end{matrix}

(3b)

where

σ (\cdot)

is a sigmoid function and

W_{g a t e}^{C} \in R^{2 d_{s} \times 2 d_{s}}

and

b_{g a t e}^{C} \in R^{2 d_{s} \times 1}

are trainable parameters. Note that

e^{C} \in R^{2 d_{s} \times 1}

is a context-matching representation that selectively combines the contextual information of the question and question-relevant dialog history.

3.2.2. Topic Aggregation Module

The topic of the question is generally expressed in a single word or phrase and likely to be connected with clues (i.e., the topic of previous questions in the dialog history). We design the Topic Aggregation module to combine the clues associated with the question topic by exploiting the initial word embeddings (i.e., GloVe) to represent their original meaning. Specifically, this module leverages word-level sequential representation of the question and dialog history,

{u_{i}^{q}}_{i = 1}^{n_{q}} \in R^{d_{u} \times n_{q}}

and

{u_{r, j}^{h}}_{j = 1}^{n_{h_{r}}} \in R^{d_{u} \times n_{h_{r}}}

, respectively. The dot product attention mechanism is employed to selectively focus the words that are relevant to the question topic from the dialog history as follows:

\begin{matrix} z_{r, i j}^{W} & = f_{q}^{W} {(u_{i}^{q})}^{⊤} f_{h}^{W} (u_{r, j}^{h}) \end{matrix}

(4a)

\begin{matrix} a_{r, i j}^{W} & = e x p (z_{r, i j}^{W}) / \sum_{j = 1}^{n_{h_{r}}} e x p (z_{r, i j}^{W}) \end{matrix}

(4b)

\begin{matrix} {\tilde{w}}_{r, i}^{h} & = \sum_{j = 1}^{n_{h_{r}}} a_{r, i j}^{W} w_{r, j}^{h}, \end{matrix}

(4c)

where

f_{q}^{W} (\cdot)

and

f_{h}^{W} (\cdot)

are non-linear transformation functions which convert

d_{u}

to

d_{f}

dimensions. The question-guided history feature for each round

{\tilde{w}}_{r, i}^{h}

is computed by a weighted sum of their word embeddings, which represent the original meanings of words. The attended representation

{{\tilde{w}}_{i}^{h}}_{i = 1}^{n_{q}} \in R^{d_{w} \times n_{q}}

is computed by aggregating overalll history

{{\tilde{w}}_{r, i}^{h}}_{r = 0}^{t - 1}

, weighted by the attention scores of the Context Matching module

a_{r}^{S}

as follows:

{\tilde{w}}_{i}^{h} = \sum_{r = 0}^{t - 1} a_{r}^{S} {\tilde{w}}_{r, i}^{h} .

(5)

Similar to the Context Matching module, the gate operation adaptively filters out irrelevant topic-guided clues at word level.

\begin{matrix} g a t e_{i}^{T} & = σ (W_{g a t e}^{T} [w_{i}^{q}, {\tilde{w}}_{i}^{h}] + b_{g a t e}^{T}) \end{matrix}

(6a)

\begin{matrix} e_{i}^{T} & = g a t e_{i}^{T} \circ [w_{i}^{q}, {\tilde{w}}_{i}^{h}], \end{matrix}

(6b)

where

W_{g a t e}^{T} \in R^{2 d_{w} \times 2 d_{w}}

and

b_{g a t e}^{T} \in R^{2 d_{w} \times 1}

are trainable parameters. Note that

{e_{i}^{T}}_{i = 1}^{n_{q}}

\in R^{2 d_{w} \times n_{q}}

is topic-aggregation representations that encode adaptive information of the question topic and various history clues associated with it.

3.2.3. Modality Alignment Module

Given the output representations of the Context Matching module

e^{C} \in R^{2 d_{s} \times 1}

and Topic Aggregation module

{e_{i}^{T}}_{i = 1}^{n_{q}} \in R^{2 d_{w} \times n_{q}}

, the Modality Alignment module aligns them with visual features

{v_{k}}_{k = 1}^{n_{v}} \in R^{d_{v} \times n_{v}}

via two-step alignments. First, topic-view alignment performs soft alignment between heterogeneous modalities at topic level before mapping high-level contextual representation with the visual features. We utilize dot-product attention to represent the visual relevant topic-aggregation embeddings as follows:

\begin{matrix} z_{i k}^{T} & = f_{ℓ}^{W} {(e_{i}^{T})}^{⊤} f_{v}^{T} (v_{k}) \end{matrix}

(7a)

\begin{matrix} a_{i k}^{T} & = e x p (z_{i k}^{T}) / \sum_{i = 1}^{n_{q}} e x p (z_{i k}^{T}) \end{matrix}

(7b)

\begin{matrix} {\tilde{e}}_{k}^{T} & = \sum_{i = 1}^{n_{q}} a_{i k}^{T} e_{i}^{T}, \end{matrix}

(7c)

where

f_{ℓ}^{W} (\cdot)

and

f_{v}^{W} (\cdot)

are non-linear transformation functions to embed two different modality representations into the same

d_{f}

dimensions. Then, we obtain the fused feature vectors by concatenating the visual and attended topic-aggregation features and using a Multi-Layer Perceptron (MLP):

m_{k}^{T} = MLP ([v_{k}, {\tilde{e}}_{k}^{T}]) .

(8)

Note that

{m_{k}^{T}}_{k = 1}^{n_{v}} \in R^{d_{v} \times n_{v}}

is a topic-view aligned representation that is aligned information across the visual contents with the salient question topics. This is passed to the context-view aligning step as follows:

\begin{matrix} {\hat{m}}_{k}^{C} & = [m_{k}^{T}, e^{C}] \end{matrix}

(9a)

\begin{matrix} z_{k}^{C} & = W^{⊤} (L 2 Norm (f_{m}^{C} ({\hat{m}}_{k}^{C}) \circ f_{ℓ}^{C} (e^{C}))) + b \end{matrix}

(9b)

\begin{matrix} a_{k}^{C} & = softmax (z_{k}^{C}) \end{matrix}

(9c)

\begin{matrix} {\tilde{m}}^{C} & = \sum_{k = 1}^{n_{v}} a_{k}^{C} {\hat{m}}_{k}^{C}, \end{matrix}

(9d)

where

f_{m}^{C} (\cdot)

and

f_{ℓ}^{C} (\cdot)

are non-linear transformation functions which convert

{\hat{m}}_{k}^{C}

and

e^{C}

into the same

d_{f}

dimensions and

W \in R^{d_{f} \times 1}

is a projection matrix. Note that

{\tilde{m}}^{C} \in R^{(d_{v} + 2 d_{s}) \times 1}

is a context-view aligned representation, which is realigned using context-matching representation. These multiple alignment processes allow the model to understand the semantic intent of the question with complementary views, and effectively align the corresponding heterogeneous multimodal inputs. Finally, this enhanced feature is fed into a single-layer Feed-Forward Neural Network (FFNN) with a ReLU activation function.

m^{e n c} = \max (0, W^{⊤} {\tilde{m}}^{C} + b),

(10)

where

W \in R^{(d_{v} + 2 d_{s}) \times d_{e n c}}

and

b \in R^{d_{e n c} \times 1}

are trainable parameters. Note that

m^{e n c} \in R^{d_{e n c} \times 1}

is the multi-view aligned representation, which is fed into either a discriminative or generative decoder.

3.3. Answer Decoder

While existing studies mainly utilize only a discriminative decoder to rank answer candidates, we adopt not only a discriminative decoder but also agenerative decoder. In the case of a discriminative decoder, it ranks all candidate answers by calculating the similarity score (dot-product) between the output representation and the sentence representation of each candidate. However, the generative decoder only takes the ground truth answer, and then predicts the next token for each time step (i.e., it does not use negative responses during training). The discriminative decoder is first described in Section 3.3.1 and the generative decoder is described in Section 3.3.2.

3.3.1. Discriminative Decoder

We use the last hidden states of the forward LSTM to encode sentence representations of answer candidates, denoted as

s^{a} = {{\vec{u}}_{i, n_{a}}^{a}}_{i = 1}^{100} \in R^{d_{a} \times 100}

. We rank them according to the dot products of the candidates

s^{a}

and multi-view aligned representation

m^{e n c}

(i.e., output representation of the encoder), then apply the softmax function to obtain the probability distribution of the candidates, denoted as

p = softmax ({(s^{a})}^{⊤} m^{e n c})

. Note that dimension of each answer candidate representation is the same as that of the encoder output. We use cross entropy loss as the discriminative objective function, formulated as:

L_{D i s c} = - \sum_{i = 1}^{100} y_{i} \log p_{i},

(11)

where

y_{i}

is a one-hot encoded vector of the ground truth answer.

3.3.2. Generative Decoder

Unlike most previous approaches, which take only a discriminative approach, we also train our model in a generative manner [7]. During the training phase, we use a two-layer LSTM to predict the next token given the previous tokens in the ground truth answer. Specifically, the hidden state

u_{g t, k}^{a} \in R^{d_{a}}

at k-th time step is computed given current word embedding

w_{k}^{a}

and previous hidden state

u_{g t, k - 1}^{a}

as follows:

u_{a}^{g t, k} = LSTM (w_{k}^{a}, u_{g t, k - 1}^{a}) .

(12)

The initial hidden state of the LSTM

u_{a}^{g t, 0}

is initialized with

m^{e n c}

(i.e., output representation of the encoder). When generating each token, we compute the likelihood of the ground truth of each token

{p_{k}}_{k = 1}^{n_{a}}

as follows:

p_{k} = {softmax}_{j} (W_{k}^{⊤} u_{g t, k}^{a} + b_{k}),

(13)

where

W_{k} \in R^{d_{a} \times | V |}

is a projection matrix and

| V |

is the size of vocabulary. Note that j is the index of k-th word in the vocabulary and

p_{k} \in R^{| V |}

is the probability score of k-th ground truth token. In the training phase, the model is optimized by minimizing the summation of the negative log-likelihood as follows:

L_{G e n} = - \sum_{k = 1}^{n_{a}} \log p_{k} .

(14)

3.3.3. Multi-Task Learning

We perform multi-task learning by combining both discriminative and generative decoders to classify answers, denoted as

L_{M u l t i} = L_{D i s c} + L_{G e n}

. For the evaluation, we simply average the probability distributions of each decoder. Multi-task learning substantially improves performance with respect to the NDCG metric.

3.3.4. Fine-Tuning on Dense Annotations

In the discriminative setting, MVAN mainly focuses on finding the one-hot ground truth answer. However, many candidates may be semantically relevant to the ground truth so that they also have to be predicted with a high ranking. For example, more than one answer to Q6 in Figure 1 exists in the candidate pool (“no”, “no,”, “nope”, “i don’t think so”, and “not that i can see”). To evaluate how well the model is capable of generalizing, visual dialog challenge (https://visualdialog.org/challenge/2018 (accessed on 26 March 2021)) organizers released relevance scores (i.e., dense annotations) for each candidate answer, denoted as

{d_{i}}_{i = 1}^{100}

, where

d_{i} \in [0, 1]

. Once MVAN is trained based on a discriminative decoder, we fine-tune the model on dense annotations (we utilize both training and validation set for fine-tuning the model since only 2k cases are provided for each dataset) to predict more semantically relevant answers with higher rankings. Most existing works [17,18,22,24] have shown fine-tuning on dense annotations significantly boosts performance in NDCG but simultaneously deteriorates performance in non-NDCG metrics. To alleviate the trade-off between NDCG and non-NDCG metrics during the fine-tuning period, we design joint loss function by combining relevance loss and discriminative loss as follows:

\begin{matrix} L_{R e l} & = - \sum_{i = 1}^{100} d_{i} \log p_{i}, \end{matrix}

(15a)

\begin{matrix} L_{D e n s e} & = α L_{D i s c} + L_{R e l} . \end{matrix}

(15b)

In order to examine the effect of the discriminative loss, we conduct various experiments by increasing the

α

from 0 to 2 in 0.2 increments (see Figure 3). We decide the final

α

as 1 to obtain the most balanced performance.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

We used the VisDial v1.0 dataset (https://visualdialog.org/data (accessed on 26 March 2021)) [7] to evaluate our proposed model. VisDial v1.0 training splits consist of 123k MS-COCO [29] images and an additional 10k images from the Flickr dataset were utilized to construct the validation and test splits, which contain 2k and 8k images, respectively. Each image has 10 consecutive question-answer pairs. Each question is paired with a group of 100 correct answers, one of which is ground true. Additionally, VisDial v1.0 provides dense annotations for 100 answer candidates corresponding to each question on subsets of the training and validation splits.

4.1.2. Evaluation Metrics

Following the work of Das et al. [7], we evaluated our proposed model using several retrieval metrics: (1) Mean rank of the ground truth response (Mean), (2) recall at k (k = {1, 5, 10}), which is denoted as R@k and evaluates where the ground truth is positioned in the sorted list, and (3) MRR [30]. NDCG was also introduced as a primary metric in the VisDial v1.0 dataset, and decreases when the model gives a low ranking to candidate answers with high relevance scores. MRR evaluates the precision of the model by ranking where a ground truth answer is positioned, whereas NDCG evaluates relative relevance of the predicted answers. In this work, we compare our model with the baselines mainly based on both NDCG and MRR metrics following the most recent visual dialog challenge (the VisDial challenge (2018∼2019) selects the winners based solely on NDCG metric, but the winners are picked based on the average rank of NDCG and MRR from the 2020 challenge). Specifically, we rank all the baselines including ours on each metric individually (i.e., NDCG and MRR), then average ranks across two metrics (denoted as AVG in the following results).

4.1.3. Training Details

Our model is implemented using the PyTorch framework [31] based on open source code (https://github.com/batra-mlp-lab/visdial-challenge-starter-pytorch (accessed on 26 March 2021)) from the work of Das et al. [7]. The question and dialog history are represented using different BiLSTMs with 512 hidden states. The maximum sequence lengths of the question and dialog history are set to 20 and 40, respectively. We set the batch size to 32 and apply the Adam optimizer [32] with an initial learning rate of 0.00001, which is gradually increased to 0.001 until epoch 2, then decay at epochs 6 and 7 with the decay rate 0.1. Our code is publicly available (https://github.com/taesunwhang/MVAN-VisDial (accessed on 26 March 2021)).

4.2. Quantitative Results

4.2.1. Discriminative Setting

We compare the results of our proposed model with previously published results on the VisDial v1.0 dataset for the following methods: Late Fusion (LF) [7], Hierarchical Recurrent Encoder (HRE) [7], Memory Network (MN) [7], Graph Neural Network (GNN) [12], Co-reference Neural Module Network (CorefNMN) [4], Recursive Visual Attention (RVA) [6], Synergistic Network [9], Dual Encoding Visual Dialogue (DualVD) [33], Context-Aware Graph (CAG) [14], History-Aware Co-Attention (HACAN) [34], Consensus Dropout Fusion (CDF) [11], Dual Attention Network (DAN) [5], and Factor Graph Attention (FGA) [13].

Table 1 reports the quantitative results on the VisDial v1.0 under the discriminative decoder setting. For VisDial v1.0, our MVAN model outperforms the previous state-of-the-art methods with respect to NDCG and AVG and shows competitive results in non-NDCG metrics. Specifically, MVAN achieves significant improvements in NDCG from 57.59 to 59.37 and is ranked first as a result of the average rank of NDCG and MRR, compared to the state-of-the-art baseline. We also report the results for an ensemble of 10 independent models that were trained with random initial seeds. This leads to further performance improvements in NDCG from 59.37 to 60.92 and MRR from 64.84 to 66.38. These results indicate that MVAN not only has an accurate prediction ability, as indicated by the non-NDCG metric results (i.e., MRR, R@k, and Mean), but it has a powerful generalization capability given the result of NDCG score because this metric considers several relevant answers to be correct.

4.2.2. Multi-Task Learning

As shown in Table 2, we report the results of our MVAN model, which is trained using multi-task learning. Our proposed approach performs better with respect to all metrics than Recurrent Dual Attention Network (ReDAN) [10], which averages the ranking results of the discriminative and generative model, and Light-weight Transformer for Many Inputs (LTMI) [24], which employs multi-task learning but uses only discriminative decoder outputs for evaluation.

4.2.3. Fine-Tuning on Dense Annotations

Table 3 reports the results of fine-tuning on dense annotations. This fine-tuning on relevance scores for each candidate answer significantly boosts the performance of NDCG. However, existing works [17,18,21,22,24] show that the performance of non-NDCG metrics deteriorates in contrast to the case of NDCG. To alleviate the inconsistency between NDCG and non-NDCG metrics, we fine-tune the model with the joint loss (see Equation (Section 3.3.4)). Figure 3 shows the performance change of MRR and NDCG by varying

α

, which is the weight of the discriminative loss, from 0 to 2. As

α

increases from 0 to 1, MRR is rapidly improved while NDCG remained high. On the contrary, NDCG sharply decreases with small improvement in MRR when

α

increases from 1 to 2. Based on this observation, we set

α

to 1 and obtain significant improvements across two metrics. In the baseline comparisons, our MVAN model is ranked first in MRR regardless of whether it is a single or ensemble model (single: 56.06, ensemble: 65.16). In addition, MVAN is ranked first and second in AVG for single and ensemble model comparison, respectively, despite a little lower NDCG.

4.2.4. Number of Dialog History

We experimented with the amount of dialog history to evaluate the impact of dialog history on the model performance of the two major metrics (i.e., MRR and NDCG). The results in Figure 4 are the average of the MRR and NDCG performance for each dialog turn in the validation set and show that as the amount of dialog history information increases, the MRR tends to gradually improve, but the NDCG score deteriorates. This quantity analysis shows that history information decreases the NDCG score but substantially boosts the other metrics.

4.3. Ablation Study

We conducted ablation studies on the VisDial v1.0 validation splits to evaluate the influence of each component in our model. Modality Alignment module is not ablated because this module handles the visual features. We use the same discriminative decoder model for all ablations to exclude the impact of multitask learning.

In Table 4, the first rows of each block indicate the impact of each module in our model. Since the two modules (i.e., Context Matching and Topic Aggregation) are interdependent, we employ simple visual features instead of topic-aggregation representation for MVAN w/o Topic Aggregation, whereas we simply remove context-matching representation for MVAN w/o Context Matching. Both models obtain slightly lower performance with respect to all evaluation metrics than MVAN. We can hence infer that the two modules are complementary with respect to each other and our model integrates these complementary characteristics well for the task.

Recent approaches [11,17,24] reported that they observed a trade-off relationship between two primary metrics (i.e., NDCG and MRR) in the visual dialog task. We also found the trade-off relationship through ablative experiments with and without dialog history features (see Table 4). Specifically, adding dialog history features improves the MRR score by 3.54% on average, whereas NDCG score is decreased by 1.92% on average. We observe that the model has a tendency to predict the answers more precisely (i.e., it has a better MRR score) when the dialog history features are added. This may imply that question-related clues in the dialog history are important factors in reasoning the ground truth, but they hinder the model’s generalization ability (i.e., they lower the NDCG score).

4.4. Qualitative Analysis

To qualitatively demonstrate the advantages of our model, we visualize the attention scores of each module through examples from the VisDial v1.0 validation set in Figure 5. The attention scores of the Context Matching module, highlighted in blue, show that our model selectively focuses on contextual information as the semantic intent of the question changes. The tendency for the caption (i.e., H0) to receive the highest attention score implies that the caption contains global information describing the image. In addition, the top three visual contents with high attention scores in each image lead to the potential interpretation that our model is capable of explicitly aligning the semantic intent (i.e., highlighted in yellow) of the question and visual contents through Modality Alignment module. In more detail, the attention scores of the dialog history, highlighted in red, indicate how our model captures topic-relevant clues through previous dialog history.

As shown in Figure 5a, comparing two examples, we see that the model no longer focuses on “6” and “people” in H0 because those words are not related to the topic of the current question (i.e.,“drinking”). In the example in Figure 5b, when answering Q4 (the left dialog), the model pays more attention to the question-relevant clue such as “women” in H0, while no longer focusing on it when answering Q6 (the question topic changes from “tennis outfits” to “background”). These qualitative results show that our model successfully pays attention to visual and textual information connected to the semantic intent of the question.

4.5. Error Analysis

We analyzed examples in the VisDial v1.0 validation set for which our model obtained a score of 0 for the R@10 metric. The errors can be categorized into three groups: (1) Subjective judgment: Our model tends to make wrong predictions for the questions about age, weather, and appearance that could involve subjective judgment, but might be acceptable (Figure 6a,b). (2) Ambiguous questions: Our model may focus on the wrong visual contents, for instance the left and right side walls rather than the rear wall when faced with an ambiguous question (Figure 6c). (3) Wrong multimodal alignment: When the dialog history includes multiple entities (e.g., “boys”, “pizzas”, and “toppings”) that can be referenced by a single pronoun (i.e., “them”), MVAN may be confused as to which entity the pronoun refers (Figure 6d).

5. Conclusions

In this paper, we introduced MVAN for the visual dialog task. MVAN could effectively determine the semantic intent of the current question and capture question-relevant information through complementary modules and sequential alignment processes. We used VisDial v1.0 to empirically evaluate our model, and as a result, our model outperformed existing state-of-the-art models. Moreover, we not only suggest plausible factors affecting a trade-off relationship of the evaluation metrics, but we enhanced the interpretability of multi-level attention through detailed visualization. As the main focus of our paper is on encoders, it limits in an in-depth exploration of decoders. In future, we plan to adopt the latest generative model for building decoder and incorporate vision-language pre-trained models [18,20] if the performance of MVAN can be further improved. In addition, we will study an efficient ranking method to improve performance both in NDCG and non-NDCG metrics.

Author Contributions

Conceptualization, S.P. and T.W.; data curation, S.P. and T.W.; formal analysis, T.W. and Y.Y.; funding acquisition/project administration/supervision, H.L.; investigation, S.P.; methodology/software, S.P. and T.W.; visualization, S.P.; review and editing, Y.Y.; original draft, S.P. and T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01405) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), Institute for Information & communications Technology Planning & Evaluation (IITP), grant funded by the Korean government (MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques) and MSIT(Ministry of Science and ICT), Korea, under the ICT Creative Consilience program(IITP-2021-2020-0-01819) supervised by the IITP(Institute for Information & communications Technology Planning & Evaluation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found here: VisDial v1.0: https://visualdialog.org/data (accessed on 26 March 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. More Qualitative Results

Figure A1. Visualization of the reasoning process for each module in MVAN.

Figure A2. Visualization of the reasoning process for each module in MVAN.

References

Hudson, D.A.; Manning, C.D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6700–6709. [Google Scholar]
Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6720–6731. [Google Scholar]
Seo, P.H.; Lehrmann, A.; Han, B.; Sigal, L. Visual reference resolution using attention memory for visual dialog. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3719–3729. [Google Scholar]
Kottur, S.; Moura, J.M.; Parikh, D.; Batra, D.; Rohrbach, M. Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 153–169. [Google Scholar]
Kang, G.C.; Lim, J.; Zhang, B.T. Dual Attention Networks for Visual Reference Resolution in Visual Dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2024–2033. [Google Scholar]
Niu, Y.; Zhang, H.; Zhang, M.; Zhang, J.; Lu, Z.; Wen, J.R. Recursive visual attention in visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6679–6688. [Google Scholar]
Das, A.; Kottur, S.; Gupta, K.; Singh, A.; Yadav, D.; Moura, J.M.; Parikh, D.; Batra, D. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 326–335. [Google Scholar]
Wu, Q.; Wang, P.; Shen, C.; Reid, I.; Van Den Hengel, A. Are you talking to me? Reasoned visual dialog generation through adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6106–6115. [Google Scholar]
Guo, D.; Xu, C.; Tao, D. Image-question-answer synergistic network for visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10434–10443. [Google Scholar]
Gan, Z.; Cheng, Y.; Kholy, A.; Li, L.; Liu, J.; Gao, J. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6463–6474. [Google Scholar]
Kim, H.; Tan, H.; Bansal, M. Modality-balanced models for visual dialogue. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Zheng, Z.; Wang, W.; Qi, S.; Zhu, S.C. Reasoning visual dialogs with structural and partial observations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6669–6678. [Google Scholar]
Schwartz, I.; Yu, S.; Hazan, T.; Schwing, A.G. Factor graph attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2039–2048. [Google Scholar]
Guo, D.; Wang, H.; Zhang, H.; Zha, Z.J.; Wang, M. Iterative Context-Aware Graph Inference for Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020. [Google Scholar]
Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 39–48. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Murahari, V.; Batra, D.; Parikh, D.; Das, A. Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Wang, Y.; Joty, S.; Lyu, M.R.; King, I.; Xiong, C.; Hoi, S.C. Vd-bert: A unified vision and dialog transformer with bert. arXiv 2020, arXiv:2004.13278. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar]
Qi, J.; Niu, Y.; Huang, J.; Zhang, H. Two Causal Principles for Improving Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020. [Google Scholar]
Agarwal, S.; Bui, T.; Lee, J.Y.; Konstas, I.; Rieser, V. History for Visual Dialog: Do we really need it? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
Nguyen, V.Q.; Suganuma, M.; Okatani, T. Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Voorhees, E.M. The TREC-8 Question Answering Track Report. In Proceedings of the TREC, Gaithersburg, MD, USA, 7–19 November 1999; Volume 99, pp. 77–82. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Jiang, X.; Yu, J.; Qin, Z.; Zhuang, Y.; Zhang, X.; Hu, Y.; Wu, Q. DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Yang, T.; Zha, Z.J.; Zhang, H. Making History Matter: History-Advantage Sequence Training for Visual Dialog. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 2561–2569. [Google Scholar]

Figure 1. Example of a visual dialog task. The text color indicates the dialog topic (e.g., “people”, “food”, and “household goods”).

Figure 2. Model architecture of Multi-View Attention Network (MVAN). Bi-Directional Long Short-Term Memory (BiLSTM) layers for the question (blue) and the dialog history (green) are shared in the Topic Aggregation Module and Context Matching Module, respectively.

Figure 3. Performance of Multi-View Attention Network (MVAN) according to the ratio of relevance loss and discrimitive loss on the VisDial v1.0 validation set.

Figure 4. Performance of MVAN with different amounts of dialog history on the VisDial v1.0 validation set.

Figure 5. Qualitative results on the VisDial v1.0 validation set. We visualize the different attention scores for each module: (1) Attention scores from Topic Aggregation module and Context Matching module are highlighted in red and blue, respectively; (2) semantic intent of the current question represented via the topic-view alignment step in yellow; and (3) the top three attention scores of visual features from the context-view alignment step, which are represented by the b-boxes with fine adjustment of transparency in the given image. The numbers in the brackets indicate the rank of the correct answer that our model predicts. Darker colors indicate higher attention scores. More qualitative results are described in Appendix A.

Figure 6. Error analysis on the VisDial v1.0 validation set. We analyze the examples for which the model scored 0 with respect to the R@10 metric. The Errors are categorized into (1) subjective judgment, (2) Ambiguous questions, and (3) Wrong multimodal alignment. (a,b) belong to the first, (c) to the second, and (d) to the third, respectively.

Table 1. Results on VisDial v1.0 (test-std).

^{†}

denotes ensembles. Existing works [17,18] based on pre-trained models are not reported for the fair comparison. The numbers in the brackets indicate rankings of the models in Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) metrics.

Table 1. Results on VisDial v1.0 (test-std).

^{†}

denotes ensembles. Existing works [17,18] based on pre-trained models are not reported for the fair comparison. The numbers in the brackets indicate rankings of the models in Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) metrics.

Model	AVG ↓	NDCG ↑	MRR ↑	R@1 ↑	R@5 ↑	R@10 ↑	Mean ↓
LF [7]	12	45.31 (13)	55.42 (12)	40.95	72.45	82.83	5.95
HRE [7]	12	45.46 (12)	54.16 (13)	39.93	70.45	81.50	6.41
MN [7]	11	47.50 (11)	55.49 (11)	40.98	72.30	83.30	5.92
GNN [12]	10	52.82 (10)	61.37 (10)	47.33	77.98	87.83	4.57
CorefNMN [4]	9	54.70 (9)	61.50 (9)	47.55	78.10	88.80	4.40
RVA [6]	8	55.59 (8)	63.03 (7)	49.03	80.40	89.83	4.18
DualVD [33]	7	56.32 (7)	63.23 (5)	49.25	80.23	89.70	4.11
Synergistic [9]	6	57.32 (3)	62.20 (8)	47.90	80.43	89.95	4.17
CAG [14]	5	56.64 (6)	63.49 (4)	49.85	80.63	90.15	4.11
DAN [5]	4	57.59 (2)	63.20 (6)	49.63	79.75	89.35	4.30
HACAN [34]	3	57.17 (4)	64.22 (3)	50.88	80.63	89.45	4.20
FGA [13]	2	56.90 (5)	66.20 (1)	52.75	82.92	91.07	3.80
MVAN (ours)	1	59.37 (1)	64.84 (2)	51.45	81.12	90.65	3.97
Synergistic $^{†}$ [9]	5	57.88 (4)	63.42 (5)	49.30	80.77	90.68	3.97
CDF $^{†}$ [11]	2	59.49 (2)	64.40 (4)	50.90	81.18	90.40	3.99
DAN $^{†}$ [5]	2	59.36 (3)	64.92 (3)	51.28	81.60	90.88	3.92
FGA $^{†}$ [13]	2	57.20 (5)	69.30 (1)	55.65	86.73	94.05	3.14
MVAN $^{†}$ (ours)	1	60.92 (1)	66.38 (2)	53.20	82.45	91.85	3.68

Table 2. Results of different methods of combining discriminative and generative models on VisDial v1.0 (test-std).

^{†}

denotes ensembles and * indicates that the model was trained using multi-task learning.

Table 2. Results of different methods of combining discriminative and generative models on VisDial v1.0 (test-std).

^{†}

denotes ensembles and * indicates that the model was trained using multi-task learning.

Model	AVG ↓	NDCG ↑	MRR ↑	R@1 ↑	R@5 ↑	R@10 ↑	Mean ↓
ReDAN [10]	2	61.86 (2)	53.13 (3)	41.38	66.07	74.50	8.91
LTMI * [24]	2	60.92 (3)	60.65 (2)	47.00	77.03	87.75	4.90
MVAN * (ours)	1	63.15 (1)	63.02 (1)	49.43	79.48	89.40	4.38
ReDAN+ $^{†}$ [10]	3	64.47 (2)	53.74 (3)	42.45	64.68	75.68	6.64
LTMI * $^{, †}$ [24]	1	66.53 (1)	63.19 (2)	49.18	80.45	89.75	4.14
MVAN * $^{, †}$ (ours)	2	63.22 (3)	66.28 (1)	53.87	82.08	89.65	4.61

Table 3. Results of dense annotations fine-tuning on the VisDial v1.0 test-std dataset.

^{†}

denotes ensembles. The models train with discriminative decoder are also used for ensembling.

Table 3. Results of dense annotations fine-tuning on the VisDial v1.0 test-std dataset.

^{†}

denotes ensembles. The models train with discriminative decoder are also used for ensembling.

Model	AVG ↓	NDCG ↑	MRR ↑	R@1 ↑	R@5 ↑	R@10 ↑	Mean ↓
MCA [22]	4	72.47 (4)	37.68 (4)	20.67	56.67	72.12	8.89
Visdial-BERT [17]	1	74.47 (2)	50.74 (2)	37.95	64.13	80.00	6.28
VD-BERT [18]	1	74.54 (1)	46.72 (3)	33.15	61.58	77.15	7.18
MVAN (ours)	1	73.07 (3)	56.06 (1)	44.38	68.50	81.18	5.98
P1_P2 $^{†}$ [21]	4	74.91 (2)	49.13 (5)	36.68	62.96	78.55	7.03
LTMI $^{†}$ [24]	4	74.88 (3)	52.14 (4)	38.93	66.60	80.65	66.53
MReal-BDAI $^{†}$ [21]	2	74.02 (4)	52.62 (2)	40.03	68.85	79.15	6.76
VD-BERT $^{†}$ [18]	1	75.35 (1)	51.17 (4)	38.90	62.82	77.98	6.69
MVAN $^{†}$ (ours)	2	71.40 (5)	65.16 (1)	52.88	79.50	88.55	4.27

Table 4. Ablations of our approaches on the VisDial v1.0 validation dataset.

Model	Context-Level History	Topic-Level History	NDCG ↑	MRR ↑	R@1 ↑	R@5 ↑	R@10 ↑	Mean ↓
MVAN	✓	✓	60.17	65.33	51.86	82.40	90.90	3.88
	✗	✗	62.33	61.79	47.61	79.30	88.81	4.42
w/o Topic Aggregation	✓	N/A	58.50	64.63	50.84	81.64	90.50	3.97
	✗	N/A	60.57	61.32	47.19	78.59	88.40	4.55
w/o Context Matching	N/A	✓	57.06	64.15	50.51	81.15	89.83	4.12
	N/A	✗	58.60	60.36	46.09	77.71	87.64	4.73

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, S.; Whang, T.; Yoon, Y.; Lim, H. Multi-View Attention Network for Visual Dialog. Appl. Sci. 2021, 11, 3009. https://doi.org/10.3390/app11073009

AMA Style

Park S, Whang T, Yoon Y, Lim H. Multi-View Attention Network for Visual Dialog. Applied Sciences. 2021; 11(7):3009. https://doi.org/10.3390/app11073009

Chicago/Turabian Style

Park, Sungjin, Taesun Whang, Yeochan Yoon, and Heuiseok Lim. 2021. "Multi-View Attention Network for Visual Dialog" Applied Sciences 11, no. 7: 3009. https://doi.org/10.3390/app11073009

APA Style

Park, S., Whang, T., Yoon, Y., & Lim, H. (2021). Multi-View Attention Network for Visual Dialog. Applied Sciences, 11(7), 3009. https://doi.org/10.3390/app11073009

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-View Attention Network for Visual Dialog

Abstract

1. Introduction

2. Related Work

Visual Dialog

3. Model

3.1. Multimodal Representation

3.1.1. Visual Features

3.1.2. Language Features

3.2. Multi-View Attention Network

3.2.1. Context Matching Module

3.2.2. Topic Aggregation Module

3.2.3. Modality Alignment Module

3.3. Answer Decoder

3.3.1. Discriminative Decoder

3.3.2. Generative Decoder

3.3.3. Multi-Task Learning

3.3.4. Fine-Tuning on Dense Annotations

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Training Details

4.2. Quantitative Results

4.2.1. Discriminative Setting

4.2.2. Multi-Task Learning

4.2.3. Fine-Tuning on Dense Annotations

4.2.4. Number of Dialog History

4.3. Ablation Study

4.4. Qualitative Analysis

4.5. Error Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. More Qualitative Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI