Multi-View Attention Networks for Visual Dialog

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, image, and answer) is required. Specifically, it is necessary for an agent to 1) understand question-relevant dialog history and 2) focus on question-relevant visual contents among the diverse visual contents in a given image. In this paper, we propose Multi-View Attention Network (MVAN), which considers complementary views of multimodal inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two different textual-views (i.e., Topic Aggregation and Context Matching), and integrates multimodal representations with two-step fusion process. Experimental results on VisDial v1.0 and v0.9 benchmarks show the effectiveness of our proposed model, which outperforms the previous state-of-the-art methods with respect to all evaluation metrics.


Introduction
As a part of the interdisciplinary research that combines natural language processing with computer vision, a wide variety of vision-language tasks (e.g., visual question answering (VQA), image captioning, and referring expressions) have been introduced in recent years. Considerable efforts in this field have advanced the capabilities of artificial intelligence agents a step further, but the agent's comprehension of multimodal information is still far from human-level reasoning and cognitive ability (Hudson and Manning, 2019;Zellers et al., 2019). Figure 1: Example of a visual dialog task. The text background color indicates the dialog topic (e.g., "people", "food", and "household goods").
The visual dialog task is similar to VQA in that it requires the agent to answer a question that is guided by an image but differs in that the agent needs to answer a series of questions focusing on a given image as well as previous dialog. This task is more challenging than other vision-language tasks because the agent is asked to selectively capture the visual contents related to the question topic, which change as the dialog proceeds. For example, Figure 1 illustrates how the question topics change during each dialog turn (e.g., "household goods", "people", and "food"). Furthermore, answering some questions that contain pronouns (e.g., "they", "it", "he", "she", and "them") can be more difficult because the agent should consider which entity the pronoun refers in the dialog history. To be specific, when the agent encounters question 5 (Q5), the agent should pay attention to resolve what "they" refers to (i.e., "kids" and "boys" in caption and Q4-A4 pair) and ground the referent in the image.
Several recent researches have been performed to solve the visual dialog task from the perspective of visual co-reference resolution. Kottur et al. (2018) proposed neural module networks that effectively link pronouns to referring expressions at a word level. In addition, Kang et al. (2019) adapted a self-attention mechanism (Vaswani et al., 2017) based on sentence-level representation of the question and dialog history to focus question-relevant history. Niu et al. (2019) proposed a recursive attention mechanism to capture question-relevant dialog history and ground related visual contents to the image. However, resolving the pronouns ambiguity does not always lead to understanding of semantic intent of the question. For example, when answering Q6 in Figure 1, the agent is required to explicitly understand the semantic intent of the question that is asking whether "other snacks" exists or not (i.e., the model should attend more to "other snacks" rather than "they"). In this respect, the dialog agent should be capable of accurately determining the semantic intent of the given question and then leveraging question-relevant information from the dialog history and visual contents.
To this end, this paper proposes Multi-View Attention Network (MVAN), which consists of three main modules: 1) Context Matching, 2) Topic Aggregation, and 3) Modality Fusion. First, the Context Matching module uses an attention mechanism to effectively represent the contextual information of the dialog history that is relevant to the question. Second, the Topic Aggregation module is capable of capturing topic-guided clues in dialog history at word level. Also, we apply a gate function in both modules to selectively capture textual information that interacts with the current question topic. Lastly, the Modality Fusion module consists of two sequential fusion steps to integrate each textual output of previous modules with visual features. The main contributions of this paper are as follows.
• We propose MVAN to effectively represent multimodal inputs with two different views and combine them with visual contents through multiple fusion steps.
• Experimental results on VisDial v1.0 and v0.9 benchmarks show that our proposed model outperforms the previous state-of-theart methods with respect to all evaluation metrics.
• Visualization of the multi-level attention score demonstrates that MVAN explicitly understands the semantic intent of the question, which leads to a reasonable interpretation of leveraging various multimodal inputs.

Related Work
VQA For the last few years, large-scale VQA datasets, such as VQA 1.0 (Antol et al., 2015), VQA 2.0 (Goyal et al., 2017), COCO-QA (Ren et al., 2015a), and GQA (Hudson and Manning, 2019) have accelerated various developments in the vision-language field. Lu et al. (2016) introduced a co-attention mechanism to jointly exploit both image attention and question attention. More advanced approaches have been well studied in Nguyen and Okatani (2018); . Bilinear approaches that replace element-wise addition or concatenation for using modality fusion methods have also been proposed (Fukui et al., 2016;Kim et al., 2016Kim et al., , 2018. Visual Dialog Visual dialog is a task proposed by Das et al. (2017) that requires the dialog agent to answer the current question by exploiting both the image and dialog history. Das et al. (2017) also introduced encoder-decoder models such as late fusion, hierarchical recurrent network, and memory network as baseline methods. Most of the previous approaches to the visual dialog task are predominantly based on attention mechanisms. DAN (Kang et al., 2019) uses a multi-head attention mechanism (Vaswani et al., 2017) and Gan et al. (2019) proposed a multi-step reasoning method to fuse the given multimodal inputs. FGA (Schwartz et al., 2019) considers all the interactions of connected entities based on a factor graph. RVA (Niu et al., 2019) fuses question representation and image representation via recursive trees. In addition, other approaches proposed different training methods for the visual dialog task including reinforcement mechanism (Yang et al., 2019) and adversarial learning (Lu et al., 2017). Recently, Qi et al. (2019) proposed causal intervention algorithms that can be composed into any visual dialog model and Murahari et al. (2019) introduced a fine-tuning method using a pre-trained model .

Model
In the visual dialog task (Das et al., 2017), a dialog agent is given a set of multimodal inputs for  each dialog turn t. This input set consists of an image I, a current question Q t , the dialog history set H t = {C, (Q 1 , A gt 1 ), · · · , (Q t−1 , A gt t−1 )}, which consists of image caption C and t − 1 consecutive question-answer pairs, and a set of answer candi- The agent then is required to answer the question by either discriminating or generating a correct answer.

Visual Features
We employ a bottom-up attention mechanism (Anderson et al., 2018) to represent the objects appearing in an image. Visual features of object regions V = {v k } nv k=1 ∈ R dv×nv , where n v is the number of detected objects ranging from 10 to 100, are adaptively extracted from a Faster-RCNN (Ren et al., 2015b) that is pre-trained with Visual Genome (Krishna et al., 2017).

Language Features
We first embed three different text inputs: the current question, dialog history, and answer candidates. The word embedding layer is initialized with pre-trained GloVe embeddings (Pennington et al., 2014). We then feed the word embeddings into a bi-directional long short-term memory (BiLSTM) to encode a sequential representation of each embedding. Specifically, each word in question Q is embedded as where n q is the number of words in the sequence. Each word embedding is fed into the BiLSTM layer as follows: The sequential representation of each token is constructed by concatenating the hidden states of the forward and backward LSTMs, denoted as is constructed using the question construction process with different BiLSTM layers. For the answer candidates, we use a different unidirectional LSTM to represent them because their sequence lengths are shorter than those of the questions. Context Matching Module Sequential representation is constructed by concatenating the last hidden states of the forward and backward LSTMs for question and dialog history, denoted as

Multi-View Attention Network
, respectively. We then apply an attention mechanism to focus on question-relevant history from a dialog history series. The Context Matching module takes contextual representation of question s q ∈ R ds×1 and dialog history s h = {s h 0 , s h 1 , ..., s h t−1 } ∈ R ds×t and outputs question-relevant history features as follows: where • is element-wise multiplication, W ∈ R d f ×1 is a projection matrix, and f S q (·) and f S h (·) denote non-linear transformation functions. We apply a gate function to effectively integrate the context-level features from the current question and dialog history as follows: where σ(·) is a sigmoid function and W S gate ∈ R 2ds×2ds and b S gate ∈ R 2ds are trainable parameters. Note that e S ∈ R 2ds is a contextual-aware representation that effectively integrates the sequential information of the question and question-relevant dialog history.
Topic Aggregation Module The Topic Aggregation module leverages word-level sequential representation of the question and dialog history, The dot product attention mechanism is employed to selectively focus the words that are relevant to the question topic from the dialog history as follows: where f W q (·) and f W h (·) are non-linear transformation functions. The question-guided history feature for each roundw h r,i is computed by a weighted sum of their word embeddings, which represent the original meanings of words. The attended representationw h r,i is computed by aggregating overall all history {w h r,i } t−1 r=0 , weighted by the attention scores of the Context Matching module a S r as follows: Similar to the Context Matching module, the gate operation merges the topic-guided history representation with that of the question at word level.
where W W gate ∈ R 2dw×2dw and b W gate ∈ R 2dw are trainable parameters. Note that e W i ∈ R 2dw×nq is topic-aware features that represent question topic and various history clues associated with it.
Modality Fusion Module Given the output representations of Context Matching module e S ∈ R 2ds×1 and Topic Aggregation module , the Modality Fusion module integrates them with visual features {v k } nv k=1 ∈ R dv×nv via two-step fusion (i.e., topic-view fusion and contextview fusion). The module first combines representations from heterogeneous modalities (i.e., vision and language) at word level. We utilize dot-product attention to represent the visual relevant question as follows: where f W (·) and f W v (·) are non-linear transformation functions to embed two different modality representations into the same embedding space. Then, we obtain the fused vector by concatenating the visual features and the visual relevant question features and using a multi-layer perceptron (MLP).
Note that m W k ∈ R dv×nv is a topic-view fusion representation that combines the specific question topic and visual contents. This fused embedding is further enhanced by conducting the following context-view fusion: where f S m (·) and f S (·) are non-linear transformation functions and W ∈ R d f ×1 is a projection matrix. Note thatm S is a context-view fusion representation, which is fed into a single-layer feedforward neural network with a ReLU activation function.
where W ∈ R (dv+du)×denc and b ∈ R denc are trainable parameters. Note that m enc is the multi-view fusion representation, which is fed into either a discriminative or generative decoder to select the most likely response from among the candidates.

Answer Decoder
Discriminative Decoder We use the last hidden states of the forward LSTM to encode sentence representations of answer candidates, denoted as We rank them according to the dot products of the candidates s a and multi-view fusion representation m enc , then apply the softmax function to obtain the probability distribution of the candidates, denoted as p = softmax((s a ) m enc ). Note that dimension of each answer candidate is same as the dimension of the encoder output. We use multi-class cross entropy loss as the discriminative objective function, formulated as where y i is a one-hot encoded vector of the ground truth answer.
Generative Decoder Unlike most previous approaches, which take only a discriminative approach, we also train our model in a generative manner (Das et al., 2017). During the training phase, we use a two-layer LSTM to predict the next token given the previous tokens in the answer sequence. The initial hidden state of the LSTM is initialized with the encoder output representation. For each answer candidate, we compute the likelihood of the ground truth of each token, denoted as {p k } na k=1 , and train the model by minimizing the summation of the negative log-likelihood as follows: Multi-task Learning We perform multi-task learning by combining both discriminative and generative decoders to classify answers, denoted as L = L D + L G . For the evaluation, we simply average the probability distributions of each decoder. Multi-task learning substantially improves performance with respect to the normalized discounted cumulative gain (NDCG) metric.

Experimental Setup
Datasets We use the VisDial v0.9 and v1.0 datasets to evaluate our proposed model. VisDial v0.9 (Das et al., 2017) consists of 123k MS-COCO (Lin et al., 2014) images and their captions. The training and validation splits of VisDial v0.9 contain 83k and 40k images respectively, and each image has 10 consecutive question-answer pairs. VisDial v1.0, which was released by supplementing VisDial v0.9, has 123k images for training splits that combine the training and validation splits of VisDial v0.9. An additional 10k images from the Flickr dataset are utilized to construct the validation and test splits in VisDial v1.0, which contain 2k and 8k images, respectively. Unlike the previous version of the dataset, dense annotations for each candidate answer are added in the validation and test splits.

Evaluation Metrics
We evaluated our proposed model using several retrieval metrics, following the work of Das et al. (2017): 1) mean rank of the ground truth response (Mean), 2) recall at k (k={1,5,10}), which is denoted as R@k and evaluates where the ground truth is positioned in the sorted list, and 3) mean reciprocal rank (MRR) (Voorhees et al., 1999). NDCG was also introduced as a primary metric in the VisDial v1.0 dataset, and decreases when the model gives a low ranking to candidate answers with high relevance scores. MRR evaluates the precision of the model by ranking where a ground truth answer is positioned, whereas NDCG evaluates relative relevance of the predicted answers.  publicly available 2 .
Results on VisDial v1.0 and v0.9 Table 1 reports the quantitative results on the VisDial v1.0 and v0.9 under the discriminative decoder setting. For VisDial v1.0, our MVAN model outperforms the previous state-of-the-art methods with respect to all evaluation metrics. Specifically, MVAN improves the performance from 57.59 to 59.37 for NDCG and 64.22 to 64.84 for MRR. In addition, we observe significant improvements in the Mean from 4.11 to 3.97, and an improvement in the R@k performance by approximately 0.4%. Similar results can be seen in the R@5 and Mean for VisDial v0.9. We also report the results for an ensemble of 10 independent models that were trained with different initial seeds, which yields average performance improvements of 1.3% for all metrics.
These results indicate that our MVAN model not only has accurate prediction ability, as indicated by the non-NDCG metric results (i.e., MRR, R@k, and Mean), but it has a powerful generalization capability given the result of NDCG score because this metric considers several relevant answers to be correct.
Results on multi-task learning As shown in Table 2, we report the results of our MVAN model, which was trained using multi-task learning (see Section 3.3). Our proposed approach performs better with respect to all metrics than ReDAN (Gan et al., 2019), which averages the ranking results of the discriminative and generative model, and Tohoku (Nguyen et al., 2019), which employs multi-task learning but uses only discriminative decoder outputs for evaluation.

Number of dialog history
We experimented with the amount of dialog history to evaluate the impact of dialog history on the model performance of the two major metrics (i.e., MRR and NDCG). The results in Figure 3 show that as the amount of dialog history information increases, the MRR perfor-  Table 3: Ablations of our approaches on the VisDial v1.0 validation dataset. mance of our model tends to gradually improve, but the NDCG performance deteriorates. This quantity analysis shows that history information decreases NDCG score but substantially boosts the performance of other metrics.

Ablation study
We conducted ablation studies on the VisDial v1.0 validation splits to evaluate the influence of each component in our model. Modality Fusion module is not ablated because this module handles the visual features. We use the same discriminative decoder model (Das et al., 2017) for all ablations to exclude the impact of multitask learning. In Table 3, the first rows of each block indicate the impact of each module in our model. Because the two modules (i.e., Context Matching and Topic Aggregation) are interdependent, we employ simple visual features instead of topic-aware features for Context Matching model, whereas we simply remove contextual-aware features for the Topic Aggregation model (see Section 3.2). Both models obtain slightly lower performance with respect to all evaluation metrics than the MVAN model. We can hence infer that the two modules are complementary with respect to each other and our model integrates these complementary characteristics well for the task.
Recent approaches (Murahari et al., 2019;Kim et al., 2020;Nguyen et al., 2019) reported that they observed a trade-off relationship between two primary metrics (i.e., NDCG and MRR) in the visual dialog task. We also found a trade-off relationship through ablative experiments with and without dialog history features (see Table 3). Specifically, adding dialog history features improves the MRR score by 3.54% on average, whereas NDCG score is decreased by 1.92% on average. We observe that the model has a tendency to predict the answers more precisely (i.e., it has a better MRR score) when the dialog history features are added. This may imply that question-related clues in the dialog history are important factors in reasoning the ground truth, but they hinder the model's generalization ability (i.e., they lower the NDCG score).

Qualitative Analysis
To qualitatively demonstrate the advantages of our model, we visualize the attention scores of each module through examples from the VisDial v1.0 validation set in Figure 4. The attention scores of the Context Matching module, highlighted in blue, show that our model selectively focuses on contextual information as the semantic intent of the question changes. The tendency for the caption (i.e., H0) to receive the highest score implies that the caption contains global information describing the image. In addition, the top-three visual contents with high attention scores in each image and attention scores in the current question lead to the potential interpretation that our model is capable of capturing the semantic intent of the question correctly and determining which visual contents are accordingly required. In more detail, the attention scores of the dialog history, highlighted in red, indicate how our model attends to topic-relevant clues through previous dialog history.
As shown in Figure 4(a), comparing two examples, we see that the model no longer focuses on "6" and "people" in H0 because those words are not related to the topic of the current question (i.e.,"drinking"). In the example in Figure 4(   when answering Q4 (the left dialog), the model pays more attention to the question-relevant clue such as "women" in H0, while capturing "tennis", "court", and "rackets" correspondingly as the question topic changes from "tennis outfits" to "background". These qualitative results show that our model successfully pays attention to visual and textual information connected to the semantic intent of the question.

Error Analysis
We analyzed examples in the VisDial v1.0 validation set for which our model obtained a score of 0 for the R@10 metric. The errors in Figure 5 can be categorized into three groups: 1) Subjective judgment: our model tends to make wrong predictions for questions about age, weather, and appearance that could involve subjective judgment, but might be acceptable (Figures 5(a) and (b)). 2) Ambiguous questions: our model can focus on the wrong visual contents, for instance the left and right side walls rather than the rear wall when faced with an ambiguous question ( Figure 5(c)). 3) Wrong co-reference resolution : when the dialog history includes multiple entities (e.g., "boys", "pizzas", and "toppings") that can be referenced by a single pronoun (i.e., "them"), MVAN can become confused as to which entity the pronoun refers ( Figure  5(d)).

Conclusion
In this paper, we introduced the MVAN for the visual dialog task. MVAN can effectively capture question-relevant dialog history and visual contents focusing on the semantic intent of the current question. We used VisDial v1.0 and v0.9 to empirically evaluate our model, and as a result, our model outperforms existing state-of-the-art models. Moreover, we not only suggest plausible factors affecting a trade-off relationship of the evaluation metrics, but we enhance the interpretability of multi-level attention through detailed visualization. In future work, we aim to develop a complementary model by adding sequential information about the dialog history. This is in contrast to our proposed model, which relies only on the attention scores. Moreover, we plan to incorporate the latest pre-training methods Murahari et al., 2019) into our model to improve its performance.