Deep Modular Bilinear Attention Network for Visual Question Answering

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.


Introduction
The task goal of VQA (Visual Question Answering) [1] is to build a question answering system like human intelligence, which can recognize the category, spatial relationship, and other information of objects from the specified pictures. VQA has broad application scenarios and has far-reaching significance for the development of artificial intelligence (see Figure 1).
Our model can be applied to the blind assistant robot. The surrounding images and audio can be obtained through the robot's hardware sensor as the input of our model, which can effectively help the blind perceive the surrounding objects.
The most challenging problem in VQA is establishing the association between each region in the image and the words in the question, and the model of VQA has the ability to align the image and text semantically. VQA models not only have to understand the content of a picture, but also have to find the corresponding answer to the question, which presents a greater challenge to the model and makes it more intelligent.
MCB [2], MFB [3], and Mutan [4] capture the high-level interactions between question and images features based on the fusion method. However, the scope of use is limited, and it is not easy to apply to other VQA models.
Attention mechanisms [5][6][7][8] are very important for deep learning, and it has successfully been applied to the VQA task. The model based on the attention mechanism focuses on the key information. Aderson et al. [9] proposed a bottom-up and top-down attention mechanism and won the VQA Challenge 2017. They use the concatenated attention mechanism to get the image attention guided by the question. However, the model ignores the relationship between each word and image region. BAN [10] focuses on exploring the inter-modality relations between word-region pairs, ignoring the intra-modality relations. MCAN [11], DFAF [12], and CMCN [13] simultaneously explore inter-modality relations and intra-modality relations, and achieve good results. MCAN proposes a deep Modular Co-Attention Network that consists of Modular Co-Attention (MCA) layers cascaded in depth.
Inspired by the MCAN, we designed two basic attention units (BAN-GA and BAN-SA), combined and cascaded. Our work attempts to use bilinear attention to construct inter-modality and intra-modality relations between visual and language features. Along with visual attention, learning textual attention is also very important. We try to use the pre-trained language model BERT [14] to encode the question. In addition, we use a self-attention unit to process the question features further. On this foundation, our model achieves better performance.
Almost all VQA models use the attention mechanism. However, compared with BUTD, we use the co-attention mechanism instead of the image-guided attention mechanism; in terms of question embedding, we use sentence vectors instead of word vectors to better express the characteristics of the question. Compared with the fusion method, we use BAN to construct the relationship between modes, and we also pay attention to the internal relationship of modes. Compared with MCAN, our model uses a bilinear attention network instead of the more customary one based on dot-products.
Finally, this paper's contribution and innovation are summarized as follows: • In this paper, we propose a deep multimodality attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct intermodality and intra-modality relations between visual and language features. BAN-GA and BAN-SA are the core of the whole network framework, and they can be cascaded in depth. Unlike other models, we use bilinear attention to calculate the inter-modality and intra-modality attention instead of dot-product. Our experiments show that we obtain more refined and rich features. • We encode text information based on the dynamic word vector of BERT. Then we use multi-head self-attention to process the text features and sum them with the features obtained in the previous step, before the final classification, which further improves the model's accuracy, indicating that this method can work together. • We visualize the attention of the model and the experimental results, which can help us better understand the interaction between multimodal features. Extensive ablation experiments are carried out, and the experimental results show that each module in the model can play its effectiveness.

Attention
Attention mechanism [5][6][7][8] focuses on the main areas of images and questions, ignoring some irrelevant information. Various attention mechanisms have brought significant progress to VQA and become the standard configuration of the model. The attention mechanism also inspires our model. The early attention method uses the question to find the area related to the question in the images.

High-Level Attributes and Knowledge
Refs. [15][16][17][18] deals with visual question, answering with the help of information from external knowledge base. Answering questions requires understanding the visual content of the image, such as answering "how many mammals are there in the picture?". First, you need to know whether the animals in the picture belong to mammals. This kind of question can only be answered with the help of external knowledge. Some studies combine VQA tasks with the knowledge base, and some datasets are specifically aimed at this kind of method, such as the kb-vqa data set and the fvqa dataset. If you want to answer complex questions, it is necessary to acquire knowledge from outside.

VQA Pre-Training
Most VQA methods use two separate pre-training models: visual model training on ImageNet [19] and VG [20], and word embedding for language features. Since these features of individual training may not be optimal for joint visual and language understanding, a hot topic recently is the development of joint pre-training models [21][22][23][24] for visual and language tasks.

Feature Fusion
In the early stage, multimodal fusion [3,4,25] was concatenation or element-wise multiplication, and Bilinear Fusion used bilinear pooling to fuse multimodal features to get the high-level interactions. However, these methods need high computational. Many approximated fusion methods, including MCB [2], MLB [26] and MUTAN [4], were proposed, which have shown better performance with fewer parameters.

Deep Modular Bilinear Attention Network
In this section, we elaborate on the proposed model for the VQA task. The overview of the proposed model is illustrated in Figure 2. Each layer of our model is composed of two basic units, BAN-GA and BAN-SA. BAN-SA represents the bilinear self-attention network, and BAN-GA represents the bilinear guided attention network. We will describe the composition of these two basic units in detail below.

Question and Image Encoding
The question of the VQA is a sequence of words. We encode it by BERT to q. The question is reduced to a maximum of 14 words. The extra words are discarded (we only deal with the first 14 words), and questions with less than 14 are filled with zero vectors.
BERT [14] is a new language representation model, the advantage of BERT is the use of the bidirectional transformer. Using the prediction target word and the next sentence, multi-task learning method was used for training. Other language expression models include word2vec [28], Glove [29], Elmo [30], GPT-2 [31]. Word2vec is a static method. Although it has strong universality, it cannot be dynamically optimized for specific tasks. Glove uses co-occurrence matrix and considers local information and overall information at the same time.
We embed these words into the 768-dimensional feature vectors using a pre-trained BERT model.
We refer to the correspondence between the word and index in the vocabulary of BERT and convert the word to index.
where I t represents the index of the vocabulary of BERT at position t in the question.
where Q ∈ R d q * N is the sequence of question representations. d q = 768 is the output dimension of the BERT. During training, the BERT parameters are fine-tuned using the question-answering loss.
Because of Faster R-CNN's [27] excellent performance in various target recognition tasks, it is selected for image feature extraction in this section.
Following the conventional way, we utilize Faster-RCNN with the ResNet model to detect M objects from an image. We denote the object-level image features as V ∈ R d o * M . We fine-tune the Faster R-CNN detector's last layer during training and normalize it. The calculation formula is as follows: where RCNN(.) represents extraction of image features through a Faster R-CNN model,

Multi-Glimpse Bilinear Guided-Attention Network
Here, we introduce a bilinear attention network to get the relationship between each word of the question and the region features of the image. On the one hand, the bilinear model reduces the dimension of input and reduces the amount of calculation. On the other hand, more detailed co-attention can be obtained. Figure 3 presents the multiglimpse extensions. We use the bilinear method to get bilinear attention map G GA between image and question. The calculation formula is as follows: Then we use bilinear attention map G GA to integrate the image region feature V and the question embedding Q; the k-th joint embedding is as follows: For the convenience, the bilinear attention networks can be defined as follows: We get multiple bilinear attention maps, and use residual to integrate them; compared with sum and concat, residual can get a better effect.
and H i have the same size, max(i) = g, g is the number of glimpses. We use H to represent the output of the last glimpse, denoted as:

Multi-Glimpse Bilinear Self-Attention Network
The structure of the Bilinear Self-Attention Network is similar to the Bilinear Guided-Attention Network. The overview of the Multi-glimpse Bilinear Self-Attention Network is illustrated in Figure 4. Both input of the Bilinear Self-Attention Network are H. After obtaining the integrated features H of the question and image, inspired by the self-attention mechanism, we further process the integrated features and get fine features. The calculation method is still using the method of BAN described above. Previously, we input two features: V and Q, and now we input a fusion feature: H. The calculation method is as follows: where U SA ∈ R d q * K , V SA ∈ R d q * K , P GA ∈ R K , H ∈ R d K * N are variables to be learned. The i-th output is defined as: where W SA i ∈ R d q ×K projects the joint embeddings to the same dimension of Q.

Multi-Head Self-Attention
After obtaining the question features from BERT, we use self-attention to process the features further. Now we introduce self-attention, illustrated in Figure 5. Self-attention has the same Q (Query), K (Key), and V (Value). First, we calculate the dot products of the query and all the keys, then divide each by d(the dimension of the question feature). Finally, we apply a softmax to get the attention weight. The calculation formula is as follows: where Q, K, V ∈ R n×d is the weight matrix, d is the dimension of the feature, n is the number of the words for the question features.
To get better feature representation, we usually use a multi-head mechanism. Each head is an independent Scale Dot-Product Attention operation. The formula is as follows: where W O ∈ R h * d h * d and W Q i , W K i , W V i ∈ R d * d h are learned projection matrices, d is the dimension of the feature, h is the number of the head, we make d h = d/h.
We input the text feature q extracted from BERT into the multi-head attention mechanism, which can be expressed as: where Q SA ∈ R d q * N are results after processing.

Feature Fusion and Answer Prediction
After getting the image feature and the question feature, we need to perform feature fusion. The image feature vector is 2048 dimensions, and the question vector is 768 dimensions. The representations of the question q and the imagev are passed through linear layers and then combined with a simple Hadamard product. The calculation formula is as follows: where W a ∈ R d q * d q and W b ∈ R d q * d q is the projection parameters, we use LayerNorm to stabilize training. The resulting vector h ∈ R d y is referred to as the joint embedding of the question and the image features, and is then fed to the output classifier.
After obtaining the fused feature s, we pass it to a two-layer MLP for classification: where W c ∈ R d z ×2C and W d ∈ R 2C×C is the projection parameters, d z is set to 3129.

Loss Function
We utilize the binary cross-entropy loss (BCE) as loss function to train our model, which is calculated as where y is the occurrence probablility of the ground-truth answer,ŷ is the prediction.

Datasets
The VQA task has many datasets, including COCO-QA, FM-IQA, Visual Genome [20], and VQA v2 [1]. We use the VQA v2 dataset for training and testing.
The dataset is divided into train, val, and test in advance. They are composed of 248,349 questions, 121,512 questions, and 244,302 questions, respectively, of which 204k images are from the Microsoft coco dataset. All questions are divided into three types: Yes/no, count, and others. Each question has ten free answers.
There are at least three questions per picture, and on average, there are 5.4 questions per picture. Each question has ten real answers, which ten different people annotate. The people who provide the answers are not the same as the people who ask the questions. The calculation method is as follows: acc(ans) = min #humans that said ans 3 , 1 where ans is the answer predicted by the VQA model.

Experimental Setup
The dimension of image is set to d o = 2048. We set the dimension of the question representation d q to 768. The length of question t is 14. Following the approach in [10], the number of the candidate answers d z is set to 3129, which is determined by the minimum occurrence of the answer in a unique question as nine times. We set the glimpse number to 4, the batch size to 128, and the basic learning rate to 0.001. After the 18th epoch, reduce the learning rate to 1/10 of the previous one. Besides this, gradient client and dropout technology were used. Adamax [32], a variant of Adam, is used to optimize our model. All experiments are implemented with the Pytorch and performed on a workstation with RTX 3090 GPU.

Ablation Analysis
Before the ablation experiment, we compared the effects of two image feature extraction methods, one using BUTD [9] and the other using Pythia [33]. Pythia used the new state-of-the-art detectors based on feature pyramid networks (FPN) from Detectron, which uses ResNeXt as backbone and has two fully connected layers (fc6 and fc7) for region classification. We use two models to experiment on these two image features respectively. The results show that the use of Pythia image feature can improve the results, but the increase is different for different models. Using Pythia image feature, it is more suitable for BAN. Table 1 shows the results. In later experiments, we use Pythia image feature by default. In this section, we design some ablation experiments on VQA 2.0 to verify the effectiveness of our model. For fair comparison, we feed exactly the same features to all the evaluated models that are trained on the training set and tested on the validation set. Table 2 shows the effectiveness of the proposed components. In the first line in Table 2, we only used the Bilinear Guided-Attention Networks.
In the second line, we added the BERT model based on the first line of the experiment and obtained a 1.6% improvement, which proves that the dynamic word vector can improve the model's text representation ability.
In the third row, we used the Bilinear Guided-Attention Networks and the Bilinear Self-Attention Networks, which was improved by 1.58%, which proved the effectiveness of the Bilinear Self-Attention Networks.
In the fourth row, we added the Q-SA unit based on the first line of the experiment and obtained a 1.67% improvement, which proves that the dynamic word vector can improve the model's text representation ability.
In the last row, the accuracy of the proposed method is 69.48%. The accuracy curve and loss curve during the training process of the ablation experiment are shown in Figures 6 and 7. The validity of our model is proven.   Table 3 shows the validation scores on VQA2.0 dataset for the number of glimpse of our models. Table 3. Validation scores on VQA2.0 dataset for the number of glimpse of our models. DMBA-NET-L denotes the model has L layer. Furthermore, we studied the effect of BERT's learning rate on our model. Table 4 shows the results of different BERT's learning rate. When the BERT's learning rate is set to lr × 0.001, the accuracy increases slightly, and by increasing its learning rate, it achieves the best performance at lr × 0.02. It can prove that our model is effective and compatible with BERT.

Qualitative Analysis
In Figure 8, we visualize the attention maps of the BAN-SA and BAN-GA in each layer. It is found from the figure that the important part of the question cannot be found in the first layer of BAN-SA. With the increase of layers, we can intuitively see which words have a large weight in the last layer. In the attention map of BAN-SA, the words 'how', 'many', and 'zebras' get large attention weights. It can be explained that we have found keywords from the question.
For the attention maps of BAN-GA, in Figure 9, we can see from the first three layers that the corresponding information between the question and the image region is found because this question is a number of types. The last layer shows two areas with large weight, and the weight of other areas is particularly low, which exactly corresponds to the correct answer, 'Two'. Through the visualization of other amount questions, we find that the features of the BAN-GA in the last layer will have a large weight in one column.
For the 100 regions in the image, we select the top three largest weight for visualization, mark the corresponding regions with boxes, and the numbers on the boxes represent the corresponding weights of the image regions. We can intuitively find that the boxes basically frame the two zebras in the image corresponding to the question.

Comparison with the State-of-the-Art
In this section, we compare our model with the state-of-the-art models on VQA 2.0 datasets. Table 5 shows the evaluation results on VQA 2.0 test-dev dataset; all models are based on a single model. Among them, BUTD [9] proposed the Bottom-Up attention method and won the VQA Challenge 2017. Compared with this model, we improved the accuracy by 5.37%. MFB [3] and MFH [25] are based on the bilinear pooling method. Our model outperforms them. In addition, the Counter model focuses on the number of question of VQA, and our model is 2.6% higher than Counter. The MuRel [35] model is a multimodal relational network that is learned end-to-end to reason over images. Our model increases the overall accuracy of MuRel by 2.66% on the test-dev set. The MRA-Net [36] model explores both textual and visual relations to improve performance and interpretability. Our model is 1.67% higher than MRA-Net. The results demonstrate our model has a certain reasoning ability. MCAN [11] propose a deep Modular Co-Attention Network that consists of Modular Co-Attention (MCA) layers cascaded in depth. Our model achieves considerable performance without using the Visual Genome datasets. Different from DFAF and CMCN, we use bilinear attention to calculate the inter-modality and intra-modality attention instead of dot-product. From the experimental comparison, our model is more effective.
To better prove the effects of the image attentions, we randomly picked from different question types and visualized the attentions in Figure 10. Most of the top three regions with the highest probability in the box are related to questions. The image attentions are focused on the keyword of the questions. From this point of view, our model is effective. From the incorrect examples, the first wrong prediction shows that our model is not good at recognizing some uncommon objects, indicating that the training samples are insufficient and do not cover some uncommon and rare things. The second wrong prediction shows that the model is not good at text recognition in the image (e.g., name of the girl in the fourth example), which provides a good idea for us to improve the accuracy of the model in the later stage. In the future, we can consider adding the OCR function to improve the ability of the model to recognize text. These weaknesses are helpful to guide further improvements for VQA.

Conclusions
VQA task is a very serious challenge in the field of computer vision, and it has a very wide application prospect. In this paper, we proposed a framework that can obtain more refined visual and text representation and design two basic attention units (BAN-GA and BAN-SA) to explore the inter-modality and intra-modality relations, which can be cascaded in depth. In addition, we encoded the question based on the dynamic word vector of BERT. We used multi-head self-attention to process the question features and summed them with the features obtained by the BAN-GA and BAN-SA, which further improved the model's accuracy.
From the incorrect examples in Figure 10, in the future, we intend to focus on the research of recognizing the word in the images.
Author Contributions: F.Y. and W.S. designed the concept of the research; F.Y. and W.S. implemented experimental design; F.Y. conducted data analysis; F.Y. wrote the draft paper; W.S. and Y.L. reviewed and edited the whole paper. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: Publicly available datasets were analyzed in this study. Available online: https://visualqa.org/download.html.

Conflicts of Interest:
The authors declare no conflict of interest.