Path-Wise Attention Memory Network for Visual Question Answering
Abstract
:1. Introduction
2. Related Work
2.1. Visual Question Answering
2.2. Attention Mechanisms
2.3. Graph Attention Network
3. The Proposed Approach
3.1. Problem Definition
3.2. Build Fine-Grained Feature Vectors
3.3. Attention Blocks with Memory
3.3.1. Standard Attention Block
3.3.2. Co-Attention with Guard Gates
3.3.3. Self-Attention with Conditioning Gates
3.3.4. Path-Wise Attention Block
3.3.5. Node Impact Factor
3.4. Answer Prediction
4. Experiment
4.1. Datasets
4.2. Implementation Details
4.3. Baselines
- Teney et al., 2018 [57] adopts the question/image early joint embedding and the single-layer question-guided image self-attention mechanism.
- DCN [25] designs a bi-directional interactions hierarchy network stacked by several dense co-attention maps to fuse visual and language features.
- DRAU [58] combines convolutional attention and recurrent attention to promote bi-modal fusion reasoning.
- BLOCK [26] is a multi-modal fusion method based on the tensor composition which can perform fine-grained inter-modal representation while maintaining strong single-modal representation ability.
- Zhang et al., 2020 [59] designs a question-guided top-down visual attention block and a question-guided convolutional relational reasoning block.
- MuRel [28] adopts a method similar to the graph attention mechanism to learn the inter-regional relations of the image and updates the regions hidden representation under the guidance of questions and image geometric information.
- ReGAT-implicit [27] models implicit relations between visual regions via graph attention network.
- AReg [24] prevents the VQA model from capturing language bias by introducing the question-only adversary to encourage visual grounding.
- Grand et al., 2019 [61] research how to alleviate linguistically biased by introducing adversarial regularization.
- CSS-UpDn [30] forces the model to reason correctly by adding custom counterfactual samples to the training data; then the model could achieve better performance on VQA-CP.
- Whitehead et al., 2021 [62] utilize both labeled and unlabeled image–question pairs and try to separate skills and concepts for the model to improve its generalization.
4.4. Results and Analysis
4.5. Ablation Study
4.6. Distribution of Attention and Prediction
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Notations
r | Final global visual vector |
w | Final global question vector |
Cumulative coefficient of nodes impact factor | |
R | Image regions feature |
W | Question words feature |
C | Number of categories |
D | Hidden dimension |
Number of image regions | |
Number of question words | |
H | Number of attention heads |
Dimension of attention heads | |
Weights of linear layer | |
Bias of linear layer | |
Avg | Average pooling layer |
Linear | Linear layer |
References
- Kim, J.; Koh, J.; Kim, Y.; Choi, J.; Hwang, Y.; Choi, J.W. Robust Deep Multi-Modal Learning Based on Gated Information Fusion Network. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 90–106. [Google Scholar]
- Dou, Q.; Liu, Q.; Heng, P.A.; Glocker, B. Unpaired multi-modal segmentation via knowledge distillation. IEEE Trans. Med. Imaging 2020, 39, 2415–2425. [Google Scholar] [CrossRef]
- Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
- Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790. [Google Scholar]
- Yu, H.; Zhang, C.; Li, J.; Zhang, S. Robust sparse weighted classification For crowdsourcing. IEEE Trans. Knowl. Data Eng. 2022, 1–13. [Google Scholar] [CrossRef]
- Mun, J.; Cho, M.; Han, B. Text-guided attention model for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
- Jiang, M.; Huang, Q.; Zhang, L.; Wang, X.; Zhang, P.; Gan, Z.; Diesner, J.; Gao, J. Tiger: Text-to-image grounding for image caption evaluation. arXiv 2019, arXiv:1909.02050. [Google Scholar]
- Ding, S.; Qu, S.; Xi, Y.; Wan, S. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 2020, 398, 520–530. [Google Scholar] [CrossRef]
- Rohrbach, M.; Qiu, W.; Titov, I.; Thater, S.; Pinkal, M.; Schiele, B. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 433–440. [Google Scholar]
- Dong, J.; Li, X.; Snoek, C.G. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimed. 2018, 20, 3377–3388. [Google Scholar] [CrossRef]
- Ding, S.; Qu, S.; Xi, Y.; Wan, S. A long video caption generation algorithm for big video data retrieval. Future Gener. Comput. Syst. 2019, 93, 583–595. [Google Scholar] [CrossRef]
- Wang, L.; Zhu, L.; Dong, X.; Liu, L.; Sun, J.; Zhang, H. Joint feature selection and graph regularization for modality-dependent cross-modal retrieval. J. Vis. Commun. Image Represent. 2018, 54, 213–222. [Google Scholar] [CrossRef]
- Zhang, C.; Liu, M.; Liu, Z.; Yang, C.; Zhang, L.; Han, J. Spatiotemporal activity modeling under data scarcity: A graph-regularized cross-modal embedding approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Gao, D.; Jin, L.; Chen, B.; Qiu, M.; Li, P.; Wei, Y.; Hu, Y.; Wang, H. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 2251–2260. [Google Scholar]
- Xie, D.; Deng, C.; Li, C.; Liu, X.; Tao, D. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 2020, 29, 3626–3637. [Google Scholar] [CrossRef] [PubMed]
- Mithun, N.C.; Sikka, K.; Chiu, H.P.; Samarasekera, S.; Kumar, R. Rgb2lidar: Towards solving large-scale cross-modal visual localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 934–954. [Google Scholar]
- Zhang, C.; Song, J.; Zhu, X.; Zhu, L.; Zhang, S. Hcmsl: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–22. [Google Scholar] [CrossRef]
- Zhang, C.; Xie, F.; Yu, H.; Zhang, J.; Zhu, L.; Li, Y. PPIS-JOIN: A novel privacy-preserving image similarity join method. Neural Process. Lett. 2021, 54, 2783–2801. [Google Scholar] [CrossRef]
- Zhang, C.; Zhong, Z.; Zhu, L.; Zhang, S.; Cao, D.; Zhang, J. M2guda: Multi-metrics graph-based unsupervised domain adaptation for cross-modal Hashing. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021; pp. 674–681. [Google Scholar]
- Zhu, L.; Zhang, C.; Song, J.; Liu, L.; Zhang, S.; Li, Y. Multi-graph based hierarchical semantic fusion for cross-modal representation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Zhu, L.; Zhang, C.; Song, J.; Zhang, S.; Tian, C.; Zhu, X. Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval. IEEE Multimed. 2022. [Google Scholar] [CrossRef]
- Zhu, C.; Zhao, Y.; Huang, S.; Tu, K.; Ma, Y. Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1291–1300. [Google Scholar]
- Ramakrishnan, S.; Agrawal, A.; Lee, S. Overcoming language priors in visual question answering with adversarial regularization. arXiv 2018, arXiv:1810.03649. [Google Scholar]
- Nguyen, D.K.; Okatani, T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6087–6096. [Google Scholar]
- Ben-Younes, H.; Cadene, R.; Thome, N.; Cord, M. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8102–8109. [Google Scholar]
- Li, L.; Gan, Z.; Cheng, Y.; Liu, J. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 10313–10322. [Google Scholar]
- Cadene, R.; Ben-Younes, H.; Cord, M.; Thome, N. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1989–1998. [Google Scholar]
- Gao, P.; Jiang, Z.; You, H.; Lu, P.; Hoi, S.C.; Wang, X.; Li, H. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6639–6648. [Google Scholar]
- Chen, L.; Yan, X.; Xiao, J.; Zhang, H.; Pu, S.; Zhuang, Y. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10800–10809. [Google Scholar]
- Teney, D.; Abbasnejad, E.; van den Hengel, A. Unshuffling data for improved generalization in visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1417–1427. [Google Scholar]
- Zhou, B.; Tian, Y.; Sukhbaatar, S.; Szlam, A.; Fergus, R. Simple baseline for visual question answering. arXiv 2015, arXiv:1512.02167. [Google Scholar]
- Chen, K.; Wang, J.; Chen, L.C.; Gao, H.; Xu, W.; Nevatia, R. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv 2015, arXiv:1511.05960. [Google Scholar]
- Ren, M.; Kiros, R.; Zemel, R. Exploring models and data for image question answering. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
- Shih, K.J.; Singh, S.; Hoiem, D. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4613–4621. [Google Scholar]
- Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6281–6290. [Google Scholar]
- Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical question-image co-attention for visual question answering. Adv. Neural Inf. Process. Syst. 2016, 29, 289–297. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
- Santoro, A.; Raposo, D.; Barrett, D.G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Ghosh, S.; Burachas, G.; Ray, A.; Ziskind, A. Generating natural language explanations for visual question answering using scene graphs and visual attention. arXiv 2019, arXiv:1902.05715. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Bapna, A.; Chen, M.X.; Firat, O.; Cao, Y.; Wu, Y. Training deeper neural machine translation models with transparent attention. arXiv 2018, arXiv:1808.07561. [Google Scholar]
- Zhang, H.; Kyaw, Z.; Chang, S.F.; Chua, T.S. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5532–5540. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Xu, H.; Saenko, K. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 451–466. [Google Scholar]
- Guo, W.; Zhang, Y.; Yang, J.; Yuan, X. Re-attention for visual question answering. IEEE Trans. Image Process. 2021, 30, 6730–6743. [Google Scholar] [CrossRef]
- Yu, T.; Yu, J.; Yu, Z.; Tao, D. Compositional attention networks with two-stream fusion for video question answering. IEEE Trans. Image Process. 2019, 29, 1204–1218. [Google Scholar] [CrossRef]
- Jiang, J.; Chen, Z.; Lin, H.; Zhao, X.; Gao, Y. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11101–11108. [Google Scholar]
- Kim, N.; Ha, S.J.; Kang, J.W. Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1708–1717. [Google Scholar]
- Teney, D.; Liu, L.; van Den Hengel, A. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1–9. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
- Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4971–4980. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
- Teney, D.; Anderson, P.; He, X.; Van Den Hengel, A. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4223–4232. [Google Scholar]
- Osman, A.; Samek, W. DRAU: Dual recurrent attention units for visual question answering. Comput. Vis. Image Underst. 2019, 185, 24–30. [Google Scholar] [CrossRef]
- Zhang, W.; Yu, J.; Hu, H.; Hu, H.; Qin, Z. Multimodal feature fusion by relational reasoning and attention for visual question answering. Inf. Fusion 2020, 55, 116–126. [Google Scholar] [CrossRef]
- Shrestha, R.; Kafle, K.; Kanan, C. Answer them all! toward universal visual question answering models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10472–10481. [Google Scholar]
- Grand, G.; Belinkov, Y. Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. arXiv 2019, arXiv:1906.08430. [Google Scholar]
- Whitehead, S.; Wu, H.; Ji, H.; Feris, R.; Saenko, K. Separating Skills and Concepts for Novel Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5632–5641. [Google Scholar]
- Kim, J.H.; Jun, J.; Zhang, B.T. Bilinear attention networks. arXiv 2018, arXiv:1805.07932. [Google Scholar]
Hyperparameters | Value |
---|---|
Hidden dimension | 2048 |
Number of attention heads | 8 |
Learning rate (lr) | 0.001 |
Decay start epoch of lr | 10 |
Decay interval of lr | 2 |
Decay factor of lr | 0.5 |
Dropout factor | 0.1 |
Batch size | 64 |
Epochs | 15 |
Model | VQA2.0 | VQA-CP | ||
---|---|---|---|---|
Test-Dev | Test-Std | val | Test | |
Teney et al., 2018 [57] | 65.32 | 65.67 | 63.15 | - |
DCN [25] | 66.60 | 67.00 | - | - |
DRAU [58] | 66.45 | 66.85 | - | - |
BLOCK [26] | 66.41 | 67.92 | - | - |
Zhang et al., 2020 [59] | 67.20 | 67.34 | - | - |
MuRel [28] | 68.03 | 68.41 | 65.14 | 39.54 |
RAMEN [60] | 65.96 | - | - | 39.21 |
ReGAT-implicit [27] | 67.6 | 67.81 | 65.93 | 40.13 |
Our PAM | 69.01 | 69.24 | 66.20 | 41.60 |
AReg [24] | - | - | 62.75 | 41.17 |
Grand et al., 2019 [61] | - | - | 51.92 | 42.33 |
CSS-UpDn [30] | - | - | 59.21 | 41.16 |
Teney et al., 2021 [31] | - | - | 61.08 | 42.39 |
whitehead et al., 2021 [62] | - | - | 61.08 | 41.71 |
Model | VQA v2 val |
---|---|
PAM | 66.20 |
-PA | 65.69 |
-PA+CA | 65.32 |
-PA-imp | 65.01 |
-gate in SA | 66.11 |
-gates in CA/SA/PA | 65.85 |
-PA-imp-gates in CA/PA | 64.66 |
-PA-imp-gates in SA | 65.58 |
-PA-imp-gates in CA/SA/PA | 65.34 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xiang, Y.; Zhang, C.; Han, Z.; Yu, H.; Li, J.; Zhu, L. Path-Wise Attention Memory Network for Visual Question Answering. Mathematics 2022, 10, 3244. https://doi.org/10.3390/math10183244
Xiang Y, Zhang C, Han Z, Yu H, Li J, Zhu L. Path-Wise Attention Memory Network for Visual Question Answering. Mathematics. 2022; 10(18):3244. https://doi.org/10.3390/math10183244
Chicago/Turabian StyleXiang, Yingxin, Chengyuan Zhang, Zhichao Han, Hao Yu, Jiaye Li, and Lei Zhu. 2022. "Path-Wise Attention Memory Network for Visual Question Answering" Mathematics 10, no. 18: 3244. https://doi.org/10.3390/math10183244
APA StyleXiang, Y., Zhang, C., Han, Z., Yu, H., Li, J., & Zhu, L. (2022). Path-Wise Attention Memory Network for Visual Question Answering. Mathematics, 10(18), 3244. https://doi.org/10.3390/math10183244