Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment Analysis
Abstract
:1. Introduction
- We present MemCapsNet and AgCapsNet structures, which use an adaptive mask memory attention and aspect-guided attention to extract the aspect emotion features of text and image on the pose matrix of the capsule, respectively.
- We design a visual enhancement network to explore the cross-modal interaction between text and image and the emotional enhancement of image to text.
- We propose a novel VECapsNet model that uses the capsule features of text and images for aspect-based multimodal sentiment analysis. The empirical results show that our model performs satisfactorily on both the Multi-ZOL dataset and our MTCom dataset.
2. Related Work
2.1. Single-Modal Sentiment Analysis
2.2. Aspect Based Sentiment Analysis
2.3. Multimodal Sentiment Analysis
2.4. Multimodal Fusion
3. Methodology
3.1. Problem Formalization
3.2. Visual Enhancement Capsule Network Based on Multimodal Fusion
3.2.1. MemCapsNet
- Mask mechanism. In word embedding, an input sequence into the deep model is generally required to have a uniform length, so the inputted word vectors will be padded. As shown in Figure 2, [pad] is the padding label in a word sequence. The attention is calculated over the global vector, and some invalid attention scores will be obtained in the [pad] positions of the second row. The proportion of invalid attention will increase with the increase of the padding length, which is obviously unreasonable for short texts. Therefore, we will mask these attention scores in padding positions according to the actual length of text, as shown in the third row of Figure 2.
- Adaptively scaling. Since e expect to obtain the contextual features more relevant to the aspect phrases through attention, we do not pay too much attention to irrelevant words. To concentrate on these words that are important for identifying sentiment, we apply an adaptive scaling mechanism on the contextual attention according to the correlation with the aspect phrases, as shown in Figure 3. Specifically, the attention of the jth type of capsules at the lth sequence of matrix Q is scaled as follows:
3.2.2. AgCapsNet
3.2.3. Multimodal Fusion Based on Interactive Learning
3.3. Training and Predicting
Algorithm 1 Multimodal aspect-based sentiment analysis based on a VECapsNet. |
Input: Multimodal triplet test dataset: Output: The sentiment label set at aspect 1. Extract the context word vector , the aspect word vector and the image feature mapping set M 2. Obtain the text hidden vectors from Bi-LSTMs, and concatenate them as sequence H 3. Get the pose matrix and the active matrix of H from source capsule layer 4. Obtain the attention from adaptive mask attention layer 5. Obtain the aspect mapping using Equation (16). 6. Get the pose matrix and the active matrix of each image m in M, and obtain its attention matrix using and from Text2vision module according to Equations (16)–(20) 7. Compute the most relevant image index J from all using Equations (21) 8. Compute the weighted image pose matrix using and 9. Repeat 10. Compute the Text2vision attention using and 11. Obtain the global image vector using Equation (22) 12. Compute the weighted matrix using and 14. Obtain new and by image convolution layer 15. Obtain new and by text convolution layer 16. Update Loss using Equation (35) 17. Until the accuracy of the validation dataset no longer increases over ten epochs or reaches default N times 18. Obtain the class capsules and , and the class active and from last fully connected routing layer 19. Retrieve the predicted label set from the aggregated distribution of and . |
4. Experiment
4.1. Experimental Data and Preprocessing
4.2. Experimental Setup
4.3. Baselines
4.4. Experimental Results and Analysis
4.4.1. Aspect-Based Text Sentiment Analysis
4.4.2. Aspect-Based Image Sentiment Analysis
4.4.3. Aspect-Based Multimodal Sentiment Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, L.; Wang, S.; Liu, B. Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1253. [Google Scholar] [CrossRef] [Green Version]
- Yue, L.; Chen, W.; Li, X.; Zuo, W.; Yin, M. A survey of sentiment analysis in social media. Knowl. Inf. Syst. 2019, 60, 617–663. [Google Scholar] [CrossRef]
- Abdi, A.; Shamsuddin, S.M.; Hasan, S.; Piran, J. Deep learning-based sentiment classification of evaluative text based on multi-feature fusion. Inf. Process. Manag. 2019, 56, 1245–1259. [Google Scholar] [CrossRef]
- Rao, T.; Li, X.; Zhang, H.; Xu, M. Multi-level region-based convolutional neural network for image emotion classification. Neurocomputing 2019, 333, 429–439. [Google Scholar] [CrossRef]
- Li, L.; Liu, Y.; Zhou, A. Hierarchical Attention Based Position-Aware Network for Aspect-Level Sentiment Analysis. In Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL), Brussels, Belgium, 31 October–1 November 2018; pp. 181–189. [Google Scholar]
- Wang, Y.; Huang, M.; Zhao, L.; Zhu, X. Attention-based LSTM for Aspect-level Sentiment Classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA, 1–5 November 2016; pp. 606–615. [Google Scholar]
- Li, P.; Chang, W.; Zhou, S.; Xiao, Y.; Wei, C.; Zhao, R. A conflict opinion recognition method based on graph neural network in Aspect-based Sentiment Analysis. In Proceedings of the 5th International Conference on Data Science and Information Technology (DSIT), Shanghai, China, 22–24 July 2022; pp. 1–6. [Google Scholar]
- Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
- Kaur, R.; Kautish, S. Multimodal sentiment analysis: A survey and comparison. Int. J. Serv. Sci. Manag. Eng. Technol. (IJSSMET) 2019, 10, 38–58. [Google Scholar] [CrossRef]
- Xu, N.; Mao, W.; Chen, G. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 371–378. [Google Scholar]
- Zhou, J.; Zhao, J.; Huang, J.X.; Hu, Q.V.; He, L. MASAD: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing 2021, 455, 47–58. [Google Scholar] [CrossRef]
- Truong, Q.; Lauw, H. VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 305–312. [Google Scholar]
- Dong, L.; Wei, F.; Tan, C.; Tang, D.Y.; Zhou, M.; Xu, K. Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, MD, USA, 22–27 June 2014; Volume 2, pp. 49–54. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- Chen, X.; Rao, Y.; Xie, H.; Wang, F.L.; Zhao, Y.; Yin, J. Sentiment classification using negative and intensive sentiment supplement information. Data Sci. Eng. 2019, 4, 109–118. [Google Scholar] [CrossRef] [Green Version]
- Chen, G.; Tian, Y.; Song, Y. Joint aspect extraction and sentiment analysis with directional graph convolutional networks. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), Online, 8–13 December 2020; pp. 272–279. [Google Scholar]
- Tang, D.Y.; Qin, B.; Feng, X.C.; Liu, T. Effective LSTMs for Target-Dependent Sentiment Classification. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), Osaka, Japan, 11–16 December 2016; pp. 3298–3307. [Google Scholar]
- Feng, S.; Wang, Y.; Liu, L.R.; Wang, D.; Yu, G. Attention based hierarchical LSTM network for context-aware microblog sentiment classification. World Wide Web 2019, 22, 59–81. [Google Scholar] [CrossRef]
- Huang, M.; Cao, Y.; Dong, C. Modeling rich contexts for sentiment classification with LSTM. arXiv 2016, arXiv:1605.01478. [Google Scholar]
- Zhao, Z.; Lu, H.; Cai, D.; He, X.; Zhuang, Y. Microblog Sentiment Classification via Recurrent Random Walk Network Learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 19–25 August 2017; pp. 3532–3538. [Google Scholar]
- Xu, C.; Cetintas, S.; Lee, K.; Li, L. Visual sentiment prediction with deep convolutional neural networks. arXiv 2016, arXiv:1411.5731. [Google Scholar]
- Song, K.; Yao, T.; Ling, Q.; Mei, T. Boosting image sentiment analysis with visual attention. Neurocomputing 2018, 312, 218–228. [Google Scholar] [CrossRef]
- Wu, L.; Zhang, H.; Shi, G.; Deng, S. Weakly Supervised Interaction Discovery Network for Image Sentiment Analysis. In Asian Conference on Pattern Recognition; Springer: Cham, Switzerland, 2022; Volume 13188, pp. 501–512. [Google Scholar]
- Liang, Y.; Maeda, K.; Ogawa, T.; Haseyama, M. Cross-Domain Semi-Supervised Deep Metric Learning for Image Sentiment Analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4150–4154. [Google Scholar]
- Liang, Y.; Maeda, K.; Ogawa, T.; Haseyama, M. Deep Metric Network Via Heterogeneous Semantics for Image Sentiment Analysis. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1039–1043. [Google Scholar]
- Tang, D.; Qin, B.; Liu, T. Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA, 1–5 November 2016; pp. 214–224. [Google Scholar]
- Ju, X.; Zhang, D.; Xiao, R.; Li, J.; Li, S.; Zhang, M.; Zhou, G. Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Virtual, Dominican Republic, 7–11 November 2021; pp. 4395–4405. [Google Scholar]
- Wang, B.; Lu, W. Learning Latent Opinions for Aspect-level Sentiment Classification. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5537–5544. [Google Scholar]
- Xu, L.; Bing, L.; Lu, W.; Huang, F. Aspect Sentiment Classification with Aspect-Specific Opinion Spans. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Virtual, Online, 16–20 November 2020; pp. 3561–3567. [Google Scholar]
- Li, X.; Bing, L.; Lam, W.; Shi, B. Transformation Networks for Target-Oriented Sentiment Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, VIC, Australia, 15–20 July 2018; pp. 946–956. [Google Scholar]
- Johnson, R.; Zhang, T. Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 919–927. [Google Scholar]
- Chen, Z.; Qian, T. Transfer Capsule Network for Aspect Level Sentiment Classification. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 547–556. [Google Scholar]
- Du, C.; Sun, H.; Wang, J.; Qi, Q.; Liao, J.; Xu, T.; Liu, M. Capsule Network with Interactive Attention for Aspect-Level Sentiment Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019; pp. 5488–5497. [Google Scholar]
- You, Q.; Luo, J.; Jin, H.; Yang, J. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, 22–25 February 2016; pp. 13–22. [Google Scholar]
- You, Q.; Cao, L.; Jin, H.; Luo, J. Robust visual-textual sentiment analysis: When attention meets tree-structured recursive neural networks. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 1008–1017. [Google Scholar]
- Xu, N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 152–154. [Google Scholar]
- Xu, N.; Mao, W. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2399–2402. [Google Scholar]
- Xu, N.; Mao, W.; Chen, G. A co-memory network for multimodal sentiment analysis. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 929–932. [Google Scholar]
- Chen, Q.; Huang, G.; Wang, Y. The Weighted Cross-Modal Attention Mechanism With Sentiment Prediction Auxiliary Task for Multimodal Sentiment Analysis. IEEE/ACM Trans. Audio Speech, Lang. Process. 2022, 30, 2689–2695. [Google Scholar] [CrossRef]
- Peng, C.; Zhang, C.; Xue, X.; Gao, J.; Liang, H.; Niu, Z. Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification. Tsinghua Sci. Technol. 2022, 27, 664–679. [Google Scholar] [CrossRef]
- Ji, R.; Chen, F.; Cao, L.; Gao, Y. Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Trans. Multimed. 2018, 21, 1062–1075. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef] [Green Version]
- Huang, F.; Zhang, X.; Li, Z. Learning joint multimodal representation with adversarial attention networks. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Korea, 22–26 October 2018; pp. 1874–1882. [Google Scholar]
- Huang, F.; Zhang, X.; Xu, J.; Zhao, Z.; Li, Z. Multimodal learning of social image representation by exploiting social relations. IEEE Trans. Cybern. 2021, 51, 1506–1518. [Google Scholar] [CrossRef]
- Gao, Y.; Beijbom, O.; Zhang, N.; Darrell, T. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 317–326. [Google Scholar]
- Wang, Z.; Xu, G.; Zhou, X.; Kim, J.Y.; Zhu, H.; Deng, L. Deep Tensor Evidence Fusion Network for Sentiment Classification. IEEE Trans. Comput. Soc. Syst. 2022, 1–9. [Google Scholar] [CrossRef]
- Xue, H.; Yan, X.; Jiang, S.; Lai, H. Multi-Tensor Fusion Network with Hybrid Attention for Multimodal Sentiment Analysis. In Proceedings of the 2020 International Conference on Machine Learning and Cybernetics (ICMLC), Adelaide, Australia, 2 December 2020; pp. 169–174. [Google Scholar]
- Huang, F.; Zhang, X.; Zhao, Z.; Xu, J.; Li, Z. Image–text sentiment analysis via deep multimodal attentive fusion. Knowl.-Based Syst. 2019, 167, 26–37. [Google Scholar] [CrossRef]
- Huang, F.R.; Wei, K.M.; Weng, J.; Li, Z.J. Attention based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 79. [Google Scholar] [CrossRef]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR)—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Hinton, G.E.; Sabour, S.; Frosst, S. Matrix capsules with EM routing. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Notation | Discription |
---|---|
The tth word (aspect word) | |
The tth image | |
Document (image set) included in the ith review, and , | |
The ith instance composed of and | |
The tth aspect | |
Aspect set (aspect phrase sequence of ) and , | |
Word vector of | |
A sentimental polarity label | |
L | A set of the labels composed of |
Word vector of | |
Feature vector of gained by forward LSTM | |
Feature vector of gained by backward LSTM | |
Representation vector gained by Bi-LSTM | |
Representation vector of gained by Bi-LSTM | |
H | Representation matrix obtained by Bi-LSTM |
Size of the ith n-gram convolutional kernel | |
Feature mapping matrix of the jth image | |
Capsule pose matrix of a text (the jth image) from source capsule layer | |
Capsule active value matrix of a text (the jth image) from source capsule layer | |
Attention matrix from adaptive mask attention layer | |
Aspect mapping capsule vector | |
Aspect-guided attention matrix of the jth image | |
Capsule feature representation of a text (the jth image) from capsule convolution layer | |
Active value capsule matrix of a text (the jth image) from capsule convolution layer | |
Class capsule matrix of a text (its image set) produced by fully connected routing layer | |
Class active capsule matrix of a text (its image set) produced by fully connected routing layer |
Dataset | Positive | Neutral | Negative | |||
---|---|---|---|---|---|---|
Train | Test | Train | Test | Train | Test | |
Lap14 | 987 | 341 | 460 | 169 | 866 | 218 |
Rest14 | 2164 | 728 | 633 | 196 | 805 | 196 |
Rest15 | 955 | 34 | 272 | 340 | 28 | 195 |
Rest16 | 1297 | 63 | 466 | 474 | 29 | 127 |
Star Rating | Review Number | Percentage |
---|---|---|
5 | 10,000 | 23.50% |
4 | 10,000 | 23.50% |
3 | 10,000 | 23.50% |
2 | 7428 | 17.45% |
1 | 5115 | 12.02% |
Dataset | Lap14 | Rest14 | Rest15 | Rest16 | ||||
---|---|---|---|---|---|---|---|---|
Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | |
ATAE-LST | 68.70 | - | 77.20 | - | - | - | - | - |
MemNet | 70.64 | 65.17 | 79.61 | 69.64 | 77.31 | 58.28 | 85.44 | 69.55 |
IACapsNet | 76.80 | 73.29 | 81.79 | 73.40 | 80.41 | 61.67 | 88.71 | 64.48 |
MemCapsNet (our) | 77.56 | 74.14 | 83.56 | 74.78 | 80.41 | 61.67 | 89.16 | 71.62 |
Dataset | Multi-ZOL | MTCom | ||
---|---|---|---|---|
Accuracy | F1 | Accuracy | F1 | |
MemNet-T | 59.51 | 58.73 | 56.08 | 53.03 |
IACapsNet-T | 58.69 | 57.47 | 53.48 | 50.77 |
MemCapsNet (our) | 59.47 | 58.91 | 56.32 | 54.71 |
VGG-16-I | 44.05 | 43.67 | 33.46 | 31.86 |
CapsNet-I | 45.10 | 44.29 | 31.41 | 30.13 |
AgCapsNet (our) | 46.59 | 44.36 | 41.37 | 40.16 |
MIMN | 61.59 | 60.51 | 54.35 | 52.94 |
VECapsNet (our) | 61.23 | 59.63 | 57.87 | 56.81 |
Quantile | Multi-ZOL | MTCom | ||
---|---|---|---|---|
Textual Length | Image Number | Textual Length | Image Number | |
91 | 537 | 9 | 105 | 6 |
93 | 648 | 9 | 122 | 7 |
95 | 803 | 14 | 146 | 7 |
97 | 1195 | 21 | 185 | 8 |
99 | 1922 | 34 | 285 | 9 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Zhang, Z.; Feng, S.; Wang, D. Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment Analysis. Appl. Sci. 2022, 12, 12146. https://doi.org/10.3390/app122312146
Zhang Y, Zhang Z, Feng S, Wang D. Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment Analysis. Applied Sciences. 2022; 12(23):12146. https://doi.org/10.3390/app122312146
Chicago/Turabian StyleZhang, Yifei, Zhiqing Zhang, Shi Feng, and Daling Wang. 2022. "Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment Analysis" Applied Sciences 12, no. 23: 12146. https://doi.org/10.3390/app122312146
APA StyleZhang, Y., Zhang, Z., Feng, S., & Wang, D. (2022). Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment Analysis. Applied Sciences, 12(23), 12146. https://doi.org/10.3390/app122312146