VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification
Abstract
:1. Introduction
2. Related Work
2.1. Sentiment Analysis
2.2. Multimodal Sentiment Analysis
2.3. Knowledge Distillation
3. Visual Distillation and Attention Network
3.1. Knowledge Distillation/Augmentation
3.1.1. Knowledge Distillation
Algorithm 1: Knowledge distillation/augmentation process. |
3.1.2. Knowledge Augmentation
3.2. Word Encoder with Word Attention
3.3. Sentence Encoder with Visual Aspect Attention Based on CLIP
3.4. Sentiment Classification
4. Experiment and Analysis
4.1. Dataset and Experimental Settings
4.2. Comparison Experiment
- TextCNN [18]: Convolutional neural networks are used by Kim et al. [18] to extract text features, which can help predict sentiment polarity by capturing key information from the text. Additionally, TextCNN_CLIP concatenates the text representation with the image feature representation for classification by using the CLIP [5] model to extract the image feature representation.
- FastText [54]: It was suggested by Bojanowski et al. [54] to add sub-word information to word representations. Its network architecture is straightforward, but it performs well when it comes to text classification. In order to compare it to BERT [23], it is used to create word embedding representations.
- BiGRU [55]: Tang et al. [55], the employment of gating mechanisms to address the sequence modeling issue of long-distance dependence, which results in improved quality text representation. In order to extract the features of the images and combine them with the text representation for classification, BiGRU_CLIP also employs the CLIP [5] model.
- HAN [2]: Yang et al. [2] suggested a hierarchical attention network. Before producing a representation of the text at the document level, it takes into account the significance of various words in sentences as well as the significance of various sentences within the document. In order to combine the text representation with the picture features representation for classification, HAN_CLIP additionally employs the CLIP [5] model.
- VisdaNet(Ours): The model proposed in this paper makes full use of multimodal information for knowledge supplementation of short texts as well as knowledge distillation of long texts, which can, at the same time, solve the problem of feature sparsity and information scarcity in short text representation and filter the task-irrelevant noise information in long texts.
4.3. Ablation Experiment
- VisdaNet(Full Model): The complete visual distillation and attention mechanism model proposed in this paper.
- -KnDist: The model removes knowledge distillation based on the CLIP module.
- -KnAug: The model removes the knowledge augmentation module.
- -ViAspAttn: The model removes the visual aspect attention based on CLIP.
- -WordAttn: The model removes the word attention layer. It is a hierarchical structure of reviews.
- -HiStruct: The model removes the hierarchical structure of the reviews. It is a base model (BiGRU) relying only on text.
- -BiGRU+Es: Replace the VisdaNet’s text feature extraction module BiGRU with ELECTRA-small [56].
4.4. Knowledge Distillation Based on CLIP Visualization
4.5. Illustrative Examples
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Baltrusaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [Green Version]
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: San Diego, CA, USA, 2016; pp. 1480–1489. [Google Scholar]
- Truong, Q.-T.; Lauw, H.W. VistaNet: Visual aspect attention network for multimodal sentiment analysis. Proc. AAAI Conf. Artif. Intell. 2019, 33, 305–312. [Google Scholar] [CrossRef] [Green Version]
- Zhu, J.; Zhou, Y.; Zhang, J.; Li, H.; Zong, C.; Li, C. Multimodal summarization with guidance of multimodal reference. Proc. AAAI Conf. Artif. Intell. 2020, 34, 9749–9756. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Du, Y.; Liu, Y.; Peng, Z.; Jin, X. Gated attention fusion network for multimodal sentiment classification. Knowl.-Based Syst. 2022, 240, 108107. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; MIT Press: Cambridge, MA, USA, 2019; pp. 13–23. [Google Scholar]
- Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020; pp. 1–14. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; Chang, K.-W. What does BERT with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5265–5275. [Google Scholar]
- Yu, J.; Jiang, J.; Xia, R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment lassification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 429–439. [Google Scholar] [CrossRef]
- Gan, C.; Wang, L.; Zhang, Z.; Wang, Z. Sparse Attention Based Separable Dilated Convolutional Neural Network for Targeted Sentiment Analysis. Knowl.-Based Syst. 2020, 188, 104827. [Google Scholar] [CrossRef]
- Chen, C.; Zhuo, R.; Ren, J. Gated Recurrent Neural Network with Sentimental Relations for Sentiment Classification. Inf. Sci. 2019, 502, 268–278. [Google Scholar] [CrossRef]
- Abid, F.; Alam, M.; Yasir, M.; Li, C. Sentiment Analysis through Recurrent Variants Latterly on Convolutional Neural Network of Twitter. Future Gener. Comput. Syst. 2019, 95, 292–308. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, Z.; Miao, D.; Wang, J. Three-way enhanced convolutional neural networks for sentence-level sentiment classification. Inf. Sci. 2019, 477, 55–64. [Google Scholar] [CrossRef]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014; ACL: Stroudsburg, PA, USA, 2014; pp. 655–665. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1746–1751. [Google Scholar]
- Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. Proc. AAAI Conf. Artif. Intell. 2015, 29, 2267–2273. [Google Scholar] [CrossRef]
- Basiri, M.E.; Nemati, S.; Abdar, M.; Cambria, E.; Acharya, U.R. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Gener. Comput. Syst. 2021, 115, 279–294. [Google Scholar] [CrossRef]
- Akhtar, M.S.; Ekbal, A.; Cambria, E. How intense are you? Predicting intensities of emotions and sentiments using stacked ensemble [Application Notes]. IEEE Comput. Intell. Mag. 2020, 15, 64–75. [Google Scholar] [CrossRef]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 2227–2237. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Valdivia, A.; Luzón, M.V.; Cambria, E.; Herrera, F. Consensus vote models for detecting and filtering neutrality in sentiment snalysis. Inf. Fusion 2018, 44, 126–135. [Google Scholar] [CrossRef]
- Wang, Z.; Ho, S.-B.; Cambria, E. Multi-level fine-scaled sentiment sensing with ambivalence handling. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2020, 28, 683–697. [Google Scholar] [CrossRef]
- Jiao, W.; Lyu, M.; King, I. Real-time emotion recognition via attention gated hierarchical memory network. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8002–8009. [Google Scholar] [CrossRef]
- Ghosal, D.; Majumder, N.; Gelbukh, A.; Mihalcea, R.; Poria, S. COSMIC: Common-sense knowledge for emotion identification in conversations. In Proceedings of the Findings of the Association for Computational Linguistics, EMNLP 2020, Online Event, 16–20 November 2020; pp. 2470–2481. [Google Scholar]
- Li, W.; Shao, W.; Ji, S.; Cambria, E. BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis. Neurocomputing 2022, 467, 73–82. [Google Scholar] [CrossRef]
- Borth, D.; Ji, R.; Chen, T.; Breuel, T.; Chang, S.-F. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Spain, 21 October 2013; pp. 223–232. [Google Scholar]
- Yu, Y.; Lin, H.; Meng, J.; Zhao, Z. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms 2016, 9, 41. [Google Scholar] [CrossRef] [Green Version]
- Xu, N.; Mao, W. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6 November 2017; pp. 2399–2402. [Google Scholar]
- Xu, N.; Mao, W.; Chen, G. A co-memory network for multimodal sentiment analysis. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 27 June 2018; pp. 929–932. [Google Scholar]
- Cai, Y.; Cai, H.; Wan, X. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 2506–2515. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Zhao, M.; Jha, A.; Liu, Q.; Millis, B.A.; Mahadevan-Jansen, A.; Lu, L.; Landman, B.A.; Tyska, M.J.; Huo, Y. Faster mean-shift: GPU-accelerated clustering for cosine embedding-based cell segmentation and tracking. Med. Image Anal. 2021, 71, 102048. [Google Scholar] [CrossRef]
- Yao, T.; Qu, C.; Liu, Q.; Deng, R.; Tian, Y.; Xu, J.; Jha, A.; Bao, S.; Zhao, M.; Fogo, A.B.; et al. Compound figure separation of biomedical images with side loss. In Deep Generative Models, and Data Augmentation, Labelling, and Imperfections; Engelhardt, S., Oksuz, I., Zhu, D., Yuan, Y., Mukhopadhyay, A., Heller, N., Huang, S.X., Nguyen, H., Sznitman, R., Xue, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 173–183. [Google Scholar]
- Jin, B.; Cruz, L.; Goncalves, N. Pseudo RGB-D face recognition. IEEE Sens. J. 2022, 22, 21780–21794. [Google Scholar] [CrossRef]
- Zheng, Q.; Yang, M.; Yang, J.; Zhang, Q.; Zhang, X. Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process. IEEE Access 2018, 6, 15844–15869. [Google Scholar] [CrossRef]
- Wu, Y.; Guo, H.; Chakraborty, C.; Khosravi, M.; Berretti, S.; Wan, S. Edge computing driven low-light image dynamic enhancement for object detection. IEEE Trans. Netw. Sci. Eng. 2022. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. arXiv 2017, arXiv:1612.03928. [Google Scholar]
- Furlanello, T.; Lipton, Z.; Tschannen, M.; Itti, L.; Anandkumar, A. Born again neural networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1607–1616. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through Attention. arXiv 2020, arXiv:2012.12877. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep mutual learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 4320–4328. [Google Scholar]
- Anil, R.; Pereyra, G.; Passos, A.; Ormandi, R.; Dahl, G.E.; Hinton, G.E. Large scale distributed neural network training through Online distillation. arXiv 2018, arXiv:1804.03235. [Google Scholar]
- Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1724–1734. [Google Scholar]
- Kim, J.-H.; On, K.-W. Hadamard product for low-rank bilinear pooling. arXiv 2016, arXiv:1610.04325. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Loper, E.; Bird, S. NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics; Association for Computational Linguistics: Philadelphia, PA, USA, 2002; pp. 63–70. [Google Scholar]
- Tieleman, T.; Hinton, G. Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Tang, D.; Qin, B.; Liu, T. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; ACL: Stroudsburg, PA, USA, 2015; pp. 1422–1432. [Google Scholar]
- Clark, K.; Luong, M.-T.; Le, Q.V. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Datasets | City | #Reviews | Avg. #s | Max. #s | Avg. #w | Min. #w | Max. #w | #Images | Min. #Images |
---|---|---|---|---|---|---|---|---|---|
Train | - | 35,435 | 14.8 | 104 | 225 | 10 | 1134 | 196,280 | 3 |
Valid | - | 2215 | 14.8 | 104 | 226 | 12 | 1145 | 11,851 | 3 |
BO | 315 | 13.4 | 85 | 211 | 14 | 1099 | 1654 | 3 | |
CH | 325 | 13.5 | 96 | 208 | 15 | 1095 | 1820 | 3 | |
Test | LA | 3730 | 14.4 | 104 | 223 | 12 | 1103 | 20,254 | 3 |
NY | 1715 | 13.4 | 95 | 219 | 14 | 1080 | 9467 | 3 | |
SF | 570 | 14.8 | 98 | 244 | 10 | 1116 | 3243 | 3 | |
Total | - | 44,305 | 14.8 | 104 | 237.3 | 10 | 1145 | 244,569 | 3 |
Hyperparameters | Settings |
---|---|
optimizer | RMSprop |
learning rate | 0.001 |
batch size | 32 |
dropout rate | 0.5 |
image representation dimension | 512 |
sentence representation dimension | 512 |
word representation dimension | 512 |
GRU representation dimension | 50 |
bidirectional GRU representation dimension | 100 |
attention dimensions | 100 |
M (number of sentences in each review) | 30 |
T (number of words in each sentence) | 30 |
H (number of images per review) | 3 |
number of classes of prediction | 5 |
M | Boston | Chicago | Los Angeles | New York | San Francisco | Mean | Time Cost (s) * |
---|---|---|---|---|---|---|---|
20 | 64.13 | 66.15 | 60.80 | 61.81 | 58.77 | 61.31 | 519 |
25 | 64.44 | 65.23 | 61.26 | 60.41 | 60.88 | 61.35 | 627 |
30 | 62.86 | 62.77 | 62.57 | 62.10 | 60.70 | 62.32 | 720 |
35 | 65.48 | 67.38 | 62.25 | 61.92 | 60.88 | 62.45 | 1016 |
40 | 58.10 | 63.88 | 60.64 | 60.23 | 58.87 | 60.32 | 1214 |
H | Mean Accuracy |
---|---|
1 | 61.57 |
2 | 62.04 |
3 | 62.32 |
Model | Textual Features | Visual Features | Hierarchical Structure | Visual Aspect Attention | Knowledge Distillation/Augmentation |
---|---|---|---|---|---|
TextCNN | ✓ | - | - | - | - |
TextCNN_CLIP | ✓ | ✓ | - | - | - |
FastText | ✓ | - | - | - | - |
BiGRU | ✓ | - | - | - | - |
BiGRU_CLIP | ✓ | ✓ | - | - | - |
HAN | ✓ | - | ✓ | - | - |
HAN_CLIP | ✓ | ✓ | ✓ | - | - |
BERT | ✓ | - | - | - | - |
VistaNet | ✓ | ✓ | ✓ | ✓ | - |
GAFN | ✓ | ✓ | - | ✓ | - |
VisdaNet(Ours) | ✓ | ✓ | ✓ | ✓ | ✓ |
Model | Boston | Chicago | Los Angeles | New York | San Francisco | Mean |
---|---|---|---|---|---|---|
TextCNN | 54.32 | 54.80 | 54.03 | 53.58 | 53.04 | 53.88 |
TextCNN_CLIP | 55.61 | 55.45 | 54.36 | 54.16 | 53.47 | 54.34 |
FastText | 61.27 | 59.38 | 55.49 | 56.15 | 55.44 | 56.12 |
BiGRU | 54.94 | 56.02 | 56.45 | 58.27 | 52.80 | 56.52 |
BiGRU_CLIP | 58.69 | 57.24 | 56.60 | 57.02 | 55.48 | 56.74 |
HAN | 61.60 | 58.53 | 57.61 | 57.14 | 53.02 | 57.33 |
HAN_CLIP | 62.22 | 62.15 | 58.45 | 59.77 | 58.95 | 59.19 |
BERT | 60.13 | 60.71 | 59.17 | 58.89 | 60.24 | 59.31 |
VistaNet | 63.17 | 63.08 | 59.95 | 58.72 | 59.65 | 59.91 |
GAFN | 61.60 * | 66.20 * | 59.00 * | 61.00 * | 60.70 * | 60.10 * |
VisdaNet (Ours) | 62.86 | 62.77 | 62.57 | 62.10 | 60.70 | 62.32 |
Model | Boston | Chicago | Los Angeles | New York | San Francisco | Mean |
---|---|---|---|---|---|---|
VisdaNet (Full Model) | 62.86 | 62.77 | 62.57 | 62.10 | 60.70 | 62.32 |
-KnDist | 62.54 | 64.62 | 61.96 | 61.92 | 60.53 | 61.98 |
-KnAug | 61.90 | 61.54 | 60.99 | 59.53 | 59.30 | 60.54 |
-ViAspAttn | 62.38 | 63.47 | 59.65 | 58.85 | 57.34 | 59.56 |
-WordAttn | 59.39 | 63.39 | 58.08 | 58.58 | 58.18 | 58.54 |
-HiStruct | 56.70 | 59.01 | 55.74 | 55.59 | 54.84 | 55.83 |
-BiGRU+Es | 53.41 | 55.70 | 54.89 | 55.78 | 54.13 | 55.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hou, S.; Tuerhong, G.; Wushouer, M. VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification. Sensors 2023, 23, 661. https://doi.org/10.3390/s23020661
Hou S, Tuerhong G, Wushouer M. VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification. Sensors. 2023; 23(2):661. https://doi.org/10.3390/s23020661
Chicago/Turabian StyleHou, Shangwu, Gulanbaier Tuerhong, and Mairidan Wushouer. 2023. "VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification" Sensors 23, no. 2: 661. https://doi.org/10.3390/s23020661