Uni2Mul: A Conformer-Based Multimodal Emotion Classification Model by Considering Unimodal Expression Differences with Multi-Task Learning
Abstract
:1. Introduction
2. Related Work
2.1. Visual Emotion Classification
2.2. Audio Emotion Classification
2.3. Textual Emotion Classification
2.4. Multimodal Emotion Classification
3. Methods
3.1. Unimodal Neural Networks
3.1.1. Vision
3.1.2. Audio
3.1.3. Text
3.2. Multi-Task Multimodal Fusion Network with Conformer
4. Experiment
4.1. Dataset
4.2. Parameters’ Setting
4.3. Experimental Results
5. Discussion
5.1. Ablation Study
5.1.1. Unimodal Representation
5.1.2. Multimodal Fusion
5.2. Visualization
5.2.1. Visualization of Hidden Representations
5.2.2. Visualization of Attention Weights
5.2.3. Visualization of Confusion Matrix
6. Conclusions
- (1)
- Unimodal models trained using IUAs can learn more differentiated information and improve the complementarity between modalities compared to those trained using IMAs.
- (2)
- The hidden representations of the pre-trained unimodal models serve as effective inputs for the fusion network. This ensures that the differentiated information learned using the unimodal models is passed unchanged to the fusion network.
- (3)
- The Conformer module, with its multi-head attention mechanism and convolutional kernel, excels in paying attention to important intra-modal information and capturing inter-modal relationships. It is the best among the four fusion strategies mentioned above.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Abbreviations | Stand for |
---|---|
Uni2Mul | Unimodal to Multimodal |
MEC | Multimodal Emotion Classification |
IUAs | Independent Unimodal Annotations |
IMAs | Identical Multimodal Annotations |
CNN | Convolutional Neural Network |
RNN | Recurrent Neural Network |
LSTM | Long Short-Term Memory |
SER | Speech Emotion Recognition |
LFCC | Linear Frequency Cepstral Coefficients |
MFCC | Mel-scale Frequency Cepstral Coefficients |
ROC | Receiver Operating Characteristic |
BiLSTM | Bidirectional LSTM |
BN | Batch Normalization |
ReLU | Rectified Linear Unit |
BERT | Bidirectional Encoder Representation from Transformers |
GRU | Gated Recurrent Unit |
VGG | Visual Geometry Group |
SVM | Support Vector Machine |
EEG | electroencephalogram |
GSR | Galvanic Skin Response |
CLIP | Contrastive Language-Image Pre-training |
GLU | Gated Linear Unit |
LN | Layer Normalization |
DP | Dropout Operation |
GAP | Global Average Pooling1D |
References
- Taboada, M.; Brooke, J.; Tofiloski, M.; Voll, K.; Stede, M. Lexicon-Based Methods for Sentiment Analysis. Comput. Linguist. 2011, 37, 267–307. [Google Scholar] [CrossRef]
- Thelwall, M.; Buckley, K.; Paltoglou, G. Sentiment Strength Detection for the Social Web. J. Am. Soc. Inf. Sci. Technol. 2012, 63, 163–173. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu Features? End-to-End Speech Emotion Recognition Using a Deep Convolutional Recurrent Network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
- Hoffmann, H.; Kessler, H.; Eppel, T.; Rukavina, S.; Traue, H.C. Expression Intensity, Gender and Facial Emotion Recognition: Women Recognize Only Subtle Facial Emotions Better than Men. Acta Psychol. 2010, 135, 278–283. [Google Scholar] [CrossRef] [PubMed]
- Collignon, O.; Girard, S.; Gosselin, F.; Roy, S.; Saint-Amour, D.; Lassonde, M.; Lepore, F. Audio-Visual Integration of Emotion Expression. Brain Res. 2008, 1242, 126–135. [Google Scholar] [CrossRef]
- Cho, J.; Pappagari, R.; Kulkarni, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
- Pampouchidou, A.; Simantiraki, O.; Fazlollahi, A.; Pediaditis, M.; Manousos, D.; Roniotis, A.; Giannakakis, G.; Meriaudeau, F.; Simos, P.; Marias, K.; et al. Depression Assessment by Fusing High and Low Level Features from Audio, Video, and Text. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 27–34. [Google Scholar]
- Dardagan, N.; Brđanin, A.; Džigal, D.; Akagic, A. Multiple Object Trackers in OpenCV: A Benchmark. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021. [Google Scholar]
- Guo, W.; Wang, J.; Wang, S. Deep Multimodal Representation Learning: A Survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
- Ghaleb, E.; Niehues, J.; Asteriadis, S. Multimodal Attention-Mechanism For Temporal Emotion Recognition. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 251–255. [Google Scholar]
- Deng, J.J.; Leung, C.H.C.; Li, Y. Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data. In Computational Science and Its Applications—ICCSA 2021; Lecture Notes in Computer Science; Gervasi, O., Murgante, B., Misra, S., Garau, C., Blečić, I., Taniar, D., Apduhan, B.O., Rocha, A.M.A.C., Tarantino, E., Torre, C.M., Eds.; Springer International Publishing: Cham, Switzerland, 2021; Volume 12951, pp. 552–563. ISBN 978-3-030-86969-4. [Google Scholar]
- Li, J.; Wang, S.; Chao, Y.; Liu, X.; Meng, H. Context-Aware Multimodal Fusion for Emotion Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18 September 2022; pp. 2013–2017. [Google Scholar]
- Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotations of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5 July 2020. [Google Scholar]
- Gunes, H.; Piccardi, M. Bi-Modal Emotion Recognition from Expressive Face and Body Gestures. J. Netw. Comput. Appl. 2007, 30, 1334–1345. [Google Scholar] [CrossRef]
- Cimtay, Y.; Ekmekcioglu, E.; Caglar-Ozhan, S. Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion. IEEE Access 2020, 8, 168865–168878. [Google Scholar] [CrossRef]
- Huan, R.-H.; Shu, J.; Bao, S.-L.; Liang, R.-H.; Chen, P.; Chi, K.-K. Video Multimodal Emotion Recognition Based on Bi-GRU and Attention Fusion. Multimed. Tools Appl. 2021, 80, 8213–8240. [Google Scholar] [CrossRef]
- Du, Y.; Liu, Y.; Peng, Z.; Jin, X. Gated Attention Fusion Network for Multimodal Sentiment Classification. Knowl.-Based Syst. 2022, 240, 108107. [Google Scholar] [CrossRef]
- Jabid, T. Robust Facial Expression Recognition Based on Local Directional Pattern. ETRI J. 2010, 32, 784–794. [Google Scholar] [CrossRef]
- Zhu, Y.; Li, X.; Wu, G. Face Expression Recognition Based on Equable Principal Component Analysis and Linear Regression Classification. In Proceedings of the 2016 3rd International Conference on Systems and Informatics (ICSAI), Shanghai, China, 19–21 November 2016; pp. 876–880. [Google Scholar]
- Barman, A.; Dutta, P. Facial Expression Recognition Using Distance Signature Feature. In Advanced Computational and Communication Paradigms; Bhattacharyya, S., Chaki, N., Konar, D., Chakraborty, U.K., Singh, C.T., Eds.; Advances in Intelligent Systems and Computing; Springer: Singapore, 2018; Volume 706, pp. 155–163. ISBN 978-981-10-8236-8. [Google Scholar]
- Liu, S.; Tian, Y. Facial Expression Recognition Method Based on Gabor Wavelet Features and Fractional Power Polynomial Kernel PCA. In Advances in Neural Networks—ISNN 2010; Zhang, L., Lu, B.-L., Kwok, J., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6064, pp. 144–151. ISBN 978-3-642-13317-6. [Google Scholar]
- Chao, W.-L.; Ding, J.-J.; Liu, J.-Z. Facial Expression Recognition Based on Improved Local Binary Pattern and Class-Regularized Locality Preserving Projection. Signal Process. 2015, 117, 1–10. [Google Scholar] [CrossRef]
- Sánchez, A.; Ruiz, J.V.; Moreno, A.B.; Montemayor, A.S.; Hernández, J.; Pantrigo, J.J. Differential Optical Flow Applied to Automatic Facial Expression Recognition. Neurocomputing 2011, 74, 1272–1282. [Google Scholar] [CrossRef]
- Saravanan, A.; Perichetla, G.; Gayathri, D.K.S. Facial Emotion Recognition Using Convolutional Neural Networks. SN Appl. Sci. 2019, 2, 446. [Google Scholar]
- Yu, Z.; Zhang, C. Image Based Static Facial Expression Recognition with Multiple Deep Network Learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9 November 2015; pp. 435–442. [Google Scholar]
- Ebrahimi Kahou, S.; Michalski, V.; Konda, K.; Memisevic, R.; Pal, C. Recurrent Neural Networks for Emotion Recognition in Video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9 November 2015; pp. 467–474. [Google Scholar]
- Ding, H.; Zhou, S.K.; Chellappa, R. FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 118–126. [Google Scholar]
- Verma, M.; Kobori, H.; Nakashima, Y.; Takemura, N.; Nagahara, H. Facial Expression Recognition with Skip-Connection to Leverage Low-Level Features. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 51–55. [Google Scholar]
- Yang, H.; Ciftci, U.; Yin, L. Facial Expression Recognition by De-Expression Residue Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2168–2177. [Google Scholar]
- Li, T.-H.S.; Kuo, P.-H.; Tsai, T.-N.; Luan, P.-C. CNN and LSTM Based Facial Expression Analysis Model for a Humanoid Robot. IEEE Access 2019, 7, 93998–94011. [Google Scholar] [CrossRef]
- Ming, Y.; Qian, H.; Guangyuan, L. CNN-LSTM Facial Expression Recognition Method Fused with Two-Layer Attention Mechanism. Comput. Intell. Neurosci. 2022, 2022, 1–9. [Google Scholar] [CrossRef]
- Iliou, T.; Anagnostopoulos, C.-N. Statistical Evaluation of Speech Features for Emotion Recognition. In Proceedings of the 2009 Fourth International Conference on Digital Telecommunications, Colmar, France, 20–25 July 2009; pp. 121–126. [Google Scholar]
- Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech Emotion Recognition Using Fourier Parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75. [Google Scholar] [CrossRef]
- Lahaie, O.; Lefebvre, R.; Gournay, P. Influence of Audio Bandwidth on Speech Emotion Recognition by Human Subjects. In Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Montreal, QC, USA, 22 July 2017; pp. 61–65. [Google Scholar]
- Bandela, S.R.; Kumar, T.K. Stressed Speech Emotion Recognition Using Feature Fusion of Teager Energy Operator and MFCC. In Proceedings of the 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 3–5 July 2017; pp. 1–5. [Google Scholar]
- Han, K.; Yu, D.; Tashev, I. Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014. [Google Scholar]
- Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
- Lee, J.; Tashev, I. High-Level Feature Representation Using Recurrent Neural Network for Speech Emotion Recognition. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar] [CrossRef]
- Kumbhar, H.S.; Bhandari, S.U. Speech Emotion Recognition Using MFCC Features and LSTM Network. In Proceedings of the 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA), Pune, India, 19–21 September 2019; pp. 1–3. [Google Scholar]
- Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. In Proceedings of the Workshop on Speech, Music and Mind (SMM 2018), Hyderabad, India, 1 September 2018; pp. 21–25. [Google Scholar]
- Atila, O.; Şengür, A. Attention Guided 3D CNN-LSTM Model for Accurate Speech Based Emotion Recognition. Appl. Acoust. 2021, 182, 108260. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 2020. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Chung, Y.-A.; Hsu, W.-N.; Tang, H.; Glass, J. An Unsupervised Autoregressive Model for Speech Representation Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Liu, A.T.; Li, S.-W.; Lee, H. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech. IEEEACM Trans. Audio Speech Lang. Process. 2021, 29, 2351–2366. [Google Scholar] [CrossRef]
- Liu, A.T.; Yang, S.; Chi, P.-H.; Hsu, P.; Lee, H. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6419–6423. [Google Scholar]
- Fan, Z.; Li, M.; Zhou, S.; Xu, B. Exploring Wav2vec 2.0 on Speaker Verification and Language Identification 2021. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space 2013. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality 2013. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Javed, N.; Muralidhara, B.L. Emotions During COVID-19: LSTM Models for Emotion Detection in Tweets. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications; Gunjan, V.K., Zurada, J.M., Eds.; Lecture Notes in Networks and Systems; Springer: Singapore, 2022; Volume 237, pp. 133–148. ISBN 9789811664069. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Gou, Z.; Li, Y. Integrating BERT Embeddings and BiLSTM for Emotion Analysis of Dialogue. Comput. Intell. Neurosci. 2023, 2023, 6618452. [Google Scholar] [CrossRef] [PubMed]
- Gui, L.; Zhou, Y.; Xu, R.; He, Y.; Lu, Q. Learning Representations from Heterogeneous Network for Sentiment Classification of Product Reviews. Knowl.-Based Syst. 2017, 124, 34–45. [Google Scholar] [CrossRef]
- Chen, F.; Ji, R.; Su, J.; Cao, D.; Gao, Y. Predicting Microblog Sentiments via Weakly Supervised Multimodal Deep Learning. IEEE Trans. Multimed. 2018, 20, 997–1007. [Google Scholar] [CrossRef]
- Liu, G.; Guo, J. Bidirectional LSTM with Attention Mechanism and Convolutional Layer for Text Classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
- Xie, H.; Feng, S.; Wang, D.; Zhang, Y. A Novel Attention Based CNN Model for Emotion Intensity Prediction. In Natural Language Processing and Chinese Computing; Lecture Notes in Computer Science; Zhang, M., Ng, V., Zhao, D., Li, S., Zan, H., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11108, pp. 365–377. ISBN 978-3-319-99494-9. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need 2017. In Proceedings of the 2017 Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Akula, R.; Garibay, I. Interpretable Multi-Head Self-Attention Architecture for Sarcasm Detection in Social Media. Entropy 2021, 23, 394. [Google Scholar] [CrossRef] [PubMed]
- Pérez-Rosas, V.; Mihalcea, R.; Morency, L.-P. Utterance-Level Multimodal Sentiment Analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 973–982. [Google Scholar]
- Xu, N.; Mao, W. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis. In Proceedings of the Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6 November 2017; pp. 2399–2402. [Google Scholar]
- Deng, D.; Zhou, Y.; Pi, J.; Shi, B.E. Multimodal Utterance-Level Affect Analysis Using Visual, Audio and Text Features. arXiv 2018, arXiv:1805.00625. [Google Scholar]
- Poria, S.; Cambria, E.; Gelbukh, A. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-Level Multimodal Sentiment Analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2539–2544. [Google Scholar]
- Yu, Y.; Lin, H.; Meng, J.; Zhao, Z. Visual and Textual Sentiment Analysis of a Microblog Using Deep Convolutional Neural Networks. Algorithms 2016, 9, 41. [Google Scholar] [CrossRef]
- Li, Y.; Zhao, T.; Shen, X. Attention-Based Multimodal Fusion for Estimating Human Emotion in Real-World HRI. In Proceedings of the Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, Cambridge, UK, 23 March 2020; pp. 340–342. [Google Scholar]
- Wang, H.; Yang, M.; Li, Z.; Liu, Z.; Hu, J.; Fu, Z.; Liu, F. SCANET: Improving Multimodal Representation and Fusion with Sparse-and Cross-attention for Multimodal Sentiment Analysis. Comput. Animat. Virtual Worlds 2022, 33, e2090. [Google Scholar] [CrossRef]
- Li, P.; Li, X. Multimodal Fusion with Co-Attention Mechanism. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020. [Google Scholar] [CrossRef]
- Zhu, H.; Wang, Z.; Shi, Y.; Hua, Y.; Xu, G.; Deng, L. Multimodal Fusion Method Based on Self-Attention Mechanism. Wirel. Commun. Mob. Comput. 2020, 2020, 1–8. [Google Scholar] [CrossRef]
- Thao, H.T.P.; Balamurali, B.T.; Roig, G.; Herremans, D. AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention. Sensors 2021, 21, 8356. [Google Scholar] [CrossRef]
- Gu, D.; Wang, J.; Cai, S.; Yang, C.; Song, Z.; Zhao, H.; Xiao, L.; Wang, H. Targeted Aspect-Based Multimodal Sentiment Analysis: An Attention Capsule Extraction and Multi-Head Fusion Network. IEEE Access 2021, 9, 157329–157336. [Google Scholar] [CrossRef]
- Ahn, C.-S.; Kasun, C.; Sivadas, S.; Rajapakse, J. Recurrent Multi-Head Attention Fusion Network for Combining Audio and Text for Speech Emotion Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18 September 2022; pp. 744–748. [Google Scholar]
- Xie, B.; Sidulova, M.; Park, C.H. Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion. Sensors 2021, 21, 4913. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision 2021. In Proceedings of the 2021 International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition 2020. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Williams, J.; Kleinegesse, S.; Comanescu, R.; Radu, O. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion. In Proceedings of the Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne, Australia, 20 July 2018; pp. 11–19. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.-P. Memory Fusion Network for Multi-View Sequential Learning 2018. In Proceedings of the AAAI conference on artificial intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017. [Google Scholar]
Model | Acc. | |
---|---|---|
baseline | EF-LSTM [74] | 51.73 |
MFN [75] | 63.89 | |
MULT [76] | 65.03 | |
LF-DNN [13] | 66.91 | |
MLF-DNN * [13] | 69.06 | |
LMF [77] | 64.38 | |
MLMF * [13] | 67.70 | |
TFN [78] | 64.46 | |
MTFN * [13] | 69.02 | |
ours | Uni2Mul-S-Conformer (w/o pre-train) | 72.21 |
Uni2Mul-S-Conformer | 76.15 | |
Uni2Mul-M-Conformer (w/o pre-train) | 73.30 | |
Uni2Mul-M-Conformer | 76.81 |
Modality | Name of Model | Feature | Acc. (IMAs) | F1 (IMAs) | Acc. (IUAs) | F1 (IUAs) |
---|---|---|---|---|---|---|
V | CNN | Image | 48.36 | 44.56 | 45.08 | 44.18 |
CLIP | 63.68 | 63.11 | 66.96 | 65.41 | ||
CNN-LSTM | Image | 54.49 | 38.67 | 51.20 | 35.06 | |
CLIP | 67.40 | 62.05 | 68.05 | 67.58 | ||
CNN-ATTN-LSTM | Image | 54.27 | 38.18 | 51.20 | 34.68 | |
CLIP | 65.43 | 59.97 | 67.83 | 65.60 | ||
A | BiLSTM | Mel | 51.20 | 44.35 | 50.77 | 42.14 |
Wave2Vec | 52.52 | 44.94 | 50.98 | 46.17 | ||
CNN-BiLSTM | Mel | 51.42 | 42.70 | 51.20 | 42.27 | |
Wave2Vec | 53.61 | 41.62 | 53.17 | 46.84 | ||
CNN-ATTN-BiLSTM | Mel | 52.95 | 43.24 | 49.45 | 44.36 | |
Wave2Vec | 54.27 | 40.48 | 53.17 | 50.06 | ||
T | BiLSTM | Word2Vec | 54.70 | 43.98 | 53.39 | 44.53 |
BERT | 69.80 | 65.64 | 74.40 | 73.18 | ||
ATTN-BiLSTM | Word2Vec | 54.92 | 43.16 | 54.05 | 44.62 | |
BERT | 70.24 | 65.96 | 74.62 | 74.41 | ||
CNN-ATTN | Word2Vec | 55.36 | 42.40 | 53.39 | 45.95 | |
BERT | 70.02 | 66.87 | 75.27 | 75.10 |
Name of Model | Acc. (IMAs) | F1 (IMAs) | Acc. (IUAs) | F1 (IUAs) |
---|---|---|---|---|
Uni2Mul-S-Concatenate (w/o pre-train) | 66.52 | 65.47 | 68.71 | 64.60 |
Uni2Mul-S-Concatenate | 71.99 | 70.39 | 73.96 | 73.17 |
Uni2Mul-M-Concatenate (w/o pre-train) | 70.46 | 67.95 | 69.58 | 68.83 |
Uni2Mul-M-Concatenate | 72.87 | 70.35 | 75.05 | 74.36 |
Uni2Mul-S-Attention (w/o pre-train) | 69.15 | 65.61 | 67.61 | 61.37 |
Uni2Mul-S-Attention | 72.65 | 71.53 | 73.96 | 73.26 |
Uni2Mul-M-Attention (w/o pre-train) | 70.24 | 64.35 | 70.90 | 69.56 |
Uni2Mul-M-Attention | 72.87 | 71.16 | 76.37 | 75.26 |
Uni2Mul-S-Transformer (w/o pre-train) | 68.71 | 63.64 | 65.21 | 59.62 |
Uni2Mul-S-Transformer | 71.77 | 69.17 | 75.05 | 73.87 |
Uni2Mul-M-Transformer (w/o pre-train) | 69.58 | 63.77 | 71.77 | 67.96 |
Uni2Mul-M-Transformer | 73.09 | 70.06 | 76.59 | 74.68 |
Uni2Mul-S-Conformer (w/o pre-train) | 69.37 | 64.57 | 72.21 | 68.56 |
Uni2Mul-S-Conformer | 72.21 | 70.62 | 76.15 | 75.00 |
Uni2Mul-M-Conformer (w/o pre-train) | 71.55 | 68.40 | 73.30 | 71.84 |
Uni2Mul-M-Conformer | 73.09 | 71.50 | 76.81 | 75.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, L.; Liu, C.; Jia, N. Uni2Mul: A Conformer-Based Multimodal Emotion Classification Model by Considering Unimodal Expression Differences with Multi-Task Learning. Appl. Sci. 2023, 13, 9910. https://doi.org/10.3390/app13179910
Zhang L, Liu C, Jia N. Uni2Mul: A Conformer-Based Multimodal Emotion Classification Model by Considering Unimodal Expression Differences with Multi-Task Learning. Applied Sciences. 2023; 13(17):9910. https://doi.org/10.3390/app13179910
Chicago/Turabian StyleZhang, Lihong, Chaolong Liu, and Nan Jia. 2023. "Uni2Mul: A Conformer-Based Multimodal Emotion Classification Model by Considering Unimodal Expression Differences with Multi-Task Learning" Applied Sciences 13, no. 17: 9910. https://doi.org/10.3390/app13179910
APA StyleZhang, L., Liu, C., & Jia, N. (2023). Uni2Mul: A Conformer-Based Multimodal Emotion Classification Model by Considering Unimodal Expression Differences with Multi-Task Learning. Applied Sciences, 13(17), 9910. https://doi.org/10.3390/app13179910