Multimodal Emotion Recognition Using Modality-Wise Knowledge Distillation
Abstract
1. Introduction
2. Method
2.1. Typical Multimodal Emotion Recognition Model and Optimization Imbalance Phenomenon
2.2. Modality-Wise Knowledge Distillation
2.3. Combination with Other Regularization Methods
Algorithm 1: MKD with optional imbalance mitigation methods (OGM-GE or MMCosine) |
3. Experiments
3.1. Experimental Configurations
3.2. Results
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kim, E.; Shin, J.W. DNN-based Emotion Recognition Based on Bottleneck Acoustic Features and Lexical Features. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6720–6724. [Google Scholar]
- Kossaifi, J.; Toisoul, A.; Bulat, A.; Panagakis, Y.; Hospedales, T.M.; Pantic, M. Factorized Higher-Order CNNs with an Application to Spatio-Temporal Emotion Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6060–6069. [Google Scholar]
- Ahn, Y.; Lee, S.J.; Shin, J.W. Cross-Corpus Speech Emotion Recognition Based on Few-Shot Learning and Domain Adaptation. IEEE Signal Process. Lett. 2021, 28, 1190–1194. [Google Scholar] [CrossRef]
- Ahn, Y.; Lee, S.J.; Shin, J.W. Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification Loss. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 131–135. [Google Scholar]
- Ahn, Y.; Han, S.; Lee, S.; Shin, J.W. Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability. Sensors 2024, 24, 4111. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Chen, M.; Huang, D.; Wu, D.; Li, Y. iDoctor: Personalized and Professionalized Medical Recommendations Based on Hybrid Matrix Factorization. Future Gener. Comput. Syst. 2017, 66, 30–35. [Google Scholar] [CrossRef]
- Katsis, C.D.; Rigas, G.; Goletsis, Y.; Fotiadis, D.I. Emotion Recognition in Car Industry. In Emotion Recognition; Wang, W., Ed.; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2015; Chapter 20; pp. 515–544. [Google Scholar]
- Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion Recognition in Human-Computer Interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
- Yoon, S.; Byun, S.; Jung, K. Multimodal speech emotion recognition using audio and text. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 112–118. [Google Scholar]
- Chen, B.; Cao, Q.; Hou, M.; Zhang, Z.; Lu, G.; Zhang, D. Multimodal Emotion Recognition with Temporal and Semantic Consistency. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3592-3-603. [Google Scholar] [CrossRef]
- Sun, L.; Liu, B.; Tao, J.; Lian, Z. Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4275–4279. [Google Scholar]
- Yang, D.; Huang, S.; Liu, Y.; Zhang, L. Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition. IEEE Signal Process. Lett. 2022, 29, 2093–2097. [Google Scholar] [CrossRef]
- Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
- Rajan, V.; Brutti, A.; Cavallaro, A. Is cross-attention preferable to self-attention for multi-modal emotion recognition? In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4693–4697. [Google Scholar]
- Peng, X.; Wei, Y.; Deng, A.; Wang, D.; Hu, D. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 8238–8247. [Google Scholar]
- Fan, Y.; Xu, W.; Wang, H.; Wang, J.; Guo, S. PMR: Prototypical Modal Rebalance for Multimodal Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20029–20038. [Google Scholar]
- Xu, R.; Feng, R.; Zhang, S.-X.; Hu, D. MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
- Xie, J.; Wang, J.; Wang, Q.; Yang, D.; Gu, J.; Tang, Y.; Varatnitski, Y.I. A multimodal fusion emotion recognition method based on multitask learning and attention mechanism. Neurocomputing 2023, 556, 126649. [Google Scholar] [CrossRef]
- Sebe, N.; Cohen, I.; Huang, T.S. Multimodal emotion recognition. In Handbook of Pattern Recognition and Computer Vision; World Scientific: Singapore, 2005; pp. 387–409. [Google Scholar]
- Haq, S.; Jackson, P.J.B. Multimodal Emotion Recognition. In Machine Audition: Principles, Algorithms and Systems; Wang, W., Ed.; IGI Global: Hershey, PA, USA, 2011; pp. 398–423. [Google Scholar]
- Geetha, A.V.; Mala, T.; Priyanka, D.; Uma, E. Multimodal Emotion Recognition with deep learning: Advancements, challenges, and future directions. Inf. Fusion 2024, 105, 102218. [Google Scholar]
- Fu, Y.; Yuan, S.; Zhang, C.; Cao, J. Emotion Recognition in Conversations: A Survey Focusing on Context, Speaker Dependencies, and Fusion Methods. Electronics 2023, 12, 4714. [Google Scholar] [CrossRef]
- Ma, H.; Wang, J.; Lin, H.; Zhang, B.; Zhang, Y.; Xu, B. A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations. IEEE Trans. Multimed. 2023, 26, 776–788. [Google Scholar] [CrossRef]
- Chen, F.; Shao, J.; Zhu, S.; Shen, H.T. Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10761–10770. [Google Scholar]
- Zhang, X.; Cui, W.; Hu, B.; Li, Y. A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition. IEEE Trans. Affect. Comput. 2024, 15, 1553–1566. [Google Scholar] [CrossRef]
- Sari, L.; Singh, K.; Zhou, J.; Torresani, L.; Singhal, N.; Saraf, Y. A multi-view approach to audio-visual speaker verification. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6194–6198. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
- Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Ahn, C.S.; Kasun, L.L.C.; Sivadas, S.; Rajapakse, J.C. Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 744–748. [Google Scholar]
- Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2018; Volume 32, pp. 1–9. [Google Scholar]
- Kiela, D.; Grave, E.; Joulin, A.; Mikolov, T. Efficient large-scale multi-modal classification. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2018; Volume 32, pp. 1–9. [Google Scholar]
- Yun, S.; Park, J.; Lee, K.; Shin, J. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13876–13885. [Google Scholar]
- Alain, G.; Bengio, Y. Understanding intermediate layers using linear classifier probes. arXiv 2016, arXiv:1610.01644. [Google Scholar]
Method | CREMA-D | IEMOCAP | ||
---|---|---|---|---|
#Param | ACC (%) | #Param | UA (%) | |
Multimodal | 22.4M | 55.9 | 1.6M | 64.20 |
OGM-GE [15] | 22.4M | 62.2 * | - | - |
PMR [16] | 22.9M | 61.8 * | - | - |
MMCosine [17] | 22.9M | 66.4 * | 1.6M | 61.80 |
Uni-sum [26] | 22.4M | 55.9 | 1.6M | 63.60 |
All-sum [26] | 44.7M | 60.3 | 3.3M | 68.00 |
MWCE | 22.4M | 60.8 | 1.6M | 65.70 |
Self-KD [34] | 22.4M | 60.3 | 1.6M | 64.40 |
MKD | 22.4M | 67.7 | 1.6M | 67.50 |
MKD+ [15] | 22.4M | 68.4 | - | - |
MKD+ [16] | 22.9M | 67.9 | - | - |
MKD+ [17] | 22.4M | 69.3 | 1.6M | 66.90 |
All-sum (MKD) | 44.7M | 67.1 | 3.3M | 68.70 |
Method | CREMA-D | IEMOCAP | |||
---|---|---|---|---|---|
Audio | Visual | Audio | Visual | Text | |
Unimodal | 57.5 | 27.3 | 45.1 | 53.2 | 51.3 |
Multimodal | 57.0 | 18.6 | 43.2 | 50.8 | 50.6 |
MKD | 62.5 | 29.2 | 45.7 | 54.7 | 52.0 |
Audio | Visual | Text | CREMA-D | IEMOCAP |
---|---|---|---|---|
✗ | ✗ | ✗ | 55.9 | 64.2 |
✓ | ✗ | ✗ | 63.5 | 66.1 |
✗ | ✓ | ✗ | 62.9 | 66.1 |
✗ | ✗ | ✓ | - | 65.9 |
✓ | ✓ | ✗ | 67.7 | 66.4 |
✓ | ✗ | ✓ | - | 66.4 |
✗ | ✓ | ✓ | - | 66.2 |
✓ | ✓ | ✓ | - | 67.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, S.; Ahn, Y.; Shin, J.W. Multimodal Emotion Recognition Using Modality-Wise Knowledge Distillation. Sensors 2025, 25, 6341. https://doi.org/10.3390/s25206341
Lee S, Ahn Y, Shin JW. Multimodal Emotion Recognition Using Modality-Wise Knowledge Distillation. Sensors. 2025; 25(20):6341. https://doi.org/10.3390/s25206341
Chicago/Turabian StyleLee, Seonggyu, Youngdo Ahn, and Jong Won Shin. 2025. "Multimodal Emotion Recognition Using Modality-Wise Knowledge Distillation" Sensors 25, no. 20: 6341. https://doi.org/10.3390/s25206341
APA StyleLee, S., Ahn, Y., & Shin, J. W. (2025). Multimodal Emotion Recognition Using Modality-Wise Knowledge Distillation. Sensors, 25(20), 6341. https://doi.org/10.3390/s25206341