Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks
Abstract
1. Introduction
- We design and present the MTG-ERC multimodal conversational emotion recognition model. MTG-ERC consists of a multi-level Transformer fusion module and a directed multi-relational graph fusion module. These two modules are combined through a gated fusion mechanism to integrate global (long-range, cross-modal) and local (short-term, speaker-aware) emotional features. MTG-ERC yields consistent improvements (approximately 1% absolute in both accuracy and weighted F1 on IEMOCAP and MELD) over a strong graph-based baseline, and we discuss the statistical significance of these gains rather than claiming large margins.
- We introduce a multi-level Transformer fusion module. This module uses multi-head self-attention (intra-modal) together with cross-modal attention (inter-modal) to capture interactions within and between modalities and to model long-range conversational context. This branch focuses on global dependency modelling across the dialogue.
- We introduce a directed multi-relational graph fusion module. This module represents the conversation as a directed graph with multiple explicitly defined relation types (e.g., intra-utterance cross-modal links, self-loops, temporally directed edges). It aggregates short-range contextual signals and speaker-dependent emotional shifts, enabling fine-grained local reasoning that complements the Transformer branch.
- We conduct performance evaluation and ablation studies on the public IEMOCAP [3] and MELD [4]. We compare MTG-ERC against strong recent baselines, report averaged resaults across multiple runs, and analyze variance and paired significance to assess robustness. We also analyze the computational characteristics of the two modules to clarify the accuracy–efficiency trade-off and guide deployment considerations.
2. Related Works
2.1. Multimodal Information Fusion
2.2. Conversational Emotion Recognition
3. Methodology
- (1)
- The unimodal feature extraction module is responsible for extraction of features pertaining to each modality. The text features are extracted using a Transformer network, while the audio and visual features are extracted through a fully connected network.
- (2)
- Multi-level Transformer fusion module: The use of intra-modal and inter-modal Transformer networks enables the capture of interactive information in speech sequences, facilitating the integration of intra-modal and inter-modal interactions.
- (3)
- Directed Multi-Lateral Relationship Graph Fusion and Module: The construction of a directed graph of multi-variable relationships enables the learning of local context representation, the capture of the relationship between discourse and modality, and the enhancement of the expression of local features.
- (4)
- Multimodal emotion classification module: The weights of each modality are adjusted dynamically using the softmax function. The global and local context features obtained from the multi-level Transformer fusion module and the directed multi-lateral relationship graph fusion and module are spliced, and comprehensive integration of global and local conversational emotion features is achieved. The emotion label of each sentence is then output.
3.1. Task Definition
3.2. Feature Extraction
3.3. Speaker Embedding
3.4. Multi-Level Transformer Fusion Module
Intra-Modal and Inter-Modal Transformers
3.5. Single-Modality Gated Fusion
3.6. Directed Multilateral Graph Fusion Module
3.6.1. Graph Structure
- Node division: Each sentence generates three modal nodes, denoted as , , and . The feature vectors of the nodes are represented by and , respectively.
- Edge division: The edges included in the edge set are contingent upon the scope of the context content considered during the modelling process. In a conversation, if the context content of each utterance considers all other utterances in the conversation, a fully connected graph will be generated, in which each vertex has an edge with all other vertices (including itself). However, this situation requires a significant amount of computing time, which is why this paper only considers local neighbour information and uses the context window to limit the scope when collecting local neighbour information. The parameters P and F represent the number of sentences connected to the past and future, respectively. In our implementation, we use an asymmetric context window = [9,5] = [9,5] = [9,5]. For each utterance, utu_tut, we create directed edges to up to 9 preceding utterances and 5 following utterances. We adopt an asymmetric window because emotional state is typically more influenced by accumulated past context (speaker history, escalation), while only short-range future reactions are informative. We also compared symmetric windows such as [5,5] and wider windows such as [9,9];[9,5] provided the best balance between classification performance and graph size, so we use this setting in all experiments.
3.6.2. Graph Learning
3.7. Multimodal Sentiment Classification
3.7.1. Multimodal Gated Fusion
3.7.2. Emotion Label Generation
4. Experiments
4.1. Dataset
4.2. Experimental Parameter Settings and Evaluation Indicators
4.3. Comparison with Baselines
4.4. Ablation Study
4.5. Confusion Matrix Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Fu, Y.; Yuan, S.; Zhang, C.; Cao, J. Emotion Recognition in Conversations: A Survey Focusing on Context, Speaker Dependencies, and Fusion Methods. Electronics 2023, 12, 4714. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July—2 August 2019; pp. 527–536. [Google Scholar]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.B.; Morency, L.P. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Conference Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
- Sahay, S.; Okur, E.; Kumar, S.H.; Nachman, L. Low Rank Fusion based Transformers for Multimodal Sequences. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML); ACL: Seattle, WA, USA, 2020; pp. 29–34. [Google Scholar]
- Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6818–6825. [Google Scholar]
- Ghosal, D.; Majumder, N.; Gelbukh, A.; Mihalcea, R.; Poria, S. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020; ACL: Seattle, WA, USA, 2020; pp. 2470–2481. [Google Scholar]
- Hu, D.; Wei, L.; Huai, X. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 7042–7052. [Google Scholar]
- Shen, W.; Chen, J.; Quan, X.; Xie, Z. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 13789–13797. [Google Scholar]
- Kim, T.; Vossen, P. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arXiv 2021, arXiv:2108.12009. [Google Scholar]
- Li, S.; Yan, H.; Qiu, X. Contrast and generation make bart a good dialogue emotion recognizer. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 11002–11010. [Google Scholar]
- Hu, G.; Lin, T.E.; Zhao, Y.; Lu, G.; Wu, Y.; Li, Y. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7837–7851. [Google Scholar]
- Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 154–164. [Google Scholar]
- Ishiwatari, T.; Yasuda, Y.; Miyazaki, T.; Goto, J. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 7360–7370. [Google Scholar]
- Hu, J.; Liu, Y.; Zhao, J.; Jin, Q. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 5666–5675. [Google Scholar]
- Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition. IEEE Trans. Multimed. 2023, 26, 77–89. [Google Scholar] [CrossRef]
- Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; Van Den Berg, R.; Titov, I.; Welling, M. Modeling relational data with graph convolutional networks. In The Semantic Web, Proceedings of the 15th International Conference, ESWC 2018, Heraklion, Greece, 3–7 June 2018; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 593–607. [Google Scholar]
- Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]



| Datasets | Dialogues | Utterances | ||
|---|---|---|---|---|
| Train | Test | Train | Test | |
| IEMOCAP (6-way) | 120 | 31 | 5810 | 1623 |
| IEMOCAP (4-way) | 120 | 31 | 3600 | 943 |
| MELD | 1153 | 280 | 11098 | 2610 |
| Model | Happy | Sad | Neutral | Angry | Excited | Frustrated | Acc | w-F1 |
|---|---|---|---|---|---|---|---|---|
| DialogueRNN9 | 33.18 | 78.80 | 59.21 | 65.28 | 71.86 | 58.91 | 63.40 | 62.75 |
| DialogueGCN17 | 42.75 | 80.88 | 63.54 | 64.19 | 63.08 | 61.21 | 65.25 | 64.18 |
| MMGCN18 | 42.34 | 78.67 | 61.73 | 69.00 | 74.33 | 62.32 | 65.43 | 66.22 |
| ICON22 | 32.80 | 74.41 | 60.69 | 68.23 | 68.41 | 66.28 | 64.02 | 63.54 |
| COGMEN23 | 51.90 | 81.70 | 68.60 | 66.00 | 75.30 | 58.20 | 68.20 | 67.60 |
| GraphCFC19 | 43.23 | 84.45 | 64.13 | 71.37 | 78.19 | 62.74 | 69.13 | 68.91 |
| MTG-ERC | 55.59 | 82.09 | 65.62 | 69.62 | 75.56 | 66.64 | 70.12 | 70.23 |
| Model | Acc | w-F1 |
|---|---|---|
| BC-LSTM24 | 75.20 | 75.13 |
| CHFusion25 | 76.59 | 76.80 |
| COGMEN23 | 83.29 | 83.21 |
| MTG-ERC | 84.77 | 84.51 |
| Model | Neutral | Surprise | Fear | Sadness | Joy | Disgust | Angry | Acc | w-F1 |
|---|---|---|---|---|---|---|---|---|---|
| BC-LSTM24 | 73.80 | 47.70 | 5.40 | 25.10 | 51.30 | 5.20 | 38.40 | 57.50 | 55.90 |
| DialogueRNN9 | 76.56 | 49.40 | 1.20 | 23.80 | 50.70 | 1.70 | 41.50 | 56.10 | 55.90 |
| DialogueGCN17 | - | - | - | - | - | - | - | 57.97 | 58.10 |
| MMGCN18 | 76.33 | 48.15 | - | 26.74 | 53.02 | - | 46.09 | 58.34 | 58.65 |
| EmoCaps26 | 77.12 | 63.19 | 3.03 | 42.52 | 57.50 | 7.69 | 57.54 | - | 64.00 |
| UniMSE15 | - | - | - | - | - | - | - | 65.09 | 65.51 |
| GraphCFC19 | 75.48 | 48.30 | 1.90 | 26.33 | 51.58 | 3.30 | 46.41 | 61.42 | 62.08 |
| MTG-ERC | 78.91 | 59.72 | 18.31 | 40.23 | 64.10 | 15.30 | 54.96 | 65.25 | 65.32 |
| Modality Settings | IEMOCAP (6-Way) | IEMOCAP (4-Way) | ||
|---|---|---|---|---|
| Acc | w-F1 | Acc | w-F1 | |
| T | 66.17 | 66.19 | 82.08 | 81.99 |
| A | 57.81 | 56.07 | 63.31 | 61.17 |
| V | 42.04 | 41.78 | 46.34 | 46.36 |
| T + A | 68.22 | 68.35 | 83.35 | 83.32 |
| V + T | 66.42 | 66.44 | 81.55 | 81.54 |
| A + V | 59.88 | 60.18 | 65.14 | 63.70 |
| MTG-ERC | 70.13 | 70.23 | 84.77 | 84.51 |
| IEMOCAP (6-Way) | IEMOCAP (4-Way) | |||
|---|---|---|---|---|
| Acc | w-F1 | Acc | w-F1 | |
| w/o inter transformer | 69.32 | 69.47 | 83.13 | 82.97 |
| w/o intra transformer | 68.44 | 68.63 | 82.38 | 82.16 |
| w/o | 68.35 | 68.52 | 82.02 | 81.97 |
| w/o | 67.98 | 68.12 | 82.43 | 82.37 |
| MTG-ERC | 70.13 | 70.23 | 84.77 | 84.51 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, H.; Yang, T.; Yan, L.; Wang, C.; Song, X. Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks. Appl. Sci. 2025, 15, 11971. https://doi.org/10.3390/app152211971
Jin H, Yang T, Yan L, Wang C, Song X. Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks. Applied Sciences. 2025; 15(22):11971. https://doi.org/10.3390/app152211971
Chicago/Turabian StyleJin, Hua, Tian Yang, Letian Yan, Changda Wang, and Xuehua Song. 2025. "Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks" Applied Sciences 15, no. 22: 11971. https://doi.org/10.3390/app152211971
APA StyleJin, H., Yang, T., Yan, L., Wang, C., & Song, X. (2025). Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks. Applied Sciences, 15(22), 11971. https://doi.org/10.3390/app152211971

