Next Article in Journal
Application of Modified Lignocellulosic Biomass for Sorption of Anionic Dye Reactive Black 5 in an Air-Lift and Column Reactor
Previous Article in Journal
Landslide Susceptibility Mapping Using Remote Sensing Interpretation and a Blending-XGBoost-CNN Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks

by
Hua Jin
,
Tian Yang
,
Letian Yan
,
Changda Wang
and
Xuehua Song
*
School of Computer Science and Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(22), 11971; https://doi.org/10.3390/app152211971
Submission received: 29 August 2025 / Revised: 31 October 2025 / Accepted: 8 November 2025 / Published: 11 November 2025

Abstract

To comprehensively capture conversational emotion information within and between modalities, address the challenge of global and local feature modelling in conversation, and enhance the accuracy of multimodal conversation emotion recognition, we present a model called Multimodal Transformer and GNN for Emotion Recognition in Conversations (MTG-ERC). The model incorporates a multi-level Transformer fusion module that employs multi-head self-attention and cross-modal attention mechanisms to effectively capture interaction patterns within and between modalities. To address the shortcomings of attention-mechanism-based models in capturing short-term dependencies, we introduce a directed multi-relational graph fusion module, which employs directed graphs and multiple relation types to achieve efficient multimodal information fusion and to model short-term, speaker-dependent emotional shifts. By integrating the outputs of these two modules, the MTG-ERC model effectively combines global and local conversational emotion features and enhances intra-modal and inter-modal emotional interactions. The proposed model shows consistent improvements (around 1% absolute) in both accuracy and weighted F1 on the IEMOCAP and MELD datasets when compared with other baseline models. This highlights the model’s strong performance indicators and validates its effectiveness in comparison to existing models.

1. Introduction

In the context of the information age, a variety of conversational scenarios have become an integral aspect of our lives, giving rise to the emerging field of Emotion Recognition in Conversations (ERC). ERC is concerned with the identification of subtle emotional states within the context of conversational interactions, representing a crucial aspect of emotional computing. The model performs emotion recognition for each utterance based on the context of the conversation and the information about the speaker, employing a multi-category classification approach. In particular, this entails accurately classifying each utterance in multiple predefined emotion categories (such as joy, anger, sadness, surprise, etc.), rather than just a simple emotional positive and negative polarity analysis. This enables the model to capture complex emotional changes.
ERC plays a crucial role in various application areas, including conversation emotion monitoring, social media sentiment analysis and emotion perception in dialogue systems. Additionally, ERC also shows potential application in recommendation systems, customer service feedback and healthcare. At the same time, ERC systems must operate under practical constraints such as latency, interpretability, and privacy when deployed in real-world interactive settings.
The most commonly used methods for modelling conversation context can be categorized into three groups: those based on recurrent neural networks, those based on attention mechanisms, and those based on graph neural networks [1]. Recurrent neural networks process sequence data in a stepwise manner, retaining the order information of the time series, but may suffer from inefficiency and vanishing/exploding gradients for long sequences. Methods based on attention mechanisms, exemplified by Transformer [2], capture global context through self-attention mechanisms. However, they may under-emphasize fine-grained, short-term, speaker-dependent emotional shifts in local conversational windows. Graph neural networks (GNNs) are capable of effectively capturing detailed local emotional information by collecting contextual information around the discourse. However, GNNs can find it more difficult to efficiently capture long-range, cross-modal dependencies at the conversation level.
The field of multimodal conversation emotion recognition currently faces two core challenges. The first is how to capture the information interactions within and between modes of discourse. The second is how to simultaneously acquire global and local conversation emotion features. Recent research has demonstrated the efficacy of attention mechanisms in multimodal fusion, facilitating the capture of pertinent information across modalities. However, the significance of emotional data within the modality itself is overlooked. To this end, we present a multi-level Transformer fusion module, which employs multi-head self-attention and cross-modal attention mechanisms to effectively capture information interactions within and between modalities, thereby enriching emotional information. To better capture speaker-conditioned short-term dependencies and complement global attention with local reasoning, we further introduce a directed multi-relational graph fusion module, which constructs a directed conversational graph with multiple relation types and fuses multimodal information within a bounded local context. By integrating these two branches, the model jointly models global long-range dependencies and local short-range emotional cues.
In summary, the main contributions of this work are as follows:
  • We design and present the MTG-ERC multimodal conversational emotion recognition model. MTG-ERC consists of a multi-level Transformer fusion module and a directed multi-relational graph fusion module. These two modules are combined through a gated fusion mechanism to integrate global (long-range, cross-modal) and local (short-term, speaker-aware) emotional features. MTG-ERC yields consistent improvements (approximately 1% absolute in both accuracy and weighted F1 on IEMOCAP and MELD) over a strong graph-based baseline, and we discuss the statistical significance of these gains rather than claiming large margins.
  • We introduce a multi-level Transformer fusion module. This module uses multi-head self-attention (intra-modal) together with cross-modal attention (inter-modal) to capture interactions within and between modalities and to model long-range conversational context. This branch focuses on global dependency modelling across the dialogue.
  • We introduce a directed multi-relational graph fusion module. This module represents the conversation as a directed graph with multiple explicitly defined relation types (e.g., intra-utterance cross-modal links, self-loops, temporally directed edges). It aggregates short-range contextual signals and speaker-dependent emotional shifts, enabling fine-grained local reasoning that complements the Transformer branch.
  • We conduct performance evaluation and ablation studies on the public IEMOCAP [3] and MELD [4]. We compare MTG-ERC against strong recent baselines, report averaged resaults across multiple runs, and analyze variance and paired significance to assess robustness. We also analyze the computational characteristics of the two modules to clarify the accuracy–efficiency trade-off and guide deployment considerations.

2. Related Works

2.1. Multimodal Information Fusion

Deep learning has given rise to a significant body of research on multimodal methods. In 2017, Zadeh et al. [5] put forth the TFN model, which simulates the interaction of unimodal, bimodal, and trimodal modes through the calculation of Cartesian products. Nevertheless, the tensor fusion of TFN necessitates a considerable number of parameters, which consequently increases the computational complexity and memory consumption. To address this issue, Liu et al. [6] proposed the LMF model in 2018, which employs low-rank tensors for multimodal fusion, effectively reducing the model complexity and the number of parameters. Despite the success of these methods in multimodal fusion, they have not yet addressed the issue of temporal alignment across different modalities. To address this challenge, Tsai et al. [7] proposed the MulT model in 2019, which employs a pairwise cross-modal attention mechanism to capture the interaction between different time points and different modalities. However, this approach may under-represent the fine-grained information within each individual modality. In 2020, Sahay et al. [8] proposed the LMF-MulT model, which combined the advantages of MulT and LMF. It uses LMF to obtain multimodal representation and realizes the interaction fusion between modalities through a cross-modal attention mechanism. However, since LMF uses a low-rank matrix representation, some high-dimensional information may be lost, thereby making it harder to retain very subtle affective cues and complex interactions.
Building on these developments, we introduce a multi-level Transformer fusion module (referred to in this work as the Multi-Level Transformer Fusion Module, MTFM). This module combines intra-modal self-attention and inter-modal cross-attention to model both within-modality and cross-modality interactions. In this way, it captures long-range conversational context and enriches the emotional information present in each modality without relying solely on early fusion or purely pairwise cross-modal attention.

2.2. Conversational Emotion Recognition

The ERC model improves sentiment classification by modelling contextual information. There are three types of ERC model: those based on recurrent neural networks, those based on attention mechanisms, and those based on graph neural networks.
In 2019, Majumder et al. [9] proposed the DialogueRNN model, which is based on recurrent neural networks. This model employs three types of GRU to update the speaker features, context, and emotions in the conversation, respectively, thereby effectively capturing the dependencies between speakers and the conversation. Subsequently, Ghosal et al. [10] proposed the COSMIC model in 2020, which incorporates external common-sense knowledge into the DialogueRNN to enhance model performance. The model’s ability to comprehend emotions is enhanced through the introduction of an external knowledge base; however, this reliance on external sources may be a limitation. In 2021, Hu et al. [11] further introduced a multi-round reasoning module on the bidirectional LSTM to model emotions from a cognitive perspective. Nevertheless, models based on recurrent neural networks may suffer from vanishing or exploding gradients and reduced efficiency when modelling very long conversational dependencies.
Among the models based on the attention mechanism, Shen et al. [12] proposed the DialogXL model in 2021 with the objective of obtaining longer-term contextual information in the field of conversation emotion recognition. In 2021, Kim et al. [13] proposed the EmoBERTa model, which incorporates speaker information to facilitate the characteristics of speakers in the conversation and their relationships, and utilizes this information to enhance the accuracy of ERC. In 2022, Li et al. [14] proposed the CoG-BART model. The system conducts supervised contrastive learning to improve the model’s capacity for processing contextual information. However, it should be noted that a substantial amount of labelled data is required for supervised learning. Hu et al. [15] also proposed the Unimse model in 2022 with the objective of unifying the tasks of Multimodal Sentiment Analysis (MSA) and ERC. The model conducts modal fusion at both the syntactic and semantic levels, while also implementing contrastive learning between modalities and samples, with the aim of better capturing the differences and consistency between emotions. However, purely attention-based models may under-emphasize short-term, speaker-dependent emotional shifts in local conversational windows.
Among the models based on graph neural networks, Ghosal et al. [16] proposed the DialogueGCN model in 2019. This model regards each dialogue as a graph, associates each utterance with the surrounding utterances, and uses the graph structure to capture the complex relationship between utterances. In 2020, Ishiwatari et al. [17] proposed the RGAT model, which incorporates position encoding into DialogueGCN and combines positional information to enhance the accuracy of ERC. In 2021, Hu et al. [18] proposed the MMGCN model, which extends the DialogueGCN model to encompass multimodal scenarios. It performs graph convolution operations on the utterance representations of different modalities, thereby effectively capturing the interaction between different modalities. In 2023, Wang et al. [19] proposed a GraphCFC module. This approach effectively models context and interaction information; however, it does not explicitly parameterize or analyze the full set of directed relational edge types in the conversation graph, which may limit fine-grained relational reasoning. More recent work (2023–2025) has begun to explore hybrid architectures that integrate Transformer-style global reasoning with graph-based local structure for conversational emotion recognition, highlighting the need to jointly model both long-range dependencies and local speaker-aware cues.
In this work, we introduce a directed multi-relational graph fusion module (DMRGF), which represents each conversation as a directed graph with multiple explicitly defined relation types (including intra-utterance cross-modal links, self-loops, and temporally directed edges). This module performs localized message passing to capture short-term, speaker-conditioned emotional dynamics. When combined with the multi-level Transformer fusion module described above, the overall MTG-ERC framework integrates global (long-range, cross-modal) and local (short-range, speaker-aware) conversational emotion features. We emphasize this integration as a reproducible formulation rather than claiming an entirely new paradigm.

3. Methodology

This paper presents a multimodal conversational emotion recognition model named MTG-ERC, which is composed of four main modules:
(1)
The unimodal feature extraction module is responsible for extraction of features pertaining to each modality. The text features are extracted using a Transformer network, while the audio and visual features are extracted through a fully connected network.
(2)
Multi-level Transformer fusion module: The use of intra-modal and inter-modal Transformer networks enables the capture of interactive information in speech sequences, facilitating the integration of intra-modal and inter-modal interactions.
(3)
Directed Multi-Lateral Relationship Graph Fusion and Module: The construction of a directed graph of multi-variable relationships enables the learning of local context representation, the capture of the relationship between discourse and modality, and the enhancement of the expression of local features.
(4)
Multimodal emotion classification module: The weights of each modality are adjusted dynamically using the softmax function. The global and local context features obtained from the multi-level Transformer fusion module and the directed multi-lateral relationship graph fusion and module are spliced, and comprehensive integration of global and local conversational emotion features is achieved. The emotion label of each sentence is then output.
The comprehensive framework of the model is illustrated in Figure 1.

3.1. Task Definition

A conversation is constituted by a series of consecutive utterances u 1 , u 2 , , u N and speakers s 1 , s 2 , , s M . Each utterance u i ,comprises three modalities: text, audio, and vision, represented as u i l , u i a , and u i v . The objective of the ERC task is to predict the label denoting the emotional state of the speaker is assigned to each utterance, based on a predefined set of emotion labels. Y = y 1 , y 2 , , y K .

3.2. Feature Extraction

In the context of discourse u i , each data modality evinces a distinct emotional proclivity. The discourse representation is derived from the original text, audio, and visual modal information u i l R d l , u i a R d a , u i v R d v , wherein d l , d a , d v represent the feature dimensions of each modality.
In the case of textual modality, the process of textual feature extraction is carried out using the Transformer model in order to extract the semantic features x i l from u i l :
x i l = T r a n s f o r m e r u i l , W t r a n s l
where W t r a n s l is the Transformer parameter to be learned.
In the case of audio and visual modalities, fully connected networks are employed for the purpose of feature extraction.
x i a = F C u i a ; W f c a x i v = F C u i v ; W f c v
where W f c a and W f c v are trainable parameters in the fully connected layer.

3.3. Speaker Embedding

To capture the speaker information in the speech sequence, a speaker embedding module is designed to enhance the speaker information. This involves mapping the speaker in the conversation into a vector, which more accurately reflects the speaker’s characteristics.
S e m b = E m b e d d i n g ( S , M )
where S represents the set of speakers, M represents the total number of participants in the conversation, and S e m b represents the embedding vector matrix of the speakers.
The extracted utterance features can be further enhanced by the addition of the corresponding speaker embeddings:
X τ = η S e m b + x τ , τ l , a , v
where x τ respectively represent the global context representation of the entire conversation after feature extraction of text, audio, and visual modalities, X τ represents the global context representation of each modality after speaker embedding enhancement, and η 0 , 1 represents the contribution rate.

3.4. Multi-Level Transformer Fusion Module

Intra-Modal and Inter-Modal Transformers

Intra-modal and Inter-modal Transformer networks have been developed with the objective of simulating the interaction within and between modalities in multimodal discourse sequences. These networks are designated as Intra-Transformer and Inter-Transformer, respectively, and their structural configurations are illustrated in Figure 2.
Previous research has concentrated on the efficient interaction with the emotional aspects of each modality, with little attention paid to the emotional characteristics of the modality itself. This paper presents the Intra-Transformer network, which employs the multi-head self-attention mechanism to capture long-distance dependencies in the feature data, thus enhancing the emotional feature representation of each modality.
The Intra-Transformer network takes the global context representation X τ of each modality as input Q , K , V , captures the intra-modal interactions between discourse sequences, and enhances the emotional representation of discourse.
X τ τ = I n t r a T r a n s f o r m e r X τ , X τ , X τ
In the formula, τ l , a , v .
Although the data of different modalities are heterogeneous, each modality contains the sentiment information of the samples. Building on the MulT model, this paper presents an Inter-Transformer network, which combines multimodal feature data in pairs and inputs them into the cross-modal attention network. Additionally, the network employs auxiliary modal data to cyclically reinforce the target modal data, thereby supplementing the sentiment information in the target modal features.
The Inter-Transformer network orders Q = X τ , K = V = X υ , so that the τ modality obtains the information of the υ modality, thereby capturing the interaction between the various modalities.
X υ τ = I n t e r T r a n s f o r m e r X τ , X υ , X υ
In the formula, τ l , a , v , υ l , a , v τ .

3.5. Single-Modality Gated Fusion

The utilization of a gated neural network, through the gating mechanism, enables the filtration of unimportant information in X υ τ , while retaining the useful information.
g υ τ = σ W υ τ X υ τ
X υ τ = X υ τ g υ τ
where W υ τ is the weight matrix, σ is the sigmoid function, and g υ τ is the gated neural network.
X τ τ , X υ 1 τ , and X υ 2 τ are concatenated together and passed through a fully connected layer to generate enhanced sequence representations of text, audio, and visual modalities.
X τ = W τ X τ τ ; X υ 1 τ ; X υ 2 τ + b m
where W τ and b m are trainable parameters.
Let X τ = x τ 1 ; x τ 2 ; ; x τ N , where x τ i is the τ -modal representation of sentence u i .

3.6. Directed Multilateral Graph Fusion Module

3.6.1. Graph Structure

The network structure of a conversation with N utterances can be represented by a directed graph G = ν , ε , r , which represents the points, edges and relationships for text, audio and vision. In order to better elucidate the discourse relationship between and within modalities, this module treats each utterance of each modality as a node of the graph, employing directed lines to connect past utterances and future utterances.
  • Node division: Each sentence u i generates three modal nodes, denoted as u i l , u i a , and u i v . The feature vectors of the nodes are represented by x i l and x i a , respectively.
  • Edge division: The edges included in the edge set are contingent upon the scope of the context content considered during the modelling process. In a conversation, if the context content of each utterance considers all other utterances in the conversation, a fully connected graph will be generated, in which each vertex has an edge with all other vertices (including itself). However, this situation requires a significant amount of computing time, which is why this paper only considers local neighbour information and uses the context window P , F to limit the scope when collecting local neighbour information. The parameters P and F represent the number of sentences connected to the past and future, respectively. In our implementation, we use an asymmetric context window P , F = [9,5] P , F = [9,5] P , F = [9,5]. For each utterance, utu_tut, we create directed edges to up to 9 preceding utterances and 5 following utterances. We adopt an asymmetric window because emotional state is typically more influenced by accumulated past context (speaker history, escalation), while only short-range future reactions are informative. We also compared symmetric windows such as [5,5] and wider windows such as [9,9];[9,5] provided the best balance between classification performance and graph size, so we use this setting in all experiments.
The edge set is represented as u i τ , u j τ , r i j ε , τ l , a , v , which represents the interaction between points u i τ , u j τ and relationship type r i j R . This module will consider two sets of relationships: R intra and R inter . Among them, R intra represents the internal connection between the three modes in the same sentence, reflecting the interaction of multiple modes. In addition, each mode has a self-connection to strengthen the information of the mode itself. Formula (10) represents 9 types of edge relationships of R intra .
R intra = u i a , u i l , u i v , u i l , u i l , u i l u i l , u i a , u i v , u i a , u i a , u i a u i l , u i v , u i a , u i v , u i v , u i v
R inter represents the impact of the context (including past and future discourse) of the same modal discourse on the current discourse, reflecting the interaction between different time nodes. The specific formula is the following:
R inter = u j u i τ i P < j < i u i u j τ i < j < i + F
where and represent the connection relationship with the past and the future, respectively, P and F represent the number of past and future related sentences.

3.6.2. Graph Learning

In order to differentiate between various types of edges and nodes and facilitate the capture of more nuanced structural information and feature representation, this module employs the Relational Graph Convolutional Network (RGCN) [20]. The node representation for each relationship type r R is inferred using a mapping function f ( H , W r ) with a weighting matrix W r . The final node representation r R f H , W r can be calculated by combining 15 different edge types.
The formula for the i-th utterance is:
g i τ = r R j N r ( i ) 1 N r ( i ) W r x i τ + W 0 x i τ
where N r ( i ) represents all neighbour nodes that have relationship r with node i . W r and W 0 are learnable parameters in the RGCN.
In order to further extract information from the graph structure data, this module employs the Graph Transformer network [21]. The Graph Transformer network comprises layers containing a self-attention mechanism, which enables each node to attend to the information of other nodes in the graph, thereby enhancing the feature expression ability.
The g i τ obtained from the RGCN is transformed into the Graph Transformer network as follows:
o i τ = c = 1 C W 1 g i τ + i N i a i , j τ W 2 g j τ
where N r ( i ) is the set of nodes adjacent to node i , W 1 and W 2 are learnable parameters in the Graph Transformer network. C is the number of connected multi-head attentions.
The formula for the attention coefficient a i , j τ of node j is as follows:
a i , j τ = softmax W 3 g i τ T W 4 g i τ d
where W 3 and W 4 are learnable parameters.
After aggregating the entire graph, we get a new vector representation:
G τ = o 1 τ , o 2 τ , , o N τ

3.7. Multimodal Sentiment Classification

3.7.1. Multimodal Gated Fusion

A gating mechanism utilizing a Softmax function is devised to facilitate the dynamic learning of the modal weight associated with each utterance. In particular, the final multimodal representation of an utterance is determined by the following Equation (16):
g t i ; g a i ; g v i = softmax W x t i ; W x a i ; W x v i g t i ; g a i ; g v i = softmax W o t i ; W o a i ; W o v i
where W is the weight matrix, g t i , g a i , g v i represent the weights learned by text, audio, and visual modalities, respectively.
The ultimate multimodal representation of each utterance is produced through the weighted addition of the enhanced representations of each modality.
x i = τ t , a , v x τ i g τ i o i = τ t , a , v o τ i g τ i
where x i and o i are the multimodal representations of utterance u i in the Transformer module and the graph module, respectively.
The multimodal sequence generated by the multi-level Transformer fusion module and the directed multi-edge graph fusion module is r is shown like this:
X τ = x τ 1 ; x τ 2 ; ; x τ N
G τ = o τ 1 , o τ 2 , , o τ N

3.7.2. Emotion Label Generation

The global and local context representations generated by the multi-level Transformer fusion module and the directed multilateral graph fusion module are concatenated together, as illustrated in Equation (20):
H = F u s i o n X , G = x τ 1 ; x τ 2 ; ; x τ N , o τ 1 , o τ 2 , , o τ N
where F u s i o n Indicates the connection method.
Input H into a fully connected layer, perform dimensionality reduction, and then make prediction and classification. This will allow us to ascertain the emotional label of speech u i . The relevant calculation formula is shown in Equations (21)–(23):
p i = ReLU ϕ 0 h i + b 0
q i = softmax ϕ 1 p i + b 1
y i = argmax q i
where ϕ 0 and ϕ 1 are learnable parameters.

4. Experiments

4.1. Dataset

The MTG-ERC model is evaluated using two datasets: IEMOCAP and MELD. The IEMOCAP dataset comprises conversational data from ten actors engaged in emotional interactions. The emotion labels are divided into six categories. This paper employs two experimental settings: one is a 4-category emotional scale (angry, sad, happy, neutral), and the other is a 6-category scale. The MELD dataset comprises 1432 dialogues from the TV series “Friends”. The data forms include video, text, and voice. The emotion labels are divided into seven categories: natural, surprised, fearful, sad, happy, disgusted, and angry. Table 1 illustrates the distribution of training samples and test samples in the two datasets.

4.2. Experimental Parameter Settings and Evaluation Indicators

In the IEMOCAP dataset, text features are extracted by sBERT with a feature dimension of 768, audio features are extracted by OpenSmile with a feature dimension of 100 and visual features are extracted by OpenFace with a feature dimension of 512.
In the MELD dataset, text features are extracted by the RoBERTa Large model with a feature dimension of 1024, audio features are extracted by OpenSmile with a feature dimension of 300 and visual features are extracted by DenseNet with a feature dimension of 342.
The MTG-ERC model selects Adam as the optimizer, the learning rate is set to 0.0003, the dropout rate is set to 0.5 and the number of multi-head attention used in the Graph Transformer and Intra/Inter-Transformer networks is 7 and 2, respectively. The P , F value is [9,5].
The model employs two evaluation indicators: accuracy (ACC) and weighted F1 value (w-F1). The weighted F1 value considers the weight of each category within the dataset, with the weight proportional to the number of utterances associated with a specific emotion tag. Given the uneven distribution of emotion samples within the dataset, this approach allows for a more accurate reflection of the model’s overall performance. The calculation method is illustrated in Formula (25).
We set the learning rate to 0.0003 and the dropout rate to 0.5. These values were selected based on validation performance: we performed a small grid search over learning rates {0.0001, 0.0003, 0.0005} and dropout rates {0.3, 0.5, 0.7} on the development split of IEMOCAP, and (0.0003, 0.5) provided the best trade-off between convergence speed and generalization. We keep the same configuration for MELD for consistency and reproducibility.
w F 1 = k = 1 K N k × F 1 k k = 1 K N k

4.3. Comparison with Baselines

This study compares the proposed MTG-ERC model with the baseline method in the multimodal emotion classification setting of the IEMOCAP dataset. In the table, bold font is used to indicate the best accuracy for each emotion, while underlining is used to indicate suboptimal accuracy. The results are presented in Table 2 and Table 3. In the context of a six-category emotion setting, the MTG-ERC model demonstrated an accuracy of 70.12% and a weighted F1 score of 70.23%. This represents a notable improvement over the previous baseline model, with an increase of 0.99% in accuracy and 1.32% in weighted F1 score compared with the GraphCFC model. Among them, the F1 score for the happy and depressed categories are the best, while the F1 score for the sad, natural, angry and excited categories are the second best. In the four-category emotion setting, the MTG-ERC model demonstrates superior performance compared with the baseline model, attaining an accuracy of 84.77% and a weighted F1 score of 84.51%. This represents an improvement of 1.48% in accuracy and 1.3% in weighted F1 score compared with the COGMEN model.
In this paper, we present an MTG-ERC model and experimentally compare it with the baseline method on the MELD dataset, which is used for multimodal emotion classification. The results are presented in Table 4. In the context of seven categories of emotions, the MTG-ERC model demonstrated an accuracy of 65.23% and a weighted F1 score of 65.32%. The highest accuracy was observed for the four categories of emotions: naturalness, fear, pleasure and disgust. It is noteworthy that the MTG-ERC model demonstrated superior performance in differentiating between similar emotions. Its F1 scores reached 18.31% and 15.30% for fear and disgust, respectively, showing strong discrimination ability.
Figure 3 depicts the confusion matrix representation of each emotion classification prediction value and label value for 6-way and 4-way in the IEMOCAP dataset. The confusion matrix serves as a visualization tool, facilitating a more intuitive understanding of the effect of emotion recognition. The colour depth in the matrix corresponds to the classification accuracy, with darker colour indicating higher accuracy.

4.4. Ablation Study

In order to evaluate the relative effectiveness of different modal tasks, the three modalities of text (T), audio (A) and vision (V) are combined and compared. The detailed performance results of each combination are presented in Table 5.
As demonstrated in Table 5, the accuracy and weighted F1 value of the multimodal emotion recognition task are superior to those of the unimodal task, thereby substantiating the significance of multimodal emotion recognition. Furthermore, the experimental results demonstrate that the unimodal tasks of text and audio contribute more to emotion recognition than the visual unimodal tasks.
To assess the efficacy of the multi-level Transformer fusion module and the directed multi-lateral relationship graph fusion module in multimodal emotion recognition, ablation comparison experiments were conducted on the components of these two modules.
As demonstrated in Table 6, the inter-transformer module, intra-transformer module and edge relationship module have been shown to significantly enhance the accuracy of the ERC task. The removal of a module resulted in a notable decline in model performance, with a reduction of 2.15% in accuracy and 2.11% in F1 score, respectively, on the IEMOCAP (6-way) dataset. This suggests that the module in the directed multilateral relationship graph fusion module plays a pivotal role in multi-modal conversational emotion recognition.
The analysis in Table 5 indicates that the visual modality contributes less to the final performance compared to the textual and audio modalities. This observation can be attributed to several factors:
Inherent Information Density in Conversational Settings: In dyadic or multi-party conversations (as in IEMOCAP and MELD), the primary carriers of emotional information are often the linguistic content (text) and the paralinguistic cues (audio, e.g., tone, pitch). Visual cues, such as facial expressions, while valuable, can be more subtle, transient, or even intentionally suppressed in certain social contexts. Consequently, their standalone discriminative power for emotion classification is inherently lower than that of the other two modalities.
Challenges in Visual Data Quality: The visual stream is particularly susceptible to practical issues such as varying lighting conditions, partial occlusions, non-frontal face orientations, and lower effective resolution in video data. These factors introduce noise and sparsity into the visual feature representations, making them less reliable than the relatively clean and stable features extracted from text and audio.
Limitations of Feature Representation: Our model relies on standard visual feature extractors (e.g., OpenFace). While these tools provide robust low-level facial action unit descriptors, they may not fully capture the high-level, dynamic, and complex temporal patterns of facial expressions that are crucial for distinguishing between certain fine-grained emotions (e.g., frustration vs. sadness). This represents a general challenge in the field rather than a specific shortcoming of our fusion architecture.
This discussion underscores that the lower contribution of the visual modality is likely a reflection of the dataset characteristics and inherent challenges in visual emotion recognition. Therefore, future work could focus on developing more sophisticated and dynamic visual feature extractors to better harness the complementary information that the visual channel provides.

4.5. Confusion Matrix Analysis

To gain deeper insights into the model’s limitations and error patterns, we conducted a thorough analysis of the confusion matrices on the IEMOCAP dataset, as visualized in Figure 3. This analysis reveals which specific emotion pairs are most frequently confused by the MTG-ERC model.
In the 6-way classification setting (Figure 3a), we observe several consistent confusion patterns:
Happiness vs. Excitement: The most prominent confusion occurs between happy and excited. A significant number of excited utterances are misclassified as happy. This is likely because both emotions share similar high-arousal acoustic properties (e.g., increased pitch and speech rate) and often occur in positive contexts, making them challenging to distinguish, especially for audio-centric models.
Sadness vs. Frustration: There is a notable bidirectional confusion between sad and frustrated. Both are negative, low-valence emotions. The model sometimes struggles to capture the subtle differences, possibly because frustration can be a manifestation or a precursor to sadness in conversational contexts, and their textual expressions can be similar.
Anger vs. Frustration: The model also confuses angry with frustrated. These are both high-arousal, negative emotions. The distinction is often nuanced and context-dependent, relying on the intensity of the emotion, which the model may not always perfectly discern.
In the 4-way setting (Figure 3b), which consolidates some of the finer categories, the confusion between happy and the removed excited category is naturally absorbed, leading to a generally cleaner matrix. However, the challenge of distinguishing between the core negative emotions (sad, angry, frustrated consolidated) persists, indicating an inherent difficulty in fine-grained negative emotion classification.
Implications and Limitations: These findings highlight the primary limitation of our current model: its difficulty in disambiguating emotion pairs that are either acoustically proximate (e.g., happy/excited) or semantically and contextually similar in their negative valence (e.g., sad/frustrated/angry). This suggests that future work could benefit from designing more nuanced feature extractors capable of capturing subtle acoustic differences or incorporating external commonsense knowledge to better understand the contextual cues that differentiate these complex emotions.

5. Conclusions

The field of sentiment analysis is currently witnessing a growing interest in multimodal conversational emotion recognition, as the information derived from different modalities can reinforce each other, thereby enhancing the robustness of the system. This paper presents a unified framework for the problem of conversational emotion recognition, designated MTG-ERC. This model is designed to capture the emotional information conveyed within and between modalities, as well as the emotional characteristics of global and local contexts, through the utilization of a multi-level Transformer fusion module and a directed multi-relational graph fusion module (for local, speaker-dependent conversational structure). Experimental results on the IEMOCAP and MELD datasets show consistent improvements (approximately 1% absolute in both accuracy and weighted F1) over a strong graph-based baseline. We further include ablation studies to isolate the contribution of each branch and to analyze robustness.

Author Contributions

Formal analysis, L.Y.; Data curation, C.W.; Writing—original draft, T.Y.; Writing—review & editing, H.J.; Project administration, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Fu, Y.; Yuan, S.; Zhang, C.; Cao, J. Emotion Recognition in Conversations: A Survey Focusing on Context, Speaker Dependencies, and Fusion Methods. Electronics 2023, 12, 4714. [Google Scholar] [CrossRef]
  2. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  3. Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
  4. Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July—2 August 2019; pp. 527–536. [Google Scholar]
  5. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar]
  6. Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.B.; Morency, L.P. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar]
  7. Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Conference Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
  8. Sahay, S.; Okur, E.; Kumar, S.H.; Nachman, L. Low Rank Fusion based Transformers for Multimodal Sequences. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML); ACL: Seattle, WA, USA, 2020; pp. 29–34. [Google Scholar]
  9. Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6818–6825. [Google Scholar]
  10. Ghosal, D.; Majumder, N.; Gelbukh, A.; Mihalcea, R.; Poria, S. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020; ACL: Seattle, WA, USA, 2020; pp. 2470–2481. [Google Scholar]
  11. Hu, D.; Wei, L.; Huai, X. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 7042–7052. [Google Scholar]
  12. Shen, W.; Chen, J.; Quan, X.; Xie, Z. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 13789–13797. [Google Scholar]
  13. Kim, T.; Vossen, P. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arXiv 2021, arXiv:2108.12009. [Google Scholar]
  14. Li, S.; Yan, H.; Qiu, X. Contrast and generation make bart a good dialogue emotion recognizer. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 11002–11010. [Google Scholar]
  15. Hu, G.; Lin, T.E.; Zhao, Y.; Lu, G.; Wu, Y.; Li, Y. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7837–7851. [Google Scholar]
  16. Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 154–164. [Google Scholar]
  17. Ishiwatari, T.; Yasuda, Y.; Miyazaki, T.; Goto, J. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 7360–7370. [Google Scholar]
  18. Hu, J.; Liu, Y.; Zhao, J.; Jin, Q. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 5666–5675. [Google Scholar]
  19. Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition. IEEE Trans. Multimed. 2023, 26, 77–89. [Google Scholar] [CrossRef]
  20. Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; Van Den Berg, R.; Titov, I.; Welling, M. Modeling relational data with graph convolutional networks. In The Semantic Web, Proceedings of the 15th International Conference, ESWC 2018, Heraklion, Greece, 3–7 June 2018; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 593–607. [Google Scholar]
  21. Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
Figure 1. MTG-ERC model.
Figure 1. MTG-ERC model.
Applsci 15 11971 g001
Figure 2. Intra-Transformer structure (a) and Inter-Transformer structure (b).
Figure 2. Intra-Transformer structure (a) and Inter-Transformer structure (b).
Applsci 15 11971 g002
Figure 3. Confusion matrix of IEMOCAP (6-way) (a) and confusion matrix of IEMOCAP (6-way) (b).
Figure 3. Confusion matrix of IEMOCAP (6-way) (a) and confusion matrix of IEMOCAP (6-way) (b).
Applsci 15 11971 g003
Table 1. Distribution of dialogues and utterances.
Table 1. Distribution of dialogues and utterances.
DatasetsDialoguesUtterances
TrainTestTrainTest
IEMOCAP (6-way)1203158101623
IEMOCAP (4-way)120313600943
MELD1153280110982610
Table 2. Experimental results of each model on the IEMOCAP (6-way) dataset.
Table 2. Experimental results of each model on the IEMOCAP (6-way) dataset.
ModelHappySadNeutralAngryExcitedFrustratedAccw-F1
DialogueRNN933.1878.8059.2165.2871.8658.9163.4062.75
DialogueGCN1742.7580.8863.5464.1963.0861.2165.2564.18
MMGCN1842.3478.6761.7369.0074.3362.3265.4366.22
ICON2232.8074.4160.6968.2368.4166.2864.0263.54
COGMEN2351.9081.7068.6066.0075.3058.2068.2067.60
GraphCFC1943.2384.4564.1371.3778.1962.7469.1368.91
MTG-ERC55.5982.0965.6269.6275.5666.6470.1270.23
Table 3. Experimental results of each model on the IEMOCAP (4-way) dataset.
Table 3. Experimental results of each model on the IEMOCAP (4-way) dataset.
ModelAccw-F1
BC-LSTM2475.2075.13
CHFusion2576.5976.80
COGMEN2383.2983.21
MTG-ERC84.7784.51
Table 4. Experimental results of each model on the MELD dataset.
Table 4. Experimental results of each model on the MELD dataset.
ModelNeutralSurpriseFearSadnessJoyDisgustAngryAccw-F1
BC-LSTM2473.8047.705.4025.1051.305.2038.4057.5055.90
DialogueRNN976.5649.401.2023.8050.701.7041.5056.1055.90
DialogueGCN17-------57.9758.10
MMGCN1876.3348.15-26.7453.02-46.0958.3458.65
EmoCaps2677.1263.193.0342.5257.507.6957.54-64.00
UniMSE15-------65.0965.51
GraphCFC1975.4848.301.9026.3351.583.3046.4161.4262.08
MTG-ERC78.9159.7218.3140.2364.1015.3054.9665.2565.32
Table 5. Ablation experiments on each modality on IEMOCAP (6-way) and IEMOCAP (4-way) datasets.
Table 5. Ablation experiments on each modality on IEMOCAP (6-way) and IEMOCAP (4-way) datasets.
Modality
Settings
IEMOCAP (6-Way)IEMOCAP (4-Way)
Accw-F1Accw-F1
T66.1766.1982.0881.99
A57.8156.0763.3161.17
V42.0441.7846.3446.36
T + A68.2268.3583.3583.32
V + T66.4266.4481.5581.54
A + V59.8860.1865.1463.70
MTG-ERC70.1370.2384.7784.51
Table 6. Ablation experiments of each module on IEMOCAP (6-way) and IEMOCAP (4-way) datasets.
Table 6. Ablation experiments of each module on IEMOCAP (6-way) and IEMOCAP (4-way) datasets.
IEMOCAP (6-Way)IEMOCAP (4-Way)
Accw-F1Accw-F1
w/o inter transformer69.3269.4783.1382.97
w/o intra transformer68.4468.6382.3882.16
w/o R inter 68.3568.5282.0281.97
w/o R intra 67.9868.1282.4382.37
MTG-ERC70.1370.2384.7784.51
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, H.; Yang, T.; Yan, L.; Wang, C.; Song, X. Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks. Appl. Sci. 2025, 15, 11971. https://doi.org/10.3390/app152211971

AMA Style

Jin H, Yang T, Yan L, Wang C, Song X. Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks. Applied Sciences. 2025; 15(22):11971. https://doi.org/10.3390/app152211971

Chicago/Turabian Style

Jin, Hua, Tian Yang, Letian Yan, Changda Wang, and Xuehua Song. 2025. "Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks" Applied Sciences 15, no. 22: 11971. https://doi.org/10.3390/app152211971

APA Style

Jin, H., Yang, T., Yan, L., Wang, C., & Song, X. (2025). Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks. Applied Sciences, 15(22), 11971. https://doi.org/10.3390/app152211971

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop