1. Introduction
With the increasing digitalization of human–computer interaction, emotion recognition has emerged as a pivotal domain in the pursuit of high-order cognitive intelligence [
1], garnering significant academic interest and promising application prospects [
2]. Historically, constrained by data acquisition technologies and algorithmic bottlenecks, early research primarily focused on sentiment intensity classification within unimodal text [
3]. However, cognitive neuroscience reveals that human emotional expression is inherently a complex process governed by multimodal synergy [
4]. Beyond the explicit semantic information conveyed by text [
5], nonverbal signals—including vocal prosody, facial expressions, and body gestures—carry a wealth of irreplaceable emotional cues [
6]. Moreover, significant complementarity and correlation exist across these modalities [
7]. Recent studies confirm that by deeply fusing heterogeneous data from linguistic, acoustic, and visual sources, multimodal emotion recognition (MER) can effectively resolve semantic ambiguities (e.g., irony and metaphors) prevalent in unimodal contexts. This integration facilitates the construction of robust multimodal representations [
8], leading to a more precise understanding of complex human affective states. Despite its potential, mainstream MER methodologies still face several core challenges:
First, modal heterogeneity leads to the “semantic gap” and feature entanglement. Text, audio, and video exhibit fundamental differences in representation formats, sampling frequencies, and data distributions [
9]. Conventional fusion methods often force different modalities into a shared subspace, neglecting distinct physical properties such as the semantic relatedness of text, the temporal dynamics of audio, and the spatial correlations of imagery. This oversight complicates the alignment of heterogeneous data. Furthermore, raw features are frequently confounded by emotion-irrelevant private attributes [
10], such as background noise, speaker identity, and ambient lighting. Without an explicit decoupling mechanism, emotional semantics become highly entangled with modal-specific noise, causing models to overfit to environmental artifacts rather than the essence of emotional features, which severely weakens cross-scene generalization.
Second, cross-modal interactions exhibit dynamic, time-varying characteristics. In continuous interactions, the emotional contribution of different modalities fluctuates over time [
11]. For instance, during active speech, acoustic prosody and textual semantics may dominate; however, during silences or pauses, facial expressions and subtle gestures often provide critical emotional cues [
12]. Traditional static fusion paradigms—such as feature concatenation or fixed weight attention mechanisms—struggle to capture these nonlinear, time-varying intermodal correlations. Consequently, they fail to handle complex scenarios like asynchrony or missing modalities [
13]. Moreover, such methods are prone to introducing intramodal noise (e.g., background chatter) and intermodal redundancy (e.g., positive sentiment expressed simultaneously in text and vision), which hampers predictive performance.
Furthermore, emotion recognition faces challenges in temporal modeling and data distribution. The evolution of emotional states exhibits long-range dependencies. Traditional recurrent neural networks are often limited by gradient vanishing when processing long sequences, making it difficult to capture distant contextual dependencies [
14]. Simultaneously, emotion datasets in naturalistic settings frequently suffer from severe long-tail distributions, where samples are concentrated in neutral categories while extreme or specific emotions remain sparse [
15]. This imbalance causes models to favor majority classes, thereby diminishing the recognition efficacy for critical minority emotions.
To address these challenges, we propose the Dynamic Heterogeneous Graph Temporal Network (DHGTN), an end-to-end multimodal emotion recognition framework. Adopting a strategy of “divide-and-conquer, dynamic fusion, and global reasoning”, DHGTN systematically tackles the aforementioned hurdles.
The main contributions of this work are summarized as follows:
Contrastive learning-based multimodal feature decoupling: We construct a “Shared Private” subspace projection mechanism utilizing Wav2vec 2.0, VideoMAE, and BERT backbones. By imposing orthogonal constraints and self-supervised contrastive limits, we explicitly disentangle emotion common semantics from modality-specific noise. This design minimizes redundancy and ensures robust feature alignment without relying on unverified semantic assumptions.
Dynamic heterogeneous graph temporal fusion: Transcending static fusion: We introduce a dynamic graph structure that is fully connected at each timestep to capture instantaneous cross-modal correlations. We utilize a graph attention mechanism to adaptively assign weights to both intramodal and intermodal edges, enabling the model to track time-varying interactions with high granularity.
Comprehensive composite optimization system: We propose a multidimensional hybrid loss function that integrates shared space contrastive loss, temporal consistency loss, and supervised contrastive loss. This system works in tandem with the global reasoning capabilities of Transformers to effectively resolve long-range dependencies and mitigate the impact of class imbalance in emotional data.
Superior empirical performance: Extensive experiments on the IEMOCAP and MELD benchmarks demonstrate that DHGTN significantly outperforms state-of-the-art baselines. Specifically, our method achieves a weighted F1 score improvement of 2.03% on IEMOCAP and 1.35% on MELD compared to the runner up model, validating the effectiveness of our dynamic interaction modeling.
2. Related Works
2.1. Multimodal Emotion Recognition
Unimodal data typically captures only localized emotional cues, making it difficult to comprehensively model complex and fluid affective states. Consequently, multimodal emotion recognition seeks to integrate complementary information from textual, acoustic, and visual modalities to overcome the representational limitations of a single source, thereby significantly enhancing accuracy and robustness. Existing multimodal fusion mechanisms are generally categorized into two paradigms: representation-learning-based fusion and attention-based interaction.
In the realm of representation learning, research focuses on designing effective strategies for the unified representation and integration of multimodal features. Zadeh et al. [
16] introduced the Tensor Fusion Network (TFN), which utilizes Cartesian products to construct multidimensional tensors, explicitly modeling all potential unimodal, bimodal, and trimodal interactions. While the TFN excels at capturing high-order interactive features, the exponential growth of feature dimensionality incurs prohibitive computational costs. To mitigate this, Liu et al. [
17] proposed Low-rank Multimodal Fusion (LMF), leveraging low-rank tensor decomposition to maintain performance while significantly reducing overhead. Subsequently, Mai et al. [
18] developed the Hierarchical Feature Fusion Network (HFFN), which employs a hierarchical tensor decomposition architecture to consider both local and global interactions, enhancing the extraction of latent structures in multimodal temporal data.
In attention-based methods, the Transformer architecture has become the de facto standard due to its global self-attention mechanism. Its capability to capture deep contextual representations has been verified not only in standard NLP benchmarks but also in specialized automated analysis systems [
19], confirming its robustness in feature extraction. By computing attention weights in parallel, these models dynamically focus on key timesteps or modal features that contribute significantly to emotional prediction, effectively capturing long-range dependencies. For instance, Tsai et al. [
20] proposed the Multimodal Transformer (MulT), which achieves feature alignment and interaction via directional pairwise cross-modal attention. Rahman et al. [
21] introduced a multimodal adaptation gate (MAG) to inject nonverbal information into pretrained language models, achieving effective fusion of linguistic knowledge and multimodal perception. Furthermore, Ou et al. [
22] designed the Multimodal Local and Global Attention Network (MMLGAN) to integrate diverse representations and generate discriminative emotional features.
Despite these advancements, existing fusion methods face critical challenges. First, the lack of explicit spatiotemporal calibration often allows noise to infiltrate cross-modal features during high-order interactions. Second, the issue of feature entanglement remains unresolved; failing to decouple modal-specific attributes from shared emotional features results in redundant information interfering with the final representation, thus undermining model robustness.
2.2. Graph Neural Networks
Graph neural networks offer a novel perspective for MER by modeling multimodal data as topological structures with complex dependencies, leveraging their superiority in processing non Euclidean data. Moreover, the robustness of GNNs for text classification has been validated in diverse practical applications [
23], providing empirical evidence of their effectiveness in capturing latent semantic dependencies. Under the GNN framework, modalities or temporal segments are treated as nodes, while interactions are represented as edges. Through message-passing mechanisms, GNNs dynamically aggregate neighborhood information to capture deep-seated emotional correlations.
Researchers have achieved significant results by designing sophisticated graph topologies for context and cross-modal interaction modeling. Hu et al. [
24] proposed the Multimodal Graph Convolutional Network (MMGCN), which constructs a heterogeneous conversational graph with multimodal nodes and applies spectral convolutions to model long-range cross-modal dependencies. To achieve finer control over information flow, Li et al. [
25] introduced the Graph and Attention-based Multisource Integration Framework (GA2MIF), which utilizes multihead directed graph attention and pairwise cross-channel attention to decouple and collaboratively optimize contextual modeling and cross-modal interaction. Similarly, Lu et al. [
26] proposed the Bistream Graph Multimodal Fusion (BiGMF) architecture, which constructs independent unimodal and cross-modal graphs to capture intramodal temporal dependencies and intermodal pairwise interactions, respectively.
To address structural complexity and data quality issues, subsequent works have further expanded the boundaries of GNNs. Lian et al. [
27] designed the Graph Completion Network (GCNet), utilizing topological completion to handle incomplete conversational data by inferring missing semantic and emotional links. Wei et al. [
28] developed the Dialogue and Event Relation Aware GCN (DER-GCN), which employs weighted multirelational graphs and self-supervised masked graph autoencoders to capture causal dependencies between speakers and events. Furthermore, Du et al. [
29] proposed the Hierarchical Graph Contrastive Learning (HGCLLG) network, which maximizes the mutual information between local and global views of a single utterance, enabling the model to learn fine-grained local features and high-level global semantics.
However, these GNN-based MER methods exhibit two critical limitations regarding dynamic interactions and feature purity. First, most existing approaches, such as HGGLG and BiGMF, predominantly rely on static graph topologies. They assume that intermodal connection weights remain fixed throughout a dialogue, which contradicts the psychological reality that modal importance fluctuates instantaneously (e.g., audio dominates during shouting, while facial expressions dominate during silence). Although recent works like MDH have introduced dynamic hypergraphs to model high-order correlations, they often prioritize structural complexity over the purification of input features. This leads to the second limitation: the lack of explicit feature decoupling. State-of-the-art methods like HCIL and MDH typically perform graph message passing directly on entangled feature representations. Without a prior “Shared Private” decoupling mechanism, modality-specific noise (e.g., background clutter or identity information) is inevitably propagated through the graph structure, interfering with the synthesis of emotion common semantics. In contrast, our DHGTN conceptually differs by integrating a prefusion decoupling stage with a frame level dynamic graph. This ensures that the graph topology evolves based on “purified” emotional semantics rather than noisy raw features, offering a more robust mechanism than simple dynamic association.
3. Methodology
The overall architecture of the proposed DHGTN is illustrated in
Figure 1. As depicted, the data flow operates through four hierarchical stages to transform raw multimodal signals into emotion predictions:
Modality-specific encoding: Raw input streams (text, audio, vision) are first processed by independent pretrained backbones (BERT, Wav2vec 2.0, VideoMAE) to extract high-dimensional feature sequences.
Feature decoupling: To suppress noise, these features are projected into orthogonal “Shared” (emotion common) and “Private” (modality-specific) subspaces via a contrastive learning mechanism.
Dynamic graph fusion: The purified “Shared” features serve as nodes in a dynamic heterogeneous graph. At each timestep, a graph attention network dynamically computes edge weights to model instantaneous cross-modal interactions.
Temporal reasoning and classification: Finally, the sequence of fused graph representations is fed into a global Transformer to capture long-range dependencies, followed by a classifier that outputs the final emotion probability.
3.1. Modality-Specific Feature Encoding
This stage aims to transform raw, highly heterogeneous multimodal input streams into a unified sequence of high-dimensional spatiotemporal feature representations, while addressing the inherent temporal heterogeneity across modalities. The model receives synchronized input sequences from
modalities, denoted as
. Here,
, representing {Text, Audio, Vision}, and
denotes the number of timesteps. Given the substantial discrepancies in data structures and semantic levels across signals, we adopt a “divide-and-conquer” strategy. Specifically, dedicated pretrained backbone networks are deployed for each modality to extract high-level semantic feature sequences
, where
is the encoded feature dimension:
Specifically, Textual Modality employs the pretrained BERT model to extract contextual embeddings from the input transcripts. Acoustic Modality adopts the Wav2vec 2.0, a self-supervised framework pretrained on raw audio waveforms, to capture paralinguistic cues such as prosody and intonation. Visual Modality uses VideoMAE to obtain robust spatiotemporal representations from the video frames. Following feature extraction, the module outputs three sets of temporally aligned but semantically heterogeneous feature sequences .
3.2. Multimodal Feature Decoupling and Alignment
To address the issue of feature entanglement caused by modal heterogeneity—where raw features contain a mixture of critical emotional semantics and emotion-irrelevant modal-specific noise—this module constructs a feature decoupling mechanism. The goal is to explicitly separate the representation of each modality into two orthogonal latent subspaces: an Emotion Shared Space, which captures common emotional semantics across modalities, and a Modality Private Space, which retains modal specific noise and private attributes. Additionally, a contrastive learning mechanism is employed to enforce semantic alignment within the shared features.
For any modality at timestep
with feature
, we decouple the representation via two independent linear projection networks
and
:
where
represents the shared feature containing general emotional semantics, and
represents the private feature related to modal-specific noise.
and
are learnable projection matrices.
To ensure that the shared feature
and private feature
are informationally independent, we impose an orthogonality constraint
to minimize the mutual information between them:
where
denotes the Frobenius norm. Minimizing this term enforces the two projection matrices to be mutually orthogonal.
Simultaneously, to enforce high semantic alignment among shared features from different modalities while suppressing interference from private features, we introduce a self-supervised multimodal contrastive learning loss
. Based on the InfoNCE framework, we construct positive and negative sample pairs. The positive samples are the shared features
and
of different modalities at the same timestep
. The negative samples include the shared features of the current moment
paired with shared features from all other timesteps
, as well as private features
from all timesteps. By calculating the similarity between the target shared feature and its positive/negative sets, the contrastive loss is defined as follows:
where
is the cosine similarity,
is a temperature coefficient, and
denotes the set of shared features from all other modalities at timestep
. The output of this stage is the decoupled and aligned shared feature sequence
.
3.3. Dynamic Heterogeneous Graph Temporal Network
To capture the time-varying nature of emotional contributions and complex modal interplay, this module designs a dynamic heterogeneous graph that adaptively learns connection weights between modalities at each timestep. For every timestep , a dynamic heterogeneous graph is constructed based on the shared features . The node set contains three nodes representing the shared modal features at the current moment. To comprehensively model the relationships, the graph is designed as fully connected, ensuring that each modality can attend to all others. The edge set comprises two types of connections: intramodal edges, which allow each modality to preserve its unique characteristics, and intermodal edges, which capture latent cross-modal synergies. The weights of these edges are determined by a dynamically computed graph attention mechanism.
We utilize graph attention to calculate the dynamic attention weight
from modality
to
, reflecting the contribution of
to the emotional information of
. Given the fixed and small number of modalities (
), the computational cost for this graph construction is
per timestep, resulting in a total complexity of
, which is computationally efficient. The attention coefficients are computed as follows:
where
and
are learnable weight parameters, and
denotes the concatenation operation.
These weights implement context adaptive dynamic information flow control. Subsequently, neighborhood information is aggregated using the calculated attention weights to update the feature representation of each node:
where
is a nonlinear activation function.
The updated node feature
effectively assimilates contextual information from other modalities. Following the graph attention update, an average pooling operation is applied to compress the overall state of the graph
into a single vector
, representing the instantaneous global multimodal snapshot at time
:
Ultimately, this process yields a temporally ordered sequence of frame-level fused features .
3.4. Temporal Reasoning and Composite Optimization
The fundamental challenge in emotion recognition lies in decoding subtle affective correlations embedded across long temporal spans. This module is designed to bridge long-range temporal dependencies and mitigate data imbalance, transforming discrete frame-level snapshots into a coherent, global temporal understanding. First, a shared task adaptation layer is introduced to project the graph fused features
into a latent task space:
where
and
are learnable parameters.
Next, to effectively model long-range affective dependencies, the adapted sequence
is fed into a Transformer encoder
containing
stacked layers. This captures the evolution of emotional states and distant contextual dependencies via the global self-attention mechanism:
Finally, a fully connected classifier
and a Softmax function are applied to
to obtain the predicted probability
for emotion classes:
where
represents the result of the
sequence after pooling or taking the final token.
Furthermore, to enhance the model’s discriminative power and address the long-tail distribution of data, we design a multidimensional composite loss function
:
where supervised contrastive loss (
) is based on the final features
; this utilizes label information to construct positive and negative pairs, optimizing representation quality at the geometric level to enhance intraclass compactness and interclass separability. Focal loss (
) is found by introducing a modulation factor
; this reduces the weight of easy-to-classify examples, forcing the model to focus on hard samples and correcting attentional bias at the probability distribution level. Dice loss (
) is often used in segmentation; this serves as an auxiliary term to optimize decision boundaries for minority classes. It is insensitive to sample counts, effectively balancing the contribution of various classes and improving recognition performance for tail emotions. It optimizes the set overlap between predicted and ground truth distributions, further alleviating emotional bias caused by class imbalance globally. Specifically, focal loss focuses on hard samples by down-weighting easy ones:
where
is the focusing parameter set to 2.0, and
denotes the batch size, which is the number of samples in the current minibatch. Dice loss is adopted to optimize the F1-score directly for imbalanced classes:
The total optimization objective
is defined as follows:
where
is the contrastive learning loss,
is the orthogonal constraint loss,
are weighting hyperparameters, and
is a regularization term to maintain temporal consistency. In our experiments, we empirically set the weighting hyperparameters to
and
to balance the auxiliary constraints with the main classification task. By jointly optimizing these terms, DHGTN achieves robust emotion recognition capabilities with high adaptability to imbalanced data.
4. Experiments
4.1. Experimental Datasets
To comprehensively evaluate the effectiveness and generalization capability of the proposed method in multimodal emotion recognition (MER) tasks, we selected two benchmark datasets: IEMOCAP and MELD. These datasets cover a wide range of interaction patterns, ranging from controlled laboratory environments to naturalistic conversational scenarios. Their rich emotional annotation categories enable a thorough assessment of model performance under complex conditions. The specific data partitions for training, validation, and testing, along with the sample counts, are detailed in
Table 1.
IEMOCAP is one of the most widely used multimodal datasets in the field of affective computing [
30]. It contains over 150 dyadic sessions recorded by 10 professional actors, comprising approximately 12 h of dialogue and over 10,000 emotional utterance samples. IEMOCAP provides pixel-level aligned video, audio, and text transcripts, along with fine-grained facial motion capture data. The samples are annotated with six discrete emotion labels: happy, sad, angry, frustrated, excited, and neutral. In our experiments, we follow the standard “leave-one-session-out” strategy, using the first four sessions for training and validation, and the last session for testing, to objectively assess the model’s performance in speaker independent scenarios.
MELD [
31] is derived from the TV sitcom Friends, containing over 1400 dialogue scenes and approximately 13,000 utterances. MELD exhibits significant “in-the-wild” characteristics, featuring complex background noise, varying numbers of speakers, and dynamic camera angles. It provides synchronized audio, visual, and textual modalities for each utterance, making it suitable for evaluating robustness in complex acoustic environments and multiparty contexts. The dataset is annotated with seven discrete emotion categories: anger, disgust, fear, joy, neutral, sadness, and surprise. For our experiments, we adopt the standard fixed partition, randomly splitting the dataset into training, validation, and testing sets with a ratio of 80%, 10%, and 10%, respectively.
4.2. Experimental Settings and Evaluation Metrics
This study employs two core evaluation metrics widely recognized in the MER domain: accuracy (Acc) and weighted F1-score (W-F1) [
32]. Higher values for both metrics indicate superior model performance. Accuracy measures the proportion of correctly predicted samples to the total number of samples, intuitively reflecting the model’s overall classification capability. The weighted F1-score calculates the weighted average of the F1-score for each category based on the proportion of samples. W-F1 effectively mitigates the issue of inflated performance scores caused by overfitting to majority classes, providing a more objective assessment of recognition efficacy on minority emotion classes.
We executed all model training and inference tasks on a robust Linux-based computing environment featuring an NVIDIA GeForce RTX 3080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). The software architecture was built upon Python 3.10, leveraging the advanced capabilities of the PyTorch 2.1.2 framework in conjunction with the CUDA 11.8 computing platform to optimize training speed and resource utilization. For clarity and reproducibility, specific settings regarding the training strategy are provided in
Table 2, which details the exact hyperparameter values used to achieve the reported results.
Furthermore, to assess the model’s scalability in real-world contexts, we evaluated its computational efficiency on the IEMOCAP dataset. The proposed DHGTN contains 27.5 M parameters. In terms of computational speed, conducting inference on a single NVIDIA GeForce RTX 3080Ti GPU results in an average latency of 37.4 ms per utterance (with a batch size of 1). The total training time for convergence (80 epochs) is approximately 4.5 h. These concrete measurements confirm that while DHGTN incorporates dynamic graph modeling, it maintains a reasonable trade-off between recognition accuracy and computational cost, making it feasible for deployment in time sensitive applications.
4.3. Comparative Analysis of Experimental Results
In order to provide a rigorous evaluation of our model within the context of multimodal emotion recognition, we perform a comparative analysis against a spectrum of representative baselines. The selected models span distinct architectural paradigms and evolutionary phases of the field, including established classics and recent state-of-the-art advancements. To ensure fair and rigorous evaluation of experimental results, we trained all competing baseline models under identical experimental conditions. All models were trained and tested on exactly the same data splits for IEMOCAP and MELD. This approach eliminates performance variations arising from differing feature properties or data partitioning. These baselines are summarized as follows:
- (1)
TFN: A classic fusion paradigm that constructs a high-dimensional fusion tensor by calculating the Cartesian product between modalities, explicitly modeling high-order nonlinear interactions across all modality combinations.
- (2)
LMF: An efficient improvement over TFN that utilizes low-rank tensor decomposition to assign low-rank factors to each modality. This approximates the full high-dimensional interaction tensor, preserving intermodal combinatorial capabilities while reducing computational overhead.
- (3)
MFN [
33]: Designed for multimodal sequential data, it employs a multiview gated memory network to jointly model modality-specific sequences and cross-modal interactions over time, aiming to capture long-term dependencies.
- (4)
MulT: A representative architecture applying Transformers to the multimodal domain. It utilizes cross-modal attention mechanisms to capture long-range dynamic cross-modal interactions without the need for explicit alignment.
- (5)
MISA [
34]: Based on the concept of representation disentanglement, it projects multimodal inputs into two distinct subspaces—“Modality-Invariant” and “Modality-Specific”—to enhance robustness and reduce information redundancy.
- (6)
TMBL [
35]: Reconstructs the Transformer architecture by introducing bimodal and trimodal binding mechanisms combined with fine-grained convolution modules, aiming to strengthen deep coordination and fusion of different data types.
- (7)
HGCLLG [
29]: A hierarchical graph contrastive learning framework that explores local interactions within single utterances and global context relationships between utterances via graph structures, optimizing multimodal representations through graph gating mechanisms.
- (8)
HCIL [
36]: Proposes a hybrid interaction framework that integrates intramodal dynamics, intermodal fusion, and cross-modal learning mechanisms to maximize the synergistic effect of different modal information on emotional expression.
- (9)
TETFN [
37]: Adopts a text centric strategy, leveraging powerful textual semantic features to enhance and calibrate feature extraction for acoustic and visual modalities, generating a unified and efficient multimodal representation.
- (10)
MDH [
38]: Utilizes a dynamic hypergraph convolutional network, allowing edges to connect multiple nodes to capture high-order modal relationships. Its “dynamic” nature allows for adaptive adjustment of the graph structure based on inputs, flexibly modeling complex associations between modalities.
To verify the effectiveness of the method, we conducted comparative experiments on the IEMOCAP and MELD datasets. We report the average performance and standard deviation over five independent runs to rule out stochastic effects. The specific results are presented in
Table 3 and
Table 4, respectively.
Specifically, on the IEMOCAP dataset, our method achieved improvements of approximately 1.74% and 2.03% in Acc and W-F1, respectively, over the second-best-performing model (MDH), and improvements of 7.95% and 7.62% over the classic MulT model. On the MELD dataset, our method maintained robust performance, surpassing the second-best MDH by approximately 1.35% in W-F1 and the classic MulT by 7.36%. These results fully demonstrate that the proposed dynamic heterogeneous graph interaction mechanism effectively captures not only intramodal temporal dependencies but also the dynamic interactions of emotional cues across modalities, proving highly effective in detecting subtle emotional shifts in dialogue. Crucially, the proposed composite optimization strategy significantly alleviates the long-tail issue. Detailed class-wise analysis reveals that DHGTN achieves the most notable gains on minority emotions compared to the baseline MDH. For instance, on the “Fear” class (which accounts for only <7% of samples), our model improves the accuracy by roughly 7.30% (from 57.87 to 65.17%). This confirms that the focal and Dice losses effectively prevent the model from overfitting to majority classes like “Neutral”.
Comparison with tensor fusion and sequence models: While TFN and LMF capture high-order relationships via tensor computation, they neglect the dynamic changes in the temporal dimension of long sequences, leading to suboptimal performance on long-dialogue datasets like IEMOCAP. Although MFN introduces memory mechanisms for sequence processing, its gating mechanism struggles with complex modal interactions in multiparty scenarios. In contrast, our method leverages dynamically updated graph nodes to simultaneously account for temporal information and modal interactions, thereby significantly outperforming these classic models across all metrics.
Comparison with attention-based models: MulT, MISA, and TMBL utilize attention mechanisms effectively for multimodal feature alignment. However, MISA is prone to introducing feature redundancy during disentanglement; MulT struggles to capture the implicit topological structure of dialogues without explicit graph guidance; and TMBL remains focused on local feature sequences without achieving global relational representation. By combining the attention advantages of Transformers with the structural strengths of graph neural networks, our method retains key cross-modal information while enhancing contextual understanding. This results in higher classification accuracy, particularly in the short-text, fast-paced dialogue scenarios typical of MELD.
Comparison with feature enhancement and GNN models: TETFN relies on powerful pretrained text models to guide acoustic and visual modalities; consequently, its performance is heavily dependent on the quality and informational content of the text modality, limiting its utility in certain real-world scenarios. Static-graph-based models like HGGLG and HCIL lack the flexibility to handle instantaneous changes in emotional states. While MDH employs a dynamic hypergraph mechanism, it incurs high computational complexity and focuses its dynamics on modeling high-order modal relationships rather than the real-time adjustment of relational weights between dialogue units. The proposed method treats emotional dynamics as a real-time evolution on a graph structure. It allows for fine-grained adjustment of connection structures and weights based on current inputs and historical context. Without relying on heavy text augmentation or complex hypergraphs, our method achieves state-of-the-art performance on both datasets through efficient structural learning, proving the superiority of our approach to modeling dialogue relational dynamics.
4.4. Ablation Studies
4.4.1. Impact of Individual Modules
We conducted ablation studies to systematically evaluate the validity and contribution of each core component. The experimental results are presented in
Table 5, and the visualization results are shown in
Figure 2. Among them, the “ w/o Feature Decoupling and Alignment Module” refers to removes the explicit separation of modal features into shared and private subspaces. Instead, modality-specific features are used directly for graph node initialization, and the shared space contrastive loss term is removed. The “w/o Dynamic Graph Fusion Mechanism” eliminates the construction of the dynamic heterogeneous graph and the spatial fusion process of the graph attention network. It replaces the global graph vector with simple feature concatenation. The “ w/o Composite Regularization Constraints” removes all auxiliary loss terms, including the shared space contrastive loss, temporal consistency loss, and supervised contrastive loss, replacing the classification objective with a standard cross-entropy loss.
The results indicate that our proposed method significantly outperforms other ablation variants across all evaluation metrics, strongly proving that the tight synergy between components is key to achieving optimal performance. Specifically, the ablation analysis reveals a clear performance hierarchy. First, removing the Feature Decoupling and Alignment Module caused an acute performance drop, with W-F1 decreasing by approximately 8.34% on IEMOCAP and 5.51% on MELD. This demonstrates its central role in addressing multimodal heterogeneity. By explicitly decoupling features and aligning semantics, it effectively avoids information redundancy and prevents overfitting to modality-specific noise, thereby significantly enhancing the robustness of emotional representations. Second, replacing the Dynamic Graph Fusion Mechanism with simple feature concatenation resulted in the most significant decline, with Acc dropping by approximately 4.63% on IEMOCAP and 5.57% on MELD. This confirms that simple linear combinations fail to model high-order nonlinear interactions between modalities. It further elucidates that this module provides fine-grained, structured modeling of instantaneous interactions, which is the critical mechanism for ensuring high-quality multimodal fusion output. Finally, removing the Composite Regularization Constraints led to a decrease in W-F1 by approximately 4.02% on IEMOCAP and 3.13% on MELD. This underscores the importance of optimization and constraint mechanisms for deep networks. This module effectively mitigates the long-tail distribution issues and prediction jitter prevalent in natural emotional data, endowing the model with more robust and discriminative predictive capabilities.
Furthermore, to differentiate the contribution of the proposed fusion architecture from the strength of the pretrained feature extractors, we conducted an additional ablation study using simpler backbones. We replaced BERT, Wav2vec 2.0, and VideoMAE with GloVe+LSTM, OpenSmile, and ResNet-50, respectively. As shown in the newly added row “w/o Traditional Backbones” in
Table 5, the Acc of this model dropped to 60.89% on the IEMOCAP dataset, and it achieved 56.41% on the MELD dataset. While this drop highlights the importance of rich semantic representations provided by advanced pretraining, the performance remains robust compared to early multimodal baselines. This demonstrates that the dynamic heterogeneous graph structure effectively captures cross-modal correlations even with limited feature expressiveness, proving that the model’s effectiveness is not solely reliant on powerful backbones.
In summary, all components make indispensable contributions to the stable and high-performance emotion recognition achieved by DHGTN.
4.4.2. Impact of Multimodal Fusion
To investigate the contribution of different modalities and their complementary mechanisms in emotion recognition, we designed a progressive ablation experiment ranging from unimodal to bimodal and finally full modal configurations. Here, T, A, and V denote text, acoustic, and visual modalities, respectively.
Table 6 records the performance under different modality combinations, and the trends are visualized in
Figure 3.
The results show that the T + V + A configuration achieved the best performance on both datasets. Removing any single modality resulted in statistically significant declines across all core evaluation metrics. This strongly supports the “Modality Complementarity Hypothesis”, suggesting that each modality provides irreplaceable emotional information, and their organic integration constructs the most robust emotional representation.
Meanwhile, the results exhibit distinct differentiations based on dataset characteristics. For instance, on the MELD dataset, the T + V combination outperformed T + A, indicating that facial expressions play a critical role in semantic disambiguation within short dialogues. Conversely, on the IEMOCAP dataset, the V + A combination achieved competitive results even without text. This is attributed to the fact that IEMOCAP involves substantial improvisation, where variations in actors’ prosody become key features for distinguishing emotions. Consequently, the model must possess the capability for adaptive weighting tailored to different data characteristics.
Beyond the standard modality ablation where models are retrained on specific subsets, we further evaluated the robustness of the fully trained DHGTN model against missing modalities during the inference phase. This simulates real-world hardware failures (e.g., camera occlusion or microphone noise) by masking the input features of a specific modality (setting vectors to zero) without retraining. As illustrated in the newly added
Figure 4, DHGTN demonstrates remarkable resilience. For instance, in the “Missing Visual” scenario, the performance decline on IEMOCAP is significantly lower than that of baseline models. This stability validates that our “Shared Private” decoupling mechanism successfully encodes modality-invariant emotional semantics, allowing the model to recover core information from the remaining modalities even when one source is corrupted.
4.5. Case Analysis
To deeply explore the decision logic and robustness of DHGTN in real-world conversational scenarios, we selected three representative samples from the MELD test set for qualitative analysis. Detailed results are shown in
Table 7.
In Case 1, the text conveys a clear negative signal (“Pressure”). Visually, the subject exhibits open palm gestures and a serious facial expression, while the audio presents a rapid tempo with distinct stress. Through the Transformer fusion module, DHGTN effectively aligns the textual semantics with these “anxiety” features, correctly identifies the emotion not as low-arousal Sadness, but as Anger with a cathartic nature, ultimately yielding a correct prediction.
In Case 2, the text content is merely a functional instruction. Visually, although the subject frowns, the overall posture leans towards a normal communicative state. The audio pitch remains relatively stable without obvious aggression. DHGTN avoids over-interpreting this local visual noise. By capturing the global stability of the multimodal state, it accurately controls the emotional boundary and classifies the sample as Neutral, demonstrating robustness against complex background contexts.
In Case 3, although the text contains positive vocabulary, it is prone to being misinterpreted as irony or sarcasm in specific contexts. The acoustic modality extracts high-pitched intonation and a brisk pace, while the visual modality captures typical expressions of pleasure. Quantitative analysis of the graph attention layer reveals that at the moment of the utterance, the dynamic graph mechanism adaptive shifts the dominance from text (0.15 weight) to audio (0.45 weight) and vision (0.40 weight). This instantaneous weight adjustment allows the model to override the semantic ambiguity of the text by focusing on the “high-pitched intonation” and “exaggerated smile”, thereby correctly identifying the emotion as Joy, whereas static baselines misclassify it as Neutral or Positive.
In conclusion, the case study demonstrates that our method exhibits stable performance across different types of multimodal interaction patterns, possessing excellent generalization capabilities and robustness.
5. Conclusions
To address core challenges in multimodal emotion recognition—specifically severe modal heterogeneity, the difficulty of modeling deep interactions, and interference from redundant information—this paper proposes a novel approach based on a dynamic heterogeneous graph network. Extensive experiments on the IEMOCAP and MELD benchmarks demonstrate that the proposed method significantly outperforms state-of-the-art baselines in key metrics, including accuracy and weighted F1. These results substantiate the model’s effectiveness and generalization capability in handling complex cross-modal interactions and fine-grained emotion classification tasks.
However, the proposed framework exhibits distinct boundaries regarding its applicability in real-world scenarios. First, the reliance on complete and synchronized multimodal data is a strong idealization. In uncontrolled environments, missing modalities (e.g., occlusion) or signal asynchrony may disrupt the graph topology and degrade performance. Future iterations must incorporate missing modality imputation mechanisms to handle incomplete views. Second, the dynamic nature of the graph, which updates weights at every timestep, introduces higher computational overhead than static methods. While our reported latency benefits from GPU parallelization, which efficiently parallelizes the matrix multiplications required for calculating attention weights, deployment on hardware lacking massive parallelism may experience increased latency due to sequential processing. Nevertheless, given the small graph scale, the framework remains viable for edge deployment, particularly when optimization techniques like quantization are applied.
In light of these limitations, future research will focus on establishing highly robust and lightweight multimodal emotion computing architectures. Key directions include exploring network pruning and knowledge distillation to reduce the computational cost of dynamic graphs for real-time applications, as well as developing generative approaches to infer missing semantic cues. These advancements will ensure the model remains robust even when one or more modalities are unavailable, balancing high performance with the adaptability required for large-scale, real-time human–computer interaction.