Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network

Da, Bulaga; Bao, Feilong

doi:10.3390/app16041731

Open AccessArticle

Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network

by

Bulaga Da

and

Feilong Bao

^*

School of Computer Science, Inner Mongolia University, Hohhot 010021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 1731; https://doi.org/10.3390/app16041731

Submission received: 25 December 2025 / Revised: 17 January 2026 / Accepted: 22 January 2026 / Published: 10 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

To address the semantic gap and complex feature entanglement inherent in multimodal emotion recognition, we propose the Dynamic Heterogeneous Graph Temporal Network (DHGTN), an end-to-end framework designed to model dynamic cross-modal interactions effectively. Utilizing a robust backbone of Wav2vec 2.0, VideoMAE, and BERT, we introduce a “Shared Private” subspace projection mechanism that explicitly disentangles emotion common features from modality-specific noise through contrastive learning to ensure strict semantic alignment. Furthermore, our collaborative Dynamic Heterogeneous Graph and Transformer module overcomes static fusion limitations by constructing time-varying graphs for instantaneous associations and employing global attention to capture long-range temporal dependencies. Extensive experiments on the IEMOCAP and MELD benchmarks demonstrate that DHGTN significantly outperforms state-of-the-art baselines, achieving weighted F1-scores of 73.86% and 66.87%, respectively, which confirms the method’s effectiveness and robustness.

Keywords:

multimodal emotion recognition; cross-modal feature alignment; dynamic heterogeneous graph; attention mechanism; contrastive learning

1. Introduction

With the increasing digitalization of human–computer interaction, emotion recognition has emerged as a pivotal domain in the pursuit of high-order cognitive intelligence [1], garnering significant academic interest and promising application prospects [2]. Historically, constrained by data acquisition technologies and algorithmic bottlenecks, early research primarily focused on sentiment intensity classification within unimodal text [3]. However, cognitive neuroscience reveals that human emotional expression is inherently a complex process governed by multimodal synergy [4]. Beyond the explicit semantic information conveyed by text [5], nonverbal signals—including vocal prosody, facial expressions, and body gestures—carry a wealth of irreplaceable emotional cues [6]. Moreover, significant complementarity and correlation exist across these modalities [7]. Recent studies confirm that by deeply fusing heterogeneous data from linguistic, acoustic, and visual sources, multimodal emotion recognition (MER) can effectively resolve semantic ambiguities (e.g., irony and metaphors) prevalent in unimodal contexts. This integration facilitates the construction of robust multimodal representations [8], leading to a more precise understanding of complex human affective states. Despite its potential, mainstream MER methodologies still face several core challenges:

First, modal heterogeneity leads to the “semantic gap” and feature entanglement. Text, audio, and video exhibit fundamental differences in representation formats, sampling frequencies, and data distributions [9]. Conventional fusion methods often force different modalities into a shared subspace, neglecting distinct physical properties such as the semantic relatedness of text, the temporal dynamics of audio, and the spatial correlations of imagery. This oversight complicates the alignment of heterogeneous data. Furthermore, raw features are frequently confounded by emotion-irrelevant private attributes [10], such as background noise, speaker identity, and ambient lighting. Without an explicit decoupling mechanism, emotional semantics become highly entangled with modal-specific noise, causing models to overfit to environmental artifacts rather than the essence of emotional features, which severely weakens cross-scene generalization.

Second, cross-modal interactions exhibit dynamic, time-varying characteristics. In continuous interactions, the emotional contribution of different modalities fluctuates over time [11]. For instance, during active speech, acoustic prosody and textual semantics may dominate; however, during silences or pauses, facial expressions and subtle gestures often provide critical emotional cues [12]. Traditional static fusion paradigms—such as feature concatenation or fixed weight attention mechanisms—struggle to capture these nonlinear, time-varying intermodal correlations. Consequently, they fail to handle complex scenarios like asynchrony or missing modalities [13]. Moreover, such methods are prone to introducing intramodal noise (e.g., background chatter) and intermodal redundancy (e.g., positive sentiment expressed simultaneously in text and vision), which hampers predictive performance.

Furthermore, emotion recognition faces challenges in temporal modeling and data distribution. The evolution of emotional states exhibits long-range dependencies. Traditional recurrent neural networks are often limited by gradient vanishing when processing long sequences, making it difficult to capture distant contextual dependencies [14]. Simultaneously, emotion datasets in naturalistic settings frequently suffer from severe long-tail distributions, where samples are concentrated in neutral categories while extreme or specific emotions remain sparse [15]. This imbalance causes models to favor majority classes, thereby diminishing the recognition efficacy for critical minority emotions.

To address these challenges, we propose the Dynamic Heterogeneous Graph Temporal Network (DHGTN), an end-to-end multimodal emotion recognition framework. Adopting a strategy of “divide-and-conquer, dynamic fusion, and global reasoning”, DHGTN systematically tackles the aforementioned hurdles.

The main contributions of this work are summarized as follows:

Contrastive learning-based multimodal feature decoupling: We construct a “Shared Private” subspace projection mechanism utilizing Wav2vec 2.0, VideoMAE, and BERT backbones. By imposing orthogonal constraints and self-supervised contrastive limits, we explicitly disentangle emotion common semantics from modality-specific noise. This design minimizes redundancy and ensures robust feature alignment without relying on unverified semantic assumptions.
Dynamic heterogeneous graph temporal fusion: Transcending static fusion: We introduce a dynamic graph structure that is fully connected at each timestep to capture instantaneous cross-modal correlations. We utilize a graph attention mechanism to adaptively assign weights to both intramodal and intermodal edges, enabling the model to track time-varying interactions with high granularity.
Comprehensive composite optimization system: We propose a multidimensional hybrid loss function that integrates shared space contrastive loss, temporal consistency loss, and supervised contrastive loss. This system works in tandem with the global reasoning capabilities of Transformers to effectively resolve long-range dependencies and mitigate the impact of class imbalance in emotional data.
Superior empirical performance: Extensive experiments on the IEMOCAP and MELD benchmarks demonstrate that DHGTN significantly outperforms state-of-the-art baselines. Specifically, our method achieves a weighted F1 score improvement of 2.03% on IEMOCAP and 1.35% on MELD compared to the runner up model, validating the effectiveness of our dynamic interaction modeling.

2. Related Works

2.1. Multimodal Emotion Recognition

Unimodal data typically captures only localized emotional cues, making it difficult to comprehensively model complex and fluid affective states. Consequently, multimodal emotion recognition seeks to integrate complementary information from textual, acoustic, and visual modalities to overcome the representational limitations of a single source, thereby significantly enhancing accuracy and robustness. Existing multimodal fusion mechanisms are generally categorized into two paradigms: representation-learning-based fusion and attention-based interaction.

In the realm of representation learning, research focuses on designing effective strategies for the unified representation and integration of multimodal features. Zadeh et al. [16] introduced the Tensor Fusion Network (TFN), which utilizes Cartesian products to construct multidimensional tensors, explicitly modeling all potential unimodal, bimodal, and trimodal interactions. While the TFN excels at capturing high-order interactive features, the exponential growth of feature dimensionality incurs prohibitive computational costs. To mitigate this, Liu et al. [17] proposed Low-rank Multimodal Fusion (LMF), leveraging low-rank tensor decomposition to maintain performance while significantly reducing overhead. Subsequently, Mai et al. [18] developed the Hierarchical Feature Fusion Network (HFFN), which employs a hierarchical tensor decomposition architecture to consider both local and global interactions, enhancing the extraction of latent structures in multimodal temporal data.

In attention-based methods, the Transformer architecture has become the de facto standard due to its global self-attention mechanism. Its capability to capture deep contextual representations has been verified not only in standard NLP benchmarks but also in specialized automated analysis systems [19], confirming its robustness in feature extraction. By computing attention weights in parallel, these models dynamically focus on key timesteps or modal features that contribute significantly to emotional prediction, effectively capturing long-range dependencies. For instance, Tsai et al. [20] proposed the Multimodal Transformer (MulT), which achieves feature alignment and interaction via directional pairwise cross-modal attention. Rahman et al. [21] introduced a multimodal adaptation gate (MAG) to inject nonverbal information into pretrained language models, achieving effective fusion of linguistic knowledge and multimodal perception. Furthermore, Ou et al. [22] designed the Multimodal Local and Global Attention Network (MMLGAN) to integrate diverse representations and generate discriminative emotional features.

Despite these advancements, existing fusion methods face critical challenges. First, the lack of explicit spatiotemporal calibration often allows noise to infiltrate cross-modal features during high-order interactions. Second, the issue of feature entanglement remains unresolved; failing to decouple modal-specific attributes from shared emotional features results in redundant information interfering with the final representation, thus undermining model robustness.

2.2. Graph Neural Networks

Graph neural networks offer a novel perspective for MER by modeling multimodal data as topological structures with complex dependencies, leveraging their superiority in processing non Euclidean data. Moreover, the robustness of GNNs for text classification has been validated in diverse practical applications [23], providing empirical evidence of their effectiveness in capturing latent semantic dependencies. Under the GNN framework, modalities or temporal segments are treated as nodes, while interactions are represented as edges. Through message-passing mechanisms, GNNs dynamically aggregate neighborhood information to capture deep-seated emotional correlations.

Researchers have achieved significant results by designing sophisticated graph topologies for context and cross-modal interaction modeling. Hu et al. [24] proposed the Multimodal Graph Convolutional Network (MMGCN), which constructs a heterogeneous conversational graph with multimodal nodes and applies spectral convolutions to model long-range cross-modal dependencies. To achieve finer control over information flow, Li et al. [25] introduced the Graph and Attention-based Multisource Integration Framework (GA2MIF), which utilizes multihead directed graph attention and pairwise cross-channel attention to decouple and collaboratively optimize contextual modeling and cross-modal interaction. Similarly, Lu et al. [26] proposed the Bistream Graph Multimodal Fusion (BiGMF) architecture, which constructs independent unimodal and cross-modal graphs to capture intramodal temporal dependencies and intermodal pairwise interactions, respectively.

To address structural complexity and data quality issues, subsequent works have further expanded the boundaries of GNNs. Lian et al. [27] designed the Graph Completion Network (GCNet), utilizing topological completion to handle incomplete conversational data by inferring missing semantic and emotional links. Wei et al. [28] developed the Dialogue and Event Relation Aware GCN (DER-GCN), which employs weighted multirelational graphs and self-supervised masked graph autoencoders to capture causal dependencies between speakers and events. Furthermore, Du et al. [29] proposed the Hierarchical Graph Contrastive Learning (HGCLLG) network, which maximizes the mutual information between local and global views of a single utterance, enabling the model to learn fine-grained local features and high-level global semantics.

However, these GNN-based MER methods exhibit two critical limitations regarding dynamic interactions and feature purity. First, most existing approaches, such as HGGLG and BiGMF, predominantly rely on static graph topologies. They assume that intermodal connection weights remain fixed throughout a dialogue, which contradicts the psychological reality that modal importance fluctuates instantaneously (e.g., audio dominates during shouting, while facial expressions dominate during silence). Although recent works like MDH have introduced dynamic hypergraphs to model high-order correlations, they often prioritize structural complexity over the purification of input features. This leads to the second limitation: the lack of explicit feature decoupling. State-of-the-art methods like HCIL and MDH typically perform graph message passing directly on entangled feature representations. Without a prior “Shared Private” decoupling mechanism, modality-specific noise (e.g., background clutter or identity information) is inevitably propagated through the graph structure, interfering with the synthesis of emotion common semantics. In contrast, our DHGTN conceptually differs by integrating a prefusion decoupling stage with a frame level dynamic graph. This ensures that the graph topology evolves based on “purified” emotional semantics rather than noisy raw features, offering a more robust mechanism than simple dynamic association.

3. Methodology

The overall architecture of the proposed DHGTN is illustrated in Figure 1. As depicted, the data flow operates through four hierarchical stages to transform raw multimodal signals into emotion predictions:

Modality-specific encoding: Raw input streams (text, audio, vision) are first processed by independent pretrained backbones (BERT, Wav2vec 2.0, VideoMAE) to extract high-dimensional feature sequences.
Feature decoupling: To suppress noise, these features are projected into orthogonal “Shared” (emotion common) and “Private” (modality-specific) subspaces via a contrastive learning mechanism.
Dynamic graph fusion: The purified “Shared” features serve as nodes in a dynamic heterogeneous graph. At each timestep, a graph attention network dynamically computes edge weights to model instantaneous cross-modal interactions.
Temporal reasoning and classification: Finally, the sequence of fused graph representations is fed into a global Transformer to capture long-range dependencies, followed by a classifier that outputs the final emotion probability.

3.1. Modality-Specific Feature Encoding

This stage aims to transform raw, highly heterogeneous multimodal input streams into a unified sequence of high-dimensional spatiotemporal feature representations, while addressing the inherent temporal heterogeneity across modalities. The model receives synchronized input sequences from

M

modalities, denoted as

X = {x_{t}^{m}}_{t = 1}^{T}

. Here,

M = 3

, representing {Text, Audio, Vision}, and

T

denotes the number of timesteps. Given the substantial discrepancies in data structures and semantic levels across signals, we adopt a “divide-and-conquer” strategy. Specifically, dedicated pretrained backbone networks are deployed for each modality to extract high-level semantic feature sequences

H^{m} = {h_{t}^{m}}_{t = 1}^{T} \in ℝ^{T \times D_{h}}

, where

D_{h}

is the encoded feature dimension:

H^{m} = {Encoder}^{m} (X^{m})

(1)

Specifically, Textual Modality

F_{T}

employs the pretrained BERT model to extract contextual embeddings from the input transcripts. Acoustic Modality

F_{A}

adopts the Wav2vec 2.0, a self-supervised framework pretrained on raw audio waveforms, to capture paralinguistic cues such as prosody and intonation. Visual Modality

F_{V}

uses VideoMAE to obtain robust spatiotemporal representations from the video frames. Following feature extraction, the module outputs three sets of temporally aligned but semantically heterogeneous feature sequences

H = {H_{T}, H_{A}, H_{V}}

.

3.2. Multimodal Feature Decoupling and Alignment

To address the issue of feature entanglement caused by modal heterogeneity—where raw features

H^{m}

contain a mixture of critical emotional semantics and emotion-irrelevant modal-specific noise—this module constructs a feature decoupling mechanism. The goal is to explicitly separate the representation of each modality into two orthogonal latent subspaces: an Emotion Shared Space, which captures common emotional semantics across modalities, and a Modality Private Space, which retains modal specific noise and private attributes. Additionally, a contrastive learning mechanism is employed to enforce semantic alignment within the shared features.

For any modality at timestep

t

with feature

h_{t}^{m} \in ℝ^{D_{h}}

, we decouple the representation via two independent linear projection networks

W_{s}^{m}

and

W_{p}^{m}

:

h_{t}^{m} = h_{t, s}^{m} + h_{t, p}^{m}

(2)

h_{t, s}^{m} = h_{t}^{m} W_{s}^{m} \in ℝ^{D_{s}}

(3)

h_{t, p}^{m} = h_{t}^{m} W_{p}^{m} \in ℝ^{D_{p}}

(4)

where

h_{t, s}^{m}

represents the shared feature containing general emotional semantics, and

h_{t, p}^{m}

represents the private feature related to modal-specific noise.

W_{s}^{m} \in ℝ^{D_{h} \times D_{s}}

and

W_{p}^{m} \in ℝ^{D_{h} \times D_{p}}

are learnable projection matrices.

To ensure that the shared feature

h_{t, s}^{m}

and private feature

h_{t, p}^{m}

are informationally independent, we impose an orthogonality constraint

L_{o r t h}

to minimize the mutual information between them:

L_{o r t h} = \sum_{m = 1}^{M} ‖ (W_{s}^{m})^{⊤} W_{p}^{m} ‖_{F}

(5)

where

‖ \cdot ‖_{F}

denotes the Frobenius norm. Minimizing this term enforces the two projection matrices to be mutually orthogonal.

Simultaneously, to enforce high semantic alignment among shared features from different modalities while suppressing interference from private features, we introduce a self-supervised multimodal contrastive learning loss

L_{CL}

. Based on the InfoNCE framework, we construct positive and negative sample pairs. The positive samples are the shared features

h_{t, s}^{i}

and

h_{t, s}^{j} (i \neq j)

of different modalities at the same timestep

t

. The negative samples include the shared features of the current moment

h_{t, s}^{i}

paired with shared features from all other timesteps

t^{'} \neq t

, as well as private features

h_{t^{'}, p}^{j}

from all timesteps. By calculating the similarity between the target shared feature and its positive/negative sets, the contrastive loss is defined as follows:

L_{C L} = - \sum_{t = 1}^{T} \sum_{i = 1}^{M} \log \frac{\exp (sim (h_{t, s}^{i}, h_{t, s}^{pos}) / τ)}{\exp (sim (h_{t, s}^{i}, h_{t, s}^{pos}) / τ) + \sum_{h_{t^{'}, j}^{neg} \in N_{t, i}} \exp (sim (h_{t, s}^{i}, h_{t^{'}, j}^{neg}) / τ)}

(6)

where

sim (\cdot, \cdot)

is the cosine similarity,

τ

is a temperature coefficient, and

h_{t, s}^{pos}

denotes the set of shared features from all other modalities at timestep

t

. The output of this stage is the decoupled and aligned shared feature sequence

H^{s} = {H_{T, S}, H_{A, S}, H_{V, S}}

.

3.3. Dynamic Heterogeneous Graph Temporal Network

To capture the time-varying nature of emotional contributions and complex modal interplay, this module designs a dynamic heterogeneous graph that adaptively learns connection weights between modalities at each timestep. For every timestep

t

, a dynamic heterogeneous graph

G_{t} = (V_{t}, E_{t})

is constructed based on the shared features

F^{S} (t)

. The node set

V_{t}

contains three nodes representing the shared modal features at the current moment. To comprehensively model the relationships, the graph is designed as fully connected, ensuring that each modality can attend to all others. The edge set

E_{t}

comprises two types of connections: intramodal edges, which allow each modality to preserve its unique characteristics, and intermodal edges, which capture latent cross-modal synergies. The weights of these edges are determined by a dynamically computed graph attention mechanism.

We utilize graph attention to calculate the dynamic attention weight

α_{t}^{m, j}

from modality

m

to

j

, reflecting the contribution of

j

to the emotional information of

m

. Given the fixed and small number of modalities (

N = 3

), the computational cost for this graph construction is

O (N^{2})

per timestep, resulting in a total complexity of

O (T \times N^{2})

, which is computationally efficient. The attention coefficients are computed as follows:

e_{t}^{m, j} = LeakyReLU (a^{T} [W F_{m}^{S} (t) | | W F_{j}^{S} (t)])

(7)

α_{t}^{m, j} = \frac{\exp (e_{t}^{m, j})}{\sum_{k \in V_{t}} \exp (e_{t}^{m, k})}

(8)

where

W

and

a

are learnable weight parameters, and

| |

denotes the concatenation operation.

These weights implement context adaptive dynamic information flow control. Subsequently, neighborhood information is aggregated using the calculated attention weights to update the feature representation of each node:

F_{m}^{'} (t) = σ (\sum_{j \in V_{t}} α_{t}^{m, j} F_{j}^{S} (t))

(9)

where

σ

is a nonlinear activation function.

The updated node feature

F_{m}^{'} (t)

effectively assimilates contextual information from other modalities. Following the graph attention update, an average pooling operation is applied to compress the overall state of the graph

G_{t}

into a single vector

F^{G} (t)

, representing the instantaneous global multimodal snapshot at time

t

:

F^{G} (t) = \frac{1}{M} \sum_{m \in V_{t}} F_{m}^{'} (t)

(10)

Ultimately, this process yields a temporally ordered sequence of frame-level fused features

F^{G} = {F^{G} (t)}_{t = 1}^{T}

.

3.4. Temporal Reasoning and Composite Optimization

The fundamental challenge in emotion recognition lies in decoding subtle affective correlations embedded across long temporal spans. This module is designed to bridge long-range temporal dependencies and mitigate data imbalance, transforming discrete frame-level snapshots into a coherent, global temporal understanding. First, a shared task adaptation layer is introduced to project the graph fused features

F^{G} (t)

into a latent task space:

H_{t} = LayerNorm (W_{A} F^{G} (t) + b_{A})

(11)

where

W_{A}

and

b_{A}

are learnable parameters.

Next, to effectively model long-range affective dependencies, the adapted sequence

H = {H_{t}}_{t = 1}^{T}

is fed into a Transformer encoder

T_{Enc}

containing

L

stacked layers. This captures the evolution of emotional states and distant contextual dependencies via the global self-attention mechanism:

Z = T_{Enc} (H)

(12)

Finally, a fully connected classifier

C (\cdot)

and a Softmax function are applied to

Z

to obtain the predicted probability

p

for emotion classes:

p = Softmax (C (Z_{Final}))

(13)

where

Z_{Final}

represents the result of the

Z

sequence after pooling or taking the final token.

Furthermore, to enhance the model’s discriminative power and address the long-tail distribution of data, we design a multidimensional composite loss function

L_{Composite}

:

L_{Composite} = L_{SCL} + L_{Focal} + L_{Dice}

(14)

where supervised contrastive loss (

L_{SCL}

) is based on the final features

Z_{Final}

; this utilizes label information to construct positive and negative pairs, optimizing representation quality at the geometric level to enhance intraclass compactness and interclass separability. Focal loss (

L_{F o c a l}

) is found by introducing a modulation factor

- {(1 - p_{t})}^{γ}

; this reduces the weight of easy-to-classify examples, forcing the model to focus on hard samples and correcting attentional bias at the probability distribution level. Dice loss (

L_{D i c e}

) is often used in segmentation; this serves as an auxiliary term to optimize decision boundaries for minority classes. It is insensitive to sample counts, effectively balancing the contribution of various classes and improving recognition performance for tail emotions. It optimizes the set overlap between predicted and ground truth distributions, further alleviating emotional bias caused by class imbalance globally. Specifically, focal loss focuses on hard samples by down-weighting easy ones:

L_{f o c a l} = - \frac{1}{N} \sum_{i = 1}^{N} {(1 - p_{i, c})}^{γ} \log (p_{i, c})

(15)

where

γ

is the focusing parameter set to 2.0, and

N

denotes the batch size, which is the number of samples in the current minibatch. Dice loss is adopted to optimize the F1-score directly for imbalanced classes:

L_{d i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i, c} y_{i, c} + ϵ}{\sum_{i = 1}^{N} p_{i, c}^{2} + \sum_{i = 1}^{N} y_{i, c}^{2} + ϵ}

(16)

The total optimization objective

L_{Total}

is defined as follows:

L_{Total} = L_{Composite} + λ_{1} L_{CL} + λ_{2} L_{Orth} + λ_{3} L_{Reg}

(17)

where

L_{CL}

is the contrastive learning loss,

L_{Orth}

is the orthogonal constraint loss,

λ

are weighting hyperparameters, and

L_{Reg}

is a regularization term to maintain temporal consistency. In our experiments, we empirically set the weighting hyperparameters to

λ_{1} = 0.1

and

λ_{2} = 0.01

to balance the auxiliary constraints with the main classification task. By jointly optimizing these terms, DHGTN achieves robust emotion recognition capabilities with high adaptability to imbalanced data.

4. Experiments

4.1. Experimental Datasets

To comprehensively evaluate the effectiveness and generalization capability of the proposed method in multimodal emotion recognition (MER) tasks, we selected two benchmark datasets: IEMOCAP and MELD. These datasets cover a wide range of interaction patterns, ranging from controlled laboratory environments to naturalistic conversational scenarios. Their rich emotional annotation categories enable a thorough assessment of model performance under complex conditions. The specific data partitions for training, validation, and testing, along with the sample counts, are detailed in Table 1.

IEMOCAP is one of the most widely used multimodal datasets in the field of affective computing [30]. It contains over 150 dyadic sessions recorded by 10 professional actors, comprising approximately 12 h of dialogue and over 10,000 emotional utterance samples. IEMOCAP provides pixel-level aligned video, audio, and text transcripts, along with fine-grained facial motion capture data. The samples are annotated with six discrete emotion labels: happy, sad, angry, frustrated, excited, and neutral. In our experiments, we follow the standard “leave-one-session-out” strategy, using the first four sessions for training and validation, and the last session for testing, to objectively assess the model’s performance in speaker independent scenarios.

MELD [31] is derived from the TV sitcom Friends, containing over 1400 dialogue scenes and approximately 13,000 utterances. MELD exhibits significant “in-the-wild” characteristics, featuring complex background noise, varying numbers of speakers, and dynamic camera angles. It provides synchronized audio, visual, and textual modalities for each utterance, making it suitable for evaluating robustness in complex acoustic environments and multiparty contexts. The dataset is annotated with seven discrete emotion categories: anger, disgust, fear, joy, neutral, sadness, and surprise. For our experiments, we adopt the standard fixed partition, randomly splitting the dataset into training, validation, and testing sets with a ratio of 80%, 10%, and 10%, respectively.

4.2. Experimental Settings and Evaluation Metrics

This study employs two core evaluation metrics widely recognized in the MER domain: accuracy (Acc) and weighted F1-score (W-F1) [32]. Higher values for both metrics indicate superior model performance. Accuracy measures the proportion of correctly predicted samples to the total number of samples, intuitively reflecting the model’s overall classification capability. The weighted F1-score calculates the weighted average of the F1-score for each category based on the proportion of samples. W-F1 effectively mitigates the issue of inflated performance scores caused by overfitting to majority classes, providing a more objective assessment of recognition efficacy on minority emotion classes.

We executed all model training and inference tasks on a robust Linux-based computing environment featuring an NVIDIA GeForce RTX 3080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). The software architecture was built upon Python 3.10, leveraging the advanced capabilities of the PyTorch 2.1.2 framework in conjunction with the CUDA 11.8 computing platform to optimize training speed and resource utilization. For clarity and reproducibility, specific settings regarding the training strategy are provided in Table 2, which details the exact hyperparameter values used to achieve the reported results.

Furthermore, to assess the model’s scalability in real-world contexts, we evaluated its computational efficiency on the IEMOCAP dataset. The proposed DHGTN contains 27.5 M parameters. In terms of computational speed, conducting inference on a single NVIDIA GeForce RTX 3080Ti GPU results in an average latency of 37.4 ms per utterance (with a batch size of 1). The total training time for convergence (80 epochs) is approximately 4.5 h. These concrete measurements confirm that while DHGTN incorporates dynamic graph modeling, it maintains a reasonable trade-off between recognition accuracy and computational cost, making it feasible for deployment in time sensitive applications.

4.3. Comparative Analysis of Experimental Results

In order to provide a rigorous evaluation of our model within the context of multimodal emotion recognition, we perform a comparative analysis against a spectrum of representative baselines. The selected models span distinct architectural paradigms and evolutionary phases of the field, including established classics and recent state-of-the-art advancements. To ensure fair and rigorous evaluation of experimental results, we trained all competing baseline models under identical experimental conditions. All models were trained and tested on exactly the same data splits for IEMOCAP and MELD. This approach eliminates performance variations arising from differing feature properties or data partitioning. These baselines are summarized as follows:

(1): TFN: A classic fusion paradigm that constructs a high-dimensional fusion tensor by calculating the Cartesian product between modalities, explicitly modeling high-order nonlinear interactions across all modality combinations.
(2): LMF: An efficient improvement over TFN that utilizes low-rank tensor decomposition to assign low-rank factors to each modality. This approximates the full high-dimensional interaction tensor, preserving intermodal combinatorial capabilities while reducing computational overhead.
(3): MFN [33]: Designed for multimodal sequential data, it employs a multiview gated memory network to jointly model modality-specific sequences and cross-modal interactions over time, aiming to capture long-term dependencies.
(4): MulT: A representative architecture applying Transformers to the multimodal domain. It utilizes cross-modal attention mechanisms to capture long-range dynamic cross-modal interactions without the need for explicit alignment.
(5): MISA [34]: Based on the concept of representation disentanglement, it projects multimodal inputs into two distinct subspaces—“Modality-Invariant” and “Modality-Specific”—to enhance robustness and reduce information redundancy.
(6): TMBL [35]: Reconstructs the Transformer architecture by introducing bimodal and trimodal binding mechanisms combined with fine-grained convolution modules, aiming to strengthen deep coordination and fusion of different data types.
(7): HGCLLG [29]: A hierarchical graph contrastive learning framework that explores local interactions within single utterances and global context relationships between utterances via graph structures, optimizing multimodal representations through graph gating mechanisms.
(8): HCIL [36]: Proposes a hybrid interaction framework that integrates intramodal dynamics, intermodal fusion, and cross-modal learning mechanisms to maximize the synergistic effect of different modal information on emotional expression.
(9): TETFN [37]: Adopts a text centric strategy, leveraging powerful textual semantic features to enhance and calibrate feature extraction for acoustic and visual modalities, generating a unified and efficient multimodal representation.
(10): MDH [38]: Utilizes a dynamic hypergraph convolutional network, allowing edges to connect multiple nodes to capture high-order modal relationships. Its “dynamic” nature allows for adaptive adjustment of the graph structure based on inputs, flexibly modeling complex associations between modalities.

To verify the effectiveness of the method, we conducted comparative experiments on the IEMOCAP and MELD datasets. We report the average performance and standard deviation over five independent runs to rule out stochastic effects. The specific results are presented in Table 3 and Table 4, respectively.

Specifically, on the IEMOCAP dataset, our method achieved improvements of approximately 1.74% and 2.03% in Acc and W-F1, respectively, over the second-best-performing model (MDH), and improvements of 7.95% and 7.62% over the classic MulT model. On the MELD dataset, our method maintained robust performance, surpassing the second-best MDH by approximately 1.35% in W-F1 and the classic MulT by 7.36%. These results fully demonstrate that the proposed dynamic heterogeneous graph interaction mechanism effectively captures not only intramodal temporal dependencies but also the dynamic interactions of emotional cues across modalities, proving highly effective in detecting subtle emotional shifts in dialogue. Crucially, the proposed composite optimization strategy significantly alleviates the long-tail issue. Detailed class-wise analysis reveals that DHGTN achieves the most notable gains on minority emotions compared to the baseline MDH. For instance, on the “Fear” class (which accounts for only <7% of samples), our model improves the accuracy by roughly 7.30% (from 57.87 to 65.17%). This confirms that the focal and Dice losses effectively prevent the model from overfitting to majority classes like “Neutral”.

Comparison with tensor fusion and sequence models: While TFN and LMF capture high-order relationships via tensor computation, they neglect the dynamic changes in the temporal dimension of long sequences, leading to suboptimal performance on long-dialogue datasets like IEMOCAP. Although MFN introduces memory mechanisms for sequence processing, its gating mechanism struggles with complex modal interactions in multiparty scenarios. In contrast, our method leverages dynamically updated graph nodes to simultaneously account for temporal information and modal interactions, thereby significantly outperforming these classic models across all metrics.

Comparison with attention-based models: MulT, MISA, and TMBL utilize attention mechanisms effectively for multimodal feature alignment. However, MISA is prone to introducing feature redundancy during disentanglement; MulT struggles to capture the implicit topological structure of dialogues without explicit graph guidance; and TMBL remains focused on local feature sequences without achieving global relational representation. By combining the attention advantages of Transformers with the structural strengths of graph neural networks, our method retains key cross-modal information while enhancing contextual understanding. This results in higher classification accuracy, particularly in the short-text, fast-paced dialogue scenarios typical of MELD.

Comparison with feature enhancement and GNN models: TETFN relies on powerful pretrained text models to guide acoustic and visual modalities; consequently, its performance is heavily dependent on the quality and informational content of the text modality, limiting its utility in certain real-world scenarios. Static-graph-based models like HGGLG and HCIL lack the flexibility to handle instantaneous changes in emotional states. While MDH employs a dynamic hypergraph mechanism, it incurs high computational complexity and focuses its dynamics on modeling high-order modal relationships rather than the real-time adjustment of relational weights between dialogue units. The proposed method treats emotional dynamics as a real-time evolution on a graph structure. It allows for fine-grained adjustment of connection structures and weights based on current inputs and historical context. Without relying on heavy text augmentation or complex hypergraphs, our method achieves state-of-the-art performance on both datasets through efficient structural learning, proving the superiority of our approach to modeling dialogue relational dynamics.

4.4. Ablation Studies

4.4.1. Impact of Individual Modules

We conducted ablation studies to systematically evaluate the validity and contribution of each core component. The experimental results are presented in Table 5, and the visualization results are shown in Figure 2. Among them, the “ w/o Feature Decoupling and Alignment Module” refers to removes the explicit separation of modal features into shared and private subspaces. Instead, modality-specific features are used directly for graph node initialization, and the shared space contrastive loss term is removed. The “w/o Dynamic Graph Fusion Mechanism” eliminates the construction of the dynamic heterogeneous graph and the spatial fusion process of the graph attention network. It replaces the global graph vector with simple feature concatenation. The “ w/o Composite Regularization Constraints” removes all auxiliary loss terms, including the shared space contrastive loss, temporal consistency loss, and supervised contrastive loss, replacing the classification objective with a standard cross-entropy loss.

The results indicate that our proposed method significantly outperforms other ablation variants across all evaluation metrics, strongly proving that the tight synergy between components is key to achieving optimal performance. Specifically, the ablation analysis reveals a clear performance hierarchy. First, removing the Feature Decoupling and Alignment Module caused an acute performance drop, with W-F1 decreasing by approximately 8.34% on IEMOCAP and 5.51% on MELD. This demonstrates its central role in addressing multimodal heterogeneity. By explicitly decoupling features and aligning semantics, it effectively avoids information redundancy and prevents overfitting to modality-specific noise, thereby significantly enhancing the robustness of emotional representations. Second, replacing the Dynamic Graph Fusion Mechanism with simple feature concatenation resulted in the most significant decline, with Acc dropping by approximately 4.63% on IEMOCAP and 5.57% on MELD. This confirms that simple linear combinations fail to model high-order nonlinear interactions between modalities. It further elucidates that this module provides fine-grained, structured modeling of instantaneous interactions, which is the critical mechanism for ensuring high-quality multimodal fusion output. Finally, removing the Composite Regularization Constraints led to a decrease in W-F1 by approximately 4.02% on IEMOCAP and 3.13% on MELD. This underscores the importance of optimization and constraint mechanisms for deep networks. This module effectively mitigates the long-tail distribution issues and prediction jitter prevalent in natural emotional data, endowing the model with more robust and discriminative predictive capabilities.

Furthermore, to differentiate the contribution of the proposed fusion architecture from the strength of the pretrained feature extractors, we conducted an additional ablation study using simpler backbones. We replaced BERT, Wav2vec 2.0, and VideoMAE with GloVe+LSTM, OpenSmile, and ResNet-50, respectively. As shown in the newly added row “w/o Traditional Backbones” in Table 5, the Acc of this model dropped to 60.89% on the IEMOCAP dataset, and it achieved 56.41% on the MELD dataset. While this drop highlights the importance of rich semantic representations provided by advanced pretraining, the performance remains robust compared to early multimodal baselines. This demonstrates that the dynamic heterogeneous graph structure effectively captures cross-modal correlations even with limited feature expressiveness, proving that the model’s effectiveness is not solely reliant on powerful backbones.

In summary, all components make indispensable contributions to the stable and high-performance emotion recognition achieved by DHGTN.

4.4.2. Impact of Multimodal Fusion

To investigate the contribution of different modalities and their complementary mechanisms in emotion recognition, we designed a progressive ablation experiment ranging from unimodal to bimodal and finally full modal configurations. Here, T, A, and V denote text, acoustic, and visual modalities, respectively. Table 6 records the performance under different modality combinations, and the trends are visualized in Figure 3.

The results show that the T + V + A configuration achieved the best performance on both datasets. Removing any single modality resulted in statistically significant declines across all core evaluation metrics. This strongly supports the “Modality Complementarity Hypothesis”, suggesting that each modality provides irreplaceable emotional information, and their organic integration constructs the most robust emotional representation.

Meanwhile, the results exhibit distinct differentiations based on dataset characteristics. For instance, on the MELD dataset, the T + V combination outperformed T + A, indicating that facial expressions play a critical role in semantic disambiguation within short dialogues. Conversely, on the IEMOCAP dataset, the V + A combination achieved competitive results even without text. This is attributed to the fact that IEMOCAP involves substantial improvisation, where variations in actors’ prosody become key features for distinguishing emotions. Consequently, the model must possess the capability for adaptive weighting tailored to different data characteristics.

Beyond the standard modality ablation where models are retrained on specific subsets, we further evaluated the robustness of the fully trained DHGTN model against missing modalities during the inference phase. This simulates real-world hardware failures (e.g., camera occlusion or microphone noise) by masking the input features of a specific modality (setting vectors to zero) without retraining. As illustrated in the newly added Figure 4, DHGTN demonstrates remarkable resilience. For instance, in the “Missing Visual” scenario, the performance decline on IEMOCAP is significantly lower than that of baseline models. This stability validates that our “Shared Private” decoupling mechanism successfully encodes modality-invariant emotional semantics, allowing the model to recover core information from the remaining modalities even when one source is corrupted.

4.5. Case Analysis

To deeply explore the decision logic and robustness of DHGTN in real-world conversational scenarios, we selected three representative samples from the MELD test set for qualitative analysis. Detailed results are shown in Table 7.

In Case 1, the text conveys a clear negative signal (“Pressure”). Visually, the subject exhibits open palm gestures and a serious facial expression, while the audio presents a rapid tempo with distinct stress. Through the Transformer fusion module, DHGTN effectively aligns the textual semantics with these “anxiety” features, correctly identifies the emotion not as low-arousal Sadness, but as Anger with a cathartic nature, ultimately yielding a correct prediction.

In Case 2, the text content is merely a functional instruction. Visually, although the subject frowns, the overall posture leans towards a normal communicative state. The audio pitch remains relatively stable without obvious aggression. DHGTN avoids over-interpreting this local visual noise. By capturing the global stability of the multimodal state, it accurately controls the emotional boundary and classifies the sample as Neutral, demonstrating robustness against complex background contexts.

In Case 3, although the text contains positive vocabulary, it is prone to being misinterpreted as irony or sarcasm in specific contexts. The acoustic modality extracts high-pitched intonation and a brisk pace, while the visual modality captures typical expressions of pleasure. Quantitative analysis of the graph attention layer reveals that at the moment of the utterance, the dynamic graph mechanism adaptive shifts the dominance from text (0.15 weight) to audio (0.45 weight) and vision (0.40 weight). This instantaneous weight adjustment allows the model to override the semantic ambiguity of the text by focusing on the “high-pitched intonation” and “exaggerated smile”, thereby correctly identifying the emotion as Joy, whereas static baselines misclassify it as Neutral or Positive.

In conclusion, the case study demonstrates that our method exhibits stable performance across different types of multimodal interaction patterns, possessing excellent generalization capabilities and robustness.

5. Conclusions

To address core challenges in multimodal emotion recognition—specifically severe modal heterogeneity, the difficulty of modeling deep interactions, and interference from redundant information—this paper proposes a novel approach based on a dynamic heterogeneous graph network. Extensive experiments on the IEMOCAP and MELD benchmarks demonstrate that the proposed method significantly outperforms state-of-the-art baselines in key metrics, including accuracy and weighted F1. These results substantiate the model’s effectiveness and generalization capability in handling complex cross-modal interactions and fine-grained emotion classification tasks.

However, the proposed framework exhibits distinct boundaries regarding its applicability in real-world scenarios. First, the reliance on complete and synchronized multimodal data is a strong idealization. In uncontrolled environments, missing modalities (e.g., occlusion) or signal asynchrony may disrupt the graph topology and degrade performance. Future iterations must incorporate missing modality imputation mechanisms to handle incomplete views. Second, the dynamic nature of the graph, which updates weights at every timestep, introduces higher computational overhead than static methods. While our reported latency benefits from GPU parallelization, which efficiently parallelizes the matrix multiplications required for calculating attention weights, deployment on hardware lacking massive parallelism may experience increased latency due to sequential processing. Nevertheless, given the small graph scale, the framework remains viable for edge deployment, particularly when optimization techniques like quantization are applied.

In light of these limitations, future research will focus on establishing highly robust and lightweight multimodal emotion computing architectures. Key directions include exploring network pruning and knowledge distillation to reduce the computational cost of dynamic graphs for real-time applications, as well as developing generative approaches to infer missing semantic cues. These advancements will ensure the model remains robust even when one or more modalities are unavailable, balancing high performance with the adaptability required for large-scale, real-time human–computer interaction.

Author Contributions

Methodology, B.D.; writing—original draft preparation, B.D.; writing—review and editing, B.D. and F.B.; supervision, F.B.; validation, B.D.; data curation, B.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Wang, Y.; Cui, Z.; Fu, G.; Zhu, W.; Yang, J. Decoupled multimodal distilling for emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Wang, Y.; Li, Y.; Cui, Z.; Fu, G.; Yang, J. Incomplete multimodality-diffused emotion recognition. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Luo, W.; Xu, M.; Lai, H.; Bian, Y.; Li, Y. Multimodal reconstruct and align net for missing modality problem in sentiment analysis. In Proceedings of the 29th International Conference on Multimedia Modeling, Bergen, Norway, 9–12 January 2023. [Google Scholar]
An, J.Y.; Zainon, N.W.; Mohd, W. Improving targeted multimodal sentiment classification with semantic description of images. Comput. Mater. Contin. 2023, 75, 5801–5815. [Google Scholar] [CrossRef]
Huang, Z.P.; Zeng, B.Q.; Chen, P.F.; Zhang, L. Aspect Affective Triad Extraction Based on Language Feature Enhancement. Comput. Eng. 2025, 51, 83–92. [Google Scholar]
Liu, Z.; Zhou, B.; Chu, D.; Sun, Y. Modality translation-based multimodal sentiment analysis under uncertain missing modalities. Inf. Fusion 2024, 101, 101973. [Google Scholar] [CrossRef]
Li, M.; Yang, D.; Zhao, X.; Wang, S.; Wang, Y.; Yang, K.; Sun, M.; Kou, D.; Qian, Z.; Zhang, L. Correlation-decoupled knowledge distillation for multimodal sentiment analysis with incomplete modalities. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Sun, H.; Zhao, S.; Li, S.; Kong, X.; Wang, X.; Kong, A.; Zhou, J.; Chen, Y.; Zeng, W.; Qin, Y. Enhancing emotion recognition in incomplete data: A novel cross modal alignment, reconstruction, and refinement framework. In Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025. [Google Scholar]
Mai, S.; Zeng, Y.; Zheng, S.; Hu, H. Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2023, 14, 2276–2289. [Google Scholar] [CrossRef]
Yang, J.; Yu, Y.; Niu, D.; Guo, W.; Xu, Y. Confede: Contrastive feature decomposition for multimodal sentiment analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Ma, F.; Zhang, Y.; Sun, X.; Zhao, Y.; Cui, L.; Dong, R.; Hao, S. Multimodal sentiment analysis with preferential fusion and distance-aware contrastive learning. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023. [Google Scholar]
Zhou, S.X.; Yu, K. Multimodal Affection Analysis Based on Intensive Collaborative Attention. Comput. Eng. 2025, 51, 144–151. [Google Scholar]
Tu, G.; Xie, T.; Liang, B.; Wang, H.; Xu, R. Adaptive graph learning for multimodal conversational emotion detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Wang, L.; Peng, J.; Zheng, C.; Zhao, T.; Zhu, L. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Inf. Process. Manag. 2024, 61, 103675. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, Y.; Cheng, B. RL-EMO: A Reinforcement Learning Framework for Multimodal Emotion Recognition. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024. [Google Scholar]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017. [Google Scholar]
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
Mai, S.; Hu, H.; Xing, S. Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Wang, S. Development of an automated transformer-based text analysis framework for monitoring fire door defects in buildings. Sci. Rep. 2025, 15, 43910. [Google Scholar] [CrossRef] [PubMed]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Rahman, W.; Hasan, M.K.; Lee, S.; Bagher Zadeh, A.; Mao, C.; Morency, L.-P.; Hoque, E. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Ou, X.; Hong, Z.; Li, X.; Zhang, J.; Mao, C.; Lan, Y. Multimodal Local-Global Attention Network for Affective Video Content Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
Wang, S. Graph neural network–driven text classification for fire-door defect inspection in pre-completion construction. Sci. Rep. 2025, 15, 44382. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Liu, H.; Zhao, Z.; Zhou, J.; Li, S.; Ren, H. MMGCN: Multi-modal Graph Convolution Network for Sentiment Analysis. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021. [Google Scholar]
Li, Y.; Wang, Y.; Huang, Z.; Fu, G.; Zhu, W.; Yang, J. GA2MIF: Graph Attention and Attention-Based Multimodal Information Fusion for Sentiment Analysis. Appl. Intell. 2023, 53, 15159–15174. [Google Scholar]
Lu, N.; Han, Z.; Han, M.; Qian, J. Bi-stream graph learning based multimodal fusion for emotion recognition in conversation. Inf. Fusion 2024, 106, 102272. [Google Scholar] [CrossRef]
Lian, Z.; Chen, L.; Sun, L.; Liu, B.; Tao, J. GCNet: Graph completion network for incomplete multimodal learning in conversation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8419–8432. [Google Scholar] [CrossRef] [PubMed]
Wei, A.; Shou, Y.; Meng, T.; Yin, N.; Li, K. DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition. arXiv 2023, arXiv:2312.10579. [Google Scholar]
Du, J.; Jin, J.; Zhuang, J.; Zhang, C. Hierarchical graph contrastive learning of local and global presentation for multimodal sentiment analysis. Sci. Rep. 2024, 14, 5335. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Wang, X.; Liu, Y.; Zeng, Z. CFN-ESA: A cross modal fusion network with emotion-shift awareness for dialogue emotion recognition. IEEE Trans. Affect. Comput. 2024, 15, 1919–1933. [Google Scholar] [CrossRef]
Yao, B.; Shi, W. Speaker-centric multimodal fusion networks for emotion recognition in conversations. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024. [Google Scholar]
Zou, S.H.; Huang, X.Y.; Shen, X.D. Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar]
Zadeh, A.; Liang, P.P.; Majumder, N.; Poria, S.; Cambria, E.; Morency, L.-P. Memory Fusion Network for Multi-view Sequential Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
Huang, J.; Zhou, J.; Tang, Z.; Lin, J.; Chen, C.Y.-C. TMBL: Transformer-based multimodal binding learning model for multimodal sentiment analysis. Knowl.-Based Syst. 2024, 285, 111346. [Google Scholar] [CrossRef]
Fu, Y.; Zhang, Z.; Yang, R.; Yao, C. Hybrid cross modal interaction learning for multimodal sentiment analysis. Neurocomputing 2024, 571, 127201. [Google Scholar] [CrossRef]
Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
Huang, J.; Pu, Y.; Zhou, D.; Cao, J.; Gu, J.; Zhao, Z.; Xu, D. Dynamic hypergraph convolutional network for multimodal sentiment analysis. Neurocomputing 2024, 565, 126992. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the DHGTN model.

Figure 2. Visual analysis of key component performance contributions.

Figure 3. Visual analysis of fusion strategy performance comparison.

Figure 4. Robustness analysis under missing modality scenarios on the IEMOCAP dataset.

Table 1. Statistics of datasets.

Dataset	Training Set	Validation Set	Test Set	Total
IEMOCAP	6790	1207	1820	9817
MELD	10,247	1127	1202	12,576

Table 2. Experimental hyperparameter settings.

Hyperparameters	IEMOCAP	MELD
batch size	32	32
learning rate	0.001	0.001
epoch	80	100
lr scheduler	CosineAnnealing	CosineAnnealing
dropout	0.3	0.2
optimizer	AdamW	AdamW
early stop	20	30

Table 3. Results of the comparative experiment on the IEMOCAP dataset.

Model Name	Happy	Sad	Angry	Frustrated	Excited	Neutral	Acc	W-F1
TFN	58.17	59.35	68.79	56.50	67.87	58.13	60.24	59.87
MFN	61.22	62.54	71.93	58.82	70.21	61.14	62.57	62.18
LMF	62.83	63.97	73.13	59.50	71.37	62.13	63.72	62.89
MulT	64.91	66.89	75.21	61.52	73.59	64.12	65.57	66.24
MISA	63.50	65.18	74.35	60.13	72.13	63.09	64.39	65.72
TMBL	68.09	70.91	78.47	64.20	76.53	67.52	70.14	69.94
TETFN	67.84	68.54	76.19	63.87	74.21	65.01	69.37	68.83
HGCLLG	69.43	71.12	79.83	66.33	77.82	68.52	71.17	71.09
HCIL	70.15	72.09	80.49	67.02	78.40	69.07	71.57	71.24
MDH	71.58	73.47	81.89	68.30	79.81	70.34	71.78	71.83
Ours	75.31	76.79	81.61	69.93	68.37	65.03	73.52 ± 0.41 *	73.86 ± 0.47 *

Note: Metrics for individual emotion classes are reported as the average of 5 runs. Overall Acc and W-F1 are reported as mean ± standard deviation. * denotes statistical significance (p < 0.05) compared to the strongest baseline using a paired t-test.

Table 4. Results of the comparative experiment on the MELD dataset.

Model Name	Happy	Sad	Angry	Disgust	Fear	Neutral	Surprise	Acc	W-F1
TFN	55.43	53.13	58.20	41.52	45.37	56.20	52.89	54.63	54.74
MFN	56.79	54.63	59.41	43.17	46.90	57.57	53.92	55.87	56.03
LMF	58.13	55.95	60.83	44.71	48.57	59.05	55.41	57.09	56.73
MulT	60.31	58.15	62.92	46.89	50.75	61.20	57.51	58.92	59.51
MISA	59.51	57.37	62.09	45.95	49.83	60.33	56.63	58.62	58.41
TMBL	62.59	61.39	66.12	49.03	53.93	64.43	60.79	63.07	63.36
TETFN	64.51	62.26	64.03	48.80	52.87	62.35	58.62	62.71	62.87
HGCLLG	65.05	62.81	67.57	51.37	55.31	65.89	62.13	64.93	64.79
HCIL	66.54	64.23	68.94	52.82	56.82	67.21	63.57	65.41	65.24
MDH	67.52	65.21	69.91	53.89	57.87	68.26	64.55	65.90	65.52
Ours	72.24	61.72	73.82	63.68	65.17	67.26	65.37	67.52 ± 0.52 *	66.87 ± 0.54 *

Note: Metrics for individual emotion classes are reported as the average of 5 runs. Overall Acc and W-F1 are reported as mean ± standard deviation. * denotes statistical significance (p < 0.05) compared to the strongest baseline using a paired t-test.

Table 5. Ablation experiment of key components.

Model	IEMOCAP		MELD
Model	Acc	W-F1	Acc	W-F1
Ours (Full Model)	73.52	73.86	67.52	66.87
w/o Feature Decoupling and Alignment Module	69.54	65.52	62.47	61.36
w/o Dynamic Graph Fusion Mechanism	68.89	64.93	61.95	60.49
w/o Composite Regularization Constraints	71.05	69.84	65.19	63.74
w/o Traditional Backbones	60.89	61.17	56.41	55.87

Table 6. Ablation experiment of multimodal fusion strategy.

Model	IEMOCAP		MELD
Model	Acc	W-F1	Acc	W-F1
T	66.15	66.45	63.62	61.18
A	67.12	67.43	59.18	57.85
V	65.29	65.58	60.35	58.91
T + A	70.08	70.35	63.54	61.94
T + V	69.15	69.41	65.17	63.87
V + A	71.37	71.55	61.59	60.75
T + V + A	73.52	73.86	67.52	66.87

Table 7. Case analysis on the MELD dataset.

Sample ID	Text & Audio	Truth Label	Predicted Label
10(1)	There’s just so much pressure.	anger	anger
94(2)	Ross, let me show you where the guest room is.	neutral	neutral
233(5)	Oh, it’s so much more fun with you.	joy	joy

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Da, B.; Bao, F. Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network. Appl. Sci. 2026, 16, 1731. https://doi.org/10.3390/app16041731

AMA Style

Da B, Bao F. Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network. Applied Sciences. 2026; 16(4):1731. https://doi.org/10.3390/app16041731

Chicago/Turabian Style

Da, Bulaga, and Feilong Bao. 2026. "Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network" Applied Sciences 16, no. 4: 1731. https://doi.org/10.3390/app16041731

APA Style

Da, B., & Bao, F. (2026). Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network. Applied Sciences, 16(4), 1731. https://doi.org/10.3390/app16041731

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network

Abstract

1. Introduction

2. Related Works

2.1. Multimodal Emotion Recognition

2.2. Graph Neural Networks

3. Methodology

3.1. Modality-Specific Feature Encoding

3.2. Multimodal Feature Decoupling and Alignment

3.3. Dynamic Heterogeneous Graph Temporal Network

3.4. Temporal Reasoning and Composite Optimization

4. Experiments

4.1. Experimental Datasets

4.2. Experimental Settings and Evaluation Metrics

4.3. Comparative Analysis of Experimental Results

4.4. Ablation Studies

4.4.1. Impact of Individual Modules

4.4.2. Impact of Multimodal Fusion

4.5. Case Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI