A Complementary Fusion Framework for Robust Multimodal Emotion Recognition

Yi, Moung-Ho; Kwak, Keun-Chang; Shin, Ju-Hyun

doi:10.3390/electronics14224444

Open AccessArticle

A Complementary Fusion Framework for Robust Multimodal Emotion Recognition

by

Moung-Ho Yi

¹,

Keun-Chang Kwak

¹

and

Ju-Hyun Shin

^2,*

¹

Department of Electronic Engineering, Chosun University, Gwangju 61452, Republic of Korea

²

Department of New Industry Convergence, Chosun University, Gwangju 61452, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4444; https://doi.org/10.3390/electronics14224444

Submission received: 1 October 2025 / Revised: 1 November 2025 / Accepted: 12 November 2025 / Published: 14 November 2025

(This article belongs to the Special Issue Emerging Trends in Multimodal Human-Computer Interaction)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a novel dual-stream framework for multimodal emotion recognition, engineered to address the varying complexity inherent in emotional expressions. The proposed architecture uniquely integrates a graph embedding as an auxiliary modality to explicitly model temporal correlations between utterances, and processes features through two complementary sub-models operating in parallel: a Cross-Attention Transformer Mixture-of-Experts (MoE) model and a Sum-Product Linear MoE model. The former deciphers nuanced and ambiguous emotions, such as ‘fear’ and ‘sadness’, by leveraging a deep cross-attention mechanism to model intricate, bidirectional dependencies between textual and acoustic features. The latter is a lightweight model optimized to efficiently recognize clear and intuitive emotions, like ‘anger’ and ‘happiness’, through simple element-wise operations. The final prediction is derived from an average ensemble of logits from both sub-models, ensuring a robust and balanced classification. Evaluated on the Korean AI-Hub and English IEMOCAP datasets, the framework achieves state-of-the-art accuracy of 0.8071 and demonstrates excellent cross-lingual generalization with an accuracy of 0.7823. Empirical results validate the complementary design, confirming that the specialized models synergistically enhance overall performance.

Keywords:

cross-attention; mixture of experts; multimodal emotion recognition; sum-product

1. Introduction

The rapid advancement of artificial intelligence (AI) is fundamentally transforming the way humans and computers interact. Central to this technological evolution is emotion recognition technology, which accurately perceives the emotional states of users and responds appropriately. This technology plays a pivotal role in realizing more natural and efficient interactions [1]. Emotion recognition contributes to providing real-time, personalized environments by automatically identifying a user’s state, even without explicit emotional expressions [2,3]. It has become a core technology in various applications such as intelligent interfaces, conversational AI, and affective robotics, with its applicability expanding across diverse industries, including education, healthcare, automotive, and gaming [4]. Consequently, there is a continuously growing demand for more precise and reliable emotion recognition systems.

Early research in emotion recognition predominantly focused on analyzing emotions using single sources of information, or unimodal data, such as text, speech, and video [5,6,7,8,9,10,11]. While these approaches achieved a certain level of success based on the unique characteristics of each modality, they have evident limitations in fully capturing the complexity and diversity of human emotions. Text-based analysis is useful for understanding the explicit meaning of sentences but can easily miss contextual nuances or non-verbal cues [12]. Speech-based recognition, which analyzes acoustic features like intonation and speech rate, can suffer from performance degradation due to its sensitivity to background noise and the speaker’s personal characteristics [13]. Similarly, video-based recognition analyzes emotions through facial expressions or gestures but is heavily influenced by environmental factors such as lighting and camera angles, which can impair its generalization capabilities [14]. These unimodal approaches have shown inherent limitations in improving emotion recognition performance due to their respective structural constraints.

To overcome these limitations, multimodal emotion recognition, which integrates multiple modalities to analyze emotions, has recently garnered significant attention [15,16]. This approach enhances the accuracy of emotion recognition by combining the strengths of each modality and improves its applicability in various environments [17]. It is now considered a key direction for the advancement and practical implementation of emotion recognition technology [18,19,20]. However, many existing multimodal studies have limitations, as they often simply concatenate or combine feature vectors in parallel, failing to effectively reflect the sophisticated interactions between the two modalities [21].

In this paper, we propose A Complementary Fusion Framework for Robust Multimodal Emotion Recognition, a model designed to reflect both the temporal context and structural correlation between modalities. The proposed model utilizes text and speech embeddings as primary modalities and utterance timing-based graph embeddings as an auxiliary modality to simultaneously learn temporal context and structural relationships. The model consists of two complementary sub-models, each employing a specialized fusion strategy to tackle these different facets of emotion. The academic and practical contributions of this study are as follows:

Information Integration using Time-Aligned Graph Embedding: We model the connection structure between utterances in a graph format based on the temporal information of text and speech. By using this as an auxiliary modality, we effectively capture the temporal context and structural relationships of the dialogue, which have often been overlooked in existing models.
Design of a Complementary Dual-Fusion Strategy: We propose a dual strategy that analyzes emotions with high affective complexity using a Cross-Attention Transformer structure, while analyzing those with high clarity using a Sum-Product Linear structure. This division is motivated by established affective computing theories, such as Russell’s Circumplex Model [22], which differentiate emotions based on dimensions like valence and arousal. We hypothesized that ‘clear’ emotions like ‘anger’ and ‘happiness’, which are typically high arousal, present distinct signals suitable for efficient, lightweight fusion. Conversely, ‘complex’ emotions like ‘fear’ and ‘sadness’, which are often low-arousal or ambiguous, exhibit subtle or even contradictory signals, such as strong text with weak audio. This necessitates a deep, bidirectional fusion model to decipher their intricate inter-modal dependencies. This presents a novel multimodal recognition architecture that can flexibly respond to various types of emotions.
Quantitative Validation on Diverse Public Datasets: Through experiments on both the Korean AI-Hub Emotion in Dialogue dataset and the English IEMOCAP dataset, we empirically demonstrate the superiority and linguistic generalization capabilities of the proposed model.
Practical Design for Industrial Applications: The model was designed with practical applicability in various industrial fields, such as education and healthcare, in mind. In particular, the proposed parallel structure and graph embedding enhance scalability and suitability for real-world application environments.

This study was designed not only to achieve simple performance improvements but also to consider practical applicability in real-world environments. We expect that it will make a substantial contribution, both academically and industrially, to the advancement of multimodal emotion recognition technology.

The remainder of this paper is organized as follows: Section 2 reviews related work in unimodal and multimodal emotion recognition, as well as MoE architectures. Section 3 details the proposed complementary fusion framework, including the input embeddings (text, speech, graph) and the architectures of the two parallel sub-models. Section 4 presents the experimental setup, datasets, comparative results, and detailed analyses of our findings. Finally, Section 5 concludes the paper, summarizes the contributions, and discusses limitations and future work.

2. Related Work

2.1. Text- and Speech-Based Emotion Recognition

Initial research in emotion recognition predominantly focused on unimodal approaches. Among these, text and speech have been the most extensively researched modalities, as they play the most central role in human communication. The two modalities differ in their method of conveying emotion and the nature of the information they contain, and thus, each possesses distinct advantages and clear technical limitations.

Text-based emotion recognition, which explicitly infers emotions from linguistic information, identifies emotional states by analyzing vocabulary and syntactic structures within sentences. With the recent advancement of deep learning, studies leveraging pre-trained models such as BERT to generate sophisticated vectorized representations of contextual meaning have shown high performance. Furthermore, there have been attempts in text-based emotion recognition to identify sequential patterns of emotions within sentences [23] or to maximize expressive power through hybrid models that combine multiple deep learning architectures like CNN and Bi-GRU [5]. Additionally, to enhance the accuracy of emotion analysis, studies have proposed using emoticon information alongside text [24] or integrating the sentiment scores of words into embeddings, as in the LeBERT model [8]. These studies demonstrate that the choice of embedding method and model architecture significantly impacts performance in text-based sentiment analysis. However, text carries an inherent problem of ambiguity, where the same words can be interpreted differently depending on context or intonation. It also has a fundamental limitation in its ability to capture non-verbal information, such as emotional intensity or subtle nuances.

Speech-based emotion recognition captures the implicit aspects of emotion by analyzing prosodic features of an utterance, such as intonation, speed, stress, and timbre. These features serve as crucial cues for providing information that is difficult to grasp from text alone, such as the depth or tension of an emotion. In early research, algorithms were proposed to generate personalized emotion patterns through PLP (Perceptual Linear Prediction) analysis [25]. More recently, the primary trend has been to extract standard acoustic features like MFCCs (Mel-Frequency Cepstral Coefficients) and use them as input for various deep learning models. Methods have been proposed to learn spatio-temporal patterns from Mel-Spectrograms using 2D-CNNs [10] or to precisely model the time-series characteristics of speech by combining BiLSTM-Transformer structures [6]. Moreover, other studies have focused on integrating multiple levels of acoustic information such as MFCC, Spectrogram, and wav2vec2 using a Co-Attention mechanism to concentrate on important features [26], or have compared and analyzed performance by fusing various speech features based on CCA (Canonical Correlation Analysis) [27]. However, speech signals suffer from a persistent challenge: they are highly sensitive to external factors such as the speaker’s unique characteristics, background noise, and differences in recording equipment, making it difficult to ensure model robustness.

In conclusion, text and speech exist in a complementary relationship, representing different dimensions of emotion. If text conveys the logical content of an emotion, speech communicates the emotional color and intensity embedded within that content. Therefore, to overcome the limitations inherent in each modality and to more accurately understand complex human emotions, a multimodal approach that integratively analyzes both sources of information is essential.

2.2. Multimodal Emotion Recognition

To overcome the inherent limitations of the unimodal approaches previously discussed, multimodal emotion recognition, which integrates and analyzes multiple modalities, has emerged as a key area of research. Since human emotions are expressed simultaneously through multiple channels, such as linguistic content, vocal tone, and facial expressions comprehensively, utilizing a variety of modalities is essential for enhancing the accuracy and robustness of emotion recognition.

The performance of a multimodal system is largely dependent on how effectively data with different characteristics are fused. Fusion strategies are broadly categorized into Early Fusion and Late Fusion, depending on the stage and method of information integration. Early fusion is an approach where feature vectors extracted from each modality are concatenated into a single, unified vector, which is then fed into a single classification model. While this method has the advantage of learning low-level interactions between modalities, it requires temporal synchronization of the data and often entails high computational complexity. In contrast, late fusion involves first generating individual predictions from independent models for each modality, and then aggregating these results to make a final decision. Although this approach is robust to the absence of information from a modality, it has the disadvantage of potentially missing fine-grained, feature-level interactions. Hybrid Fusion is a compromise between these two approaches, offering greater accuracy and flexibility but tending to result in more complex model architectures.

Recently, research has actively moved beyond simple feature combination toward Transformer-based fusion architectures to model the sophisticated interactions between modalities. These models primarily leverage the Cross-Attention mechanism, which enables dynamic information exchange by using information from one modality as a query to reference information from another. For instance, models like MemoCMT and HyFusER have achieved high recognition performance by learning the inter-dependencies between text and speech features through a Cross-Modal Transformer architecture [28,29]. Furthermore, other approaches have been introduced, such as using a Graph Convolutional Neural Network (DER-GCN) to model the relationships between speakers to consider conversational context [30], or an asymmetric fusion network (CFN-ESA) that utilizes text as the primary modality and acoustic and visual information as auxiliary modalities [31].

However, despite these advancements, many existing Transformer-based multimodal models still possess architectures that fail to effectively capture inter-modal interactions. Furthermore, they continue to show limitations in integrating the temporal flow of emotions with the structural context between utterances. To address these limitations, this paper proposes a novel fusion model that can simultaneously reflect both the complementarity and the time-series structure of modalities by explicitly integrating graph-based information derived from the time-aligned structure of text and speech.

2.3. MoE Architecture

The MoE (Mixture of Experts) architecture has garnered attention as a solution for maximizing computational efficiency while preserving the representational capacity of large-scale neural networks. MoE partitions the entire network into multiple independent “expert” sub-networks and selectively activates only a small subset of the most suitable experts based on the characteristics of the input data. The core of the MoE architecture consists of a Gating Network and multiple Expert Networks. For a given input, the Gating Network evaluates the importance of each expert, computes probabilistic weights, and thereby determines which experts to activate. In contrast, a standard dense model performs computations using all of its parameters for every input. The MoE architecture, however, utilizes only a subset of experts for each input. This provides the distinct advantage of allowing the total number of model parameters to be significantly scaled up while keeping the computational cost required for inference constant.

To implement this sparse activation, the Top-K Routing strategy is widely used. This method selects only the top K experts based on the scores computed by the gating network. Pioneering studies such as the Sparsely Gated MoE and the Switch Transformer demonstrated the feasibility of efficiently operating massive models containing thousands of experts by proposing Top-K and Top-1 routing methods, respectively [32,33].

This MoE structure is inherently well-suited for multimodal emotion recognition, which deals with heterogeneous information. Since text and speech differ in their expressive methods and information structures, individual experts can be trained to specialize in the characteristics of each modality or specific types of emotions. The study by Goyal [34] proposed an MoE-based architecture to dynamically combine predictions from speech and video data, demonstrating superior performance over static fusion methods. This suggests that the Gating Network can flexibly handle the complexity and uncertainty of emotions by selecting the optimal combination of experts based on the input’s characteristics and context.

This study applies the strengths of the MoE architecture to multimodal emotion recognition, aiming to sophisticatedly integrate the disparate information structures of text and speech. We use the embeddings generated by the different fusion methods proposed earlier as input to the MoE structure, thereby guiding each expert to specialize in specific emotional expressions. Through this approach, we aim to maximize the overall representational capacity of the model and achieve more precise and robust recognition across diverse emotion types.

3. Proposed Mechanism

In this study, we propose the “A Complementary Fusion Framework for Robust Multimodal Emotion Recognition” model to maximize the accuracy of emotion recognition by leveraging text, speech, and the temporal relationship between the two modalities. The overall architecture of the proposed model is illustrated in Figure 1.

The model first extracts respective embeddings from the input text and speech data. Subsequently, it generates a graph embedding based on the time-alignment information between the two data streams, facilitated by the Whisper model. These three types of embeddings are then fed into two independent, parallel paths. The first path consists of a Cross-Attention Transformer-based model, which focuses on learning the complex and subtle interactions between the modalities. The second path is a Sum-Product Linear-based model, designed to efficiently extract salient emotional features. Each path produces an independent emotion prediction through its MoE structure. Finally, these two predictions are combined using an average ensemble method to yield a more stable and accurate final emotion recognition.

3.1. Input Embeddings

To capture the diverse characteristics of emotion, the proposed model utilizes three types of information as input. Each type of information is converted into a vector form understandable by the model through a distinct embedding process. These are a text embedding that encapsulates the semantic content of the text; a speech embedding that reflects the prosodic features of the audio; and a graph embedding that structures the temporal relationship between the two modalities.

3.1.1. Text Embedding

The objective of the text embedding is to precisely extract both the explicit and contextual emotional information embedded within an utterance. To achieve this, we adopted KoELECTRA, a pre-trained language model known for its high performance on Korean data. The text embedding generation process consists of three stages: Tokenization, Feature Extraction, and Sequence Normalization.

First, in the Tokenization stage, the input Korean utterance is decomposed into a sequence of tokens using KoELECTRA’s subword tokenizer. During this process, special tokens, [CLS] and [SEP], are automatically appended to denote the beginning and end of the sentence, respectively. Next, during Feature Extraction, the tokenized sequence is fed into the KoELECTRA model to extract the last hidden state vectors. These vectors are rich in contextual information for each token. Finally, in the Sequence Normalization stage, the length of all sequences is fixed to 128 to standardize the model’s input size. As our analysis revealed that all utterances used in the experiment consisted of fewer than 128 tokens, sequences shorter than this length are padded with [PAD] tokens to meet the required dimension. Ultimately, each utterance is converted into a tensor of size (128, 768), which serves as the input for the subsequent fusion model. This entire process is summarized in Algorithm 1.

Algorithm 1. Text Embedding using KoELECTRA

1: procedure GenerateTextEmbedding(text_input)
2: tokenizer ← LoadTokenizer(“koelectra-base-v3”)
3: model ← LoadModel(“koelectra-base-v3”)
4: input_features ← tokenizer(text, padding = “max_length”, max_length = 128,
truncation = True)
5: model_output ← model(input_features)
6: text_embedding ← outputs.last_hidden_state
7: return text_embedding
8: end procedure

3.1.2. Speech Embedding

The speech embedding focuses on capturing the prosodic and non-verbal cues of emotion such as the speaker’s intonation, stress, and speed which are difficult to discern from text alone. For this purpose, we employed HuBERT, a self-supervised learning model pre-trained on large-scale unlabeled speech data. The speech embedding generation process consists of three stages: Preprocessing, Feature Extraction, and Sequence Normalization. First, in the Preprocessing stage, the raw input audio waveform is resampled to a 16 kHz sampling rate to conform to the model’s standard input format. Next, during Feature Extraction, the preprocessed audio signal sequentially passes through the CNN encoder and Transformer blocks of the HuBERT model. Through this process, the final hidden state vectors, which reflect the complex acoustic characteristics of the speech, are extracted. Finally, the Sequence Normalization stage is necessary because the sequence length of the embeddings output by HuBERT is variable, depending on the original audio length. To align with the text modality, this length is standardized to a fixed size of 128. For sequences shorter than 128, zero-filled padding vectors are appended for the remainder of the length. Conversely, if a sequence exceeds 128, linear interpolation is applied to downsample it to the target length, thereby minimizing information loss. Through this process, all speech data is converted into a fixed-size tensor of (128, 768). This entire process is summarized in Algorithm 2.

Algorithm 2. Speech Embedding using HuBERT

1: procedure GenerateSpeechEmbedding(speech_path)
2: extractor ← LoadFeatureExtractor(“hubert-base-ls960”)
3: model ← LoadModel(“hubert-base-ls960”)
4: audio, sr ← LoadAudio(speech_path, sr = 16,000)
5: input_features ← extractor(audio, sampling rate = sr, return tensors = “pt”)
6: model_output ← model(input_features)
7: speech_embedding ← outputs.last_hidden_state
8: if speech_embedding.shape[1] < 128 then
9: pad ← 128 − speech_embedding.shape[1]
10:    speech_embedding ← Pad(speech_embedding, (0, 0, 0, pad))
11: else if speech_embedding.shape[1] > 128 then
12:    speech_embedding ← AdaptiveAvgPool1D(Permute(embedding,
                                                                                   (0, 2, 1)), output size = 128)
13:    speech_embedding ← Permute(speech_embedding, (0, 2, 1))
14: end if
15: return speech_embedding
16: end procedure

3.1.3. Graph Embedding

To explicitly model the temporal continuity and structural relationships within utterances, this research introduces graph embedding as a key auxiliary modality. The graph embedding generation process consists of three stages: Temporal Alignment, Graph Construction, and Graph Attention Learning. First, in the Temporal Alignment stage, we leverage Whisper, a large-scale multilingual speech recognition model, to extract the precise start and end times of each token’s utterance from the audio file, measured in seconds. This alignment process maps discrete text tokens to corresponding continuous speech frames. Next, during Graph Construction, we build a bipartite graph. This graph consists of two distinct sets of nodes: one set representing the text token embeddings and the other representing the speech frame embeddings. Based on the temporal information from Whisper, an edge is established between a text node and a speech node if the speech frame’s timestamp falls within the start and end times of that specific text token. Crucially, these initial structural edges are unweighted, representing only the existence of a temporal correspondence. The edge weight is not a predefined scalar; rather, the strength of the connection is a parameter to be learned dynamically by the subsequent attention mechanism. For instance, if the token for the word ‘happy’ was uttered between 1.2 s and 1.5 s in the audio file, direct (unweighted) edges are formed between the ‘happy’ node and all speech feature nodes corresponding to that specific time window. This process constructs a graph that explicitly models the fine-grained temporal alignment between semantic content and acoustic expression, as visualized in Figure 2.

Finally, in the Graph Attention Learning stage, the constructed graph is used as input to a GAT(Graph Attention Network). The GAT is specifically chosen for this task because it computes the “edge weights” dynamically during the learning process. Instead of treating all connected neighbors equally, the GAT employs a self-attention mechanism to assign different importance scores to different neighbors. This allows the model to learn, for example, which speech frames within an utterance window are most indicative of the emotion associated with a specific text token. This process allows the model to obtain a high-dimensional vector that structurally captures the dynamic and weighted interactions between the modalities. This entire process is summarized in Algorithm 3.

Algorithm 3. Graph Embedding Construction

1: procedure GenerateGraphEmbedding (text_embeddings, speech_embeddings, speech_path)
2: whisper_model ← LoadWhisperModel()
3: audio ← LoadAndResampleAudio(speech_path, target_sr = 16,000)
4: token_timestamps ← Transcribe(whisper_model, audio, get_timestamps = True)
5: speech_frame_times ← GetFrameTimestamps(speech_embeddings)
6: edge_list ← []
7: for i, token in enumerate(token_timestamps):
8: for j, frame in enumerate(speech_frame_times):
9: if frame.time ≥ token.start and frame.time ≤ token.end then
10: edge_list.append([i, j])
11: end if
12: edge_index ← ConvertToEdgeIndex(edge_list)
13: gat_model ← LoadGATModel()
14: final_graph_embedding ← gat_model(text_nodes = text_embeddings,
15: speech_nodes = speech_embeddings,
16: edge_index = edge_index)
17: return final_graph_embedding
18: end procedure

3.2. Cross-Attention Transformer MoE

The Cross-Attention Transformer MoE model is designed to precisely analyze nuanced situations where emotional expressions are difficult to interpret using a single modality alone or involve a complex entanglement of multiple meanings. The core of this model lies in its use of the Cross-Attention Transformer architecture to bidirectionally learn the deep interactions between two heterogeneous modalities: text and speech. This approach allows the model to compensate for the uncertainty in one modality’s information by leveraging contextual cues from the other, thereby constructing a more robust and sophisticated emotional representation.

The model’s operations begin with Feature Dimension Reduction. The input text embedding(t) and speech embedding(s) are each passed through an independent 1D Convolutional layer. This serves as a bottleneck, compressing the essential features from the high-dimensional original embeddings to enhance the efficiency of subsequent computations and prevent overfitting. This process is defined by Equation (1) and Equation (2), respectively.

t_{b} = D r o p o u t (R e L U ({C o n v 1 d}_{t e x t} (t)))

(1)

s_{b} = D r o p o u t (R e L U ({C o n v 1 d}_{s p e e c h} (s)))

(2)

The two dimensionally reduced embedding vectors serve as inputs to the Multi-Head Cross-Attention operation. First, with the text embedding as the Query and the speech embedding as the Key and Value, the model learns which temporal features in the speech each text token should attend to. Conversely, with the speech embedding as the Query and the text embedding as the Key and Value, the model learns which semantic cues are most relevant to each audio frame. This bidirectional exchange is represented by Equation (3) and Equation (4), respectively. The internal calculation of attention weights follows the Scaled Dot-Product Attention mechanism, as shown in Equation (5).

T_{a t t n} = M u l t i H e a d A t t e n t i o n (Q = t_{b}, K = s_{b}, V = s_{b})

(3)

S_{a t t n} = M u l t i H e a d A t t e n t i o n (Q = s_{b}, K = t_{b}, V = t_{b})

(4)

A t t e n t i o n (Q, K, V) = s o f r m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

The two attention output vectors, generated through bidirectional Cross-Attention, are each subjected to average pooling along the sequence dimension. This process transforms them into fixed-size vectors that summarize the information of the entire sequence, as described by Equation (6). Subsequently, these two vectors are concatenated with the graph embedding (g), generated in the Input Embedding section, to form the final fused vector. This is represented by Equation (7).

P_{t} = \frac{1}{L} \sum_{i = 1}^{L} T_{a t t n, i}, P_{s} = \frac{1}{L} \sum_{i = 1}^{L} S_{a t t n, i}

(6)

f_{f u s e d} = [P_{t}; P_{s}; g]

(7)

The final fused vector is then passed to the MoE model. Within the MoE model, a gating network first analyzes the characteristics of the input fused vector to compute a preference score for each of the five pre-defined expert networks. To maximize computational efficiency and the division of roles among the experts, this model adopts a Top-1 Hard Routing approach. That is, only the single expert with the highest score computed by the gating network is selected and activated, while the computations for all other experts are omitted. This expert selection process follows the principles of Equations (8) and (9), with the distinction that the weight for the selected expert is set to 1, while the weights for all others are set to 0.

M o E (x) = \sum_{i = 1}^{N} g_{i} (x) \cdot E_{i} (x)

(8)

\sum_{i = 1}^{N} g_{i} (x) = 1 a n d g_{i} (x) \in [0,1]

(9)

The selected expert, which has a two-layer MLP (Multi-Layer Perceptron) architecture, processes the input vector to output the logits for the final emotion classes. Thus, the Cross-Attention Transformer MoE model combines a sophisticated cross-referencing mechanism with an efficient expert selection structure to deeply analyze and classify complex emotional representations. The algorithm for the Cross-Attention Transformer architecture is presented in Algorithm 4.

Algorithm 4. Cross-Attention Transformer

1: procedure CrossAttentionLayer(query_input, key_input, value_input)
2: quert ← LinearProjection(query_input)
3: key ← LinearProjection(key_input)
4: value ← LinearProjection(value_input)
5: Reshape query, key, value to shape (B, H, L, D/H)
6: attn ← Softmax((

{q u e r y - k e y}^{T}

)/√D)
7: attention_output ← attn · value
8: Reshape and project output to shape (B, L, D)
9: return final_output
10: end procedure

3.3. Sum-Product Linear MoE

The Sum-Product Linear MoE model is a lightweight architecture optimized for the fast and efficient recognition of cases where emotional expressions are relatively clear and intuitive. This model aims to avoid the high computational complexity of the Cross-Attention Transformer and effectively capture the core interactions between modalities through concise mathematical operations. It was designed under the premise that when emotional signals are distinct, sufficiently high recognition performance can be achieved without complex dynamic interaction modeling.

The model’s operational process begins with Feature Vectorization. The input text and speech embeddings each pass through an independent 1D convolutional layer followed by average pooling along the sequence dimension. This transforms them into single global vectors that compress the information of the entire sequence. This process serves to summarize the core features of each modality into a fixed-size vector.

Based on these two global vectors, element-wise sum and product operations are performed to generate two new types of interaction feature vectors. The sum operation has the effect of linearly combining and mutually reinforcing the features of both modalities, while the product operation captures the non-linear interaction that occurs when specific dimensions of both modalities are strongly activated simultaneously. This interaction feature generation process is defined by Equation (10).

s u m = t + s, p r o d = t ⊙ s

(10)

The two generated interaction vectors, the original global vectors, and the graph embedding from the Input Embedding section are all concatenated along the channel dimension to form a single, unified fused vector. This high-dimensional vector then undergoes a Linear Transformation to be adjusted into the final representation vector for subsequent classification. This process is defined by Equation (11).

f_{f u s e d} = R e L U (W_{f} [t; s; s u m; p r o d] + b_{f})

(11)

Identical to the Cross-Attention Transformer MoE model, the final fused vector is used as input to the MoE model. The gating network dynamically selects and activates the most suitable single expert (Top-1) based on the characteristics of the input vector. The selected expert then outputs the final emotion classification logits through a two-layer MLP. In this way, the Sum-Product Linear MoE model combines intuitive operations with an efficient expert selection structure to perform fast and accurate predictions for distinct emotional expressions. The algorithm for the Sum-Product Linear architecture is presented in Algorithm 5.

Algorithm 5. Sum-Product Linear

1: procedure SumProductFusion(text_vector, speech_vector)
2: sum_vector ← text_vector + speech_vector
3: product_vector ← text_vector⊙speech_vector
4: concatenated_vector ← Concat(text_vector, speech_vector,
sum_vector, product_vector)
5: fused_vector ← LinearProjection(concatenated_vector)
6: return fused_vector
7: end procedure

3.4. Multimodal Emotion Recognition Model

The final model proposed in this paper performs the ultimate emotion recognition by integrating the two independent yet complementary sub-models described previously: the Cross-Attention Transformer MoE and the Sum-Product Linear MoE. As emotions exist on a spectrum of complexity and clarity, our model maximizes the overall system’s flexibility and robustness by operating two specialized, parallel paths, each dedicated to handling emotions with different characteristics.

To integrate the predictions from the two sub-models, this study adopts an average ensemble method. This approach does not average the final output probabilities from each model. Instead, it averages the raw linear output values before they pass through the Softmax activation function, that is, the logits, to derive a more stable conclusion.

The specific operation of the final model is as follows: Given a single input data point (comprising text, speech, and graph embeddings), both sub-models receive it simultaneously and perform their respective predictions independently. The Cross-Attention Transformer MoE model, through its complex interaction analysis, outputs a logit vector for the emotion classes, denoted as y_c. The Sum-Product Linear MoE model, via its efficient operations, outputs its own logit vector for the emotion classes, denoted as y_s.

The logit vectors produced by the two models are then averaged element-wise. This process can be viewed as a “pre-consensus” stage for the predictions. If both models output a high logit value for a specific emotion class, the averaged value will also remain high. Conversely, if one model has low confidence in its prediction, resulting in a low logit value for a certain class, it will have a mitigating effect on the high value from the other model. This plays a crucial role in enhancing the overall system’s stability by reducing the impact of a potential error from a single model on the final prediction.

Finally, the Softmax activation function is applied to the averaged logit vector to calculate the final emotion probability distribution, where the sum of probabilities for all classes equals 1. The class with the highest probability is then determined as the final predicted emotion. This logit-averaging ensemble process is formally defined by Equation (12).

y_{r e s} = S o f t m a x (\frac{1}{2} (y_{c} + y_{s}))

(12)

This method is known to be statistically more stable than averaging the final probabilities from each model. It also prevents the potential distortion of the probability distribution that can occur when a specific model makes an overly confident misclassification.

4. Experimental Evaluation

4.1. Experimental Data

The development and core performance evaluation of the model proposed in this study were primarily conducted using a large-scale Korean dataset derived from real-world conversational settings. Subsequently, to validate that the developed model was not overfitted to a specific language or data characteristics—that is, to confirm its generalization performance—additional experiments were conducted using a standard international English benchmark dataset.

4.1.1. AI-HuB Dataset

The primary dataset for this study is the “Dialogue Voice Dataset for Emotion Classification,” constructed and released by the National Information Society Agency (NIA) of Korea [35]. The dataset consists of pairs of high-quality Korean speech files… and their corresponding accurately transcribed text data. The dataset encompasses seven core emotion categories ‘angry,’ ‘sadness,’ ‘neutral,’ ‘happiness,’ ‘fear,’ ‘disgust,’ and ‘surprise.’

To enhance the stability of model training and the reliability of the results, we applied processes to ensure label consistency and mitigate severe class imbalance. First, for label consistency, the data was annotated by five expert annotators. We recognized that a strict unanimity (5-out-of-5) criterion could filter out naturally complex or subtle emotional expressions. Therefore, we adopted a majority-vote criterion. We retained all utterances where a clear majority (three or more annotators) agreed on the emotion label. Samples that were highly ambiguous, with no majority consensus (e.g., a 2-2-1 split), were excluded from the analysis. We judged that these non-consensus utterances would act as label noise during model training rather than representing unreliable data. Second, to mitigate class imbalance, we first examined the data distribution and found that the ‘sadness’ class was significantly overrepresented. To prevent the model from being biased towards this single dominant class, we performed random down-sampling on the ‘sadness’ class only. It is crucial to note that this was a minimal intervention, not an attempt to artificially balance the dataset. As shown in Table 1, even after this process, the dataset remains significantly imbalanced: ‘sadness’ (24.15%) and ‘angry’ (24.15%) are still the largest classes, far outnumbering classes like ‘surprise’ (2.97%). This approach preserved the natural, imbalanced data distribution while reducing the extreme skew from a single class.

4.1.2. IEMOCAP Dataset

To demonstrate that the performance of the model, developed and optimized using the preceding AI-Hub dataset, represents a generalized capability not confined to the Korean language, the IEMOCAP dataset was additionally utilized in the final validation stage [36]. IEMOCAP is the most widely used English-based, standard benchmark dataset in the field of emotion recognition research. This dataset consists of speech and text data from professional actors performing emotions under specific scenarios. In this validation experiment, the data was refined using criteria similar to those applied to the AI-Hub dataset. After excluding classes with few utterance samples or ambiguous emotional boundaries and removing ultra-short utterances lacking sufficient semantic information, a final set of 5124 utterances across five classes ‘neutral,’ ‘angry,’ ‘sad,’ ‘excited,’ and ‘happy’ was used. The purpose of this final validation experiment is to show that the proposed model operates robustly even in a different linguistic context and under disparate data collection environments.

4.2. Performance Evaluation and Analysis

In this section, we quantitatively evaluate the performance of our proposed model using the AI-Hub dataset introduced in Section 4.1. Through a multi-faceted analysis against various comparative models, we further demonstrate the superiority and structural validity of the proposed architecture.

4.2.1. Experimental Environment and Evaluation Metrics

The model’s performance was comprehensively evaluated using four standard metrics for classification tasks: Accuracy, Precision, Recall, and F1-Score. Accuracy represents the proportion of correctly predicted samples out of the total samples. Precision refers to the proportion of true positives among the samples predicted as a specific emotion, while Recall indicates the proportion of true positives that the model correctly identified out of all actual samples of a specific emotion class. The F1-Score is the harmonic mean of Precision and Recall, representing the balance between these two metrics. The detailed definitions of each metric are provided in Table 2.

To ensure the reproducibility of the proposed model, the key hyperparameters used for training are specified in Table 3. The dataset was divided into training and test sets using an 8:2 ratio, stratified by emotion labels. We used the Adam optimization algorithm for both models. For the Cross-Attention Transformer MoE, we additionally applied a OneCycleLR scheduler with a maximum learning rate (max_lr) of 1 × 10⁻⁴ (using a linear annealing strategy). For the Sum-Product Linear MoE, we used Adam with a fixed learning rate of 1 × 10⁻⁴ (without a scheduler). A batch size of 32 was used for both. Training was set for a maximum of 100 epochs, and we employed early stopping to prevent overfitting: training was terminated if the validation accuracy did not improve for 5 consecutive epochs, and the model with the best performance was saved. A dropout rate of 0.1 was applied to both models. Both were configured with 5 expert networks, and the Top-1 Hard Routing strategy was adopted for MoE routing.

4.2.2. Comparative Models and Experimental Design

To conduct a multi-faceted analysis of the proposed model’s performance, we established four categories of comparative models: Unimodal Models, Simple Multimodal Models, Existing Transformer-based Models, and our Proposed Models. The Unimodal Models category measures the baseline performance of each modality by independently using information from text (BiLSTM), speech (CNN), and graph (MLP), respectively. The Simple Multimodal Models category evaluates the performance of traditional fusion methods, such as combining text and speech features at an early stage (Early Fusion) or combining the prediction results of individual models at a later stage (Late Fusion). The Existing Transformer-based Models category includes KoHMT and HyFusER, recent Transformer-based multimodal models that have demonstrated superior performance, as comparative benchmarks. The Proposed Models category evaluates the performance of the two sub-models proposed in this paper (the Cross-Attention Transformer MoE and the Sum-Product Linear MoE) and the final ensembled model that combines them. These categories are summarized in Table 4.

4.2.3. Overall Performance Comparison and Analysis

In this section, to demonstrate the performance of our proposed model, we conduct a hierarchical comparative analysis, starting from Unimodal Models and progressing to Simple Multimodal Models and Existing Transformer-based Models. This systematic comparison clearly illustrates how each component of the proposed model contributes to its overall performance. First, to verify the effectiveness of multimodal fusion, we evaluated the individual performance of each modality and the performance of traditional fusion methods. The results are summarized in Table 5. Among the unimodal models, the text-based model (BiLSTM) achieved the highest performance with an accuracy of 0.7540. This suggests that the explicit semantic information of emotions serves as a powerful cue for classification. The speech-based model (CNN) recorded an accuracy of 0.7281, while the model using only graph embeddings (MLP), which represent the temporal relationships between utterances, showed the lowest performance at 0.5637. This indicates that graph information alone is insufficient for specifying emotions, and its value is only realized when combined with the semantic and acoustic information of other modalities. Consequently, it was confirmed that no single modality is sufficient to fully represent the complexity of human emotions. In the simple multimodal models, two contrasting results were observed with traditional fusion methods. The Early Fusion method, which simply concatenates text and speech features at an early stage, achieved an accuracy of 0.7576, showing only a marginal improvement compared to the text-only unimodal model. This demonstrates the difficulty of creating synergy by combining two heterogeneous feature spaces without modeling their deep interactions. In contrast, the Late Fusion (ensemble) method, which averages the independent predictions from each modality, achieved a relatively high accuracy of 0.7913. While this showcases the general strength of ensemble techniques in mutually compensating for the prediction errors of individual models, it has an inherent limitation in that it fails to learn the dynamic interactions between modalities during the model training process.

To compensate for the limitations of these simple architectures, Transformer-based multimodal emotion recognition models have recently been proposed. A representative example is KoHMT, which enhances the interaction between text and speech by combining modality-specific Transformer encoders with a Cross-Modal Attention structure. The KoHMT model achieved an accuracy of 0.7771, demonstrating higher performance than the simple Early Fusion method. Subsequently, HyFusER was designed as an extension of KoHMT’s architecture, introducing a Dual Cross-Modal Attention mechanism to perform information exchange between modalities with greater precision. The HyFusER model recorded an accuracy of 0.7944.

The model proposed in this paper is designed to not only embrace the advantages of the Transformer architecture but also to reflect the time-series context and structural interactions that existing architectures have not considered. We maximized the respective strengths of our sub-models by constructing a parallel architecture: the Cross-Attention Transformer MoE model, which handles the complex relationships between text and speech, and the Sum-Product Linear MoE model, which focuses on learning from emotional clarity. Finally, the performance was enhanced by average-ensembling the two models. Our proposed model achieved an accuracy of 0.8071, recording the highest performance among all comparative models, including the existing Transformer-based ones. Table 6 presents a performance comparison of the Transformer-based models and our proposed model.

To validate the statistical reliability of these results, we conducted 10 independent training and evaluation runs for each of our proposed sub-models and the final ensemble model. The detailed statistics, including the mean, standard deviation (SD), and 95% confidence intervals (CI), are presented in Table 7. The results demonstrate high stability across all models, as indicated by the very low standard deviations (e.g., Ensemble Accuracy SD = 0.0014). Most importantly, these results provide strong statistical validation for our complementary fusion strategy. The 95% confidence interval for the Ensemble model’s accuracy [0.8063, 0.8083] does not overlap with the intervals for either the CAT-MoE model [0.7918, 0.7962] or the SPL-MoE model [0.7825, 0.7877]. This non-overlapping CI statistically confirms that the performance gain from the ensemble is significant and not a result of random chance, empirically proving that the two sub-models contribute synergistically to the final, superior result.

Through this comparison with diverse fusion strategies and architectures, we have demonstrated that the design of our proposed model is not a mere structural extension but a systematic strategy that substantively reflects the complexity and precision of emotion recognition. These results suggest that in the future design of multimodal emotion recognition models, structural fusion and the role definition of each modality can make a decisive contribution to performance enhancement. Figure 3 presents a visual comparison of the accuracies of each model based on these results. The visualization clearly confirms that the performance of the proposed model surpasses that of all comparative models.

The superiority of the proposed model lies not only in its overall performance improvement but also in the differentiated roles and complementary relationship of its sub-models, which specialize in different types of emotions. This observed specialization provides the empirical justification for the dual-strategy hypothesis introduced in Section 1. The following analysis of the confusion matrices and the emotion-wise accuracy confirms precisely how these specialized models synergistically contribute to the final ensemble performance. To analyze this, we visualized the confusion matrix for each of the models—the Cross-Attention Transformer MoE model, the Sum-Product Linear MoE model, and the final ensembled model—and also used a radar chart to provide an at-a-glance comparison of the classification accuracy for each emotion type. Figure 4 visualizes the confusion matrix for the Cross-Attention Transformer MoE model. The results show that its classification performance for ‘fear’ and ‘sadness’ is higher than that of the SPL-MoE model. These two emotions are generally considered complex, as their expressions are often ambiguous and manifest in various forms depending on the context. For instance, ‘fear’ can be subdivided into sub-emotions like anxiety, confusion, and vigilance, while ‘sadness’ has a broad emotional spectrum and great diversity in expression, encompassing feelings such as resignation, depression, and loss. Since these emotions require the precise capture of subtle interactions between text and speech, we conclude that the structural advantages of the Cross-Attention Transformer MoE model, which reflects bidirectional information flow based on its Cross-Attention architecture, operated effectively.

In contrast, the analysis of the confusion matrix for the Sum-Product Linear MoE model in Figure 5 revealed superior performance on ‘anger,’ ‘disgust,’ and ‘happiness.’ These emotions are typically characterized by distinct signals in expressions, intonation, and word choice, making their emotional meaning more intuitive and their mode of expression relatively straightforward. Emotions of this nature are often better suited to linear inference rather than complex time-series computations or deep, multi-layered architectures. Accordingly, the Sum-Product Linear MoE model, concisely built around sum and product operations, demonstrated high accuracy, indicating that this approach is highly effective for recognizing such distinct emotions.

For a more intuitive comparison, the confusion matrix of the proposed ensemble model is presented in Figure 6. These results demonstrate that the strengths of each sub-model are combined in a complementary manner, leading to a balanced improvement in overall performance. For example, the classification of ‘anger,’ where the Cross-Attention Transformer MoE model showed relatively lower performance, was compensated for by the Sum-Product Linear MoE model. Conversely, the classification of ‘fear,’ a relative weakness for the Sum-Product Linear MoE model, was enhanced through the architectural strengths of the Cross-Attention Transformer MoE model.

This trend is also clearly confirmed in the radar chart in Figure 7. In the radar chart, the proposed model (red line) either surpasses or maintains a similar level of performance compared to each sub-model (green for the Cross-Attention Transformer MoE and blue for the Sum-Product Linear MoE) across all emotion types. It showed particularly high and consistent performance for the emotions of ‘fear,’ ‘sadness,’ and ‘happiness.’ However, it is difficult to definitively conclude that these characteristics are discretely handled by specific experts, and this can be considered a direction for future analysis.

This analysis suggests the necessity of differentiating modality integration structures and learning strategies according to the characteristics of emotional expressions, such as their complexity versus their clarity. Furthermore, it demonstrates that the complementary architectural design of our study contributes not only to performance optimization based on emotion type but also to ensuring real-time applicability.

The proposed MoE model employs Top-1 hard routing for efficiency. To validate this architectural choice, we conducted an additional experiment analyzing the trade-off between performance and computational cost based on the routing strategy. To evaluate this performance-efficiency balance, we performed emotion recognition tasks by varying only the routing strategy while maintaining an identical model architecture and hyperparameters. The results are presented in Table 8. As shown, using Top-2 routing yielded a minor accuracy improvement of approximately 0.0042, demonstrating a negligible performance difference compared to the Top-1 method. However, the Top-2 strategy requires activating twice as many experts as Top-1 during inference, which proportionally increases the computational cost. Given that the significant increase in computational overhead far outweighs this marginal performance gain, we concluded that Top-1 hard routing is the optimal choice, striking the most reasonable balance in the performance-efficiency trade-off.

Furthermore, to demonstrate the inherent efficiency and real-time processing capabilities of Top-1 routing, we constructed large-scale MoE models (over 200 million total parameters) by expanding the number of experts to 1000. We then measured their actual inference latency, with the results shown in Table 9. Notably, the Sum-Product Linear MoE model achieved an exceptionally fast inference latency of just 0.69 ms, despite possessing over 206 million total parameters. The Cross-Attention Transformer MoE model also demonstrated real-time capability, processing inferences at 2.08 ms with over 200 million parameters. These results strongly support the core efficiency principle of MoE: the Top-1 routing strategy maintains a low and constant computational cost per input, regardless of the total parameter size of the model.

All preceding experiments were conducted based on the AI-Hub Korean emotion dialogue dataset, with a focus on demonstrating the superiority of the proposed model. However, results optimized for a single language and dataset are insufficient to fully substantiate the model’s generalizability and reliability. Therefore, we conducted an additional validation experiment using the representative English emotion recognition dataset, IEMOCAP, to verify its robustness. This dataset includes both text and speech data, allowing for the construction of a multivariate emotion recognition experimental environment. A key difference in our preprocessing pipeline was the change in the pre-trained language model used for text embedding. Since the Korean-specific KoELECTRA model used in the AI-Hub experiments is inapplicable to English data, it was replaced with the ELECTRA-base model. The speech data processing still utilized the same CNN-based feature extractor, and the graph embedding retained its identical structure. This IEMOCAP-based experiment was designed to ensure consistency in the experimental environment by preserving the core architecture with only minimal necessary modifications. For the unimodal models, the text-based bidirectional LSTM model recorded an accuracy of 0.5766, the speech-based CNN model an accuracy of 0.6615, and the graph embedding-based MLP model an accuracy of 0.5805, all showing relatively low performance. The simple multimodal models, including both Early and Late Fusion methods, also showed accuracies in the range of 0.67 to 0.70. In contrast, the Cross-Attention Transformer MoE and the Sum-Product Linear MoE achieved accuracies of 0.7768 and 0.7757, respectively. Ultimately, our final proposed model attained the highest performance with an accuracy of 0.7823, demonstrating that it secures both structural consistency and generalization capability. The detailed performance metrics for all experimental models are summarized in Table 10. This table is structured to allow for a comprehensive comparison not only of accuracy but also of Precision, Recall, and F1-Score, playing a crucial role in securing the reliability and objectivity of our experimental conditions.

In this section, through the experimental results based on the AI-Hub dataset, we have confirmed that the proposed model outperforms simple multimodal combination methods and existing Transformer-based models. Furthermore, we have empirically demonstrated that the architectural design, which maximizes the strengths of the sub-models for each emotion type, makes a decisive contribution to performance enhancement. Additionally, through the validation experiments based on IEMOCAP, we verified that our model is a structurally generalizable multimodal emotion recognition model that transcends differences in language and datasets. These results strongly suggest that the multimodal emotion recognition model utilizing the MoE-CFT technique proposed in this paper can be a significant alternative for improving the accuracy and universality of emotion recognition.

5. Conclusions

This study proposed a novel multimodal emotion recognition model architecture that primarily uses text and speech, supplemented by graph embeddings generated from the temporal information between utterances. The proposed model consists of two expert sub-models that employ different strategies based on the characteristics of the emotional signals: the Cross-Attention Transformer MoE model and the Sum-Product Linear MoE model. The Cross-Attention Transformer MoE model showed its strength in handling ambiguous or complex signals, while the Sum-Product Linear MoE model demonstrated high efficiency for signals with clear expressions. The two models were trained independently, and their predictions were then integrated using an average ensemble technique, maximizing their complementary performance.

For performance validation, experiments on the Korean AI-Hub emotion dialogue dataset showed that the proposed model achieved state-of-the-art performance, with an accuracy of 0.8071, surpassing existing unimodal and Transformer-based models. Furthermore, it achieved a high accuracy of 0.7823 on the English IEMOCAP dataset, demonstrating that the proposed architecture possesses excellent generalization capabilities that are not confined to a specific language or dataset.

In conclusion, this study presented an innovative fusion strategy that simultaneously considers the expressive characteristics of emotions and the multimodality of data. This enables stable and precise emotion recognition in various application domains, such as affective dialogue systems and AI counseling assistants.

Despite these contributions, this study has several limitations that open avenues for future research. First, the model’s performance, particularly the graph embedding, is dependent on the accuracy of the temporal alignment provided by the Whisper model. Errors or ‘timing noise’ from the ASR model could propagate and affect performance. Second, our data preprocessing, which involved unanimity filtering, could potentially introduce a selection bias by removing “naturally ambiguous” real-world emotions, although this was necessary to reduce label noise for fair architectural comparison. Third, beyond technical limitations, the application of emotion recognition technology itself carries significant ethical implications. While beneficial in domains like healthcare and education, this technology could be misused for invasive emotional surveillance, manipulative advertising, or biased decision-making in areas like hiring or law enforcement. Future development must therefore be paired with robust ethical guidelines, transparency, and privacy-preserving frameworks to ensure responsible deployment.

Future work will focus on addressing these limitations. Beyond the simple integration of visual information, we plan to incorporate broader contextual modalities, such as speaker identity and dialogue history, to better understand context-dependent emotions. Furthermore, we will focus on enhancing the model’s robustness, specifically by testing its performance on unprocessed, “noisy” real-world datasets and exploring methods to reduce dependency on precise temporal alignment. Finally, we intend to investigate more sophisticated and learnable routing mechanisms for the MoE architecture to more flexibly allocate resources based on signal complexity, moving beyond the current ‘clear’ vs. ‘complex’ division.

Author Contributions

Conceptualization, M.-H.Y.; methodology, M.-H.Y. and K.-C.K.; software, M.-H.Y.; validation, J.-H.S.; resources, M.-H.Y. and K.-C.K.; writing—original draft preparation, M.-H.Y.; writing—review and editing, M.-H.Y., K.-C.K. and J.-H.S.; supervision, J.-H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2023R1A2C1006149).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The AI-Hub and IEMOCAP datasets used in this study are publicly available resources as cited in [35,36]. Furthermore, all source code used for data preprocessing, model implementation, and analysis to ensure reproducibility is publicly available at: https://github.com/moungho/moung.git (accessed on 1 September 2025).

Acknowledgments

The authors would like to acknowledge and thank all reviewers for their constructive and helpful reviews.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xie, Z.; Guan, L. Multimodal information fusion of audiovisual emotion recognition using novel information theoretic tools. In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, 15–19 July 2013; IEEE: New York, NY, USA, 2013; pp. 1–6. [Google Scholar]
Busso, C.; Bulut, M.; Narayanan, S. Toward effective automatic recognition systems of emotion in speech. In Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction; Oxford University Press: New York, NY, USA, 2013; pp. 110–127. [Google Scholar]
Kim, T.; Vossen, P. EmoBERTa: Speaker-aware emotion recognition in conversation with RoBERTa. arXiv 2021, arXiv:2108.12009. [Google Scholar]
Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.-F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
Bharti, S.K.; Kumar, P.; Kumari, K.; Rashid, M.; Faiyaz, M.; Kumar, V.; Kumar, R. Text-based emotion recognition using deep learning approach. Comput. Intell. Neurosci. 2022, 2022, 9523756. [Google Scholar] [CrossRef]
Kim, S.; Lee, S.-P. A BiLSTM–transformer and 2D CNN architecture for emotion recognition from speech. Electronics 2023, 12, 4034. [Google Scholar] [CrossRef]
Sabbeh, S.F.; Fasihuddin, H.A. A comparative analysis of word embedding and deep learning for Arabic sentiment classification. Electronics 2023, 12, 1425. [Google Scholar] [CrossRef]
Mutinda, J.; Mwangi, W.; Okeyo, G. Sentiment analysis of text reviews using lexicon-enhanced Bert embedding (LeBERT) model with convolutional neural network. Appl. Sci. 2023, 13, 1445. [Google Scholar] [CrossRef]
Li, H.; Ma, Y.; Ma, Z.; Zhu, H. Weibo text sentiment analysis based on BERT and deep learning. Appl. Sci. 2021, 11, 10774. [Google Scholar] [CrossRef]
Reggiswarashari, F.; Sihwi, S.W. Speech emotion recognition using 2D-convolutional neural network. Int. J. Electr. Comput. Eng. 2022, 12, 6594–6601. [Google Scholar] [CrossRef]
Hazra, S.K.; Ema, R.R.; Galib, S.M.; Kabir, S.; Adnan, N. Emotion recognition of human speech using deep learning method and MFCC features. Spec. Syst. Data Process. 2022, 4, 161–172. [Google Scholar] [CrossRef]
Alvarez-Gonzalez, N.; Kaltenbrunner, A.; Gómez, V. Uncovering the limits of text-based emotion detection. arXiv 2021, arXiv:2109.01900. [Google Scholar] [CrossRef]
George, S.M.; Ilyas, P.M. A review on speech emotion recognition: A survey, recent advances, challenges, and the influence of noise. Neurocomputing 2024, 568, 127015. [Google Scholar] [CrossRef]
Ren, Z.; Ortega, J.; Wang, Y.; Chen, Z.; Guo, Y.; Yu, S.X.; Whitney, D. VEATIC: Video-based emotion and affect tracking in context dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 4467–4477. [Google Scholar]
Kim, Y.-J.; Roh, K.; Chae, D. Feature-based emotion recognition model using multimodal data. In Proceedings of the 2023 Korean Computer Congress (KCC), Seoul, Republic of Korea, 21–23 June 2023; Korean Institute of Information Scientists and Engineers: Seoul, Republic of Korea, 2023; pp. 2169–2171. [Google Scholar]
Park, H. Enhancement of multimodal emotion recognition classification model through weighted average ensemble of KoBART and CNN models. In Proceedings of the 2023 Korean Computer Congress (KCC), Seoul, Republic of Korea, 21–23 June 2023; Korean Institute of Information Scientists and Engineers: Seoul, Republic of Korea, 2023; pp. 2157–2159. [Google Scholar]
Gladys, A.A.; Vetriselvi, V. Survey on multimodal approaches to emotion recognition. Neurocomputing 2023, 556, 126693. [Google Scholar] [CrossRef]
Lian, H.; Lu, C.; Li, S.; Zhao, Y.; Tang, C.; Zong, Y. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy 2023, 25, 1440. [Google Scholar] [CrossRef] [PubMed]
Ramaswamy, M.P.A.; Palaniswamy, S. Multimodal emotion recognition: A comprehensive review, trends, and challenges. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1563. [Google Scholar] [CrossRef]
Shou, Y.; Meng, T.; Ai, W.; Yin, N.; Li, K. A comprehensive survey on multi-modal conversational emotion recognition with deep learning. arXiv 2023, arXiv:2312.05735. [Google Scholar] [CrossRef]
Wang, Y.; Gu, Y.; Yin, Y.; Han, Y.; Zhang, H.; Wang, S.; Li, C.; Quan, D. Multimodal transformer augmented fusion for speech emotion recognition. Front. Neurorobot. 2023, 17, 1205391. [Google Scholar] [CrossRef]
James, A.; Russell. A Circumplex Model of Affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar]
Wang, X.; Kou, L.; Sugumaran, V.; Luo, X.; Zhang, H. Emotion correlation mining through deep learning models on natural language text. arXiv 2020, arXiv:2007.14071. [Google Scholar] [CrossRef]
Habib, M.A.; Akhand, M.A.H.; Kamal, M.A.S. Emotion recognition from microblog managing emoticon with text and classifying using 1D CNN. arXiv 2023, arXiv:2301.02971. [Google Scholar] [CrossRef]
Jung, B.; Cheun, S.; Kim, Y.; Kim, S. An emotion recognition technique using speech signals. J. Korean Inst. Intell. Syst. 2008, 18, 494–500. [Google Scholar] [CrossRef]
Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention based multi-level acoustic information. arXiv 2022, arXiv:2203.15326. [Google Scholar]
Jo, A.-H.; Kwak, K.-C. Feature level fusion based on canonical correlation analysis for Korean speech emotion recognition. In Proceedings of the 2023 KIIT Summer Conference, Jeju, Republic of Korea, 21–23 June 2023; pp. 244–248. [Google Scholar]
Khan, M.; Tran, P.-N.; Pham, N.T.; El Saddik, A.; Othmani, A. MemoCMT: Multimodal emotion recognition using cross-modal transformer-based feature fusion. Sci. Rep. 2025, 15, 5473. [Google Scholar] [CrossRef] [PubMed]
Yi, M.-H.; Kwak, K.-C.; Shin, J.-H. HyFusER: Hybrid multimodal transformer for emotion recognition using dual cross modal attention. Appl. Sci. 2025, 15, 1053. [Google Scholar] [CrossRef]
Ai, W.; Shou, Y.; Meng, T.; Yin, N.; Li, K. DER-GCN: Dialogue and event relation-aware graph convolutional neural network for multimodal dialogue emotion recognition. arXiv 2024, arXiv:2312.10579. [Google Scholar] [CrossRef]
Li, J.; Wang, X.; Liu, Y.; Zeng, Z. CFN-ESA: A cross-modal fusion network with emotion-shift awareness for dialogue emotion recognition. IEEE Trans. Affect. Comput. 2024, 15, 1919–1933. [Google Scholar] [CrossRef]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv 2021, arXiv:2101.03961. [Google Scholar]
Goyal, A.; Kumar, N.; Guha, T.; Narayanan, S.S. A multimodal mixture-of-experts model for dynamic emotion prediction in movies. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: New York, NY, USA, 2016; pp. 2825–2829. [Google Scholar]
Dialogue Voice Dataset for Emotion Classification. Available online: https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=263 (accessed on 31 August 2025).
IEMOCAP Database. Available online: https://sail.usc.edu/iemocap/ (accessed on 13 May 2025).

Figure 1. Overall architecture of the proposed framework. Text, Speech, and Graph embeddings are processed in parallel by two complementary sub-models. Their final logits are combined via an average ensemble to produce the emotion prediction.

Figure 2. Visualization of temporal alignment between text tokens (top) and speech frames (bottom). The gray lines indicate the correspondence between a token and the specific audio frames that fall within its utterance timestamp.

Figure 3. Comparison of model accuracy based on the AI-Hub dataset. The proposed ensemble model (I) achieves the highest accuracy (0.8071), outperforming all unimodal (A, B, C) and existing Transformer-based models (E, F), and other compared methods (D, G, H), which validates the complementary fusion strategy.

Figure 4. Cross-Attention Transformer MoE confusion matrix. This model demonstrates strong performance on complex emotions.

Figure 5. Sum-Product Linear MoE confusion matrix. This model demonstrates strong performance on clear emotions.

Figure 6. Proposed model confusion matrix. The ensemble leverages the strengths of both sub-models to provide a balanced performance.

Figure 7. Emotion-wise Model Accuracy. This chart empirically validates the complementary hypothesis. The Cross-Attention Transformer MoE excels at ‘fear’, while the Sum-Product Linear MoE excels at ‘anger’, and the final ensemble effectively integrates these strengths to achieve the best overall performance.

Table 1. AI-HuB Dataset Emotion Distribution.

Emotion Class	Data Count (%)
Anger	7333 (24.15)
Sadness	7333 (24.15)
Neutral	6261 (20.62)
Happiness	3601 (11.86)
Fear	2588 (8.52)
Disgust	2339 (7.9)
Surprise	903 (2.97)
Total	30,358 (100)

Table 2. Performance Evaluation Indicators.

Indicator	Formula	Description
Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$	Proportion of correct predictions
Precision	$\frac{T P}{T P + F P}$	Proportion of true positives among positive predictions
Recall	$\frac{T P}{T P + F N}$	Proportion of true positives among actual positives
F1-Score	$\frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}$	Harmonic mean of Precision and Recall

Table 3. Key Hyperparameters of the Proposed Model.

Hyperparameter	Cross-Attention Transformer	Sum-Product Linear MoE
Optimization Algorithm	Adam	Adam
LR Scheduler	OneCycleLR	None
Learning Rate	1 × 10⁻⁴ (max)	1 × 10⁻⁴ (Fixed)
Dropout	0.1	0.1
Batch Size	32	32
Max Epochs	100	100
Early Stopping	Patience = 5	Patience = 5
Number of Experts	5	5

Table 4. Experimental Comparison Model Configuration.

Category	Modality	Model
Unimodal Models	Text	Bidirectional LSTM
	Speech	CNN
	Graph Embedding	MLP
Simple Multimodal Models	Text + Speech	Early Fusion, Late Fusion
Existing Transformer- based Models	Text + Speech	KoHMT, HyFusER
Proposed Models	Text + Speech + Graph Embedding	Cross-Attention Transformer and Sum-Product Linear Late Fusion

Table 5. Unimodal and Simple Multimodal Model Performance Comparison.

Modal	Fusion Method	Accuracy	Precision	Recall	F1-Score
Unimodal	Text (BiLSTM)	0.7540	0.7558	0.7540	0.7541
	Speech (CNN)	0.7281	0.7372	0.7281	0.7302
	Graph (MLP)	0.5637	0.5750	0.5637	0.5539
Multimodal	Early Fusion	0.7576	0.7615	0.7576	0.7580
	Late Fusion	0.7913	0.8011	0.7913	0.7938

Table 6. Performance Comparison of Transformer-Based Models and the Proposed Model.

Model	Accuracy	Precision	Recall	F1-Score
KoHMT	0.7771	0.7797	0.7771	0.7778
HyFusER	0.7944	0.7937	0.7944	0.7936
Cross-Attention Transformer MoE	0.7940	0.7961	0.7804	0.7865
Sum-Product Linear MoE	0.7851	0.7918	0.7624	0.7721
Proposed model	0.8073	0.8139	0.7898	0.7997

Table 7. Statistical validation of the proposed models over 10 independent runs.

Model	Metric	Mean	SD	95% CI
Cross-Attention Transformer MoE	Accuracy	0.7940	0.0031	[0.7918, 0.7962]
Cross-Attention Transformer MoE	F1-Score	0.7865	0.0025	[0.7847, 0.7883]
Sum-Product Linear MoE	Accuracy	0.7851	0.0036	[0.7825, 0.7877]
Sum-Product Linear MoE	F1-Score	0.7721	0.0038	[0.7694, 0.7748]
Proposed Model	Accuracy	0.8073	0.0014	[0.8063, 0.8083]
Proposed Model	F1-Score	0.7997	0.0022	[0.7982, 0.8013]

Table 8. Performance Comparison by Routing Strategy.

Model	Accuracy	Precision	Recall	F1-Score
Top-1	0.8071	0.8064	0.7953	0.7997
Top-2	0.8113	0.8122	0.7985	0.8043

Table 9. Real-time Processing Performance of Large-Scale MoE Models.

Model	Number of Experts	Number of Parameters	Inference Speed (ms)
Cross-Attention Transformer MoE	1000	210,892,911	2.08
Sum-Product Linear MoE	1000	206,525,935	0.69

Table 10. IEMOCAP-based experimental model performance comparison.

Model	Accuracy	Precision	Recall	F1-Score
Text (BiLSTM)	0.5766	0.5240	0.5766	0.5450
Speech (CNN)	0.6615	0.6521	0.6615	0.6481
Graph (MLP)	0.5850	0.5263	0.5850	0.5521
Early Fusion	0.6859	0.6746	0.6859	0.6768
Late Fusion	0.6702	0.6657	0.6702	0.6652
Cross-Attention Transformer MoE	0.7768	0.7773	0.7880	0.7803
Sum-Product Linear MoE	0.7757	0.7802	0.7773	0.7766
Proposed model	0.7823	0.7832	0.7870	0.7843

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, M.-H.; Kwak, K.-C.; Shin, J.-H. A Complementary Fusion Framework for Robust Multimodal Emotion Recognition. Electronics 2025, 14, 4444. https://doi.org/10.3390/electronics14224444

AMA Style

Yi M-H, Kwak K-C, Shin J-H. A Complementary Fusion Framework for Robust Multimodal Emotion Recognition. Electronics. 2025; 14(22):4444. https://doi.org/10.3390/electronics14224444

Chicago/Turabian Style

Yi, Moung-Ho, Keun-Chang Kwak, and Ju-Hyun Shin. 2025. "A Complementary Fusion Framework for Robust Multimodal Emotion Recognition" Electronics 14, no. 22: 4444. https://doi.org/10.3390/electronics14224444

APA Style

Yi, M.-H., Kwak, K.-C., & Shin, J.-H. (2025). A Complementary Fusion Framework for Robust Multimodal Emotion Recognition. Electronics, 14(22), 4444. https://doi.org/10.3390/electronics14224444

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Complementary Fusion Framework for Robust Multimodal Emotion Recognition

Abstract

1. Introduction

2. Related Work

2.1. Text- and Speech-Based Emotion Recognition

2.2. Multimodal Emotion Recognition

2.3. MoE Architecture

3. Proposed Mechanism

3.1. Input Embeddings

3.1.1. Text Embedding

3.1.2. Speech Embedding

3.1.3. Graph Embedding

3.2. Cross-Attention Transformer MoE

3.3. Sum-Product Linear MoE

3.4. Multimodal Emotion Recognition Model

4. Experimental Evaluation

4.1. Experimental Data

4.1.1. AI-HuB Dataset

4.1.2. IEMOCAP Dataset

4.2. Performance Evaluation and Analysis

4.2.1. Experimental Environment and Evaluation Metrics

4.2.2. Comparative Models and Experimental Design

4.2.3. Overall Performance Comparison and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI