Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition

Chae, Da-Eun; Lee, Seok-Pil

doi:10.3390/electronics14244972

Open AccessArticle

Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition

by

Da-Eun Chae

¹ and

Seok-Pil Lee

^2,*

¹

Department of Computer Science, Graduate School, Sangmyung University, Seoul 03016, Republic of Korea

²

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4972; https://doi.org/10.3390/electronics14244972

Submission received: 6 November 2025 / Revised: 11 December 2025 / Accepted: 17 December 2025 / Published: 18 December 2025

(This article belongs to the Special Issue Artificial Intelligence and Advanced Signal Processing Techniques and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

Multimodal emotion recognition (MER) often relies on single-scale representations that fail to capture the hierarchical structure of emotional signals. This paper proposes a Dual Routing Mixture-of-Experts (MoE) model that dynamically selects between local (fine-grained) and global (contextual) representations extracted from speech and text encoders. The framework first obtains local–global embeddings using WavLM and RoBERTa, then employs a scale-aware routing mechanism to activate the most informative expert before bidirectional cross-attention fusion. Experiments on the IEMOCAP dataset show that the proposed model achieves stable performance across all folds, reaching an average unweighted accuracy (UA) of 75.27% and weighted accuracy (WA) of 74.09%. The model consistently outperforms single-scale baselines and simple concatenation methods, confirming the importance of dynamic multi-scale cue selection. Ablation studies highlight that neither local-only nor global-only representations are sufficient, while routing behavior analysis reveals emotion-dependent scale preferences—such as strong reliance on local acoustic cues for anger and global contextual cues for low-arousal emotions. These findings demonstrate that emotional expressions are inherently multi-scale and that scale-aware expert activation provides a principled approach beyond conventional single-scale fusion.

Keywords:

speech emotion recognition; multimodal learning; self-supervised learning; mixture-of-experts; dual routing; cross-attention; hierarchical representation; IEMOCAP

1. Introduction

Speech Emotion Recognition (SER) plays a crucial role in various applications—including human–computer interaction, virtual assistants, education, healthcare, and emotion-aware interfaces—because it enables systems to understand a speaker’s internal emotional state and support more natural interactions [1,2,3]. Recently, multimodal emotion recognition (MER), which jointly utilizes both speech and text information, has gained significant attention due to its ability to more accurately capture the complex characteristics of emotional expression [4]. Acoustic elements such as intonation, stress, and speaking rate, together with the semantic content of text, provide complementary information, and numerous models have been proposed to integrate these heterogeneous sources to infer emotional states. In particular, with the introduction of Transformer-based architectures and self-supervised learning (SSL) models such as Wav2Vec 2.0 [5], HuBERT [6], and WavLM [7], MER performance has steadily improved [8]. However, due to the complexity of emotional signals and variations between utterances, several challenges remain.

Recent multimodal emotion recognition models such as MuLT [9], MISA [10], and MFDR [11] have achieved performance gains by adopting cross-modal attention or modality-specific representation learning to effectively capture interactions between modalities [12,13]. Nevertheless, these state-of-the-art models primarily rely on single-scale representations at the utterance level, which limits their capacity to reflect the diverse spatiotemporal characteristics of emotional cues. For example, in text, certain words may convey strong emotional signals within a short span, whereas in speech, the overall prosodic pattern across the entire utterance may determine the emotional flow. Despite this, prior studies have struggled to simultaneously leverage fine-grained temporal cues and utterance-level prosody, and simple utterance-level fusion approaches remain insufficient for capturing such subtle signals [14]. Consequently, existing models face inherent limitations in fully representing the multi-scale emotional structure of expressive speech.

Emotional signals possess a layered structure that cannot be fully captured by a single-level representation. In text, local semantic cues—such as emotionally charged words, emphasized expressions, or subtle shifts in polarity—play a central role, whereas in speech, global prosody, such as overall intonation flow or speaking style, substantially contributes to emotional interpretation. Meanwhile, speech-based studies have reported that specific short segments within an utterance exert a stronger influence on emotion classification, highlighting the importance of local acoustic cues [15]. Multi-scale acoustic modeling, which integrates information across segments of varying lengths, has been shown to improve SER performance [16,17]. In addition, segment-level pooling and SSL-based local representations have proven effective for capturing the fine-grained structure of emotional signals [18]. More recently, multi-scale acoustic representations have been experimentally demonstrated to enhance the stability and generalization of emotion recognition models [19]. Therefore, relying on a single temporal scale risks missing important emotional cues depending on the utterance type or speaker characteristics, and incorporating multi-scale feature extraction that combines local–global granularity is essential for developing more robust and generalizable MER systems.

However, simply merging multi-scale representations or integrating them through cross-attention is insufficient for actively selecting between local and global information, whose relative importance varies depending on modality and speaking context. Emotional signals change dynamically with factors such as utterance type, speaker style, and emotional intensity, making it necessary for the model to automatically adjust its focus according to input characteristics. Mixture-of-Experts (MoE)–based routing methods are well aligned with these requirements because they provide an input-dependent expert selection mechanism. The effectiveness of such routing has been demonstrated in models like Switch Transformer and GShard [20,21], and recent efforts such as ST-MoE have further strengthened routing stability [22], drawing significant attention to MoE-based selection mechanisms across different domains. Furthermore, MoEfication research has shown that MoE structures promote representation specialization depending on input characteristics, enhancing adaptability [23]. In speech-based emotion recognition, MoE structures have also been reported to effectively integrate SSL features with spectral cues [24]. Although FuseMoE adopted an MoE structure to achieve fleximodal fusion in multimodal settings, it did not consider scale-aware expert selection between local and global experts [25]. Similarly, the recent MER model CAG-MoE combines cross-attention with MoE but still does not provide scale-aware expert selection [26]. To address these limitations, this study introduces a Dual Routing MoE framework that adaptively selects local or global representations according to contextual conditions, enabling more fine-grained utilization of multi-scale information in MER.

The main contributions of this study are as follows.

We construct a multi-scale representation framework that simultaneously extracts local cues and global prosody from speech and text, effectively reflecting the spatiotemporal diversity of emotional signals.
To address the varying importance of local/global representations depending on utterance context and modality characteristics, we propose a Dual Routing MoE structure that selects experts at each scale based on input properties. This enables the model to dynamically adjust its focus across scales and perform more precise information selection.
To integrate the multi-scale information selected by the proposed multi-scale extraction and Dual Routing MoE in a complementary manner across modalities, we apply Bidirectional Cross-Attention, allowing the local/global cues of each modality to naturally interact with the representation of the other. This ensures that fine-grained cues and global patterns are effectively conveyed during multimodal interaction, enabling a balanced representation of both detailed and holistic emotional structures.

2. Related Work

2.1. Speech & Multimodal Emotion Recognition

Early research in SER primarily relied on traditional acoustic features such as MFCCs combined with classical classifiers including SVM, LDA, and hierarchical decision trees [27,28,29,30]. With the emergence of deep learning, SER performance improved substantially through CNN-, RNN-, and attention-based architectures, which enhanced the quality of feature representations. More recently, SSL models such as Wav2Vec 2.0, HuBERT, and WavLM have further advanced SER by providing robust contextualized representations [31].

However, speech alone is often insufficient for fully reflecting the semantic aspects of emotion. As a result, MER, which jointly leverages the semantic cues of text and the prosodic cues of speech, has become an important research direction. By combining the complementary properties of the two modalities, MER offers higher stability and generalization than unimodal approaches, and has been widely studied using large-scale conversational datasets such as IEMOCAP [32] and MELD [33].

2.2. Multimodal Fusion Approaches

The core challenge in multimodal emotion recognition is the effective integration of information across different modalities. Representation-level fusion has improved MER performance by separating modality-specific representations or learning modality alignment and cross-modal dependencies. For example, MuLT learns inter-modal relationships without explicit temporal alignment using a cross-modal Transformer [9], while MISA separates modality-invariant and modality-specific representations to obtain refined emotional features [10].

However, these models primarily rely on single-scale representations and therefore struggle to simultaneously capture the momentary emotional signals (local cues) in text and the overall prosodic patterns (global cues) in speech. Attention-based approaches—such as cross-attention and bidirectional attention—can model modal interactions more precisely [34], but they do not selectively emphasize local cues or explicitly differentiate between temporal scales. In other words, existing fusion models effectively capture inter-modal dependencies but remain structurally limited in leveraging both fine-grained and global cues through a multi-scale framework.

2.3. Mixture-of-Experts for Representation Selection

MoE provides a selective routing mechanism that dynamically chooses experts based on input characteristics, making it effective for enhancing representation specialization and scalability. Sparsely-Gated MoE [35], Switch Transformer [20], Expert Choice Routing [36] and ST-MoE [22] have demonstrated the efficiency of such routing mechanisms across diverse input spaces. MoEfication [23] further emphasized the advantages of selective learning by analyzing how MoE structures autonomously specialize experts according to input distributions.

In the emotion recognition domain, studies have applied MoE architectures to integrate SSL-based speech features with spectral cues [24]. In MER, models such as CAG-MoE, which combines cross-attention with MoE [26], FuseMoE for fleximodal fusion [25], and the multimodal contrastive MoE structure LIMoE [37], have been proposed. However, these approaches largely remain at the level of modality-aware or input-dependent expert selection and do not provide scale-aware routing that distinguishes between local and global temporal cues in emotional signals. Thus, while existing MoE structures are effective for modality mixture, they are not sufficiently extended to address the MER-specific challenge of multi-scale cue selection.

3. Method

In this study, we propose a multimodal emotion recognition model that extracts local–global representations from speech (WavLM) and text (RoBERTa [38]), selects the most suitable scale-specific representation for each modality through a Dual Routing MoE, and subsequently integrates the complementary information of the two modalities using bidirectional Cross-Attention.

Figure 1 provides an overview of the overall architecture.

The global representation

h^{g}

extracted from each modality reflects the utterance-level semantic and prosodic structure, while the local representation

h^{l}

captures fine-grained emotional cues at specific temporal segments or token-level units. These two representations are then fed into the Dual Routing MoE, where either the local or global expert is selectively activated depending on modality and utterance characteristics. The final modality-specific representations generated through routing,

z^{a}, z^{t}

, are transformed into mutually conditioned representations via bidirectional Cross-Attention, and their combined output is fed into the classifier.

3.1. Audio Encoder

The speech signal is normalized to 16 kHz and then fed into WavLM-Base to obtain frame-level hidden representations.

The global embedding

h_{a}^{g}

, which reflects the overall information of the entire utterance, is computed by applying mean pooling over all frames:

h_{a}^{g} = \frac{1}{T} \sum_{t = 1}^{T} s_{t}

(1)

Meanwhile, the local embedding is computed using a VAD-based Multi-Head Self-Attention (MHSA) pooling structure. WebRTC VAD is first applied to remove non-speech intervals that are emotionally irrelevant, and a valid speech frame set

S = {t ∣ VAD (t) = 1}

is selected.

A single MHSA layer is then applied to the selected frames to obtain contextualized representations

{\tilde{s}}_{t}

that reflect interactions across segments:

{\tilde{s}}_{1 : |S|} = MHSA (s_{1 : |S|})

(2)

The MHSA output contains segment-level emotional cues, and the local embedding

h_{a}^{l}

is computed by applying mean pooling:

h_{a}^{l} = \frac{1}{|S|} \sum_{t \in S} {\tilde{s}}_{t}

(3)

This SAP-like structure is designed with reference to multi-scale contextual representation learning methods proposed in recent multimodal emotion recognition studies [18]. The process of selecting speech segments and applying MHSA to capture inter-segment interactions inherits the structural advantages of existing work. By stably preserving both the prosodic cues and segment-level fine-grained information captured by WavLM, this design naturally normalizes the scale difference between local and global representations at the encoder level.

3.2. Text Encoder

Text input is processed through RoBERTa-Base to obtain token embeddings

\{u_{1}, \dots, u_{N}\}

. The global embedding is constructed by concatenating the CLS embeddings extracted from the last four encoder layers:

h_{t}^{g} = [c^{(L - 3)}; c^{(L - 2)}; c^{(L - 1)}; c^{(L)}]

(4)

The local embedding is computed using an SAP-like approach that mirrors the structure used for audio while being adapted to the characteristics of text. PAD, CLS, and SEP tokens are removed, and a set of content tokens

{{u}_{i} | i \in C}

is selected.

A single Multi-Head Self-Attention (MHSA) layer is then applied to these content tokens:

{\tilde{u}}_{i} = MHSA (u_{i} | i \in C)

(5)

The MHSA output is a contextualized token representation that captures interactions among content words, and the text local embedding is computed through mean pooling:

h_{t}^{l} = \frac{1}{|C|} \sum_{i \in C} {\tilde{u}}_{i}

(6)

The local embedding of text reflects the same multi-scale design philosophy, applying a SAP-like contextual pooling structure tailored to textual representations. This naturally aligns the local–global scales of audio and text at the encoder stage. By maintaining a local–global structure similar to audio SAP, this design more effectively captures token-level semantic information and contextual interactions within the text modality.

3.3. Dual Routing MoE

Figure 2 shows that the multi-scale embedding

[h^{g}; h^{l}]

of each modality is concatenated and used as the input to the Gate MLP, which determines the selection ratio between the local expert and the global expert through routing scores.

The output of the Gate MLP is defined as follows:

g = W_{2} σ (W_{1} [h^{g}; h^{l}])

(7)

where denotes the GELU activation.

The routing probabilities are computed through a softmax operation:

α = s o f t m a x (g) = [α_{l o c a l}, α_{g l o b a l}]

(8)

The transformations of the local and global experts are defined as independent linear mappings:

z_{l o c a l} = W_{l} h^{l}, z_{g l o b a l} = W_{g} h^{g}

(9)

The final routed representation is the weighted sum of the two experts:

z = α_{l o c a l} z_{l o c a l} + α_{g l o b a l} z_{g l o b a l}

(10)

This structure enables the model to automatically capture the scale-dependent characteristics of emotional signals and adjust the relative importance of local and global cues in a learnable manner.

3.4. Bidirectional Cross-Attention Fusion

After obtaining the routed representations of audio and text through Dual Routing, bidirectional Cross-Attention is performed to exchange complementary information between the two modalities, as illustrated in Figure 3.

The Query, Key, and Value for Audio-to-Text attention are defined as:

Q_{a} = W_{Q}^{(a)} z^{a}, K_{t} = W_{K}^{(t)} z^{t}, V_{t} = W_{V}^{(t)} z^{t}

(11)

The attention output is computed as:

{\hat{z}}^{t \leftarrow a} = Softmax (\frac{Q_{a} K_{t}^{⊺}}{\sqrt{d}}) V_{t}

(12)

Text-to-Audio attention is computed in the same manner but in the opposite direction, and the two representations are concatenated to form the final fused representation:

z_{fuse} = [{\hat{z}}^{a \leftarrow t} ⨁ {\hat{z}}^{t \leftarrow a}]

(13)

This bidirectional architecture provides higher expressive power than single-direction attention or simple concatenation, as it balances complementary information from both modalities.

3.5. Loss Function

To ensure stable training of the Dual Routing MoE with multi-scale representations, this study adopts a total loss composed of three components.

The overall learning objective is defined as:

L = L_{f o c a l} + λ_{branch} L_{branch} + λ_{balance} L_{balance}

(14)

For the main classification loss, a weighted focal loss is employed to mitigate class imbalance and increase sensitivity to hard samples. It is defined as:

L_{f o c a l} = - \sum_{c = 1}^{C} w_{c} {(1 - {\hat{p}}_{c})}^{γ} y_{c} l o g ({\hat{p}}_{c})

(15)

This loss assigns higher weights to hard samples than to easy ones, allowing the model to learn emotional class boundaries more effectively.

Next, to prevent a collapse in which routing becomes concentrated on either the local or global expert, a balance loss is introduced. The balance loss encourages higher entropy in routing probabilities, thereby alleviating expert-selection imbalance:

L_{balance} = - (α_{l o c a l} \log α_{l o c a l} + α_{g l o b a l} \log α_{g l o b a l})

(16)

This loss ensures that both experts remain involved during training, thereby maintaining diversity and stability in scale selection (local/global routing).

Because the Dual Routing MoE processes different levels of emotional cues through the local and global experts, auxiliary supervision is applied to ensure that each expert independently acquires emotion classification capability. Each expert produces emotion probabilities through an independent classification head, and the branch loss is defined as the sum of two cross-entropy losses:

L_{branch} = L_{C E}^{l o c a l} + L_{C E}^{g l o b a l} L_{C E}^{l o c a l} = - \sum_{c = 1}^{C} w_{c} y_{c} l o g ({\hat{p}}_{c}^{l o c a l}), L_{C E}^{g l o b a l} = - \sum_{c = 1}^{C} w_{c} y_{c} l o g ({\hat{p}}_{c}^{g l o b a l})

(17)

This auxiliary loss encourages each expert to learn representations that are independently capable of emotion classification, providing an additional safeguard against routing collapse.

Furthermore, by ensuring that both the local and global experts learn meaningful emotional information within their respective representation spaces, this design enhances the stability and interpretability of the final routing decisions.

4. Experiment

4.1. Dataset

This study utilized the IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset [32] for experiments on voice- and text-based multimodal emotion recognition. IEMOCAP is a publicly available multimodal dataset that provides English dialogue data containing emotions across speech, text, and video modalities, and it is widely used in various SER studies. The dataset consists of 5 sessions involving five pairs (10 individuals) of adult male and female speakers. Each session includes both scripted and spontaneous speech. All utterances are annotated with emotion labels, and motion-capture and video information are also provided; however, this study utilized only the speech and text modalities. Furthermore, because each session features a different pair of speakers, model evaluation was conducted using leave-one-session-out (LOSO) cross-validation. This approach is a widely adopted evaluation strategy in SER research, as it enables verification of speaker-independent generalization performance. To maintain consistency with prior SER studies, this experiment focused on four primary emotion classes. Specifically, it included 1103 angry, 1084 sad, 1636 happy (combining “happy” and “excited”), and 1708 neutral samples, totaling 5531 utterances across these classes. This class selection targeted categories from the original IEMOCAP dataset—among more than ten emotion labels—that had relatively sufficient data volume and ensured high comparability across studies.

4.2. Implementation Details

Preprocessing and Encoder Configuration: Audio waveforms were sampled at 16 kHz and processed using the WavLM-Base encoder. WebRTC-based VAD was applied to remove non-speech regions, and utterances exceeding 3 s were segmented into 2–3 s chunks. Global audio embeddings were obtained via mean pooling, while local embeddings were computed by applying a single MHSA layer to VAD-selected speech frames. For the text branch, RoBERTa-Base was employed with a maximum sequence length of 128. Both WavLM and RoBERTa were partially fine-tuned by unfreezing only the last four layers.
Model Structure: The Dual Routing MoE architecture received 768-dimensional local and global embeddings from each modality. Routing weights were produced through a two-layer gating MLP with a hidden size of 128. Experts were implemented as MLPs with GELU activation and a dropout rate of 0.1. Fusion between audio and text was performed using two bidirectional cross-attention layers, each with four attention heads. The overall loss combined focal loss $γ = 2.0$ ), an auxiliary branch loss weighted by $λ_{branch} = 0.2$ , and a routing-balance loss weighted by $λ_{balance} = 0.01$ .
Training Settings: Models were optimized using AdamW with separate learning rates for SSL encoder parameters ( $3 \times 10^{- 5}$ ) and the MoE/fusion layers ( $1 \times 10^{- 4}$ ). A cosine-annealing scheduler with two warm-up epochs was applied. Training was conducted for up to 40 epochs with a batch size of 8, weight decay of 0.01, gradient clipping, and early stopping with a patience of 3. Class weights were additionally applied to address label imbalance.
Hardware Environment: All experiments were performed in a Python 3.10 environment using PyTorch 2.5.1. Training was executed on an NVIDIA GeForce RTX 4070 Ti GPU.

5. Result

The experimental results conducted on all five sessions of IEMOCAP are summarized in Table 1. The proposed Dual Routing-based multimodal model achieved an average UA of 75.27% and WA of 74.09%, demonstrating overall stable performance. Here, UA (unweighted accuracy) represents the mean recall across classes, while WA (weighted accuracy) denotes the accuracy weighted by class distribution. Despite differences in speaker composition or speaking style across sessions, the model maintained a UA of at least 0.72 in all folds, and achieved comparatively high performance in fold 2 (UA 0.796) and fold 4 (WA 0.753). These results suggest that the multi-scale representation and Dual Routing architecture effectively adapt to speaker- and session-dependent variations in utterance characteristics. Furthermore, the small performance deviations across sessions indicate that the model does not overly rely on any specific session or speaker, exhibiting consistent performance across diverse speaking environments.

The performance comparison with existing multimodal emotion recognition models reported on IEMOCAP is summarized in Table 2. SERVER exhibited a basic performance level of approximately 63% UA, while IA-MMTF and BERT–RoBERTa BiLSTM reported stable results in the range of 72–74%. Meanwhile, the recent state-of-the-art model MFDR achieved UA 77% and WA 75.7%, representing one of the highest accuracies reported to date. The proposed Dual Routing model recorded UA 75.27% and WA 74.09%, which is lower than MFDR but comparable to or slightly better than models such as BERT–RoBERTa BiLSTM. Notably, whereas most existing models rely on single-scale representations, the proposed model introduces a Dual Routing mechanism that adaptively adjusts the importance of local–global representations based on input characteristics, offering a distinctive perspective for selectively leveraging multi-scale emotional cues. As demonstrated in the ablation results and routing analysis, this structural property plays a crucial role in adapting to emotion- and session-specific variations in speaking patterns. Therefore, although the performance gap with the latest SOTA model is acknowledged, this study provides academic value by presenting a novel structural direction—scale-aware expert selection—that has not been sufficiently addressed in previous MER research.

5.1. Ablation Study

To examine the contribution of the key components of the proposed model—namely the multi-scale representation and the Dual Routing mechanism—an ablation study was conducted. Four model variants were compared: a model using only local representations, a model using only global representations, a concat model that simply merges local and global features, and the full Dual Routing-based model. UA and WA were measured for each model.

As shown in Table 3, the Default model achieved the highest performance in both UA and WA. Single-scale models that use only local or only global representations performed worse than the Default model in both cases, demonstrating that emotional information cannot be sufficiently captured using a single-level representation alone. Moreover, the concat model, which simply combines local and global features, also showed lower performance than the Default model across all metrics, indicating that simple feature aggregation cannot properly reflect the varying importance of emotional cues depending on utterance context. These results suggest that the Dual Routing mechanism, which selects between local and global experts based on input characteristics, is a key factor in effectively utilizing multi-scale information and directly contributes to the performance improvement of the model.

5.2. Confusion Matrix Analysis

To more precisely analyze the classification performance for each emotion class across the entire test set, we examined the recall-based confusion matrix, as shown in Figure 4. The Angry class achieved the highest recall at 81.41%, indicating that it was classified most reliably. This can be attributed to the strong prosodic intensity and distinctive local acoustic cues that typically appear in angry utterances, making them relatively easier to distinguish from other emotions. In contrast, Happy showed a misclassification rate of 13.75% toward Neutral. This tendency can arise when happy utterances exhibit low arousal or lack strong prosodic expression, resulting in neutral-like characteristics under certain speaking conditions.

The Sad class recorded a recall of 74.45%, but it also showed the highest confusion toward Neutral, with an 18.08% misclassification rate—the largest among the four emotions. This is because sad utterances generally exhibit low energy and slower tempo, which creates prosodic similarities with neutral speech. The Neutral class had the lowest recall at 67.10% and exhibited misclassifications toward both Happy and Sad. Neutral utterances tend to contain weaker emotional cues and lack distinctive features, resulting in ambiguous decision boundaries. Consequently, they can be easily confused with emotions in different directions.

5.3. Routing Behavior Analysis

The local/global expert selection ratio analyzed by session and emotion serves as an important indicator of how the model utilizes multi-scale information depending on input characteristics, as shown in Figure 5. In the audio modality, clear variations were observed across sessions. For example, in Session 2, the selection ratio of local cues was high, whereas in Session 4, the selection of global cues increased overwhelmingly. This reflects substantial differences among sessions in terms of speaking habits, emotional expression patterns, and conversational context, demonstrating that the Dual Routing mechanism responds sensitively to such variations and selects the appropriate scale of features accordingly. In Session 3, local and global cues were selected at nearly equal proportions, which can be interpreted as indicating that the utterances in this session tend to require both local cues and global prosody.

In contrast, the text modality exhibited consistently high utilization of global cues across all sessions, with particularly high global selection ratios observed in Sessions 4 and 5. Since text-based emotional signals rely heavily on sentence-level semantic structure and contextual flow, these results show that text routing operates in a stable and session-robust manner.

Emotion-wise analysis of local/global utilization also revealed notable differences, as shown in Figure 6. For audio, Angry showed strong dominance of local cues, while Happy and Neutral exhibited higher proportions of global cue selection. Sad demonstrated a balanced pattern, with local and global cues selected at nearly equal ratios. This indicates that the prosodic structure differs across emotions: emotions like anger, characterized by abrupt changes in energy, depend more on local cues, whereas emotions with more static and stable prosody, such as neutral or happy, rely more heavily on global cues. The text modality showed consistently high selection of global cues across all emotions, reflecting the central role of semantic context in emotion judgment.

Overall, the routing behavior analysis empirically validates the necessity of multi-scale representations and the practical contribution of the Dual Routing mechanism.

6. Discussion

This study proposed a Dual Routing-based MER model that integrates local–global representations, motivated by the observation that emotional signals in speech and text cannot be sufficiently captured by single-scale representations alone. The experimental results clearly demonstrate that the proposed architecture effectively reflects fine-grained components of emotional signals and addresses the limitations inherent in previous single-scale approaches.

In the ablation experiments, models using only local or only global representations showed overall performance degradation, and simple feature concatenation was also limited in its effectiveness. In contrast, the proposed Dual Routing structure selectively activates local or global experts according to input characteristics, enabling more appropriate utilization of multi-scale information. These findings indicate that emotional signals require cues at different temporal scales depending on context and utterance conditions, and that fixed single-scale representations are insufficient to capture such diversity.

The routing behavior analysis provides important qualitative insight into this structural mechanism. In the audio modality, the proportion of local/global cue utilization varied significantly across sessions, reflecting differences in speaking habits, emotional expression patterns, and conversational context. However, such session-wise differences cannot be attributed solely to variation in speaker style. Since IEMOCAP does not maintain a uniform emotional distribution across sessions, it is also possible that imbalances in emotional labels contributed to the observed session-level variation. Although this study does not perform a quantitative analysis of session-wise emotional distributions, acknowledging this possibility allows for a more careful interpretation of session variability.

In contrast, the text modality showed strong and consistent reliance on global cues across all sessions, reflecting the natural linguistic tendency that sentence-level semantic structure plays a central role in emotion judgment. This analysis suggests that multi-scale representation offers benefits beyond performance improvement, contributing to a deeper structural understanding of emotional signals.

Emotion-wise analysis also revealed clear distinctions. The Angry emotion exhibited the highest utilization of local cues due to its strong local acoustic bursts, whereas Happy and Neutral showed higher reliance on global cues because of their smoother and more stable prosody. Sad displayed a balanced pattern, with local and global cues selected at nearly similar rates. These differences indicate that prosodic patterns vary across emotions and that fixed representation approaches struggle to clearly separate emotional boundaries.

The confusion matrix analysis further highlighted these structural properties. Angry achieved the highest classification performance due to its strong arousal and distinctive prosodic patterns, while Neutral exhibited the lowest recall and was frequently confused with other emotions because of its weak emotional cues and ambiguous boundaries. In particular, misclassifications from Happy to Neutral and from Sad to Neutral were high, reflecting the shared prosodic characteristics of low-arousal emotions. These results support the conclusion that the scale characteristics of emotional signals affect both classification performance and misclassification patterns.

Although the proposed model demonstrates slightly lower absolute performance than the latest SOTA model MFDR, its contribution lies not in outperforming existing models but in addressing an underexplored structural aspect of MER research: multi-scale cue selection. While conventional MoE-based models primarily focus on expert routing based on modality differences, this study introduces a selective expert activation mechanism that considers the temporal scale inherent in emotional signals. This design allows the model to more precisely reflect the hierarchical structure of emotional expression, ultimately enhancing interpretability, representational power, and adaptability to contextual variation.

In summary, this study provides academic value by presenting a new direction in representation design that accounts for the intrinsic structure of emotional signals rather than merely pursuing performance improvement. The experimental results and routing analysis clearly demonstrate that multi-scale information plays a critical role in MER and establish a foundation for future research combining the proposed approach with various self-supervised models, lightweight routing mechanisms, or hierarchical fusion strategies.

7. Conclusions

This study proposed a Dual Routing-based multimodal emotion recognition model that selectively utilizes local–global representations according to input characteristics, motivated by the fact that emotional signals cannot be sufficiently captured by single-scale representations in either temporal or semantic dimensions. The proposed model simultaneously incorporates multi-scale cues from speech and text modalities and automatically adjusts their relative importance through Dual Routing, thereby more faithfully capturing the hierarchical structure of emotional expression that existing MER models often fail to reflect.

The experimental results showed that although the absolute performance of the proposed model was somewhat lower than that of the latest SOTA model, the ablation and routing behavior analyses confirmed that multi-scale cue selection plays a meaningful role in the actual emotion recognition process. This supports the conclusion that the main contribution of this study lies not in performance competition but in presenting a new modeling paradigm that reflects the structural properties of emotional signals. In particular, the concept of scale-aware expert activation complements limitations of previous modality-centered MoE approaches and plays an important role in explaining the variability and context dependency of emotional signals.

Future work may extend the proposed multi-scale selection mechanism by integrating it with more lightweight routing structures or various self-supervised encoders, and further potential lies in combining it with hierarchical fusion strategies. Additionally, visualizing cue-selection patterns or validating the approach across different languages, corpora, or domains would further enhance the model’s generalization and practical value.

Overall, this study introduces a novel structural approach to addressing the scale-dependent cue selection problem—an essential aspect of multimodal emotion recognition—and provides foundational insights that highlight the importance of multi-scale information in MER.

Author Contributions

Conceptualization, D.-E.C. and S.-P.L.; methodology, D.-E.C.; investigation, D.-E.C.; writing—original draft preparation, D.-E.C.; writing—review and editing, S.-P.L.; project administration, S.-P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the IEMOCAP corpus (https://sail.usc.edu/iemocap/, accessed on 16 December 2025), which is publicly accessible for research purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SER	Speech Emotion Recognition
MER	Multimodal Emotion Recognition
MoE	Mixture-of-Experts
SSL	Self-Supervised Learning

References

Mohamed, A.; Lee, H.; Borgholt, L.; Havtorn, J.D.; Edin, J.; Igel, C.; Kirchhoff, K.; Li, S.-W.; Livescu, K.; Maaløe, L.; et al. Self-Supervised Speech Representation Learning: A Review. IEEE J. Sel. Top. Signal Process. 2022, 16, 1179–1210. [Google Scholar] [CrossRef]
Shah Fahad, M.d.; Ranjan, A.; Yadav, J.; Deepak, A. A Survey of Speech Emotion Recognition in Natural Environment. Digit. Signal Process. 2021, 110, 102951. [Google Scholar] [CrossRef]
Trinh Van, L.; Dao Thi Le, T.; Le Xuan, T.; Castelli, E. Emotional Speech Recognition Using Deep Neural Networks. Sensors 2022, 22, 1414. [Google Scholar] [CrossRef] [PubMed]
Yoon, S.; Byun, S.; Jung, K. Multimodal Speech Emotion Recognition Using Audio and Text. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 112–118. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 12449–12460. [Google Scholar]
Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Morais, E.; Hoory, R.; Zhu, W.; Gat, I.; Damasceno, M.; Aronowitz, H. Speech Emotion Recognition Using Self-Supervised Features. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6922–6926. [Google Scholar]
Tsai, Y.-H.H.; Bai, S.; Pu Liang, P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proc. Conf. Assoc. Comput. Linguist Meet 2019, 2019, 6558–6569. [Google Scholar] [CrossRef]
Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; Association for Computing Machinery: New York, NY, USA; pp. 1122–1131. [Google Scholar]
Zhao, Z.; Gao, T.; Wang, H.; Schuller, B. MFDR: Multiple-Stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition. In Proceedings of the Interspeech 2024, ISCA, Kos Island, Greece, 1 September 2024; pp. 3719–3723. [Google Scholar]
K.D., N.; Patil, A. Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks. In Proceedings of the Interspeech 2020, ISCA, Online, 25 October 2020; pp. 4243–4247. [Google Scholar]
Shang, Y.; Fu, T. Multimodal Fusion: A Study on Speech-Text Emotion Recognition with the Integration of Deep Learning. Intell. Syst. Appl. 2024, 24, 200436. [Google Scholar] [CrossRef]
Sebastian, J.; Pierucci, P. Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts. In Proceedings of the Interspeech 2019, ISCA, Graz, Austria, 15 September 2019; pp. 51–55. [Google Scholar]
Leygue, T.; Sabourin, A.; Bolzmacher, C.; Bouchigny, S.; Anastassova, M.; Pham, Q.-C. Explainable Speech Emotion Recognition through Attentive Pooling: Insights from Attention-Based Temporal Localization 2025. arXiv 2025, arXiv:2506.15754. [Google Scholar]
Xu, M.; Zhang, F.; Cui, X.; Zhang, W. Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation 2021. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6 June 2021. [Google Scholar]
Peng, Z.; Lu, Y.; Pan, S.; Liu, Y. Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6 June 2021; pp. 3020–3024. [Google Scholar]
Hyeon, J.; Oh, Y.-H.; Lee, Y.-J.; Choi, H.-J. Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features. In Proceedings of the 2025 IEEE International Conference on Big Data and Smart Computing (BigComp), Kota Kinabalu, Malaysia, 9–12 February 2025; pp. 191–198. [Google Scholar]
Wang, Y.; Huang, J.; Zhao, Z.; Lan, H.; Zhang, X. Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Appl. Sci. 2024, 14, 11494. [Google Scholar] [CrossRef]
Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding 2020. arXiv 2020, arXiv:2006.16668. [Google Scholar]
Zoph, B.; Bello, I.; Kumar, S.; Du, N.; Huang, Y.; Dean, J.; Shazeer, N.; Fedus, W. ST-MoE: Designing Stable and Transferable Sparse Expert Models 2022. arXiv 2022, arXiv:2202.08906. [Google Scholar]
Zhang, Z.; Lin, Y.; Liu, Z.; Li, P.; Sun, M.; Zhou, J. MoEfication: Transformer Feed-Forward Layers Are Mixtures of Experts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 877–890. [Google Scholar]
Hyeon, J.; Oh, Y.-H.; Lee, Y.-J.; Choi, H.-J. Improving Speech Emotion Recognition by Fusing Self-Supervised Learning and Spectral Features via Mixture of Experts. Data Knowl. Eng. 2024, 150, 102262. [Google Scholar] [CrossRef]
Han, X.; Nguyen, H.; Harris, C.; Ho, N.; Saria, S. FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion. Adv. Neural Inf. Process. Syst. 2024, 37, 67850–67900. [Google Scholar]
Mengara Mengara, A.G.; Moon, Y. CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts. Mathematics 2025, 13, 1907. [Google Scholar] [CrossRef]
Wang, Y.; Hu, W. Speech Emotion Recognition Based on Improved MFCC. In Proceedings of the 2nd International Conference on Computer Science and Application Engineering, New York, NY, USA, 22 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–7. [Google Scholar]
Indra, J.; Shankar, R.K.; Priya, R.D. Speech Emotion Recognition Using Support Vector Machine and Linear Discriminant Analysis. In Proceedings of the Intelligent Systems Design and Applications; Abraham, A., Pllana, S., Casalino, G., Ma, K., Bajaj, A., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 482–492. [Google Scholar]
Lee, C.-C.; Mower, E.; Busso, C.; Lee, S.; Narayanan, S. Emotion Recognition Using a Hierarchical Binary Decision Tree Approach. Speech Commun. 2011, 53, 1162–1171. [Google Scholar] [CrossRef]
El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
Latif, S.; Zaidi, A.; Cuayahuitl, H.; Shamshad, F.; Shoukat, M.; Usama, M.; Qadir, J. Transformers in Speech Processing: A Survey 2025. arXiv 2023, arXiv:2303.11607. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations 2019. In Proceedings of the 57th Annual Meeting of the Association for Computational Linzguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Khan, M.; Tran, P.-N.; Pham, N.T.; El Saddik, A.; Othmani, A. MemoCMT: Multimodal Emotion Recognition Using Cross-Modal Transformer-Based Feature Fusion. Sci. Rep. 2025, 15, 5473. [Google Scholar] [CrossRef]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer 2017. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Zhou, Y.; Lei, T.; Liu, H.; Du, N.; Huang, Y.; Zhao, V.; Dai, A.M.; Chen, Z.; Le, Q.V.; Laudon, J. Mixture-of-Experts with Expert Choice Routing. Adv. Neural Inf. Process. Syst. 2022, 35, 7103–7114. [Google Scholar]
Mustafa, B.; Riquelme, C.; Puigcerver, J.; Jenatton, R.; Houlsby, N. Multimodal Contrastive Learning with LIMoE: The Language-Image Mixture of Experts. Adv. Neural Inf. Process. Syst. 2022, 35, 9564–9576. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach 2019. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Pham, N.T.; Dang, D.N.M.; Pham, B.N.H.; Nguyen, S.D. SERVER: Multi-Modal Speech Emotion Recognition Using Transformer-Based and Vision-Based Embeddings. In Proceedings of the 2023 8th International Conference on Intelligent Information Technology; Association for Computing Machinery, New York, NY, USA, 13 July 2023; pp. 234–238. [Google Scholar]
Guo, L.; Wang, L.; Dang, J.; Fu, Y.; Liu, J.; Ding, S. Emotion Recognition with Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information. IEEE MultiMed. 2022, 29, 94–103. [Google Scholar] [CrossRef]
Zhang, S.; Feng, Y.; Ren, Y.; Guo, Z.; Yu, R.; Li, R.; Xing, P. Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function. Electronics 2024, 13, 3262. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed Dual Routing-based multimodal emotion recognition model. The model extracts local–global representations from audio and text, performs scale-aware expert routing, and integrates both modalities through bidirectional cross-attention.

Figure 2. Dual Routing Mixture-of-Experts module. Local and global embeddings are concatenated, gated through an MLP, and softly routed to scale-specific experts to produce a routed representation.

Figure 3. Bidirectional Cross-Attention fusion. Audio and text routed representations attend to each other, and the resulting cross-attended features are fused to generate the final multimodal representation.

Figure 4. Normalized recall-based confusion matrix for the four emotion classes on the IEMOCAP dataset.

Figure 5. Selection frequencies of local and global experts in the Dual Routing mechanism. (a) Audio expert utilization across sessions; (b) Text expert utilization across sessions.

Figure 6. Emotion-wise distribution of local versus global expert activation. (a) Audio local/global ratio by emotion; (b) Text local/global ratio by emotion.

Table 1. Emotion recognition performance (UA/WA) across the five IEMOCAP sessions.

Session	UA	WA
S1	73.9	70.97
S2	79.6	77.03
S3	74.28	74.02
S4	75.28	75.36
S5	73.29	73.09
Average	75.27	74.09

Table 2. Comparison of the proposed model with previous multimodal emotion recognition approaches on IEMOCAP.

Methods	UA	WA
SERVER [39]	63.0	63.1
IA-MMTF [40]	72.5	72.0
BERT-RoBERTa BiLSTM [41]	73.95	74.27
MFDR [11]	77.0	75.7
Ours	75.27	74.09

Table 3. Ablation study evaluating the contributions of local, global, and multi-scale routing components.

Methods	UA	WA
Local-Only	74.06	71.84
Global-Only	74.15	72.45
Local–Global Concat	74.24	72.10
Local–Global Dual Routing	75.27	74.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chae, D.-E.; Lee, S.-P. Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition. Electronics 2025, 14, 4972. https://doi.org/10.3390/electronics14244972

AMA Style

Chae D-E, Lee S-P. Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition. Electronics. 2025; 14(24):4972. https://doi.org/10.3390/electronics14244972

Chicago/Turabian Style

Chae, Da-Eun, and Seok-Pil Lee. 2025. "Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition" Electronics 14, no. 24: 4972. https://doi.org/10.3390/electronics14244972

APA Style

Chae, D.-E., & Lee, S.-P. (2025). Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition. Electronics, 14(24), 4972. https://doi.org/10.3390/electronics14244972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition

Abstract

1. Introduction

2. Related Work

2.1. Speech & Multimodal Emotion Recognition

2.2. Multimodal Fusion Approaches

2.3. Mixture-of-Experts for Representation Selection

3. Method

3.1. Audio Encoder

3.2. Text Encoder

3.3. Dual Routing MoE

3.4. Bidirectional Cross-Attention Fusion

3.5. Loss Function

4. Experiment

4.1. Dataset

4.2. Implementation Details

5. Result

5.1. Ablation Study

5.2. Confusion Matrix Analysis

5.3. Routing Behavior Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI