1. Introduction
With the rapid development of Artificial Intelligence (AI) and multimodal learning [
1,
2,
3], Emotion Recognition (ER) [
4], particularly Bimodal Emotion Recognition (BER) [
5] and Multimodal Emotion Recognition (MER) [
6], has become a core component of effective Human-Computer Interaction (HCI) [
7,
8,
9]. By leveraging multiple modalities such as audio, text, and visual data, MER systems are capable of capturing the complex and subtle nature of human affective states, which are rarely conveyed through a single channel [
10,
11]. These systems have seen increasing adoption in various domains, including mental health monitoring, personalized learning, and affect-aware virtual assistants [
12,
13,
14]. However, BER systems based on audio–text analysis are more widely adopted as they combine linguistic content with paralinguistic features (prosody, tone). Their practicality in privacy-sensitive applications (e.g., voice assistants, telehealth) further drives preference for visual modalities [
15,
16].
The development of ER [
17,
18] models is based on psychological theories of emotion classification. An influential framework is Ekman’s basic emotions theory [
19], which identifies six universal emotions: happy, sad, anger, fear, disgust, and surprise. Another widely used model is Russell’s circumplex model [
20], which maps emotions onto a two-dimensional plane defined by arousal (high to low) and valence (pleasant to unpleasant). These psychological theories provide a conceptual foundation for computational methods used in emotion classification tasks [
21].
Recent developments in ER have seen a strong shift toward multimodal [
22] and Transformer-based [
23] architectures that leverage pre-trained encoders for audio and text, with increasingly sophisticated attention-based or graph-based fusion mechanisms. These models demonstrate impressive performance on benchmark corpora and continue to evolve with the integration of newer techniques such as State-Space Models (SSMs) (e.g., Mamba) [
24], adaptive attention [
25,
26], and unsupervised [
27,
28,
29] or weakly supervised learning [
30,
31]. However, despite their success, several key challenges persist. Firstly, many State-of-the-Art (SOTA) models are trained and evaluated on English corpora, which limits their effectiveness in multilingual contexts. The ability to use different languages simultaneously is important for a wide variety of multilingual human–computer interaction systems [
32]. Secondly, modality fusion often lacks fine-grained alignment—most methods either fuse at a high level or focus on short-range dependencies, failing to capture nuanced cross-modal dynamics over time. Moreover, corpora have issues such as class imbalance, variation in utterance lengths, and limited emotional diversity, which can lead to overfitting and poor generalization in real-world applications. Finally, current data augmentation strategies tend to be generic, lacking task-specific adaptation or emotional semantics, particularly in low-resource or cross-lingual settings. Therefore, no existing method currently addresses the interaction of linguistic diversity, fine-grained cross-modal alignment across variable-length utterances, and context-specific data augmentation within a unified framework.
To address these limitations, a BER method is proposed that integrates audio and text modalities using cross-lingual encoders fine-tuned jointly on Russian Emotional Speech Dialogs (RESD) (Russian) [
33] (
https://huggingface.co/datasets/Aniemore/resd, accessed on 28 September 2025) and Multimodal EmotionLines Dataset (MELD) (English) [
34] (
https://github.com/declare-lab/MELD, accessed on 28 September 2025) corpora. To combine the outputs of unimodal encoders, a fusion model is introduced, which is based on a Transformer architecture with cross-modal attention, enabling feature interaction across time and feature axes. To mitigate overfitting and improve generalization, three complementary data augmentation strategies are developed and systematically evaluated.
First, two sampling-based augmentation methods are introduced. The Stacked Data Sampling (SDS) method concatenates short utterances into longer, semantically diverse samples. SDS reduces the variability in sequence lengths and enriches the corpus. Second, the Template-based Utterance Generation (TUG) method applies generative augmentation using Large Language Models (LLMs) to synthesize emotionally relevant text samples, which are then converted to speech using high-fidelity Text-to-Speech (TTS) models. This method improves class balance and enhances emotional diversity in training. Finally, a LLM-guided label smoothing strategy is proposed that replaces conventional uniform smoothing with semantically-aware target distributions. Rather than assigning equal weight to non-target emotions, LLM generates soft labels that reflect nuanced emotional co-occurrence based on the input context.
This leads to the following research questions:
RQ1: Can a cross-lingual BER model improve generalization across languages?
RQ2: Can a hierarchical cross-modal attention mechanism effectively handle variable utterance lengths?
RQ3: Do corpus-specific augmentation strategies improve the generalization of cross-lingual BER models?
The main contributions of the article are as follows:
A Cross-Lingual BER model is proposed that integrates Mamba-based cross-lingual unimodal encoders with a Transformer-based cross-modal fusion model, achieving model generalization across English and Russian corpora.
Two complementary data sampling methods are introduced: segment-level SDS and TUG based on emotional TTS generation, both designed to enrich emotional variability and mitigate class imbalance.
A label smoothing method Label Smoothing Generation based on LLM (LS-LLM) is developed, which uses LLM to produce context-dependent soft labels, reducing model overconfidence by accounting for the complex nature of emotional co-occurrence.
The rest of the article is organized as follows: In
Section 2, SOTA ER methods and data augmentation strategies are described.
Section 3 provides a detailed explanation of the proposed method.
Section 4 outlines the research corpora, performance measures, and experimental results.
Section 5 discusses the advantages and limitations of the proposed method, as well as its applicability in intelligent systems and research questions. Finally,
Section 6 summarizes the key findings and outlines future research directions.
2. Related Work
This section reviews SOTA ER methods, covering unimodal, bimodal, and multimodal methods, with a focus on multimodal fusion techniques. It also discusses modern data augmentation strategies to mitigate overfitting in training data and address class imbalance issues.
2.1. State-of-the-Art Methods for Emotion Recognition
Early ER methods have primarily focused on single modalities such as video (e.g., facial expressions [
35], body language [
36]), text (e.g., sentiment [
37] and affective language [
38]), or audio (e.g., prosody [
39] and speech patterns [
40,
41]). Although these methods have demonstrated effectiveness in controlled settings, they often struggle to generalize to real-world scenarios where emotional expression is inherently multimodal.
To overcome these limitations, recent research [
4,
42,
43] has increasingly focused on BER systems that integrate complementary information from multiple channels. These systems are driven by the dual nature of spoken language: while text captures semantic content, audio encodes the way speech is delivered via intonation, rhythm, pitch, and other prosodic features. Crucially, these modalities exhibit complementary strengths: text-based models often struggle with linguistically neutral emotional content, whereas audio-based systems are sensitive to speaker variability and noise. By fusing both, the BER models create richer and more robust emotional representations [
44,
45]. As a result, ER has emerged as a core task in affective computing [
46], driven by advances in deep learning [
17,
18] and representation learning [
47,
48].
Various multimodal corpora have been developed to support the training and evaluation of ER models.
Table 1 presents an overview of commonly used corpora in English and Russian. Despite their differences in size and modality, many of these corpora face recurring issues such as class imbalance, annotation noise, and limited linguistic diversity, all of which restrict the generalizability of trained models. The generalizability of ER depends not only on corpus selection but also on robust preprocessing and feature extraction—critical steps that shape emotion modeling. As each modality (audio, text) encodes emotions differently, it requires specialized techniques to extract information. To align and integrate features across modalities, attention mechanisms [
18,
49,
50] have become a standard design element in BER architectures. Self-attention is effective for capturing long-range dependencies in text, while cross-modal attention allows the system to relate semantic and prosodic cues to improve emotional inference.
A comparative overview of ER methods is shown in
Table 2. Although earlier methods relied on hand-crafted features [
57,
58,
59], current methods leverage deep neural encoders trained directly on raw or minimally processed data [
60,
61,
62]. Textual encoders are commonly based on large pre-trained Transformer models such as Bidirectional Encoder Representations from Transformers (BERT) [
63], Robustly optimized BERT approach (RoBERTa) [
64], A Lite BERT (ALBERT) [
65], or multilingual embeddings such as Jina-v3 [
66] and Jina-v4 (
https://huggingface.co/jinaai/jina-embeddings-v4, accessed on 28 September 2025). For audio, recent methods rely on self-supervised learning models such as Wav2Vec [
32,
67,
68,
69], Extended HuBERT (ExHuBERT) [
70], and Whisper [
71,
72,
73]. For modality fusion, Transformer-based [
74,
75,
76,
77], Graph-based [
42,
78,
79,
80] and Mamba-based [
81,
82,
83,
84] models are predominantly employed, as they allow dynamic integration of information from both linguistic and acoustic modalities.
2.1.1. Transformer-Based Methods
Transformer-based architectures represent one of the most prominent directions in BER, utilizing attention mechanisms to align and integrate linguistic and acoustic characteristics. These models differ in their fusion strategies, ranging from early attention pooling to deep co-attentional encoding, and in how they refine input representations [
103].
Several recent models enhance speech processing through modality-specific attention. DropFormer [
92], for instance, applies Drop Attention to emphasize emotion-relevant segments and includes a Token Dropout Module to suppress irrelevant information. Similarly, Zhao et al. [
93] propose a dynamically refined multimodal framework that integrates Sliding Adaptive Window Attention (SAWA), a Gated Context Perception Unit (GCPU), and Dynamic Frame Convolution (DFC) to mitigate information misalignment and improve acoustic–textual interaction. García et al. [
94] investigate deep learning-based beamforming methods using simulated acoustic data in Human-Robot Interaction (HRI) scenarios. Sun et al. [
95] present the Multi-perspective Fusion Search Network (MFSN) based on Transformer that separates and optimally fuses textual and speech-related emotional content.
Other Transformer-based methods focus on cross-modal encoding. Ryumina et al. [
18] present a gated attention fusion mechanism that aggregates emotion features from RoBERTa and Wav2Vec2.0 across multiple corpora, achieving strong cross-corpus generalization. Delbrouck et al. [
86] design a modular Transformer encoder with co-attention and a glimpse layer to jointly model emotional signals from text and audio. Phukan et al. [
97] conduct a comprehensive evaluation of paralinguistic pre-trained models, demonstrating that TRILLsson, a model trained on non-semantic speech tasks, consistently outperforms others such as Whisper [
71], XLS-R [
104], and Waveform-based Language Model (WavLM) [
105].
Extending attention-based fusion to video or context modeling, Huang et al. [
101] introduce MM-NodeFormer, a Transformer model that dynamically reweights textual, acoustic, and visual inputs according to their emotional salience, treating audio and video as auxiliary to text. Luo et al. [
76] present a cross-modal attention consistency framework that regularizes representations from different modalities through consistency loss. Zhao et al. [
77] propose R1-Omni (
https://huggingface.co/StarJiaxing/R1-Omni-0.5B, accessed on 28 September 2025), a Reinforcement Learning (RL)-enhanced model that optimizes fusion weights for better robustness to distribution shifts and improves interpretability by modeling modality-specific contributions. Similarly, Kim and Cho [
74] target audio–text interaction using a Cross-modal Transformer and Focus-Attention, built atop RoBERTa and a Convolutional Neural Network (CNN)-Bidirectional LSTM network (BiLSTM) acoustic stack, representing one class of attention-based fusion methods. Leem et al. [
75] propose a dual-branch attention fusion model that dynamically weighs acoustic and linguistic inputs based on their relevance.
2.1.2. Graph-Based Methods
Beyond Transformers, several works explore graph-based architectures to explicitly model the relational structure of multimodal features [
106]. These methods [
107,
108] often aim to model both intra- and inter-modality interactions using graph convolutions or message-passing mechanisms.
Li et al. [
78] introduce a decoupled multimodal distillation framework that separates shared and modality-specific features, applying graph-based knowledge transfer to improve fusion. Focusing on conversational BER, Shi et al. [
79] introduce a two-stage fusion strategy: the first stage extracts emotional cues from text using Knowledge-based Word Relation Tagging (KWRT), while the second enriches speech input through a prosody-aware acoustic module. FrameERC [
80] applies framelet-based multi-frequency graph convolution to model emotional dynamics in conversations, and incorporates a dual-reminder fusion mechanism to balance high- and low-frequency emotional cues, reducing overreliance on textual signals.
2.1.3. Mamba-Based Methods
Recent research [
24,
109,
110,
111] directions explore alternatives to attention-based architectures using linear SSMs, such as Mamba, which offer improved scalability and efficiency for long-context emotion modeling. Mamba and its successors replace explicit attention with input-dependent recurrence, enabling linear-time inference and reduced memory usage [
112,
113].
Building on this idea, Coupled Mamba [
81] introduces inter-modal state coupling, allowing synchronized state transitions across modalities while preserving their individual dynamics. Wang et al. [
82] further apply model distillation to transfer knowledge from large Transformer models to compact Mamba-based Recurrent Neural Networks (RNNs) with minimal performance loss. OmniMamba [
83] (
https://huggingface.co/hustvl/OmniMamba, accessed on 28 September 2025) extends this paradigm into a unified multimodal framework, incorporating decoupled vocabularies, task-specific Low-Rank Adaptation (LoRA) [
114], and a two-stage training scheme. Finally, Gate DeltaNet [
84] enhances Mamba2 [
113] by combining gated recurrence with biologically inspired delta-rule updates and memory erasure, achieving improved long-sequence modeling.
2.2. State-of-the-Art Methods for Data Augmentation
Data augmentation strategies have become essential to improve the generalizability and robustness of ER models [
115]. These techniques help alleviate issues such as data imbalance [
116], noise [
117], and incomplete modalities [
102] by artificially generating various training examples. Recent studies propose several innovative augmentation strategies [
118,
119,
120] that can be directly applied to both unimodal and multimodal models.
For example, Wang et al. [
45] introduce IMDer, a diffusion-based model that recovers missing modalities by mapping noise back into their original distribution space, using available modalities as semantic conditions. Similarly, Wang et al. [
89] propose a method based on Mixup [
121] to enhance emotional representations in speech data by learning robust features from unbalanced corpora. Several studies have explored more feature-centric methods. Malik et al. [
122] present a modality dropout strategy, randomly omitting modalities during training to improve model robustness against missing or noisy inputs, while Gong et al. [
91] introduce LanSER, which derives weak emotion labels from speech transcripts via LLM analysis. Zhang and Li [
44] explore modality-aware data augmentation by perturbing features within and between modalities to create synthetic samples.
Recently, diffusion-based augmentation has emerged as a powerful tool for generating realistic and diverse emotional samples, particularly under low-resource conditions. Methods such as Diff-II [
123] and DiffEmotionVC [
124] have achieved significant improvements in affective recognition benchmarks through the synthesis of expressive multimodal signals.
In addition, several studies address the challenge of data imbalance in ER. Ritter-Gutierrez et al. [
98] propose a Dataset Distillation method using Generative Adversarial Network (GAN), which generates a smaller synthetic corpus while preserving key discriminative features. Similarly, Leem et al. [
99] develop the Keep or Delete framework, which selectively removes noisy frames during training, improving robustness in real-world scenarios. Other methods focus on handling domain shifts in cross-corpus tasks. For example, Hierarchical Distribution Adaptation [
28] mitigates domain shifts by employing nested Transformers to align emotion features across source and target corpora.
Continuing the exploration of data augmentation strategies, Stanley et al. [
125] investigate data augmentation using pitch-shift, time-stretch, and vocal tract length perturbation to generate more varied emotion representations. The Emotion Open Deep Network (EODN) [
100] addresses data imbalance and unseen emotional categories by employing an open-set recognition strategy combined with the Pleasure-Arousal-Dominance model of emotion (PAD)-based emotional space mapping. Furthermore, Purohit and Magimai-Doss [
126] propose a framework that enhances data efficiency and generalization through curriculum learning and adversarial data augmentation, focusing on pitch and formant perturbations.
On the other hand, Wu et al. [
127] introduce a novel method to cross-lingual ER that models inter-rater ambiguity in emotion labels by applying a new loss function called Belief Mismatch Loss. This method improves noise robustness by representing emotion labels as probability distributions to enable more flexible label matching. Similarly, Khan et al. [
128] present a semi-automated framework for ER on user feedback from low-rated software applications, using ChatGPT (
https://chatgpt.com, accessed on 28 September 2025) as both an annotator and mediator to improve emotion annotation quality based on a grounded emotion taxonomy. The SemEval-2025 [
129] shared task on multilingual ER further highlights this trend. It covers over 30 languages and continues to expand its corpora, underscoring the growing need for balanced data and robust annotation strategies across diverse linguistic contexts.
In ER, Franceschini et al. [
130] propose a novel unsupervised method based on modality-pairwise contrastive learning that eliminates the need for explicit emotion labels or conventional data augmentation. Their method outperforms supervised and unsupervised baselines by fusing text, audio, and visual features only at inference time. Padi et al. [
87] combine BERT-based text encoders with a ResNet-based speech encoder pre-trained on speaker recognition, incorporating transfer learning and spectrogram augmentation. Additionally, Dutta et al. [
131] introduce EmoComicNet, a multi-task model that uses emotion-oriented feature augmentation and a fusion strategy adaptable to missing modalities, including text and images.
Finally, Lee [
90] propose a MER framework that employs a sampling strategy, prioritizing emotionally ambiguous samples and using contrastive learning with modality-aware hard sample mining, thus improving discriminative feature learning across audio and text inputs. Sun et al. [
102] present an ER model that incorporates pseudo-label-based contrastive learning combined with modality dropout to enhance representation learning under noisy and incomplete modality conditions.
Together, these diverse methods demonstrate the growing importance of data augmentation in improving the performance and generalization of ER models, particularly in situations with incomplete or noisy data.
3. Proposed Method
The proposed Cross-Lingual Bimodal Emotion Recognition method pipeline is illustrated in
Figure 1. The method takes an audio recording as input. Whisper [
71] is used for Automatic Speech Recognition (ASR) [
132] to extract orthographic transcriptions from the audio. Then both audio and text signals are passed through their respective frozen encoders for low-level feature extraction. A distinctive characteristic of the method is its cross-lingual unimodal encoders, which are trained to recognize emotions in two different languages and cultures: English and Russian. These encoders produce mid-level features that serve as inputs to the bimodal fusion model. The final model shares similar characteristics with the unimodal encoders. It is also trained using augmented data to improve its generalization capability. To achieve this, short utterances of a certain length are supplemented with other utterances bearing the same label using the SDS method. The target labels are then softened using a LLM-based method (LS-LLM), enabling the model to detect complex compound emotions. The method outputs probability distributions across seven emotions: anger, disgust, fear, joy/happy, neutral, sad, and surprise/enthusiasm.
This section describes the full methodological process of the proposed cross-lingual BER method. To ensure replicability and justify design decisions, ablation studies comparing alternative components are included. The final configuration used in the main experiments is explicitly indicated in each subsection. The detailed description of the method components is provided in the following
Section 3.1,
Section 3.2,
Section 3.3,
Section 3.4 and
Section 3.5.
3.1. Audio-Based Emotion Recognition
In research corpora, audio signals vary in sampling rate and duration [
133]. To ensure consistency, all audio signals are resampled at a uniform rate of 16 kHz. To account for varying durations, zero-padding is applied to the embeddings during batch formation, aligning them to the length of the longest sequence in the batch. Acoustic embeddings are extracted using two pre-trained models and their performance is compared across four different temporal models.
The first embedding extractor is based on the Wav2Vec2.0 emotion model pre-trained on the MSP-Podcast corpus [
32,
134] (
https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim, accessed on 28 September 2025). The second extractor uses the ExHuBERT emotion model pre-trained on 37 benchmark corpora [
70] (
https://huggingface.co/amiriparian/ExHuBERT, accessed on 28 September 2025). Both models generate embeddings of size
T × 1024, where
T is dynamically determined by the input audio signal duration. The embeddings feed into four temporal models: Long Short-Term Memory network (LSTM) [
135], Extended LSTM (xLSTM) [
136], Transformer [
137] and Mamba [
112]. The LSTM model captures temporal dependencies through recurrent connections, effectively modeling short- and mid-term contextual dynamics. However, its sequential nature limits parallelization and long-range dependency modeling. The xLSTM model modifies this structure by introducing multiplicative gating within its recurrent dynamics, which enhances gradient stability and allows for more efficient propagation of information over time. In contrast, Transformer leverages self-attention to capture global dependencies between all time steps simultaneously, enabling efficient context integration across long utterances, albeit with higher computational cost proportional to the sequence length squared. Mamba introduces a recent state-space model that replaces explicit recurrence and attention with linear dynamical updates, providing efficient long-context modeling with linear time complexity and reduced parameter count. Each model is trained using architecture-specific and training hyperparameters, which are optimized during training on validation subsets.
3.2. Text-Based Emotion Recognition
In the research corpora, the utterance length varies between samples. To ensure uniform input dimensions during vectorization, zero-padding is applied to the embeddings up to the maximum token count. This threshold is set to 95 (sub)tokens for the Jina-v3 [
66] and XLM-RoBERTa [
64] models and to 329 tokens for the CANINE-c model [
138]. The latter operates on character-level tokens rather than word-level tokens. For each token, XLM-RoBERTa and CANINE-c extract 768 features, while Jina-v3 produces 1024 features. The extracted embeddings feed into four temporal models: LSTM [
135], xLSTM [
136], Transformer [
137], and Mamba [
112]. Each model is trained with architecture-specific hyperparameters, which are optimized during training on validation subsets. Among the evaluated encoders, Mamba is selected for both audio and text modalities in the final model due to its superior performance on long sequences and computational efficiency, as detailed in
Section 4.3 and
Section 4.4.
3.3. Data Sampling Method
In this research, data sampling refers to the generation of new training samples to improve model performance. Two different data sampling strategies are investigated, each designed to address specific challenges inherent in the corpus.
The first strategy focuses on audio data sampling. New audio samples are generated during training by setting a segment length of
T seconds. If the input sample is shorter than the target duration, it is concatenated with another randomly selected sample that shares the same label. The corresponding text transcription is then extracted using the Whisper ASR model [
71]. This strategy, referred to as SDS, ensures uniform input lengths while preserving label consistency. In addition, it mitigates potential biases that may arise from uneven sample durations between classes.
The second method employs TTS generation. In this pipeline, an emotion label is randomly sampled, after which a textual utterance is generated using ChatGPT-5 (
https://openai.com/index/introducing-gpt-5/, accessed on 28 September 2025), following a predefined template-based prompt. The prompt is designed to produce short, emotionally rich utterances that match the expressive, conversational style of the MELD corpus [
34] as illustrated in
Figure 2. The generated text, paired with its not-verbal tag (e.g., sighs, laughs, and others), is subsequently fed into the DIA-TTS model (
https://huggingface.co/nari-labs/Dia-1.6B, accessed on 28 September 2025) to synthesize the corresponding audio signal as illustrated in
Figure 3. This strategy, denoted as TUG, facilitates controlled data augmentation by targeting emotional and linguistic imbalances in the corpus. Notably, it allows for the precise adjustment of affective speech parameters without relying on additional human annotation.
Both sampling strategies are applied exclusively to the training subset to ensure that the evaluation measures reflect true generalization performance. To achieve a balance between data diversity and the model’s performance, the probability of generated samples (the proportion of augmented to original data) is iteratively tuned during experimentation.
3.4. Label Smoothing Method
Previous studies [
17,
139,
140,
141,
142] show that label smoothing improves generalization and stabilizes training by reducing model overconfidence. The techniques smooth one-hot vector by mixing it with a uniform distribution.
Instead of assigning uniform labels to non-target classes, as is done in trivial label smoothing, zero-shot [
143] LLMs are used to generate semantically informed soft labels. During training, each ground truth label is smoothed using the predicted emotion probabilities by LLM based on the input text. This method ensures that the target emotion receives the highest probability, while the remaining probabilities are realistically distributed among other relevant emotions.
This method, which is termed LS-LLM, produces context-aware label distributions that capture the subtleties and overlaps inherent in emotional expression, providing a more informative supervision signal. By aligning training labels with plausible emotional interpretations, LS-LLM encourages better generalization and robustness in ambiguous or multi-emotion cases. To generate these semantically informed soft labels, various prompts used to guide LLMs were tested, as illustrated in
Figure 4. It was found that prompt formulations enforcing strict probability formatting and distributional constraints produced the most reliable outputs. Simpler variants often resulted in invalid or less informative distributions.
Eight lightweight LLMs are compared: Falcon3-3B-Instruct (
https://huggingface.co/tiiuae/Falcon3-3B-Instruct, accessed on 28 September 2025), Phi-4-mini-instruct (
https://huggingface.co/microsoft/Phi-4-mini-instruct, accessed on 28 September 2025), Gemma-3-4b-it (
https://huggingface.co/google/gemma-3-4b-it, accessed on 28 September 2025), Qwen3-4B (
https://huggingface.co/Qwen/Qwen3-4B, accessed on 28 September 2025), Falcon-H1-3B-Instruct (
https://huggingface.co/tiiuae/Falcon-H1-3B-Instruct, accessed on 28 September 2025), Gemma-3n-E2B-it (
https://huggingface.co/google/gemma-3n-E2B-it, accessed on 28 September 2025), SmolLM3-3B (
https://huggingface.co/HuggingFaceTB/SmolLM3-3B, accessed on 28 September 2025) and Qwen3-4B-Instruct-2507 (
https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507, accessed on 28 September 2025). The characteristics of the selected LLMs are presented in
Table 3. These models were selected because they are small enough to run efficiently (with a range of parameters from approximately 3B to 5.5B) and yet powerful enough to understand emotionally rich text. All the models are openly available, perform well on language tasks, and differ in their designs and the number of layers. This allows us to compare their performance. To improve model performance, the proportion (probability) of LLM-generated soft labels to one-hot vectors was iteratively optimized during experimentation.
3.5. Bimodal Emotion Recognition
Audio–text fusion is performed at both prediction-level and feature-level representations. For prediction-level fusion, a weighted combination of modality-specific probabilities is computed:
where
is the final predicted probability distribution;
and
are the audio and text modality predictions, respectively;
are learnable modality weights (constrained to
via softmax normalization). This model is referred to as Weighted Probability Fusion (WPF).
For feature-level fusion, several model architectures are proposed that dynamically account for the importance of features across modalities. The architectures of the proposed BER models are shown in
Figure 5. The components of the proposed architectures are shown in
Figure 6. A key characteristic of these architectures is their ability to calculate statistical information from contextual representations at both: (1) the feature-level, which identifies the most informative features for the model’s final prediction; (2) the temporal-level, which determines the most relevant timesteps (for audio) or word sequences (for text) in the final prediction.
The intra- and inter-modal relationships between the two modalities are implemented through several distinct models:
Cross-modal Transformer (BiFormer). The input data consists of audio and text features with different dimensionalities. A feature resampling block is applied to normalize the input to a unified dimensionality. This block consists of (1) the fully connected layer that transforms the feature dimensions, the number of neurons is determined during training; (2) the temporal alignment layer standardizes the audio features length to match the text features (fixed at 95 timesteps) using linear interpolation. The modified input is passed through two Transformer blocks configured for cross-modal attention. In the first block, audio features act as queries () and the text features serve as keys/values ( and ). The second block reverses this configuration with text features as queries () and audio features as keys/values ( and ). The number of Transformer layers in each block is optimized during training. Each block produces an attention-weighted value matrix (in which the query-key interactions dynamically consider the importance of features), enhancing discriminative features and reducing less informative ones. Then these transformed matrices are combined with their original unimodal representations ( or ). These features are subjected to statistical pooling that averages both temporal (along the sequence length) and feature dimensions. The resulting concatenated vector is fed into a classifier to predict probabilities in seven emotion categories.
BiGraphFormer. In this model, a second level of modality fusion is introduced using two pairs of graph-based layers. Each pair utilizes as input either the temporally averaged audio and text features or the feature-wise averaged audio and text representations. The output of each Graph Fusion Layer is a set of weighted representations. The weighting results are from the mutual interactions between all node pairs in a fully connected graph. The outputs (weighted temporal and feature-based representations) are further averaged, passed through fully-connected layers and concatenated. The resulting concatenated vector is fed into a classifier to predict probabilities in seven emotion categories.
BiGatedGraphFormer. This model adds a third level of modality fusion using two pairs of gated-based layers. This layer implements attention mechanisms with learnable forget/amplify gates, automatically suppressing or enhancing input features based on their relevance to the prediction task. Unlike previous attention mechanisms, these gated-based layers generate new temporal and feature-based representations that take into account both modalities features varying importance degrees. These representations are passed through fully-connected layers and concatenated with averaged graph-based outputs. The resulting concatenated vector is fed into a classifier to predict probabilities in seven emotion categories.
BiMamba. A simple feature concatenation method using the Mamba-based architecture is also implemented. This architecture efficiently models long-range dependencies in the joint feature space while maintaining linear computational complexity.
The BiFormer model is adopted as the primary fusion mechanism in the final model, as it achieves the best trade-off between performance and complexity across both research corpora. BiGraphFormer, BiGatedGraphFormer, and BiMamba are included for completeness and are not part of the main pipeline. Additionally, for the best-performing models, WPF is applied, which takes into account: (1) the importance of unimodal probability predictions (audio-only and text-only); (2) the multimodal probability predictions derived from cross-modal interactions. Finally, different data sampling and augmentation strategies (SDS, TUG, and LS-LLM) are applied to the final best-performing model. For all models, the architecture-specific and training hyperparameters are optimized during training on validation subsets.
4. Experimental Research
The experiments are conducted in several stages. First, the unimodal encoders are fine-tuned. Unlike many SOTA solutions, the proposed method trains encoders in two corpora that cover Russian and English languages. This method improves the ability to generalize to new corpora. The patterns of SOTA pre-trained encoders are maintained and new temporal model patterns are added.
Then multi-corpus unimodal features are extracted from temporal models and used as input for bimodal models. This multi-stage training process allows us to create a BER model. The model demonstrates strong generalizability in new data, including different languages and cultural emotional expressions.
In this article, the best model configurations and training hyperparameters are selected based on the performance in the development subsets, and the performance measures obtained in the test subsets are reported.
4.1. Research Corpora
For this research, two corpora are selected: MELD [
34] and RESD [
33]. Both corpora pertain to the domain of dyadic interaction; however, they differ in language (English and Russian). This choice supports the development of multilingual models capable of analyzing cross-cultural and linguistic communication patterns.
The MELD corpus contains audiovisual data extracted from the TV series Friends, annotated for seven emotions: joy, sad, fear, anger, surprise, disgust, and neutral. The audio recordings include considerable background noise (e.g., applause, laughter, etc.). In contrast, the RESD corpus exclusively consists of laboratory-recorded acoustic data annotated for seven emotions as well: happy, sad, fear, anger, enthusiasm, disgust, and neutral. To enable joint model training on both corpora, the following emotions are aligned: joy and happy; enthusiasm and surprise. The distribution of emotion utterances across the subsets is presented in
Figure 7. Both corpora employ fixed subsets; however, while the RESD corpus only contains training and test subsets, the test subset is additionally utilized for validation purposes. The MELD corpus exhibits a pronounced class imbalance, whereas in RESD the distribution of utterances across classes is generally balanced.
Table 4 provides a summary of key statistics for the MELD and RESD corpora. MELD contains significantly more utterances (9989 in Train, 1109 in Development, and 2610 in Test) compared to RESD (1116 in Train and 280 in Test). In terms of utterance characteristics, MELD tends to have shorter utterances (averaging 14–15 (sub)tokens and 3.1–3.3 s in duration). In contrast, RESD utterances are longer on average, with 24 (sub)tokens and durations of 6.0–6.1 s. Notably, MELD’s Test subset includes a 305-s outlier. Despite these differences, both corpora exhibit similar utterance length distributions across subsets, although MELD demonstrates greater variability in maximum durations.
Data augmentation using the TUG method was restricted to the MELD corpus due to the DIA-TTS model’s incompatibility with Russian. Approximately 23,000 synthetic utterances and corresponding audio clips were generated. The number of samples was determined according to the class distribution imbalance in the original corpus, with minority classes augmented until their size equaled that of the dominant (neutral) class. To enhance emotional expressiveness, supported inline tags (e.g., sighs, laughs, and others) were inserted into the generated utterances to guide the DIA-TTS model in producing prosodically varied and affectively rich speech outputs.
4.2. Performance Measures and Loss Function
To evaluate the effectiveness of the proposed methods, various performance measures are used: Unweighted Average Recall (UAR), Weighted Average Recall (WAR), Macro F1-score (MF1), and Weighted F1-score (WF1). These measures are calculated as follows:
where
C is the number of classes;
is the number of true positives for class
i;
is the number of false negatives for class
i;
is the number of utterances in class
i;
N is the total number of utterances;
is the number of false positives for class
i. Both weighted (
and
) and unweighted measures (
and
) are used to evaluate the performance of the proposed methods, as the class utterances are unbalanced. Although weighted measures primarily reflect model performance in majority classes, unweighted measures provide a more accurate assessment of overall performance, capturing both majority and minority class predictions.
The weighted cross-entropy is employed as the loss function, defined as
where
C is the number of classes;
is the weight assigned to class
i to mitigate class imbalance;
is the truth label of emotion class
i;
corresponds to the predicted probability of class
i. The weighting helps to address class imbalance by assigning higher weights to underrepresented classes, thereby preventing the model from being biased towards the majority-represented classes.
4.3. Audio-Based Emotion Recognition
Experiments are conducted with several temporal models such as LSTM [
135], xLSTM [
136], Transformer [
137], and Mamba [
112]. The optimal configurations are determined through a grid search on different combinations of model hyperparameters and training settings. The best-performing configurations of audio-based models are presented in
Table 5. Among the audio-based models, Transformer has the best computational efficiency, with the least number of parameters (0.53 M) and the fastest training time (50 s/epoch), despite its larger disk space requirement (9.6 MB). LSTM has the longest training time (91 s/epoch) and the highest number of parameters (1.58 M). xLSTM reduces both the size (1.18 M) and training time (73 s) compared to LSTM, making it more efficient. Mamba offers a smaller model (0.79 M parameters, 3 MB) and faster training (51 s) than LSTM, while closely matching Transformer’s efficiency and using significantly less storage space.
The audio-based experimental results are summarized in
Table 6. Across all tested temporal models, Wav2Vec2.0-based embeddings consistently outperform ExHuBERT embeddings. This suggests that increasing training data volume (250 h vs. 150 h) may be more beneficial for building generalizable embeddings than expanding domain coverage (1 vs. 37 domains).
The comparative analysis of temporal models shows no substantial performance differences on MELD, while the Mamba architecture significantly outperforms the LSTM, xLSTM and Transformer models on RESD. This disparity likely arises from Mamba’s enhanced capacity to model long-range temporal dependencies.
Figure 8 shows the confusion matrices obtained by the audio-based model. Despite joint training, the performance of the model correlates with the distribution of training samples. In MELD, classes such as Disgust, Fear, and Sad, which are the least frequent, achieve the lowest recall values (0.0%, 2.0%, and 19.7%, respectively). Although RESD is a more balanced corpus, class imbalance still affects the recall for the following emotions: Disgust (48.6%), Neutral (52.6%), Sad (53.1%) and Surprise (52.5%). Regarding cross-lingual differences in acoustic expression, emotions in English tend to be confused with Anger and Happy more often, while in Russian, they are often misclassified as Fear. This may be due to language-specific prosodic patterns, such as the use of a higher pitch or more distinct intonation to convey emotional intensity.
For bimodal fusion, the feature matrices extracted from Mamba’s final hidden states are employed in all subsequent experiments.
Table 6 presents the performance results of SOTA audio-based methods evaluated on the research corpora. To date, only one method (
https://github.com/aniemore/Aniemore, accessed on 28 September 2025) has been reported for RESD. This method demonstrates performance over 80% in all measures, while the same model shows a sharp decline for MELD, falling below 20%. This discrepancy indicates a clear issue of model overfitting to a single corpus.
Comparable or superior performance is achieved relative to other methods on the MELD corpus, while maintaining about 60% accuracy on the RESD corpus. Thus, the proposed audio-based method currently shows the best ER performance compared to SOTA audio-based methods and also demonstrates multilingual generalization capabilities.
4.4. Text-Based Emotion Recognition
The best configurations of text-based models, determined using the same methodology as for audio-based models, are presented in
Table 7. The comparison of computational cost of text-based models show clear trade-offs. The Transformer has the smallest size (1.58 M parameters, 6.03 MB), followed by Mamba (3.28 M, 12.50 MB), xLSTM (8.15 M, 31.08 MB), and LSTM (12.61 M, 48.09 MB). Surprisingly, Mamba, despite its smaller size, is the slowest to train (392 s/epoch), significantly slower than even the recurrent models and Transformer (45 s/epoch). Although Mamba is often considered computationally efficient due to its linear time architecture, the obtained results do not support this claim under the current experimental setup.
The text-based experimental results are shown in
Table 8. On average, Jina-v3-based and XLM-RoBERTa-based embeddings demonstrate comparable performance, while CANINE-c underperforms across all measures regardless of the employed temporal model. These results indicate that character-level tokenization is less effective than word-level tokenization for this task. Furthermore, despite being fine-tuned on only 30 languages, Jina-v3 outperforms XLM-RoBERTa, which was pre-trained on 100 languages. This suggests that training on a smaller but more carefully selected language set enables the model to capture more robust linguistic patterns and achieve better generalization capabilities.
The experimental results of the temporal models reveal distinct performance patterns across corpora: LSTM demonstrates superior effectiveness on MELD, while Mamba outperforms others (including xLSTM and Transformer) on RESD. This divergence arises from fundamental differences in token distribution between the corpora. In MELD, most words are known tokens, resulting in shorter informative sequences (with zero-padding beyond actual content). LSTM handles such compact sequences effectively. In RESD, texts contain more unknown words that split into (sub)tokens, creating longer sequences. Mamba performs better here due to (1) efficient long-sequence processing; (2) better handling of (sub)token patterns.
Figure 9 shows the confusion matrices obtained by the text-based model. The impact of class imbalance remains. In MELD, classes such as Disgust, Fear, and Sad, achieve the lowest recall values (11.8%, 14.0%, and 23.1%, respectively), although higher than for the audio-based model. In RESD, Sad and Surprise have lower recognition, with recall values of 25.0% and 17.5%, respectively. Cross-class confusions differ across languages. The model identifies errors in Fear vs. Sad and Anger vs. Happy in MELD. In RESD, however, it indicates confusion between Fear and Surprise, as well as Anger and Fear. These patterns suggest that language- and culture-specific ways of expressing emotions affect model performance beyond mere data imbalance, particularly through lexical and phrasing conventions that link distinct emotions to similar textual cues in each language.
For bimodal fusion, the 95 × 512 feature matrices extracted from Mamba’s final hidden states are employed in all subsequent experiments.
Table 8 presents the performance results of SOTA text-based methods evaluated on the MELD corpus. To the best of current knowledge, there are no published results for the text modality on the RESD corpus. The results on the MELD demonstrate that the proposed model’s performance remains below SOTA models. This gap stems from multilingual processing challenges, as Russian and English emotional patterns diverge significantly. Despite this limitation, the proposed model offers significant practical advantages: (1) it maintains operational capacity across languages where monolingual models typically fail; (2) it eliminates language-specific tuning requirements.
4.5. Bimodal-Based Emotion Recognition
The best model configurations for BER are presented in
Table 9. The analysis of optimal hyperparameters reveals key trends across bimodal architectures. Transformer-based models (BiFormer, BiGraphFormer, and BiGatedGraphFormer) have five layers, a hidden size of 256, and eight attention heads. In contrast, BiMamba reaches peak performance with fewer layers (four) but a larger hidden size (512). The training hyperparameters remain consistent across the models. In terms of computational cost, BiFormer is the most effective (9 s/epoch, 4.40 M parameters). Graph-based variants, on the other hand, require more computational resources (10–11 s/epoch, up to 6.37 M parameters). BiMamba is the slowest to train (16 s/epoch) and uses more parameters (4.89 M) than BiFormer.
The bimodal-based experimental results are shown in
Table 10. These results demonstrate that bimodal models consistently outperform unimodal baselines. Among them, BiFormer achieves the best (lowest) rank of 5.88, outperforming the text- and audio-based baselines (14.50 and 14.88, respectively), which corresponds to improvements of 8.62 and 9.00 in rank reduction. The WPF strategy, which aggregates predictions from both unimodal (audio/text) and bimodal models, exhibits variable effectiveness across different corpora. For example, on the MELD corpus, WPF significantly enhances performance (BiMamba-WPF reaches an UAR of 41.47%), suggesting that shorter utterances may benefit from weighted consensus between modalities. However, on the RESD corpus, this method performs worse. The BiGraphFormer model performs well on the RESD corpus, achieving a UAR of 68.51%. It uses a fully connected graph structure to process all audio–text feature pairs. This structure explicitly models cross-modal interactions and is likely more efficient for longer sequences. Notably, gated fusion variants (BiGatedGraphFormer) do not improve performance on either corpus. This indicates that Transformer- and graph-based attention mechanisms may be sufficient for cross-modal integration in the current experimental setting. Similarly, Mamba-based architectures (BiMamba) show limited effectiveness on research corpora. Their simplified feature concatenation lacks explicit cross-modal attention mechanisms that are required for robust multimodal fusion.
BiFormer’s superior performance over graph-based models on the MELD corpus suggests that standard attention mechanisms are sufficient to capture modality interactions in brief utterances. In contrast, RESD’s longer sequences benefit from explicit relational modeling through graph structures. Interestingly, WPF’s effectiveness appears to be inversely related to utterance length and complexity, as it is effective in MELD’s concise dialogues; it struggles with the more elaborate interactions in RESD, where joint bimodal learning becomes crucial. These observations are consistent with corpora statistics in that MELD contains shorter, more uniform utterances (average 14 (sub)tokens and 3.1 s), which favor probabilistic fusion. In comparison, RESD consists of longer sequences (average 24 (sub)tokens and 6.0 s), requiring deeper architectural integration of modalities. The results demonstrate that while WPF provides a straightforward yet effective enhancement for simpler corpora, more sophisticated bimodal architectures are essential to learn the complex multimodal interactions.
Table 10 also shows the high-performance results for various augmentation techniques in combination with the BiFormer model. More detailed results for all augmentation techniques with varying probabilities are presented in
Figure 10. The BiFormer model was selected for evaluation due to its consistently superior high rank performance across corpora. The results demonstrate the distinct effects of various sampling and label smoothing strategies on the MELD and RESD corpora. The SDS method yields the most notable improvement with a sampling probability of
p = 0.4, achieving a rank of 5.00, surpassing the baseline value of 5.88. The performance gain is particularly evident in the MELD corpus, where SDS increases the UAR from 38.54% to 39.57%. Given MELD’s elevated background noise and the high variability and brevity of its utterances, these results suggest that structured audio concatenation combined with Whisper-based transcription helps mitigate acoustic inconsistencies. As expected, no improvement was observed on the RESD corpus, confirming that SDS is most effective in scenarios with substantial variation in utterance duration.
In contrast, the TUG method failed to yield performance gains on either corpus. This finding highlights the limitations of current publicly available TTS models, such as DIA-TTS, in generating emotionally expressive and authentic-sounding speech. Moreover, the artificial nature of the TUG-generated utterance–audio pairs may introduce artifacts that negatively affect downstream performance, particularly in tasks that require fine-grained affective interpretation.
Using LLMs for label smoothing (LS-LLM) has different effects on performance, with the most notable improvements observed on the RESD corpus. The Qwen3-4B variant achieves its highest rank of 4.00% at p = 0.2, increasing UAR for the RESD corpus from 67.89% to 68.05%. This suggests that soft labels derived from LLMs enable more accurate detection of compound emotions in clean laboratory-recorded corpora such as RESD, where emotional distinctions tend to be more subtle and less obscured by noise. While the other LLMs, such as Phi-4-mini-instruct and Falcon-H1-3B-Instruct, also offer benefits compared to the baseline model, especially for MELD, they underperform relative to Qwen3-4B. This difference is due to the stronger ability of Qwen3-4B to follow instructions and its more nuanced representation of emotional context, which leads to higher-quality soft labels. This indicates that the quality of label smoothing depends on the LLM’s semantic understanding capabilities. Moreover, optimal smoothing probability (p = 0.2) suggests that excessive label interpolation may reduce the effectiveness of ground-truth annotations.
The combination of the two best augmentation methods (SDS with
p = 0.4 and LS-LLM with
p = 0.2) yielded a rank improvement of 2.63 over the baseline (2.63 vs. 5.88). Specifically, UAR gain for MELD was 0.69% (39.23% vs. 38.54%), while for RESD it reached 1.01% (68.90% vs 67.89%). These results demonstrate significant corpus-specific effects on data augmentation method performance. The integration of audio and text modal improves overall performance in both corpora, as shown by the higher diagonal values and improved recall in
Figure 11 compared to
Figure 8 and
Figure 9. For instance, Sad recall increases from 19.7% (by audio) and 23.1% (by text) to 35.1% (by bimodal) in MELD and from 25% (by text) and 53.1% (by audio) to 62.5% (by bimodal) in RESD. However, there are still language-specific patterns of confusion: in MELD, Fear is confused with Angry, and in RESD, Anger and Fear continue to be misclassified. This suggests that a bimodal fusion helps but does not completely eliminate emotion confusion caused by differences in how emotions are expressed in language.
Therefore, when corpus utterances exhibit duration imbalance, simply extending shorter utterances by concatenating additional samples of the same emotion proves beneficial. This method offers two advantages: (1) it normalizes utterance lengths; and (2) it generates novel training utterances. The combined effect substantially improves model performance. For corpora annotated under strict single-label protocols, LLM-based label smoothing helps uncover complex compound emotions. This method enhances prediction accuracy by reducing the model’s overconfidence and incorporating distributional patterns identified by LLM. Optimal performance on both corpora is achieved by combining multiple augmentation strategies. While individual methods may demonstrate corpus-specific effectiveness, their combination compensates for respective limitations and balances overall performance.
5. Discussion
In this research, a Cross-Lingual Bimodal Emotion Recognition method is proposed that uses two modality-specific encoders. Each encoder utilizes pre-trained, frozen Transformer-based models to extract low-level features (e.g., Wav2Vec2.0 for audio and Jina-v3 for text). These features are then processed by temporal encoders based on the Mamba architecture, which are trained from scratch to capture mid-level representations. Then, these modality-specific embeddings are integrated through a Transformer-based cross-modal fusion model, also trained from scratch, to perform the final emotion probabilities.
To enhance the model’s generalization capability across corpora with differing linguistic and acoustic characteristics, two data augmentation strategies are introduced: (1) SDS, which concatenates short utterances with others sharing the same emotion to create more context-rich emotional expressions of fixed duration; and (2) LS-LLM, which smooths the emotion distribution for ambiguous or multi-emotion utterances that are likely to have been mislabeled due to limitations in the annotation process.
The sequential application and comprehensive evaluation of all proposed solutions enabled the model to achieve strong performance on both corpora under joint training. For MELD, a relative improvement in UAR of 11.23% (28.00% vs. 36.17% vs. 38.54% vs. 39.23%) was observed when comparing the performance of audio-only, text-only, and bimodal models (with applying data augmentation methods). Similarly, for RESD, the relative UAR gain was 31.70% (60.79% vs. 37.20% vs. 67.89% vs. 68.90%) across the same model configurations.
However, the proposed method has several limitations. The effectiveness of label smoothing via LLMs depends on careful prompt engineering, which is difficult to generalize. Although lightweight Mamba encoders improve efficiency, they may be less accurate than larger models. The use of LLMs increases the computational cost and training latency. Finally, extending the model to include the visual modality and temporal uncertainty could further improve its robustness in real-world settings.
An analysis of the computational cost of the models is also provided. The total size of the models used in the proposed method is approximately 19.10 GB, which corresponds to 257.53 M parameters. Only 3% of the parameters are included in the trained models. The remaining parameters are taken from the Whisper, Jina-v3, and Wav2Vec2.0 models. All models were trained on an NVIDIA A100 with 80 GB of Graphics processing unit (GPU) memory [
147]. Training a single epoch of the audio encoder takes approximately 120 s with a batch size of 16, while training the text encoder takes around 80 s with a batch size of 32. Training the BiFormer model requires approximately 70 s with a batch size of 64. The proposed method supports real-time inference. The real-time factor for processing 1 s audio–text data is 0.11 s on a GPU (NVIDIA GeForce RTX 4080) and 0.70 s on a CPU (Intel Core i7), indicating that the method is capable of processing audio faster than its duration. The inference time scales linearly with the length of the input audio.
5.1. RQ1: Can a Cross-Lingual BER Model Improve Generalization Across Languages?
Figure 12 shows the word attention distribution of a proposed BER on semantically equivalent Russian and English sentence pairs, revealing different lexical salience patterns. In the top pair, despite language differences, the model predicts Neutral (75.5% for Russian, 99.4% for English), with attention concentrated on lexically aligned terms, suggesting partial cross-lingual alignment. In contrast, the bottom pair exposes a critical limitation: while both sentences express Sad, the model attends to literal, morphologically salient components rather than recognizing idiomatic or culturally entrenched expressions of grief. Notably, the Russian phrase, while semantically rich, lacks the conventional emotional framing found in the model’s training data. Its literal interpretation of “bleeding heart” evokes physical trauma rather than psychological sorrow, leading to a misattribution of emotion. This highlights the challenges of cross-lingual emotion modeling, which are not only caused by structural incongruences but also by the lack of culturally specific idioms in research corpora.
These results indicate that while the cross-lingual BER model achieves partial alignment for structurally similar expressions, it struggles with native-speaker-specific emotional idioms that do not appear in research corpora. Thus, cross-lingual generalization is possible but remains fragile without explicit modeling of linguistic and cultural diversity.
It should be noted that the model’s performance was evaluated on English and Russian, which are both Indo-European languages with relatively similar syntactic and semantic structures. Extending this model to other languages (such as Arabic, Korean, and others) would require emotion-annotated data and dedicated experiments. This is an important area for future research. Nevertheless, the proposed model is based on the multilingual Jina-v3 encoder [
66], which was pre-trained on over 30 languages and aligns semantically similar words across languages in a shared embedding space. Thus, the proposed cross-lingual BER model can, in principle, be applied to any supported language. However, without emotion-labeled data, it cannot reliably interpret culture-specific emotional expressions.
5.2. RQ2: Can a Hierarchical Cross-Modal Attention Mechanism Effectively Handle Variable Utterance Lengths?
This research question is addressed by the results in
Table 10 and the corresponding analysis in
Section 4.5. On MELD, which consists of short utterances (approximately 3.6 s long, with around 8 words each), the BiFormer model with one attention mechanism outperforms other models. Simpler fusion techniques, such as WPF, also work well in this setting. This implies that for short and fragmented expressions, global attention is sufficient to detect cross-modal emotions. Adding a hierarchical cross-modal attention mechanism using a graph- or gate-based fusion introduces unnecessary complexity in this case. In contrast, on the RESD corpus, which contains longer and more expressive utterances (average duration 6.0 s, approximately 24 words), the hierarchical graph-based BiGraphFormer model outperforms the simple BiFormer model. This suggests that modeling pairwise audio–text interactions in two stages can be beneficial when emotional signals develop over time and require more in-depth integration. Mamba-based models demonstrate low performance on both corpora, despite Mamba’s reputation for efficiently handling length-independent sequences. This suggests that linear state-space modeling alone is not enough to accurately BER.
In summary, it is important to adapt the fusion strategy depending on the length of the utterances. For short utterances, a lightweight cross-modal fusion model is more effective, while for longer utterances, a hierarchical cross-modal fusion model leads to improved performance. This finding supports the need for adaptive or hierarchical model architectures in real-world ER systems, which can adapt to different dialogue styles and contexts.
5.3. RQ3: Do Corpus-Specific Augmentation Strategies Improve the Generalization of Cross-Lingual BER Models?
This question is addressed by the results in
Table 10 and
Figure 10, which evaluate augmentation strategies using the BiFormer model as a consistent backbone. On the noisy and fragmented MELD, SDS, which extends short utterances by stacking same-emotion samples, improves performance by stabilizing input length and reducing acoustic variability. However, it does not bring any gain on the cleaner and more structured RESD corpus. Conversely, LS-LLM is particularly effective for RESD, as longer utterances often contain more nuanced or complex emotional cues that may not be captured by a single-label annotation scheme. By generating a context-aware distribution of soft labels, LS-LLM can help recover this affective complexity. However, on MELD, where utterances are short and noisy, the signal is less reliable, limiting the benefit of label smoothing. The TUG method, which creates emotional speech using TTS, fails to improve performance in both corpora. This confirms that SOTA TTS models cannot reliably generate emotional speech. Combining SDS and LS-LLM results in complementary improvements on both corpora. This shows that corpus-aware augmentation, especially when multiple strategies are used together, enhances cross-lingual generalization.
In summary, the choice of augmentation strategies should be based on the utterance characteristics. SDS can be used for variable-length or noisy utterances, while LS-LLM can be used for long or nuanced emotional utterances. Combining these strategies can improve overall performance, as it allows for adaptation and context-awareness in data augmentation, which benefits generalization.
5.4. Comparison with Sate-of-the-Art Methods
The SOTA methods are predominantly developed under a single-corpus setup, which limits their generalization to unseen data. In contrast, this research focuses on a multi-corpus setup to enhance cross-domain robustness. Nevertheless, a comparative evaluation of the proposed method with the SOTA methods in a single-corpus setup was also conducted. The results of this comparison are presented in
Table 11. The results demonstrate that the proposed method outperforms SOTA on both target measures, highlighting the advantages of the Transformer-based bimodal fusion model and the proposed data augmentation strategies.
5.5. Method Proposed in Intelligent Systems
To demonstrate the practical applicability of the proposed Cross-Lingual Bimodal Emotion Recognition system, an interactive prototype was implemented using the Gradio library (
https://www.gradio.app, accessed on 28 September 2025).
Figure 13 illustrates the user interface of the developed system. The system accepts an audio input recorded via a microphone. For transcript extraction, the Whisper model is employed, which provides accurate and robust transcription across multiple languages. The recognized emotions are visualized as bar plots, showing the predictions from unimodal models (text-only and audio-only) as well as from the bimodal model.
This system serves as a proof of concept and can be easily integrated into a range of real-world expert applications, including customer support services, voice assistants, emergency dispatch systems, and other interactive platforms requiring emotional awareness [
149,
150,
151]. Such a deployment highlights the potential of BER technologies to enhance HCI in cross-lingual emotion modeling.
6. Conclusions
This research addresses the task of Cross-Lingual Bimodal Emotion Recognition by proposing a novel method that integrates multilingual Mamba-based encoders with multilingual Transformer-based cross-modal fusion. The proposed method achieves robust generalization across two linguistically and structurally diverse corpora: MELD (English) and RESD (Russian). To enhance the generalizability of the method, various data augmentation strategies are introduced: SDS and LS-LLM. The proposed improvements lead to an increase in performance over the SOTA method. For MELD, the improvement is 0.58% (68.31% vs. 67.73%), and for RESD, it reaches 4.32% (85.25% vs. 80.93%). These results confirm the effectiveness of the proposed method in both cross-lingual and monolingual setups.
Unimodal feature extractors are systematically evaluated using various temporal encoders (LSTM, xLSTM, Transformer, and Mamba). In audio-based experiments, the Wav2Vec2.0 embeddings outperform ExHuBERT, and Mamba achieves superior performance on longer utterances. For text, word-level models such as Jina-v3 and XLM-RoBERTa outperform the character-level CANINE-c model. Mamba again shows better efficiency compared to other temporal models.
To perform bimodal fusion, a model is built based on three attention mechanisms: Transformer-, graph-, and gated-based attention. Among these, the Transformer-based model (BiFormer) shows the best average performance, with relative improvements of 16.23% and 7.14% over the text and audio baselines, respectively. The choice of attention mechanism proves corpus-dependent: Transformer-based attention mechanisms are sufficient for short utterances (MELD), whereas longer and more complex sequences (RESD) benefited from Transformer- and graph-based cross-modal integration.
The proposed data augmentation strategies also demonstrate clear benefits that are specific to the corpus. SDS improves performance on noisy, shortened and variable-duration utterances in MELD, while LS-LLM enhances recognition in cleaner and emotionally nuanced data of RESD. Notably, combining both strategies yields complementary gains, underscoring their synergistic potential. A third augmentation strategy, TUG, is also evaluated. TUG involves synthesizing emotional audio-text pairs to mitigate class imbalance. However, it proves to be ineffective due to the current inability of TTS models to reliably generate emotionally expressive speech.
In future work, the framework will be expanded to incorporate visual modalities, enabling comprehensive affective computing in dynamic, real-world environments.
Author Contributions
Conceptualization, E.R. and D.R.; methodology, E.R., A.A., T.A., D.K. and D.R.; software, D.R.; validation, E.R., A.A., T.A. and D.K.; formal analysis, E.R. and A.A.; investigation, E.R., A.A., T.A. and D.K.; resources, E.R., A.A., T.A., D.K. and D.R.; data curation, D.R.; writing—original draft preparation, E.R. and A.A.; writing—review and editing, E.R. and D.R.; visualization, E.R. and A.A.; supervision, D.R.; project administration, D.R.; funding acquisition, D.R. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the Basic Research Program at the National Research University Higher School of Economics (HSE University) and made use of computational resources provided by the university’s HPC facilities.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AI | Artificial Intelligence |
| HCI | Human-Computer Interaction |
| MER | Multimodal Emotion Recognition |
| ER | Emotion Recognition |
| BER | Bimodal Emotion Recognition |
| SER | Speech Emotion Recognition |
| UAR | Unweighted Average Recall |
| WAR | Weighted Average Recall |
| MF1 | Macro F1-score |
| WF1 | Weighted F1-score |
| SDS | Stacked Data Sampling |
| GDS | Generated Data Sampling |
| TUG | Template-based Utterance Generation |
| GAN | Generative Adversarial Network |
| CER | Continuous Emotion Recognition |
| SOTA | State-of-the-Art |
| WPF | Weighted Probability Fusion |
| LLM | Large Language Model |
| LS-LLM | Label Smoothing Generation based on LLM |
| TTS | Text-to-Speech |
| RESD | Russian Emotional Speech Dialogs |
| MELD | Multimodal EmotionLines Dataset |
| ASR | Automatic Speech Recognition |
| SAWA | Sliding Adaptive Window Attention |
| GCPU | Gated Context Perception Unit |
| DFC | Dynamic Frame Convolution |
| MCFN | Modality-Collaborative Fusion Network |
| SSM | State-Space Model |
| BERT | Bidirectional Encoder Representations from Transformers |
| RoBERTa | Robustly optimized BERT approach |
| ALBERT | A Lite BERT |
| HuBERT | Hidden-unit BERT |
| ExHuBERT | Extended HuBERT |
| CNN | Convolutional Neural Network |
| DCNN | Deep CNN |
| LSTM | Long Short-Term Memory network |
| BiLSTM | Bidirectional LSTM network |
| WavLM | Waveform-based Language Model |
| HRI | Human-Robot Interaction |
| MTL | Multi-Task Learning |
| DANN | Domain-Adversarial Neural Network |
| GPT | Generative Pre-trained Transformer |
| CM-ARR | Cross-Modal Alignment, Reconstruction, and Refinement |
| MFSN | Multi-perspective Fusion Search Network |
| MFCC | Mel-Frequency Cepstral Coefficients |
| RL | Reinforcement Learning |
| GloVe | Global Vectors for Word Representation |
| COVAREP | COllaborative Voice Analysis REPository |
| BiGRU | Bidirectional Gated Recurrent Unit |
| CLIP | Contrastive Language-Image Pretraining |
| KWRT | Knowledge-based Word Relation Tagging |
| RNN | Recurrent Neural Network |
| LoRA | Low-Rank Adaptation |
| E-ODN | Emotion Open Deep Network |
| PAD | Pleasure-Arousal-Dominance model of emotion |
| xLSTM | Extended LSTM |
| TelME | Teacher-leading Multimodal fusion network for Emotion recognition in conversation |
| GPU | Graphics Processing Unit |
| TCN | Temporal Convolutional Network |
References
- Liu, W.; Qiu, J.L.; Zheng, W.L.; Lu, B.L. Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 715–729. [Google Scholar] [CrossRef]
- Geetha, A.; Mala, T.; Priyanka, D.; Uma, E. Multimodal Emotion Recognition with Deep Learning: Advancements, Challenges, and Future Directions. Inf. Fusion 2024, 105, 102218. [Google Scholar] [CrossRef]
- Wu, Y.; Mi, Q.; Gao, T. A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions. Biomimetics 2025, 10, 418. [Google Scholar] [CrossRef] [PubMed]
- Ai, W.; Zhang, F.; Shou, Y.; Meng, T.; Chen, H.; Li, K. Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 11418–11426. [Google Scholar] [CrossRef]
- Dikbiyik, E.; Demir, O.; Dogan, B. BiMER: Design and Implementation of a Bimodal Emotion Recognition System Enhanced by Data Augmentation Techniques. IEEE Access 2025, 13, 64330–64352. [Google Scholar] [CrossRef]
- Khan, M.; Tran, P.N.; Pham, N.T.; El Saddik, A.; Othmani, A. MemoCMT: Multimodal Emotion Recognition using Cross-Modal Transformer-based Feature Fusion. Sci. Rep. 2025, 15, 5473. [Google Scholar] [CrossRef]
- Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
- Dixit, C.; Satapathy, S.M. Deep CNN with Late Fusion for Real Time Multimodal Emotion Recognition. Expert Syst. Appl. 2024, 240, 122579. [Google Scholar] [CrossRef]
- Huang, C.; Lin, Z.; Han, Z.; Huang, Q.; Jiang, F.; Huang, X. PAMoE-MSA: Polarity-Aware Mixture of Experts Network for Multimodal Sentiment Analysis. Int. J. Multimed. Inf. Retr. 2025, 14, 7. [Google Scholar] [CrossRef]
- Li, Q.; Gkoumas, D.; Sordoni, A.; Nie, J.Y.; Melucci, M. Quantum-Inspired Neural Network for Conversational Emotion Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 13270–13278. [Google Scholar] [CrossRef]
- Li, B.; Fei, H.; Liao, L.; Zhao, Y.; Teng, C.; Chua, T.S.; Ji, D.; Li, F. Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition. In Proceedings of the ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5923–5934. [Google Scholar] [CrossRef]
- Tapaswi, M.; Zhu, Y.; Stiefelhagen, R.; Torralba, A.; Urtasun, R.; Fidler, S. MovieQA: Understanding Stories in Movies through Question-Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4631–4640. [Google Scholar] [CrossRef]
- Lin, Z.; Madotto, A.; Shin, J.; Xu, P.; Fung, P. MoEL: Mixture of Empathetic Listeners. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 121–132. [Google Scholar] [CrossRef]
- Zhang, F.; Chai, L. A Review of Research on Micro-Expression Recognition Algorithms based on Deep Learning. Neural Comput. Appl. 2024, 36, 17787–17828. [Google Scholar] [CrossRef]
- Gao, Y.; Shi, H.; Chu, C.; Kawahara, T. Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 1060–1064. [Google Scholar] [CrossRef]
- Yadav, A.; Gupta, A. An Emotion-Driven, Transformer-based Network for Multimodal Fake News Detection. Int. J. Multimed. Inf. Retr. 2024, 13, 7. [Google Scholar] [CrossRef]
- Ryumina, E.; Markitantov, M.; Ryumin, D.; Kaya, H.; Karpov, A. Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 4752–4760. [Google Scholar] [CrossRef]
- Ryumina, E.; Ryumin, D.; Axyonov, A.; Ivanko, D.; Karpov, A. Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion. Pattern Recognit. Lett. 2025, 190, 192–200. [Google Scholar] [CrossRef]
- Ekman, P.; Dalgleish, T.; Power, M. Handbook of Cognition and Emotion; Wiley Online Library: Hoboken, NJ, USA, 1999. [Google Scholar]
- Russell, J. A Circumplex Model of Affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, M.; Wu, Y.; Tiwari, P.; Li, Q.; Wang, B.; Qin, J. DialogueLLM: Context and Emotion Knowledge-tuned Large Language Models for Emotion Recognition in Conversations. Neural Netw. 2025, 192, 107901. [Google Scholar] [CrossRef] [PubMed]
- Khalane, A.; Makwana, R.; Shaikh, T.; Ullah, A. Evaluating Significant Features in Context-Aware Multimodal Emotion Recognition with XAI Methods. Expert Syst. 2025, 42, e13403. [Google Scholar] [CrossRef]
- Zhu, X.; Wang, Y.; Cambria, E.; Rida, I.; López, J.S.; Cui, L.; Wang, R. RMER-DT: Robust Multimodal Emotion Recognition in Conversational Contexts based on Diffusion and Transformers. Inf. Fusion 2025, 123, 103268. [Google Scholar] [CrossRef]
- Liang, Y.; Wang, Z.; Liu, F.; Liu, M.; Yao, Y. Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 11–12 June 2025; pp. 5651–5656. [Google Scholar]
- Zhang, T.; Li, S.; Chen, B.; Yuan, H.; Chen, C.P. AIA-Net: Adaptive Interactive Attention Network for Text–Audio Emotion Recognition. IEEE Trans. Cybern. 2023, 53, 7659–7671. [Google Scholar] [CrossRef]
- Wu, L.; Bai, L.; Cheng, W.; Cheng, Z.; Chen, G. Adaptive Alignment and Time Aggregation Network for Speech-Visual Emotion Recognition. IEEE Signal Process. Lett. 2025, 32, 1181–1185. [Google Scholar] [CrossRef]
- Mote, P.; Sisman, B.; Busso, C. Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 1045–1049. [Google Scholar] [CrossRef]
- Lu, C.; Zong, Y.; Zhao, Y.; Lian, H.; Qi, T.; Schuller, B.; Zheng, W. Hierarchical Distribution Adaptation for Unsupervised Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 3739–3743. [Google Scholar] [CrossRef]
- Deng, Y.; Bian, J.; Wu, S.; Lai, J.; Xie, X. Multiplex Graph Aggregation and Feature Refinement for Unsupervised Incomplete Multimodal Emotion Recognition. Inf. Fusion 2025, 114, 102711. [Google Scholar] [CrossRef]
- Jin, X.; Wang, J.; Qin, H.; Ning, X.; Xin, T.; Lin, Y. Group-Wise Relation Mining for Weakly-Supervised Fine-Grained Multimodal Emotion Recognition. Neural Netw. 2025, 190, 107543. [Google Scholar] [CrossRef]
- Tran, M.; Yin, Y.; Soleymani, M. SetPeER: Set-based Personalized Emotion Recognition with Weak Supervision. IEEE Trans. Affect. Comput. 2025, 1–15. [Google Scholar] [CrossRef]
- Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
- Lubenets, I.; Davidchuk, N.; Amentes, A. Aniemore: A Toolkit for Animation and Emotion Recognition. GitHub Repository. 2023. Available online: https://github.com/aniemore/Aniemore (accessed on 28 September 2025).
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 527–536. [Google Scholar] [CrossRef]
- Yi, Y.; Zhou, Y.; Wang, T.; Zhou, J. Advances in Video Emotion Recognition: Challenges and Trends. Sensors 2025, 25, 3615. [Google Scholar] [CrossRef] [PubMed]
- Noroozi, F.; Corneanu, C.A.; Kamińska, D.; Sapiński, T.; Escalera, S.; Anbarjafari, G. Survey on Emotional Body Gesture Recognition. IEEE Trans. Affect. Comput. 2018, 12, 505–523. [Google Scholar] [CrossRef]
- Deng, J.; Ren, F. A Survey of Textual Emotion Recognition and its Challenges. IEEE Trans. Affect. Comput. 2021, 14, 49–67. [Google Scholar] [CrossRef]
- García-Hernández, R.A.; Luna-García, H.; Celaya-Padilla, J.M.; García-Hernández, A.; Reveles-Gómez, L.C.; Flores-Chaires, L.A.; Delgado-Contreras, J.R.; Rondon, D.; Villalba-Condori, K.O. A Systematic Literature Review of Modalities, Trends, and Limitations in Emotion Recognition, Affective Computing, and Sentiment Analysis. Appl. Sci. 2024, 14, 7165. [Google Scholar] [CrossRef]
- El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
- Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech Emotion Recognition using Deep Learning Techniques: A Review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
- Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A Comprehensive Review of Speech Emotion Recognition Systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar] [CrossRef]
- Fu, C.; Qian, F.; Su, K.; Su, Y.; Wang, Z.; Shi, J.; Liu, Z.; Liu, C.; Ishi, C.T. HiMul-LGG: A Hierarchical Decision Fusion-based Local-Global Graph Neural Network for Multimodal Emotion Recognition in Conversation. Neural Netw. 2025, 181, 106764. [Google Scholar] [CrossRef]
- Dutta, S.; Ganapathy, S. LLM Supervised Pre-training for Multimodal Emotion Recognition in Conversations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Zhang, X.; Li, Y. A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 1468–1472. [Google Scholar] [CrossRef]
- Wang, Y.; Li, Y.; Cui, Z. Incomplete Multimodality-Diffused Emotion Recognition. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; pp. 17117–17128. [Google Scholar]
- Zhou, Y.; Liang, X.; Chen, H.; Zhao, Y.; Chen, X.; Yu, L. Triple Disentangled Representation Learning for Multimodal Affective Analysis. Inf. Fusion 2025, 114, 102663. [Google Scholar] [CrossRef]
- Botteghi, N.; Poel, M.; Brune, C. Unsupervised Representation Learning in Deep Reinforcement Learning: A Review. IEEE Control Syst. 2025, 45, 26–68. [Google Scholar] [CrossRef]
- Zhou, H.; Huang, S.; Xu, Y. UA-FER: Uncertainty-Aware Representation Learning for Facial Expression Recognition. Neurocomputing 2025, 621, 129261. [Google Scholar] [CrossRef]
- Xie, J.; Wang, J.; Wang, Q.; Yang, D.; Gu, J.; Tang, Y.; Varatnitski, Y.I. A Multimodal Fusion Emotion Recognition Method based on Multitask Learning and Attention Mechanism. Neurocomputing 2023, 556, 126649. [Google Scholar] [CrossRef]
- Kalateh, S.; Estrada-Jimenez, L.A.; Nikghadam-Hojjati, S.; Barata, J. A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges. IEEE Access 2024, 12, 103976–104019. [Google Scholar] [CrossRef]
- Bujnowski, P.; Kuzma, B.; Paziewski, B.; Rutkowski, J.; Marhula, J.; Bordzicka, Z.; Andruszkiewicz, P. SAMSEMO: New Dataset for Multilingual and Multimodal Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 2925–2929. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower Provost, E.; Kim, S.; Chang, J.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Bagher Zadeh, A.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar] [CrossRef]
- Fan, W.; Xu, X.; Xing, X.; Chen, W.; Huang, D. LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 641–645. [Google Scholar] [CrossRef]
- Nojavanasghari, B.; Baltrušaitis, T.; Hughes, C.E.; Morency, L.P. EmoReact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI), Tokyo, Japan, 12–16 November 2016; pp. 137–144. [Google Scholar] [CrossRef]
- Sokolov, A.; Minkin, F.; Savushkin, N.; Karpov, N.; Kutuzov, O.; Kondratenko, V. Dusha Dataset. GitHub Repository. 2022. Available online: https://github.com/salute-developers/golos/tree/master/dusha#dusha-dataset (accessed on 28 September 2025).
- Ververidis, D.; Kotropoulos, C. Emotional Speech Recognition: Resources, Features, and Methods. Speech Commun. 2006, 48, 1162–1181. [Google Scholar] [CrossRef]
- Han, K.; Yu, D.; Tashev, I. Speech Emotion Recognition using Deep Neural Network and Extreme Learning Machine. In Proceedings of the Interspeech, Singapore, 14–18 September 2014; pp. 223–227. [Google Scholar] [CrossRef]
- Papakostas, M.; Spyrou, E.; Giannakopoulos, T.; Siantikos, G.; Sgouropoulos, D.; Mylonas, P.; Makedon, F. Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition. Computation 2017, 5, 26. [Google Scholar] [CrossRef]
- Ai, W.; Shou, Y.; Meng, T.; Li, K. DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4908–4921. [Google Scholar] [CrossRef]
- Yaddaden, Y. Efficient Dynamic Emotion Recognition from Facial Expressions using Statistical Spatio-Temporal Geometric Features. Big Data Cogn. Comput. 2025, 9, 213. [Google Scholar] [CrossRef]
- Labib, F.H.; Elagamy, M.; Saleh, S.N. EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification. Big Data Cogn. Comput. 2025, 9, 48. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, USA, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26 April–1 May 2020; pp. 1–17. [Google Scholar]
- Sturua, S.; Mohr, I.; Kalim Akram, M.; Günther, M.; Wang, B.; Krimmel, M.; Wang, F.; Mastrapas, G.; Koukounas, A.; Wang, N.; et al. Jina Embeddings V3: Multilingual Text Encoder with Low-Rank Adaptations. In Proceedings of the European Conference on Information Retrieval (ECIR), Lucca, Italy, 6–10 April 2025; pp. 123–129. [Google Scholar] [CrossRef]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. Wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 3465–3469. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual-only, 6–12 December 2020; pp. 12449–12460. [Google Scholar]
- Li, F.; Luo, J.; Xia, W. WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition. In Proceedings of the MultiMedia Modeling, Nara, Japan, 8–10 January 2025; pp. 325–336. [Google Scholar] [CrossRef]
- Amiriparian, S.; Packań, F.; Gerczuk, M.; Schuller, B.W. ExHuBERT: Enhancing HuBERT through Block Extension and Fine-Tuning on 37 Emotion Datasets. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 2635–2639. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; Mcleavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Goron, E.; Asai, L.; Rut, E.; Dinov, M. Improving Domain Generalization in Speech Emotion Recognition with Whisper. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11631–11635. [Google Scholar] [CrossRef]
- Fukuda, R.; Kano, T.; Ando, A.; Ogawa, A. Speech Emotion Recognition based on Large-Scale Automatic Speech Recognizer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Kim, K.; Cho, N. Focus-Attention-Enhanced Crossmodal Transformer with Metric Learning for Multimodal Speech Emotion Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 2673–2677. [Google Scholar] [CrossRef]
- Leem, S.G.; Fulford, D.; Onnela, J.P.; Gard, D.; Busso, C. Computation and Memory Efficient Noise Adaptation of Wav2Vec2.0 for Noisy Speech Emotion Recognition with Skip Connection Adapters. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 1888–1892. [Google Scholar] [CrossRef]
- Luo, J.; Phan, H.; Reiss, J. Fine-tuned RoBERTa Model with a CNN-LSTM Network for Conversational Emotion Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 2413–2417. [Google Scholar] [CrossRef]
- Zhao, J.; Wei, X.; Bo, L. R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning. GitHub Repository. 2025. Available online: https://github.com/HumanMLLM/R1-Omni (accessed on 28 September 2025).
- Li, Y.; Wang, Y.; Cui, Z. Decoupled Multimodal Distilling for Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 6631–6640. [Google Scholar] [CrossRef]
- Shi, H.; Liang, Z.; Yu, J. Emotional Cues Extraction and Fusion for Multi-Modal Emotion Prediction and Recognition in Conversation. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4074–4078. [Google Scholar] [CrossRef]
- Li, M.; Shi, J.; Bai, L.; Huang, C.; Jiang, Y.; Lu, K.; Wang, S.; Hancock, E.R. FrameERC: Framelet Transform based Multimodal Graph Neural Networks for Emotion Recognition in Conversation. Pattern Recognit. 2025, 161, 111340. [Google Scholar] [CrossRef]
- Li, W.; Zhou, H.; Yu, J.; Song, Z.; Yang, W. Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; pp. 59808–59832. [Google Scholar]
- Wang, J.; Paliotta, D.; May, A.; Rush, A.M.; Dao, T. The Mamba in the Llama: Distilling and Accelerating Hybrid Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; pp. 62432–62457. [Google Scholar]
- Zou, J.; Liao, B.; Zhang, Q.; Liu, W.; Wang, X. OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models. GitHub Repository. 2025. Available online: https://github.com/hustvl/OmniMamba (accessed on 28 September 2025).
- Yang, S.; Kautz, J.; Hatamizadeh, A. Gated Delta Networks: Improving Mamba2 with Delta Rule. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025; pp. 1–21. [Google Scholar]
- Yang, Z.; Hirschberg, J. Predicting Arousal and Valence from Waveforms and Spectrograms using Deep Neural Networks. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3092–3096. [Google Scholar] [CrossRef]
- Delbrouck, J.B.; Tits, N.; Brousmiche, M.; Dupont, S. A Transformer-based Joint-Encoding for Emotion Recognition and Sentiment Analysis. In Proceedings of the Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), Seattle, WA, USA, 10 July 2020; pp. 1–7. [Google Scholar] [CrossRef]
- Padi, S.; Sadjadi, S.O.; Manocha, D.; Sriram, R.D. Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based Models. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey), Beijing, China, 28 June–1 July 2022; pp. 407–414. [Google Scholar] [CrossRef]
- Lei, S.; Dong, G.; Wang, X.; Wang, K.; Wang, S. InstructERC: Reforming Emotion Recognition in Conversation with Multi-Task Retrieval-Augmented Large Language Models. GitHub Repository. 2023. Available online: https://github.com/LIN-SHANG/InstructERC (accessed on 28 September 2025).
- Wang, S.; Gudnason, J.; Borth, D. Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 351–355. [Google Scholar] [CrossRef]
- Lee, S.w. Diverse Feature Mapping and Fusion via Multitask Learning for Multilingual Speech Emotion Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 3944–3948. [Google Scholar] [CrossRef]
- Gong, T.; Belanich, J.; Somandepalli, K.; Nagrani, A.; Eoff, B.; Jou, B. LanSER: Language-Model Supported Speech Emotion Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 2408–2412. [Google Scholar] [CrossRef]
- Mai, J.; Xing, X.; Chen, W.; Xu, X. DropFormer: A Dynamic Noise-Dropping Transformer for Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 2645–2649. [Google Scholar] [CrossRef]
- Zhao, Z.; Gao, T.; Wang, H.; Schuller, B. MFDR: Multiple-Stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 3719–3723. [Google Scholar] [CrossRef]
- García, R.; Mahu, R.; Grágeda, N.; Luzanto, A.; Bohmer, N.; Busso, C.; Becerra Yoma, N. Speech Emotion Recognition with Deep Learning Beamforming on a Distant Human-Robot Interaction Scenario. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 3215–3219. [Google Scholar] [CrossRef]
- Sun, H.; Zhang, F.; Gao, Y.; Zhang, S.; Lian, Z.; Feng, J. MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4703–4707. [Google Scholar] [CrossRef]
- Wu, H.; Chou, H.C.; Chang, K.W.; Goncalves, L.; Du, J.; Jang, J.S.R.; Lee, C.C.; Lee, H.Y. Open-Emotion: A Reproducible EMO-Superb for Speech Emotion Recognition Systems. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Macao, China, 2–5 December 2024; pp. 510–517. [Google Scholar] [CrossRef]
- Phukan, O.C.; Kashyap, G.S.; Buduru, A.B.; Sharma, R. Are Paralinguistic Representations All that is Needed for Speech Emotion Recognition? In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4698–4702. [Google Scholar] [CrossRef]
- Ritter-Gutierrez, F.; Huang, K.P.; Wong, J.H.M.; Ng, D.; Lee, H.-y.; Chen, N.F.; Chng, E.S. Dataset-Distillation Generative Model for Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 2640–2644. [Google Scholar] [CrossRef]
- Leem, S.G.; Fulford, D.; Onnela, J.P.; Gard, D.; Busso, C. Keep, Delete, or Substitute: Frame Selection Strategy for Noise-Robust Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 3734–3738. [Google Scholar] [CrossRef]
- Ma, L.; Shen, L.; Li, R.; Zhang, H.; Qian, K.; Hu, B.; Schuller, B.W.; Yamamoto, Y. E-ODN: An Emotion Open Deep Network for Generalised and Adaptive Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4293–4297. [Google Scholar] [CrossRef]
- Huang, Z.; Mak, M.W.; Lee, K.A. MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in Conversation. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4069–4073. [Google Scholar] [CrossRef]
- Sun, H.; Zhao, S.; Li, S.; Kong, X.; Wang, X.; Zhou, J.; Kong, A.; Chen, Y.; Zeng, W.; Qin, Y. Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Li, B.; Fei, H.; Li, F.; Chua, T.s.; Ji, D. Multimodal Emotion-Cause pair Extraction with Holistic Interaction and Label Constraint. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 21, 1–19. [Google Scholar] [CrossRef]
- Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 2278–2282. [Google Scholar] [CrossRef]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Zhang, Y.; Jia, A.; Wang, B.; Zhang, P.; Zhao, D.; Li, P.; Hou, Y.; Jin, X.; Song, D.; Qin, J. M3GAT: A Multi-Modal, Multi-Task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition. ACM Trans. Inf. Syst. 2023, 42, 13. [Google Scholar] [CrossRef]
- Lian, H.; Lu, C.; Chang, H.; Zhao, Y.; Li, S.; Li, Y.; Zong, Y. AMGCN: An Adaptive Multi-Graph Convolutional Network for Speech Emotion Recognition. Speech Commun. 2025, 168, 103184. [Google Scholar] [CrossRef]
- Zhou, Y.; Zhou, Y.; Yang, Y.; Liu, Y.; Huang, J.; Zhao, S.; Su, R.; Wang, L.; Yan, N. Emotion-Guided Graph Attention Networks for Speech-based Depression Detection under Emotion-Inducting Tasks. In Proceedings of the Interspeech, Rotterdam, The Netherlands, 17–21 August 2025; pp. 469–473. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, X.; Li, Y.; Wang, L.; Zhao, Z. Multi-Stage Confidence-Guided Diffusion and Emotional Bidirectional Mamba for Robust Speech Emotion Recognition. IEEE Signal Process. Lett. 2025, 32, 2184–2188. [Google Scholar] [CrossRef]
- Zhang, T.; Chen, Z.; Du, J. Multimodal Mamba Model for Emotion Recognition in Conversations. In Proceedings of the International Conference on Machine Learning and Computing, Nanjing, China, 6–9 February 2025; pp. 262–273. [Google Scholar] [CrossRef]
- Chen, G.; Liao, Y.; Zhang, D.; Yang, W.; Mai, Z.; Xu, C. Multimodal Emotion Recognition via the Fusion of Mamba and Liquid Neural Networks with Cross-Modal Alignment. Electronics 2025, 14, 3638. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the Conference on Language Modeling (CoLM), University of Pennsylvania, Philadelphia, PA, USA, 7–9 October 2024; pp. 1–32. [Google Scholar]
- Dao, T.; Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; pp. 10041–10071. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022; pp. 1–13. [Google Scholar]
- Mai, S.; Zeng, Y.; Hu, H. Learning by Comparing: Boosting Multimodal Affective Computing through Ordinal Learning. In Proceedings of the ACM on Web Conference, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 2120–2134. [Google Scholar] [CrossRef]
- Ma, F.; Li, Y.; Ni, S.; Huang, S.L.; Zhang, L. Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN. Appl. Sci. 2022, 12, 527. [Google Scholar] [CrossRef]
- Tiwari, U.; Soni, M.; Chakraborty, R.; Panda, A.; Kopparapu, S.K. Multi-Conditioning and Data Augmentation using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7194–7198. [Google Scholar] [CrossRef]
- Li, A.; Wu, M.; Ouyang, R.; Wang, Y.; Li, F.; Lv, Z. A Multimodal-Driven Fusion Data Augmentation Framework for Emotion Recognition. IEEE Trans. Artif. Intell. 2025, 6, 2083–2097. [Google Scholar] [CrossRef]
- Bouchelligua, W.; Al-Dayil, R.; Algaith, A. Effective Data Augmentation Techniques for Arabic Speech Emotion Recognition using Convolutional Neural Networks. Appl. Sci. 2025, 15, 2114. [Google Scholar] [CrossRef]
- Avci, U. A Comprehensive Analysis of Data Augmentation Methods for Speech Emotion Recognition. IEEE Access 2025, 13, 111647–111669. [Google Scholar] [CrossRef]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–13. [Google Scholar]
- Malik, M.I.; Latif, S.; Jurdak, R.; Schuller, B.W. A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 646–650. [Google Scholar] [CrossRef]
- Wang, Y.; Chen, L. Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-Scarce Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Denver, Colorado, 3–7 June 2025; pp. 25560–25569. [Google Scholar] [CrossRef]
- Su, X.; Yang, B.; Yi, X.; Cao, Y. DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice Conversion. In Proceedings of the Interspeech, Rotterdam, The Netherlands, 17–21 August 2025; pp. 4393–4397. [Google Scholar] [CrossRef]
- Stanley, E.; DeMattos, E.; Klementiev, A.; Ozimek, P.; Clarke, G.; Berger, M.; Palaz, D. Emotion Label Encoding using Word Embeddings for Speech Emotion Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 2418–2422. [Google Scholar] [CrossRef]
- Purohit, T.; Magimai-Doss, M. Emotion Information Recovery Potential of Wav2Vec2 Network Fine-tuned for Speech Recognition Task. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Wu, Y.T.; Wu, J.; Sethu, V.; Lee, C.C. Can Modelling Inter-Rater Ambiguity Lead to Noise-Robust Continuous Emotion Predictions? In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 3714–3718. [Google Scholar] [CrossRef]
- Khan, N.D.; Khan, J.A.; Li, J.; Ullah, T.; Zhao, Q. Leveraging Large Language Model ChatGPT for Enhanced Understanding of End-User Emotions in Social Media Feedbacks. Expert Syst. Appl. 2025, 261, 125524. [Google Scholar] [CrossRef]
- Muhammad, S.H.; Ousidhoum, N.; Abdulmumin, I.; Yimam, S.M.; Wahle, J.P.; Lima Ruas, T.; Beloucif, M.; De Kock, C.; Belay, T.D.; Ahmad, I.S.; et al. SemEval-2025 Task 11: Bridging the Gap in Text-based Emotion Detection. In Proceedings of the International Workshop on Semantic Evaluation (SemEval), Vienna, Austria, 31 July–1 August 2025; pp. 2558–2569. [Google Scholar]
- Franceschini, R.; Fini, E.; Beyan, C.; Conti, A.; Arrigoni, F.; Ricci, E. Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss. In Proceedings of the International Conference on Pattern Recognition (ICPR), Montréal, QC, Canada, 21–25 August 2022; pp. 2589–2596. [Google Scholar] [CrossRef]
- Dutta, A.; Biswas, S.; Das, A.K. EmoComicNet: A Multi-Task Model for Comic Emotion Recognition. Pattern Recognit. 2024, 150, 110261. [Google Scholar] [CrossRef]
- Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef]
- Carson, A.; Välimäki, V.; Wright, A.; Bilbao, S. Resampling Filter Design for Multirate Neural Audio Effect Processing. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 2163–2174. [Google Scholar] [CrossRef]
- Lotfian, R.; Busso, C. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings. IEEE Trans. Affect. Comput. 2019, 10, 471–483. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. Adv. Neural Inf. Process. Syst. (NeurIPS) 2024, 37, 107547–107603. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar]
- Clark, J.H.; Garrette, D.; Turc, I.; Wieting, J. Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Trans. Assoc. Comput. Linguist. 2022, 10, 73–91. [Google Scholar] [CrossRef]
- Müller, R.; Kornblith, S.; Hinton, G.E. When does Label Smoothing Help? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 1–10. [Google Scholar]
- Zhang, C.B.; Jiang, P.T.; Hou, Q.; Wei, Y.; Han, Q.; Li, Z.; Cheng, M.M. Delving Seep Into Label Smoothing. IEEE Trans. Image Process. 2021, 30, 5984–5996. [Google Scholar] [CrossRef]
- Axyonov, A.; Ryumin, D.; Ivanko, D.; Kashevnik, A.; Karpov, A. Audio-Visual Speech Recognition In-the-Wild: Multi-Angle Vehicle Cabin Corpus and Attention-based Method. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 8195–8199. [Google Scholar] [CrossRef]
- Zhu, J.; Zhao, S.; Jiang, J.; Xu, Z.; Tang, W.; Yao, H. Learning Class Prototypes for Visual Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Ryumina, E.; Markitantov, M.; Axyonov, A.; Ryumin, D.; Dolgushin, M.; Karpov, A. Zero-Shot Multimodal Compound Expression Recognition Approach using Off-the-Shelf Large Visual-Language Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Honolulu, HI, USA, 19–23 October 2025; pp. 71–79. [Google Scholar]
- Devvrit, F.; Kudugunta, S.; Kusupati, A.; Dettmers, T.; Chen, K.; Dhillon, I.; Tsvetkov, Y.; Hajishirzi, H.; Kakade, S.; Farhadi, A.; et al. MatFormer: Nested Transformer for Elastic Inference. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2025; pp. 140535–140564. [Google Scholar]
- Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
- Yun, T.; Lim, H.; Lee, J.; Song, M. TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Mexico City, Mexico, 16–21 June 2024; pp. 82–95. [Google Scholar] [CrossRef]
- Kostenetskiy, P.; Chulkevich, R.; Kozyrev, V. HPC Resources of the Higher School of Economics. J. Phys. Conf. Ser. 2021, 1740, 012050. [Google Scholar] [CrossRef]
- Wu, Y.; Zhang, S.; Li, P. Multi-Modal Emotion Recognition in Conversation based on Prompt Learning with Text-Audio Fusion Features. Sci. Rep. 2025, 15, 8855. [Google Scholar] [CrossRef]
- Zhang, X.; Shi, S.; Li, Y.; Ma, W.; Sun, P.; Zhang, M. Feature-Enhanced Neural Collaborative Reasoning for Explainable Recommendation. ACM Trans. Inf. Syst. 2024, 43, 7. [Google Scholar] [CrossRef]
- Tang, X.; Li, Z.; Sun, X.; Xu, X.; Zhang, M.L. ZzzMate: A Self-Conscious Emotion-Aware Chatbot for Sleep Intervention. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–7. [Google Scholar] [CrossRef]
- Schedl, M.; Lex, E.; Tkalcic, M. Psychological Aspects in Retrieval and Recommendation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Padua, Italy, 13–18 July 2025; pp. 4114–4117. [Google Scholar] [CrossRef]
Figure 1.
The pipeline of the proposed Cross-Lingual Bimodal Emotion Recognition method.
Figure 1.
The pipeline of the proposed Cross-Lingual Bimodal Emotion Recognition method.
Figure 2.
Illustration of prompt formulations for ChatGPT-5 to generate short, emotionally rich utterances aligned with the expressive and conversational style of the MELD corpus.
Figure 2.
Illustration of prompt formulations for ChatGPT-5 to generate short, emotionally rich utterances aligned with the expressive and conversational style of the MELD corpus.
Figure 3.
Illustration of the DIA-TTS model employed in the TUG pipeline, which converts generated textual utterances enriched with verbal tags (e.g., sighs, laughs) into expressive audio signals, enabling controlled augmentation of emotional speech data.
Figure 3.
Illustration of the DIA-TTS model employed in the TUG pipeline, which converts generated textual utterances enriched with verbal tags (e.g., sighs, laughs) into expressive audio signals, enabling controlled augmentation of emotional speech data.
Figure 4.
Illustration of prompt formulations for the LS-LLM. The best-performing prompt enforces strict formatting and distributional constraints, while alternative prompts were prone to errors and produced less informative outputs.
Figure 4.
Illustration of prompt formulations for the LS-LLM. The best-performing prompt enforces strict formatting and distributional constraints, while alternative prompts were prone to errors and produced less informative outputs.
Figure 5.
Architectures of BER models. The architectures range from the simple BiFormer, which uses a Transformer-based cross-modal attention mechanism, to the hierarchical BiGatedGraphFormer, which combines a Transformer-based, graph-based, and gated-based cross-modal mechanisms.
Figure 5.
Architectures of BER models. The architectures range from the simple BiFormer, which uses a Transformer-based cross-modal attention mechanism, to the hierarchical BiGatedGraphFormer, which combines a Transformer-based, graph-based, and gated-based cross-modal mechanisms.
Figure 6.
Components of the proposed architectures of BER models. The graph fusion layer is a component of the BiGraphFormer and BiGatedGraphFormer models. The gated fusion layer is a specific component of the BiGatedGraphFormer, while feature resampling is used in all models.
Figure 6.
Components of the proposed architectures of BER models. The graph fusion layer is a component of the BiGraphFormer and BiGatedGraphFormer models. The gated fusion layer is a specific component of the BiGatedGraphFormer, while feature resampling is used in all models.
Figure 7.
Distributions of emotions in utterances across subsets.
Figure 7.
Distributions of emotions in utterances across subsets.
Figure 8.
Confusion matrices obtained for the best-performing audio-based model.
Figure 8.
Confusion matrices obtained for the best-performing audio-based model.
Figure 9.
Confusion matrices obtained for the best-performing text-based model.
Figure 9.
Confusion matrices obtained for the best-performing text-based model.
Figure 10.
Bimodal-based experimental results using different data augmentation methods. The results are sorted based on a rank across all measures. The UAR metric is also shown as it is particularly informative for imbalanced class distributions. p means augmentation probability.
Figure 10.
Bimodal-based experimental results using different data augmentation methods. The results are sorted based on a rank across all measures. The UAR metric is also shown as it is particularly informative for imbalanced class distributions. p means augmentation probability.
Figure 11.
Confusion matrices obtained for the best-performing bimodal-based model.
Figure 11.
Confusion matrices obtained for the best-performing bimodal-based model.
Figure 12.
Visualization of the BER model’s attention for two pairs of semantically equivalent sentences in Russian and English.
Figure 12.
Visualization of the BER model’s attention for two pairs of semantically equivalent sentences in Russian and English.
Figure 13.
User interface of the interactive prototype developed for the BER system.
Figure 13.
User interface of the interactive prototype developed for the BER system.
Table 1.
Comparison of ER corpora: languages, modalities, and speech characteristics.
Table 1.
Comparison of ER corpora: languages, modalities, and speech characteristics.
| Corpus | Language(s) | Size | Modalities | Speakers (N) | Speech Type | Interaction Type |
|---|
| SAMSEMO [51] | English, German, Spanish, Polish, Korean | 23,086 | Audio, Video, Text | ∼1390 | Natural | Monologue |
| IEMOCAP [52] | English | 10,000 | Audio, Video, Text, Head movements | 10 | Both | Dialogue |
| CMU-MOSEI [53] | English | 23,453 | Audio, Video, Text | 1000 | Natural | Monologue |
| MELD [34] | English | 13,708 | Audio, Video, Text | 407 (6 main) | Acted | Dialogue |
| LSSED [54] | English | 147,025 | Audio | 820 | Natural | Monologue |
| EmoReact [55] | Russian | 1102 | Audio, Video | 63 (children) | Natural | Dialogue |
| Dusha [56] | Russian | 300,000+ | Audio | 2068 | Both | Monologue |
| RESD [33] | Russian | 3400 | Audio, Text | unknown | Acted | Dialogue |
Table 2.
Overview of recent ER methods using audio and text modalities.
Table 2.
Overview of recent ER methods using audio and text modalities.
| Year | Method | Linguistic Features | Acoustic Features | Visual Features | Modality Fusion | Corpora Used |
|---|
| 2018 | Yang and Hirschberg [85] | – | CNN, BiLSTM | – | – | SEMAINE, RECOLA |
| 2020 | Delbrouck et al. [86] | GloVe | DCNN | R(2+1)D-152 | Modular Co-Attention, Glimpse Layer | CMU-MOSEI |
| 2021 | Fan et al. [54] | – | DCNN, LSTM-CNN | – | – | LSSED |
| 2022 | Padi et al. [87] | BERT (fine-tuned) | ResNet-34 (transfer-learned) | – | Score-level late fusion | IEMOCAP |
| 2023 | Zhang and Li [44] | RoBERTa | DCNN | – | MCFN | IEMOCAP, MELD |
| 2023 | Wang et al. [45] | BERT | COVAREP | Facet | Diffusion-based modality recovery | CMU-MOSI, CMU-MOSEI |
| 2023 | Kim and Cho [74] | RoBERTa (GPT-2 Tokenizer) | Wav2Vec2.0 (CNN, BiLSTM) | – | Cross-modal Transformer, Focus-Attention | IEMOCAP |
| 2023 | Li et al. [78] | GloVe, BERT | COVAREP | Facet | Graph-based Knowledge Distillation | IEMOCAP, MOSEI, AFEW, CMU-MOSEI |
| 2023 | Lei et al. [88] | LLM | – | – | – | IEMOCAP, MELD, DailyDialog |
| 2023 | Wang et al. [89] | – | Emotion Extractor (with Mixup) | – | – | IEMOCAP |
| 2023 | Lee [90] | – | HuBERT | – | MTL, DANN, late combination | IEMOCAP, JTES |
| 2023 | Gong et al. [91] | GPT-2, BERT | ResNet-50 + Whisper ASR | – | Weak labels via pre-trained LLM | IEMOCAP, CREMA-D |
| 2024 | Shi et al. [79] | RoBERTa | WavLM | – | Two-stage fusion with prosody enhancement | IEMOCAP, MELD |
| 2024 | Mai et al. [92] | – | DropFormer (Transformer, WavLM) | – | – | IEMOCAP, MELD |
| 2024 | Zhao et al. [93] | BERT | WavLM | – | SAWA, GCPU, DFC | IEMOCAP, CMU-MOSEI |
| 2024 | García et al. [94] | – | Deep Learning Beamforming | – | – | MSP-Podcast, Real HRI Data |
| 2024 | Sun et al. [95] | ASR, BERT | MFCC | – | MFSN | IEMOCAP |
| 2024 | Wu et al. [96] | – | 15 SSLMs(e.g., WavLM, HuBERT) | – | – | IEMOCAP, CREMA-D, MSP-Podcast, BIIC-Podcast, IMPROV, NNIME |
| 2024 | Phukan et al. [97] | – | TRILLsson | – | – | SUPERB, EMO-SUPERB, ML-SUPERB |
| 2024 | Ritter-Gutierrez et al. [98] | – | HuBERT | – | – | IEMOCAP |
| 2024 | Leem et al. [99] | – | WavLM-Large, 1D CNN | – | – | MSP-Podcast |
| 2024 | Ma et al. [100] | – | ResNet-50, CNN, Emotion Open Deep Network | – | – | IEMOCAP |
| 2024 | Huang et al. [101] | RoBERTa, BiGRU | Wav2Vec2.0, BiGRU | CLIP, BiGRU | NodeFormer Fusion | IEMOCAP, MELD |
| 2025 | Ryumina et al. [18] | RoBERTa | Wav2Vec2.0 | ResNet-50 | Gated attention fusion | IEMOCAP, MOSEI, MELD, AFEW |
| 2025 | Zhao et al. [77] | LLM | – | LLM | RL with verifiable reward based multimodal fusion | MAFW, DFEW, EMER |
| 2025 | Li et al. [80] | RoBERTa | OpenSmile | DenseNet | Graph Framelet, Dual-Reminder Attentoion | IEMOCAP, MELD |
| 2025 | Sun et al. [102] | BERT | Wav2Vec2.0 | – | CM-ARR (alignment, reconstruction, refinement) | IEMOCAP, MSP-IMPROV |
| 2025 | The proposed method | Jina-v3, Mamba | Wav2Vec2.0, Mamba | – | Transformer-based cross-modal fusion, SDS, LS-LLM | MELD, RESD |
Table 3.
Comparison of lightweight LLMs characteristics.
Table 3.
Comparison of lightweight LLMs characteristics.
| Model | Parameters (B) | Layers | Architecture | Release Date |
|---|
| Falcon3-3B-Instruct | 3.23 | 22 | Decoder-only | December 2024 |
| Phi-4-mini-instruct | 3.84 | 32 | Decoder-only | February 2025 |
| Gemma-3-4b-it | 4.3 | 34 | Decoder-only | March 2025 |
| Qwen3-4B | 4.02 | 36 | Decoder-only | April 2025 |
| Falcon-H1-3B-Instruct | 3.15 | 32 | Decoder-only | May 2025 |
| Gemma-3n-E2B-it | 5.44 | 30 | MatFormer [144] | June 2025 |
| SmolLM3-3B | 3.08 | 36 | Decoder-only | July 2025 |
| Qwen3-4B-Instruct-2507 | 4.02 | 36 | Decoder-only | August 2025 |
Table 4.
General statistics of MELD and RESD corpora.
Table 4.
General statistics of MELD and RESD corpora.
| Statistics | MELD | RESD |
|---|
| Train | Development | Test | Train | Test |
|---|
| # of utterances | 9989 | 1109 | 2610 | 1116 | 280 |
| Max. # of (sub)tokens in utterances | 95 | 59 | 64 | 88 | 85 |
| Avg. # of (sub)tokens in utterances | 14.28 | 14.17 | 14.61 | 23.73 | 24.08 |
| Max. duration of utterances, sec. | 41.05 | 28.54 | 304.96 | 20.14 | 17.16 |
| Avg. duration of utterances, sec. | 3.14 | 3.12 | 3.29 | 6.07 | 5.93 |
Table 5.
Optimal parameters of the audio-based model and training process.
Table 5.
Optimal parameters of the audio-based model and training process.
| Hyperparameter | LSTM | xLSTM | Transformer | Mamba |
|---|
| Model hyperparameters |
| Layer number, {1, 2, 3, 4} | 2 | 1 | 2 | 2 |
| Head number, {1, 2, 4, 8, 16} | – | 1 | 4 | – |
| Hidden size, {64, 128, 256} | 128 | 128 | 128 | 256 |
| Feedforward hidden size, {128, 256, 512, 1024} | – | – | 512 | – |
| Pos. encoding type, {fixed, learnable} | – | – | fixed | – |
| Pooling, {cls, mean} | – | – | mean | – |
| Bidirectional, {True, False} | True | – | – | – |
| Kernel size, {–, 2, 4} | – | – | – | – |
| Number of sLSTM blocks, {0, 1, 2} | – | 1 | – | – |
| Number of mLSTM blocks, {0, 1, 2, 3} | – | 2 | – | – |
| Training hyperparameters |
| Optimizer, {Adam, AdamW, SGD} | Adam | Adam | Adam | Adam |
| Learning rate, log_scale [1 × 10−3, 1 × 10−4] | ≈5.5 × 10−4 | ≈1.4 × 10−4 | ≈6.4 × 10−4 | ≈1.4 × 10−4 |
| Batch size, {8, 16, 32} | 16 | 16 | 16 | 16 |
| Dropout, {0.1, 0.2, 0.3, 0.4} | 0.2 | 0.1 | 0.4 | 0.2 |
| Normalization, {True, False} | False | False | False | False |
| Loss, {cross-entropy, focal} | cross-entropy | cross-entropy | cross-entropy | cross-entropy |
| Computational cost of model |
| Number of parameters | 1.58 M | 1.18 M | 0.53 M | 0.79 M |
| Model weight | 6 MB | 4.5 MB | 9.6 MB | 3 MB |
| Learning time per epoch in seconds | 91 | 73 | 50 | 51 |
Table 6.
Audio-based experimental results. Rank is calculated using Friedman’s test [
145].
Table 6.
Audio-based experimental results. Rank is calculated using Friedman’s test [
145].
| Extractor | Model | MELD | RESD | Rank |
|---|
| UAR | MF1 | WAR | WF1 | UAR | MF1 | WAR | WF1 |
|---|
| Wav2Vec2.0 | LSTM | 28.43 | 29.81 | 52.15 | 49.12 | 34.04 | 33.41 | 35.36 | 34.34 | 4.88 |
| Wav2Vec2.0 | xLSTM | 28.75 | 30.12 | 52.71 | 49.19 | 39.53 | 38.76 | 41.12 | 40.03 | 3.00 |
| Wav2Vec2.0 | Transformer | 26.79 | 27.58 | 52.45 | 48.41 | 42.22 | 40.86 | 43.21 | 41.68 | 3.88 |
| Wav2Vec2.0 | Mamba | 28.00 | 29.34 | 53.56 | 49.28 | 60.79 | 61.77 | 61.79 | 62.15 | 2.25 |
| ExHuBERT | LSTM | 24.37 | 25.32 | 51.46 | 45.42 | 33.06 | 29.43 | 35.36 | 31.33 | 7.75 |
| ExHuBERT | xLSTM | 24.94 | 26.06 | 52.89 | 46.63 | 34.20 | 30.39 | 36.53 | 32.48 | 6.13 |
| ExHuBERT | Transformer | 26.88 | 27.18 | 50.61 | 47.26 | 31.27 | 28.66 | 33.57 | 30.41 | 7.50 |
| ExHuBERT | Mamba | 29.60 | 29.55 | 50.08 | 48.24 | 38.91 | 37.15 | 40.71 | 38.84 | 4.63 |
| TelME [146] | – | – | – | 46.60 | – | – | – | – | 8.00 |
| WavLM [33] | 18.42 | 12.05 | 17.14 | 10.92 | 80.97 | 80.62 | 81.07 | 80.93 | 5.13 |
Table 7.
Optimal parameters of the text-based model and training process.
Table 7.
Optimal parameters of the text-based model and training process.
| Hyperparameter | LSTM | xLSTM | Transformer | Mamba |
|---|
| Model hyperparameters |
| Layers number, {1, 2, 3} | 2 | 2 | 3 | 2 |
| Heads number, {1, 2, 4, 8} | – | 1 | 8 | – |
| Hidden size, {64, 128, 256, 512} | 512 | 256 | 256 | 512 |
| Kernel size, {2, 4, 8} | – | – | – | 4 |
| Bidirectional, {True, False} | True | – | – | – |
| Number of sLSTM blocks, {0, 1, 2} | – | 1 | – | – |
| Number of mLSTM blocks, {0, 1, 2, 3} | – | 1 | – | – |
| Training hyperparameters |
| Optimizer, {Adam, AdamW, SGD} | AdamW | Adam | Adam | Adam |
| Learning rate, {1 × 10−3, 1 × 10−4, 1 × 10−5} | 1 × 10−4 | 1 × 10−4 | 1 × 10−4 | 1 × 10−4 |
| Batch size, {16, 32, 64} | 32 | 32 | 32 | 32 |
| Dropout, {0, 0.1, 0.2} | 0.1 | 0.2 | 0.1 | 0 |
| Computational cost of model |
| Number of parameters | 12.61 M | 8.15 M | 1.58 M | 3.28 M |
| Model weight | 48.09 MB | 31.08 MB | 6.03 MB | 12.50 MB |
| Learning time per epoch in seconds | 54 | 140 | 45 | 392 |
Table 8.
Text-based experimental results.
Table 8.
Text-based experimental results.
| Extractor | Model | MELD | RESD | Rank |
|---|
| UAR | MF1 | WAR | WF1 | UAR | MF1 | WAR | WF1 |
|---|
| Jina-v3 | LSTM | 38.05 | 37.84 | 55.67 | 56.07 | 34.09 | 34.04 | 35.00 | 34.81 | 3.75 |
| Jina-v3 | xLSTM | 37.89 | 37.63 | 55.38 | 55.73 | 35.07 | 34.98 | 35.92 | 35.64 | 3.88 |
| Jina-v3 | Transformer | 41.33 | 38.65 | 53.95 | 55.60 | 33.80 | 33.44 | 33.93 | 33.73 | 4.00 |
| Jina-v3 | Mamba | 36.17 | 36.76 | 56.44 | 55.27 | 37.20 | 37.18 | 37.50 | 37.36 | 3.63 |
| XLM-RoBERTa | LSTM | 38.50 | 37.52 | 56.36 | 56.64 | 30.45 | 29.10 | 30.71 | 29.37 | 5.63 |
| XLM-RoBERTa | xLSTM | 38.39 | 37.43 | 56.01 | 56.24 | 30.61 | 29.23 | 30.79 | 29.46 | 5.75 |
| XLM-RoBERTa | Transformer | 39.12 | 37.21 | 53.49 | 54.88 | 32.74 | 31.53 | 32.86 | 31.79 | 5.88 |
| XLM-RoBERTa | Mamba | 39.44 | 38.09 | 55.52 | 56.60 | 31.71 | 31.06 | 32.50 | 31.70 | 4.50 |
| CANINE-c | LSTM | 32.55 | 31.95 | 49.62 | 49.79 | 23.14 | 23.07 | 23.57 | 23.19 | 10.88 |
| CANINE-c | xLSTM | 32.65 | 32.11 | 49.84 | 50.02 | 24.28 | 24.13 | 24.78 | 24.36 | 9.88 |
| CANINE-c | Transformer | 32.41 | 31.67 | 48.51 | 49.66 | 22.30 | 21.32 | 22.86 | 21.78 | 12.00 |
| CANINE-c | Mamba | 32.88 | 32.52 | 49.58 | 49.57 | 27.34 | 27.50 | 26.95 | 27.01 | 9.75 |
| TelME [146] | – | – | – | 66.57 | – | – | – | – | 1.00 |
Table 9.
Optimal parameters of the Bimodal-based model and training process.
Table 9.
Optimal parameters of the Bimodal-based model and training process.
| Hyperparameter | BiFormer | BiGraphFormer | BiGatedGraphFormer | BiMamba |
|---|
| Model hyperparameters |
| Transformer layer number, {2, 3, 4, 5} | 5 | 5 | 5 | – |
| Transformer hidden size, {128, 256, 512} | 256 | 256 | 256 | – |
| Transformer head number, {2, 4, 8, 16} | 8 | 8 | 8 | – |
| Graph head number, {2, 4, 8, 16} | – | 8 | 8 | – |
| Graph/Graph hidden size, {128, 256, 512} | – | – | 256 | – |
| Output feature size, {128, 256, 512} | 256 | 256 | 256 | 512 |
| Mamba hidden size, {128, 256, 512} | – | – | – | 512 |
| Mamba layer number, {2, 3, 4, 5} | – | – | – | 4 |
| Mamba kernel size, {2, 3, 4, 5} | – | – | – | 4 |
| Mamba D state, {2, 4, 8, 16} | – | – | – | 16 |
| Training hyperparameters |
| Optimizer, {Adam, AdamW, SGD, Lion} | Adam | Adam | Adam | Adam |
| Learning rate, {1 × 10−3, 1 × 10−4, 1 × 10−5} | 1 × 10−5 | 1 × 10−5 | 1 × 10−5 | 1 × 10−4 |
| Batch size, {16, 32, 64} | 64 | 64 | 64 | 64 |
| Dropout, {0, 0.1, 0.15,0.2} | 0.15 | 0.15 | 0.15 | 0.15 |
| Scheduler, {plateau, cosine, cosine with restarts} | plateau | plateau | plateau | plateau |
| Computational cost of model |
| Number of parameters | 4.40 M | 5.82 M | 6.37 M | 4.89 M |
| Model weight | 16.8 MB | 22.2 MB | 24.3 MB | 18.6 MB |
| Learning time per epoch in seconds | 9 | 10 | 11 | 16 |
Table 10.
Bimodal-based experimental results. p means augmentation probability.
Table 10.
Bimodal-based experimental results. p means augmentation probability.
| Extractor | Model | MELD | RESD | Rank |
|---|
| UAR | MF1 | WAR | WF1 | UAR | MF1 | WAR | WF1 |
|---|
| Wav2Vec2.0 | Mamba | 28.00 | 29.34 | 53.56 | 49.28 | 60.79 | 61.77 | 61.79 | 62.15 | 14.50 |
| Jina-v3 | Mamba | 36.17 | 36.76 | 56.44 | 55.27 | 37.20 | 37.18 | 37.50 | 37.36 | 14.88 |
| Wav2Vec2.0- and Jina-v3-Mamba | WPF | 35.60 | 37.88 | 59.89 | 57.01 | 56.79 | 56.72 | 56.59 | 56.54 | 11.50 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiFormer | 38.54 | 38.86 | 57.05 | 56.51 | 67.89 | 67.93 | 68.57 | 68.41 | 5.88 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiGraphFormer | 36.94 | 37.55 | 56.82 | 56.18 | 68.51 | 68.30 | 68.93 | 68.70 | 6.75 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiGatedGraphFormer | 37.12 | 37.44 | 56.40 | 55.90 | 68.24 | 68.33 | 68.57 | 68.50 | 7.50 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiMamba | 36.96 | 37.42 | 56.90 | 55.73 | 66.30 | 66.50 | 66.43 | 66.54 | 10.38 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiFormer-WPF | 38.08 | 38.88 | 58.62 | 57.43 | 62.73 | 62.84 | 63.21 | 63.03 | 7.88 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiGraphFormer-WPF | 38.01 | 38.16 | 56.55 | 56.41 | 64.14 | 64.04 | 64.64 | 64.44 | 9.75 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiGatedGraphFormer-WPF | 37.83 | 37.70 | 55.67 | 55.62 | 62.53 | 62.39 | 63.21 | 62.79 | 11.88 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiMamba-WPF | 41.47 | 41.06 | 55.32 | 53.94 | 59.66 | 59.67 | 60.00 | 59.99 | 11.00 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiFormer + SDS (p = 0.4) | 39.57 | 39.89 | 58.51 | 57.82 | 66.71 | 66.97 | 67.14 | 67.28 | 5.00 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiFormer + LS-LLM (Falcon-H1-3B-Instruct, p = 0.3) | 38.40 | 38.89 | 58.01 | 56.81 | 67.01 | 67.28 | 67.50 | 67.62 | 6.25 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiFormer + LS-LLM (Phi-4-mini-instruct, p = 0.2) | 38.53 | 39.14 | 58.35 | 57.25 | 66.91 | 67.12 | 67.50 | 67.49 | 5.63 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiFormer + LS-LLM (Qwen3-4B, p = 0.2) | 39.03 | 39.29 | 58.08 | 57.14 | 68.05 | 68.14 | 68.57 | 68.51 | 4.00 |
| Wav2Vec2.0- and Jina-v3-Mamba | BiFormer + SDS (p = 0.4) + LS-LLM (Qwen3-4B, p = 0.2) | 39.23 | 39.30 | 57.39 | 57.18 | 68.90 | 69.10 | 69.29 | 69.38 | 2.63 |
Table 11.
Comparison with SOTA methods. A, T, and AT mean audio, text, and audio–text modalities, respectively.
Table 11.
Comparison with SOTA methods. A, T, and AT mean audio, text, and audio–text modalities, respectively.
| Method | M | WAR | WF1 |
|---|
| RESD |
| WavLM [33] | A | 81.07 | 80.93 |
| Wav2Vec2.0- and Jina-v3-Mamba, BiFormer | AT | 83.93 | 84.88 |
| Wav2Vec2.0- and Jina-v3-Mamba, BiFormer + SDS + LS-LLM | AT | 84.29 | 85.25 |
| MELD |
| RoBERTa, DCNN, Modality-Collaborative Fusion Network (MCFN) [44] | AT | 64.50 | 62.20 |
| RoBERTa, WavLM, Transformer-based fusion [79] | AT | 64.36 | 63.73 |
| RoBERTa + BiGRU, Wav2Vec2.0 + BiGRU, Concat. [101] | AT | 66.03 | 64.37 |
| RoBERTa, Data2vec, Attention-based modality shifting fusion [146] | AT | – | 67.19 |
| BERT, OpenSMILE, Cross-attention, Temporal Convolutional Network (TCN) [148] | AT | – | 67.73 |
| Wav2Vec2.0- and Jina-v3-Mamba, BiFormer | AT | 64.64 | 67.98 |
| Wav2Vec2.0- and Jina-v3-Mamba, BiFormer + SDS + LS-LLM | AT | 64.90 | 68.31 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).