Review Reports - Cross-Lingual Bimodal Emotion Recognition with LLM-Based Label Smoothing

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors present a study that is of interest, in which they propose a cross-lingual bimodal emotion recognition method. This method combines audio/text encoders with a transformer-based cross-modal fusion method. Notwithstanding the potential of the present study to make a substantial contribution to the enhancement of bimodal emotion recognition, it is recommended that some minor issues be resolved.

Despite the authors' clear exposition of the objectives and contributions of the study, it is recommended that research questions be formulated in the conclusion of the "Introduction" section to enhance the scientific rigor of the objectives. It is recommended that the "Discussion" be organized into subsections corresponding to each research question discussed, as outlined in the formulated research questions (e.g. 5.1 RQ 1 - …, 5.2 RQ2 - …).
The authors have provided a substantial bibliography. Nevertheless, a total of 12 papers are cited in the “arXiv” platform, the majority of which are preprints and therefore not subject to the peer-review process. As is correct, citations for non peer-reviewed sources, should be made in the text itself (as the authors have done with other sources, such as huggingface.co).

In conclusion, the authors have presented a study that only requires minor improvements, and which is of great interest.

Author Response

We sincerely thank the reviewer for your valuable time and efforts in reviewing our manuscript. Those comments are all valuable and very helpful for revising and improving our article, as well as the important guiding significance to our researches. We have studied comments carefully and have made correction which we hope meet with approval.

The description of different fonts used in this document are as follows:

Reviewers’ original comments are reproduced in blue-colored
Plain fonts are our answers to Reviewers’ comments.
Text reproduced from the paper is shown in red color.

Comment 1: Despite the authors' clear exposition of the objectives and contributions of the study, it is recommended that research questions be formulated in the conclusion of the "Introduction" section to enhance the scientific rigor of the objectives. It is recommended that the "Discussion" be organized into subsections corresponding to each research question discussed, as outlined in the formulated research questions (e.g. 5.1 RQ 1 - …, 5.2 RQ2 - …).

Reply: We thank the reviewer for pointing out this comment. In response, we have clearly formulated the three research questions (RQ1-RQ3) at the end of 1. Introduction to make the study's objectives clearer and enhance scientific rigor. Additionally, 5. Discussion has been expanded to include 5.1 (RQ1), 5.2 (RQ2) and 5.3 (RQ3) subsections, each directly addressing the corresponding research question with a focused analysis and interpretation of the results.

Lines 95-100: This leads to the following research questions:

RQ1: Can a cross-lingual BER model improve generalization across languages?
RQ2: Can a hierarchical cross-modal attention mechanism effectively handle variable utterance lengths?
RQ3: Do corpus-specific augmentation strategies improve the generalization of cross-lingual BER models?

Lines 878-904: 5.1. RQ1: Can a cross-lingual BER model improve generalization across languages?

Figure 9 shows the word attention distribution of a proposed BER on semantically equivalent Russian and English sentence pairs, revealing different lexical salience patterns. In the top pair, despite language differences, the model predicts Neutral (75.5% for Russian, 99.4% for English), with attention concentrated on lexically aligned terms, suggesting partial cross-lingual alignment. In contrast, the bottom pair exposes a critical limitation: while both sentences express Sadness, the model attends to literal, morphologically salient components rather than recognizing idiomatic or culturally entrenched expressions of grief. Notably, the Russian phrase, while semantically rich, lacks the conventional emotional framing found in the model’s training data. Its literal interpretation of “bleeding heart” evokes physical trauma rather than psychological sorrow, leading to a misattribution of emotion. This highlights the challenges of cross-lingual emotion modeling, which are not only caused by structural incongruences but also by the lack of culturally specific idioms in research corpora.

These results indicate that while the cross-lingual BER model achieves partial alignment for structurally similar expressions, it struggles with native-speaker-specific emotional idioms that do not appear in research corpora. Thus, cross-lingual generalization is possible but remains fragile without explicit modeling of linguistic and cultural diversity.

It should be noted that the model’s performance was evaluated on English and Russian, which are both Indo-European languages with relatively similar syntactic and semantic structures. Extending this model to other languages (such as Arabic, Korean and others) would require emotion-annotated data and dedicated experiments. This is an important area for future research. Nevertheless, the proposed model is based on the multilingual Jina-v3 encoder [67], which was pre-trained on over 30 languages and aligns semantically similar words across languages in a shared embedding space. Thus, the proposed cross-lingual BER model can, in principle, be applied to any supported language. However, without emotion-labeled data, it cannot reliably interpret culture-specific emotional expressions.

Lines 905-927: 5.2. RQ2: Can a hierarchical cross-modal attention mechanism effectively handle variable utterance lengths?

This research question is addressed by the results in Table 10 and the corresponding analysis in Section 4.5. On MELD, which consists of short utterances (approximately 3.6 seconds long, with around 8 words each), the BiFormer model with one attention mechanism outperforms other models. Simpler fusion techniques, such as WPF, also work well in this setting. This implies that for short and fragmented expressions, global attention is sufficient to detect cross-modal emotions. Adding a hierarchical cross-modal attention mechanism using a graph- or gate-based fusion introduces unnecessary complexity in this case. In contrast, on the RESD corpus, which contains longer and more expressive utterances (average duration 6.0 seconds, approximately 24 words), the hierarchical graph-based BiGraphFormer model outperforms the simple BiFormer model. This suggests that modeling pairwise audio-text interactions in two stages can be beneficial when emotional signals develop over time and require more in-depth integration. Mamba-based models demonstrate low performance on both corpora, despite Mamba’s reputation for efficiently handling length-independent sequences. This suggests that linear state-space modeling alone is not enough to accurately BER.

In summary, it is important to adapt the fusion strategy depending on the length of the utterances. For short utterances, a lightweight cross-modal fusion model is more effective, while for longer utterances, a hierarchical cross-modal fusion model leads to improved performance. This finding supports the need for adaptive or hierarchical model architectures in real-world ER systems, which can adapt to different dialogue styles and contexts.

Lines 928-949: 5.3. RQ3: Do corpus-specific augmentation strategies improve the generalization of cross-lingual BER models?

This question is addressed by the results in Tables 10 and Figures 8, which evaluate augmentation strategies using the BiFormer model as a consistent backbone. On the noisy and fragmented MELD, SDS, which extends short utterances by stacking same-emotion samples, improves performance by stabilizing input length and reducing acoustic variability. However, it does not bring any gain on the cleaner and more structured RESD corpus. Conversely, LS-LLM is particularly effective for RESD, as longer utterances often contain more nuanced or complex emotional cues that may not be captured by a single-label annotation scheme. By generating a context-aware distribution of soft labels, LS-LLM can help recover this affective complexity. However, on MELD, where utterances are short and noisy, the signal is less reliable, limiting the benefit of label smoothing. The TUG method, which creates emotional speech using TTS, fails to improve performance in both corpora. This confirms that SOTA TTS models cannot reliably generate emotional speech. Combining SDS and LS-LLM results in complementary improvements on both corpora. This shows that corpus-aware augmentation, especially when multiple strategies are used together, enhances cross-lingual generalization.

In summary, the choice of augmentation strategies should be based on the utterance characteristics. SDS can be used for variable-length or noisy utterances, while LS-LLM can be used for long or nuanced emotional utterances. Combining these strategies can improve overall performance, as it allows for adaptation and context-awareness in data augmentation, which benefits generalization.

Comment 2: The authors have provided a substantial bibliography. Nevertheless, a total of 12 papers are cited in the “arXiv” platform, the majority of which are preprints and therefore not subject to the peer-review process. As is correct, citations for non peer-reviewed sources, should be made in the text itself (as the authors have done with other sources, such as huggingface.co).

Reply: We appreciate the comment from the reviewer regarding the use of preprint references. Following this suggestion, we have thoroughly revised the bibliography. All citations that previously referred to arXiv preprints have either been replaced with their corresponding peer-reviewed publications (if available), or have been converted into in-text references for non-peer-reviewed resources (e.g., Hugging Face and other model documentation). The current version of the manuscript contains no references to Arxiv.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1. The abstract is not clearly enough to describe the method of the paper.

2. The method part is confusing, because the authors listed several models that may not be choosed at last. For example, in subsection 3.1, LSTM, xLSTM, Transformer and Mamba are listed. I think these models are just tested by the authors in the process to choose a proper model. The authors should clearly describe the final selection rather than the selection process.

3. In subsection 3.5, the relationships of BiFormer, BiGraphFormer, BiGatedGraphFormer and BiMamba are not clearly illustrated. Why so many models are listed here? Which is the final selection for the proposed method?

4. In figure 5 and figure 6, these parts are all isolated, it is hard to understand the relations of them.

5. Although the authors give lots of contents in the experimental part, most of them are the comparisons of different alternative models for intermediate steps, which is not very important for the main contributions of the paper.

Author Response

The description of different fonts used in this document are as follows:

Reviewers’ original comments are reproduced in blue-colored
Plain fonts are our answers to Reviewers’ comments.
Text reproduced from the paper is shown in red color.

Comment 1: The abstract is not clearly enough to describe the method of the paper.

Reply: We thank the reviewer for pointing out this comment. The abstract has been revised to provide a clearer and more detailed description of the proposed method. These additions ensure that the abstract accurately reflects the technical contribution of the manuscript.

Lines 2-24: Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method is proposed, integrating Mamba-based temporal encoders for audio (Wav2Vec2.0) and text (Jina-v3) with a Transformer-based cross-modal fusion architecture (BiFormer). Three corpus-adaptive augmentation strategies are introduced: (1) Stacked Data Sampling, in which short utterances are concatenated to stabilize sequence length; (2) Label Smoothing Generation based on Large Language Model, where the Qwen3-4B model is prompted to detect subtle emotional cues missed by annotators, producing soft labels that reflect latent emotional co-occurrences; and (3) Text-to-Utterance Generation, in which emotionally labeled utterances are generated by ChatGPT-5 and synthesized into speech using the DIA-TTS model, enabling controlled creation of affective audio-text pairs without human annotation. BiFormer is trained jointly on the English Multimodal EmotionLines Dataset and the Russian Emotional Speech Dialogs corpus, enabling cross-lingual transfer without parallel data. Experimental results show that the optimal data augmentation strategy is corpus-dependent: Stacked Data Sampling achieves the best performance on short, noisy English utterances, while Label Smoothing Generation based on Large Language Model better captures nuanced emotional expressions in longer Russian utterances. Text-to-Utterance Generation does not yield a measurable gain due to current limitations in expressive speech synthesis. When combined, the two best performing strategies produce complementary improvements, establishing new state-of-the-art performance in both monolingual and cross-lingual settings.

Comment 2: The method part is confusing, because the authors listed several models that may not be choosed at last. For example, in subsection 3.1, LSTM, xLSTM, Transformer and Mamba are listed. I think these models are just tested by the authors in the process to choose a proper model. The authors should clearly describe the final selection rather than the selection process.

Reply: We thank the reviewer for pointing out this comment. We understand that the inclusion of multiple architectural variants in 3. Proposed Method may initially seem like an exhaustive list of options rather than a focus on the final model. However, as this is a journal submission, we have intentionally provided a comprehensive methodological account of our design process to ensure reproducibility and transparency. The comparison of alternative components is an essential part of our research, as it allows us to:

Understand the strengths and weaknesses of different temporal encoders for different utterance lengths and modalities.
The suitability of fusion mechanisms for different types of data (e.g., Transformer-based fusion for short MELD dialogues and graph-based fusion for longer RESD sequences).
The empirical grounding for our final model configuration.

Therefore, we believe that this level of detail enhances the scientific rigor of our work and its utility for the research community, particularly for researchers working on cross-lingual, variable-length, or other psychological recognition tasks. Instead of removing these analyses, we have clarified the text to explicitly distinguish between the candidate architectures and the final selected configuration. We have also retained the ablation studies as evidence of our robust model development process.

Lines 325-328: This section describes the full methodological process of the proposed cross-lingual BER method. To ensure replicability and justify design decisions, ablation studies comparing alternative components are included. The final configuration used in the main experiments is explicitly indicated in each subsection.

Lines 385-388: Among the evaluated encoders, Mamba is selected for both audio and text modalities in the final model due to its superior performance on long sequences and computational efficiency, as detailed in Sections 4.3 and 4.4.

Comment 3: In subsection 3.5, the relationships of BiFormer, BiGraphFormer, BiGatedGraphFormer and BiMamba are not clearly illustrated. Why so many models are listed here? Which is the final selection for the proposed method?

Reply: We thank the reviewer for pointing out this comment. Please see the detailed response to comment 4. To clarify the ambiguity, we've added the appearance to 3.5. Bimodal Emotion Recognition of the manuscript.

Lines 531-534: The BiFormer model is adopted as the primary fusion mechanism in the final model, as it achieves the best trade-off between performance and complexity across both research corpora. BiGraphFormer, BiGatedGraphFormer, and BiMamba are included for completeness and are not part of the main pipeline.

Comment 4: In figure 5 and figure 6, these parts are all isolated, it is hard to understand the relations of them.

Reply: We thank the reviewer for pointing out this comment. Figure 5 presents the three proposed BER models, where each subsequent model builds upon the previous one. To clarify their hierarchical relationship, we have visually isolated the shared and inherited components using distinct colors and added a legend. We also revised the caption to explicitly describe the progression between the models.

Figure 6 illustrates the individual components used within the proposed models. The blocks in this figure are not directly connected to each other; instead, each corresponds to elements employed in the architectures shown in Figure 5. To avoid confusion, we have expanded the caption of Figure 6 to clearly explain this relationship.

Figure 5. Architectures of BER models. The architectures range from the simple BiFormer, which uses a Transformer-based cross-modal attention mechanism, to the hierarchical BiGatedGraphFormer, which combines a Transformer-based, graph-based, and gated-based cross-modal mechanisms.

Figure 6. Components of the proposed architectures of BER models. The graph fusion layer is a component of the BiGraphFormer and BiGatedGraphFormer models. The gated fusion layer is a specific component of the BiGatedGraphFormer, while feature resampling is used in all models.

Comment 5: Although the authors give lots of contents in the experimental part, most of them are the comparisons of different alternative models for intermediate steps, which is not very important for the main contributions of the paper.

Reply: We thank the reviewer for pointing out this comment. Please see the detailed response to comment 4. To clarify the ambiguity, we've added the appearance to 3. Proposed Method of the manuscript.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a Cross-Lingual Bimodal Emotion Recognition (BER) framework that combines audio and text modalities using Transformer-based cross-modal fusion and Mamba-based encoders. The study introduces several innovations. The paper is technically sound and relevant, however, some aspects must be enhanced. My observations are outlined below:

1. Keywords are too generic, please use more specific ones. Also, use expanded forms as keywords (for LLM).

2. It is recommended to avoid using acronyms in the abstract, use expanded forms.

3. I feel that the research gap should be emphasized better in the Introduction, to justify the need for the proposed approach.

4. Regarding the choice of the used architecture, what was the rationale for selecting Mamba as the core encoder compared to more established models (including Transformer, Conformer, LSTM)?

5. The authors should consider providing ablation study. The performance contribution of individual LLMs (in LS-LLM) is not quantitatively isolated. It is unclear which model yields the best label quality and why.

6. The authors should provide brief insights on computational efficiency of Mamba against Transformer models.

7. Discuss limitations in cross-lingual generalization beyond English–Russian. Broader cross-lingual validation (for example Arabic) would better demonstrate generalization, since both English and Russian are Indo-European languages.

8. Interpretability could be significantly improved. No confusion matrices or qualitative examples are provided to illustrate typical emotion misclassifications.

9. Which emotion categories are most frequently confused across languages, and how do cultural or linguistic factors contribute to these errors?

10. Regarding the computational efficiency, the authors should elaborate how does the training time and memory footprint of the Mamba-based model compare with Transformer or LSTM architectures?

11. Avoid using personal nouns like we/our throughout the paper, since scientific manuscript should be written in third person and passive voice.

12. Thorough proofreading is highly recommended, as there are some errors and awkward constructions throughout the manuscript.

Author Response

The description of different fonts used in this document are as follows:

Reviewers’ original comments are reproduced in blue-colored
Plain fonts are our answers to Reviewers’ comments.
Text reproduced from the paper is shown in red color.

Comment 1: Keywords are too generic, please use more specific ones. Also, use expanded forms as keywords (for LLM).

Reply: We thank the reviewer for this valuable comment. The Keywords list has been carefully revised to include more specific and informative terminology that more accurately reflects the manuscript's methodological and architectural contributions. Abbreviations such as 'llm' have been replaced with their full forms ('large language model'), and additional keywords have been added to emphasize the proposed data augmentation and fusion strategies, as well as the underlying model architecture.

Lines 26-27: large language model; mamba architecture; stacked data sampling; template-based utterance generation; label smoothing generation

Comment 2: It is recommended to avoid using acronyms in the abstract, use expanded forms.

Reply: We thank the reviewer for pointing out this comment. All acronyms have been removed from the abstract, and only fully expanded forms are now used.

Comment 3: I feel that the research gap should be emphasized better in the Introduction, to justify the need for the proposed approach.

Reply: We thank the reviewer for pointing out this comment. In the revised Introduction, we have strengthened the articulation of the research gap.

Line 55-68: However, despite their success, several key challenges persist. Firstly, many State-of-the-Art (SOTA) models are trained and evaluated on English corpora, which limits their effectiveness in multilingual contexts. The ability to use different languages simultaneously is important for a wide variety of multilingual human-computer interaction systems [33]. Secondly, modality fusion often lacks fine-grained alignment – most methods either fuse at a high level or focus on short-range dependencies, failing to capture nuanced cross-modal dynamics over time. Moreover, corpora have issues such as class imbalance, variation in utterance lengths, and limited emotional diversity, which can lead to overfitting and poor generalization in real-world applications. Finally, current data augmentation strategies tend to be generic, lacking task-specific adaptation or emotional semantics, particularly in low-resource or cross-lingual settings. Therefore, no existing method currently addresses the interaction of linguistic diversity, fine-grained cross-modal alignment across variable-length utterances, and context-specific data augmentation within a unified framework.

Comment 4: Regarding the choice of the used architecture, what was the rationale for selecting Mamba as the core encoder compared to more established models (including Transformer, Conformer, LSTM)?

Reply: We thank the reviewer for pointing out this comment. In our study, we compare four representative temporal architectures: LSTM, xLSTM, Transformer, and Mamba. We describe the strengths and limitations of each model in Section 3.1. Audio-based Emotion Recognition upon their first mention. These models were chosen to cover a wide range of sequence modeling approaches, from classical recurrent networks to modern attention- and state-space modeling-based techniques. This allows us to systematically evaluate their suitability for emotion recognition in different languages and modalities.

Line 345-357: The embeddings feed into four temporal models: Long Short-Term Memory network (LSTM) [136], Extended LSTM (xLSTM) [137], Transformer [138] and Mamba [113]. The LSTM model captures temporal dependencies through recurrent connections, effectively modeling short- and mid-term contextual dynamics. However, its sequential nature limits parallelization and long-range dependency modeling. The xLSTM model modifies this structure by introducing multiplicative gating within its recurrent dynamics, which enhances gradient stability and allows for more efficient propagation of information over time. In contrast, Transformer leverages self-attention to capture global dependencies between all time steps simultaneously, enabling efficient context integration across long utterances, albeit with higher computational cost proportional to the sequence length squared. Mamba introduces a recent state-space model that replaces explicit recurrence and attention with linear dynamical updates, providing efficient long-context modeling with linear time complexity and reduced parameter count.

Comment 5: The authors should consider providing ablation study. The performance contribution of individual LLMs (in LS-LLM) is not quantitatively isolated. It is unclear which model yields the best label quality and why.

Reply: We thank the reviewer for pointing out this comment. In the original submission, the performance of individual LLMs within LS-LLM was presented alongside other augmentation methods in Figure 8, arranged by an aggregated ranking. This interleaving layout made it difficult to compare the LLMs directly and may have led to confusion. In response to this comment, we have revised the figure to separate method families into distinct groups: baseline, SDS ablation, TUG ablation, and LS-LLM ablation, best results. Within the LS-LLM group, we have grouped and ordered the results for each LLM for easier comparison. Color-coding and an updated legend help to visually distinguish between the groups, while maintaining a compact presentation, especially important given the number of experiments included. The refined analysis shows that Qwen3-4B achieves the highest label quality, particularly at p=0.2, achieving the best UAR scores on both MELD and RESD corpora. We attribute this to its superior instruction-following ability and emotionally coherent label generation. These findings are presented in Section 4.5 Bimodal-based Emotion Recognition, specifically in Figure 8 and lines 660-666.

Lines 799-805: While the other LLMs, such as Phi-4-mini-instruct and Falcon-H1-3B-Instruct, also offer benefits compared to the baseline model, especially for MELD, they underperform relative to Qwen3-4B. This difference is due to the stronger ability of Qwen3-4B to follow instructions and its more nuanced representation of emotional context, which leads to higher-quality soft labels. This indicates that the quality of label smoothing depends on the LLM’s semantic understanding capabilities.

Comment 6: The authors should provide brief insights on computational efficiency of Mamba against Transformer models.

Reply: We thank the reviewer for pointing out this comment. To address this suggestion, we have included a comparison of computational efficiency between Mamba, Transformer, and other models in the revised manuscript. The updated Table-5, Table-7, Table-9 now reports additional metrics such as the number of parameters, training time per epoch, and inference latency for all architectures. These measurements have been added to provide a clearer view of model efficiency.

Comment 7: Discuss limitations in cross-lingual generalization beyond English–Russian. Broader cross-lingual validation (for example Arabic) would better demonstrate generalization, since both English and Russian are Indo-European languages.

Reply: We thank the reviewer for pointing out this comment. We fully agree that evaluating the proposed method on languages other than English and Russian, particularly on typologically distant ones like Arabic, would provide a more robust validation. Our method uses the multilingual Jina-v3 encoder, which supports over 30 languages, including Arabic, and can in principle be applied to any language. However, conducting reliable evaluations on additional languages would require emotion-annotated data and validation by native speakers, which is beyond the scope of this work. Additionally, as non-native speakers of non-Indo-European languages, we are not able to confidently assess the cultural appropriateness of emotion labels and expressions in those languages. This limitation is acknowledged in Section 5.1. RQ1: Can a cross-lingual BER model improve generalization across languages?, lines 740-748, and we consider cross-lingual evaluation across diverse language families to be an important area for future research.

Lines 896-904: It should be noted that the model’s performance was evaluated on English and Russian, which are both Indo-European languages with relatively similar syntactic and semantic structures. Extending this model to other languages (such as Arabic, Korean and others) would require emotion-annotated data and dedicated experiments. This is an important area for future research. Nevertheless, the proposed model is based on the multilingual Jina-v3 encoder [67], which was pre-trained on over 30 languages and aligns semantically similar words across languages in a shared embedding space. Thus, the proposed cross-lingual BER model can, in principle, be applied to any supported language. However, without emotion-labeled data, it cannot reliably interpret culture-specific emotional expressions.

Comment 8: Interpretability could be significantly improved. No confusion matrices or qualitative examples are provided to illustrate typical emotion misclassifications.

Reply: We thank the reviewer for this valuable comment. In the revised version, we have incorporated interpretability analyses to better illustrate model behavior. Specifically, confusion matrices for both the MELD and RESD corpora and each models (audio-based, test-based, and bimodal-based) have been added to visualize the distribution of emotion predictions and identify typical misclassification patterns.

Line 639-648: Figure 8 shows the confusion matrices obtained by the audio-based model. Despite joint training, the performance of the model correlates with the distribution of training samples. In MELD, classes such as Disgust, Fear, and Sad, which are the least frequent, achieve the lowest recall values (0.0%, 2.0%, and 19.7%, respectively). Although RESD is a more balanced corpus, class imbalance still affects the recall for the following emotions: Disgust (48.6%), Neutral (52.6%), Sad (53.1%) and Surprise (52.5%). Regarding cross-lingual differences in acoustic expression, emotions in English tend to be confused with Anger and Happy more often, while in Russian, they are often misclassified as Fear. This may be due to language-specific prosodic patterns, such as the use of a higher pitch or more distinct intonation to convey emotional intensity.

Line 692-702: Figure 9 shows the confusion matrices obtained by the text-based model. The impact of class imbalance remains. In MELD, classes such as Disgust, Fear, and Sad, achieve the lowest recall values (11.8%, 14.0%, and 23.1%, respectively), although higher than for the audio-based model. In RESD, Sad and Surprise have lower recognition, with recall values of 25.0% and 17.5%, respectively. Cross-class confusions differ across languages. The model identifies errors in Fear vs. Sad and Anger vs. Happy in MELD. In RESD, however, it indicates confusion between Fear and Surprise, as well as Anger and Fear. These patterns suggest that language- and culture-specific ways of expressing emotions affect model performance beyond mere data imbalance, particularly through lexical and phrasing conventions that link distinct emotions to similar textual cues in each language.

Line 811-819: The integration of audio and text modal improves overall performance in both corpora, as shown by the higher diagonal values and improved recall in Figure 11 compared to Figure 8 and Figure 9. For instance, Sad recall increases from 19.7% (by audio) and 23.1% (by text) to 35.1% (by bimodal) in MELD and from 25% (by text) and 53.1% (by audio) to 62.5% (by bimodal) in RESD. However, there are still language-specific patterns of confusion: in MELD, Fear is confused with Angry, and in RESD, Anger and Fear continue to be misclassified. This suggests that a bimodal fusion helps but does not completely eliminate emotion confusion caused by differences in how emotions are expressed in language.

Comment 9: Which emotion categories are most frequently confused across languages, and how do cultural or linguistic factors contribute to these errors?

Reply: We thank the reviewer for this valuable comment. In the revised manuscript, we have added confusion matrices for text-, audio-, and bimodal-based models (Figures 5, 7, and 9), which explicitly highlight the emotion categories most frequently confused across languages. As detailed in Response 8. We attribute these differences to language- and culture-specific expression styles, particularly in lexical choice (for text) and prosodic cues (for audio), which lead models to associate distinct emotions with similar surface forms in each language.

Comment 10: Regarding the computational efficiency, the authors should elaborate how does the training time and memory footprint of the Mamba-based model compare with Transformer or LSTM architectures?

Reply: In the revised manuscript, we provide a detailed comparison of computational efficiency across all architectures (see Tables 5, 7, and 9).

Line 623-630: Among the audio-based models, Transformer has the best computational efficiency, with the least number of parameters (0.53 M) and the fastest training time (50 seconds/epoch), despite its larger disk space requirement (9.6 MB). LSTM has the longest training time (91 seconds/epoch) and the highest number of parameters (1.58 M). xLSTM reduces both the size (1.18 M) and training time (73 seconds) compared to LSTM, making it more efficient. Mamba offers a smaller model (0.79 M parameters, 3 MB) and faster training (51 seconds) than LSTM, while closely matching Transformer’s efficiency and using significantly less storage space.

Line 668-675: The comparison of computational cost of text-based models show clear trade-offs. The Transformer has the smallest size (1.58 M parameters, 6.03 MB), followed by Mamba (3.28 M, 12.50 MB), xLSTM (8.15 M, 31.08 MB), and LSTM (12.61 M, 48.09 MB). Surprisingly, Mamba, despite its smaller size, is the slowest to train (392 seconds/epoch), significantly slower than even the recurrent models and Transformer (45 seconds/epoch). Although Mamba is often considered computationally efficient due to its linear time architecture, the obtained results do not support this claim under the current experimental setup.

Line 736-739: In terms of computational cost, BiFormer is the most effective (9 seconds/epoch, 4.40 M parameters). Graph-based variants, on the other hand, require more computational resources (10-11 seconds/epoch, up to 6.37 M parameters). BiMamba is the slowest to train (16 second/epoch) and uses more parameters (4.89 M) than BiFormer.

Comment 11: Avoid using personal nouns like we/our throughout the paper, since scientific manuscript should be written in third person and passive voice.

Reply: We thank the reviewer for this important remark. All occurrences of personal pronouns we and our have been revised throughout the manuscript. Each instance was rewritten in passive voice or neutral third-person form to ensure consistency with formal academic writing conventions. The corresponding revisions are indicated in the text by red highlighting for clarity.

Comment 12: Thorough proofreading is highly recommended, as there are some errors and awkward constructions throughout the manuscript.

Reply: We thank the reviewer for this important remark. We have carefully reviewed the entire manuscript and made corrections for grammatical errors and awkward constructions. We have also revised the phrasing to improve clarity and fluency.