Cross-Corpus Speech Emotion Recognition Based on Attention-Driven Feature Refinement and Spatial Reconstruction

Tao, Huawei; Jiang, Yixing; Li, Qianqian; Zhao, Li; Yang, Zhizhe

doi:10.3390/info16110945

Open AccessArticle

Cross-Corpus Speech Emotion Recognition Based on Attention-Driven Feature Refinement and Spatial Reconstruction

by

Huawei Tao

^1,2,*

,

Yixing Jiang

^1,2

,

Qianqian Li

³,

Li Zhao

⁴ and

Zhizhe Yang

⁵

¹

Key Laboratory of Grain Information Processing and Control, Ministry of Education, Henan University of Technology, Zhengzhou 450001, China

²

Henan Key Laboratory of Grain Photoelectric Detection and Control, Henan University of Technology, Zhengzhou 450001, China

³

School of Mechanical and Electrical Engineering, Zhengzhou Business University, Zhengzhou 451200, China

⁴

School of Information Science and Engineering, Southeast University, Nanjing 210096, China

⁵

Yunnan Chinese Language and Culture College, Yunnan Normal University, Kunming 650504, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(11), 945; https://doi.org/10.3390/info16110945

Submission received: 8 September 2025 / Revised: 23 October 2025 / Accepted: 26 October 2025 / Published: 30 October 2025

Download

Browse Figures

Versions Notes

Abstract

In cross-corpus scenarios, inappropriate feature-processing methods tend to cause the loss of key emotional information. Additionally, deep neural networks contain substantial redundancy, which triggers domain shift issues and impairs the generalization ability of emotion recognition systems. To address these challenges, this study proposes a cross-corpus speech emotion recognition model based on attention-driven feature refinement and spatial reconstruction. Specifically, the proposed approach consists of three key components: first, an autoencoder integrated with a multi-head attention mechanism to enhance the model’s ability to focus on the emotional components of acoustic features during the feature compression process of the autoencoder network; second, a feature refinement and spatial reconstruction module designed to further improve the extraction of emotional features, with a gating mechanism employed to optimize the feature reconstruction process; finally, the Charbonnier loss function adopted as the loss metric during training to minimize the difference between features from the source domain and target domain, thereby enhancing the cross-domain robustness of the model. Experimental results demonstrated that the proposed method achieved an average recognition accuracy of 46.75% across six sets of cross-corpus experiments, representing an improvement of 4.17% to 14.33% compared with traditional domain adaptation methods.

Keywords:

cross-corpus speech emotion recognition; attention-driven feature refinement; spatial reconstruction unit; domain adaptation optimization

Graphical Abstract

1. Introduction

Speech emotion recognition (SER) is a key component of human–computer interaction (HCI) and holds significant application value in fields such as intelligent customer service [1] and mental health assessment [2]. In recent years, deep-learning-based SER models [3,4] have achieved remarkable success in single-corpus scenarios, often reaching or exceeding human-level accuracy. However, in cross-corpus scenarios, SER systems still suffer from severe performance degradation due to domain shift [5], which mainly stems from differences in recording environments, linguistic and cultural backgrounds, and emotional expression styles [6]. Furthermore, deep neural networks tend to produce redundant representations that may encode paralinguistic or irrelevant acoustic factors (e.g., dialectal accents, speaking rhythms, background noise), thereby weakening model generalization across corpora [7].

Recently, domain adaptation (DA) methods have achieved significant progress in improving the robustness of SER systems, mainly through feature alignment, adversarial learning, and reconstruction strategies. For instance, Naeeni et al. [8] introduced a subspace learning–based domain adaptation approach that minimizes inter-corpus distribution mismatch. Latif et al. [3] proposed a self-supervised adversarial dual-discriminator network (ADDi/sADDi) to jointly enforce domain invariance and emotional discriminability in cross-corpus/cross-language SER. Similarly, attention-based fusion networks [9,10] and dual-stream cross-attention architectures [11] have further improved domain-invariant feature extraction and emotional discriminability. Nevertheless, current DA methods still face several challenges: feature alignment methods struggle with highly nonlinear domain shifts, adversarial approaches may suffer from unstable convergence, and reconstruction-based methods often risk losing discriminative emotion features.

To address the aforementioned issues, this paper proposes a cross-corpus SER model based on attention-driven feature refinement and spatial reconstruction. The proposed model is implemented through three sequential stages: First, an variational autoencoder is used to capture key information and reduce data dimensionality; furthermore, a multi-head attention mechanism is introduced to enhance the model’s ability to focus on the emotional components within acoustic features. Second, a feature refinement and spatial reconstruction unit is designed to further extract emotional features, while a gating mechanism is integrated to optimize the reconstruction process and dynamically select critical features. Finally, the Charbonnier loss function is adopted as the loss metric during training to minimize the discrepancy between features from the source domain and target domain, thereby improving the model’s generalization ability. Experimental results demonstrated that the proposed method exhibited significant performance advantages over traditional domain adaptation methods in cross-corpus scenarios.

The overall structure of this paper is organized as follows. Section 1 (Introduction) presents the research background, motivation, and main contributions. Section 2 (Methods) describes the proposed approach in detail, including the overall framework of the cross-corpus SER model based on attention-driven feature refinement and spatial reconstruction, feature processing and normalization, the design of the attention-driven feature refinement and spatial reconstruction module, and the loss and joint optimization strategy. Section 3 (Experiment) introduces the speech emotion corpora, experimental setup, and evaluation metrics. Section 4 (Results Analysis and Discussion) reports and discusses the experimental findings, including feature set comparisons, comparative experiments, ablation studies, parameter sensitivity analysis, and confusion matrix results. Section 5 (Conclusions) summarizes the paper and outlines potential directions for future research.

2. Methods

2.1. Framework of Cross-Corpus Speech Emotion Recognition Model Based on Attention-Driven Feature Refinement and Spatial Reconstruction

To further refine feature representations, eliminate substantial redundant information in the feature processing process, and enhance the discriminability of emotional features, a cross-corpus speech emotion recognition method based on attention-driven feature refinement and spatial reconstruction is proposed. This method can not only deeply refine feature representations but also remove redundant information, thereby effectively improving model performance. The specific model framework is illustrated in Figure 1.

In the attention feature refinement and spatial reconstruction (AFSR) module shown in the Figure 1, the input features pass through the encoder, bottleneck, decoder, and fully connected (FC) layers and are finally normalized using Softmax for classification. In the encoder layer, “Linear” refers to the fully connected layer, “BatchNorm” to batch normalization, “ELU” to the exponential linear unit activation function, and “MHA” to multi-head attention.

2.2. Feature Processing and Normalization

Two speech databases were selected as the test database and training database, respectively. The original speech databases underwent preprocessing: acoustic low-dimensional descriptors were used to process speech segments, and statistical functions were applied to augment the obtained values. The augmented data were concatenated into vectors, which served as speech emotion features. Subsequently, the features were normalized; this study adopted the min-max normalization method, with the following formula:

x^{'} = \frac{x - min}{max - min}

(1)

In the formula,

x^{'}

denotes the value of a single feature, where min represents the minimum value of the column where the feature is located, and max represents the maximum value of that column. For the experiment, the specified feature set from the INTERSPEECH 2010 Emotion Challenge [12] was selected as model input, which contains a total of 1582-dimensional features. It includes 34 basic low-level descriptors (LLDs), namely Mel-frequency cepstral coefficients (MFCCs), line spectrum pairs (LSPs), and 34 corresponding delta coefficients. Based on these low-level descriptors, 21 statistical functions were applied to obtain 1428-dimensional features. Additionally, for 4 pitch-based low-level descriptors and their corresponding delta coefficients, 19 statistical functions were applied to generate 152-dimensional features. The start and duration of pitch were taken as the last two features, thus forming the 1582-dimensional speech features, as shown in Table 1. To maintain consistency with other researchers and ensure experimental reproducibility, this study used the openSMILE open-source tool [13] for feature extraction from raw speech.

To facilitate research and ensure the comparability of results, this study standardized these features by organizing challenges, establishing them as a unified research benchmark. The key feature sets were as follows: INTERSPEECH 2009 Emotion Challenge Feature Set [14], which provides a basic emotion recognition feature set, mainly including low-level descriptors (LLDs) and their statistical features; INTERSPEECH 2010 Paralinguistic Challenge Feature Set [12], which expands the 2009 feature set by adding more functional functions and low-level descriptors, offering more comprehensive acoustic feature coverage; INTERSPEECH 2011 Speaker State Challenge Feature Set [15], which further extends the feature set with new speaker state-related features, such as jitter and shimmer; INTERSPEECH 2012 and 2013 Challenge Feature Sets [16,17], which further expand and optimize the feature sets, incorporating additional advancedfunctions and multimodal features; and Extended Geneva Minimal Acoustic Parameter Set (eGeMAPS) [18], built on the Geneva Minimal Acoustic Parameter Set (GeMAPS), which adds more features related to emotion and non-verbal information, such as energy, fundamental frequency (F0) variation rate, and formants.These feature sets provide standardized feature selection for emotion recognition and non-verbal information processing, enabling the comparability of results across different studies. Detailed information about the feature sets is presented in Table 2.

2.3. Attention-Driven Feature Refinement and Spatial Reconstruction

The model consists of a symmetric encoder–decoder. Features from the source domain and target domain undergo feature compression via the autoencoder network to extract key features and reduce redundancy. The encoding–decoding process is as follows:

\begin{matrix} h = f (W \cdot X + b) \\ X^{'} = f (W^{'} \cdot h + b^{'}) \end{matrix}

(2)

where X denotes the input sample feature; X′ denotes the output sample feature;

f (\cdot)

denotes the activation function; W and b represent the weight matrix and bias in the encoding process, respectively; and W′ and b′ correspond to those in the decoding process, respectively. The mean-squared error (MSE) [19] is adopted as the reconstruction loss function, and its specific formulation is as follows:

\begin{matrix} L_{r e 1} & = \frac{1}{N} \sum_{i = 1}^{N} {∥X_{s i} - X_{s i}^{'}∥}^{2} \\ L_{r e 2} & = \frac{1}{N} \sum_{i = 1}^{N} {∥X_{t i} - X_{t i}^{'}∥}^{2} \end{matrix}

(3)

where X represents the original input feature vector,

X^{'}

represents the output feature vector decoded by the decoder,

L_{r e 1}

denotes the source domain reconstruction loss, and

L_{r e 2}

denotes the target domain reconstruction loss. All of them have the same dimensionality as the original feature vector.

To further refine the extracted features, an attention mechanism is introduced to enhance the extraction of key features and suppress irrelevant ones. Specifically, the attention mechanism is embedded as a separate module into the linear layers of the encoder and decoder within the autoencoder, respectively. Given a normalized feature map X, we generate the query matrix Q, key matrix K, and value matrix V from X as follows:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(4)

The attention calculation can be defined as follows:

A = f (Q K^{T} / \sqrt{d} + B) V

(5)

where A denotes the estimated attention, B denotes the learnable relative positional bias, and

f (\cdot)

is the scoring function. Notably, we perform weight calculation for different “heads” in parallel; these heads are concatenated and then fused via linear projection.

After the autoencoder network integrated with the attention mechanism enhances the feature representation capability, the features output by the decoder are used to complete emotion classification on the source domain. The classification loss for this process is as follows:

L_{c l s} = - \frac{1}{B} \sum_{i = 1}^{B} \sum_{c = 1}^{5} y_{i c} log ({\hat{y}}_{i c})

(6)

In the formula, B represents the batch size during training,

y_{i c}

takes a value of 1 or 0, and

{\hat{y}}_{i c}

denotes the predicted probability that a sample belongs to the C-th emotion category. Through training between the source domain input samples and their labels, the network’s capability to process emotional representations is enhanced.

For the intermediate input feature R of the bottleneck layer, the redundancy of features is leveraged to further refine spatial features through the feature refinement and reconstruction unit. Specifically, this unit can be divided into two steps: separation and reconstruction (as shown in Figure 2). The proposed module was developed as an extension of the traditional normalization-based feature-weighting concept [20], with the aim to enhance emotion-related feature discrimination in cross-domain scenarios. The purpose of separation is to isolate features rich in information from those with less information.

We use the scaling factors in the batch normalization (BN) layer to evaluate the information content of different features [21]. This normalization-based separation approach follows a general principle widely used in deep learning. Similar feature weighting and separation concepts have also been adopted in other domains such as image feature reconstruction [22]. Specifically, given an intermediate feature

R \in R^{B \times C}

, where B is the batch size and C is the feature dimension. The input feature X is standardized by subtracting the mean

μ

and dividing by the standard deviation

σ

, and the expression is as follows:

R_{out} = B N (R) = γ \frac{R - μ}{\sqrt{σ^{2} + ε}} + β

(7)

where

μ

and

σ

are the mean and standard deviation of R,

ε

is a small positive constant added for division stability, and

γ

and

β

are trainable affine transformations. The calculation formula for the normalized weights is as follows:

W_{γ} = \{ω_{i}\} = \frac{γ_{i}}{\sum_{j = 1}^{C} γ_{j}}, i, j = 1, 2, \dots, C

(8)

Then, the weights of the feature map weighted by

W_{γ}

are mapped to the range of (0, 1) via the sigmoid function and gated using a threshold. We set the weights above the threshold to 1 to obtain the informative weights

W_{1}

, and set those below the threshold to 0 to obtain the non-informative weights

W_{2}

.

W = Gate (Sigmoid (W_{γ} (B N (X))))

(9)

Finally, we multiply the input feature R by

W_{1}

and

W_{2}

, respectively, to obtain two weighted features: the information-rich feature

R_{1}^{ω}

and the information-poor feature

R_{2}^{ω}

. To reduce spatial redundancy, we further propose a reconstruction operation: originally, the idea was to add the information-rich feature and the information-poor feature to generate a refined information-rich feature while saving space. However, instead of directly adding these two parts, this paper adopts a cross-reconstruction method to fully combine the two weighted information features. This method enhances the information flow between them. Subsequently, we concatenate the cross-reconstructed features

R^{ω 1}

and

R^{ω 2}

to obtain the spatially refined feature

R^{ω}

. The entire process of the reconstruction operation can be expressed as follows:

\begin{matrix} R_{1}^{ω} = W_{1} \otimes R, \\ R_{2}^{ω} = W_{2} \otimes R, \\ R_{11}^{ω} \oplus R_{22}^{ω} = R^{ω 1}, \\ R_{21}^{ω} \oplus R_{12}^{ω} = R^{ω 2}, \\ R^{ω 1} \cup R^{ω 2} = R^{ω}, \end{matrix}

(10)

where ⊗ denotes element-wise multiplication, ⊕ denotes element-wise addition, and ∪ denotes concatenation. After applying the feature refinement and spatial reconstruction module to the intermediate input feature R, we not only separate the information-rich features from the information-poor features but also reconstruct them to enhance representative features and suppress redundant features in the spatial dimension.

2.4. Loss and Joint Optimization

The distance metric loss adopted in this paper is the Charbonnier loss, also known as Charbonnier distance or a smooth variant of L1 loss. Mathematically, the Charbonnier loss is defined as follows:

ı (I^{'}, \hat{I}) = \sqrt{{∥I^{'} - \hat{I}∥}^{2} + ε^{2}}

(11)

where

I^{'}

represents the true value,

\hat{I}

represents the predicted value, and

ε

is a very small constant used to ensure the differentiability of the function. It avoids the gradient being zero when

I^{'} = \hat{I}

and is usually set to

10^{- 3}

. Here, it is used to measure the feature difference between the source domain and the target domain. This loss function combines the robustness of L1 loss and the smoothness of L2 loss, making the data performance more stable.

This paper uses the stochastic gradient descent (SGD) optimizer to optimize the model. The model is trained on the input data, and parameters are continuously updated to obtain the optimal result. The joint loss is constructed as follows:

L_{a l l} = γ L_{r e 1} + λ L_{r e 2} + α L_{c l s} + β L_{l}

(12)

3. Experiment

3.1. Speech Emotion Corpus

To evaluate the performance of the proposed model, extensive experiments were conducted using the Berlin Speech Emotion Speech Corpus [23], the eNTERFACE Speech Emotion Corpus [24], and the CASIA Chinese Speech Emotion Corpus [25]. These three corpora demonstrate substantial diversity in language types, recording conditions, and emotional category distributions and have been widely adopted in previous cross-corpus speech emotion recognition studies, providing a reliable basis for evaluating the cross-domain generalization capability of the proposed model. Although other speech emotion datasets such as SAVEE and AESDD also contain aligned emotional categories, they were not included in this study because their relatively small sample sizes could lead to unstable model training and lower statistical reliability in cross-corpus evaluations, and their unbalanced gender and speaker distributions limit their representativeness for robust domain generalization research. The basic information of the three selected corpora is summarized in Table 3.

The Berlin corpus is a German emotional speech corpus recorded by the Technical University of Berlin. It is also one of the most widely used speech corpora in speech emotion recognition. It was created by 10 actors simulating 7 types of emotions for 10 sentences, and 535 of the most valid speech samples were retained after auditory perception tests. The eNTERFACE corpus is an audio-visual emotional dataset, which was recorded in English by 42 participants from 14 countries, with a total of 1287 speech samples. The CASIA Chinese Emotional Speech Corpus was recorded by the Institute of Automation, Chinese Academy of Sciences (CASIA), including 4 professional speakers and a total of 1200 speech samples.

3.2. Experimental Setup and Selection of Evaluation Metrics

Based on the three speech emotion corpora, six groups of cross-corpus speech emotion recognition tasks were designed in the experiment. For each group of cross-corpus speech emotion recognition tasks, the common emotions between the training corpus and the test corpus were selected for evaluation. The specific task settings are shown in Table 4.

They were set to 0.01 and 32, respectively, and the number of iterations was set to 5040. The unweighted average recall (UAR) was used as the evaluation metric to assess the performance of different models.

4. Results Analysis and Discussion

4.1. Comparison of Feature Sets

To demonstrate the performance of our algorithm across different feature sets, widely used speech feature sets were selected, including those provided by the INTERSPEECH 2009–2013 [14,15,16,17] challenges and the extended Geneva Minimal Acoustic Parameter Set (eGeMAPS) [18]. Detailed information about these feature sets is available in Table 2 (Speech feature sets). The specific experimental results are shown in Table 5.

The following conclusions can be drawn from the analysis of the above experimental results. First, regardless of the feature set used, the average recognition rate achieved good results, which further verifies the effectiveness of the algorithm proposed in this paper. Second, although an increase in feature dimensions is generally considered helpful for improving recognition performance, we observed that this effect was not significant in cross-corpus emotion recognition. This may be because different feature sets are suitable for different recognition tasks, such as emotion recognition and speaker recognition. Therefore, selecting an appropriate feature set is crucial. Compared with feature sets with fewer features (such as IS09 and eGeMAPS), larger feature sets (e.g., IS10) performed better. This may be because larger feature sets contain more features useful for the recognition task, or the combination of certain features can positively promote the recognition effect. Considering the overall recognition performance, we concluded that the IS10 feature set performs the best. Therefore, we selected the IS10 feature set as the main feature for subsequent experiments. This selection will provide a solid foundation for further research.

4.2. Comparative Experiments

To comprehensively evaluate the advancement of the proposed algorithm, its experimental results were compared with those of other algorithms in the field, including traditional algorithms and deep learning algorithms. The experimental results are shown in Table 6. It can be concluded from the results that the algorithm proposed in this paper achieved the optimal average recognition rate in the six groups of experiments, with a final average recognition rate of up to 46.75%, which is 4.17–14.33% higher than that of other algorithms. This indicates that the AFRSR proposed in this paper can more effectively capture emotional features and suppress the interference of noise and irrelevant information when handling cross-corpus speech emotion recognition tasks.

Baseline algorithm

(1) SVM: Support Vector Machine. Its core goal is to find an optimal hyperplane in the feature space to maximize the margin between samples of different classes, with the linear kernel selected as the kernel function.

(2) TCA: By finding a common feature subspace between two domains, it reduces the difference between them, thereby improving the generalization ability of the model on the target domain.

Transfer learning algorithms

(1) TSDSL [26]: It learns a common feature subspace across different corpora by introducing discriminative learning and norm penalty, thereby obtaining the most discriminative features.

(2) JDAR [27]: It learns a regression matrix by jointly considering the marginal probability distribution and conditional probability distribution between the training and test speech signals. This alleviates the difference in their feature distributions in the subspace spanned by the learned regression matrix.

Deep domain adaptation algorithms

(1) DASA [28]: It acquires low-dimensional emotional information with strong representation through a deep autoencoder, and then performs feature alignment in the low-dimensional space by combining subdomain adaptation.

(2) DANN [29]: It uses adversarial multi-task training to extract common representations between the source domain and target domain.

4.3. Ablation Experiments

To verify the performance of each component of the proposed algorithm and evaluate the overall effect of the model, a set of ablation experiments was designed, as shown in Figure 3. Three scenarios were considered in this set of ablation experiments: “no FRSR” refers to omitting the feature refinement and spatial reconstruction module in the algorithm, “no re” refers to omitting the autoencoder reconstruction loss integrated with attention, and “AFRSR” refers to the proposed algorithm itself.

From the experimental results, it can be intuitively observed that the algorithm proposed in this paper achieved the optimal performance in all six groups of experiments. However, omitting any component of the algorithm would lead to a decline in the model’s performance. This result not only verifies the effectiveness of the proposed algorithm but also highlights the key role of the feature refinement and spatial reconstruction module, as well as the autoencoder reconstruction loss integrated with attention, in improving the model’s performance. Specifically, the feature refinement and spatial reconstruction module further refines the features output by the autoencoder and optimizes the feature reconstruction process, which helps enhance the model’s discriminative ability and generalization ability. In contrast, the autoencoder reconstruction loss integrated with attention enables the model to better capture key information and compress data dimensions by introducing an attention mechanism, thereby improving the model’s performance. When any component is missing from the algorithm, the model’s performance decreases. This finding emphasizes the importance of the collaborative work of each module in the algorithm design and their joint contribution to improving the performance of cross-corpus speech emotion recognition tasks.

4.4. Parameter Sensitivity Analysis

To further analyze the stability of the proposed algorithm, an analysis of loss parameters was conducted. Figure 4 presents the radar charts for parameter stability analysis, where (a), (b), and (c) correspond to the radar charts for the classification loss parameter

α

, the feature distribution distance loss parameter

β

, and the reconstruction loss parameters

γ

and

λ

, respectively.

First, from an overall perspective, adjustments to parameters had differential impacts on model performance, and different cross-corpus tasks exhibited significant differences in parameter sensitivity. This indicates that parameter adjustment is crucial for the stability and accuracy of the model. Next, we analyzed each parameter individually. For parameter

α

in (a), it can be clearly observed that different tasks and different parameter settings had a significant impact on model performance, with obvious fluctuations in the curves. This suggests that the stability of this parameter is poor. For parameter

β

in (b), curves such as the purple one (C-e) and the yellow one (e-B) remain flat in most intervals, and only the cyan curve (e-C) shows a sharp rise when

β

= 100. This indicates that the model has strong robustness to the

β

parameter. For parameters

γ

and

λ

in (c), the six cross-corpus task curves remain highly overlapping when the two parameters change. Particularly within the range of 1–100, the spacing between lines of different colors is less than the 10% scale range. This strong stability also indicates that

γ

and

λ

are non-sensitive parameters of the model.

4.5. Confusion Matrix

To evaluate the classification discriminability of the proposed model across different emotion categories, a confusion matrix-based experimental analysis was conducted. Figure 5 presents the heatmaps under the AFSR framework, where AN, DI, FE, HA, SA, NE, and AM denote anger, disgust, fear, happiness, sadness, neutral, and surprise, respectively.

From Figure 5, it can be observed that the model demonstrated the most stable recognition performance for sadness in most tasks, with the darkest diagonal regions, indicating a higher recognition rate under cross-corpus conditions. Sad speech typically exhibits a low pitch, slow speech rate, and energy concentrated in lower frequency ranges, features that are relatively consistent across corpora, allowing the model to generalize effectively. In the C→B and e→B tasks, anger also showed strong discriminability, suggesting that the AFSR model effectively captures the energy concentration and dynamic speech changes characteristic of high-arousal emotions. In contrast, amazement was associated with relatively high recognition rates in certain tasks, indicating the model’s ability to effectively recognize emotions characterized by strong acoustic fluctuations.

However, the model’s discriminability for mid-arousal emotions was weaker. Disgust and fear exhibited generally low recognition rates and were frequently misclassified as neutral or sadness. This suggests that the model still struggles to distinguish emotions with acoustically similar and ambiguous boundaries. Particularly in the B→e and C→e tasks, confusion between happiness and fear was evident. These two emotions shared similar high-frequency energy distribution and excitation levels, increasing the difficulty of accurate classification by the model.

To quantitatively analyze the results of the confusion matrix experiments, Table 7 presents the macro-averaged precision (Macro-Precision), macro-averaged recall (Macro-Recall), and macro-averaged F1-score (Macro-F1) calculated from the confusion matrix data across six cross-corpus tasks. These metrics provide a statistical evaluation of the model’s classification performance across different corpus combinations. The results indicate that the model’s performance varies significantly depending on the training and testing corpus pairs.

Overall, the C→B configuration had the highest macro-F1 score, suggesting that the model trained on Corpus C generalizes well when evaluated on Corpus B, indicating that the emotional and prosodic feature distributions between the two corpora are relatively consistent. Similarly, the e→B configuration also performed well, further demonstrating that Corpus B shares strong transferable acoustic features with other corpora. In contrast, the C→e and B→e experiments produced the lowest macro-F1 scores, indicating a significant domain discrepancy between Corpus e and the other corpora. This difference may arise from variations in recording environments, emotional expression intensity, or speaker distributions, which reduce the model’s ability to transfer across domains. Additionally, the macro-precision and macro-recall values in most experiments are relatively close, suggesting that the model maintains a relatively balanced prediction tendency during classification, with no significant bias toward any specific emotion class.

5. Conclusions

To reduce the substantial redundancy in deep neural networks, this paper proposes a cross-corpus speech emotion recognition method based on attention feature refinement. The aim is to reduce redundancy and improve the performance of cross-domain emotion recognition tasks through an attention feature refinement module. First, the autoencoder module can effectively capture key information and compress data dimensions, reducing the negative impact of redundant features on model performance. Second, the multi-head attention mechanism enhances the model’s ability to focus on the emotional components of acoustic features while suppressing interference from noise and irrelevant information. Additionally, the feature refinement and spatial reconstruction unit further optimizes feature representation, improving the model’s discriminative ability. Finally, the introduction of the Charbonnier loss function further enhances the model’s generalization ability and robustness, enabling it to exhibit stronger adaptability across different datasets and emotion categories.

Author Contributions

Conceptualization, H.T. and Y.J.; methodology, Y.J. and Q.L.; software, Y.J.; validation, Q.L. and Y.J.; formal analysis, H.T., L.Z. and Z.Y.; investigation, Y.J.; resources, Z.Y.; data curation, Y.J.; writing—original draft preparation, Y.J.; writing—review and editing, H.T.; visualization, Q.L.; supervision, L.Z.; project administration, Z.Y.; funding acquisition, H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was funded in part by Innovative Funds Plan of Henan University of Technology (2022ZKCJ13), Open Project of Scientific Research Platform of Henan University of Technology Grain Information Processing Center (no. KFJJ2023011), Natural Science Project of Henan Provincial Department of Science and Technology, Technology Research Projects (No. 242102211027), and Fund of the Institute of Complexity Science, Henan University of Technology (CSKFJJ-2025-49).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in public repositories. These data were derived from the following resources available in the public domain: Berlin Database of Emotional Speech (Emo-DB), available at http://emodb.bilderbar.info/docu/#emodb (accessed on 7 September 2025); eNTERFACE’05 Audio-Visual Emotion Database, available at https://enterface.net/enterface05/emotion.html (accessed on 7 September 2025); and CASIA Chinese Emotional Speech Database, available at https://gitcode.com/open-source-toolkit/115f2 (accessed on 7 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, S.; Liu, R.; Tao, X.; Zhao, X. Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives. Front. Neurorobot. 2021, 15, 784514. [Google Scholar] [CrossRef] [PubMed]
Singh, J.; Saheer, L.B.; Faust, O. Speech Emotion Recognition Using Attention Model. Int. J. Environ. Res. Public Health 2023, 20, 5140. [Google Scholar] [CrossRef] [PubMed]
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Schuller, B. Self-Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition. arXiv 2022, arXiv:2204.08625. [Google Scholar] [CrossRef]
Tao, H.; Yu, H.; Liu, M.; Fu, H.L.; Zhu, C.H.; Xie, Y. A Semi-Supervised High-Quality Pseudo Labels Algorithm Based on Multi-Constraint Optimization for Speech Deception Detection. Comput. Speech Lang. 2024, 85, 101586. [Google Scholar]
Cao, X.; Jia, M.; Ru, J.; Li, Y.; Zhang, S. Cross-Corpus Speech Emotion Recognition Using Subspace Learning and Domain Adaptation. J. Audio Speech Music Process. 2022, 2022, 32. [Google Scholar] [CrossRef]
Yang, J.; Liu, J.; Huang, K.; Xia, J.; Zhu, Z.; Zhang, H. Single- and Cross-Lingual Speech Emotion Recognition Based on WavLM Domain Emotion Embedding. Electronics 2024, 13, 1380. [Google Scholar] [CrossRef]
Pastor, M.A.; Ribas, D.; Ortega, A.; Miguel, A.; Lleida, E. Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations. Appl. Sci. 2023, 13, 9062. [Google Scholar] [CrossRef]
Naeeni, N.; Nasersharif, B. Feature and Classifier-Level Domain Adaptation in DistilHuBERT for Cross-Corpus Speech Emotion Recognition. Comput. Biol. Med. 2025, 194, 110510. [Google Scholar] [CrossRef] [PubMed]
Naderi, N.; Nasersharif, B. Cross-Corpus Speech Emotion Recognition Using Transfer Learning and Attention-Based Fusion of Wav2Vec2 and Prosody Features. Knowl.-Based Syst. 2023, 277, 110814. [Google Scholar] [CrossRef]
Jiang, P.; Xu, X.; Tao, H.; Zhao, L.; Zou, C. Convolutional-Recurrent Neural Networks with Multiple Attention Mechanisms for Speech Emotion Recognition. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 1564–1573. [Google Scholar] [CrossRef]
Yu, S.; Meng, J.; Fan, W.; Chen, Y.; Zhu, B.; Yu, H.; Xie, Y.; Sun, Q. Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion. Electronics 2024, 13, 2191. [Google Scholar] [CrossRef]
Schuller, B.; Steidl, S.; Batliner, A.; Burkhardt, F.; Devillers, L.; Müller, C.; Narayanan, S.S. The Interspeech 2010 paralinguistic challenge. In Proceedings of the Interspeech 2010, Makuhari, Japan, 26–30 September 2010; pp. 2794–2797. [Google Scholar]
Eyben, F.; Wollmer, M.; Schuller, B. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 1459–1462. [Google Scholar]
Schuller, B.; Steidl, S.; Batliner, A. The Interspeech 2009 Emotion Challenge. In Proceedings of the INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, UK, 6–10 September 2009; pp. 312–315. [Google Scholar] [CrossRef]
Schuller, B.; Steidl, S.; Batliner, A.; Schiel, F.; Krajewski, J. The Interspeech 2011 Speaker State Challenge. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011; pp. 3201–3204. [Google Scholar]
Schuller, B.; Steidl, S.; Batliner, A.; Nöth, E.; Vinciarelli, A.; Burkhardt, F.; van Son, R.; Weninger, F.; Eyben, F.; Bocklet, T.; et al. The Interspeech 2012 Speaker Trait Challenge. In Proceedings of the 13th Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012. [Google Scholar]
Schuller, B.; Steidl, S.; Batliner, A.; Vinciarelli, A.; Scherer, K.; Ringeval, F.; Chetouani, M.; Weninger, F.; Eyben, F.; Marchi, E.; et al. The Interspeech 2013 Computational Paralinguistics Challenge: Social Signal, Conflict, Emotion, Autism. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013; pp. 148–152. [Google Scholar]
Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
Barkan, O.; Tsiris, D. Deep Synthesizer Parameter Estimation. In Proceedings of the ICASSP 2019—IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3887–3891. [Google Scholar]
Wu, X.; Xu, X.; Liu, J.; Wang, H.; Hu, B.; Nie, F. Supervised Feature Selection with Orthogonal Regression and Feature Weighting. arXiv 2019, arXiv:1910.03787. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 6153–6162. [Google Scholar]
Yao, J.; Zhu, Z.; Yuan, M.; Li, L.; Wang, M. The Detection of Maize Leaf Disease Based on an Improved Real-Time Detection Transformer Model. Symmetry 2025, 17, 808. [Google Scholar] [CrossRef]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.; Weiss, B. A Database of German Emotional Speech. In Proceedings of the Interspeech 2005-Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005; Volume 5, pp. 1517–1520. [Google Scholar]
Martin, O.; Kotsia, I.; Macq, B.; Pitas, I. The eNTERFACE’05 Audiovisual Emotion Database. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06), Washington, DC, USA, 3–7 April 2006; p. 8. [Google Scholar]
Tao, J.; Liu, F.; Zhang, M.; Jia, H. Design of Speech Corpus for Mandarin Text-to-Speech. In Proceedings of the Blizzard Challenge 2008 Workshop, Brisbane, Australia, 21 September 2008. [Google Scholar]
Zhang, W.; Song, P. Transfer Sparse Discriminant Subspace Learning for Cross-Corpus Speech Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 28, 307–318. [Google Scholar] [CrossRef]
Zhang, J.; Jiang, L.; Zong, Y.; Zheng, W.; Zhao, L. Cross-Corpus Speech Emotion Recognition Using Joint Distribution Adaptive Regression. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3790–3794. [Google Scholar]
Zhuang, Z.H.; Fu, H.L.; Tao, H.W. Cross-Corpus Speech Emotion Recognition Based on Deep Autoencoder Subdomain Adaptation. J. Comput. Appl. Res. (J. Comput. Appl.) 2021, 38, 3279–3282+3348. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-Adversarial Training of Neural Networks. In Advances in Computer Vision and Pattern Recognition; Springer: Cham, Switzerland, 2017; pp. 189–209. [Google Scholar]

Figure 1. Framework of cross-corpus speech emotion recognition model based on attention-driven feature refinement and spatial reconstruction.

Figure 2. Feature refinement and spatial reconstruction.

Figure 3. Ablation experiments.

Figure 4. Parameter sensitivity analysis.

Figure 5. Confusion matrix visualization results.

Table 1. 1582-dimensional features from the INTERSPEECH 2010 Emotion Challenge specified feature set.

No	Speech Features	Quantity
1	Loudness	42
2	MFCC	630
3	Log Mel bands	336
4	LSP frequencies	336
5	F0 envelope	42
6	Voiced frequency distribution	42
7	F0 fundamental frequency	38
8	Local jitter	38
9	Consecutive jitter frame pairs	38
10	Local shimmer	38
11	F0 onset time	1
12	Duration	1

Table 2. Speech feature set.

Year	Feature Set Name	Global Feature Dimensionality	Temporal Feature Dimensionality
2009	INTERSPEECH 2009 Emotion Challenge Feature Set	384	32
2010	INTERSPEECH 2010 Paralinguistic Challenge Feature Set	1582	76
2011	INTERSPEECH 2011 Speaker State Challenge Feature Set	4368	120
2012	INTERSPEECH 2012 Speaker Trait Challenge Feature Set	5757	120
2013	Interspeech 2013 ComParE Emotion Sub-Challenge	6373	130
2022	eGeMAPS Feature Set	88	25

Table 3. Basic information of the three speech corpora.

No	Speech Emotion Corpus	Language	Number of Speeches	Emotion Categories
1	Berlin	German	535	7
2	eNTERFACE	English	1287	6
3	CASIA	Chinese	1200	6

Table 4. Setup of cross-corpus speech emotion recognition tasks.

Source Domain	Target Domain	Shared Emotion Types
eNTERFACE(e)	Berlin(B)	Anger, disgust, fear, joy, sadness
Berlin(B)	eNTERFACE(e)	Anger, disgust, fear, joy, sadness
Berlin(B)	CASIA(C)	Anger, fear, joy, neutral, sadness
CASIA(C)	Berlin(B)	Anger, fear, joy, neutral, sadness
eNTERFACE(e)	CASIA(C)	Anger, fear, joy, sadness, surprise
CASIA(C)	eNTERFACE(e)	Anger, fear, joy, sadness, surprise

Table 5. Comparison of UAR percentages across feature sets.

Feature Set	B-C	B-e	C-e	C-B	e-B	e-C	UAR
IS09	37.33	35.78	29.28	41.01	55.11	34.28	38.80
IS10	44.96	44.31	36.84	62.91	58.48	35.10	47.10
IS11	33.51	38.09	32.39	24.17	52.17	32.03	35.39
IS12	31.63	39.93	33.75	24.34	55.34	35.69	37.10
IS13	41.95	38.98	34.95	58.00	55.44	37.97	44.55
eGeMAPS	42.86	36.34	36.90	55.24	48.81	39.93	43.35

Table 6. UAR percentages for each cross-corpus task.

Algorithm	B-C	B-e	C-e	C-B	e-B	e-C	UAR
SVM	37.80	32.47	25.69	44.12	32.00	27.40	33.25
TCA	37.70	31.23	26.02	39.50	33.68	26.40	32.42
TSDSL	37.40	35.44	33.25	56.74	47.41	32.50	40.46
JDAR	38.60	38.14	28.43	49.58	48.74	30.30	38.97
DANN	42.89	36.53	29.17	57.64	52.67	36.60	42.58
DASA	41.40	40.11	32.09	51.47	52.35	36.10	42.25
AFSR	44.96	44.31	36.84	62.91	58.48	35.10	47.10

Table 7. Quantitative analysis based on the confusion matrix.

Cross-Corpus Setting	Macro-Precision	Macro-Recall	Macro-F1
B → C	0.4304	0.4343	0.4266
B → e	0.3886	0.4028	0.3850
C → B	0.6004	0.5710	0.5753
C → e	0.3266	0.3249	0.3250
e → B	0.5487	0.5690	0.5519
e → C	0.3825	0.3754	0.3664

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, H.; Jiang, Y.; Li, Q.; Zhao, L.; Yang, Z. Cross-Corpus Speech Emotion Recognition Based on Attention-Driven Feature Refinement and Spatial Reconstruction. Information 2025, 16, 945. https://doi.org/10.3390/info16110945

AMA Style

Tao H, Jiang Y, Li Q, Zhao L, Yang Z. Cross-Corpus Speech Emotion Recognition Based on Attention-Driven Feature Refinement and Spatial Reconstruction. Information. 2025; 16(11):945. https://doi.org/10.3390/info16110945

Chicago/Turabian Style

Tao, Huawei, Yixing Jiang, Qianqian Li, Li Zhao, and Zhizhe Yang. 2025. "Cross-Corpus Speech Emotion Recognition Based on Attention-Driven Feature Refinement and Spatial Reconstruction" Information 16, no. 11: 945. https://doi.org/10.3390/info16110945

APA Style

Tao, H., Jiang, Y., Li, Q., Zhao, L., & Yang, Z. (2025). Cross-Corpus Speech Emotion Recognition Based on Attention-Driven Feature Refinement and Spatial Reconstruction. Information, 16(11), 945. https://doi.org/10.3390/info16110945

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Corpus Speech Emotion Recognition Based on Attention-Driven Feature Refinement and Spatial Reconstruction

Abstract

1. Introduction

2. Methods

2.1. Framework of Cross-Corpus Speech Emotion Recognition Model Based on Attention-Driven Feature Refinement and Spatial Reconstruction

2.2. Feature Processing and Normalization

2.3. Attention-Driven Feature Refinement and Spatial Reconstruction

2.4. Loss and Joint Optimization

3. Experiment

3.1. Speech Emotion Corpus

3.2. Experimental Setup and Selection of Evaluation Metrics

4. Results Analysis and Discussion

4.1. Comparison of Feature Sets

4.2. Comparative Experiments

4.3. Ablation Experiments

4.4. Parameter Sensitivity Analysis

4.5. Confusion Matrix

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI