Frame and Utterance Emotional Alignment for Speech Emotion Recognition

Byun, Seounghoon; Lee, Seok-Pil

doi:10.3390/fi17110509

Open AccessArticle

Frame and Utterance Emotional Alignment for Speech Emotion Recognition

by

Seounghoon Byun

¹

and

Seok-Pil Lee

^2,*

¹

Department of Computer Science, Graduate School, Sangmyung University, Seoul 03016, Republic of Korea

²

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(11), 509; https://doi.org/10.3390/fi17110509

Submission received: 29 September 2025 / Revised: 1 November 2025 / Accepted: 3 November 2025 / Published: 5 November 2025

(This article belongs to the Special Issue Artificial Intelligence: Innovation, Applications and Transformative Experiences)

Download

Browse Figures

Versions Notes

Abstract

Speech Emotion Recognition (SER) is important for applications such as Human–Computer Interaction (HCI) and emotion-aware services. Traditional SER models rely on utterance-level labels, aggregating frame-level representations through pooling operations. However, emotional states can vary across frames within an utterance, making it difficult for models to learn consistent and robust representations. To address this issue, we propose two auxiliary loss functions, Emotional Attention Loss (EAL) and Frame-to-Utterance Alignment Loss (FUAL). The proposed approach uses a Classification token (CLS) self-attention pooling mechanism, where the CLS summarizes the entire utterance sequence. EAL encourages frames of the same emotion to align closely with the CLS while separating frames of different classes, and FUAL enforces consistency between frame-level and utterance-level predictions to stabilize training. Model training proceeds in two stages: Stage 1 fine-tunes the wav2vec 2.0 backbone with Cross-Entropy (CE) loss to obtain stable frame embeddings, and stage 2 jointly optimizes CE, EAL and FUAL within the CLS-based pooling framework. Experiments on the IEMOCAP four-class dataset demonstrate that our method consistently outperforms baseline models, showing that the proposed losses effectively address representation inconsistencies and improve SER performance. This work advances Artificial Intelligence by improving the ability of models to understand human emotions through speech.

Keywords:

speech emotion recognition; self-supervised learning; frame-level emotion alignment; attention; artificial intelligence

Graphical Abstract

1. Introduction

SER is the task of classifying emotional states from a speaker’s audio signal. Recent advances in deep learning and Self-supervised Learning (SSL) frameworks have significantly improved SER performance [1,2]. SSL models leverage large-scale unlabeled audio data to learn generalized acoustic representations, enabling efficient adaptation to downstream tasks using only a limited amount of labeled data [3,4,5]. Among these, models such as wav2vec 2.0, HuBERT, and WavLM are widely adopted as backbone architectures in SER [6,7,8].

Traditional SER systems generate frame-level embeddings using an SSL backbone and then aggregate them into a single utterance-level representation through pooling operations. The resulting utterance-level representation is passed to a classifier for final emotion prediction [9,10]. However, this approach has two key limitations. First, emotional content is often non-uniformly distributed within an utterance, meaning that some frames may express different emotions than others. Since conventional SER models are trained using only utterance-level labels, they struggle to learn consistent intra-class frame representations, leading to a mismatch between frame-level and utterance-level features and causing training instability [11]. Second, traditional pooling methods such as mean or weighted sum cannot explicitly capture relationships between frames within the same emotion class. As a result, even utterances belonging to the same emotion category may produce inconsistent frame embeddings, obscuring decision boundaries and degrading overall performance.

To address these challenges, we propose two auxiliary loss functions: EAL and FUAL. Both loss functions are designed to improve representation learning by modeling the relationship between frames and the entire utterance, while requiring only labels at the utterance level. EAL encourages frames of the same emotion to cluster around a central representation while pushing frames of different classes apart, thereby sharpening class boundaries. FUAL enforces consistency between frame-level and utterance-level predictions, ensuring that each frame reflects the overall emotional structure of the utterance.

To effectively implement these objectives, we employ CLS-based self-attention pooling. from all frames into a single utterance-level representation through Multi-Head Attention (MHA), while simultaneously affecting the update of frame-level embeddings [12,13,14]. In this process, EAL and FUAL jointly influence the learning procedure, enabling the CLS to capture a concise and representative utterance-level context.

The training process proceeds in two stages. In Stage 1, the SSL backbone is fine-tuned with CE loss only, stabilizing frame-level embeddings for downstream tasks [15]. In stage 2, the model from Stage 1 is used as initialization, and EAL and FUAL are jointly optimized within the CLS-based self-attention pooling framework to refine emotional alignment and improve recognition performance.

Experiments on the IEMOCAP 4-class dataset demonstrate that the proposed method consistently outperforms baseline models in both Unweighted Accuracy (UA) and Weighted Accuracy (WA).

As Speech Emotion Recognition is a key component of affective computing and Human-Centered Artificial Intelligence, the proposed method directly contributes to advancing AI systems capable of understanding and responding to human emotions.

The remainder of this paper is organized as follows. Section 2 reviews related work on Speech Emotion Recognition and Self-Supervised Learning. Section 3 presents the proposed method and training strategy. Section 4 describes the experimental setup and results. Section 5 provides a discussion, and Section 6 concludes the paper.

2. Related Work

2.1. Traditional SER Approaches

Previous SER research involved manually extracting designed features such as MFCCs, pitch, energy, and mel spectrograms from audio signals, then classifying emotions using traditional machine learning algorithms like Support Vector Machine (SVM) or K-Nearest Neighbor (KNN). However, these features had limitations. They could not sufficiently capture complex, nonlinear patterns of emotion, and their generalization performance across datasets was not robust [16,17,18].

Subsequent advancements in deep learning enabled the application of Convolutional Neural Network (CNN) and RNN-based models to SER, significantly improving performance [19,20]. While Recurrent Neural Network (RNN)-based models demonstrated particular strength in learning time-series audio information, limitations persisted regarding data scarcity and the ability to learn diverse emotional expressions with precision [21].

To address this, SSL-based models, which leverage large-scale unlabeled audio data to learn powerful representations, have recently become widely used as the backbone for SER.

2.2. Wav2vec 2.0

The SSL model learns frame-level generalizable representations from unlabeled audio data, achieving remarkable performance on diverse downstream audio tasks using only a small amount of labeled data. The representative model, wav2vec 2.0, learns rich acoustic representations by masking parts of the raw waveform and predicting those masked segments [6].

It uses a convolutional feature encoder to capture local patterns and a Transformer network to model long-range temporal dependencies. Through contrastive learning, wav2vec 2.0 distinguishes true masked segments from negative samples, producing robust and discriminative features. Due to these characteristics, wav2vec 2.0 has become a widely used backbone for SER, offering superior performance to CNN and RNN-based models and reducing reliance on large labeled datasets.

2.3. Pooling Techniques for Utterance-Level Representations

SER models integrate frame-level audio representations to generate a single utterance-level vector, which is then used to classify emotion. Previous studies employed simple techniques like average pooling or max pooling, but these methods had limitations. They either processed all frames equally or extremely.

To improve this, attentive pooling was proposed, generating speech representations that assign higher weights to important frames through learned weights [22]. However, in this approach, weights are determined only indirectly during model training, making it difficult to guarantee that emotionally significant frames are directly reflected. Furthermore, since the relationships between frames are not explicitly captured, different frames can be mixed. This increases the likelihood of noise and unnecessary information being included, ultimately limiting the final prediction performance.

2.4. Frame Representation Alignment in SER

Existing SER research primarily relies on utterance-level labels, meaning that the roles and relationships of individual frames are not explicitly learned. Attention pooling methods assign higher weights to frames that are relatively more important for classification, based on the relationships between frames, to generate an utterance-level representation. Consequently, even among utterances belonging to the same emotion class, frame representations may fail to align consistently, and the representations of frames sharing the same emotion may be learned in a highly inconsistent manner.

Such inconsistencies blur the boundaries between emotion classes and introduce unnecessary variability into the training process, ultimately degrading the final prediction performance. Therefore, to enhance both the performance and generalization ability of SER models, a new approach is required that enables mutual learning between utterance-level and frame-level representations.

3. Proposed Method

This study utilizes SSL-based audio frame embedding to learn stable and emotionally consistent Utterance-level representations through a two-stage training procedure. The overall two-stage training framework is illustrated in Figure 1.

In stage 1, input audio is fed through the SSL model to extract frame-level embeddings. Subsequently, frame-specific weights are computed via attention pooling, which are then weighted to generate utterance-level representations. These generated utterance representations are learned through CE loss, enabling the SSL model to stably reflect frame-level embedded emotional information.

In stage 2, the SSL model trained in stage 1 is used as the initial value, and the frame representations that stably retain emotional information are utilized as input. At this stage, a CLS is added, and information from all frames is integrated via MHA, which is then learned as a global utterance vector. Subsequently, the utterance representation generated via CLS-based self-attention pooling is used for final emotion classification, simultaneously applying CE Loss, EAL, and FUAL throughout this process.

3.1. Stage 1: SSL Fine-Tuning with Attentive Pooling

We employ wav2vec 2.0 as the SSL backbone to extract frame-level embeddings

H = [h_{1}, h_{2}, \dots, h_{T}]

from the input waveform

x

. These frame embeddings are then integrated using an attentive pooling module to generate an utterance-level representation

u

, which is subsequently passed to the final classifier for emotion prediction.

The frame-wise attention weights

α_{t}

and the utterance-level representation

u

are computed as follows:

s_{t} = w^{T} h_{t} + b, α_{t} = \frac{\exp (s_{t})}{\sum_{k = 1}^{T} e x p (s_{k})}, u = \sum_{t = 1}^{T} α_{t} h_{t},

(1)

The utterance-level representation

u

obtained in (1) is trained using CE Loss, defined as:

L_{C E} = - \sum_{c = 1}^{C} y_{c} \log ({\hat{y}}_{c}),

(2)

C

is the number of emotion classes,

y_{c}

is the ground truth as a one-hot vector, and

{\hat{y}}_{c}

is the softmax probability for class

c

.

Through this process, the frame-level weights

α_{t}

are stabilized to accurately reflect emotional information. Thus, stage 1 focuses on learning a stable SSL-based initial model, providing a strong foundation for stage 2. This step ensures that the subsequent application of auxiliary losses in stage 2 is stable and effective, facilitating smooth training and improving overall performance.

3.2. Stage 2: CLS-Based Self-Attention Pooling & Joint Objectives

In stage 2, the SSL backbone from stage 1 is used for initialization. A CLS is prepended to the frame sequence and passed through a multi-head attention encoder to produce frame representations

f_{t}

and a global utterance-level representation

c

.

These outputs are aggregated to form the final utterance-level vector

u

. At this stage, CE Loss, EAL, and FUAL are jointly optimized to enhance both frame-level alignment and utterance-level consistency.

3.2.1. Self-Attention Pooling with CLS

The frame embedding sequence from the SSL backbone is given by

H = [h_{1}, h_{2}, \dots, h_{T}]

. A CLS

h_{C L S}

is prepended to form the input sequence:

H^{'} = [h_{C L S}, h_{1}, h_{2}, \dots, h_{T}],

(3)

The sequence

H^{'}

is then processed by the multi-head attention encoder

E n c (\cdot)

:

[c, f_{1}, f_{2}, \dots, f_{T}] = E n c (H^{'}),

(4)

c

represents the CLS output summarizing the entire utterance, and

f_{t}

represents the final frame-level representations. By concentrating the utterance information into

c

, this structure provides a foundation for EAL to guide emotion-specific alignment.

The similarity scores between the CLS and each frame are normalized using a softmax function to produce attention weights

α_{t}

. These weights determine the contribution of each frame to the final utterance-level representation, similar to the process described in stage 1.

3.2.2. Emotional Attention Loss (EAL)

EAL encourages frames within the same emotion class to align closely with the CLS while separating frames from different classes.

The cosine similarity between the CLS and each frame,

s_{t}

, is calculated as follows:

s_{t} = \frac{C \cdot f_{t}}{‖C‖ \cdot ‖f_{t}‖},

(5)

To enforce alignment, a softplus-margin loss is applied so that emotionally salient frames maintain a similarity above margin

m

[23].

The frame-wise loss for each frame is defined as:

l_{t} = \log (1 + ⅇ^{(m - s_{t})}),

(6)

The overall EAL is computed as the mean of these losses across all frames:

L_{E A L} = \frac{1}{T} \sum_{t = 1}^{T} l_{t},

(7)

While EAL improves class boundary clarity, the softmax function can still amplify small differences in similarity scores, causing attention weights to become overly concentrated on certain frames. The margin mitigates this effect by limiting updates for frames already close to the CLS, but it does not fully resolve the disproportionate allocation of attention.

As a result, EAL alone may not fully prevent this weight concentration, which can make training less stable and reduce the diversity of frame-level representations.

3.2.3. Frame-to-Utterance Alignment Loss (FUAL)

Using only EAL may cause the model to over-focus on a few specific frames, which can lead to unstable representations for other frames. To address this issue, FUAL is introduced to ensure that frame-level predictions are consistent with the utterance-level prediction distribution. Each frame output is passed through a softmax function to generate a probability distribution

p_{t}

, and the utterance representation is also passed through a softmax function to produce the utterance-level probability distribution

q

. The frame-level predictions are then aggregated using attention weights

α_{t}

to form a weighted average distribution

\bar{p}

.

Finally, FUAL minimizes the Kullback–Leibler (KL) divergence between these two distributions as follows [24]:

L_{F U A L} = K L (q_{∥} \bar{p}),

(8)

FUAL minimizes this divergence to encourage consistency between frame-level and utterance-level predictions, resulting in more stable and coherent emotional representations across both levels.

3.2.4. Total Loss

The final loss function combines CE, EAL, and FUAL:

L_{t o t a l} = L_{C E} + λ_{E A L} L_{E A L} + λ_{F U A L} L_{F U A L},

(9)

λ_{E A L}

and

λ_{F U A L}

are weighting factors controlling the influence of the auxiliary losses. By optimizing

L_{t o t a l}

, the model improves utterance-level classification accuracy while ensuring that frame-level representations remain aligned and consistent with the overall emotional structure.

4. Experiments

In this section, we present the dataset, training configurations, baseline models, and experimental results used to evaluate the performance of the proposed method.

4.1. Dataset

For performance evaluation, this study utilized the IEMOCAP corpus [25], a widely adopted benchmark dataset for SER. The corpus is multi-modal, containing audio, visual, and lexical modalities; however, only the audio modality was used in this work. It comprises five sessions, each consisting of dyadic conversations between one male and one female speaker.

To prevent speaker information leakage, we followed a leave-one-session-out five-fold cross-validation strategy. In this setup, four sessions were used for training, while the data from one speaker in the remaining session served as the validation set, and the data from the other speaker were used as the test set.

We focused on four emotion categories: Happy (including Excited), Sad, Neutral, and Angry. In total, 5531 audio utterances were selected for training and evaluation, and their distribution across the four classes is summarized in Table 1.

Table 1 summarizes the distribution of utterances across the 4 emotion classes.

Within the held-out session, the data were further divided by speaker gender: utterances from the male speaker were used for validation, and those from the female speaker were used for testing.

4.2. Evaluation Metrics

The model performance was evaluated using UA and WA. UA accounts for class imbalance by equally weighting all classes, providing a fair assessment of performance across different emotion categories. In contrast, WA measures the overall accuracy based on the total number of samples, reflecting the model’s performance across the entire dataset.

U A = \frac{1}{C} \sum_{c = 1}^{C} \frac{T P_{c}}{N_{c}}, W A = \frac{\sum_{c = 1}^{C} T P_{c}}{\sum_{c = 1} N_{c}},

(10)

C

denotes the number of classes,

T P_{c}

represents the number of correct predictions for class

C

, and

N_{c}

indicates the total number of samples in class

C

.

In addition, a Confusion Matrix was used to visually inspect class-wise prediction tendencies, allowing for an intuitive understanding of potential misclassification patterns.

4.3. Implementation Details

All experiments were conducted on a single NVIDIA RTX 5080 GPU using the PyTorch 2.8.0 framework with CUDA 12.8 and Python 3.10.8. The training process consisted of two sequential stages. In stage 1, the SSL backbone was fine-tuned to obtain stable frame-level representations using only CE loss. In stage 2, the model was jointly trained with CE and the proposed auxiliary losses to refine frame-to-utterance emotional alignment. The CNN-based feature extractor remained frozen in both stages to preserve general speech representations. AdamW was used as the optimizer with different learning rates (LR) for the backbone, attention module, and classification head. The pretrained model, facebook/wav2vec2-base, was obtained from the Hugging Face Transformers library.

Table 2 summarizes the core hyperparameter settings for both stages, allowing a direct comparison of key experimental configurations.

In stage 2, CLS-based self-attention pooling was configured with four attention heads, a depth of one, and no dropout. The hyperparameters of both EAL and FUAL were determined experimentally, and further analysis is provided in Section 4.4.2 (Ablation Study).

4.4. Results

4.4.1. Baselines

For comparison, two baseline models were established using the same dataset and the SSL backbone pre-trained in stage 1 to ensure a fair and consistent evaluation setting. The first baseline employed a conventional self-attention pooling mechanism, while the second utilized a CLS-based self-attention pooling approach, where the CLS serves as a global summary representation of the entire utterance. These baselines were designed to evaluate the effect of incorporating a global anchor token in emotion representation learning.

Building upon the CLS-based self-attention pooling baseline, we further examined the impact of the proposed auxiliary loss functions by gradually integrating the FUAL. Specifically, three CLS-based configurations were tested: one with EAL alone, another with FUAL alone, and a third with both EAL and FUAL applied together to form the proposed model. This progressive approach allowed us to systematically analyze the contribution of each individual loss function as well as their combined effect on overall performance.

All experiments were conducted using five-fold cross-validation, and performance was evaluated using UA and WA.

As shown in Table 3, the CLS-based self-attention pooling showed moderate performance improvements compared to conventional self-attention pooling, indicating that incorporating a global summary token can provide a more stable foundation for utterance-level representation. When either EAL or FUAL was added to the CLS-based framework, the model exhibited additional improvements over the CLS baseline, suggesting that each loss plays a supportive role in enhancing frame-level alignment and prediction consistency. When both losses were applied together, the proposed model achieved the best overall performance among all configurations, demonstrating clear performance gains compared to the baseline. These results confirm that the proposed auxiliary losses are effective within the CLS-based framework and provide consistent performance improvements under the given experimental setting.

4.4.2. Ablation Study

To analyze the contribution of each component in the proposed framework, we conducted an ablation study focusing on the auxiliary losses, EAL and FUAL. The experiments were designed to investigate the effect of key hyperparameters that influence alignment learning.

EAL includes two key hyperparameters, the margin

m

and the loss weight

λ_{E A L}

. The margin

m

controls the degree of separation between frame-level representations, and we explored three different values, 0.4, 0.6, and 0.8. The loss weight

λ_{E A L}

determines how much influence the EAL term has on the overall loss function, and we tested three levels, 0.05, 0.10, and 0.15. Because these parameters can affect both the stability of training and the final recognition performance, we systematically investigated their impact through a structured hyperparameter search.

To conduct this study, we evaluated every possible combination of the margin and the loss weight, resulting in nine different experimental settings. All experiments were performed using five-fold cross-validation in order to provide reliable performance estimates. The results, reported in terms of UA and WA, are presented in Table 4.

As shown in Table 4, the performance of the model varied depending on the specific values of the margin and the loss weight. When the margin was either too low or too high, the improvement in accuracy was less consistent, which suggests that an appropriate balance is needed between encouraging frame clustering and maintaining sufficient separation. A similar trend was observed for the loss weight. When the loss weight was increased beyond a certain level, the performance did not continue to improve and slightly decreased in some cases, indicating that excessive weighting of EAL can interfere with the optimization of other objectives. Among all the tested combinations, the setting with a margin of 0.8 and a loss weight of 0.05 produced the most stable and consistent results.

To further validate the robustness of the proposed framework, an additional ablation study was conducted by varying

λ_{F U A L}

while fixing the best EAL parameters (

m

= 0.8,

λ_{E A L}

= 0.05). The results are summarized in Table 5.

As shown in Table 5, the model achieved the highest performance when

λ_{F U A L}

= 0.05. Increasing the loss weight beyond this value slightly degraded performance, indicating that an excessive emphasis on FUAL can dominate the optimization and reduce generalization. These findings suggest that moderate weighting of FUAL effectively complements EAL, maintaining stable alignment between frame- and utterance-level representations.

4.4.3. Visualization Analysis

To examine the effect of EAL and FUAL on aligning emotional representations across entire utterances, we visualized the learned embeddings using t-SNE [26].

As shown in Figure 2a,b, the comparison suggests that in Figure 2b, the utterance-level emotion distributions appear more clearly differentiated, showing a relatively clear boundary around the CLS token. This implies that the proposed loss functions help improve the separation and organization of emotion representations at the utterance level, indicating their potential in structuring the emotional feature space.

4.4.4. Confusion Matrix

Figure 3 shows the confusion matrix of the final proposed model evaluated using 5-fold cross-validation. Each cell indicates the percentage of predicted classes for a specific true emotion class, clearly visualizing the model’s tendency to misclassify particular emotions. The results indicate that while this model achieves high accuracy for the Anger and Sad classes, it exhibits relatively lower accuracy for the Happy and Neutral classes, showing notable confusion between these two emotion categories.

As shown in Table 1, both Happy and Neutral classes contain significantly more utterances than other emotions, which may lead to class imbalance and bias the model toward these categories. In addition, the boundary between Happy and Neutral emotions is often ambiguous in conversational speech, making them difficult to distinguish clearly. These factors together likely contribute to the observed confusion between the two emotions.

4.4.5. Comparison with Other SER Systems

As shown in Table 6, the proposed model achieved a UA of 75.3% and a WA of 74.2%, demonstrating clear performance improvements over existing models such as SMW-CAT and ShiftCNN. Moreover, the proposed method achieved comparable performance to the state-of-the-art FLEA model, highlighting its effectiveness.

Importantly, our approach requires only utterance-level labels for training and does not rely on frame-level annotations. This demonstrates that the model can effectively achieve high SER performance without complex modules or additional labeling efforts, making it both efficient and scalable for practical applications.

4.4.6. Cross-Dataset Evaluation

To further assess the generalizability of the proposed method, additional experiments were conducted on two benchmark datasets, CREMA-D [31] and RAVDESS [32]. The CREMA-D dataset contains 7442 audiovisual clips of six emotions (happy, sad, neutral, anger, disgust, and fear) recorded by 91 actors [31]. The RAVDESS dataset includes 1440 utterances from 24 actors expressing eight emotions (calm, happy, sad, angry, fearful, surprise, disgust, and neutral) [32].

As shown in Table 7, the proposed CLS-based model achieved higher UA and WA scores than the baseline model on both datasets, confirming consistent performance improvements. These results demonstrate that the proposed method maintains stable recognition capability across different emotional corpora, validating its robustness beyond the IEMOCAP dataset.

5. Discussion

This study aimed to address inconsistencies between frame-level and utterance-level representations in SER. To this end, we introduced two auxiliary loss functions: EAL and FUAL. EAL promotes clustering of frames belonging to the same emotion class around the CLS while separating frames of different classes, whereas FUAL encourages consistency between frame-level and utterance-level predictions, preventing the model from over-focusing on a few dominant frames. The CLS also plays a key role in integrating frame information into a coherent utterance-level representation, providing a stable reference point for subsequent alignment learning.

The results in Table 3 show that incorporating the CLS into the pooling mechanism provided modest improvements over conventional self-attention pooling. Adding EAL or FUAL individually led to further gains, while combining both losses achieved the best overall performance, demonstrating their complementary effects. t-SNE visualizations confirmed this by showing clearer and more distinct emotion clusters when both losses were applied, indicating that the proposed framework reshapes the representation space to better capture emotional structures.

The ablation analysis further highlighted the importance of carefully tuning alignment-related hyperparameters such as the margin and loss weights. These parameters influenced training stability and overall recognition performance, emphasizing the need for empirical balance to achieve stable and effective optimization.

The confusion analysis indicated that most emotions were accurately recognized, though minor overlap was observed between acoustically similar categories such as Happy and Neutral. In addition, comparison with previous methods demonstrated that our approach outperforms prior models such as SMW-CAT and ShiftCNN while reaching performance levels comparable to the state-of-the-art FLEA model. Importantly, this was achieved without requiring frame-level annotations, highlighting the practicality and scalability of our method.

To further validate its robustness, additional experiments were conducted on the CREMA-D and RAVDESS datasets. The proposed method maintained consistent performance across these benchmarks, confirming its generalizability across different emotional speech conditions.

Future studies will aim to further assess adaptability using more diverse and spontaneous speech data.

In summary, integrating EAL and FUAL within a CLS-based framework effectively improves both alignment and representation consistency in SER. This approach provides a simple yet effective way to enhance emotion recognition using only utterance-level labels, supporting its potential for broader practical applications.

6. Conclusions

This study proposed a CLS-based framework that integrates two auxiliary loss functions, EAL and FUAL, to address inconsistencies between frame-level and utterance-level representations in SER. The proposed method leverages a two-stage training strategy, where the wav2vec 2.0 backbone is first fine-tuned to obtain stable frame embeddings, and then EAL and FUAL are jointly optimized within a CLS-based self-attention pooling mechanism.

Through experiments on the IEMOCAP dataset, we demonstrated that the combination of EAL and FUAL effectively improves alignment and representation consistency, leading to stable training and higher recognition accuracy without requiring frame-level annotations. Additional experiments on the CREMA-D and RAVDESS datasets further confirmed that the proposed approach maintains consistent performance across different corpora, supporting its generalizability to varied emotional speech conditions. Furthermore, comparison with existing SER systems showed that the proposed method achieves performance comparable to recent state-of-the-art models while maintaining a simpler and more efficient architecture.

In conclusion, this framework provides a practical and interpretable solution for enhancing SER using only utterance-level labels, achieving a good balance between model simplicity, scalability, and recognition performance. Future work will extend this approach to multimodal emotion recognition by incorporating visual or textual modalities, and will explore its application to more diverse and spontaneous speech data to evaluate adaptability in real-world scenarios.

Author Contributions

Conceptualization, S.B. and S.-P.L.; methodology, S.B.; investigation, S.B.; writing—original draft preparation, S.B.; writing—review and editing, S.-P.L.; project administration, S.-P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Experiments used publicly available datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Atmaja, B.T.; Sasou, A. Evaluating self-supervised speech representations for speech emotion recognition. IEEE Access 2022, 10, 124396–124407. [Google Scholar] [CrossRef]
Naini, A.R.; Kohler, M.A.; Richerson, E.; Robinson, D.; Busso, C. Generalization of self-supervised learning-based representations for cross-domain speech emotion recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 12031–12035. [Google Scholar] [CrossRef]
Morais, E.; Hoory, R.; Zhu, W.; Gat, I.; Damasceno, M.; Aronowitz, H. Speech emotion recognition using self-supervised features. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6922–6926. [Google Scholar] [CrossRef]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-Training for Speech Recognition. Proc. Interspeech 2019, 2019, 3465–3469. [Google Scholar] [CrossRef]
Fang, Y.; Xing, X.; Xu, X.; Zhang, W. Exploring Downstream Transfer of Self-Supervised Features for Speech Emotion Recognition. Proc. Interspeech 2023, 2023, 3627–3631. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2020; Volume 33, pp. 12449–12460. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Hyeon, J.; Oh, Y.H.; Lee, Y.J.; Choi, H.J. Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts. Data Knowl. Eng. 2024, 150, 102262. [Google Scholar] [CrossRef]
Kakouros, S.; Stafylakis, T.; Mošner, L.; Burget, L. Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Nagase, R.; Fukumori, T.; Yamashita, Y. Speech Emotion Recognition Using Sequences of Fine-grained Emotion Labels with Phoneme Class Attributes. APSIPA Trans. Signal Inf. Process. 2025, 14, e17. [Google Scholar] [CrossRef]
Chang, H.-S.; Sun, R.-Y.; Ricci, K.; McCallum, A. Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 821–854. [Google Scholar] [CrossRef]
Chen, W.; Liang, Y.; Ma, Z.; Zheng, Z.; Chen, X. EAT: Self-supervised pre-training with efficient audio transformer. arXiv 2024, arXiv:2401.03497. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS 2017); NeurIPS: San Diego, CA, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
De Boer, P.T.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A tutorial on the cross-entropy method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New York, NY, USA, 2017; pp. 2227–2231. [Google Scholar] [CrossRef]
Rajamani, S.T.; Rajamani, K.T.; Mallol-Ragolta, A.; Liu, S.; Schuller, B. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6294–6298. [Google Scholar] [CrossRef]
Li, R.; Wu, Z.; Jia, J.; Zhao, S.; Meng, H. Dilated residual network with multi-head self-attention for speech emotion recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: New York, NY, USA, 2019; pp. 6675–6679. [Google Scholar] [CrossRef]
Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
De Lope, J.; Grana, M. A hybrid time-distributed deep neural architecture for speech emotion recognition. Int. J. Neural Syst. 2022, 32, 2250024. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Wang, Y.; Yang, X.; Im, S.K. Speech emotion recognition based on Graph-LSTM neural network. EURASIP J. Audio Speech Music. Process. 2023, 2023, 40. [Google Scholar] [CrossRef]
Santos, C.D.; Tan, M.; Xiang, B.; Zhou, B. Attentive pooling networks. arXiv 2016, arXiv:1602.03609. [Google Scholar] [CrossRef]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; JMLR Workshop and Conference Proceedings: New York, NY, USA, 2011; pp. 315–323. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Chen, L.W.; Rudnicky, A. Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
He, Y.; Minematsu, N.; Saito, D. Multiple acoustic features speech emotion recognition using cross-attention transformer. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Shen, S.; Liu, F.; Zhou, A. Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Li, Q.; Gao, Y.; Wang, C.; Deng, Y.; Xue, J.; Han, Y.; Li, Y. Frame-level emotional state alignment method for speech emotion recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 11486–11490. [Google Scholar] [CrossRef]
Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Architecture of the Frame-Align SER Model.

Figure 2. t-SNE visualization of utterance-level embeddings. (a) Baseline; (b) Proposed.

Figure 3. Confusion matrix of the proposed method.

Table 1. Number of utterances per emotion and per session.

Session	Angry	Happy	Sad	Neutral	Total
1	229	278	194	384	1085
2	137	327	197	362	1023
3	240	286	305	320	1151
4	327	303	143	258	1031
5	170	442	245	384	1241
Total	1103	1636	1084	1708	5531

Table 2. Hyperparameter settings for stage 1 and stage 2.

Hyperparameters	Stage 1	Stage 2
Backbone	wav2vec 2.0 base	Stage-1 checkpoint
Epochs	40	30
Batch size	32	32
Backbone LR	1e-4	1e-5
Classifier LR	1e-4	3e-4
Attention LR	5e-4	3e-4
Weight decay	1e-2	1e-2
Pooling	attentive pooling	CLS + self-attention pooling
Trainable transformer layers	12	6

Table 3. Performance of the proposed methods.

Methods	UA (%)	WA (%)
self-attention pooling	73.29	71.88
CLS-based self-attention pooling	73.50	72.33
+EAL	74.16	72.86
+FUAL	74.96	73.80
+EAL + FUAL	75.30	74.22

Table 4. Ablation study of margin (m) and loss weight (

λ_{E A L}

) on UA and WA.

Table 4. Ablation study of margin (m) and loss weight (

λ_{E A L}

) on UA and WA.

	m = 0.4	m = 0.6	m = 0.8
λ_EAL	m = 0.4	m = 0.6	m = 0.8
0.05	74.80/73.54	74.54/73.39	75.30/74.22
0.10	75.15/74.07	75.10/73.84	74.60/73.65
0.15	74.62/73.39	74.59/73.39	75.11/74.10

Table 5. Ablation study of loss weight (

λ_{F U A L}

) on UA and WA.

Table 5. Ablation study of loss weight (

λ_{F U A L}

) on UA and WA.

$λ_{F U A L}$	UA (%)	WA (%)
0.05	75.3	74.22
0.10	74.81	73.8
0.15	74.7	73.42

Table 6. Comparison with other SER systems.

Systems	UA (%)	WA (%)
P-TAPT [27]	74.3	-
SMW-CAT [28]	74.2	72.8
ShiftCNN [29]	74.8	72.8
Proposed	75.3	74.2
FLEA [30]	75.7	74.7

Table 7. Cross-dataset comparison between the baseline and proposed model.

Dataset	Model	UA (%)	WA (%)
CREMA-D	Baseline	78.00	77.73
	Proposed	78.76	78.53
RAVDESS	Baseline	90.11	90.07
	Proposed	93.13	92.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Byun, S.; Lee, S.-P. Frame and Utterance Emotional Alignment for Speech Emotion Recognition. Future Internet 2025, 17, 509. https://doi.org/10.3390/fi17110509

AMA Style

Byun S, Lee S-P. Frame and Utterance Emotional Alignment for Speech Emotion Recognition. Future Internet. 2025; 17(11):509. https://doi.org/10.3390/fi17110509

Chicago/Turabian Style

Byun, Seounghoon, and Seok-Pil Lee. 2025. "Frame and Utterance Emotional Alignment for Speech Emotion Recognition" Future Internet 17, no. 11: 509. https://doi.org/10.3390/fi17110509

APA Style

Byun, S., & Lee, S.-P. (2025). Frame and Utterance Emotional Alignment for Speech Emotion Recognition. Future Internet, 17(11), 509. https://doi.org/10.3390/fi17110509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frame and Utterance Emotional Alignment for Speech Emotion Recognition

Abstract

1. Introduction

2. Related Work

2.1. Traditional SER Approaches

2.2. Wav2vec 2.0

2.3. Pooling Techniques for Utterance-Level Representations

2.4. Frame Representation Alignment in SER

3. Proposed Method

3.1. Stage 1: SSL Fine-Tuning with Attentive Pooling

3.2. Stage 2: CLS-Based Self-Attention Pooling & Joint Objectives

3.2.1. Self-Attention Pooling with CLS

3.2.2. Emotional Attention Loss (EAL)

3.2.3. Frame-to-Utterance Alignment Loss (FUAL)

3.2.4. Total Loss

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Results

4.4.1. Baselines

4.4.2. Ablation Study

4.4.3. Visualization Analysis

4.4.4. Confusion Matrix

4.4.5. Comparison with Other SER Systems

4.4.6. Cross-Dataset Evaluation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI