Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model

Wang, Xiaoyu; Yao, Kai; Yi, Ying

doi:10.3390/app16105166

Open AccessArticle

Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model

by

Xiaoyu Wang

,

Kai Yao

and

Ying Yi

^*

School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5166; https://doi.org/10.3390/app16105166

Submission received: 14 April 2026 / Revised: 16 May 2026 / Accepted: 18 May 2026 / Published: 21 May 2026

Download

Browse Figures

Versions Notes

Abstract

Accurate recognition of human emotions is crucial for human–computer interaction, and speech, as an important external manifestation of emotion, has attracted significant attention. Existing speech emotion recognition (SER) methods are predominantly based on single-task learning, which inadequately model speaker variability and other latent factors in speech, thereby limiting recognition performance. In this paper, a multi-task learning-based SER method leveraging a pre-trained acoustic model is proposed. Speech emotion recognition is treated as the primary task, while speaker recognition, gender recognition, and automatic speech recognition are introduced as auxiliary tasks. A multi-task learning framework based on hard parameter sharing is constructed to guide the model to learn shared acoustic representations that simultaneously encode emotional category characteristics, speaker identity, and other relevant information. Experiments conducted on the IEMOCAP dataset demonstrate that the proposed model achieves weighted accuracy (WA) and unweighted accuracy (UA) of 83.24% and 83.36%, respectively, under five-fold cross-validation, and 83.86% and 84.23%, respectively, under ten-fold cross-validation. In both settings, the proposed method consistently outperforms the baseline models, confirming its effectiveness in improving speech emotion recognition performance.

Keywords:

speech emotion recognition; multi-task learning; speaker recognition; gender recognition; automatic speech recognition

1. Introduction

Emotion recognition, as an important component of affective computing, aims to enable computational systems to perceive and understand human emotional states, thereby enhancing the intelligence of human–computer interaction. With the rapid development of artificial intelligence, emotion recognition has demonstrated broad application value in fields such as intelligent interaction [1], mental health monitoring [2], public safety [3], and education and training [4]. Depending on the source of emotional information, emotion recognition can be conducted based on multiple modalities, including facial expressions, speech signals, textual data, and electroencephalogram (EEG) signals. Among these, speech, as one of the most natural and direct carriers of human emotional expression, is easy to acquire, requires no additional sensing devices, and contains rich and subtle emotional information. Therefore, speech emotion recognition (Speech Emotion Recognition—SER) has gradually become an important branch of emotion recognition and has attracted widespread attention.

Speech emotion recognition (SER) typically involves two core steps: feature extraction and classification. In traditional machine learning-based approaches, handcrafted acoustic features are first extracted from speech signals and then fed into classifiers for emotion recognition. Commonly used acoustic features include prosodic features (e.g., fundamental frequency and energy) and spectral features (e.g., Mel-frequency cepstral coefficients and formants). Related studies have shown that fundamental frequency (pitch) and its continuous modeling play an important role in speech emotion expression [5]. Representative classification algorithms include support vector machine (SVM) [6], k-nearest neighbor (KNN) [7], Naive Bayes, and multilayer perceptron (MLP). Koolagudi et al. [8] investigated combinations of spectral and prosodic features based on dataset characteristics and enhanced discriminative capability by extracting statistical descriptors, selecting optimal feature subsets for different classifiers. To study cross-language and cross-gender speech emotion recognition, Costantini et al. [9] employed SVM, Naive Bayes, and MLP to model acoustic features, and found that cross-gender recognition is more challenging than cross-language recognition, with gender differences exerting a greater impact on emotion recognition than language differences. Although traditional machine learning methods have achieved certain progress in early SER research, they rely heavily on manually designed acoustic features with limited representational capacity. Moreover, these methods remain insufficient in modeling the complex temporal dynamics of speech signals.

With the advancement of deep learning, researchers have increasingly applied neural networks to SER to further explore deep emotional representations embedded in acoustic signals and improve recognition performance. Designing dedicated neural network architectures for different types of acoustic features has become a classical and widely adopted paradigm. Existing studies mainly focus on spectrogram-based features and temporal acoustic features. Due to their image-like time–frequency representations, spectrograms are typically modeled using convolutional neural network (CNN) and their variants to extract discriminative time–frequency patterns [10,11,12,13]. In contrast, temporal acoustic features such as Mel-frequency cepstral coefficient (MFCC) are usually represented as one-dimensional sequences, emphasizing the dynamic evolution of speech over time. For such features, long short-term memory (LSTM) networks are commonly employed to capture temporal dependencies and contextual information. Shahin et al. [14] addressed the limitation of CNN in exploiting temporal information by using second-order delta MFCC features as input, extracting temporal correlations via a dual-channel LSTM, and performing emotion recognition with a compressed capsule network. To effectively capture dynamic emotional variations in speech, Yang et al. [15] fused short-term and prosodic features into low-dimensional composite representations and modeled temporal information using LSTM, demonstrating effectiveness across multiple datasets. To better adapt to temporal modeling tasks, some studies [16,17] have extended standard CNN to one-dimensional convolution (1D CNN) structures. However, these supervised learning methods inevitably rely on large-scale labeled datasets, resulting in high data collection and annotation costs. To alleviate the reliance on labeled data, self-supervised learning (SSL) methods have been proposed, such as Wav2vec [18], Wav2vec 2.0 [19], HuBERT [20], and WavLM [21]. These approaches leverage large-scale unlabeled speech data to learn more general and robust speech representations. Owing to their strong representational capability and superior generalization performance, they have been increasingly applied in SER research. Wagner et al. [22] fine-tuned Wav2vec 2.0 and HuBERT to analyze the impact of model scale and pre-training data on SER performance and generalization. Chakhtouna et al. [23] employed pre-trained speech models as frozen feature extractors and combined them with SVM to perform SER on small-scale datasets, demonstrating the transferability of self-supervised speech representations. Chen et al. [24] proposed a fine-tuning method termed P-TAPT for Wav2vec 2.0, which modifies the training objective to learn more discriminative contextual emotional representations. Nevertheless, speech signals contain complex and diverse information, and single-task SER frameworks often overlook other factors that may influence emotional expression, such as speaker identity and gender, thereby limiting recognition accuracy and generalization performance.

To address the limitations of single-task SER in exploiting latent information in speech, researchers have begun to explore multi-task learning (MTL) frameworks that jointly model speech emotion recognition with other related tasks. By simultaneously learning multiple tasks within a unified framework, additional auxiliary information can be introduced, enabling the model to capture latent correlations among tasks and thereby enhance its modeling capability and generalization performance. Li et al. [25] proposed an end-to-end MTL framework that jointly trains emotion classification and gender classification tasks. Their approach directly extracts features from speech spectrograms rather than relying on handcrafted features, and employs a self-attention mechanism to focus on emotionally salient temporal segments of speech. Cai et al. [26] constructed an MTL framework based on Wav2vec 2.0 to jointly perform automatic speech recognition (ASR) and emotion classification, improving SER performance through shared task representations. Lee et al. [27] leveraged MTL to map the same speech input into different feature subspaces to extract complementary features, which are then fused based on feature correlations, thereby enhancing emotion recognition performance. Tzeng et al. [28] proposed an MTL framework based on self-supervised speech representations, integrating speech enhancement and SER tasks. By sharing SSL representations and adopting a progressive unfreezing strategy, their method improves emotion recognition performance under noisy conditions while maintaining speech enhancement quality. However, existing MTL-based approaches still have limitations in auxiliary task design. Most studies focus on a single auxiliary task or simple combinations, lacking systematic exploration of task complementarity and collaborative optimization. Furthermore, Ryumina et al. [29] proposed the SSL-MEPR framework, which integrates self-supervised learning, multi-task learning, and cross-domain learning, and achieves cross-task knowledge transfer and joint optimization for multimodal emotion and personality recognition through Graph Attention Fusion and a modified GradNorm-based dynamic weighting strategy. This study demonstrates that the combination of self-supervised learning and multi-task collaborative optimization can effectively improve representation learning and cross-domain generalization ability in affective computing tasks. As a result, their ability to effectively exploit latent emotional information in speech remains limited.

Therefore, this paper introduces a multi-task learning framework and proposes a speech emotion recognition method based on pre-trained acoustic models. The proposed framework takes speech emotion recognition (SER) as the primary task, while speaker recognition (SR), gender recognition (GR), and automatic speech recognition (ASR) are jointly optimized as auxiliary tasks through a hard parameter-sharing architecture to learn shared acoustic representations. Different from existing Wav2vec2.0-based multi-task SER methods that mainly improve performance by simply introducing additional auxiliary task branches, this work further systematically investigates the effects of task collaboration, task competition, and loss-weight balancing on shared representation learning. Through hierarchical ablation experiments on different auxiliary-task weight combinations, the proposed framework is able to alleviate negative transfer among tasks while learning more robust emotion-related acoustic representations, thereby improving the performance and generalization ability of speech emotion recognition.

The remainder of this paper is organized as follows. Section 2 introduces the proposed speech emotion recognition method, with a detailed description of each module in the model. Section 3 presents the experimental setup, including comparative and ablation studies, along with visualization results to validate the effectiveness of the proposed method. Section 4 concludes this paper and discusses directions for future work.

2. Theoretical Method

The architecture of the proposed MTL-based speech emotion recognition network using pre-trained acoustic models is illustrated in Figure 1. The following sections describe the model in detail based on three aspects: shared acoustic feature extraction, task-specific decoding, and joint loss optimization.

2.1. Shared Acoustic Feature Extraction

To mitigate the potential effects of speaker variability, gender differences, and semantic information on emotional expression, the proposed MTL framework incorporates SR, GR, and ASR as auxiliary tasks to assist SER. Given the strong correlations among these tasks, a MTL architecture based on hard parameter sharing is adopted, where multiple tasks jointly constrain the shared layers to extract more expressive acoustic features.

In the field of speech emotion recognition, self-supervised pre-trained speech models have demonstrated strong representation capabilities, with representative models including Wav2vec 2.0, HuBERT, and WavLM. These models leverage SSL to extract high-quality and generalizable speech representations from large-scale unlabeled data, which can be effectively applied to downstream speech-related tasks. Therefore, the proposed framework employs a pre-trained Wav2vec 2.0 model to extract shared acoustic features, and further optimizes feature representations through a multi-task joint learning mechanism.

Let the input raw speech signal be denoted as x, the pre-trained model as

P_{σ}

, and the corresponding model parameters as

σ

. After being processed by Wav2vec 2.0, the shared feature representation is

v \in ℝ^{L \times d}

obtained, where T denotes the frame-level sequence length (i.e., the number of time steps), and d represents the feature dimension of each frame (typically 768).

v = P_{σ} (x)

(1)

The feature v is used as the shared input for subsequent downstream tasks. Through backpropagation, gradients from different tasks jointly update and fine-tune the parameters of the shared layers, enabling the feature v to preserve emotional discriminability while incorporating complementary information from multiple tasks.

2.2. Task-Specific Decoding

After shared feature extraction, the shared representation v is fed into four different downstream tasks for joint training. In this paper, a multi-task speech emotion recognition framework is designed, where SER serves as the primary task, and SR, GR, and ASR are introduced as auxiliary tasks.

2.2.1. Primary Task: Speech Emotion Recognition

Task Formulation: In the MTL framework, SER serves as the primary task, aiming to predict the corresponding emotion category based on the input speech features. This task can be formulated as a multi-class classification problem, where an emotion prediction function is applied to the input speech to perform classification, and the category with the highest prediction score among all emotion classes is selected as the final output. Formally, the SER task can be expressed as follows:

y_{e}^{*} = \underset{y_{e} \in {1, \dots, E}}{a r g m a x} f_{S E R} (x)

(2)

where x denotes the speech input,

f_{S E R} (\cdot)

represents the emotion prediction function, E denotes the set of emotion categories,

y_{e}^{*}

is the predicted emotion label, and

y_{e}

is the ground-truth emotion label.

Implementation: In the proposed method, after obtaining the feature vector v from the pre-trained model, average pooling is first applied. The pooling layer performs downsampling on the input data, reducing its scale while preserving important feature information. In this work, an average pooling layer is employed to reduce the dimensionality of acoustic features, thereby improving the efficiency of downstream tasks. The feature v is then fed into the average pooling layer to obtain a one-dimensional feature vector

v^{*} \in ℝ^{d}

.

v^{*} = A v g P o o l (v) = \frac{1}{T} \sum_{t = 0}^{T - 1} v_{t}

(3)

The resulting feature is then fed into a fully connected layer

F C_{S E R}

, which maps it into the emotion category space and outputs the logits

z_{e} \in ℝ^{E}

:

z_{e} = F C_{S E R} (v^{*})

(4)

In this study, a four-class discrete emotion model is adopted, i.e., E = 4.

2.2.2. Auxiliary Task: Gender Recognition

Task Formulation: Due to differences in vocal tract structure and physiological characteristics between males and females, their speech exhibits noticeable variations in fundamental frequency, formant distribution, and energy intensity. These differences lead to variations in the acoustic expression of the same emotional state across speakers of different genders. Therefore, gender is an important factor influencing emotional expression. Based on this observation, GR is introduced as an auxiliary task in the proposed multi-task speech emotion recognition framework. The objective of this task is to identify the gender of the speaker from the input speech signal. The GR task can be formulated as follows:

y_{g}^{*} = \underset{y_{g} \in {1, \dots, G}}{a r g m a x} f_{G R} (x)

(5)

where x denotes the speech input,

f_{G R} (\cdot)

represents the gender prediction function, G denotes the set of gender categories,

y_{g}^{*}

is the predicted gender label, and

y_{g}

is the ground-truth gender label.

Implementation: The processing of the GR task is similar to that of the SER task. First, the same average pooling layer is applied to reduce the dimensionality of the feature, yielding

v^{*}

. The resulting feature is then fed into a fully connected layer

F C_{G R}

, which maps it into the gender category space and outputs the logits

z_{g} \in ℝ^{G}

:

z_{g} = F C_{G R} (v^{*})

(6)

where G = 2, corresponding to male and female.

2.2.3. Auxiliary Task: Speaker Recognition

Task Formulation: Different speakers exhibit significant variations in speaking habits, vocal styles, and acoustic feature distributions. As a result, even when expressing the same emotion, their acoustic representations may differ. Meanwhile, emotional information and speaker identity are inherently coupled in speech signals, which increases the difficulty of speech emotion recognition to some extent. In this work, SR is introduced as an auxiliary task. By incorporating task-level constraints, speaker-related information is implicitly injected, enabling the model to learn inter-speaker variability within the MTL framework. The objective of the SR task is to identify the speaker identity label corresponding to the input speech, which can be formulated as follows:

y_{s}^{*} = \underset{y_{s} \in {1, \dots, S}}{a r g m a x} f_{S R} (x)

(7)

where x denotes the speech input,

f_{S R} (\cdot)

represents the speaker prediction function, S denotes the set of speaker categories,

y_{s}^{*}

is the predicted speaker identity label, and

y_{s}

is the ground-truth speaker identity label.

Implementation: The processing of the SR task is similar to that of the SER and GR tasks. First, the same average pooling layer is applied to reduce the dimensionality of the feature, yielding

v^{*}

. The resulting feature is then fed into a fully connected layer

F C_{S R}

, which maps it into the speaker category space and outputs the logits

z_{s} \in ℝ^{S}

.

z_{s} = F C_{S R} (v^{*})

(8)

In this study, experiments are conducted on the IEMOCAP dataset, where S = 10, including five male speakers and five female speakers.

2.2.4. Auxiliary Task: Automatic Speech Recognition

Task Formulation: The expression of speech emotion is not only related to acoustic features but also closely associated with the linguistic content of speech. In this work, ASR is introduced as an auxiliary task in the multi-task speech emotion recognition framework, enabling the shared acoustic features to better preserve speech content and temporal structure while learning emotional information. The objective of the ASR task is to convert the input speech signal into the corresponding text sequence. Unlike classification tasks such as SER, ASR is a typical sequence-to-sequence (Seq2Seq) task. Given a speech feature sequence

Z = [z_{1}, z_{2}, \dots, z_{T}]

, where T denotes the number of time frames, the goal of ASR is to find the most probable text sequence

W^{*}

, as shown in Equation (9):

W^{*} = \underset{W \in V^{*}}{a r g m a x} P (W | Z)

(9)

where V denotes the vocabulary,

V^{*}

represents the set of all possible text sequences of arbitrary length constructed from the vocabulary,

W = [w_{1}, w_{2}, \dots, w_{L}]

is a candidate text sequence, and L denotes the length of the text sequence.

P (W | Z)

represents the probability of generating the text sequence W given the speech feature sequence Z.

Implementation: In the proposed method, after obtaining the feature vector v, it is fed into a fully connected layer

F C_{A S R}

. This layer maps the shared features into the character-level vocabulary space, producing a frame-level character prediction score matrix

z_{w} \in ℝ^{T \times V}

, where each row corresponds to the prediction score vector (logits) at a time step.

z_{w} = F C_{A S R} (v)

(10)

The vocabulary used in this study consists of 26 English letters and 6 punctuation symbols, resulting in a total of 32 characters, i.e., V = 32.

2.3. Multi-Task Collaborative Learning Mechanism and Joint Loss Construction

The design of the loss function determines the optimization direction and the effectiveness of the collaborative mechanism in MTL. The proposed framework involves a combination of classification tasks and a sequence task; therefore, a joint loss function is adopted. In the proposed MTL framework, four tasks are included: SER, SR, GR, and ASR. Among them, SER, GR, and SR are multi-class classification tasks, which are optimized using the cross-entropy loss, denoted as

L_{C E_S E R}

,

L_{C E_G R}

, and

L_{C E_S R}

, respectively.

L_{C E_S E R} = - \frac{1}{N} \sum_{j = 1}^{N} \sum_{e = 1}^{E} t_{e}^{j} \log (p_{e}^{j})

(11)

L_{C E_G R} = - \frac{1}{N} \sum_{j = 1}^{N} \sum_{g = 1}^{G} t_{g}^{j} \log (p_{g}^{j})

(12)

L_{C E_S E R} = - \frac{1}{N} \sum_{j = 1}^{N} \sum_{s = 1}^{S} t_{s}^{j} \log (p_{s}^{j})

(13)

where N denotes the number of samples, and E, G, and S represent the numbers of emotion, gender, and speaker categories, respectively. For simplicity, a class index c is introduced. In the SER task, c corresponds to e; in the GR task, c corresponds to g; and in the SR task, c corresponds to s.

t_{c}^{j}

denotes the element of the one-hot label vector of the j-th sample at class c. Specifically,

t_{c}^{j} = 1

if the ground-truth class of the sample is c, and

t_{c}^{j} = 0

otherwise.

p_{c}^{j}

represents the probability that the j-th sample is predicted to belong to class c, which is obtained by applying the softmax function to the prediction score vector corresponding to the j-th sample.

p_{c}^{j} = s o f t m a x (z_{c}^{j})

(14)

For the ASR task, due to the length mismatch and the lack of alignment between the input speech and output text sequences, the Connectionist Temporal Classification (CTC) loss [30] is adopted. The loss for the ASR task is denoted as

L_{C T C_A S R}

.

L_{C T C_A S R} = - \frac{1}{N} \sum_{j = 1}^{N} \ln (P (π^{j} | z_{w}^{j}))

(15)

where N denotes the number of samples,

z_{w}^{j}

represents the frame-level prediction score matrix for the j-th sample,

π^{j}

denotes the corresponding ground-truth label sequence, and

P (π^{j} | z_{w}^{j})

represents the probability of generating the ground-truth sequence

π^{j}

given the output

z_{w}^{j}

. This probability is obtained by summing over all possible alignment paths that can map to

π^{j}

.

To simultaneously optimize the performance of the four tasks during training, the model combines the above loss functions into a weighted joint loss function. This weighted summation strategy is adopted to balance the contributions of different tasks, as they involve heterogeneous objectives (classification and sequence modeling) with potentially different loss scales and convergence behaviors. By introducing task-specific weights, the model can prevent any single task from dominating the optimization process and facilitate more stable and effective multi-task learning. Such a formulation is widely used in multi-task learning as a simple yet effective mechanism for coordinating multiple objectives.

L_{l o s s} = L_{C E_S E R} + α \cdot L_{C E_G R} + β \cdot L_{C E_S R} + γ \cdot L_{C T C_A S R}

(16)

where α, β, and γ are hyperparameters used to adjust the weights of each task. The optimal combination of these hyperparameters was determined through a grid search, ensuring that multi-task collaborative training can be performed without negative transfer and improving the performance of the SER task.

3. Experimental Result

3.1. Dataset

IEMOCAP (Interactive Emotional Dyadic Motion Capture Database) [31] is a widely used multimodal dataset in emotion recognition research. It contains 12 h of audiovisual recordings, including audio, textual transcriptions, video, phonetic details, and facial expressions. The dataset was collected from 10 professional actors (5 male and 5 female) performing dyadic interactions, with both scripted and spontaneous emotional expressions. To facilitate comparison with previous studies in this field, the “happy” and “excited” categories were merged into a single “happy” category. A total of 5531 audio utterances were used, corresponding to four emotions: angry (1103), sad (1084), happy (1636), and neutral (1708).

3.2. Evaluation Metrics

This study adopts the internationally recognized evaluation metrics in the field of speech emotion recognition, namely Weighted Accuracy (WA) and Unweighted Accuracy (UA), to assess the performance of the model.

W A = \frac{\sum_{i = 1}^{L} T P_{i}}{\sum_{i = 1}^{L} (T P_{i} + F N_{i})}

(17)

U A = \frac{1}{L} \sum_{i = 1}^{L} \frac{T P_{i}}{T P_{i} + F N_{i}}

(18)

where

T P_{i}

denotes the number of true positives for class i,

F N_{i}

denotes the number of false negatives for class i, and L is the total number of classes.

To further validate the effectiveness of the proposed MTL framework, the auxiliary tasks are evaluated using Accuracy (ACC) and Word Error Rate (WER). For the GR and SR classification tasks, the class distribution in the IEMOCAP dataset is relatively balanced, so ACC is adopted as the evaluation metric.

A C C = \frac{\sum_{i = 1}^{C} T P_{i}}{N}

(19)

where C denotes the number of classes,

T P_{i}

denotes the number of true positives for class i, and N denotes the total number of samples.

For the ASR task, WER is used as the evaluation metric to measure the error rate between the predicted text and the reference text.

W E R = \frac{S + D + I}{N}

(20)

where S, D, and I represent the numbers of substitution, deletion, and insertion errors, respectively, and N denotes the total number of words in the reference text. A lower WER indicates better ASR performance.

3.3. Implementation Details

In this study, the raw audio signals from the IEMOCAP dataset were used, all resampled to 16 kHz. The model was implemented using the PyTorch 2.3.0 framework, with specific parameter settings shown in Table 1. To comprehensively evaluate the proposed method, standard random five-fold and ten-fold cross-validation strategies were adopted in this study. In both settings, the dataset samples were randomly divided into training and validation subsets while maintaining the overall class distribution. For statistical reliability, all experiments were conducted multiple times with different fixed random seeds. Specifically, each five-fold cross-validation experiment was repeated 5 times, and each ten-fold cross-validation experiment was repeated 10 times. The final results are reported as mean ± standard deviation across these runs. It should be noted that the auxiliary tasks (including SR, etc.) are used only during the training phase to assist representation learning and optimize the shared encoder, while only the SER task is evaluated during the testing phase. Due to the high computational cost of repeated training under cross-validation and multi-task learning settings, statistical significance tests (e.g., t-test) are not performed.

3.4. Comparative Experiments

Under the above cross-validation settings, this section compares the proposed method with several representative SER models on the IEMOCAP dataset to validate its effectiveness. The selected comparison models are briefly introduced as follows:

(1): MCRVT [32]: A speech emotion recognition model based on multi-level feature fusion. It enhances spectrogram features using multi-function attention, incorporates high-level semantic features from WavLM, and leverages both contrastive reconstruction networks and cross-attention fusion to explore complementary information.
(2): MS-Swinformer + DMTL [33]: A speech emotion recognition framework based on multi-scale fusion and MTL. It first extracts time–frequency features from speech using multi-scale convolutions, then incorporates attention mechanisms within the Swin Transformer structure to model long-range contextual dependencies. A dynamic MTL strategy is also proposed to jointly optimize high-level semantic features from Wav2vec and low-level acoustic features from MFCC, achieving optimal fusion of multi-source information.
(3): ENT [34]: An Emotion Neural Transducer model that achieves fine-grained speech emotion recognition through joint training with ASR. It extends the traditional Transducer structure by introducing an emotion joint network, modeling emotion class distributions on the alignment grid of acoustic and linguistic representations, forming an “emotion alignment grid.” Max-pooling is applied on the alignment grid to enhance the model’s ability to distinguish emotional frames from non-emotional frames.
(4): FENT [34]: An improved version of ENT. Unlike ENT, FENT separates the prediction of the blank symbol and token prediction, allowing the blank symbol to serve simultaneously as a special placeholder for ASR and as an indicator of emotion, enabling finer-grained capture of frame-level emotion dynamics.
(5): MMER [35]: MMER is a multimodal MTL method. It models both textual and audio modalities through early fusion and cross-modal self-attention. In addition to the primary task of emotion classification, three auxiliary tasks, including ASR and two contrastive learning-based tasks, are jointly optimized to improve the model’s recognition performance.
(6): Self-attention CNN-BLSTM [25]: An end-to-end spectrogram-based emotion recognition method. Self-attention guides the model to focus on emotion-relevant segments, while a MTL framework leveraging the correlation between emotion and gender further improves recognition performance.
(7): MS-SENet [36]: A multi-scale feature fusion network based on squeeze-and-excitation (SE) blocks. Using MFCC as input, multi-scale convolutions extract time–frequency features, which are reweighted by SE to enhance their effectiveness. Skip connections and spatial dropout layers are incorporated to prevent overfitting and increase network depth. Finally, TIM-Net [37] captures bidirectional emotion dependencies.
(8): Wav2vec2.0 + MTL [26]: A Wav2vec2.0-based multi-task speech emotion recognition model, with SER as the primary task and ASR as an auxiliary task.
(9): Co-attention [38]: A multi-level acoustic feature fusion network. It extracts MFCC, spectrograms, and high-level self-supervised representations from Wav2vec from raw speech, processes each feature through separate encoders, and performs final fusion using a co-attention mechanism.

Comparative experiments were conducted on the IEMOCAP dataset. Some baseline results were directly cited from the original studies, while the remaining baseline methods were reproduced under the same experimental settings in this work. The experimental results are presented in Table 2. In five-fold cross-validation, the proposed model achieved a WA of 83.24 ± 0.75% and a UA of 83.36 ± 0.73%; in ten-fold cross-validation, WA and UA reached 83.86 ± 0.66% and 84.23 ± 0.63%, respectively. Under both validation strategies, the model outperformed the comparative models in terms of WA and UA, demonstrating competitive recognition accuracy.

In addition, we further report the computational cost of the proposed model. The model requires 35.436 GFLOPs, and achieves an inference time of 5.703 ms per utterance on an NVIDIA GeForce RTX 4080 GPU. The reported inference efficiency suggests that the proposed model is suitable for practical deployment in both real-time and near-real-time speech recognition systems. In particular, the low latency enables its application in scenarios such as human–computer interaction, online affective computing, and speech-based intelligent systems. Although the model employs a multi-branch architecture during training, only the inference path of the SER task is retained during testing, ensuring efficient deployment without significant computational overhead.

3.5. Ablation Study

The proposed method is based on a MTL framework with a shared acoustic encoder, where different tasks impose constraints on the shared acoustic features to enhance speech emotion recognition performance. Based on this framework, a hierarchical ablation study is conducted from two perspectives: shared acoustic representation modeling and task collaboration mechanism. At the shared acoustic representation level, the modeling capability of the shared encoder is compared by substituting different pre-trained acoustic models, as discussed in Section 3.5.1. At the task collaboration level, while keeping the shared encoder and network structure consistent, the weights of different auxiliary tasks are adjusted to analyze their impact on SER performance. Specifically, Section 3.5.2, Section 3.5.3 and Section 3.5.4 investigate the effects of individual auxiliary tasks (GR, SR, and ASR) on SER, Section 3.5.5 explores the impact of dual-task combinations, and Section 3.5.6 further examines the performance of joint optimization with all three auxiliary tasks.

3.5.1. Selection of Pre-Trained Acoustic Models

Feature extraction is a core step in speech emotion recognition, as the representational capability of the features directly affects the model’s final recognition performance. Compared with traditional acoustic features, high-level acoustic features extracted from pre-trained models have demonstrated superior performance across multiple speech-related tasks. Currently, mainstream pre-trained acoustic models include Wav2vec2.0, HuBERT, and WavLM. To identify the model best suited as the shared encoder in the proposed MTL framework, this section conducts comparative experiments by substituting different pre-trained acoustic models as the shared encoder while keeping all other experimental settings identical. The results are shown in Figure 2.

3.5.2. Effect of the GR Auxiliary Task on SER Performance

In speech emotion expression, gender differences impose unique constraints through inherent acoustic characteristics, which in turn affect the quality of emotion-related feature representations. In the experiments of this section, the weights of the SR task β and the ASR task γ are set to 0. By varying the weight of the gender recognition task α (set to 0, 0.2, 0.4, 0.6, and 1), the impact of GR as a single auxiliary task on the SER main task is analyzed.

Figure 3a,b show the changes in WA and UA, respectively, over 100 training epochs for different values of the weight α. Here, α = 0 serves as the baseline, where the MTL framework contains only the SER task. As α increases, the emotion recognition accuracy first rises and then declines, reaching the best performance at α = 0.4. This is because moderate gender supervision effectively constrains gender-related features, promoting the disentanglement between gender and emotion representations, and enabling the model to learn more generalizable emotional features. However, when the gender constraint becomes overly strong, it introduces competition between tasks, causing the model to overemphasize gender differences and thus weaken its ability to capture emotional dynamics. This observation indicates that there exists a dynamic balance between collaboration and competition among tasks in the multi-task learning framework.

Figure 4 shows the GR accuracy under different values of α. In the baseline group, the GR task is not included, so the learned shared features are not constrained by gender supervision, resulting in a gender recognition accuracy of approximately 50%, close to the random level for a binary classification. As α gradually increases from 0, the convergence speed of the GR task significantly improves, and the accuracy increases, reaching its peak at α = 1. However, when α approaches 1, the convergence rate noticeably slows down, and the improvement in accuracy becomes smaller.

Therefore, considering both SER and GR performance, α = 0.4 is selected as the optimal setting.

3.5.3. Effect of the SR Auxiliary Task on SER Performance

Due to individual differences in emotional expression among speakers, the SR task is introduced to help the model distinguish speaker-specific characteristics from emotion-related features. In this experiment, α and γ are set to 0, and β is varied (0, 0.2, 0.6, 0.8, and 1) to analyze the effect of SR as an auxiliary task on the SER performance.

Figure 5a,b show the changes in WA and UA for different β values. Following the setup in Section 3.5.2, β = 0 serves as the baseline. As β increases from 0 to 0.6, the emotion recognition accuracy improves, indicating that incorporating speaker information can help the model better discriminate emotions across different speakers. The best performance is achieved when β = 0.6. However, further increasing β causes the model to focus more on fitting speaker identity rather than emotion information, leading to a decrease in accuracy.

In addition, the SR accuracy under different β values was analyzed, as shown in Figure 6. For the baseline without the SR task, the SR accuracy was close to the random level for ten classes. As β gradually increased from 0, the convergence speed of the SR task increased significantly, and the accuracy improved, reaching the best performance at β = 1. However, when β exceeded 0.6, both the convergence speed and the accuracy improvement slowed down noticeably.

In summary, β = 0.6 was selected as the optimal setting.

3.5.4. Effect of the ASR Auxiliary Task on SER Performance

Automatic speech recognition serves as an auxiliary task, providing semantic-level constraints to enhance the model’s emotion recognition capability. With α and β set to 0, the effect of the ASR single auxiliary task on SER is analyzed by varying γ (set to 0, 0.1, 0.2, 0.4, and 1).

Figure 7a,b illustrate the changes in WA and UA under different γ values. The best performance is achieved when γ = 0.2. Similarly, γ = 0 is used as the baseline. As γ increases from 0 to 0.2, the emotion recognition accuracy gradually improves, indicating that adding the ASR task can help the model better discriminate emotions. However, as γ increases from 0.2 to 1, the model’s convergence speed slows down.

In addition, Figure 8 presents the WER of ASR under different γ values. In the baseline group, where the ASR task is not included, WER is 1. Introducing the ASR task shows a significant effect even with a small weight, achieving the lowest WER when γ = 0.2. However, as γ increases from 0.2 to 1, the ASR performance exhibits a trend similar to that observed for SER, indicating a reduced optimization effect for the ASR task.

In summary, γ = 0.2 is selected as the optimal setting.

3.5.5. Effect of Joint Training with Two Auxiliary Tasks on SER Performance

Through the above ablation experiments on single auxiliary tasks, the independent contributions and optimal weights of each task have been clarified. Specifically, when α = 0.4, the GR task provides the best improvement for SER; when β = 0.6, the SR task provides the best improvement for SER; and when γ = 0.2, the ASR task provides the best improvement for SER. This subsection further explores the synergistic effects of any two auxiliary tasks, providing a theoretical basis for the subsequent joint optimization of all three tasks.

First, the effect of joint training with GR and SR on SER is analyzed. A discrete search space is constructed around the optimal weights of the single tasks, with α

\in

{0, 0.2, 0.4, 0.6} and β

\in

{0.2, 0.4, 0.6, 0.8} for combination experiments, resulting in a total of 16 groups. The WA results of SER are shown in Figure 9. When α = 0.2 and β = 0.4, the synergistic effect is maximized. Compared with the optimal weights when GR and SR assist individually, the optimal weights of both tasks shift toward smaller values during joint training. This may be because GR and SR focus on partially correlated information; if both weights are too large, it may lead to redundancy. Smaller weights allow the two auxiliary tasks to impose appropriate constraints on the shared representation, thereby improving recognition accuracy.

Next, the effect of joint training of GR and ASR on SER was analyzed. A discrete search space was constructed around the optimal single-task weights, with α

\in

{0, 0.2, 0.4, 0.6, 0.8} and β

\in

{0.1, 0.2, 0.3} for combination experiments, totaling 15 groups. The experimental results are shown in Figure 10, where the best synergistic effect was achieved when α = 0.6 and γ = 0.2. It can also be observed that the overall performance of GR and ASR joint training is better than that of GR and SR joint training. This may be because GR and ASR tasks focus on more distinct aspects of information, providing more complementary signals, whereas SR and GR tasks are somewhat correlated, limiting their contribution to overall performance improvement.

Finally, the effect of joint training of SR and ASR on SER was analyzed. Combination experiments were conducted with γ

\in

{0, 0.1, 0.2, 0.3} and β

\in

{0.4, 0.6, 0.8}. The results are shown in Figure 11, where the best synergistic effect was achieved when β = 0.6 and γ = 0.1. It can be observed that a smaller γ provides a better auxiliary effect for speech emotion recognition, indicating that within the proposed framework, the ASR task weight should not be too large, as an excessively high weight may interfere with the main task. From the perspective of β, as the weight increases, the performance of emotion recognition initially improves and then decreases. The optimal performance was obtained at β = 0.6, consistent with the trend observed when only the SR auxiliary task was applied, suggesting that the SR task contributes to SER performance within an appropriate optimal range.

3.5.6. Performance Analysis of Joint Optimization with Three Auxiliary Tasks

Based on the experimental results of single and dual auxiliary tasks, this section further explores the overall performance when all three auxiliary tasks are simultaneously involved in training. Combining the findings from single-task and dual-task fusion experiments, it was observed that the SER performance is relatively stable when γ is optimal. Therefore, γ

\in

{0.1, 0.2} was selected for experimentation. The optimal ranges for GR and SR task weights exhibit some fluctuations, so α

\in

{0.2, 0.4, 0.6} and β

\in

{0.4, 0.6, 0.8} were used in the experiments. The results under different weight combinations are shown in Table 3. It can be seen that when α = 0.4, β = 0.6, and γ = 0.1, SER achieves the best performance.

Based on the experimental analysis of the effects of single-task, dual-task, and multiple auxiliary-task settings on model performance, we observe that when the weight of auxiliary tasks exceeds a certain range, the performance of the SER task degrades. This indicates the presence of task interference and potential negative transfer between tasks. However, when an appropriate weighting coefficient is applied, the multi-task learning framework consistently outperforms the single-task baseline, demonstrating that the auxiliary tasks provide complementary information for representation learning rather than introducing harmful interference.

These observations suggest that negative transfer does not dominate the optimization process in the proposed framework, as the shared encoder is still able to learn task-invariant acoustic representations under a properly balanced weighting strategy. Therefore, the performance improvement can be attributed to a controllable trade-off between shared representation learning and task-specific optimization.

To more intuitively compare the impact of different task settings on model performance, we systematically organized the above ablation results and constructed Table 4, where “√” indicates that the corresponding task is included in the model.

3.5.7. Comparison of Fixed and Adaptive Loss Weighting Strategies

In multi-task learning, the design of loss weighting strategies plays a crucial role in balancing the optimization among different tasks. To further investigate the effectiveness of the proposed weighting scheme, we compare the commonly used fixed loss weighting strategy with an adaptive loss weighting method.

Specifically, the fixed weighting strategy assigns constant coefficients to each task loss during training, i.e., the optimal weight configuration identified in Section 3.5.6 (α = 0.4, β = 0.6, γ = 0.1), which remains unchanged throughout the entire training process. In contrast, adaptive weighting strategies dynamically adjust task weights based on training signals such as gradient magnitudes or uncertainty estimation. In this study, we evaluate several representative adaptive methods, including Uncertainty Weighting, GradNorm, and Dynamic Weight Averaging (DWA). The experimental results are reported in Table 5.

The experimental results demonstrate that, within the proposed framework, the fixed weighting strategy outperforms adaptive weighting methods in terms of both performance and stability. This may be attributed to the high homogeneity of the speech-related tasks in this study, where all tasks share the same acoustic input space and exhibit strong feature correlations. In this case, manually designed fixed weights can effectively balance the contributions of different tasks without introducing additional optimization noise.

In contrast, adaptive weighting methods dynamically adjust task weights during training, which may intensify competition among tasks and adversely affect the stability of model convergence.

4. Conclusions

This paper proposes a multi-task speech emotion recognition network based on a pre-trained acoustic model. Taking speech emotion recognition as the primary task, speaker recognition, gender recognition, and automatic speech recognition are introduced as auxiliary tasks for joint training, thereby enhancing the model’s ability to capture latent acoustic factors and enabling effective modeling of speaker individuality, gender characteristics, and linguistic content differences. As a result, the performance and generalization capability of speech emotion recognition are further improved. Experimental results on the IEMOCAP dataset demonstrate that the proposed method achieves competitive performance. Under five-fold cross-validation, the WA and UA reach 83.24% and 83.36%, respectively, while under ten-fold cross-validation, they reach 83.86% and 84.23%, respectively. It should be noted that the experiments in this work are mainly conducted on the IEMOCAP dataset, and further evaluations on additional public speech emotion datasets, cross-corpus scenarios, and complex noisy environments have not yet been performed. Therefore, the generalization capability of the proposed model in different domains and real-world scenarios still requires further investigation. In future work, more public speech emotion datasets will be introduced for cross-corpus evaluations, together with noise robustness tests and domain-shift analysis, to provide a more comprehensive assessment of the model’s generalization performance. In addition, model compression techniques such as pruning and quantization will be further explored to improve deployment efficiency in practical applications.

Author Contributions

Conceptualization, X.W.; methodology, X.W.; software, K.Y.; validation, X.W.; formal analysis, X.W.; investigation, X.W.; resources, Y.Y.; data curation, K.Y.; writing—original draft preparation, X.W.; writing—review and editing, Y.Y.; visualization, X.W.; supervision, Y.Y.; project administration, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chhikara, P.; Singh, P.; Tekchandani, R.; Kumar, N.; Guizani, M. Federated Learning Meets Human Emotions: A Decentralized Framework for Human-Computer Interaction for IoT Applications. IEEE Internet Things J. 2021, 8, 6949–6962. [Google Scholar] [CrossRef]
Zhao, Z.; Bao, Z.; Zhang, Z.; Deng, J.; Cummins, N.; Wang, H.; Tao, J.; Schuller, B. Automatic Assessment of Depression From Speech via a Hierarchical Attention Transfer Network and Attention Autoencoders. IEEE J. Sel. Top. Signal Process. 2020, 14, 423–434. [Google Scholar] [CrossRef]
Miranda Calero, J.A.; Rituerto-González, E.; Luis-Mingueza, C.; Canabal, M.F.; Barcenas, A.R.; Lanza-Gutierrez, J.M.; Pelaez-Moreno, C.; Lopez-Ongil, C. Bindi: Affective Internet of Things to Combat Gender-Based Violence. IEEE Internet Things J. 2022, 9, 21174–21193. [Google Scholar] [CrossRef]
Dehbozorgi, N.; Kunuku, M.T. Exploring the Influence of Emotional States in Peer Interactions on Students’ Academic Performance. IEEE Trans. Educ. 2024, 67, 405–412. [Google Scholar] [CrossRef]
Al-Radhi, M.S.; Csapó, T.G.; Németh, G. Adaptive refinements of pitch tracking and HNR estimation within a vocoder for statistical parametric speech synthesis. Appl. Sci. 2019, 9, 2460. [Google Scholar] [CrossRef]
Kwon, O.W.; Chan, K.; Hao, J.; Lee, T.-W. Emotion recognition by speech signals. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003), Geneva, Switzerland, 1–4 September 2003; pp. 125–128. [Google Scholar]
Umamaheswari, J.; Akila, A. An Enhanced Human Speech Emotion Recognition Using Hybrid of PRNN and KNN. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 2019; IEEE: New York, NY, USA, 2019; pp. 177–183. [Google Scholar]
Koolagudi, S.G.; Murthy, Y.V.S.; Bhaskar, S.P. Choice of a classifier, based on properties of a dataset: Case study-speech emotion recognition. Int. J. Speech Technol. 2018, 21, 167–183. [Google Scholar] [CrossRef]
Costantini, G.; Parada-Cabaleiro, E.; Casali, D.; Cesarini, V. The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning. Sensors 2022, 22, 2461. [Google Scholar] [CrossRef]
Huang, Z.; Dong, M.; Mao, Q.; Zhan, Y. Speech Emotion Recognition Using CNN. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 2014; ACM: New York, NY, USA, 2014; pp. 801–804. [Google Scholar]
Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
Mustaqeem, N.; Kwon, S. A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition. Sensors 2019, 20, 183. [Google Scholar] [CrossRef]
Li, D.; Yang, H.; Song, Z.; Wang, Z. MSMF-MIL: Multi-Scale Mixed Feature Based Multiple Instance Learning for Speech Emotion Recognition. IEEE Trans. Consum. Electron. 2025, 71, 7539–7550. [Google Scholar] [CrossRef]
Shahin, I.; Hindawi, N.; Nassif, A.B.; Alhudhaif, A.; Polat, K. Novel dual-channel long short-term memory compressed capsule networks for emotion recognition. Expert Syst. Appl. 2022, 188, 116080. [Google Scholar] [CrossRef]
Yang, Z.; Li, Z.; Zhou, S.; Zhang, L.; Serikawa, S. Speech emotion recognition based on multi-feature speed rate and LSTM. Neurocomputing 2024, 601, 128177. [Google Scholar] [CrossRef]
Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
Liu, M.; Raj, A.N.J.; Rajangam, V.; Ma, K.; Zhuang, Z.; Zhuang, S. Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition. Speech Commun. 2024, 156, 103010. [Google Scholar] [CrossRef]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-training for Speech Recognition. In Proceedings of Interspeech 2019, Graz, Austria, 2019; International Speech Communication Association: Grenoble, France, 2019; pp. 3465–3469. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 2020; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2020; pp. 12449–12460. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
Chakhtouna, A.; Sekkate, S.; Adib, A. Unveiling embedded features in Wav2vec2 and HuBERT msodels for Speech Emotion Recognition. Procedia Comput. Sci. 2024, 232, 2560–2569. [Google Scholar] [CrossRef]
Chen, L.W.; Rudnicky, A. Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of Interspeech 2019, Graz, Austria, 2019; International Speech Communication Association: Grenoble, France, 2019; pp. 2803–2807. [Google Scholar]
Cai, X.; Yuan, J.; Zheng, R.; Huang, L.; Church, K. Speech Emotion Recognition with Multi-Task Learning. In Proceedings of Interspeech 2021, Brno, Czech Republic, 2021; International Speech Communication Association: Grenoble, France, 2021; pp. 4508–4512. [Google Scholar]
Lee, S.W. Diverse Feature Mapping and Fusion via Multitask Learning for Multilingual Speech Emotion Recognition. In Proceedings of Interspeech 2023, Dublin, Ireland, 2023; International Speech Communication Association: Grenoble, France, 2023; pp. 3944–3948. [Google Scholar]
Tzeng, J.T.; Leem, S.G.; Salman, A.N.; Lee, C.-C.; Busso, C. Noise-Robust Speech Emotion Recognition Using Shared Self-Supervised Representations with Integrated Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
Ryumina, E.; Axyonov, A.; Koryakovskaya, D.; Abdulkadirov, T.; Egorova, A.; Fedchin, S.; Zaburdaev, A.; Ryumin, D. SSL-MEPR: A Semi-Supervised Multi-Task Cross-Domain Learning Framework for Multimodal Emotion and Personality Recognition. Mach. Learn. Knowl. Extr. 2026, 8, 56. [Google Scholar] [CrossRef]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 369–376. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Li, X.H.; Liu, Z.T.; Zou, Y.J.; She, J.; Hirota, K. MCRVT: Multi-Hierarchical Cross-Reconstruction Networks With Versatile Transformer for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2025, 16, 2189–2199. [Google Scholar] [CrossRef]
Lan, D.; Cheng, H. MS-Swinformer and DMTL: Multi-scale spatial fusion and dynamic multi-task learning for speech emotion recognition. Comput. Speech Lang. 2026, 99, 101908. [Google Scholar] [CrossRef]
Shen, S.; Gao, Y.; Liu, F.; Wang, H.; Zhou, A. Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024; IEEE: New York, NY, USA, 2024; pp. 10111–10115. [Google Scholar]
Ghosh, S.; Tyagi, U.; Ramaneswaran, S.; Srivastava, H.; Manocha, D. MMER: Multimodal Multi-task Learning for Speech Emotion Recognition. In Proceedings of Interspeech 2023, Dublin, Ireland, 2023; International Speech Communication Association: Grenoble, France, 2023; pp. 1209–1213. [Google Scholar]
Li, M.; Zheng, Y.; Li, D.; Wu, Y.; Wang, Y.; Fei, H. MS-SENet: Enhancing Speech Emotion Recognition Through Multi-Scale Feature Fusion with Squeeze-and-Excitation Blocks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024; IEEE: New York, NY, USA, 2024; pp. 12271–12275. [Google Scholar]
Ye, J.; Wen, X.C.; Wei, Y.; Xu, Y.; Liu, K.; Shan, H. Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022; IEEE: New York, NY, USA, 2022; pp. 7367–7371. [Google Scholar]

Figure 1. Network architecture of multi-task speech emotion recognition based on pre-trained acoustic models.

Figure 2. Effect of different pre-trained acoustic models on emotion recognition accuracy.

Figure 3. Accuracy of SER under different values of α: (a) WA; (b) UA.

Figure 4. GR accuracy under different α.

Figure 5. Accuracy of SER under different values of β: (a) WA; (b) UA.

Figure 6. SR accuracy under different β.

Figure 7. Accuracy of SER under different values of γ: (a) WA; (b) UA.

Figure 8. WER of ASR under different γ.

Figure 9. Effect of joint training of GR and SR on SER.

Figure 10. Effect of joint training of GR and ASR on SER.

Figure 11. Effect of joint training of SR and ASR on SER.

Table 1. Experimental parameter settings.

Parameter	Value
Optimizer	AdamW
Learning Rate	10⁻⁵
Batch Size	8
Training Epochs	100

Table 2. Comparison of accuracy for different models on the IEMOCAP dataset.

Model	k-Fold	WA (%)	UA (%)	FLOPs (G)	Inference Time (ms)
MCRVT [32]	5	73.02	71.57	-	-
MS-Swinformer + DMTL [33]	5	71.12	72.31	-	-
FENT [34]	5	71.84	72.37	-	-
ENT [34]	5	72.43	73.88	-	-
MMER [35]	5	81.20	-	138.77	16.01
Self-attention CNN-BLSTM [25]	5	81.60	82.80	26.10	9.62
Ours	5	83.24 ± 0.75	83.36 ± 0.73	35.44	5.81
Co-attention [38]	10	71.64	72.70	45.09	3.51
MS-Swinformer + DMTL [33]	10	72.68	73.45	-	-
MS-SENet [36]	10	73.38	73.67	-	-
Wav2vec2.0 + MTL [26]	10	78.15	-	17.31	5.52
Ours	10	83.86 ± 0.66	84.23 ± 0.63	35.44	5.81

Table 3. Emotion recognition performance under joint optimization of three auxiliary tasks.

ID	α	β	γ	WA	UA
1	0.2	0.4	0.1	0.8137	0.8228
2	0.2	0.4	0.2	0.8179	0.8215
3	0.2	0.6	0.1	0.8258	0.8364
4	0.2	0.6	0.2	0.8194	0.8220
5	0.2	0.8	0.1	0.8201	0.8214
6	0.2	0.8	0.2	0.8183	0.8189
7	0.4	0.4	0.1	0.8226	0.8302
8	0.4	0.4	0.2	0.8168	0.8273
9	0.4	0.6	0.1	0.8324	0.8336
10	0.4	0.6	0.2	0.8232	0.8324
11	0.4	0.8	0.1	0.8310	0.8321
12	0.4	0.8	0.2	0.8221	0.8294
13	0.6	0.4	0.1	0.8217	0.8221
14	0.6	0.4	0.2	0.8113	0.8198
15	0.6	0.6	0.1	0.8271	0.8281
16	0.6	0.6	0.2	0.8205	0.8222
17	0.6	0.8	0.1	0.8232	0.8247
18	0.6	0.8	0.2	0.8192	0.8216

Table 4. Comparison of Different Task Configurations.

SER	GR	SR	ASR	WA	UA
√				0.7733	0.7871
√	√			0.7971	0.8102
√		√		0.7989	0.8104
√			√	0.8190	0.8206
√	√	√		0.8062	0.8143
√	√		√	0.8245	0.8174
√		√	√	0.8261	0.8287
√	√	√	√	0.8324	0.8336

Table 5. Comparison of different loss weighting strategies.

Method	WA	UA
Ours	0.8324	0.8336
Uncertainty Weight	0.8140	0.8255
GradNorm	0.8201	0.8220
DWA	0.8172	0.8193

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Yao, K.; Yi, Y. Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model. Appl. Sci. 2026, 16, 5166. https://doi.org/10.3390/app16105166

AMA Style

Wang X, Yao K, Yi Y. Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model. Applied Sciences. 2026; 16(10):5166. https://doi.org/10.3390/app16105166

Chicago/Turabian Style

Wang, Xiaoyu, Kai Yao, and Ying Yi. 2026. "Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model" Applied Sciences 16, no. 10: 5166. https://doi.org/10.3390/app16105166

APA Style

Wang, X., Yao, K., & Yi, Y. (2026). Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model. Applied Sciences, 16(10), 5166. https://doi.org/10.3390/app16105166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model

Abstract

1. Introduction

2. Theoretical Method

2.1. Shared Acoustic Feature Extraction

2.2. Task-Specific Decoding

2.2.1. Primary Task: Speech Emotion Recognition

2.2.2. Auxiliary Task: Gender Recognition

2.2.3. Auxiliary Task: Speaker Recognition

2.2.4. Auxiliary Task: Automatic Speech Recognition

2.3. Multi-Task Collaborative Learning Mechanism and Joint Loss Construction

3. Experimental Result

3.1. Dataset

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Comparative Experiments

3.5. Ablation Study

3.5.1. Selection of Pre-Trained Acoustic Models

3.5.2. Effect of the GR Auxiliary Task on SER Performance

3.5.3. Effect of the SR Auxiliary Task on SER Performance

3.5.4. Effect of the ASR Auxiliary Task on SER Performance

3.5.5. Effect of Joint Training with Two Auxiliary Tasks on SER Performance

3.5.6. Performance Analysis of Joint Optimization with Three Auxiliary Tasks

3.5.7. Comparison of Fixed and Adaptive Loss Weighting Strategies

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI