Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model
Abstract
1. Introduction
2. Theoretical Method
2.1. Shared Acoustic Feature Extraction
2.2. Task-Specific Decoding
2.2.1. Primary Task: Speech Emotion Recognition
2.2.2. Auxiliary Task: Gender Recognition
2.2.3. Auxiliary Task: Speaker Recognition
2.2.4. Auxiliary Task: Automatic Speech Recognition
2.3. Multi-Task Collaborative Learning Mechanism and Joint Loss Construction
3. Experimental Result
3.1. Dataset
3.2. Evaluation Metrics
3.3. Implementation Details
3.4. Comparative Experiments
- (1)
- MCRVT [32]: A speech emotion recognition model based on multi-level feature fusion. It enhances spectrogram features using multi-function attention, incorporates high-level semantic features from WavLM, and leverages both contrastive reconstruction networks and cross-attention fusion to explore complementary information.
- (2)
- MS-Swinformer + DMTL [33]: A speech emotion recognition framework based on multi-scale fusion and MTL. It first extracts time–frequency features from speech using multi-scale convolutions, then incorporates attention mechanisms within the Swin Transformer structure to model long-range contextual dependencies. A dynamic MTL strategy is also proposed to jointly optimize high-level semantic features from Wav2vec and low-level acoustic features from MFCC, achieving optimal fusion of multi-source information.
- (3)
- ENT [34]: An Emotion Neural Transducer model that achieves fine-grained speech emotion recognition through joint training with ASR. It extends the traditional Transducer structure by introducing an emotion joint network, modeling emotion class distributions on the alignment grid of acoustic and linguistic representations, forming an “emotion alignment grid.” Max-pooling is applied on the alignment grid to enhance the model’s ability to distinguish emotional frames from non-emotional frames.
- (4)
- FENT [34]: An improved version of ENT. Unlike ENT, FENT separates the prediction of the blank symbol and token prediction, allowing the blank symbol to serve simultaneously as a special placeholder for ASR and as an indicator of emotion, enabling finer-grained capture of frame-level emotion dynamics.
- (5)
- MMER [35]: MMER is a multimodal MTL method. It models both textual and audio modalities through early fusion and cross-modal self-attention. In addition to the primary task of emotion classification, three auxiliary tasks, including ASR and two contrastive learning-based tasks, are jointly optimized to improve the model’s recognition performance.
- (6)
- Self-attention CNN-BLSTM [25]: An end-to-end spectrogram-based emotion recognition method. Self-attention guides the model to focus on emotion-relevant segments, while a MTL framework leveraging the correlation between emotion and gender further improves recognition performance.
- (7)
- MS-SENet [36]: A multi-scale feature fusion network based on squeeze-and-excitation (SE) blocks. Using MFCC as input, multi-scale convolutions extract time–frequency features, which are reweighted by SE to enhance their effectiveness. Skip connections and spatial dropout layers are incorporated to prevent overfitting and increase network depth. Finally, TIM-Net [37] captures bidirectional emotion dependencies.
- (8)
- Wav2vec2.0 + MTL [26]: A Wav2vec2.0-based multi-task speech emotion recognition model, with SER as the primary task and ASR as an auxiliary task.
- (9)
- Co-attention [38]: A multi-level acoustic feature fusion network. It extracts MFCC, spectrograms, and high-level self-supervised representations from Wav2vec from raw speech, processes each feature through separate encoders, and performs final fusion using a co-attention mechanism.
3.5. Ablation Study
3.5.1. Selection of Pre-Trained Acoustic Models
3.5.2. Effect of the GR Auxiliary Task on SER Performance
3.5.3. Effect of the SR Auxiliary Task on SER Performance
3.5.4. Effect of the ASR Auxiliary Task on SER Performance
3.5.5. Effect of Joint Training with Two Auxiliary Tasks on SER Performance
3.5.6. Performance Analysis of Joint Optimization with Three Auxiliary Tasks
3.5.7. Comparison of Fixed and Adaptive Loss Weighting Strategies
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chhikara, P.; Singh, P.; Tekchandani, R.; Kumar, N.; Guizani, M. Federated Learning Meets Human Emotions: A Decentralized Framework for Human-Computer Interaction for IoT Applications. IEEE Internet Things J. 2021, 8, 6949–6962. [Google Scholar] [CrossRef]
- Zhao, Z.; Bao, Z.; Zhang, Z.; Deng, J.; Cummins, N.; Wang, H.; Tao, J.; Schuller, B. Automatic Assessment of Depression From Speech via a Hierarchical Attention Transfer Network and Attention Autoencoders. IEEE J. Sel. Top. Signal Process. 2020, 14, 423–434. [Google Scholar] [CrossRef]
- Miranda Calero, J.A.; Rituerto-González, E.; Luis-Mingueza, C.; Canabal, M.F.; Barcenas, A.R.; Lanza-Gutierrez, J.M.; Pelaez-Moreno, C.; Lopez-Ongil, C. Bindi: Affective Internet of Things to Combat Gender-Based Violence. IEEE Internet Things J. 2022, 9, 21174–21193. [Google Scholar] [CrossRef]
- Dehbozorgi, N.; Kunuku, M.T. Exploring the Influence of Emotional States in Peer Interactions on Students’ Academic Performance. IEEE Trans. Educ. 2024, 67, 405–412. [Google Scholar] [CrossRef]
- Al-Radhi, M.S.; Csapó, T.G.; Németh, G. Adaptive refinements of pitch tracking and HNR estimation within a vocoder for statistical parametric speech synthesis. Appl. Sci. 2019, 9, 2460. [Google Scholar] [CrossRef]
- Kwon, O.W.; Chan, K.; Hao, J.; Lee, T.-W. Emotion recognition by speech signals. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003), Geneva, Switzerland, 1–4 September 2003; pp. 125–128. [Google Scholar]
- Umamaheswari, J.; Akila, A. An Enhanced Human Speech Emotion Recognition Using Hybrid of PRNN and KNN. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 2019; IEEE: New York, NY, USA, 2019; pp. 177–183. [Google Scholar]
- Koolagudi, S.G.; Murthy, Y.V.S.; Bhaskar, S.P. Choice of a classifier, based on properties of a dataset: Case study-speech emotion recognition. Int. J. Speech Technol. 2018, 21, 167–183. [Google Scholar] [CrossRef]
- Costantini, G.; Parada-Cabaleiro, E.; Casali, D.; Cesarini, V. The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning. Sensors 2022, 22, 2461. [Google Scholar] [CrossRef]
- Huang, Z.; Dong, M.; Mao, Q.; Zhan, Y. Speech Emotion Recognition Using CNN. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 2014; ACM: New York, NY, USA, 2014; pp. 801–804. [Google Scholar]
- Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
- Mustaqeem, N.; Kwon, S. A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition. Sensors 2019, 20, 183. [Google Scholar] [CrossRef]
- Li, D.; Yang, H.; Song, Z.; Wang, Z. MSMF-MIL: Multi-Scale Mixed Feature Based Multiple Instance Learning for Speech Emotion Recognition. IEEE Trans. Consum. Electron. 2025, 71, 7539–7550. [Google Scholar] [CrossRef]
- Shahin, I.; Hindawi, N.; Nassif, A.B.; Alhudhaif, A.; Polat, K. Novel dual-channel long short-term memory compressed capsule networks for emotion recognition. Expert Syst. Appl. 2022, 188, 116080. [Google Scholar] [CrossRef]
- Yang, Z.; Li, Z.; Zhou, S.; Zhang, L.; Serikawa, S. Speech emotion recognition based on multi-feature speed rate and LSTM. Neurocomputing 2024, 601, 128177. [Google Scholar] [CrossRef]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
- Liu, M.; Raj, A.N.J.; Rajangam, V.; Ma, K.; Zhuang, Z.; Zhuang, S. Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition. Speech Commun. 2024, 156, 103010. [Google Scholar] [CrossRef]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-training for Speech Recognition. In Proceedings of Interspeech 2019, Graz, Austria, 2019; International Speech Communication Association: Grenoble, France, 2019; pp. 3465–3469. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 2020; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2020; pp. 12449–12460. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
- Chakhtouna, A.; Sekkate, S.; Adib, A. Unveiling embedded features in Wav2vec2 and HuBERT msodels for Speech Emotion Recognition. Procedia Comput. Sci. 2024, 232, 2560–2569. [Google Scholar] [CrossRef]
- Chen, L.W.; Rudnicky, A. Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of Interspeech 2019, Graz, Austria, 2019; International Speech Communication Association: Grenoble, France, 2019; pp. 2803–2807. [Google Scholar]
- Cai, X.; Yuan, J.; Zheng, R.; Huang, L.; Church, K. Speech Emotion Recognition with Multi-Task Learning. In Proceedings of Interspeech 2021, Brno, Czech Republic, 2021; International Speech Communication Association: Grenoble, France, 2021; pp. 4508–4512. [Google Scholar]
- Lee, S.W. Diverse Feature Mapping and Fusion via Multitask Learning for Multilingual Speech Emotion Recognition. In Proceedings of Interspeech 2023, Dublin, Ireland, 2023; International Speech Communication Association: Grenoble, France, 2023; pp. 3944–3948. [Google Scholar]
- Tzeng, J.T.; Leem, S.G.; Salman, A.N.; Lee, C.-C.; Busso, C. Noise-Robust Speech Emotion Recognition Using Shared Self-Supervised Representations with Integrated Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
- Ryumina, E.; Axyonov, A.; Koryakovskaya, D.; Abdulkadirov, T.; Egorova, A.; Fedchin, S.; Zaburdaev, A.; Ryumin, D. SSL-MEPR: A Semi-Supervised Multi-Task Cross-Domain Learning Framework for Multimodal Emotion and Personality Recognition. Mach. Learn. Knowl. Extr. 2026, 8, 56. [Google Scholar] [CrossRef]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 369–376. [Google Scholar]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Li, X.H.; Liu, Z.T.; Zou, Y.J.; She, J.; Hirota, K. MCRVT: Multi-Hierarchical Cross-Reconstruction Networks With Versatile Transformer for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2025, 16, 2189–2199. [Google Scholar] [CrossRef]
- Lan, D.; Cheng, H. MS-Swinformer and DMTL: Multi-scale spatial fusion and dynamic multi-task learning for speech emotion recognition. Comput. Speech Lang. 2026, 99, 101908. [Google Scholar] [CrossRef]
- Shen, S.; Gao, Y.; Liu, F.; Wang, H.; Zhou, A. Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024; IEEE: New York, NY, USA, 2024; pp. 10111–10115. [Google Scholar]
- Ghosh, S.; Tyagi, U.; Ramaneswaran, S.; Srivastava, H.; Manocha, D. MMER: Multimodal Multi-task Learning for Speech Emotion Recognition. In Proceedings of Interspeech 2023, Dublin, Ireland, 2023; International Speech Communication Association: Grenoble, France, 2023; pp. 1209–1213. [Google Scholar]
- Li, M.; Zheng, Y.; Li, D.; Wu, Y.; Wang, Y.; Fei, H. MS-SENet: Enhancing Speech Emotion Recognition Through Multi-Scale Feature Fusion with Squeeze-and-Excitation Blocks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024; IEEE: New York, NY, USA, 2024; pp. 12271–12275. [Google Scholar]
- Ye, J.; Wen, X.C.; Wei, Y.; Xu, Y.; Liu, K.; Shan, H. Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022; IEEE: New York, NY, USA, 2022; pp. 7367–7371. [Google Scholar]











| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 10−5 |
| Batch Size | 8 |
| Training Epochs | 100 |
| Model | k-Fold | WA (%) | UA (%) | FLOPs (G) | Inference Time (ms) |
|---|---|---|---|---|---|
| MCRVT [32] | 5 | 73.02 | 71.57 | - | - |
| MS-Swinformer + DMTL [33] | 5 | 71.12 | 72.31 | - | - |
| FENT [34] | 5 | 71.84 | 72.37 | - | - |
| ENT [34] | 5 | 72.43 | 73.88 | - | - |
| MMER [35] | 5 | 81.20 | - | 138.77 | 16.01 |
| Self-attention CNN-BLSTM [25] | 5 | 81.60 | 82.80 | 26.10 | 9.62 |
| Ours | 5 | 83.24 ± 0.75 | 83.36 ± 0.73 | 35.44 | 5.81 |
| Co-attention [38] | 10 | 71.64 | 72.70 | 45.09 | 3.51 |
| MS-Swinformer + DMTL [33] | 10 | 72.68 | 73.45 | - | - |
| MS-SENet [36] | 10 | 73.38 | 73.67 | - | - |
| Wav2vec2.0 + MTL [26] | 10 | 78.15 | - | 17.31 | 5.52 |
| Ours | 10 | 83.86 ± 0.66 | 84.23 ± 0.63 | 35.44 | 5.81 |
| ID | α | β | γ | WA | UA |
|---|---|---|---|---|---|
| 1 | 0.2 | 0.4 | 0.1 | 0.8137 | 0.8228 |
| 2 | 0.2 | 0.4 | 0.2 | 0.8179 | 0.8215 |
| 3 | 0.2 | 0.6 | 0.1 | 0.8258 | 0.8364 |
| 4 | 0.2 | 0.6 | 0.2 | 0.8194 | 0.8220 |
| 5 | 0.2 | 0.8 | 0.1 | 0.8201 | 0.8214 |
| 6 | 0.2 | 0.8 | 0.2 | 0.8183 | 0.8189 |
| 7 | 0.4 | 0.4 | 0.1 | 0.8226 | 0.8302 |
| 8 | 0.4 | 0.4 | 0.2 | 0.8168 | 0.8273 |
| 9 | 0.4 | 0.6 | 0.1 | 0.8324 | 0.8336 |
| 10 | 0.4 | 0.6 | 0.2 | 0.8232 | 0.8324 |
| 11 | 0.4 | 0.8 | 0.1 | 0.8310 | 0.8321 |
| 12 | 0.4 | 0.8 | 0.2 | 0.8221 | 0.8294 |
| 13 | 0.6 | 0.4 | 0.1 | 0.8217 | 0.8221 |
| 14 | 0.6 | 0.4 | 0.2 | 0.8113 | 0.8198 |
| 15 | 0.6 | 0.6 | 0.1 | 0.8271 | 0.8281 |
| 16 | 0.6 | 0.6 | 0.2 | 0.8205 | 0.8222 |
| 17 | 0.6 | 0.8 | 0.1 | 0.8232 | 0.8247 |
| 18 | 0.6 | 0.8 | 0.2 | 0.8192 | 0.8216 |
| SER | GR | SR | ASR | WA | UA |
|---|---|---|---|---|---|
| √ | 0.7733 | 0.7871 | |||
| √ | √ | 0.7971 | 0.8102 | ||
| √ | √ | 0.7989 | 0.8104 | ||
| √ | √ | 0.8190 | 0.8206 | ||
| √ | √ | √ | 0.8062 | 0.8143 | |
| √ | √ | √ | 0.8245 | 0.8174 | |
| √ | √ | √ | 0.8261 | 0.8287 | |
| √ | √ | √ | √ | 0.8324 | 0.8336 |
| Method | WA | UA |
|---|---|---|
| Ours | 0.8324 | 0.8336 |
| Uncertainty Weight | 0.8140 | 0.8255 |
| GradNorm | 0.8201 | 0.8220 |
| DWA | 0.8172 | 0.8193 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, X.; Yao, K.; Yi, Y. Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model. Appl. Sci. 2026, 16, 5166. https://doi.org/10.3390/app16105166
Wang X, Yao K, Yi Y. Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model. Applied Sciences. 2026; 16(10):5166. https://doi.org/10.3390/app16105166
Chicago/Turabian StyleWang, Xiaoyu, Kai Yao, and Ying Yi. 2026. "Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model" Applied Sciences 16, no. 10: 5166. https://doi.org/10.3390/app16105166
APA StyleWang, X., Yao, K., & Yi, Y. (2026). Multi-Task Learning-Based Speech Emotion Recognition Using Pre-Trained Acoustic Model. Applied Sciences, 16(10), 5166. https://doi.org/10.3390/app16105166

