MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition
Abstract
:1. Introduction
- (1)
- For the model generalization, our proposed model uses gradient normalization-based MTL to jointly learn the tasks of emotion recognition and language recognition, where the gradient normalization method is to adjust the gradient of each task dynamically. With the LSTM-attention structure in MSFL, two tasks can learn features from different perspectives to better regularize the model and uncover the high-level discriminative MSFs.
- (2)
- For the model interpretability, the validity and generalization of the model are explained in multilingual scenarios from the perspective of MSFs. The MSFs are ranked according to the weights from the attention mechanism in MSFL and we analyze the top features to compare with the monolingual features of three datasets. The differences and commonalities are found between the three monolingual features and MSFs.
- (3)
- The technical idea and model interpretability perspective are provided from model building to feature analysis in this study. In particular, the feature ranking and analysis lay the theoretical foundation for multi-language and multi-corpus data aggregation to alleviate the data sparsity, and promote the research prospects of multilingual SER.
2. Related Work
2.1. Deep Learning for Speech Emotion Recognition
2.2. Multi-Task Learning for Speech Emotion Recognition
3. Proposed Model
4. Experimental Setup
4.1. Corpora
4.1.1. EMO-DB
4.1.2. CASIA
4.1.3. SAVEE
4.2. Speech Features Extraction
4.3. Model Configuration
5. Experiments and Results
5.1. Experiment I: Different Weight Adjustment Methods for MSFL
5.2. Experiment II: Comparison with Different Models
5.3. Experiment III: Multilingual Shared Features Ranking and Analysis
- EMO-DB: HNR, Spectral Slope 0–500 Hz, Harmonic difference H1–H2, Hammarberg index, MFCC-3, F0 Pitch, and Formant-2 frequency;
- CASIA: Spectral Flux, Spectral Slope 0–500 Hz and 500–1500 Hz, Formant-2 bandwidth, Hammarberg index, MFCC-1, Formant-1 frequency, and Loudness;
- SAVEE: F0 Pitch, Spectral Slope 0–500 Hz and 500–1500 Hz, Formant-2 bandwidth, Spectral flux.
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Dellaert, F.; Polzin, T.; Waibel, A. Recognizing Emotion in Speech. In Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP ’96, Philadelphia, PA, USA, 3–6 October 1996; Volume 3, pp. 1970–1973. [Google Scholar]
- Savchenko, A.V.; Savchenko, L.V.; Makarov, I. Classifying Emotions and Engagement in Online Learning Based on a Single Facial Expression Recognition Neural Network. IEEE Trans. Affect. Comput. 2022, 13, 2132–2143. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Research. 2022, 21, 5485–5551. [Google Scholar]
- Zhong, P.; Wang, D.; Miao, C. EEG-Based Emotion Recognition Using Regularized Graph Neural Networks. IEEE Trans. Affect. Comput. 2022, 13, 1290–1301. [Google Scholar] [CrossRef]
- Li, H.F.; Chen, J.; Ma, L.; Bo, H.J.; Xu, C.; Li, H.W. Dimensional Speech Emotion Recognition Review. Ruan Jian Xue Bao/J. Softw. 2020, 31, 2465–2491. (In Chinese) [Google Scholar]
- Kakuba, S.; Poulose, A.; Han, D.S. Attention-Based Multi-Learning Approach for Speech Emotion Recognition with Dilated Convolution. IEEE Access 2022, 10, 122302–122313. [Google Scholar] [CrossRef]
- Jiang, P.; Xu, X.; Tao, H.; Zhao, L.; Zou, C. Convolutional-Recurrent Neural Networks with Multiple Attention Mechanisms for Speech Emotion Recognition. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 1564–1573. [Google Scholar] [CrossRef]
- Guo, L.; Wang, L.; Dang, J.; Chng, E.S.; Nakagawa, S. Learning Affective Representations Based on Magnitude and Dynamic Relative Phase Information for Speech Emotion Recognition. Speech Commun. 2022, 136, 118–127. [Google Scholar] [CrossRef]
- Vögel, H.-J.; Süß, C.; Hubregtsen, T.; Ghaderi, V.; Chadowitz, R.; André, E.; Cummins, N.; Schuller, B.; Härri, J.; Troncy, R.; et al. Emotion-Awareness for Intelligent Vehicle Assistants: A Research Agenda. In Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, Gothenburg, Sweden, 28 May 2018; pp. 11–15. [Google Scholar]
- Tanko, D.; Dogan, S.; Burak Demir, F.; Baygin, M.; Engin Sahin, S.; Tuncer, T. Shoelace Pattern-Based Speech Emotion Recognition of the Lecturers in Distance Education: ShoePat23. Appl. Acoust. 2022, 190, 108637. [Google Scholar] [CrossRef]
- Huang, K.-Y.; Wu, C.-H.; Su, M.-H.; Kuo, Y.-T. Detecting Unipolar and Bipolar Depressive Disorders from Elicited Speech Responses Using Latent Affective Structure Model. IEEE Trans. Affect. Comput. 2020, 11, 393–404. [Google Scholar] [CrossRef]
- Merler, M.; Mac, K.-N.C.; Joshi, D.; Nguyen, Q.-B.; Hammer, S.; Kent, J.; Xiong, J.; Do, M.N.; Smith, J.R.; Feris, R.S. Automatic Curation of Sports Highlights Using Multimodal Excitement Features. IEEE Trans. Multimed. 2019, 21, 1147–1160. [Google Scholar] [CrossRef]
- Vogt, T.; André, E. Improving Automatic Emotion Recognition from Speech via Gender Differentiation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06); European Language Resources Association (ELRA): Genoa, Italy, 2006. [Google Scholar]
- Mill, A.; Allik, J.; Realo, A.; Valk, R. Age-Related Differences in Emotion Recognition Ability: A Cross-Sectional Study. Emotion 2009, 9, 619–630. [Google Scholar] [CrossRef] [Green Version]
- Latif, S.; Qayyum, A.; Usman, M.; Qadir, J. Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages. In Proceedings of the 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 19 December 2018; pp. 88–93. [Google Scholar]
- Ding, N.; Sethu, V.; Epps, J.; Ambikairajah, E. Speaker Variability in Emotion Recognition—An Adaptation Based Approach. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 5101–5104. [Google Scholar]
- Feraru, S.M.; Schuller, D.; Schuller, B. Cross-Language Acoustic Emotion Recognition: An Overview and Some Tendencies. In Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi’an, China, 21–24 September 2015; pp. 125–131. [Google Scholar]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. 2016, 7, 190–202. [Google Scholar] [CrossRef]
- Schuller, B.; Steidl, S.; Batliner, A.; Burkhardt, F.; Devillers, L.; Müller, C.; Narayanan, S.S. The INTERSPEECH 2010 Paralinguistic Challenge. In Proceedings of the Interspeech 2010, ISCA, Chiba, Japan, 26–30 September 2010; pp. 2794–2797. [Google Scholar]
- Qadri, S.A.A.; Gunawan, T.S.; Kartiwi, M.; Mansor, H.; Wani, T.M. Speech Emotion Recognition Using Feature Fusion of TEO and MFCC on Multilingual Databases. In Proceedings of the Recent Trends in Mechatronics Towards Industry 4.0; Ab. Nasir, A.F., Ibrahim, A.N., Ishak, I., Mat Yahya, N., Zakaria, M.A., Abdul Majeed, A.P.P., Eds.; Springer: Singapore, 2022; pp. 681–691. [Google Scholar]
- Origlia, A.; Galatà, V.; Ludusan, B. Automatic Classification of Emotions via Global and Local Prosodic Features on a Multilingual Emotional Database. In Proceedings of the Fifth International Conference Speech Prosody 2010, Chicago, IL, USA, 10–14 May 2010. [Google Scholar]
- Bandela, S.R.; Kumar, T.K. Stressed Speech Emotion Recognition Using Feature Fusion of Teager Energy Operator and MFCC. In Proceedings of the 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 3–5 July 2017; pp. 1–5. [Google Scholar]
- Rao, K.S.; Koolagudi, S.G. Robust Emotion Recognition Using Sentence, Word and Syllable Level Prosodic Features. In Robust Emotion Recognition Using Spectral and Prosodic Features; Rao, K.S., Koolagudi, S.G., Eds.; SpringerBriefs in Electrical and Computer Engineering; Springer: New York, NY, USA, 2013; pp. 47–69. ISBN 978-1-4614-6360-3. [Google Scholar]
- Araño, K.A.; Gloor, P.; Orsenigo, C.; Vercellis, C. When Old Meets New: Emotion Recognition from Speech Signals. Cogn Comput 2021, 13, 771–783. [Google Scholar] [CrossRef]
- Wang, C.; Ren, Y.; Zhang, N.; Cui, F.; Luo, S. Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion. Multimed. Tools Appl. 2022, 81, 4897–4907. [Google Scholar] [CrossRef]
- Sun, L.; Chen, J.; Xie, K.; Gu, T. Deep and Shallow Features Fusion Based on Deep Convolutional Neural Network for Speech Emotion Recognition. Int. J. Speech Technol. 2018, 21, 931–940. [Google Scholar] [CrossRef]
- Yao, Z.; Wang, Z.; Liu, W.; Liu, Y.; Pan, J. Speech Emotion Recognition Using Fusion of Three Multi-Task Learning-Based Classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 2020, 120, 11–19. [Google Scholar] [CrossRef]
- Al-onazi, B.B.; Nauman, M.A.; Jahangir, R.; Malik, M.M.; Alkhammash, E.H.; Elshewey, A.M. Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion. Appl. Sci. 2022, 12, 9188. [Google Scholar] [CrossRef]
- Issa, D.; Fatih Demirci, M.; Yazici, A. Speech Emotion Recognition with Deep Convolutional Neural Networks. Biomed. Signal Process. Control. 2020, 59, 101894. [Google Scholar] [CrossRef]
- Li, X.; Akagi, M. Improving Multilingual Speech Emotion Recognition by Combining Acoustic Features in a Three-Layer Model. Speech Commun. 2019, 110, 1–12. [Google Scholar] [CrossRef]
- Heracleous, P.; Yoneyama, A. A Comprehensive Study on Bilingual and Multilingual Speech Emotion Recognition Using a Two-Pass Classification Scheme. PLoS ONE 2019, 14, e0220386. [Google Scholar] [CrossRef] [Green Version]
- Sagha, H.; Matějka, P.; Gavryukova, M.; Povolny, F.; Marchi, E.; Schuller, B. Enhancing Multilingual Recognition of Emotion in Speech by Language Identification. In Proceedings of the Interspeech 2016, ISCA, San Francisco, CA, USA, 8 September 2016; pp. 2949–2953. [Google Scholar]
- Bertero, D.; Kampman, O.; Fung, P. Towards Universal End-to-End Affect Recognition from Multilingual Speech by ConvNets. arXiv 2019, arXiv:1901.06486. [Google Scholar]
- Neumann, M.; Thang Vu, N. goc Cross-Lingual and Multilingual Speech Emotion Recognition on English and French. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5769–5773. [Google Scholar]
- Zehra, W.; Javed, A.R.; Jalil, Z.; Khan, H.U.; Gadekallu, T.R. Cross Corpus Multi-Lingual Speech Emotion Recognition Using Ensemble Learning. Complex Intell. Syst. 2021, 7, 1845–1854. [Google Scholar] [CrossRef]
- Sultana, S.; Iqbal, M.Z.; Selim, M.R.; Rashid, M.M.; Rahman, M.S. Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks. IEEE Access 2022, 10, 564–578. [Google Scholar] [CrossRef]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Schuller, B.W. Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2022. [Google Scholar] [CrossRef]
- Tamulevičius, G.; Korvel, G.; Yayak, A.B.; Treigys, P.; Bernatavičienė, J.; Kostek, B. A Study of Cross-Linguistic Speech Emotion Recognition Based on 2D Feature Spaces. Electronics 2020, 9, 1725. [Google Scholar] [CrossRef]
- Fu, C.; Dissanayake, T.; Hosoda, K.; Maekawa, T.; Ishiguro, H. Similarity of Speech Emotion in Different Languages Revealed by a Neural Network with Attention. In Proceedings of the 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 3–5 February 2020; pp. 381–386. [Google Scholar]
- Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Lee, S. The Generalization Effect for Multilingual Speech Emotion Recognition across Heterogeneous Languages. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5881–5885. [Google Scholar]
- Zhang, Y.; Liu, Y.; Weninger, F.; Schuller, B. Multi-Task Deep Neural Network with Shared Hidden Layers: Breaking down the Wall between Emotion Representations. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4990–4994. [Google Scholar]
- Sharma, M. Multi-Lingual Multi-Task Speech Emotion Recognition Using Wav2vec 2.0. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6907–6911. [Google Scholar]
- Gerczuk, M.; Amiriparian, S.; Ottl, S.; Schuller, B.W. EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2021. [Google Scholar] [CrossRef]
- Akçay, M.B.; Oğuz, K. Speech Emotion Recognition: Emotional Models, Databases, Features, Preprocessing Methods, Supporting Modalities, and Classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
- Wang, W.; Cao, X.; Li, H.; Shen, L.; Feng, Y.; Watters, P. Improving Speech Emotion Recognition Based on Acoustic Words Emotion Dictionary. Nat. Lang. Eng. 2020, 27, 747–761. [Google Scholar] [CrossRef]
- Hsu, J.-H.; Su, M.-H.; Wu, C.-H.; Chen, Y.-H. Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1675–1686. [Google Scholar] [CrossRef]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J. Direct Modelling of Speech Emotion from Raw Speech. In Proceedings of the Interspeech 2019, ISCA, Graz, Austria, 15 September 2019; pp. 3920–3924. [Google Scholar]
- Wu, X.; Cao, Y.; Lu, H.; Liu, S.; Wang, D.; Wu, Z.; Liu, X.; Meng, H. Speech Emotion Recognition Using Sequential Capsule Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3280–3291. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech Emotion Recognition with Dual-Sequence LSTM Architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6474–6478. [Google Scholar]
- Graves, A.; Jaitly, N.; Mohamed, A. Hybrid Speech Recognition with Deep Bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; pp. 273–278. [Google Scholar]
- Wang, Y.; Zhang, X.; Lu, M.; Wang, H.; Choe, Y. Attention Augmentation with Multi-Residual in Bidirectional LSTM. Neurocomputing 2020, 385, 340–347. [Google Scholar] [CrossRef]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
- Hu, D.; Wei, L.; Huai, X. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7042–7052. [Google Scholar]
- Zhang, Y.; Yang, Q. An Overview of Multi-Task Learning. Natl. Sci. Rev. 2018, 5, 30–43. [Google Scholar] [CrossRef] [Green Version]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Qadir, J.; Schuller, B.W. Survey of Deep Representation Learning for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2021. [Google Scholar] [CrossRef]
- Zhang, Z.; Wu, B.; Schuller, B. Attention-Augmented End-to-End Multi-Task Learning for Emotion Prediction from Speech. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6705–6709. [Google Scholar]
- Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the Interspeech 2019, ISCA, Graz, Austria, 15 September 2019; pp. 2803–2807. [Google Scholar]
- Fu, C.; Liu, C.; Ishi, C.T.; Ishiguro, H. An End-to-End Multitask Learning Model to Improve Speech Emotion Recognition. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Virtual, 18–21 January 2021; pp. 1–5. [Google Scholar]
- Li, X.; Lu, G.; Yan, J.; Zhang, Z. A Multi-Scale Multi-Task Learning Model for Continuous Dimensional Emotion Recognition from Audio. Electronics 2022, 11, 417. [Google Scholar] [CrossRef]
- Thung, K.-H.; Wee, C.-Y. A Brief Review on Multi-Task Learning. Multimed Tools Appl 2018, 77, 29705–29725. [Google Scholar] [CrossRef]
- Xia, R.; Liu, Y. A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space. IEEE Trans. Affect. Comput. 2017, 8, 3–14. [Google Scholar] [CrossRef]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J.; Schuller, B.W. Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2022, 13, 992–1004. [Google Scholar] [CrossRef] [Green Version]
- Atmaja, B.T.; Akagi, M. Dimensional Speech Emotion Recognition from Speech Features and Word Embeddings by Using Multitask Learning. APSIPA Trans. Signal Inf. Process. 2020, 9, e17. [Google Scholar] [CrossRef]
- Kim, J.-W.; Park, H. Multi-Task Learning for Improved Recognition of Multiple Types of Acoustic Information. IEICE Trans. Inf. Syst. 2021, E104.D, 1762–1765. [Google Scholar] [CrossRef]
- Chen, Z.; Badrinarayanan, V.; Lee, C.-Y.; Rabinovich, A. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 3 July 2018; pp. 794–803. [Google Scholar]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A Database of German Emotional Speech. In Proceedings of the Interspeech 2005, ISCA, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
- Pan, S.; Tao, J.; Li, Y. The CASIA Audio Emotion Recognition Method for Audio/Visual Emotion Challenge 2011. In Proceedings of the Affective Computing and Intelligent Interaction; D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 388–395. [Google Scholar]
- Jackson, P.; Haq, S. Surrey Audio-Visual Expressed Emotion (Savee) Database; University of Surrey: Guildford, UK, 2014. [Google Scholar]
- El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu Features? End-to-End Speech Emotion Recognition Using a Deep Convolutional Recurrent Network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
- He, Y.; Feng, X.; Cheng, C.; Ji, G.; Guo, Y.; Caverlee, J. MetaBalance: Improving Multi-Task Recommendations via Adapting Gradient Magnitudes of Auxiliary Tasks. In Proceedings of the ACM Web Conference 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2205–2215. [Google Scholar]
- Eyben, F.; Weninger, F.; Schuller, B. Affect Recognition in Real-Life Acoustic Conditions—A New Perspective on Feature Selection. In Proceedings of the Interspeech 2013, ISCA, Lyon, France, 25–29 August 2013; pp. 2044–2048. [Google Scholar]
- Ryumina, E.; Dresvyanskiy, D.; Karpov, A. In Search of a Robust Facial Expressions Recognition Model: A Large-Scale Visual Cross-Corpus Study. Neurocomputing 2022, 514, 435–450. [Google Scholar] [CrossRef]
- Antoniadis, P.; Filntisis, P.P.; Maragos, P. Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition; IEEE Computer Society: Washington, DC, USA, 2021; pp. 1–8. [Google Scholar]
- Kakuba, S.; Poulose, A.; Han, D.S. Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features. IEEE Access 2022, 10, 125538–125551. [Google Scholar] [CrossRef]
Corpus | Language | Size | Actors | Sampling Rate | Emotions | |||
---|---|---|---|---|---|---|---|---|
Neu | Ang | Hap | Sad | |||||
EMO-DB | German | 339 | 10(5M, 5F) | 16 KHz | 79 | 127 | 71 | 62 |
CASIA | Chinese | 800 | 4(2M, 2F) | 22.05 KHz | 200 | 200 | 200 | 200 |
SAVEE | English | 300 | 4M | 44.1 KHz | 120 | 60 | 60 | 60 |
Parameter Group | Low-Level Descriptors (LLDs) | Functions | Amount |
---|---|---|---|
Frequency | Pitch | (V): am, stddevNorm, 20th, 50th, and 80th percentile, the range of 20th to 80th percentile, the mean and standard deviation of the slope of rising/falling signal parts | 10 |
Jitter | (V): am, stddevNorm | 2 | |
Formant 1, 2 and 3 frequency | (V): am, stddevNorm | 6 | |
Formant 1, 2 and 3 bandwidth | (V): am, stddevNorm | 6 | |
Energy | Shimmer | (V): am, stddevNorm | 2 |
Loudness | am, stddevNorm, 20th, 50th, and 80th percentile, pctlrange0–2, the mean and standard deviation of the slope of rising/falling signal parts | 10 | |
Harmonics-to-Noise Ratio (HNR) | (V): am, stddevNorm | 2 | |
Spectral | Alpha Ratio | (V): am, stddevNorm (UV): am | 3 |
Hammarberg Index | (V): am, stddevNorm (UV): am | 3 | |
Spectral Slope 0–500 Hz and 500–1500 Hz | (V): am, stddevNorm (UV): am | 6 | |
Formant 1, 2 and 3 relative energy | (V): am, stddevNorm | 6 | |
Harmonic difference H1-H2 | (V): am, stddevNorm | 2 | |
Harmonic difference H1-A3 | (V): am, stddevNorm | 2 | |
MFCC 1–4 | am, stddevNorm (V): am,stddevNorm | 16 | |
Spectral flux | am, stddevNorm (V): am, stddevNorm (UV): am | 5 |
Notation | Meaning | Value |
---|---|---|
Number of cells in LSTM | 32 | |
Number of nodes in attention | 88 | |
Number of nodes in FC1 | 512 | |
Number of nodes in FC2 | 300 | |
Number of nodes in softmax1 | 4 | |
Number of nodes in softmax2 | 3 |
Purpose | Experiment | Introduction | Result |
---|---|---|---|
Model Generalization | Experiment I | Verify the effect of gradient normalization method introduced in MTL on the generalization ability of the model | The gradient normalization method performs best than other methods |
Experiment II | Compare the results of generalization ability of MSFL with other baseline models | MSFL achieves better results than most models with an average improvement of 3.37–4.49% | |
Model Interpretability | Experiment III | Analyze and compare the features in MSFs and monolingual features | The top 10 MSFs almost contain the top-ranked monolingual features |
Models | Evaluation Metrics | Emotion Recognition | Language Recognition | |||
---|---|---|---|---|---|---|
E | C | S | Avg | |||
E_MSFL | UAR | 82.58% | 82.20% | 76.67% | 80.77% | 99.20% |
ACC | 82.80% | 82.86% | 77.43% | 81.09% | 98.96% | |
M_MSFL | UAR | 83.86% | 83.08% | 75.75% | 80.89% | 99.10% |
ACC | 83.77% | 82.99% | 76.81% | 81.19% | 98.83% | |
G_MSFL | UAR | 85.24% | 84.44% | 75.33% | 81.66% | 99.35% |
ACC | 85.29% | 84.41% | 75.89% | 81.86% | 99.13% |
Models | EMO-DB | CASIA | SAVEE | Average |
---|---|---|---|---|
MSFL_ST | 81.93% | 79.05% | 71.14% | 77.37% |
MTL_DNN | 81.14% | 82.30% | 72.03% | 78.49% |
MT-SHL-DNN (2019) [42] | 82.34% | - | - | - |
CAbiLS (2020) [39] | 75.61% | 64.33% | - | - |
Ensemble Model (2021) [35] | 89.75% | - | 69.31% | - |
MSFL (our proposed) | 85.29% | 84.41% | 75.89% | 81.86% |
Ranking | Multilingual Shared Features (MSFs) | EMO-DB Features (EFs) | CASIA Features (CFs) | SAVEE Features (SFs) |
---|---|---|---|---|
1 | slopeV0-500_ sma3nz_amean | HNRdBACF_ sma3nz_amean | spectralFlux_ sma3_stddevNorm | F0semitoneFrom27.5Hz_ sma3nz_pctlrange0-2 |
2 | F0semitoneFrom27.5Hz_ sma3nz_ pctlrange0-2 | slopeV0-500_ sma3nz_amean | slopeV500-1500_ sma3nz_amean | F0semitoneFrom27.5Hz_ sma3nz_amean |
3 | slopeUV500-1500_sma3nz_ amean | StddevUnvoiced SegmentLength | slopeV0-500 _sma3nz_amean | slopeV500-1500_ sma3nz_amean |
4 | HNRdBACF_ sma3nz_amean | logRelF0-H1-H2_ sma3nz_amean | F2bandwidth _sma3nz_amean | F0semitoneFrom27.5Hz_sma3nz_ percentile80.0 |
5 | spectralFlux_ sma3_ stddevNorm | Equivalent SoundLevel_dBp | hammarbergIndexV _sma3nz_amean | F2bandwidth_sma3nz_ stddevNorm |
6 | loudness_sma3_ percentile20.0 | hammarbergIndexV_ sma3nz_amean | mfcc1V_sma3nz_amean | slopeV0-500_sma3nz_amean |
7 | spectralFluxUV_ sma3nz_amean | HNRdBACF_ sma3nz_stddevNorm | slopeUV0-500_sma3nz_amean | F0semitoneFrom27.5Hz _sma3nz_meanRisingSlope |
8 | loudnessPeaks PerSec | mfcc3_sma3_ stddevNorm | F1frequency_sma3nz_ stddevNorm | StddevUnvoiced SegmentLength |
9 | loudness_sma3_ stddev FallingSlope | F0semitoneFrom27.5Hz_ sma3nz_pctlrange0-2 | loudness_sma3_ percentile20.0 | spectralFluxV_sma3nz_ stddevNorm |
10 | F3bandwidth_sma3nz_amean | F2frequency_ sma3nz_stddevNorm | hammarbergIndexV_ sma3nz_stddevNorm | F0semitoneFrom27.5Hz _sma3nz_stddevRisingSlope |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, Y.; Wang, W. MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition. Appl. Sci. 2022, 12, 12805. https://doi.org/10.3390/app122412805
Ma Y, Wang W. MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition. Applied Sciences. 2022; 12(24):12805. https://doi.org/10.3390/app122412805
Chicago/Turabian StyleMa, Yiping, and Wei Wang. 2022. "MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition" Applied Sciences 12, no. 24: 12805. https://doi.org/10.3390/app122412805
APA StyleMa, Y., & Wang, W. (2022). MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition. Applied Sciences, 12(24), 12805. https://doi.org/10.3390/app122412805