Review of Automatic Estimation of Emotions in Speech
Abstract
1. Introduction
2. Emotions
2.1. Theories of Emotion
2.2. Physical Measures of Emotion in Speech
2.3. Emotional Effects on Intonation
2.4. Emotional Effects on Voice Quality
2.5. Multi-Lingual Factors
2.6. Perception of Emotion from Speech
3. Acoustic Analysis of Speech
3.1. Basic Digital Signal Processing for SER
3.2. Basic Spectral Measures
3.3. More Advanced Spectral Envelope Measures
3.4. Time Windows for Frames in Speech Analysis
3.5. Intonation Features
4. Models to Use Measures of Analysis for Speech Emotion Classification
4.1. Embedding
4.2. Hidden Markov Models (HMMs)
4.3. Support Vector Machines (SVMs)
4.4. Other Approaches
5. Task Variability and Evaluation of SER
6. Datasets for SER
7. Neural (Machine Learning) Techniques for Emotion Classification of Speech
7.1. Basics of ANNs
7.2. Common Architectures for ANNs
7.2.1. Convolutional Neural Networks (CNNs)
7.2.2. Recurrent Neural Networks (RNNs)
7.2.3. Attention
7.2.4. Typical ANN Operation for SER
7.3. Supervised and Self-Supervised Methods
8. Discussion
9. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
List of Acronyms
ANN | artificial neural network |
AR | autoregressive |
ASR | automatic speech recognition |
ASV | automatic speaker verification |
FBEs | filter bank energies |
CNN | Convolutional Neural Network |
DFT | discrete Fourier transform |
E2E | end-to-end machine learning system |
F0 | fundamental frequency of vocal cord vibration |
GMM | Gaussian Mixture Model |
HMM | Hidden Markov model |
HNR | harmonics-to-noise ratio |
LLDs | Low-Level-Descriptors |
LPC | linear predictive coding |
MFCCs | mel frequency cepstral coefficients |
NAQ | Normalized Amplitude Quotient |
RNN | Recurrent Neural Network |
SER | spoken emotion recognition |
References
- Bond, C.F., Jr.; DePaulo, B.M. Accuracy of deception judgments. Personal. Soc. Psychol. Rev. 2006, 10, 214–234. [Google Scholar] [CrossRef] [PubMed]
- Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
- Ververidis, D.; Kotropoulos, C. Emotional speech recognition: Resources, features, and methods. Speech Commun. 2006, 48, 1162–1181. [Google Scholar] [CrossRef]
- El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
- Koolagudi, S.G.; Rao, K.S. Emotion recognition from speech: A review. Int. J. Speech Technol. 2012, 15, 99–117. [Google Scholar] [CrossRef]
- Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech emotion recognition using deep learning techniques: A review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
- Schuller, B. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
- Scherer, K.R. Vocal communication of emotion: A review of research paradigms. Speech Commun. 2003, 40, 227–256. [Google Scholar] [CrossRef]
- Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83, 19–52. [Google Scholar] [CrossRef]
- Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
- Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812. [Google Scholar] [CrossRef]
- Mower, E.; Metallinou, A.; Lee, C.C.; Kazemzadeh, A.; Busso, C.; Lee, S.; Narayanan, S. Interpreting ambiguous emotional expressions. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction and Workshops, Amsterdam, The Netherlands, 10–12 September 2009; pp. 1–8. [Google Scholar]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
- Devillers, L.; Vidrascu, L.; Lamel, L. Challenges in real-life emotion annotation and machine learning based detection. Neural Netw. 2005, 18, 407–422. [Google Scholar] [CrossRef]
- Schuller, B.; Reiter, S.; Muller, R.; Al-Hames, M.; Lang, M.; Rigoll, G. Speaker independent speech emotion recognition by ensemble classification. In Proceedings of the IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, 6–8 July 2005; pp. 864–867. [Google Scholar]
- Antoniou, N.; Katsamanis, A.; Giannakopoulos, T.; Narayanan, S. Designing and Evaluating Speech Emotion Recognition Systems: A reality check case study with IEMOCAP. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [Google Scholar] [CrossRef]
- Tracy, J.L.; Randles, D. Four models of basic emotions: A review of Ekman and Cordaro, Izard, Levenson, and Panksepp and Watt. Emot. Rev. 2011, 3, 397–405. [Google Scholar] [CrossRef]
- Plutchik, R. The Emotions; University Press of America: Lanham, MD, USA, 1991. [Google Scholar]
- Ortony, A.; Clore, G.L.; Collins, A. The Cognitive Structure of Emotions; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
- Leventhal, H. Toward a comprehensive theory of emotion. Adv. Exp. Soc. Psychol. 1980, 13, 139–207. [Google Scholar]
- Plutchik, R.; Kellerman, H. (Eds.) Theories of Emotion; Academic Press: Cambridge, MA, USA, 2013. [Google Scholar]
- Banse, R.; Scherer, K.R. Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 1996, 70, 614. [Google Scholar] [CrossRef]
- Williams, C.E.; Stevens, K.N. Emotions and speech: Some acoustical correlates. J. Acoust. Soc. Am. 1972, 52, 1238–1250. [Google Scholar] [CrossRef]
- Jin, Q.; Li, C.; Chen, S.; Wu, H. Speech emotion recognition with acoustic and lexical features. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; IEEE: New York, NY, USA, 2015; pp. 4749–4753. [Google Scholar]
- Lee, C.M.; Yildirim, S.; Bulut, M.; Kazemzadeh, A.; Busso, C.; Deng, Z.; Lee, S.; Narayanan, S. Emotion recognition based on phoneme classes. In Proceedings of the International Conference on Spoken Language Processing, Jeju Island, Republic of Korea, 4–8 October 2004. [Google Scholar]
- Hsu, J.H.; Su, M.H.; Wu, C.H.; Chen, Y.H. Speech emotion recognition considering nonverbal vocalization in affective conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1675–1686. [Google Scholar] [CrossRef]
- Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 985–1000. [Google Scholar] [CrossRef]
- Deng, J.; Ren, F. A survey of textual emotion recognition and its challenges. IEEE Trans. Affect. Comput. 2021, 14, 49–67. [Google Scholar] [CrossRef]
- Xu, H.; Liu, B.; Shu, L.; Yu, P.S. BERT post-training for review reading comprehension and aspect-based sentiment analysis. arXiv 2019, arXiv:1904.02232, 2324–2335. [Google Scholar]
- Murray, I.R.; Arnott, J.L. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoust. Soc. Am. 1993, 93, 1097–1108. [Google Scholar] [CrossRef] [PubMed]
- Batliner, A.; Fischer, K.; Huber, R.; Spilker, J.; Nöth, E. How to find trouble in communication. Speech Commun. 2003, 40, 117–143. [Google Scholar] [CrossRef]
- Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
- Burkhardt, F.; Sendlmeier, W.F. Verification of acoustical correlates of emotional speech using formant-synthesis. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Northern Ireland, UK, 5–7 September 2000. [Google Scholar]
- Suni, A.; Šimko, J.; Aalto, D.; Vainio, M. Hierarchical representation and estimation of prosody using continuous wavelet transform. Comput. Speech Lang. 2017, 45, 123–136. [Google Scholar] [CrossRef]
- Spanias, A.S. Speech coding: A tutorial review. Proc. IEEE 1994, 82, 1541–1582. [Google Scholar] [CrossRef]
- Valstar, M.; Gratch, J.; Schuller, B.; Ringeval, F.; Lalanne, D.; Torres Torres, M.; Scherer, S.; Stratou, G.; Cowie, R.; Pantic, M. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 3–10. [Google Scholar]
- Lim, N. Cultural differences in emotion: Differences in emotional arousal level between the East and the West. Integr. Med. Res. 2016, 5, 105–109. [Google Scholar] [CrossRef]
- Castaldo, F.; Colibro, D.; Dalmasso, E.; Laface, P.; Vair, C. Compensation of nuisance factors for speaker and language recognition. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1969–1978. [Google Scholar] [CrossRef]
- Anagnostopoulos, C.N.; Iliou, T.; Giannoukos, I. Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artif. Intell. Rev. 2015, 43, 155–177. [Google Scholar] [CrossRef]
- Lee, C.M.; Narayanan, S.S. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 2005, 13, 293–303. [Google Scholar]
- Lugger, M.; Yang, B. The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, HI, USA, 15–20 April 2007; IEEE: New York, NY, USA, 2007; pp. IV-17–IV-20. [Google Scholar]
- Sun, R.; Moore, E.; Torres, J.F. Investigating glottal parameters for differentiating emotional categories with similar prosodics. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; IEEE: New York, NY, USA, 2009; pp. 4509–4512. [Google Scholar]
- Sundberg, J.; Patel, S.; Bjorkner, E.; Scherer, K.R. Interdependencies among voice source parameters in emotional speech. IEEE Trans. Affect. Comput. 2011, 2, 162–174. [Google Scholar] [CrossRef]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Makhoul, J. Spectral analysis of speech by linear prediction. IEEE Trans. Audio Electroacoust. 1973, 21, 140–148. [Google Scholar] [CrossRef]
- Yu, D.; Deng, L. Automatic Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Busso, C.; Deng, Z.; Yildirim, S.; Bulut, M.; Lee, C.M.; Kazemzadeh, A.; Lee, S.; Neumann, U.; Narayanan, S. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the International Conference on Multimodal Interfaces, State College, PA, USA, 13–15 October 2004; pp. 205–211. [Google Scholar]
- Dellaert, F.; Polzin, T.; Waibel, A. Recognizing emotion in speech. ICSLP 1996, 3, 1970–1973. [Google Scholar]
- Rabiner, L.; Cheng, M.; Rosenberg, A.; McGonegal, C. A comparative performance study of several pitch detection algorithms. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 399–418. [Google Scholar] [CrossRef]
- Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75. [Google Scholar] [CrossRef]
- de Lope, J.; Graña, M. An ongoing review of speech emotion recognition. Neurocomputing 2023, 528, 1–11. [Google Scholar] [CrossRef]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Schuller, B. Self supervised adversarial domain adaptation for cross-corpus and cross-language speech emotion recognition. IEEE Trans. Affect. Comput. 2022, 14, 1912–1926. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: New York, NY, USA, 2016; pp. 5200–5204. [Google Scholar]
- Yenigalla, P.; Kumar, A.; Tripathi, S.; Singh, C.; Kar, S.; Vepa, J. Speech Emotion Recognition Using Spectrogram & Phoneme Embedding. In Proceedings of the 19th annual conference of the International Speech Communication Association Hyderabad, Hyderabad, India, 2–6 September 2018; pp. 3688–3692. [Google Scholar]
- Pepino, L.; Riera, P.; Ferrer, L. Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv 2021, arXiv:2104.03502, 3400–3404. [Google Scholar]
- Schuller, B.; Rigoll, G.; Lang, M. Hidden Markov model-based speech emotion recognition. In Proceedings of the 2003 International Conference on Multimedia and Expo. ICME ’03. Proceedings (Cat. No.03TH8698), Baltimore, MD, USA, 6–9 July 2003; IEEE: New York, NY, USA, 2003; p. II-1. [Google Scholar]
- Lin, Y.L.; Wei, G. Speech emotion recognition based on HMM and SVM. Int. Conf. Mach. Learn. Cybern. 2005, 4898–4901. [Google Scholar]
- Schölkopf, B.; Smola, A. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Zong, Y.; Zheng, W.; Zhang, T.; Huang, X. Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression. IEEE Signal Process. Lett. 2016, 23, 585–589. [Google Scholar] [CrossRef]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
- Eyben, F.; Wöllmer, M.; Schuller, B. OpenSMILE: The Munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM International Conference on Multimedia, Firenze, Italy, 29 October 2010; pp. 1459–1462. [Google Scholar]
- Leem, S.G.; Fulford, D.; Onnela, J.P.; Gard, D.; Busso, C. Not all features are equal: Selection of robust features for speech emotion recognition in noisy environments. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6447–6451. [Google Scholar]
- Hautamäki, R.G.; Kinnunen, T.; Hautamäki, V.; Laukkanen, A.M. Automatic versus human speaker verification: The case of voice mimicry. Speech Commun. 2015, 72, 13–31. [Google Scholar] [CrossRef]
- Pappagari, R.; Villalba, J.; Żelasko, P.; Moro-Velazquez, L.; Dehak, N. Copypaste: An augmentation method for speech emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6324–6328. [Google Scholar]
- Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech emotion recognition with dual-sequence LSTM architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 6474–6478. [Google Scholar]
- Lawrence, I.; Lin, K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989, 45, 255–268. [Google Scholar]
- Schuller, B.; Steidl, S.; Batliner, A. The Interspeech 2009 emotion challenge. In Proceedings of the Interspeech 2009, Brighton, UK, 6–10 September 2009; pp. 312–315. [Google Scholar]
- Kim, J.; Englebienne, G.; Truong, K.P.; Evers, V. Towards speech emotion recognition “in the wild” using aggregated corpora and deep multi-task learning. arXiv 2017, arXiv:1708.03920. [Google Scholar]
- Zhou, Y.; Liang, X.; Gu, Y.; Yin, Y.; Yao, L. Multi-classifier interactive learning for ambiguous speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 695–705. [Google Scholar] [CrossRef]
- Lotfian, R.; Busso, C. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans. Affect. Comput. 2017, 10, 471–483. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.; Lee, S.; Narayanan, S. IEMOCAP: Interactive emotional dyadic motion capture database. J. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Busso, C.; Parthasarathy, S.; Burmania, A.; AbdelWahab, M.; Sadoughi, N.; Provost, E.M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 2016, 8, 67–80. [Google Scholar] [CrossRef]
- Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
- Costantini, G.; Iaderola, I.; Paoloni, A.; Todisco, M. EMOVO corpus: An Italian emotional speech database. In Proceedings of the International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26–31 May 2014; pp. 3501–3504. [Google Scholar]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
- Martin, O.; Kotsia, I.; Macq, B.; Pitas, I. The eNTERFACE’05 audio-visual emotion database. In Proceedings of the International Conference on Data Engineering Workshops, Helsinki, Finland, 20 May 2016; p. 8. [Google Scholar]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
- Grimm, M.; Kroschel, K.; Narayanan, S. The Vera am Mittag German audio-visual emotional speech database. In Proceedings of the IEEE International Conference on Multimedia and Expo, Hannover, Germany, 23–26 June 2008; pp. 865–868. [Google Scholar]
- Wu, T.; Yang, Y.; Wu, Z.; Li, D. MASC: A speech corpus in Mandarin for emotion analysis and affective speaker recognition. In Proceedings of the IEEE Odyssey—The Speaker and Language Recognition Workshop, San Juan, Puerto Rico, 28–30 June 2006; pp. 1–5. [Google Scholar]
- Tickle, A. English and Japanese speakers’ emotion vocalisation and recognition: A comparison highlighting vowel quality. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Northern Ireland, UK, 5–7 September 2000. [Google Scholar]
- Batliner, A.; Steidl, S.; Nöth, E. Releasing a thoroughly annotated and processed spontaneous emotional database: The FAU Aibo Emotion Corpus. In Proceedings of the Satellite Workshop of LREC 2008 on Corpora for Research on Emotion and Affect, Marrakech, Morocco, 26 May 2008. [Google Scholar]
- Garcia-Cuesta, E.; Salvador, A.B.; Pãez, D.G. EmoMatchSpanishDB: Study of speech emotion recognition machine learning models in a new Spanish elicited database. Multimed. Tools Appl. 2024, 83, 13093–13112. [Google Scholar] [CrossRef]
- Makarova, V.; Petrushin, V.A. RUSLANA: A database of Russian emotional utterances. In Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CL, USA, 16–20 September 2002; pp. 2041–2044. [Google Scholar]
- Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1656–1660. [Google Scholar]
- Schuller, B.; Vlasenko, B.; Eyben, F.; Wöllmer, M.; Stuhlsatz, A.; Wendemuth, A.; Rigoll, G. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Trans. Affect. Comput. 2010, 1, 119–131. [Google Scholar] [CrossRef]
- Cai, X.; Yuan, J.; Zheng, R.; Huang, L.; Church, K. Speech emotion recognition with multi-task learning. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 4508–4512. [Google Scholar]
- Chen, L.W.; Rudnicky, A. Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
- Feng, T.; Narayanan, S. PEFT-SER: On the use of parameter efficient transfer learning approaches for speech emotion recognition using pre-trained speech models. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction, Cambridge, MA, USA, 10–13 September 2023; pp. 1–8. [Google Scholar]
- Gerczuk, M.; Amiriparian, S.; Ottl, S.; Schuller, B.W. Emonet: A transfer learning framework for multi-corpus speech emotion recognition. IEEE Trans. Affect. Comput. 2021, 14, 1472–1487. [Google Scholar] [CrossRef]
- He, S.; Zheng, X.; Zeng, D.; Luo, C.; Zhang, Z. Exploring entrainment patterns of human emotion in social media. PLoS ONE 2016, 11, e0150630. [Google Scholar] [CrossRef]
- Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Interspeech 2014, Singapore, Singapore, 14–18 September 2014; pp. 223–227. [Google Scholar]
- Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
- Chen, L.; Mao, X.; Xue, Y.; Cheng, L.L. Speech emotion recognition: Features and classification models. Digit. Signal Process. 2012, 22, 1154–1160. [Google Scholar] [CrossRef]
- Fayek, H.M.; Lech, M.; Cavedon, L. Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 2017, 92, 60–68. [Google Scholar] [CrossRef]
- Bartz, C.; Herold, T.; Yang, H.; Meinel, C. Language identification using deep convolutional recurrent neural networks. In Proceedings of the Neural Information Processing Conference, Long Beach, CA, USA, 4–9 December 2017; pp. 880–889. [Google Scholar]
- Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
- Aftab, A.; Morsali, A.; Ghaemmaghami, S.; Champagne, B. LIGHT-SERNET: A lightweight fully convolutional neural network for speech emotion recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6912–6916. [Google Scholar]
- Aldeneh, Z.; Provost, E.M. Using regional saliency for speech emotion recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New York, NY, USA, 2017; pp. 2741–2745. [Google Scholar]
- Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2803–2807. [Google Scholar]
- Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. DialogueRNN: An attentive RNN for emotion detection in conversations. AAAI Conf. Artif. Intell. 2019, 33, 6818–6825. [Google Scholar] [CrossRef]
- Zazo, R.; Lozano-Diez, A.; Gonzalez-Dominguez, J.; TToledano, D.; Gonzalez-Rodriguez, J. Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS ONE 2016, 11, e0146917. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Rajamani, S.T.; Rajamani, K.T.; Mallol-Ragolta, A.; Liu, S.; Schuller, B. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6294–6298. [Google Scholar]
- Peng, Z.; Lu, Y.; Pan, S.; Liu, Y. Efficient speech emotion recognition using multi-scale CNN and attention. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 3020–3024. [Google Scholar]
- Liu, J.; Liu, Z.; Wang, L.; Guo, L.; Dang, J. Speech emotion recognition with local-global aware deep representation learning. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 7174–7178. [Google Scholar]
- Xu, M.; Zhang, F.; Cui, X.; Zhang, W. Speech emotion recognition with multiscale area attention and data augmentation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6319–6323. [Google Scholar]
- Zhu, W.; Li, X. Speech emotion recognition with global-aware fusion on multi-scale feature representation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6437–6441. [Google Scholar]
- Cai, W.; Cai, D.; Huang, S.; Li, M. Utterance-level end-to-end language identification using attention-based CNN-BLSTM. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: New York, NY, USA, 2019; pp. 5991–5995. [Google Scholar]
- Zhao, S.; Gholaminejad, A.; Ding, G.; Gao, Y.; Han, J.; Keutzer, K. Personalized emotion recognition by personality-aware high-order learning of physiological signals. ACM Trans. Multimed. Comput. Commun. Appl. 2019, 15, 1–18. [Google Scholar] [CrossRef]
- Tarantino, L.; Garner, P.N.; Lazaridis, A. Self-Attention for Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2578–2582. [Google Scholar]
- Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention based multi-level acoustic information. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 7367–7371. [Google Scholar]
- Morais, E.; Hoory, R.; Zhu, W.; Gat, I.; Damasceno, M.; Aronowitz, H. Speech emotion recognition using self-supervised features. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6922–6926. [Google Scholar]
- Villalba, J.; Brümmer, N.; Dehak, N. End-to-end versus embedding neural networks for language recognition in mismatched conditions. In Proceedings of the Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018. [Google Scholar]
- Chatziagapi, A.; Paraskevopoulos, G.; Sgouropoulos, D.; Pantazopoulos, G.; Nikandrou, M.; Giannakopoulos, T.; Katsamanis, A.; Potamianos, A.; Narayanan, S. Data Augmentation Using GANs for Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 171–175. [Google Scholar]
- Tiwari, U.; Soni, M.; Chakraborty, R.; Panda, A.; Kopparapu, S.K. Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 7194–7198. [Google Scholar]
- Chang, J.; Scherer, S. Learning representations of emotional speech with deep convolutional generative adversarial networks. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New York, NY, USA, 2017; pp. 2746–2750. [Google Scholar]
- Su, B.H.; Lee, C.C. A conditional cycle emotion gan for cross corpus speech emotion recognition. In Proceedings of the IEEE Spoken Language Technology Workshop, Shenzhen, China, 19–22 January 2021; pp. 351–357. [Google Scholar]
- Malik, I.; Latif, S.; Jurdak, R.; Schuller, B. A preliminary study on augmenting speech emotion recognition using a diffusion model. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023. [Google Scholar]
- Peng, L.; Zhang, Z.; Pang, T.; Han, J.; Zhao, H.; Chen, H.; Schuller, B.W. Customising General Large Language Models for Specialised Emotion Recognition Tasks. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 11326–11330. [Google Scholar]
- Ling, S.; Salazar, J.; Liu, Y.; Kirchhoff, K. BERTPHONE: Phonetically-aware encoder representations for utterance-level speaker and language recognition. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2020), Tokyo, Japan, 1–5 November 2020; pp. 9–16. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Fan, Z.; Li, M.; Zhou, S.; Xu, B. Exploring wav2vec 2.0 on speaker verification and language identification. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 1509–1513. [Google Scholar]
- Latif, S.; Rana, R.; Younis, S.; Qadir, J.; Epps, J. Transfer learning for improving speech emotion classification accuracy. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 257–261. [Google Scholar]
- Wu, W.; Zhang, C.; Woodland, P.C. Emotion recognition by fusing time synchronous and time asynchronous representations. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6269–6273. [Google Scholar]
- Chen, W.; Xing, X.; Xu, X.; Yang, J.; Pang, J. Key-sparse transformer for multimodal speech emotion recognition. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6897–6901. [Google Scholar]
Name | Language | Categories | Comments | Citation |
---|---|---|---|---|
MSP-PODCAST | English | 8 | 100 h | [72] |
IEMOCAP | English | 9 | 12 h | [16,73] |
MSP-IMPROV | English | 4 | 9 h | [74] |
CREMA-D | English | 6 | 7442 clips | [75] |
EMOVO | Italian | 6 | 6 actors, 14 sentences | [76] |
EmoDB | German | 7 | 10 actors, 10 sentences | [77] |
eNTERFACE | English | 6 | 42 subjects | [78] |
RAVDESS | English | 7 | 24 actors | [79] |
Vera am Mittag | German | Large range | 12 h; spontaneous | [80] |
MASC | Mandarin | 5 | 68 actors | [81] |
— | Japanese/English | 5 | 90 sentences | [82] |
FAU Aibo Emotion Corpus | German | 12 | spontaneous from children; 9 h | [83] |
EmoMatchSpanishDB | Spanish | 6 | 50 subjects | [84] |
RUSLANA | Russian | 6 | 61 subjects | [85] |
Architecture | Pluses | Minuses |
---|---|---|
Fully connected feedforward (FFNN) | Simple structure | Excessive number of parameters |
Convolutional neural network (CNN) | Smooths local data | Only operates locally |
Recurrent neural network (RNN) | Exploits distant data | Many feedback parameters |
Attention | Emphasis on correlations | Ignores time ordering |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
O’Shaughnessy, D. Review of Automatic Estimation of Emotions in Speech. Appl. Sci. 2025, 15, 5731. https://doi.org/10.3390/app15105731
O’Shaughnessy D. Review of Automatic Estimation of Emotions in Speech. Applied Sciences. 2025; 15(10):5731. https://doi.org/10.3390/app15105731
Chicago/Turabian StyleO’Shaughnessy, Douglas. 2025. "Review of Automatic Estimation of Emotions in Speech" Applied Sciences 15, no. 10: 5731. https://doi.org/10.3390/app15105731
APA StyleO’Shaughnessy, D. (2025). Review of Automatic Estimation of Emotions in Speech. Applied Sciences, 15(10), 5731. https://doi.org/10.3390/app15105731