Review of Automatic Estimation of Emotions in Speech

O’Shaughnessy, Douglas

doi:10.3390/app15105731

Open AccessReview

Review of Automatic Estimation of Emotions in Speech

by

Douglas O’Shaughnessy

Telecommunication Department, INRS, University of Quebec, 800 de la Gauchetiere West, Montreal, QC H5A 1K6, Canada

Appl. Sci. 2025, 15(10), 5731; https://doi.org/10.3390/app15105731

Submission received: 23 March 2025 / Revised: 13 May 2025 / Accepted: 14 May 2025 / Published: 20 May 2025

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

Download Versions Notes

Abstract

Identification of emotions exhibited in utterances is useful for many applications, e.g., assisting with handling telephone calls or psychological diagnoses. This paper reviews methods to identify emotions from speech signals. We examine the information in speech that helps to estimate emotion, from points of view involving both production and perception. As machine approaches to recognize emotion in speech often have much in common with other speech tasks, such as automatic speaker verification and speech recognition, we compare such processes. Many methods of emotion recognition have been found in research on pattern recognition in other areas, e.g., image and text recognition, especially in recent methods for machine learning. We show that speech is very different compared to most other signals that can be recognized, and that emotion identification is different from other speech applications. This review is primarily aimed at non-experts (more algorithmic detail is present in the cited literature), but this presentation has much discussion for experts as well.

Keywords:

emotion identification; neural networks; speech analysis; pattern recognition; machine learning

1. Introduction

This paper reviews algorithms to estimate emotions that are exhibited in spoken utterances. Applications include remote personal assistance and medical diagnoses. To motivate some of the choices for emotion estimation by machine algorithms, we also discuss ways that speakers display emotions and how listeners judge emotions in speech.

Diverse information is present in speech signals. By speaking, a person communicates information (ideas and messages) to listeners. In addition, from speech, a listener can attempt to identify who is talking, what language is spoken, and the emotional state of the speaker. Emotions are the focus of this review, i.e., automatic spoken emotion recognition (SER). This work does not examine estimations of intentional speaker deception [1], which relates to lie detection, and is different from emotion issues, as liars tend to hide emotion in their speech. While emotion identification can benefit from use of cues beyond utterances, e.g., with visual and textual multi-modal analysis, we restrict the scope here to audio, in order to simplify the scope of discussion.

This current work goes beyond earlier reviews [2,3,4,5,6,7,8,9,10,11], and examines many concepts about SER, especially for recent approaches for neural networks. It compares machine and human SER, and notes how one may improve accuracy with better features for neural methods. Unlike some other recent reviews, it does not exhaustively list recent papers, but focuses on SER concepts and their motivation.

Among the major applications for speech signals, SER may be the most difficult and complex task, as definitions of emotions can be broad, the range of emotions is large, and the acoustic effects of emotion on speech extend over wide temporal and spectral ranges and in complex ways. In speech, the effects of emotion vary temporally, are expressed and perceived over multiple modalities, are often inherently ambiguous, and may vary within a dialog [12]. As evidence of the difficulty of SER, both humans and machines have SER accuracy rates that are below performance levels in most other speech applications. Differences in the ways people express emotion may require “personalized” SER, i.e., techniques adapted to individuals [13].

We define emotional concepts in Section 2, first noting various theories and categories for emotion, then describing physical measures of emotion in speech; thereafter, we have sub-sections on the effects of emotion on two major aspects of speech: intonation and voice quality. Vocal tract (VT) shape is a prime determiner of phonetic information in speech, and the focus of automatic speech recognition (ASR), but is likely not a major component of emotion; thus we focus on intonation and voice quality, which can be heavily influenced by emotion. We examine how human listeners perceive emotion from speech, as SER can emulate this behavior.

As VT shape affects speech spectra, which are heavily used in SER, we examine the major methods for speech analysis in Section 3. This longer section describes how computers can process speech to extract acoustic features that can be useful for speech applications. We start with simple energy and spectra, and examine the traditional methods found in most of speech processing: linear predictive coding (LPC), mel frequency cepstral coefficients (MFCCs), filter banks, and time windows. We also delve more into the components of intonation analysis: durations, pitch, and amplitude.

Section 4 examines various mathematical models that have been applied to SER: hidden Markov models (HMMs), embeddings, Support Vector Machines (SVMs), and Low-Level-Descriptors (LLDs). Section 5 notes the great variability among SER tasks, and how to evaluate SER performance. Section 6 describes many of the datasets that are commonly used for SER.

The extensive Section 7 describes the operation of artificial neural network (ANN) techniques, which are now in standard use for most SER. The review here of ANNs is not intended to be comprehensive, but to introduce basic ideas about ANNs and their use for SER. The first sub-section describes the fundamental units and architecture of ANNs. Then, we examine the most common variants: convolutional neural networks (CNNs), recurrent neural networks (RNNs), attention, and Transformer. Examples of state-of-the-art SER are given. We end this section with the important aspect of supervised and unsupervised training.

A discussion ensues in Section 8, and we conclude the paper with suggestions to improve SER, focusing on the better use of acoustic features.

2. Emotions

Emotions are mental states of mind, or “feelings”, often in reaction to events that people experience. Degrees of emotions vary along continuous scales. While emotions are often categorized with simple discrete labels such as anger, fear, happiness, and sorrow, feelings can range widely, and often do not fit neatly into simple categories. This variability makes designing quantitative recognition experiments more difficult than for other classification tasks involving speech, where the text of an utterance, the identity of a speaker, or the language used can each be clearly defined objectively.

Thus, utterances in various databases used to test SER are typically identified with categorical emotion labels (often: anger, happiness/joy, sadness, disgust, surprise, fear, contempt), where a “neutral” condition may be defined as the absence of specific emotion. While most SER research has relied on tests with speech labeled (by listeners) with distinct categories of emotion, such labeling is often unreliable, as emotions are often nuanced, and listeners have varied interpretations [14]. Indeed, the choice of which classes to include for emotion categorization in surveys tends to bias the responses of listeners. There are many potential names for emotional states, and they can overlap with each other. Quantified representations of emotion must be able to take account of the diversity of human annotation; affective experiences are inherently subjective. SER is often less accurate than ASR, especially for speech of unfamiliar speakers [15], while human SER can exceed machine performance [16].

2.1. Theories of Emotion

There is no consensus on a structure to comprehensively describe pertinent aspects of emotion, nor how to represent emotion in mathematical models. Evidence of emotion ranges on continual scales [17], but is also often described by specific named categories [18]. This has led to a wide range of theories on how to qualify and quantify emotion [19,20], both in how it is expressed by humans and how it is perceived by others [21,22].

Three measures are widely accepted for continuous scales for emotion (recently termed dimensional emotion): arousal, valence, and dominance: (1) Arousal (activation and activity are similar terms) corresponds to the intensity of an emotion. (2) Valence concerns pleasantness on a positive/negative scale, e.g., joy is positive, anger negative. (3) Dominance is the degree of control by an (assumed) stimulus causing the emotion; such a third dimension is needed for emotion, as anger is dominant (+) and fear is submissive (−), while both are low valence and high arousal.

Different theoretical models of emotion (in particular, discrete categories versus dimensional models) have not found distinctive differences in terms of SER implementation. Outputs in most SER evaluations have been based on the accuracy of specific emotional labels, as determined in datasets verified by expert listeners. SER systems usually aim to predict discrete emotion labels, although some estimate dimensional ratings.

2.2. Physical Measures of Emotion in Speech

In terms of the acoustics of speech, emotion is often linked to intonation (also called suprasegmentals or prosody), which embodies changes in the fundamental frequency (F0) of the vibration of vocal cords (in quasi-periodic voiced sounds), the intensity of speech waveforms, and durations of linguistic sound units (phonemes, syllables, and words). All three physical acoustic measures vary with emotion, but in complex ways [23].

In humans, the sympathetic nervous system becomes aroused when one experiences strong emotions such as anger, joy, or fear. This may include a higher heart rate and blood pressure, changes in respiratory motion, more sub-glottal pressure, mouth dryness, and muscle tremor [24]. The speech that results may become more intense, faster, and with more energy at higher frequency, higher mean F0, and larger F0 range. On the other hand, arousal of the parasympathetic nervous system occurs with sadness, and blood pressure and heart rate decrease, with speech that is slower, with lower F0, and with less high-frequency energy.

From speech, listeners (or machines) can judge emotion from either (or both of) linguistic information (the lexical content of what is said) or paralinguistic information (acoustics of how it is said) [25]. Linguistic data, obtained using ASR and natural language processing (NLP), may be more helpful to estimate valence, but such data are language-specific [26]. Paralinguistic data are more useful for arousal and dominance, and generalize more readily across languages and emotions; such data normally refer to acoustic analysis of spoken words, but also nonverbal sounds as well [27]. Some SER systems are multi-modal, integrating audio, textual, and/or visual data [28]. This review paper focuses on speech applications, and does not consider text-based or multi-modal emotion recognition [29,30].

2.3. Emotional Effects on Intonation

Research has shown clear links between emotion and intonation in verbal communication [8,31]. Statistical measures of F0, energy (intensity), and speaking rate have been widely observed for emotion recognition [32,33]. In particular, speech of high arousal (e.g., anger and happiness) occurs with an increase in average F0, wider F0 range, and decrease in spectral tilt (this last one not dealing with intonation). Tests with synthetic speech have shown the relevance of intonation for perceptual cues to emotion in speech [34].

Nonetheless, effects of intonation are often neglected in many diverse speech systems (including SER), as intonation varies widely and ranges over greater time periods than phonemes [35]. Most speech analysis instead uses spectral information about very brief segments via time windows of analysis lasting 10–30 ms [36]; these presume local stationarity inside each window, and can so accommodate the general dynamic nature of non-stationary speech. Analysis that is performed every 10 ms (100 Hz frame rate) is typical for many applications for speech, and extending such durations to hundreds of ms, which is needed to exploit intonation patterns, has eluded many speech algorithms.

2.4. Emotional Effects on Voice Quality

Emotion can cause involuntary changes in glottal behavior, e.g., in jitter, shimmer, and the harmonics-to-noise ratio (HNR) [2]. Jitter is the variability of durations of pitch periods in successive vibratory glottal cycles, whereas shimmer is variation in speech waveform amplitude between periods. (Note that the traditional “pitch period” is physically realized as quasi-periodicity, as humans do not produce exact repetitions; coarticulation of the VT (movements across phonemes) also renders small changes from period to period). Breathiness may occur in angry and happy speech, and vocal fry in sad and relaxed speech. Harsh voice (irregularity in voicing) has been found in fear speech.

The HNR measures the relative level of noise in sonorant spectra, i.e., the ratio between amplitudes of periodic and aperiodic components. For speech analysis, there are other pertinent quality measures such as Normalized Amplitude Quotient, Quasi-Open Quotient, and spectral tilt [37]. All of these may affect emotion, but have rarely been directly exploited in recent SER, where recent research is dominated by use of simpler acoustic measures (as many classical phonetic features have been difficult to integrate into standard neural speech classification systems).

2.5. Multi-Lingual Factors

Emotion presents itself via many social and contextual factors that shape the expression and perception of emotion. Cultural and linguistic differences exist in how emotions are vocally expressed and interpreted [38], which can significantly affect model performance and generalizability. Anecdotal evidence notes higher levels of arousal in Western cultures. Section 6 notes multilingual SER research, but most studies examine only one language.

2.6. Perception of Emotion from Speech

Human listeners can estimate aspects of emotion based on short speech portions, but details about one’s listening strategy are hard to find. Perceptual tests with synthetic speech show that aspects of intonation are major cues to emotion [34]. Synthetic utterances were heard as fearful when they had a high F0, a broad F0 range, falsetto voice, or a fast speech rate. Sadness was heard for speech with a narrow F0 range, slow speaking rate, and breathiness. Faster speech rate and tense phonation were judged as angry speech. Joy correlated with a broader pitch range and a faster rate. Bored speech had a lowered mean F0 and a F0 pitch range, and breathy or creaky voice. While some of these cues relate to broad details of spectral envelope, most relate to intonation. Thus, intonational features are likely more useful to SER than the Mel spectra that are more often used.

When asked to select from among five common emotions, listeners achieve 60–70% accuracy in general, except for “disgust” (which likely has a more varied set of acoustic cues than the standard set (fear, anger, happiness, sadness)) [8,23]. While this performance is well above chance (20%), lack of more definitive results is evidence of the level of difficulty of the SER task.

3. Acoustic Analysis of Speech

Like many applications for speech, classical SER methods start with suitable processing of the input speech signal to: (1) eliminate less useful aspects of signals, (2) reduce data size, (3) focus on relevant acoustics, and (4) transform audio into pertinent features. For almost all speech uses, to extract pertinent information usually needs analysis or processing of the input (i.e., the time waveform of speech, as from a microphone with variations in air pressure that make up the sound of speech). Analysis is especially helpful for speech, as compared to other types of data such as video, because speech is highly encoded: concepts in the brain transform into many dynamic neural commands to VT organs (lips, tongue, velum), which then move synchronously as air from the lungs passes through the tubes of the VT. The resulting speech waveform is thus a signal with highly indirect information about its relevant linguistic information.

Many physical signals other than speech (e.g., video and general audio) display direct relationships to their desired content (e.g., shapes of image objects; sounds from physical events). For speech, on the other hand, the sequence of individual time samples has little direct relationship to useful information, including emotions, ideas, and the language used. The spectral envelope of the transfer function of the VT is of main concern in most applications for speech, because it relates to VT shape, which speakers control intentionally. For SER, one must include intonational analysis as well, as VT variations appear to be a lesser component of emotion.

3.1. Basic Digital Signal Processing for SER

Like most physical data of interest (e.g., diverse signals such as images, temperature, odors), speech is always changing. For use in digital computers, signals are periodically sampled at a high rate (exceeding the Nyquist rate—double the highest signal frequency), to retain information for a pertinent frequency range in each application (e.g., in speech by telephone, up to 3.2 kHz [36]). Then, processes, including neural networks, convert the data sample sequence to representations that are more relevant to perform classification, e.g., into estimated text or classes of emotion.

Despite the large set of objectives among different applications for speech, very similar analysis methods are regularly used to convert high-dimensional speech data for classification of much lower dimensions, for many speech applications: ASR, automatic speaker verification (ASV), language identification, and SER. It is generally assumed that most useful information in speech comes from intentional speaker behavior in VT motion. However, this assumption may not work for medical diagnosis using speech, for physiological aspects of ASV, or for emotions. Nonetheless, such general processing is often applied in all speech domains, sometimes for a lack of suitable alternatives. A major aim for speech processing has been to derive a compact representation with few dimensions that highlights relevant features. However, defining such features is complex (which has led instead to general machine learning methods; see Section 7.3). To accommodate the many variations across speakers and acoustical environments, different adaptation approaches have also been used [39].

For SER, intentions of the speaker are not a primary objective, as speakers focus on articulating a textual message (the lexical content of the speech), and emotional aspects are generally secondary, if at all intentional, for speakers. When speaking with emotion, a person is generally aware of emotions (and may indeed alter VT articulation to further emphasize emotional aspects, with intent), but is primarily focused on conveying the textual content. Thus, SER need not focus on acoustic aspects that mainly deal with VT shape (despite the fact that many SER systems indeed do). Instead, much research has shown that intonation (which has little to do with VT shape) is likely the major acoustical factor for SER [40].

Some modern SER systems (often termed “end-to-end” or E2E) avoid any direct intonational or spectral analysis of speech, assuming that all specific processing not directly for the task objective risks discarding relevant information. Indeed, recent SER has focused heavily on methods that tend to derive general speech features helpful for understanding and coding the phonetic information in speech, rather than for emotions. Nonetheless, as most SER does some form of analysis, we examine acoustic processing in this section.

Common spectral features used for SER include: MFCCs, linear prediction cepstral coefficients, formant frequencies (VT resonances), modulation spectral features, and bandwidths of formant frequencies [41]. Voice quality features for SER include shimmer, jitter, and NAQ, as they are related to characteristics of glottal excitation [42,43,44]. These are detailed below.

3.2. Basic Spectral Measures

The time waveform (or shape) of speech signals varies significantly, in part due to unintended phase variations, which derive from complex airflow variations, but may be affected by emotion. However, most classical speech analysis has focused instead on its spectral amplitude, as that contains information useful to estimate the intent of speech. The elementary discrete Fourier transform (DFT) is the simplest analysis method commonly used, as it displays energy as a function of frequency [36]. A DFT transforms N consecutive time samples (during a local time window, which is shifted periodically in time, to handle the dynamic speech signal) into N samples of spectral amplitude and phase (N is usually 256–512). In theory, there is no data reduction or information loss by this invertible transform (however, for digital purposes, the quantization necessary causes inevitable loss, to be minimized with more bits of representation). While DFT amplitudes furnish more pertinent information for speech classification than time samples, the entire DFT has excessive detail for most applications, and thus needs data reduction to be efficient.

One common variant of the DFT is filter bank energies (FBEs), as they smooth DFT amplitudes over a broad series of restricted frequency spans (e.g., bandpass filtering). As useful speech information is distributed non-uniformly in frequency, the nonlinear Mel scale is often used in frequency [45]; it has filters of equal bandwidth up to 1 kHz, and then logarithmic spacing; this corresponds to the reduced resolution of audition at higher frequencies (effects that derive from the tapered width of the basilar membrane inside the inner ear). ASR typically uses 20–100 of these filters to cover the full frequency range, which reduces data well below a 512-sample DFT, while nonetheless keeping adequate information about the spectra for applications of speech.

3.3. More Advanced Spectral Envelope Measures

A frequent way to analyze speech spectrally is the MFCCs; they resemble FBEs (with the Mel scale), but convert data to an efficient, reduced set of parameters; 13 are enough to handle telephone speech to discriminate the center frequencies of formants to within 100 Hz, which is a range accurate enough to discriminate close vowels (e.g., /i-e/). The name “cepstral” corresponds to the mapping (with logarithm) of multiplication of spectra (i.e., VT filtering of the VT excitation) with linear addition terms for the excitation and the envelope transfer function; this thus permits separation of the filter and excitation of speech, and is useful because these two are handled distinctly in both speech perception and in common speech coding such as LPC.

LPC is a common spectral analysis method, and uses an all-pole autoregressive (AR) model of the speech spectrum [46], which thus directly focusses on VT resonances (AR corresponds to the digital feedback speech synthesis model, used for reconstruction of coded speech, as it represents resonances well). LPC analysis is directly related to speech, as it assumes sounds from a VT, unlike MFCCs or FBEs. Basic LPC, as found in speech telephony for decades, uses ten parameters for the narrow bandwidth (300–3200 Hz) of commercial networks. During the 1970s, ASR commonly used LPC, and then used MFCCs until recently. Modern SER systems do not use LPC, as MFCCs or FBEs (or other models) perform better; LPC is fine for coding, where its 10 parameters model spectral envelope well, but have little detail relevant for emotion estimation.

3.4. Time Windows for Frames in Speech Analysis

One usually determines acoustic features inside a window frame (e.g., of 25 ms duration), which is periodically shifted by 10 ms. This de facto use of 100 sets/s is common in diverse applications for speech, as it has enough detail to follow VT coarticulation movements, but is reduced enough to minimize computation. An average phone duration is approximately 80 ms; so, extracting eight times in each phone allows tracking VT motion sufficiently. These acoustic features can use Mel-spectrograms of 80 or so dimensions, a 512-point DFT, and a window such as Hamming [26].

Such standard spectral vectors per frame typically have 13–16 MFCC numerical values, which correspond to spectra of static VT positions. To partly compensate for use of these brief analysis windows, and to also include wider-range temporal information, each spectral vector can be augmented with a set of difference (delta) parameters, which model the change between successive frames (thus a measure of VT velocity). A further set of delta–delta values can then approximate acceleration of VT motion. These three sets may be useful for ASR [47], but tripling the number of parameters is a cost that appears not sufficiently advantageous for SER.

3.5. Intonation Features

Much research has noted that human listeners exploit the perceived intonation of speech to help distinguish emotions [48,49]. However, most recent SER does not, preferring use of simpler Mel Spectra. To automatically estimate durations for detailed sequences is not easy, and often needs ASR, which may be unreliable. It is, however, simpler to estimate average speaking rate, which may be a good feature for SER.

As for F0, there are many estimation algorithms [50]. However, integrating information about F0 into the SER methods is difficult, as intonation covers many speech frames. SER generally uses frame-based spectral amplitude data, exploiting information about VT resonances that are relevant over brief durations. While the use of both F0 and durations occurred in expert-system approaches for SER [49], most recent neural approaches do not directly use intonation. This lack may be a major reason for relatively weak SER accuracy levels.

4. Models to Use Measures of Analysis for Speech Emotion Classification

The aim of all methods of analysis for speech (FBE, DFT, LPC, MFCC, intonation) is to yield data representations that are more useful than the (raw) waveform time samples. Neural methods applied to speech inherently “learn” a version of this through their automatic training, but SER may be more accurate and efficient if such networks could be guided by direct use of intonation and spectral features. SER and ASR can benefit from lexical data in speech, i.e., information from words and phonemes, which is seen in intonation and VT resonances. Applications that instead produce speech output (e.g., enhancement, coding, synthesis) also retain phase, but such is not relevant for ASR, and also of less utility for SER [51]. The requirements of ASR differ much from those of SER, because ASR has to identify a sequence of phones and words, whereas SER looks for (less obvious) clues to emotion, which may include different acoustical information.

Most current SER uses neural networks that directly learn from data, and does not employ specific techniques such as MFCCs [52]. Before we discuss neural approaches, we first review other methods used recently for SER. Early SER research was largely developed from prior ASR methods. Such SER methods use this process: (1) front-end acoustic analysis to obtain a phonetic representation, and (2) a back-end classifier. Other recent SER methods may use a single-stage E2E method [53,54]; we discuss those in Section 7.2.

4.1. Embedding

As described later, ANNs have diverse architectures, consisting of many varied layers of nodes, possibly with feedback, where each network layer usually has input and output vectors with fixed numbers of values. However, speech input data vary greatly in duration. For speech coding and ASR, speech is partitioned into fixed 10–25 ms frames for repeated, periodic processing. For SER and ASV, however, the output is only classification for a given section of speech, but their input has large differences in duration. This requires an embedding, which maps data of variable length to a fixed length set [55,56].

4.2. Hidden Markov Models (HMMs)

For ASR, a related application to SER, the most common modeling technique until recently has been HMMs [57,58]. With a maximum likelihood approach, one seeks to maximize P(T|S), i.e., the most likely candidate text T for an input speech utterance S, where T is a word sequence and S a set of spectral vectors from speech analysis. Using Bayes’ theorem, one instead maximizes P(S|T) P(T), where P(T) is a language model and P(S|T) an acoustic model (as P(T|S) is far too complex to use, given the immense number of possible speech signals S). HMMs can be models of linguistic units of various size (e.g., phonemes or words).

Each HMM consists of a sequence of states, representing spectral patterns for shapes of the VT used to generate the sounds. To handle temporal variability, HMMs use timing transitions between states for successive frames. These transitions have trained probabilities (modeling speech durations), and a probability density function (often a weighted set of Gaussians, called a Gaussian Mixture Model (GMM)) is estimated for each state’s spectrum. Phoneme HMMs often depend on the acoustic/articulatory context of immediate neighboring phonemes, to model VT coarticulation. For SER, which does not need individual decisions per frame, a small number of emotion classifications suffices; hence, single-state GMMs may be used [33].

4.3. Support Vector Machines (SVMs)

A common general binary classifier called SVM [59,60] is simple and efficient in discriminating between two classes (e.g., emotions) in very high dimensional feature spaces. For SER, an SVM can be trained to estimate a hyperplane in representation space, to separate utterances with a specific emotion from all other utterances [58]. SVMs use low-dimensional feature vectors (e.g., average cepstral vectors) that are transformed into separable high-dimensional vectors by use of kernel transformations (also known as tricks).

4.4. Other Approaches

Earlier ways to do SER include kernel regression, maximum likelihood Bayes classification, least-squares regression, and k-nearest neighborhood methods [49,61]. In recent years, several feature sets have been developed that combine low-level acoustic features. GeMAPS [54,62] is a set of combined spectral and intonational features often used for SER. The feature extraction tool openSMILE is popular software to find both long-term and short-term features, often called Low-Level Descriptors [63]. Proficient choice of such features can help SER accuracy [64]. These hand-crafted features were popular for SER in the 2010s, but have mostly been replaced now by self-supervised learning (SSL) features (Section 7.3).

5. Task Variability and Evaluation of SER

Applications for SER vary considerably, unlike for many other speech tasks. In speech coding, the aim is direct, i.e., efficient reduction of data sequences while also retaining quality (both intelligibility and naturalness). For enhancement of speech, the objective is reducing distortion (echo and noise) while also retaining quality. Text-to-speech also needs speech quality, while having varied voices and emotions are useful. ASR criteria are also straightforward, i.e., speech understanding.

For SER, on the other hand, while determining which emotion to which speech corresponds seems like a simple aim, SER applications have a wide span. SER factors include: (1) the range of outputs (e.g., a fixed set of a small number of “classic” emotion labels), (2) the audio quality, (3) the length of speech used to test and that used to train the SER models, (4) whether emotion changes during an utterance, and (5) whether emotion is natural or simulated.

For ASR or ASV, one typically compares automatic recognition to how well native listeners comprehend speech or recognize speakers. Nonetheless, most current SER research uses tests for closed sets using several emotion labels, which is often difficult for human listeners, especially if one uses large numbers of overlapping emotion labels, e.g., beyond a few basic emotions. One can develop algorithms using different amounts of speech for training, and may obtain better performance than humans (which is also found for some ASV tasks when using unfamiliar speakers) [65].

As SER tasks are variable, so also are evaluation measures. SER using closed-set answers can be estimated with basic accuracy (percentage of test utterances with emotions with correct identification). Some SERs can be a binary task (i.e., does given speech correspond to a specific emotion?); in this case, equal error rate (EER) is often used: the percentage where false rejections (incorrect estimation of a sample that does not have that emotion) equal false acceptances (wrong estimation that a sample has a specified emotion).

A common way to evaluate SER performance is the weighted-F1 score [66]. This measure combines recall and precision of a classifier into a single metric by use of their harmonic mean. Accuracy, precision, and recall are common ways to measure success in pattern recognition, and are defined below in terms of the numbers of recognition successes and failures in tests with data examples (Equations (1)–(4)).

Accuracy = (true positives + true negatives)/total number of examples

(1)

Precision = true positives/(true positives + false positives)

(2)

Recall = true positives/(true positives + false negatives)

(3)

F1 score = 2 × Recall × Precision/(Recall + Precision).

(4)

(The weighted-F1 score takes account of the proportion of true instances for each label).

Unweighted accuracy (UA) is often a primary metric, but some reports use weighted accuracy (WA) [67]. WA considers label imbalance in computing accuracy, while UA does not. The concordance correlation coefficient provides an estimate of how well a predicted distribution matches the actual one [68], and is a typical measure for evaluating SER.

6. Datasets for SER

Many SER systems need training datasets that are labeled and also use ASR (to estimate the text of an emotional utterance), while recent neural approaches may not need these. The diverse set of applications and ways to evaluate SER leads to use of many acoustic datasets. To compare SER processes, typical datasets are used. Because neural network classifiers are based fully on training examples, the distribution of such data is crucial. For scientific control, most SER databases have speech by actors simulating different emotions, including some official “challenges” [37,69]. Some “in-the-wild” databases, often from YouTube, use unprompted emotional speech [70]. Most databases ignore inter-rater ambiguity [71], and assign distinct labels to speech (Table 1).

Perhaps the largest English SER corpus, MSP-PODCAST has speech from 62,140 speaking turns in podcasts. In comparison to SER, note that many cloud-based ASR models use thousands of hours of training speech; as a result, ASR, in general, has higher accuracy than SER. Popular for SER experiments, the IEMOCAP database has five sessions, each with two actor (man and woman) speakers: 10,039 total turns with nine emotion labels (anger, happiness, excitement, sadness, frustration, fear, surprise, other, neutral). It has both scripted and improvised conversations, with imbalanced class distribution [16]. MSP-IMPROV has 8438 improvised speaking turns with four emotions (happy, sadness, anger, neutral). Table 1 lists other databases with actors simulating various emotions in text-independent, text-dependent, and improvised scenarios.

Many SER systems train and then test using the same database. Instead, to increase robustness, training and testing on different sets is called cross-corpus [86,87]. In general, SER datasets are much smaller than those used for ASR, as general ASR has access to huge numbers of normal utterances, whereas SER sets are often limited to actor speech. As a result, many SER systems instead start from a general ASR platform, and fine tune with SER datasets [88]. In particular, SER has seen much use of HuBERT and Wav2Vec2 models [89,90], i.e., use of SSL speech models. An alternative is Whisper, a Transformer model trained with weakly supervised learning [91].

In the available databases of emotional speech, most have an insufficient number of utterances and speakers to allow effective single-corpus training of large neural networks. Thus, cross- and multi-corpus SER is common. However, databases differ greatly in recording settings (e.g., acted, forced, or natural) and language. Strategies such as domain adaptation or adapter transfer learning can improve SER performance [53,92]. Best practices for evaluation include the importance of cross-corpus testing, robustness to spontaneous versus acted emotion, and generalization across speaker identities and recording conditions. Nonetheless, there is no agreement on choice of acoustic features for SER, and significant differences in performance occur between acted and natural emotional SER.

7. Neural (Machine Learning) Techniques for Emotion Classification of Speech

Applications in many fields (other than SER) that process data now primarily use ANNs, as they are powerful and can exploit an extensive range of data [93]. Nonlinear processing permits ANNs to internalize patterns that are very complex, but typically needs huge network models and much computation, which often means devices with few resources (e.g., cellular phones) cannot be used. ANNs transform data or recognize patterns, using automatic training based on many examples. An ANN maps an input sequence (e.g., speech) and uses a nonlinearity to form an output sequence (e.g., for SER, a set of likelihoods for the possible target emotions). The last decade has seen a major research shift toward neural SER [94], whether using spectral input [95] or direct use of speech waveforms.

7.1. Basics of ANNs

The primary element (or node) of an ANN has action founded on that of biological neurons. The basic ANN is a multi-layer perceptron (MLP), which has layers (sets) of nodes, which each feed outputs as inputs to the nodes of the next layer, and use no feedback. For classification such as SER, training an MLP can be seen as an optimization of the combined placement of hyperplane boundaries for class regions in a representation space that can be very complex [96]. The initial ANN layer obtains its input as a vector of speech waveform samples. Intermediate “hidden” layers furnish refined data for use in classification; numbers of nodes vary greatly, and are often chosen empirically.

In natural neural networks, dendrites in a neuron route electrical signals to succeeding axons; each axon yields a binary output consisting of a brief pulse at the time that the weighted sum of inputs exceeds a threshold. The output of an ANN node (model of a neuron) is specified by an activation function y = φ(w · x + b), where w is a N-dimensional weight vector, b is a scalar bias, x is a N-dimensional set of inputs, and φ(.) is a nonlinearity (e.g., sigmoid, rectifier, or tanh) [94]. For each model node, its weight and bias parameters determine the location of a hyperplane in a virtual space; the binary output maps into either side of the hyperplane. For an application such as SER, each ANN node potentially provides some information toward the emotion classification. Huge networks, having millions of parameters, are typically required to attain sufficient system precision for any ANN application such as SER, owing to the large variability of input data.

To train the parameters of a neural network, initial estimates for node biases and multiplier weights can be chosen randomly (or can be pre-trained using unlabeled data, which are usually more easily obtained). Ensuing training alters these parameters incrementally, during successive training cycles called “epochs”, using a criterion to minimize a loss (cost) function. Use of a direct criterion such as minimization of accuracy for SER is not feasible, because ANN training requires a differentiable “loss”, to allow a product chain of derivatives (through the network), as this shows the direction and amount to change parameters at each training iteration, using steepest gradient descent.

Loss functions provide an approximation of a penalty for SER errors. These functions vary greatly across applications, but a common one is cross-entropy (CE) [88]. CE employs the log-likelihood of training data, which corresponds to minimization of least squares error (i.e., quadratic or L2 norm). CE is also an identification loss, usually with Softmax output units or a variant such as Angular Softmax. Softmax applies an exponential function to elements of its input vector, and normalizes these values by the sum of all the exponentials. This guarantees that the components of the output vector add to 1, and thus are suitable to interpret as a probability distribution.

7.2. Common Architectures for ANNs

The simplest fully connected feedforward (FFNN) MLP structure has all nodes in each layer feed into all nodes in the succeeding layer. For most applications, such an approach is often overly general and too costly in terms of size and computation. As useful information in speech data is spread very non-uniformly across both frequency and time, fully connected approaches are inefficient [97]. Most ANN applications instead use different versions of the more efficient components noted below (Table 2).

7.2.1. Convolutional Neural Networks (CNNs)

For many tasks of classification, pertinent information is localized in limited spans of data (e.g., objects in images; formants in speech). Thus, ANNs may contain “convolutional” layers of nodes that process data locally. If viewing input data from a representation in two dimensions (e.g., time-frequency, a wide-band spectral display for speech, where the frequency axis is linear or logarithmic), data are multiplied by a square weight matrix over a small range (e.g., 3 × 3), and results are summed (pooled) spatially [98]. A sort of CNN that applies 1-dimensional convolution (multiplication) in time is called time-delay NN; such models have no recurrence (feedback).

As an example for SER, one may combine different acoustic features in a one-dimensional array of mean values, fed into a 1-D CNN (Refs. [99,100] show an efficient CNN SER). Applying CNNs to “emotionally salient” regions of speech [101,102] has shown that information about emotion is not uniformly distributed in speech (see more on this idea in Section 7.2.3).

7.2.2. Recurrent Neural Networks (RNNs)

To take better advantage of the uneven distribution of pertinent data in speech across frequency and across time, more selective use of nodes is useful. CNNs exploit data correlations over small local ranges, but recurrence can be helpful for correlations over long durations. RNNs use models with temporal feedback, with distributed hidden states that store information about past input [103]. Network gates (input, forget, and output) control flow of data across layers. Typical RNNs are termed long short-term memory and residual networks (ResNet) [93,104]. For SER, effective high-level features are found with RNNs, which are more robust to long-range contextual effects [105,106]. In addition, multi-scale CNNs can be used to extract temporal and spectral features over variable durations [107,108,109,110].

7.2.3. Attention

Recently, a modeling approach called attention has been prominent for ANNs. It places emphasis for classification on specific sections of data, to exploit the non-uniform distribution of information [102]. As an algorithm, attention is determined as data correlation using matrix operations (e.g., dot products) that combine queries (inputs), keys (features), and values (wanted outputs). The queries derive from a decoder layer that appears earlier in an ANN, and keys and values derive from encoder outputs. Attention can apply in time, in frequency (across channels of signal spectra), or across layers of networks [111]. The essential application of attention is a useful modification to otherwise uniform ANN structures. However, the simple mechanisms of correlation that are used are often inadequate to utilize well the complex information distribution in data like speech.

Neural models that use attention are termed Transformers. Such recent E2E methods use attention, but no recurrence, to emphasize selected aspects of data that exist over wide ranges. In recurrent networks [112], on the other hand, inputs are treated in sequence. Transformer instead does not use ideas of token sequences, but uses a separate embedding table with time-positional encodings [55]. Nonlinear functions do monotonic mappings in time.

Transformers need more computation than most other ANNs. Variants of attention include self-attention [113] and cross-modal attention [114]. Transformers are now common in SER [90]. Transformer architecture has the capacity to learn general structural information from high-dimensional data, and can take advantage of much unlabeled data through self- and un-supervised learning. However, Transformer with a full-attention mechanism is quadratic in the time sequence duration, and some SER uses long utterances with dynamic emotion.

7.2.4. Typical ANN Operation for SER

Like many applications of machine learning, SER uses different combinations of the modules examined in this sub-section (i.e., attention, FFNN, CNN, RNN). In many tasks, ANN design has often been without proper motivation, i.e., empirical with arbitrary architecture choices, as ANNs automatically learn all from training examples, using different loss functions. This is especially true for SER, as there is less known about how people identify emotions than is known for ASR or speech coding. This knowledge gap causes use of many approaches for SER [9,114,115]. (We see similar broad approaches to ASV [65], as what distinguishes speakers is also harder to discern than the clearer phonetic aspects that are of prime importance for ASR or speech coding).

For many applications, classical pattern recognition uses a sequence of components, including: preprocessing, normalization, determining parameters, then (sometimes) features, classification, and finally post-processing. Early stages are named as front-end, with others as back-end. Such modular and sequential approaches are easily understood and designed, but perform sub-optimally when individually trained. Many recent neural techniques called E2E networks, on the other hand, use one classifier structure with a simple loss function [116].

7.3. Supervised and Self-Supervised Methods

Whether to use specific relevant data training for estimation tasks is an important factor in performance. Recognition models that rely entirely on supervised training often have lower accuracy, as access is limited only to existing data. Current public emotional datasets are insufficient to train a robust supervised learning model, as their diversity of speakers and conditions is too limited. In the related field of ASR, many classification approaches use supervised training, where dynamics of spectra can be estimated from large training databases by examining short sequences of speech, e.g., context-dependent phones or senones (e.g., three phones in a row, or triphones). However, as emotional speech data as sparse, SER often uses general methods that require fewer labeled data. One can generalize models by use of data augmentation, which deforms existing data for training in a random fashion, using reverberation and noise, or other distortion, which is artificially added [109,116,117]. However, most speech augmentation techniques are those derived for ASR, which may not be appropriate for SER, e.g., altering speaking rate clearly affects emotion. In addition, little research has been performed on SER using noisy speech [118].

A related domain adaptation approach to deal with lack of suitable SER data is generative adversarial networks (GANs) [117,119,120]. This technique contrasts actual (true) and synthetic (false) samples of data using two networks: generator and discriminator. The generator formulates data in a low-dimension latent space; then, the discriminator learns to discriminate between “real” training data and “false” generator outputs. Generators maximize the error rate of the discriminator, while discriminators reduce their error rate. GANs can represent phonetically or phonologically meaningful information. Diffusion models can provide an alternative to GANs [121]. Recently, large language models (LLMs), e.g., ChatGPT and Claude, have been applied to SER [122]. LLMs are pre-trained on vast text corpora, instead of speech, and have found vast applications. Pre-trained models use massive amounts of unlabeled data to learn general speech representations, which are then fine-tuned for specific tasks, which reducing needs for extensive labeled datasets.

For many machine learning tasks (including SER), a common way to address current database limitations is to use SSL models. They may encode a diversity of phonetic, prosodic, and speaker information [115,123]. The popular wav2vec2 model [124], WavLM [125], and HuBERT [27] are self-supervised, i.e., trained to determine representations of speech automatically from data that are unlabeled. Which features these models abstract from data depend greatly on their choice of loss functions and their application as encoder/decoder. They compress input data to (a reduced set of) latent representations, and a measure of success is the quality of their reconstructed output data. Their action is thus quite different from feature-based coders (e.g., LPC). Such SSL models suffer from the general lack of interpretability of most neural models, which leads to great difficulty in disentangling emotional cues from speaker- or context-specific information in speech. These all remain very active areas of research.

Wav2vec2 has a CNN encoder, randomly masks (i.e., omits) portions of input speech in a latent space, and then tries recovering via a contrastive loss criterion between the learnable quantizations and reconstructed versions. A quantization module converts latent speech representations to discrete copies, which are then used as targets. The entire model trains to solve a contrastive proxy task; this needs identification of the true quantized latent speech representations for a masked time unit within a set of distractors. The HuBERT method employs a similar approach: a 1-D CNN, and then a Transformer encoder. The 1D CNN converts input (raw) speech into low-level features; the Transformer generates high-level features via self-attention. Both wav2vec2 and HuBERT have a CNN feature extractor and Transformer encoder.

This SSL approach has been applied to various speech applications, and appears to extract relevant phonetic patterns that can fruitfully be used. Like most machine learning methods, however, any intuitive interpretation of features is extremely difficult. Thus, one assumes that relevant features are being obtained because the final classification outcome may be improved over earlier methods. Typically, the output features of this SLL training are about 1000 for each 1 s time window of input; their models often have hundreds of millions of parameters.

So, much recent SER pre-trains a model with SSL (a general upstream operation, which applies to many speech processing applications) using a large amount of unlabeled speech data (e.g., wav2vec 2.0 or HuBERT), and then fine tunes the general model with the contextual and acoustic information for SER (a downstream task) [56,89]. For SER, the model that derives from training that applies to various speech tasks is used, with fine tuning for the specific SER task. A smoothed pooling layer transforms the frame-level context representations from the wav2vec encoder into representations at the sentence level, and classification uses a fully connected layer for each speech sample by emotion [115,126].

Transfer learning [127] exploits models from other speech areas (e.g., ASR or ASV) to improve SER. Joint training or multi-task learning between ASR and SER can consider ASR and SER loss simultaneously during training [88,128].

8. Discussion

This paper reviewed modern methods of automatic spoken emotion recognition. Although most current SER does not exploit classical phonetic features directly, we first looked at the acoustic information in speech that assists people to distinguish emotion, as these may help in future SER research. Since listeners focus on specific phonetic and intonational acoustic features when they distinguish emotion in speech, it is reasonable to presume that SER could gain by guiding its automatic analysis and network using trends that are found in human listening.

As approaches to discern emotion often share much with the related applications of ASR and ASV, we compared these three applications. Many methods for SER derive from general pattern recognition methods that have been developed in other application areas, we observed how speech is very different from other signals to recognize, and how SER objectives differ from other speech applications. We noted that most SER trials have used closed sets of target emotions, and this does not reflect applications in practice.

Current trends for neural processing have dominated SER (as well as many other applications). These trends will persist, as ANNs have large power as nonlinear processors that can do automatic learning. Research in artificial intelligence does not use earlier expert-system approaches for many practical tasks, instead using systems that automatic machine learning. Nonetheless, there are many cases of ANNs that effectively learn features that humans employ in their pattern recognition.

Most neural SER networks are huge and rely heavily on resources (i.e., thus unable to be used in devices with few resources), thus using many millions of parameters. Such models are also restrained by their requirements for lots of training data and processing power, and their use of very simple loss functions and gradient training. SER methods derive mostly from algorithms developed for other applications (typically for image processing). It may well be useful to focus ANNs instead on acoustic features that are both specific to speech and helpful to distinguish how emotions differ, which is why we focused on such details, rather than simply describe SER methods.

The recent trend towards E2E methods exploits a search for a simple, uniform, global model that does not use any human analysis, assuming that any analysis prior to automatic machine learning risks suboptimal decisions. Yet, the typical training method of gradient descent is clearly suboptimal to handle data patterns that are highly complex. Speech is highly encoded, as noted in this paper, with correlations spread non-uniformly and widely across time and frequency. The attention methods of correlation help, but current methods are too simple to exploit efficiently the information that listeners employ for speech perception tasks such as SER.

9. Conclusions

Recent SERs tend to use large, pre-trained Transformer models [129], but their ability to generalize and be robust in training a single model using diverse sets of databases is less clear. They tend to need significant domain and corpus adaptation. How to generalize fine-tuned self-supervised SER representations across different domains is important, especially when testing in different conditions.

Most recent SER work has ignored earlier research about ways speakers show emotion in speech, as well as how listeners perceive emotion. Those noted the importance of intonational cues, particularly F0 and speaking rate, which were commonly exploited in earlier SER. However, fitting intonation into neural SER models has been difficult, as current SER models prefer general use of spectrograms or SSL analysis. Since it is clear that emotional aspects of speech use intonation, although in complex ways, it may be better to have ANN inputs from acoustic features that are carefully crafted, rather than with actual speech waveform samples, or with elementary spectral measures (FBEs or MFCCs). These were used with the LLDs noted earlier, but have been little used for SER in the last few years.

At the least, one should likely estimate simple intonation measures, which are known to correlate well with emotions, such as F0 statistics (frame-based mean and variance) and speaking rate, to include explicitly with other speech spectral features. Current SERs overly rely on very basic acoustic measures such as waveforms or mel-spectrograms, or from implicit SSL of features from audio via models like HuBERT and Wav2Vec2. Such general methods have been attractive for speaker-characteristic tasks such as ASV and SER, where direct phonetic links that are useful for ASR (VT shapes to identify phonemes) are not readily known. However, the direct links of emotion to intonation, which distribute in complex ways across time, are difficult to abstract via general neural methods.

Funding

This research was funded by NSERC grant number 142610.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declare no conflicts of interest.

List of Acronyms

ANN	artificial neural network
AR	autoregressive
ASR	automatic speech recognition
ASV	automatic speaker verification
FBEs	filter bank energies
CNN	Convolutional Neural Network
DFT	discrete Fourier transform
E2E	end-to-end machine learning system
F0	fundamental frequency of vocal cord vibration
GMM	Gaussian Mixture Model
HMM	Hidden Markov model
HNR	harmonics-to-noise ratio
LLDs	Low-Level-Descriptors
LPC	linear predictive coding
MFCCs	mel frequency cepstral coefficients
NAQ	Normalized Amplitude Quotient
RNN	Recurrent Neural Network
SER	spoken emotion recognition

References

Bond, C.F., Jr.; DePaulo, B.M. Accuracy of deception judgments. Personal. Soc. Psychol. Rev. 2006, 10, 214–234. [Google Scholar] [CrossRef] [PubMed]
Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
Ververidis, D.; Kotropoulos, C. Emotional speech recognition: Resources, features, and methods. Speech Commun. 2006, 48, 1162–1181. [Google Scholar] [CrossRef]
El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
Koolagudi, S.G.; Rao, K.S. Emotion recognition from speech: A review. Int. J. Speech Technol. 2012, 15, 99–117. [Google Scholar] [CrossRef]
Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech emotion recognition using deep learning techniques: A review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
Schuller, B. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
Scherer, K.R. Vocal communication of emotion: A review of research paradigms. Speech Commun. 2003, 40, 227–256. [Google Scholar] [CrossRef]
Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83, 19–52. [Google Scholar] [CrossRef]
Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812. [Google Scholar] [CrossRef]
Mower, E.; Metallinou, A.; Lee, C.C.; Kazemzadeh, A.; Busso, C.; Lee, S.; Narayanan, S. Interpreting ambiguous emotional expressions. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction and Workshops, Amsterdam, The Netherlands, 10–12 September 2009; pp. 1–8. [Google Scholar]
Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
Devillers, L.; Vidrascu, L.; Lamel, L. Challenges in real-life emotion annotation and machine learning based detection. Neural Netw. 2005, 18, 407–422. [Google Scholar] [CrossRef]
Schuller, B.; Reiter, S.; Muller, R.; Al-Hames, M.; Lang, M.; Rigoll, G. Speaker independent speech emotion recognition by ensemble classification. In Proceedings of the IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, 6–8 July 2005; pp. 864–867. [Google Scholar]
Antoniou, N.; Katsamanis, A.; Giannakopoulos, T.; Narayanan, S. Designing and Evaluating Speech Emotion Recognition Systems: A reality check case study with IEMOCAP. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [Google Scholar] [CrossRef]
Tracy, J.L.; Randles, D. Four models of basic emotions: A review of Ekman and Cordaro, Izard, Levenson, and Panksepp and Watt. Emot. Rev. 2011, 3, 397–405. [Google Scholar] [CrossRef]
Plutchik, R. The Emotions; University Press of America: Lanham, MD, USA, 1991. [Google Scholar]
Ortony, A.; Clore, G.L.; Collins, A. The Cognitive Structure of Emotions; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Leventhal, H. Toward a comprehensive theory of emotion. Adv. Exp. Soc. Psychol. 1980, 13, 139–207. [Google Scholar]
Plutchik, R.; Kellerman, H. (Eds.) Theories of Emotion; Academic Press: Cambridge, MA, USA, 2013. [Google Scholar]
Banse, R.; Scherer, K.R. Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 1996, 70, 614. [Google Scholar] [CrossRef]
Williams, C.E.; Stevens, K.N. Emotions and speech: Some acoustical correlates. J. Acoust. Soc. Am. 1972, 52, 1238–1250. [Google Scholar] [CrossRef]
Jin, Q.; Li, C.; Chen, S.; Wu, H. Speech emotion recognition with acoustic and lexical features. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; IEEE: New York, NY, USA, 2015; pp. 4749–4753. [Google Scholar]
Lee, C.M.; Yildirim, S.; Bulut, M.; Kazemzadeh, A.; Busso, C.; Deng, Z.; Lee, S.; Narayanan, S. Emotion recognition based on phoneme classes. In Proceedings of the International Conference on Spoken Language Processing, Jeju Island, Republic of Korea, 4–8 October 2004. [Google Scholar]
Hsu, J.H.; Su, M.H.; Wu, C.H.; Chen, Y.H. Speech emotion recognition considering nonverbal vocalization in affective conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1675–1686. [Google Scholar] [CrossRef]
Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 985–1000. [Google Scholar] [CrossRef]
Deng, J.; Ren, F. A survey of textual emotion recognition and its challenges. IEEE Trans. Affect. Comput. 2021, 14, 49–67. [Google Scholar] [CrossRef]
Xu, H.; Liu, B.; Shu, L.; Yu, P.S. BERT post-training for review reading comprehension and aspect-based sentiment analysis. arXiv 2019, arXiv:1904.02232, 2324–2335. [Google Scholar]
Murray, I.R.; Arnott, J.L. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoust. Soc. Am. 1993, 93, 1097–1108. [Google Scholar] [CrossRef] [PubMed]
Batliner, A.; Fischer, K.; Huber, R.; Spilker, J.; Nöth, E. How to find trouble in communication. Speech Commun. 2003, 40, 117–143. [Google Scholar] [CrossRef]
Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
Burkhardt, F.; Sendlmeier, W.F. Verification of acoustical correlates of emotional speech using formant-synthesis. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Northern Ireland, UK, 5–7 September 2000. [Google Scholar]
Suni, A.; Šimko, J.; Aalto, D.; Vainio, M. Hierarchical representation and estimation of prosody using continuous wavelet transform. Comput. Speech Lang. 2017, 45, 123–136. [Google Scholar] [CrossRef]
Spanias, A.S. Speech coding: A tutorial review. Proc. IEEE 1994, 82, 1541–1582. [Google Scholar] [CrossRef]
Valstar, M.; Gratch, J.; Schuller, B.; Ringeval, F.; Lalanne, D.; Torres Torres, M.; Scherer, S.; Stratou, G.; Cowie, R.; Pantic, M. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 3–10. [Google Scholar]
Lim, N. Cultural differences in emotion: Differences in emotional arousal level between the East and the West. Integr. Med. Res. 2016, 5, 105–109. [Google Scholar] [CrossRef]
Castaldo, F.; Colibro, D.; Dalmasso, E.; Laface, P.; Vair, C. Compensation of nuisance factors for speaker and language recognition. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1969–1978. [Google Scholar] [CrossRef]
Anagnostopoulos, C.N.; Iliou, T.; Giannoukos, I. Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artif. Intell. Rev. 2015, 43, 155–177. [Google Scholar] [CrossRef]
Lee, C.M.; Narayanan, S.S. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 2005, 13, 293–303. [Google Scholar]
Lugger, M.; Yang, B. The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, HI, USA, 15–20 April 2007; IEEE: New York, NY, USA, 2007; pp. IV-17–IV-20. [Google Scholar]
Sun, R.; Moore, E.; Torres, J.F. Investigating glottal parameters for differentiating emotional categories with similar prosodics. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; IEEE: New York, NY, USA, 2009; pp. 4509–4512. [Google Scholar]
Sundberg, J.; Patel, S.; Bjorkner, E.; Scherer, K.R. Interdependencies among voice source parameters in emotional speech. IEEE Trans. Affect. Comput. 2011, 2, 162–174. [Google Scholar] [CrossRef]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Makhoul, J. Spectral analysis of speech by linear prediction. IEEE Trans. Audio Electroacoust. 1973, 21, 140–148. [Google Scholar] [CrossRef]
Yu, D.; Deng, L. Automatic Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Busso, C.; Deng, Z.; Yildirim, S.; Bulut, M.; Lee, C.M.; Kazemzadeh, A.; Lee, S.; Neumann, U.; Narayanan, S. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the International Conference on Multimodal Interfaces, State College, PA, USA, 13–15 October 2004; pp. 205–211. [Google Scholar]
Dellaert, F.; Polzin, T.; Waibel, A. Recognizing emotion in speech. ICSLP 1996, 3, 1970–1973. [Google Scholar]
Rabiner, L.; Cheng, M.; Rosenberg, A.; McGonegal, C. A comparative performance study of several pitch detection algorithms. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 399–418. [Google Scholar] [CrossRef]
Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75. [Google Scholar] [CrossRef]
de Lope, J.; Graña, M. An ongoing review of speech emotion recognition. Neurocomputing 2023, 528, 1–11. [Google Scholar] [CrossRef]
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Schuller, B. Self supervised adversarial domain adaptation for cross-corpus and cross-language speech emotion recognition. IEEE Trans. Affect. Comput. 2022, 14, 1912–1926. [Google Scholar] [CrossRef]
Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: New York, NY, USA, 2016; pp. 5200–5204. [Google Scholar]
Yenigalla, P.; Kumar, A.; Tripathi, S.; Singh, C.; Kar, S.; Vepa, J. Speech Emotion Recognition Using Spectrogram & Phoneme Embedding. In Proceedings of the 19th annual conference of the International Speech Communication Association Hyderabad, Hyderabad, India, 2–6 September 2018; pp. 3688–3692. [Google Scholar]
Pepino, L.; Riera, P.; Ferrer, L. Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv 2021, arXiv:2104.03502, 3400–3404. [Google Scholar]
Schuller, B.; Rigoll, G.; Lang, M. Hidden Markov model-based speech emotion recognition. In Proceedings of the 2003 International Conference on Multimedia and Expo. ICME ’03. Proceedings (Cat. No.03TH8698), Baltimore, MD, USA, 6–9 July 2003; IEEE: New York, NY, USA, 2003; p. II-1. [Google Scholar]
Lin, Y.L.; Wei, G. Speech emotion recognition based on HMM and SVM. Int. Conf. Mach. Learn. Cybern. 2005, 4898–4901. [Google Scholar]
Schölkopf, B.; Smola, A. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Zong, Y.; Zheng, W.; Zhang, T.; Huang, X. Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression. IEEE Signal Process. Lett. 2016, 23, 585–589. [Google Scholar] [CrossRef]
Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
Eyben, F.; Wöllmer, M.; Schuller, B. OpenSMILE: The Munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM International Conference on Multimedia, Firenze, Italy, 29 October 2010; pp. 1459–1462. [Google Scholar]
Leem, S.G.; Fulford, D.; Onnela, J.P.; Gard, D.; Busso, C. Not all features are equal: Selection of robust features for speech emotion recognition in noisy environments. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6447–6451. [Google Scholar]
Hautamäki, R.G.; Kinnunen, T.; Hautamäki, V.; Laukkanen, A.M. Automatic versus human speaker verification: The case of voice mimicry. Speech Commun. 2015, 72, 13–31. [Google Scholar] [CrossRef]
Pappagari, R.; Villalba, J.; Żelasko, P.; Moro-Velazquez, L.; Dehak, N. Copypaste: An augmentation method for speech emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6324–6328. [Google Scholar]
Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech emotion recognition with dual-sequence LSTM architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 6474–6478. [Google Scholar]
Lawrence, I.; Lin, K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989, 45, 255–268. [Google Scholar]
Schuller, B.; Steidl, S.; Batliner, A. The Interspeech 2009 emotion challenge. In Proceedings of the Interspeech 2009, Brighton, UK, 6–10 September 2009; pp. 312–315. [Google Scholar]
Kim, J.; Englebienne, G.; Truong, K.P.; Evers, V. Towards speech emotion recognition “in the wild” using aggregated corpora and deep multi-task learning. arXiv 2017, arXiv:1708.03920. [Google Scholar]
Zhou, Y.; Liang, X.; Gu, Y.; Yin, Y.; Yao, L. Multi-classifier interactive learning for ambiguous speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 695–705. [Google Scholar] [CrossRef]
Lotfian, R.; Busso, C. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans. Affect. Comput. 2017, 10, 471–483. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.; Lee, S.; Narayanan, S. IEMOCAP: Interactive emotional dyadic motion capture database. J. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Busso, C.; Parthasarathy, S.; Burmania, A.; AbdelWahab, M.; Sadoughi, N.; Provost, E.M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 2016, 8, 67–80. [Google Scholar] [CrossRef]
Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
Costantini, G.; Iaderola, I.; Paoloni, A.; Todisco, M. EMOVO corpus: An Italian emotional speech database. In Proceedings of the International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26–31 May 2014; pp. 3501–3504. [Google Scholar]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
Martin, O.; Kotsia, I.; Macq, B.; Pitas, I. The eNTERFACE’05 audio-visual emotion database. In Proceedings of the International Conference on Data Engineering Workshops, Helsinki, Finland, 20 May 2016; p. 8. [Google Scholar]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Grimm, M.; Kroschel, K.; Narayanan, S. The Vera am Mittag German audio-visual emotional speech database. In Proceedings of the IEEE International Conference on Multimedia and Expo, Hannover, Germany, 23–26 June 2008; pp. 865–868. [Google Scholar]
Wu, T.; Yang, Y.; Wu, Z.; Li, D. MASC: A speech corpus in Mandarin for emotion analysis and affective speaker recognition. In Proceedings of the IEEE Odyssey—The Speaker and Language Recognition Workshop, San Juan, Puerto Rico, 28–30 June 2006; pp. 1–5. [Google Scholar]
Tickle, A. English and Japanese speakers’ emotion vocalisation and recognition: A comparison highlighting vowel quality. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Northern Ireland, UK, 5–7 September 2000. [Google Scholar]
Batliner, A.; Steidl, S.; Nöth, E. Releasing a thoroughly annotated and processed spontaneous emotional database: The FAU Aibo Emotion Corpus. In Proceedings of the Satellite Workshop of LREC 2008 on Corpora for Research on Emotion and Affect, Marrakech, Morocco, 26 May 2008. [Google Scholar]
Garcia-Cuesta, E.; Salvador, A.B.; Pãez, D.G. EmoMatchSpanishDB: Study of speech emotion recognition machine learning models in a new Spanish elicited database. Multimed. Tools Appl. 2024, 83, 13093–13112. [Google Scholar] [CrossRef]
Makarova, V.; Petrushin, V.A. RUSLANA: A database of Russian emotional utterances. In Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CL, USA, 16–20 September 2002; pp. 2041–2044. [Google Scholar]
Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1656–1660. [Google Scholar]
Schuller, B.; Vlasenko, B.; Eyben, F.; Wöllmer, M.; Stuhlsatz, A.; Wendemuth, A.; Rigoll, G. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Trans. Affect. Comput. 2010, 1, 119–131. [Google Scholar] [CrossRef]
Cai, X.; Yuan, J.; Zheng, R.; Huang, L.; Church, K. Speech emotion recognition with multi-task learning. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 4508–4512. [Google Scholar]
Chen, L.W.; Rudnicky, A. Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
Feng, T.; Narayanan, S. PEFT-SER: On the use of parameter efficient transfer learning approaches for speech emotion recognition using pre-trained speech models. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction, Cambridge, MA, USA, 10–13 September 2023; pp. 1–8. [Google Scholar]
Gerczuk, M.; Amiriparian, S.; Ottl, S.; Schuller, B.W. Emonet: A transfer learning framework for multi-corpus speech emotion recognition. IEEE Trans. Affect. Comput. 2021, 14, 1472–1487. [Google Scholar] [CrossRef]
He, S.; Zheng, X.; Zeng, D.; Luo, C.; Zhang, Z. Exploring entrainment patterns of human emotion in social media. PLoS ONE 2016, 11, e0150630. [Google Scholar] [CrossRef]
Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Interspeech 2014, Singapore, Singapore, 14–18 September 2014; pp. 223–227. [Google Scholar]
Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
Chen, L.; Mao, X.; Xue, Y.; Cheng, L.L. Speech emotion recognition: Features and classification models. Digit. Signal Process. 2012, 22, 1154–1160. [Google Scholar] [CrossRef]
Fayek, H.M.; Lech, M.; Cavedon, L. Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 2017, 92, 60–68. [Google Scholar] [CrossRef]
Bartz, C.; Herold, T.; Yang, H.; Meinel, C. Language identification using deep convolutional recurrent neural networks. In Proceedings of the Neural Information Processing Conference, Long Beach, CA, USA, 4–9 December 2017; pp. 880–889. [Google Scholar]
Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
Aftab, A.; Morsali, A.; Ghaemmaghami, S.; Champagne, B. LIGHT-SERNET: A lightweight fully convolutional neural network for speech emotion recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6912–6916. [Google Scholar]
Aldeneh, Z.; Provost, E.M. Using regional saliency for speech emotion recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New York, NY, USA, 2017; pp. 2741–2745. [Google Scholar]
Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2803–2807. [Google Scholar]
Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. DialogueRNN: An attentive RNN for emotion detection in conversations. AAAI Conf. Artif. Intell. 2019, 33, 6818–6825. [Google Scholar] [CrossRef]
Zazo, R.; Lozano-Diez, A.; Gonzalez-Dominguez, J.; TToledano, D.; Gonzalez-Rodriguez, J. Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS ONE 2016, 11, e0146917. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar]
Rajamani, S.T.; Rajamani, K.T.; Mallol-Ragolta, A.; Liu, S.; Schuller, B. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6294–6298. [Google Scholar]
Peng, Z.; Lu, Y.; Pan, S.; Liu, Y. Efficient speech emotion recognition using multi-scale CNN and attention. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 3020–3024. [Google Scholar]
Liu, J.; Liu, Z.; Wang, L.; Guo, L.; Dang, J. Speech emotion recognition with local-global aware deep representation learning. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 7174–7178. [Google Scholar]
Xu, M.; Zhang, F.; Cui, X.; Zhang, W. Speech emotion recognition with multiscale area attention and data augmentation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6319–6323. [Google Scholar]
Zhu, W.; Li, X. Speech emotion recognition with global-aware fusion on multi-scale feature representation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6437–6441. [Google Scholar]
Cai, W.; Cai, D.; Huang, S.; Li, M. Utterance-level end-to-end language identification using attention-based CNN-BLSTM. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: New York, NY, USA, 2019; pp. 5991–5995. [Google Scholar]
Zhao, S.; Gholaminejad, A.; Ding, G.; Gao, Y.; Han, J.; Keutzer, K. Personalized emotion recognition by personality-aware high-order learning of physiological signals. ACM Trans. Multimed. Comput. Commun. Appl. 2019, 15, 1–18. [Google Scholar] [CrossRef]
Tarantino, L.; Garner, P.N.; Lazaridis, A. Self-Attention for Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2578–2582. [Google Scholar]
Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention based multi-level acoustic information. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 7367–7371. [Google Scholar]
Morais, E.; Hoory, R.; Zhu, W.; Gat, I.; Damasceno, M.; Aronowitz, H. Speech emotion recognition using self-supervised features. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6922–6926. [Google Scholar]
Villalba, J.; Brümmer, N.; Dehak, N. End-to-end versus embedding neural networks for language recognition in mismatched conditions. In Proceedings of the Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018. [Google Scholar]
Chatziagapi, A.; Paraskevopoulos, G.; Sgouropoulos, D.; Pantazopoulos, G.; Nikandrou, M.; Giannakopoulos, T.; Katsamanis, A.; Potamianos, A.; Narayanan, S. Data Augmentation Using GANs for Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 171–175. [Google Scholar]
Tiwari, U.; Soni, M.; Chakraborty, R.; Panda, A.; Kopparapu, S.K. Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 7194–7198. [Google Scholar]
Chang, J.; Scherer, S. Learning representations of emotional speech with deep convolutional generative adversarial networks. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New York, NY, USA, 2017; pp. 2746–2750. [Google Scholar]
Su, B.H.; Lee, C.C. A conditional cycle emotion gan for cross corpus speech emotion recognition. In Proceedings of the IEEE Spoken Language Technology Workshop, Shenzhen, China, 19–22 January 2021; pp. 351–357. [Google Scholar]
Malik, I.; Latif, S.; Jurdak, R.; Schuller, B. A preliminary study on augmenting speech emotion recognition using a diffusion model. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023. [Google Scholar]
Peng, L.; Zhang, Z.; Pang, T.; Han, J.; Zhao, H.; Chen, H.; Schuller, B.W. Customising General Large Language Models for Specialised Emotion Recognition Tasks. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 11326–11330. [Google Scholar]
Ling, S.; Salazar, J.; Liu, Y.; Kirchhoff, K. BERTPHONE: Phonetically-aware encoder representations for utterance-level speaker and language recognition. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2020), Tokyo, Japan, 1–5 November 2020; pp. 9–16. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Fan, Z.; Li, M.; Zhou, S.; Xu, B. Exploring wav2vec 2.0 on speaker verification and language identification. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 1509–1513. [Google Scholar]
Latif, S.; Rana, R.; Younis, S.; Qadir, J.; Epps, J. Transfer learning for improving speech emotion classification accuracy. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 257–261. [Google Scholar]
Wu, W.; Zhang, C.; Woodland, P.C. Emotion recognition by fusing time synchronous and time asynchronous representations. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6269–6273. [Google Scholar]
Chen, W.; Xing, X.; Xu, X.; Yang, J.; Pang, J. Key-sparse transformer for multimodal speech emotion recognition. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6897–6901. [Google Scholar]

Table 1. Common datasets for SER.

Name	Language	Categories	Comments	Citation
MSP-PODCAST	English	8	100 h	[72]
IEMOCAP	English	9	12 h	[16,73]
MSP-IMPROV	English	4	9 h	[74]
CREMA-D	English	6	7442 clips	[75]
EMOVO	Italian	6	6 actors, 14 sentences	[76]
EmoDB	German	7	10 actors, 10 sentences	[77]
eNTERFACE	English	6	42 subjects	[78]
RAVDESS	English	7	24 actors	[79]
Vera am Mittag	German	Large range	12 h; spontaneous	[80]
MASC	Mandarin	5	68 actors	[81]
—	Japanese/English	5	90 sentences	[82]
FAU Aibo Emotion Corpus	German	12	spontaneous from children; 9 h	[83]
EmoMatchSpanishDB	Spanish	6	50 subjects	[84]
RUSLANA	Russian	6	61 subjects	[85]

Table 2. Major ANN components.

Architecture	Pluses	Minuses
Fully connected feedforward (FFNN)	Simple structure	Excessive number of parameters
Convolutional neural network (CNN)	Smooths local data	Only operates locally
Recurrent neural network (RNN)	Exploits distant data	Many feedback parameters
Attention	Emphasis on correlations	Ignores time ordering

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

O’Shaughnessy, D. Review of Automatic Estimation of Emotions in Speech. Appl. Sci. 2025, 15, 5731. https://doi.org/10.3390/app15105731

AMA Style

O’Shaughnessy D. Review of Automatic Estimation of Emotions in Speech. Applied Sciences. 2025; 15(10):5731. https://doi.org/10.3390/app15105731

Chicago/Turabian Style

O’Shaughnessy, Douglas. 2025. "Review of Automatic Estimation of Emotions in Speech" Applied Sciences 15, no. 10: 5731. https://doi.org/10.3390/app15105731

APA Style

O’Shaughnessy, D. (2025). Review of Automatic Estimation of Emotions in Speech. Applied Sciences, 15(10), 5731. https://doi.org/10.3390/app15105731

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Review of Automatic Estimation of Emotions in Speech

Abstract

1. Introduction

2. Emotions

2.1. Theories of Emotion

2.2. Physical Measures of Emotion in Speech

2.3. Emotional Effects on Intonation

2.4. Emotional Effects on Voice Quality

2.5. Multi-Lingual Factors

2.6. Perception of Emotion from Speech

3. Acoustic Analysis of Speech

3.1. Basic Digital Signal Processing for SER

3.2. Basic Spectral Measures

3.3. More Advanced Spectral Envelope Measures

3.4. Time Windows for Frames in Speech Analysis

3.5. Intonation Features

4. Models to Use Measures of Analysis for Speech Emotion Classification

4.1. Embedding

4.2. Hidden Markov Models (HMMs)

4.3. Support Vector Machines (SVMs)

4.4. Other Approaches

5. Task Variability and Evaluation of SER

6. Datasets for SER

7. Neural (Machine Learning) Techniques for Emotion Classification of Speech

7.1. Basics of ANNs

7.2. Common Architectures for ANNs

7.2.1. Convolutional Neural Networks (CNNs)

7.2.2. Recurrent Neural Networks (RNNs)

7.2.3. Attention

7.2.4. Typical ANN Operation for SER

7.3. Supervised and Self-Supervised Methods

8. Discussion

9. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

List of Acronyms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI