Next Article in Journal
VitralColor-12: A Synthetic Twelve-Color Segmentation Dataset from GPT-Generated Stained-Glass Images
Previous Article in Journal
Comparative Data Analysis of Non-Destructive Testing for Hollow Heart in Potatoes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Review and Comparative Analysis of Databases for Speech Emotion Recognition

1
Laboratory of Digital Signal Processing, Department of Engineering, University of Messina, 98122 Messina, Italy
2
National Inter-University Consortium for Telecommunications, Research Unit at University of Messina, 98122 Messina, Italy
3
Dipartimento di Scienze Politiche e Giuridiche, University of Messina, 98122 Messina, Italy
4
Department of Biomedical and Dental Sciences and Morphofunctional Imaging, University of Messina, Via Consolare Valeria, 1, 98125 Messina, Italy
5
Department of Electrical and Electronic Engineering, University of Cagliari, Via Marengo, 2, 09123 Cagliari, Italy
6
National Inter-University Consortium for Telecommunications, Research Unit at University of Cagliari, Via Marengo, 2, 09123 Cagliari, Italy
*
Author to whom correspondence should be addressed.
Data 2025, 10(10), 164; https://doi.org/10.3390/data10100164
Submission received: 5 September 2025 / Revised: 3 October 2025 / Accepted: 10 October 2025 / Published: 14 October 2025

Abstract

Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER systems. These resources set the boundaries for how well models can generalize across speakers, contexts, and cultures. In this paper, we present a narrative review and comparative analysis of emotional speech corpora released up to mid-2025, bringing together both psychological and technical perspectives. Rather than following a systematic review protocol, our approach focuses on providing a critical synthesis of more than fifty corpora covering acted, elicited, and natural speech. We examine how these databases were collected, how emotions were annotated, their demographic diversity, and their ecological validity, while also acknowledging the limits of available documentation. Beyond description, we identify recurring strengths and weaknesses, highlight emerging gaps, and discuss recent usage patterns to offer researchers both a practical guide for dataset selection and a critical perspective on how corpus design continues to shape the development of robust and generalizable SER systems.

1. Introduction

Human beings can express emotions in multidimensional ways. Among these, the most commonly used methods are facial expressions, speech, body language, gestures, and writing [1]. In recent times, several systems have been developed to identify emotions from speech signals. Still, classical automatic speech recognition systems have not paid much attention to paralinguistic information of the speech, like gender, emotion and personality. However, paralinguistic data is relevant for efficient communication and to understand one another. Speech emotion recognition (SER) is therefore considered with growing interest in investigating emotional states through speech signals. The ability to recognize emotions has the potential to significantly enhance human–computer interaction (HCI), making it more intuitive and responsive to users’ emotional states [2]. SER has diverse applications across various domains, including robotics, mobile services, and call centers. For example, in [3], the authors propose a new model of SER based on deep neural networks to improve the human–robot interaction. Instead, in [4], the authors developed a smartphone application powered by cloud computing that identifies emotions in real time using a standard speech corpus. Furthermore, in [5], the authors propose a call redistribution method based on SER that shows a remarkable reduction in waiting time for more urgent callers.
More recently, the applications of SER have expanded across healthcare, customer service, HCI and psychological assessment [6,7]. For instance, the integration of SER into smart home technologies and mental health monitoring systems highlights its potential in monitoring emotional well-being, particularly in personalized healthcare contexts [8]. Since affective disorders are associated with specific voice patterns, in [9], the authors explored a generalizable approach to evaluate clinical depression and remission from voice using transfer learning. They found that the SER model was able to accurately identify the presence of depression, providing a successful method for early diagnosis.
Despite its wide range of applications, SER remains a challenging task. Emotions vary not only between individuals but also across cultures and contexts, and their acoustic expression often overlaps across categories. Variability in recording conditions, imbalance in class distributions, and the difficulty of capturing spontaneous affect further complicate system design. Among these issues, the question of how emotions are represented and annotated in data is particularly critical: the lack of consensus on categorization schemes and the inherent subjectivity of labeling make it difficult to build consistent training corpora. Several key databases have therefore emerged in SER research, each addressing these challenges in different ways and serving distinct purposes depending on the type of emotional speech being studied.
Speech corpora used in SER are collected under various conditions, ranging from controlled laboratory settings with scripted speech to spontaneous, real-world interactions. The selection of speech types significantly impacts the training and performance of SER systems, as the variability in emotional expression across different contexts and datasets poses challenges in building models that generalize well [10,11]. Choosing the right dataset, therefore, plays a decisive role in the design and performance of SER systems, shaping their ability to generalize across speakers, maintain reliable annotations, and adapt to cultural differences. Although SER has been the subject of numerous surveys, most of them concentrate on features, models, and performance comparisons rather than the databases themselves. When datasets are mentioned, they are usually reduced to compact tables listing the language, number of speakers, or targeted emotions, and the most recent reviews include corpora only up to 2021 [12,13]. To the best of our knowledge, no dedicated review has yet examined emotional speech corpora in their own right, how they are collected, the contexts they represent, and the challenges they introduce for system design. This is the gap our paper aims to address.
To the best of our knowledge, we provide a broad and up-to-date comparative review of emotional speech databases up to mid-2025, combining both technical and psychological perspectives. Rather than aiming for exhaustiveness, our selection balances historical “pillar” corpora (e.g., DES, SUSAS, EMO-DB), widely used benchmarks (e.g., IEMOCAP, AIBO), linguistically and culturally diverse resources (e.g., CASIA Mandarin, ITA-DB, INTERFACE), and more recent or domain-specific datasets, such as those built from stress speech or emergency calls. We also include innovative corpora that introduced new methodologies or collection processes, setting precedents for subsequent work in the field. The depth of description varies depending on the documentation available; some corpora, even important ones, are reported with limited details in their original papers, and this naturally constrains the coverage we can provide. Beyond description, we offer a critical perspective on recurring issues, including the trade-off between clarity in acted corpora and ecological validity in natural ones, inconsistencies in labeling and annotation models, and gaps in cultural and linguistic diversity. Finally, we complement this analysis with a synthesis of trends across databases and a discussion of the resources that are actually being used in recent SER studies. In doing so, we hope to provide researchers not only with a practical guide to available datasets but also with insights into how corpus design choices shape the robustness and generalizability of SER systems.
The remainder of this paper is organized as follows. Section 3 provides an overview of SER systems, whereas Section 4 introduces the emotional models. In Section 5, we discuss the required characteristics of databases for SER, while Section 6 provides an extensive overview of the existing SER corpora available in the literature. Section 7 presents an in-depth analysis of the considered SER corpora, and finally, Section 8 concludes the paper.

2. Review Methodology

Our work adopts a narrative review approach, which differs from a systematic review in both scope and purpose. A systematic review (e.g., PRISMA-based [14]) is typically employed to answer a narrowly defined research question using formalized inclusion/exclusion criteria and replicable selection protocols [15]. In contrast, a narrative review is more appropriate when the objective is to provide a broad, critical synthesis across a heterogeneous field, as is the case with speech emotion recognition databases [16].
To conduct this review, we surveyed the literature on SER corpora up to mid-2025, collecting fifty-two databases across acted, elicited, and natural speech. The selection emphasizes representativeness and relevance: historical benchmarks, widely adopted resources, linguistically diverse corpora, and recent domain-specific or innovative datasets were all included. Where possible, we extracted comparable parameters (e.g., number of speakers, speech type, languages, annotation procedures, recording conditions) and reported them in Table 2. Additional details, such as corpus size in utterances or hours, are included in the individual corpus descriptions.
To move beyond an inventory, we also applied a comparative framework based on four cross-cutting dimensions: scope, contents, physical existence, and language composition. This structured lens allows readers to compare corpora in a consistent way while acknowledging differences in available documentation. Moreover, in Section 7 we provide a synthesis of patterns and trends, complemented by a simple quality index and statistical views. Although this approach is not a systematic review, it ensures transparency, reproducibility of descriptive metrics, and critical interpretation, consistent with the aims of a narrative review.

3. Speech Emotion Recognition Systems

Speech emotion recognition (SER) systems are designed to automatically detect emotions from speech signals by leveraging several key stages, as illustrated in Figure 1, each of which is crucial for accurate emotion classification. The first step involves choosing the appropriate emotional models to represent the emotions being classified. These can be broadly classified into categorical (or discrete) emotional models, which identify distinct emotions such as happiness, sadness, and anger, and dimensional models, which map emotions onto continuous dimensions like valence and arousal. Matveev et al. [17] emphasized the importance of selecting emotional models tailored to specific demographics, such as children, as their speech patterns differ from those of adults. Emotional models are discussed in Section 4.
Following model selection, SER systems need to rely on a variety of speech databases for training, which can be categorized into three main types: acted, where emotions are simulated by actors; natural, which captures spontaneous emotional expressions in real-world scenarios; and elicited, where specific emotions are induced in controlled conditions. The availability of diverse corpora across multiple languages is necessary for the development of effective SER systems in different application contexts [18]. Once the speech data is acquired, preprocessing steps are applied to clean the speech signal and remove irrelevant noise. This stage includes tasks such as framing, where the speech signal is divided into overlapping segments; windowing, which reduces signal discontinuities at the edges of these segments; noise reduction; silence removal; and normalization to standardize the signal’s amplitude [19]. Preprocessing techniques, including data augmentation methods such as deep convolutional generative adversarial networks (DCGANs), have also been employed to enhance the robustness of SER systems against unbalanced data and variations in speech data [20].
After preprocessing, feature extraction becomes the focus. Different features provide different insights into the emotional content of speech. Prosodic features such as pitch, energy, and duration are essential in capturing variations in the speaker’s intonation and rhythm, which are often indicative of emotional states [21]. In addition to prosodic features, spectral features, such as Mel-frequency cepstral coefficients (MFCC) and spectrogram analysis, are commonly employed to capture the nuances of emotional speech [22,23]. Voice quality features, such as jitter and shimmer, help capture the subtle variations in vocal cords and breath control that correspond to different emotional states [19]. Recent advancements have introduced hybrid feature extraction methods that combine traditional acoustic features with deep learning-derived features, enhancing the model’s ability to discern emotional states [24].
The final stage in a SER system is classification, where the emotional state is predicted. SER systems commonly use two types of classifiers: classical classifiers and deep learning models. Classical classifiers, such as support vector machines (SVMs) and decision trees, have traditionally been employed, particularly for smaller datasets, and have proven effective when combined with carefully engineered features [25]. However, with the rise of larger datasets and the increasing complexity of speech patterns, deep learning models have become the dominant approach in this field. Convolutional neural networks (CNNs) and long short-term memory (LSTM) networks are frequently employed, even in combination, due to their ability to capture temporal dependencies in speech data [26,27]. Additionally, innovative approaches such as attention mechanisms and knowledge distillation have been proposed to enhance model performance [28,29]. These approaches aim to improve the system’s ability to focus on the most relevant emotional features, particularly in noisy or complex data environments.

4. Emotional Models

An emotion is a complex human state that causes psychological and physiological changes; it is described as a response to internal or external events, and it is associated with several feelings, thoughts, and behaviors. Since emotions provided efficient solutions to ancient and frequent problems, they are considered a biological result of evolution. Emotions are also related to mood, temperament, and personality [30]. To avoid ambiguity, this paper will mainly use the term “emotional states” to refer to intense, short-lived reactions to internal or external stimuli, accompanied by physiological and behavioral changes. In a few instances, the term “affective states” will be used, which refers to a broader concept that also includes less intense and longer-lasting states, in addition to emotions. There are numerous theories regarding emotion classification, but the two main models on which existing SER systems are based are the categorical and dimensional emotional models.

4.1. Categorical Emotional Model

The categorical emotional model relies on a set of predefined discrete emotional categories, often referred to as the “basic emotions”. Indeed, Ekman and Friesen identified six basic emotions—happiness, sadness, anger, disgust, fear, and surprise—along with a neutral emotion, which are universally recognized and shared by cultures all over the world [31,32]. Thus, the core objective of SER systems based on the categorical emotional model is to classify speech into one of these discrete emotional labels through the analysis of acoustic and linguistic features extracted from the speech signal. While the categorical model offers a clear and structured framework for emotion classification, it struggles to capture the full spectrum of human emotions, potentially leading to misclassifications when dealing with subtle emotional shifts or mixed states [33,34]. Additionally, these models can struggle to distinguish between similar emotions. The acoustic characteristics of anger and frustration, for instance, can exhibit significant overlap, making it difficult for categorical models to differentiate between them due to shared features like increased pitch and energy [35,36]. Furthermore, cultural variations in emotional expression pose a challenge for categorical models. A model trained on data from one culture might not generalize well to another, leading to misinterpretations of emotional cues across cultural boundaries [37]. Finally, categorical models, by their discrete nature, struggle to capture the dynamic nature of emotions. However, research is increasingly focusing on alternative approaches that can provide a more comprehensive and nuanced understanding of the emotional landscape conveyed through speech.

4.2. Dimensional Emotional Model

Continuous emotion representation offers a promising alternative to the categorical model, moving beyond discrete labels and capturing the nuances of human emotions on continuous scales. The two most common approaches are two-dimensional (2D) and three-dimensional (3D) emotional space models. In the first model, emotions are divided into valence (ranging from pleasant to unpleasant) and arousal (ranging from calm to excited). The Circumplex model, illustrated in Figure 2, describes the emotions along the valence and arousal axes, and suggests a mapping with discrete emotions based on the combination of the positive and negative states of valence and arousal [38]. The second model is the valence–arousal–dominance (VAD) model, which maps three continuous dimensions: valence (positive to negative), arousal (high or low activation), and dominance (feeling in control or feeling controlled) [39]. These expanded models aim to capture a broader spectrum of emotional experiences.
Unlike categorical models with pre-defined labels, continuous models require data annotated with continuous emotion ratings on multiple scales. This annotation process can be subjective and time-consuming, requiring human experts to rate emotions, with consistency being a major hurdle [40]. Training SER models for continuous representations also presents complexities compared to categorical models. Finally, interpreting the results of continuous models can be difficult. Visualizing and understanding the complex relationships within multidimensional emotion spaces necessitates advanced visualization techniques and expertise in the domain of affective computing [41]. In line with our narrative review approach, rather than applying a systematic coding protocol, we adopted a structured comparative framework to analyze the available SER databases. This framework is designed to highlight the critical aspects that influence both the usability and generalizability of the corpora. Our aim is not to exhaustively quantify every dataset parameter but to provide a consistent set of dimensions through which heterogeneous databases can be compared.

5. Characteristics of Databases for SER

SER systems require appropriate databases for providing accurate recognition of human emotions from speech signals. To support comparative evaluation, we organize our narrative synthesis around four critical criteria:
  • Scope: What the corpus is meant to capture and how it is elicited: acted, elicited, or natural speech; the target task or application; and whether recordings are in the lab or “in the wild”. Scope sets the brief and drives the downstream choices that show up in contents and language composition [42].
  • Physical Existence: Whether the corpus can actually be obtained and reused. This covers access and licensing, the quality of documentation and metadata (including protocols), persistent identifiers (e.g., a DOI), and well-defined train/validation/test splits. These pieces are what make benchmarking and replication possible [43].
  • Contents: The concrete, measurable properties: number of speakers and total duration (hours/utterances), recording/channel conditions and sampling rate, class distribution, the label schema (categorical or dimensional), who does the labeling (self/observer/expert/crowd) and any inter-rater reliability, transcripts and other annotation layers (e.g., lexical/semantic), and any additional modalities when present (audio with or without video/physio). These fields allow apples-to-apples comparisons and support both recognition and synthesis uses [44].
  • Language Composition: The linguistic and cultural coverage: which languages are included, relevant dialects/accents, possible code-switching, and associated demographic breadth. Being explicit here improves cross-cultural generalization and fairness; for transparency, note when a multilingual corpus contributes to more than one language category [45].
These criteria do not constitute a systematic coding scheme but rather provide a structured lens for critical comparison, balancing technical rigor with interpretive insight. The four proposed criteria were chosen because they represent cross-cutting dimensions that, unlike other operational aspects (e.g., cost or collection effort), directly impact the quality and usability of datasets, supporting both technical robustness and scientific validity. Furthermore, these criteria highlight how the technical and psychological perspectives are closely interconnected. In our framework, the “Scope” of a database defines the intended context (posed/induced/natural; task; lab vs. in-the-wild) and drives choices realized under “Contents” (e.g., speaker diversity, label schema) and “Language Composition” (linguistic/cultural coverage). Likewise, the database’s contents concern not only the type of linguistic material collected but also the corpus’s ability to reflect real psychological processes, such as listeners’ perception of emotions. Building on these foundational criteria, speech corpora used for developing SER systems can be categorized into three types: acted, elicited, or natural speech. The key characteristics of databases based on these three different types of speech are summarized in Table 1. The table offers a structured overview of the main corpus types, providing qualitative guidance within our narrative review rather than a systematic coding scheme. Each type of database comes with distinct advantages and limitations that impact its use in SER systems. This four-dimensional framework is a central element of our narrative review methodology: it ensures transparency in how we compare heterogeneous corpora while maintaining the breadth and flexibility that a narrative review requires.

5.1. Acted Speech

Acted speech, specifically within the context of emotional speech databases, involves recordings by trained professional actors such as radio artists, theatre artists, or individuals capable of expressing various emotions while delivering linguistically neutral sentences in soundproof studios. The speakers act as if they were in a specific emotional state, e.g., being scared, angry, or sad. They usually have to memorize and rehearse a set of scripts containing the desired emotions or use the Stanislavski method, a self-induction technique that involves recalling a situation when the target emotion was felt strongly [46]. These recordings are conducted across multiple sessions to capture variations in expressiveness and speech production mechanisms.
This method is one of the reliable ways of recording emotional speech databases across a complete range of emotions. More than 60% of emotional speech databases are simulated [47]. Acted emotional speech corpora, often referred to as simulated emotional databases, are recognized for their intensity and depth of emotion, representing what is sometimes termed “full-blown” emotions [48]. Research suggests that acted or simulated emotions tend to be more expressive than natural emotions due to the deliberate and controlled nature of the performances [49].
However, while acted speech can provide rich datasets for studying emotion, it may lack the spontaneity and authenticity of emotions found in natural speech contexts. A key problem lies in its lack of ecological validity [50]. These databases often rely on trained actors portraying specific emotions while reading pre-selected, neutral sentences as monologues [51]. This scripted approach fails to capture the natural flow and context of real-world conversations. Human emotional expression is highly contextual, influenced by the dynamics of dialogue, the relationship between speakers, and the surrounding environment [52].

5.2. Elicited Speech

Elicited speech uses the method of evoking specific emotions. Several mood induction procedures (MIPs) can be used to elicit both positive and negative emotional states. Among them, the most efficient techniques have proven to be visual stimuli (such as films or images), music, autobiographical recall, situational procedures, and imagery [53]. These methods are effective in inducing a range of basic emotions, including anger, disgust, surprise, happiness, fear, and sadness. Other studies have used hypothetical scenarios designed to elicit specific emotions [54]. Participants might engage in conversations with an interviewer on emotionally charged topics, respond to prompts from a computer program, or watch videos designed to elicit certain feelings. A promising research field is virtual reality (VR), which allows complete immersion in a virtual scenario and gives the feeling of being “there”. Therefore, VR is considered an ecologically valid paradigm for studying emotion [55].
Unlike acted speech, where actors portray emotions based on scripts, elicited speech aims to capture more natural emotional responses. However, participants might have varying degrees of awareness about the intended emotional elicitation. While elicited speech offers a valuable balance between control and naturalness in data collection, it presents a trade-off. The controlled scenarios enable researchers to capture emotional nuances and analyze corresponding speech patterns. This controlled environment facilitates reproducibility and standardization in data collection, essential for training and evaluating SER systems [50]. However, elicited speech may lack the natural variability and spontaneity of real-world interactions, potentially limiting the generalizability of SER models to authentic speech scenarios [49].

5.3. Natural Speech

Natural speech refers to spontaneous speech databases, where speech is recorded from real-world scenarios, such as talk shows, call center recordings, radio talks, and environmental recordings [56]. They are prized for their ecological validity, capturing authentic and natural emotional expressions often exhibited without the speaker’s awareness [57]. This allows researchers to analyze genuine emotional reactions in response to real-world situations. Furthermore, the rich variability of emotional expression in natural speech data provides a more comprehensive training ground for SER systems, enabling them to recognize emotions across different contexts and speaking styles [58].
However, collecting and utilizing natural speech data presents significant challenges. Ethical and legal considerations arise when capturing recordings without explicit participant consent [46]. Privacy is an additional issue, as natural speech data may often contain identifiable personal information [59]. Natural speech carries specific risks because voice is a biometric identifier. An ethics review should determine the consent model (explicit consent where feasible; opt-out or waiver only under narrowly defined conditions), with extra safeguards for vulnerable populations such as children or patients. Data minimization and de-identification are essential: remove names and contacts, mask free-text metadata, consider voice transformations if open release is planned, and document any residual re-identification risk. For media-derived corpora (film, TV, online platforms), researchers must address copyright/neighboring rights and terms of service; fair use or fair dealing depends on jurisdiction and rarely permits redistribution of raw audio. When sharing, use clear dataset licenses and, where needed, controlled-access repositories with usage agreements (research-only, no re-identification, no commercial use). Finally, plan for annotator well-being (possible exposure to distressing content), and report ethics approval, consent procedures, and access controls. Additionally, the spontaneous nature of these recordings can lead to mildly expressed emotions, which are more difficult for SER systems to recognize and annotate accurately [50]. Furthermore, the variety of sources used for natural speech data collection presents challenges. Sources like TV shows, interviews, customer service interactions, cockpit recordings during abnormal conditions, and public conversations can introduce substantial background noise and variability in speech styles [60]. Finally, the subjective nature of emotion categorization presents another hurdle. Annotating emotions in natural speech data often relies heavily on expert opinion, which can be highly debatable depending on the chosen emotion classification scheme [52].

5.4. How Foundational Criteria Map to Speech Types

“Scope” anchors the selection of speech type (acted/elicited/natural; application/task; laboratory vs. in-the-wild), while “Contents” specifies how this choice is operationalized (representation and annotation of emotions: categorical vs. dimensional, intensity and mixed states, who labels (self/observer), inter-rater reliability). These decisions have predictable consequences for physical existence (availability, licensing, documentation) and language composition (cultural and demographic coverage). Acted speech, produced by professional or non-professional actors, ensures high control and clear labeling, but at the cost of ecological validity and often limited cultural diversity. Elicited speech represents a compromise: it is more natural than acted data but can still be influenced by experimental settings, tends to display moderate or variable intensity, and usually achieves lower inter-rater agreement than acted speech. Natural speech maximizes ecological variability and expressive validity, but it poses greater challenges for annotation, typically yielding lower inter-rater agreement, and may also raise issues of rights and consent, particularly when drawn from publicly available media.

6. Available Databases for SER

In this section, we present a comprehensive overview of the databases for speech emotion recognition available in the literature. Table 2 compares these databases in terms of corpus name, reference, year of publication, speech language(s), speech type (natural, elicited, or acted), number of speakers, considered emotions, voice recording conditions, and emotion annotation methods. It summarizes the key parameters of the reviewed databases, serving as a descriptive reference to support the comparative analysis developed in our narrative review. To complement Table 2, Figure 3 provides an at-a-glance descriptive view of the inventory: the blue bars show the raw distributions of corpora by language and by speech type; the quality-weighted views (orange bars) are interpreted in Section 7.

6.1. DES

The Danish Emotional Speech (DES) corpus ([61]-1997) is a compact yet impactful resource designed to explore the perception and recognition of emotional speech in the Danish language. It consists of approximately 30 min of recordings by four professional actors (two male, two female) from Aarhus Theatre, delivering a diverse range of utterances in five emotional states: neutral, surprise, happiness, sadness, and anger. The speech material includes two isolated words (“yes” and “no”), nine sentences (four questions and five statements), and two longer passages of fluent speech. The recordings were made in an acoustically dampened studio using high-quality microphones and digital equipment to ensure optimal sound quality for acoustic analyses. This carefully controlled setup emphasizes clarity while capturing emotional variability across different speech contexts. The annotation process was validated through a perceptual test involving 20 listeners (10 male, 10 female) aged 18–59 years, primarily university staff. Listeners were tasked with identifying the expressed emotions, achieving an overall recognition rate of 67.3%. Recognition rates varied across speech types, with passages yielding the highest accuracy (76.3%), followed by words (67.5%) and sentences (65.3%). Notably, confusion between sadness and neutral emotions highlighted challenges in distinguishing subtle affective cues. Listener feedback further underscored the complexity of emotion perception, providing valuable insights for refining emotional speech modeling.
For its time, the DES corpus was a pioneering effort in the study of emotional speech, especially in the Danish language. Its compact design and inclusion of multiple speech types make it an accessible yet valuable resource for foundational research in affective computing and speech synthesis. The use of professional actors and controlled recording conditions set a precedent for later emotional speech datasets. Although limited in size, its focus on high-quality, contextually varied emotional expressions offered unique insights into emotion perception and recognition. The dataset remains relevant as a historical benchmark for understanding the evolution of emotional speech research and its methodological foundations.

6.2. SUSAS

The Speech Under Simulated and Actual Stress (SUSAS) database ([62], 1997) is a pioneering resource designed to analyze speech variability under stress, providing a comprehensive framework for understanding the impact of emotional and environmental stressors on speech production. The corpus encompasses recordings from over 32 speakers (13 females and 19 males) across various stress conditions, capturing 16,000 utterances in total. It spans five distinct domains: (1) talking styles, including slow, fast, loud, soft, angry, clear, and questioning tones; (2) single tracking tasks simulating speech produced in noisy environments or under the Lombard effect; (3) dual tracking tasks involving compensatory and acquisition responses under task-induced workload; (4) motion-fear tasks captured in real-world scenarios like amusement park rides (e.g., “Scream Machine” and “Free Fall”) and helicopter missions, simulating extreme physical stress; and (5) psychiatric analysis speech reflecting emotional states such as depression, fear, and anxiety. Utterances include a 35-word vocabulary set tailored to aircraft communication contexts, emphasizing its applicability to aviation and high-stress environments.
Actual stress data was recorded in extreme scenarios such as roller-coaster rides and Apache helicopter missions, simulating high G-forces and fear responses.
The annotation process focused on categorizing stress and emotional states across diverse speaking scenarios. It emphasized distinguishing subtle variations in stress levels and emotional cues within real-world and simulated high-stress conditions. Categorical labels such as “stressed” or “neutral” were applied, tailored to contexts like aviation and emergency communication. Annotations also accounted for speech affected by environmental factors, such as the Lombard effect, and task-induced workload stress, capturing nuanced vocal variations under pressure. While inter-rater reliability metrics were not explicitly reported, the combined use of subjective human assessments and automated tools aimed to ensure consistency. This methodology enhances the corpus’s utility for studying stress-induced speech dynamics and emotion recognition in high-stakes applications, offering valuable insights into the interplay of stress and vocal expression in realistic scenarios.
As one of the earliest databases to focus on stressed speech, the SUSAS corpus laid the groundwork for subsequent research in stress-resilient speech recognition, emotion recognition, and speech synthesis. Its integration of diverse stress domains and detailed annotations provides a unique platform for investigating how stress influences speech, making it an invaluable resource for both theoretical and applied studies in affective computing, forensic linguistics, and adaptive speech technologies.

6.3. CREST

The Expressive Speech Processing (ESP) project, initiated in Spring 2000 and lasted five years, was a pivotal part of the JST/CREST (Core Research for Evolutional Science and Technology) initiative, funded by the Japanese Science and Technology Agency ([63], 2001). This research aimed to collect a comprehensive database (CREST: the expressive speech database) of spontaneous, expressive speech tailored to meet the requirements of speech technology, particularly for concatenative synthesis. The project focused on statistical modeling and parameterization of paralinguistic speech data, developing mappings between the acoustic characteristics of speaking style and speaker intention or state, and implementing and testing prototypes of software algorithms in real-world applications. Emphasizing practical applications, the ESP project prioritized states likely to occur during interactions with information-providing or service-providing devices, such as emotional states (e.g., amusement) and emotion-related attitudes (e.g., doubt, annoyance, surprise).
Given the potential language-specific nature of expressive speech, the database encompasses materials in three languages: Japanese (60%), Chinese (20%), and English (20%). The target was to collect and annotate 1000 h of speech data over five years, primarily sourced from non-professional, volunteer subjects in everyday conversational situations, as well as emotional speech in television broadcasts, DVDs, and videos. To capture truly natural speech, a “Pirelli-Calendar” approach was employed, inspired by photographers who used to take 1000 rolls of film to produce a 12-photo calendar. Volunteers were equipped with long-term recorders, capturing samples of their day-to-day vocal interactions throughout the day. This extensive data collection method was aimed at ensuring adequate and representative coverage. However, annotating this vast amount of data was a monumental task, and automatic transcription using speech recognition posed significant challenges.

6.4. INTERFACE

The INTERFACE Emotional Speech Database ([64], 2002) is a multilingual corpus developed to support research in SER across four languages: Slovenian, English, Spanish, and French. The dataset includes recordings from 20 professional actors (10 male, 10 female), approximately five actors per language. Each actor recorded speech stimuli designed to evoke six primary emotions, namely, anger, sadness, happiness, fear, disgust, and surprise, alongside neutral speech. The linguistic diversity in the corpus extends to accents and dialects within each language, ensuring comprehensive coverage for cross-linguistic emotion analysis. Each language subset includes approximately 1200 recordings, resulting in a total of 4800 audio samples. The speech material comprises a mix of isolated words, short sentences, and longer passages, reflecting varied syntactic and semantic structures to explore emotional expression across different speech contexts.
The recordings were conducted in a sound-treated studio environment to minimize noise and ensure high fidelity. High-quality condenser microphones and professional audio interfaces captured the audio signals. The actors read from carefully designed prompts displayed on screens to maintain uniformity, while also allowing for the spontaneity needed to evoke authentic emotions. Each recording session lasted approximately four hours per actor, with periodic breaks to ensure sustained vocal quality and emotional performance. For some languages, a Portable Laryngograph was employed to capture additional vocal characteristics, enriching the dataset’s acoustic features. Recordings were conducted in two separate sessions spaced two weeks apart. The annotation process combined actor self-assessments with evaluations from five independent raters per language, enabling a comparison between intended and perceived emotions. Subjective evaluations of the Spanish and Slovenian subsets provided further insights into the accuracy and intensity of emotional expressions. For the Spanish subset, 16 non-professional listeners evaluated 56 utterances (seven per emotion, including long and short versions), while 11 listeners assessed 64 utterances for the Slovenian subset. Listeners identified primary emotions, rated intensity on a scale of 1 to 5, and could mark secondary emotions if necessary. Annotations included both primary and secondary emotion labels, providing additional granularity for mixed or overlapping emotional states. Inter-rater agreement, measured using Krippendorff’s alpha, indicated moderate to high reliability, although exact values were not consistently reported. Feedback from annotators highlighted cultural and linguistic variations in emotional perception, reflecting the complexity of annotating multilingual datasets. This combination of robust training, inclusion of secondary labels, and supplementary evaluation tests enhances the dataset’s value for cross-cultural emotion recognition.
The database’s multilingual and multimodal design marked a significant contribution to emotion recognition research, particularly in cross-cultural contexts. By providing a standardized resource for analyzing emotional speech across multiple languages, it supported advancements in affective computing, multilingual speech synthesis, and early machine learning-based emotion recognition systems. This corpus has informed subsequent research efforts, offering a valuable foundation for exploring the relationship between language, culture, and emotion in the context of human–computer interaction and multilingual communication. Its influence is reflected in later datasets and studies that adopted and refined its approaches to meet evolving research needs.

6.5. SmartKom

The SmartKom Multimodal Corpus ([65], 2002) is a dataset developed under the German SmartKom project, designed to advance research in multimodal HCI. This corpus integrates acoustic, visual, and tactile modalities, collected through Wizard-of-Oz experiments that simulate realistic system interactions. The dataset includes recordings from 45 participants (25 female and 20 male) across 90 sessions, with a diverse demographic breakdown. Recording sessions were conducted across three technical scenarios (Public, Home, Mobile), enhancing SmartKom’s applicability for various real-world environments. In SmartKom Public, the system functioned as a publicly accessible information interface, enabling users to perform tasks such as making cinema reservations or obtaining navigation details. The SmartKom Home scenario simulated an intelligent personal assistant designed for domestic environments, capable of managing tasks such as scheduling appointments or controlling home appliances. The SmartKom Mobile setup demonstrated the capabilities of a portable communication assistant, enabling users to interact with the system while on the move, such as checking emails or receiving navigation guidance. Audio data were captured across 10 channels at a high sampling rate to ensure detailed acoustic analysis. Video recordings included frontal and lateral views of participants, graphical overlays of gestures, and infrared visuals to track hand movements, all recorded in high-quality formats to support in-depth analysis.
Speech data were annotated with orthographic transcriptions and prosodic markers, such as stress and pitch contours, allowing for a detailed analysis of vocal patterns. Gestural annotations employed segmentation in the 2D plane, mapping hand and pen trajectories with high precision, supported by infrared data from the SIVIT gesture analyzer and graphical tablet inputs. User states were categorized through a combination of facial expression analysis and vocal cues, leveraging advanced annotation protocols to distinguish emotional nuances. However, the paper does not explicitly report inter-annotator agreement or validation metrics, which are crucial for ensuring reliability in subjective annotations. A standout feature of this process is the synchronization of all annotations within a QuickTime framework, which aligns multiple data streams, including audio, video, and gesture coordinates, with millisecond accuracy. The use of the BAS Partitur File (BPF) format further enhances the corpus by consolidating linguistic, gestural, and user-state annotations into a single, unified dataset.
The SmartKom Multimodal Corpus contributed significantly to research in multimodal interaction by integrating speech, gestures, and user states in realistic scenarios. Its detailed annotations and synchronized data streams supported advancements in adaptive interfaces and emotion recognition. By being publicly accessible, it encouraged broader research use and influenced the development of subsequent multimodal corpora, shaping methodologies in human–computer interaction studies.

6.6. SympaFly

The SympaFly database ([66], 2003) was developed as part of the SMARTKOM project to study user behavior and emotional states in fully automated speech dialogue systems for flight reservations and bookings. The covered emotions include joyful, neutral, emphatic, surprised, ironic, compassionate, helpless, panic, touchy, and angry. The dataset captures dialogues across three stages of system development, reflecting progressive improvements in system performance and variations in user strategies. The first stage (S1) refers to the first usability test and comprises 110 dialogues with 2291 user turns and 11,581 words. The second stage (S2) includes 98 dialogues with 2674 user turns and 9964 words, during which the system’s performance of the dialogue manager showed improvement. The third stage (S3) contains 62 dialogues with 1900 user turns and 7655 words, collected using the same experimental settings used for S1. Dividing the corpus into these three developmental stages highlights user adaptation to varying system performance levels, providing a unique perspective on the interplay between system reliability and user frustration. Recordings were conducted over standardized telephone channels under controlled experimental conditions, with users completing tasks such as providing flight details, frequent flyer IDs, and payment information. These standardized setups ensured consistent data collection across all stages but may limit the applicability of the data to more modern multimodal systems integrating visual or tactile inputs.
The annotation process focused on holistic user states and prosodic features, classifying emotional states into five categories: positive (e.g., joyful), neutral, pronounced (e.g., emphatic), weak negative (e.g., surprised, ironic, compassionate), and strong negative (e.g., helpless, panic, angry, touchy). Two independent annotators followed by a consensus process ensured some reliability, though the paper does not mention inter-annotator agreement metrics. Annotations highlighted distinctions between categories, such as strong negative and positive, with prosodic features like hyper-articulation and emphasis being more pronounced in strong negative states, reflecting heightened emotional intensity. In contrast, overlapping features in neutral and weak negative states, such as pauses and subtle tonal shifts, posed challenges for consistent classification. Dialogue success, annotated on a four-point scale, correlated with emotional states: positive and neutral appeared more frequently in successful dialogues, while Strong Negative was associated with unsuccessful interactions. Additionally, conversational peculiarities like repetitions provided insights into user strategies during system limitations.
The originality of the SympaFly database lies in its focus on capturing emotional states within real-world human–machine dialogues, moving beyond prototypical or acted emotions to reflect interaction-centered states. By documenting user interactions across three stages of system development, the database provides insights into how system performance impacts user frustration, success rates, and adaptive behaviors. At its time in 2004, it addressed the need for datasets that captured naturalistic emotional and interactional patterns in speech-based automated systems. While limited to telephone interactions and predefined scenarios, its contribution remains significant for understanding the interplay between emotion, prosody, and dialogue outcomes, forming a basis for advancements in dialogue system evaluation and user behavior analysis.

6.7. AIBO

The You Stupid Tin Box Corpus is a cross-linguistic dataset developed to study children’s speech and emotional responses during interactions with the Sony AIBO robot ([67], 2004). This corpus emphasizes spontaneous emotional speech, leveraging a Wizard-of-Oz experimental setup to provoke a wide range of natural emotions. The data was collected from 51 German children aged 10–13 and 30 English children aged 4–14, in scenarios where the AIBO robot behaved either obediently or disobediently, simulating varying levels of system performance. The recordings included approximately 9.2 h of German speech and 1.5 h of English speech, after filtering silences and irrelevant segments. The setup involved the children giving spoken instructions to the AIBO robot to complete tasks such as navigating a map or avoiding “poisoned” cups. The robot’s actions were pre-determined in the disobedient mode, intentionally provoking emotional responses such as frustration, joy, or irritation. Recordings were conducted using high-quality wireless microphones in controlled environments like classrooms and multimedia studios. The dataset also includes accompanying video recordings, though these are not publicly accessible due to privacy restrictions.
The annotation focused on emotional user states and prosodic peculiarities. Emotional labels were derived by multiple annotators, who categorized each word into classes such as joyful, angry, surprised, irritated, emphatic, and neutral. Prosodic annotations included phenomena like pauses, emphasis, shouting, and clear articulation. A majority-vote approach was used to ensure reliability in labeling. The corpus provides a granular view of interactional dynamics by aligning children’s verbal reactions with the robot’s pre-scripted actions.
The AIBO Corpus is notable for its focus on spontaneous, naturalistic emotions in a real-world setting, making it a significant departure from acted emotional datasets. Its cross-linguistic component and robust annotation processes make it highly relevant for SER, particularly in studying how users, especially children, respond emotionally to autonomous systems. This originality and naturalistic approach position it as a foundational resource for both emotion modeling and human–robot interaction studies.

6.8. Real-Life Call Center

The Real-Life Call Center Corpora ([68], 2004) analyze emotional manifestations in task-oriented spoken dialog within call center interactions. The study utilizes two corpora: CORPUS 1, recorded at a stock exchange customer service center, and CORPUS 2, recorded at a capital bank service center. Together, the corpora contain a total of 2850 recorded dialog samples and over 100,000 annotated words, with CORPUS 1 contributing 8 h of recordings and CORPUS 2 contributing 5 h. These datasets capture transactional and problem-solving scenarios, where clients express a range of emotions based on the task demands. The recording setup involved natural telephone conversations between agents and clients discussing services such as stock transactions, account management, and technical issues. In CORPUS 1, longer dialog samples with heightened emotional intensity were observed due to the urgency and stakes of stock transactions, whereas CORPUS 2 included shorter dialog with more moderate emotional expressions, primarily focused on routine banking issues. All conversations were transcribed and annotated with detailed semantic, dialogic, and emotional markers. The annotation process employed a two-dimensional emotional framework that represented activation (emotional intensity) and evaluation (valence). A total of four primary emotion classes—anger, fear, satisfaction, and neutral—were labeled, along with nuanced emotions like Irritation and Anxiety to capture the complexity of real-life dialog. Inter-annotator agreement, measured using the Kappa statistic, was higher for CORPUS 1 compared to CORPUS 2, reflecting the clearer emotional cues in the former due to heightened emotional stakes. Prosodic and lexical cues such as fundamental frequency (F0), rhythm, and word choice were analyzed for their role in signaling emotional states.
The Real-Life Call Center Corpora are notable for their focus on realistic emotional dialog, capturing task-dependent emotional expressions in naturalistic call center interactions. By combining detailed annotations of lexical and prosodic features with a context-driven approach, the corpora provide valuable insights into how emotions manifest in real-world communication. Their focus on spontaneous emotional expressions, paired with a structured annotation process, makes them particularly relevant for understanding emotional dynamics in task-oriented dialog systems.

6.9. IEMO

The Italian Emotional (IEMO) Speech Corpus was developed to study emotional variations in speech and facilitate emotional speech synthesis ([69], 2004). The dataset includes recordings from three professional Italian speakers (one female and two males), all experienced in recording tasks, ensuring consistency in their performances. Each speaker recorded 25 sentences in four emotional styles—happiness, sadness, anger, and neutral—providing a baseline for analysis. However, the exclusion of more nuanced emotions (e.g., fear, surprise) limits its applicability to scenarios requiring a broader emotional spectrum. Sentences were approximately 10 words long, designed to be phonetically balanced, and intentionally devoid of semantic emotional content to prevent bias in speaker performances. Recordings were conducted in a sound-treated room using high-quality microphones and digital acquisition equipment. The controlled recording environment ensured acoustic clarity, making the dataset well-suited for prosodic and acoustical analysis.
The annotation process aimed to capture emotional variations in speech through the direct labeling of emotional categories—happiness, sadness, anger, and neutral. These categories were informed by prosodic parameters, including fundamental frequency (F0), energy, and syllable duration, which were calculated for each utterance and compared to the neutral baseline to quantify emotional differences. Automated phonetic alignment tools were employed to achieve syllable-level segmentation, with manual corrections applied to address extralinguistic artifacts like pauses and hesitations. The process effectively balanced precision with natural variation, accommodating speaker-dependent nuances, particularly in anger, which ranged from “hot” to “cold”. While this variability adds realism to the dataset, it also highlights the challenges of achieving consistent annotations for nuanced emotional states.
This corpus is notable as the first Italian emotional speech database, addressing a significant gap in Italian emotional speech resources. Its focus on phonetically balanced, semantically neutral sentences and quantification of prosodic features offered a structured and reproducible methodology for studying emotional speech. At the time, it laid the groundwork for advancements in emotional speech synthesis and remains a valuable resource for research on prosodic correlates of emotions and their applications in speech technology.

6.10. EMODB

The German Emotional Speech Dataset ([46], 2005) was developed to provide a controlled resource for studying emotional expression in speech. The dataset features recordings from 10 actors (5 male, 5 female) performing 10 predefined sentences in seven distinct emotional categories: anger, fear, joy, sadness, disgust, boredom, and neutral. These sentences were carefully designed to be phonetically diverse and contextually neutral, ensuring their suitability for expressing various emotions without introducing linguistic bias. The actors used the Stanislavski method, recalling personal experiences to evoke genuine emotions while maintaining a consistent speaking style. In total, the corpus contains approximately 800 utterances, recorded in an anechoic chamber using high-quality equipment, including a Sennheiser MKH 40 microphone and a Tascam DAT recorder, ensuring acoustic clarity. Electro-glottographic data were also captured to analyze voice quality and phonatory behavior.
Emotional utterances, recorded with high precision, underwent perceptual tests by 20 listeners to ensure recognizability and naturalness. Utterances recognized with an accuracy above 80% and judged as natural by more than 60% of listeners were included for further analysis. Each utterance was phonetically labeled using narrow transcription supported by visual analysis of oscillograms and spectrograms. Annotations incorporated phonatory and articulatory settings, voice characteristics, and stress levels for comprehensive emotional analysis. The use of both perceptual tests and phonetic analysis ensures consistency, capturing clear and interpretable emotional expressions.
EmoDB’s significance lies in its carefully controlled design, providing high-quality acoustic recordings paired with detailed phonetic and emotional annotations. By using acted emotions, the database achieves a balance between emotional clarity and experimental reproducibility. As one of the earliest publicly available emotional speech datasets, EmoDB has become a cornerstone in the field of speech emotion recognition, cited widely for its role in standardizing methodologies and enabling detailed analysis of acoustic and prosodic features across emotions.

6.11. EMOTV1

The EMOTV1 corpus ([70], 2005) was developed to analyze real-life emotions in multimodal contexts, focusing on the interactions of vocal, visual, and contextual cues. This dataset comprises 51 video clips of French TV interviews, featuring 48 speakers discussing 24 diverse topics such as politics, society, and health. The total duration of the corpus is 12 min, with clip lengths ranging from 4 to 43 s. The primary aim was to study mixed and non-basic emotions in natural settings, contrasting with acted datasets. Clips were selected based on the visibility of multimodal behaviors, including facial expressions, gestures, and gaze, providing a rich dataset for analyzing spontaneous emotional expressions.
The annotation process involved two annotators who labeled emotions using the Anvil tool under three conditions: audio-only, video-only, and combined audio-video. Emotional segments were created when annotators perceived consistent behaviors, resulting in a final agreed segmentation through the union of audio-based and the intersection of video-based annotations. Labels were selected from 14 predefined categories, including anger, joy, sadness, despair, surprise, and neutral, as well as intensity and valence dimensions. Inter-annotator agreement was quantified using Kappa for categorical variables and Cronbach’s alpha for intensity and valence, with higher agreement observed in audio-only conditions compared to combined audio-video, highlighting challenges in annotating multimodal emotions. The results revealed ambiguities in non-basic emotions, including blended and masked emotions, and sequences of consecutive emotions, offering a nuanced view of naturalistic emotional expressions.
The uniqueness of EMOTV1 stems from its focus on contextually rich, spontaneous emotional data, offering a resource for exploring multimodal emotional behaviors. By employing diverse annotation schemes and addressing the complexities of naturalistic emotions, the corpus contributes to research on emotion recognition and its potential applications in areas like conversational agents and human–computer interaction. While its limited size and visual constraints pose challenges, EMOTV1 provides a valuable foundation for studying real-life emotional expressions in realistic scenarios.

6.12. CEMO

The CEMO Corpus ([71], 2006) captures naturalistic emotional interactions recorded in a French medical emergency call center. The corpus is particularly valuable for studying real-life emotional expressions in high-stress, goal-oriented scenarios. It includes 20 h of speech recorded from 688 dialog samples, involving 784 unique callers and 7 agents. On average, each dialog contains 48 turns, providing a rich dataset for analyzing emotional dynamics in critical interactions. The callers include patients or third parties, such as family members or caregivers, reflecting diverse perspectives in medical emergencies. The recording environment ensured the naturalness of emotional expressions while maintaining ethical standards. Recordings were made imperceptibly during real medical calls, preserving the authenticity of the emotional content. Anonymity and privacy were strictly respected, with personal information removed, and the corpus is not publicly distributed. Contextual metadata was annotated, including call origin (e.g., patient or medical center), role in the dialog (caller or agent), reason for the call (e.g., immediate help, medical advice), and decision outcomes. Additional details such as the acoustic quality (e.g., noise, mobile vs. fixed phone), caller demographics (e.g., age, gender, accent), and health or mental state (e.g., hoarseness, grogginess, intoxication) were also labeled. The setup is well-designed to capture naturalistic emotional expressions in real-world scenarios.
The dataset reflects high-stakes, emotionally charged situations that are often difficult to replicate in controlled environments. However, the corpus does not explicitly describe how background noise was mitigated during annotation or pre-processing, which could influence the accuracy of emotional labeling. The corpus shows a strong commitment to privacy through anonymization of personal data and restricted access, ensuring ethical use of sensitive information. Recordings were made unobtrusively during medical calls, preserving natural behavior. However, the paper does not clarify whether explicit consent or institutional approval was obtained, leaving some room for improvement in transparency. Despite this, the privacy safeguards effectively balance ethical considerations with research value. The emotional annotation used a hierarchical framework combining coarse-grained categories and fine-grained subcategories. Coarse categories included fear, anger, sadness, hurt, relief, and compassion, with fine-grained labels like stress, panic, annoyance, and resignation, resulting in 20 distinct emotional labels. This demonstrates methodological rigor. Segments were annotated with one or two labels to account for emotional mixtures, such as relief/anxiety or compassion/annoyance, which are frequent in real-life interactions. Inter-annotator agreement was assessed through Kappa scores (0.35 for agents, 0.57 for callers) and validated with a re-annotation process, achieving 85% coherence across annotators over time. A perceptive test involving 43 participants further evaluated the annotations, achieving 85% agreement between expert and naive annotators. Even though inter-annotator agreement for agent dialog is relatively low, which could raise concerns about label consistency, the re-annotation process improved coherence. Providing more details about the annotators’ training and the criteria for label assignment could further validate the methodology. The inclusion of a perceptual test involving both experts and naive annotators is commendable, as it enhances the reliability of the annotations by cross-validating emotional perceptions. The corpus’s value lies in its focus on spontaneous emotional expressions in high-stakes medical interactions. Its integration of contextual metadata, hierarchical labels, and detailed annotations makes it a key resource for advancing speech emotion recognition in applied settings like medical communication and human–computer interaction.

6.13. eNTERFACE’05

The audio-visual emotion database (eNTERFACE’05) was designed as a reference resource for testing and evaluating emotion recognition algorithms based on video, audio, or combined audio-visual data ([72], 2006). The original material was collected by recordings from 42 subjects, coming from 14 different nationalities (9 from Belgium; 7 from both Turkey from France; 6 from Spain; 4 from Greece; and 1 each from Italy, Austria, Cuba, Slovakia, Brazil, U.S.A., Croatia, Canada, and Russia). Among the 42 subjects, 81% were men, and the remaining 19% were women. The approach used to obtain recordings in different emotional status was elicitation. Although the participants are from different countries and had their specific mother language, all the experiments were driven in English. Each subject listened to six successive short stories, each of them eliciting a particular emotion (anger, disgust, fear, happiness, sadness, surprise). Five specific sentences for emotion have to be uttered after each single emotion was elicited. The procedure begins with the subject listening carefully to a short story to immerse themselves in the given situation. Once prepared, the subject reads, memorizes, and pronounces each of the five proposed utterances, one at a time, representing different reactions to the situation. Subjects are instructed to convey as much expressiveness as possible, focusing on eliciting the intended emotion in their delivery. If the result is deemed satisfactory, the process proceeds to the next emotion; otherwise, the subject is asked to repeat the attempt. When a subject struggled to express an emotion effectively, the experimenters provided guidance based on their understanding of typical emotional expressions. In some cases, the experimenters opted not to repeat the process if they concluded that satisfactory results were unlikely with the subject in question.
In their article [72], authors present both the situations (stories told) to elicit each specific emotion and the sentences the speakers have to uttered as reactions. The speech signal was recorded using a high-quality microphone specifically designed for speech recordings. The microphone was positioned approximately 30 cm below the subject’s mouth. The recording room, measuring approximately ten square meters. To prevent external sounds from interfering with the experiments, the doors remained closed at all times. The audio sample rate of the recordings was 48,000 Hz, in an uncompressed stereo 16-bit format. Of the 42 participants in the database, 25 (60%) successfully convey all six emotions, producing five convincing reactions for each proposed situation. The remaining 17 participants (40%) were unable to convey the intended emotions in all their reactions, resulting in unusable samples that were excluded from the dataset.

6.14. SAFE

The Situation Analysis in a Fictional and Emotional (SAFE) Corpus ([73], 2006) was developed to study extreme fear-related emotions in dynamic and abnormal situations. The corpus consists of 400 audiovisual sequences extracted from 30 recent movies, with sequence durations ranging from 8 s to 5 min. These scenes were selected to illustrate emotions in contexts involving both normal and abnormal situations, such as natural disasters (fires, earthquakes, floods) and psychological or physical threats (kidnapping, aggression). The dataset contains a total of 7 h of recordings, with speech constituting 76% of the data, and features approximately 4073 speech segments spoken by 400 different speakers, covering a range of accents and genders (47% male, 32% female, 1% child). The recording environment reflects the variability and complexity of real-world conditions, including overlapping speech, environmental noise, and background music, which are annotated in the corpus. Speech segments were rated on a four-level audio quality scale, ranging from inaudible to clean and realistic sound recordings. Approximately 71% of the speech data comes from abnormal situations, enhancing the dataset’s relevance for studying emotional speech under high-stakes, dynamic conditions.
Annotation was conducted using a multimodal tool (ANVIL) and included two levels of emotional descriptors: categorical (fear, other negative emotions, neutral, and positive emotions) and dimensional (intensity, evaluation, and reactivity). Fear was further categorized into subtypes, such as stress, terror, and anxiety, to capture nuanced variations. Two annotators, one English native and one bilingual, labeled the emotional content independently, achieving inter-annotator agreement in subsequent evaluations. A supplementary blind annotation, focusing on audio cues alone, was performed to assess fear detection without video context. The annotation is methodologically robust. However, greater transparency about inter-annotator agreement and annotator training could improve confidence in the reliability of the annotations. Despite these gaps, the methodology aligns well with the corpus’s goals and makes it a valuable resource for studying complex emotional dynamics in real-world scenarios.
The originality of this corpus lies in its focus on extreme fear-related emotions in dynamic and contextually rich scenarios, which are underrepresented in existing corpora. Its emphasis on capturing emotional manifestations within task-dependent contexts makes it a useful resource for studying emotion detection in challenging and high-stakes environments. Additionally, the corpus provides insights into the interplay between environmental factors, speaker variability, and emotional expression, contributing to research on speech emotion recognition in complex, real-world conditions.

6.15. EmoTaboo

This corpus ([74], 2007) was developed to capture multimodal emotional expressions during human–human interactions in a task-oriented context. The dataset focuses on emotions elicited during a game scenario, where participants played an adapted version of the Taboo game designed to provoke spontaneous emotional reactions. The corpus consists of approximately 8 h of video and audio data, recorded from 10 pairs of participants, yielding a total of 20 individual sessions. Participants alternated roles as a “mime” or “guesser” in the game, tasked with describing or guessing specific words under time constraints, introducing elements of stress, frustration, and amusement. The recording setup used a controlled lab environment, where four camera angles captured participants’ facial expressions and upper body gestures, while high-quality microphones recorded speech. The experiment was structured to elicit a variety of emotional responses through challenging word prompts, time pressure, and penalties for incorrect guesses. While the controlled setting ensured high-quality multimodal recordings, the use of a confederate (a scripted participant) in some interactions could introduce potential biases in elicited emotional responses, as their actions were designed to provoke specific reactions.
The annotation process employed a hierarchical framework to label emotional expressions, mental states, and communication acts. Annotators could select up to five emotional labels per segment, allowing the dataset to reflect the complexity of human emotions. A total of 21 emotion labels were used, including amusement, frustration, stress, pride, and embarrassment, along with dimensions such as intensity and valence to capture subtleties in emotional expression. The annotations also included cognitive and social states, providing insights into both individual emotions and interpersonal dynamics. Annotator agreement and validation were achieved through iterative processes, ensuring reliable emotional labeling. Ultimately, the annotation process is robust, utilizing a hierarchical framework that includes 21 emotion labels and dimensions like intensity and valence, capturing the nuanced nature of emotional interactions. While allowing up to five labels per segment adds flexibility, the fixed list of labels may constrain the ability to annotate unanticipated or subtle emotional states.
The EmoTaboo Corpus stands out for its multimodal integration of speech, gestures, and facial expressions in a dyadic interaction setting. Its focus on spontaneous emotional reactions in task-oriented interactions makes it a valuable resource for studying emotion dynamics in real-time communication. While its lab-controlled environment may limit its generalizability to fully naturalistic settings, the corpus provides a rich dataset for exploring the interplay of multiple modalities in emotional expression.

6.16. CASIA

The CASIA Mandarin corpus is a Chinese emotional corpus recorded by the Institute of Automation, Chinese Academy of Sciences, and designed for Mandarin speech synthesis research ([75], 2008). CASIA is composed of 9600 short mandarin utterances, containing six different emotions, i.e., sad, angry, fear, surprise, happy, and neutral. The audio samples were recorded by four speakers (two males and two females from 25 to 45 years old) in a professional recording studio equipped with sound card and large membrane microphone devices.

6.17. IEMOCAP

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database ([50], 2008) is a comprehensive resource developed at the University of Southern California to study multimodal emotional communication through speech, gestures, and facial expressions. The corpus consists of approximately 12 h of data, collected from ten professional actors (five male and five female) participating in five recording sessions. Each session involved a pair of actors performing both scripted and improvised scenarios designed to elicit a wide range of emotional expressions. The covered emotions include happiness, anger, sadness, frustration, excitement, and neutral, with additional blended emotional states to reflect the complexity of real interactions. By integrating both scripted and spontaneous elements, the setup balances the naturalness of emotions with the consistency needed for controlled analysis, while the use of 61 motion capture markers, high-quality audio, and two high-resolution cameras ensures comprehensive multimodal data collection. This setup makes it a versatile and reliable resource for studying the interplay between verbal and nonverbal emotional communication. The annotation process employs a robust dual framework, using both categorical descriptors (happiness, anger, sadness, frustration, and neutral) and dimensional attributes (valence, activation, and dominance) rated on a five-point scale, allowing the corpus to capture both discrete and continuous aspects of emotional states. The involvement of six annotators, with overlapping turns reviewed by three evaluators, ensures reliability while accounting for blended emotions and nuanced expressions. This approach enhances the dataset’s flexibility and granularity, making it well-suited for analyzing real-life emotional complexities.
The IEMOCAP Corpus is widely regarded as a pioneering resource in the field of multimodal emotion research, combining innovative technology and rigorous methodology to set a benchmark for emotion recognition studies. Its integration of speech, gestures, and facial expressions, paired with its detailed annotation framework, has significantly influenced subsequent work in SER, affective computing, and human–computer interaction. Its foundational contributions continue to make it a critical resource for advancing research in these domains.

6.18. ITA-DB

The Italian Emotional Database (ITA-DB) Corpus ([76], 2008) is an Italian emotional speech database developed to support emotion recognition in judicial contexts. This corpus addresses the lack of Italian datasets for studying emotional speech, with a focus on the specific dynamics of courtroom debates. It comprises 391 samples of emotional speech, sourced from 40 movies and TV series dubbed by Italian professional actors. The database includes five emotional categories: anger, fear, joy, sadness, and neutral, representing emotions deemed relevant to judicial scenarios. The use of high-quality dubbed audio ensures clear recordings and speaker diversity, as the samples include multiple actors across various productions. However, the reliance on acted speech from entertainment media may not fully capture the complexity of emotional expressions in judicial settings. Courtroom proceedings often involve nuanced emotional states such as confusion, frustration, resignation, or moral indignation, which are difficult to simulate authentically. These complexities are not covered by the selected emotional categories, potentially limiting the corpus’s applicability to the judicial domain. Annotation was based on the predefined emotional categories, with an effort to balance representation across classes. However, the paper lacks details about the annotation process, such as the criteria used for labeling and the reliability of the annotations.
The originality of the ITA-DB Corpus lies in its attempt to create an Italian-specific emotional speech database for a specialized application, addressing a gap in existing resources. However, the choice of emotions and the reliance on acted content may limit its ability to fully capture the subtle and layered emotions typical of courtroom interactions. While it represents a valuable step forward for Italian emotion recognition research, the corpus highlights the need for more context-specific data to better support applications in judicial and legal environments. This observation emphasizes the importance of developing a more comprehensive Italian emotional speech corpus that reflects the complexity of real-world emotional expressions in specialized domains.

6.19. VAM

The Vera am Mittag Corpus ([77], 2008) is a German audio-visual emotional speech database developed to study spontaneous emotional expressions in natural, unscripted interactions. The corpus was recorded from 12 broadcasts of the German TV talk show “Vera am Mittag” (translated as “Vera at Noon”), aired on the Sat.1 channel between December 2004 and February 2005. It comprises 12 h of data segmented into 45 dialogues and further into 1421 utterances from 104 speakers aged 16 to 69 years, with 70% aged 35 or younger, reflecting a broad demographic. Only the set of “good” and “very good” speakers was considered, for a total of 47 speakers (11 males and 36 females). While the public entertainment setting adds to the naturalistic quality of the corpus, elements such as background noise, applause, and audience or moderator influence may introduce variability during analysis. The setup captures audio-visual signals from spontaneous and emotionally rich discussions, focusing on personal topics such as romantic affairs, family disputes, and friendship crises. The discussions are moderated by a host and involve 2 to 5 participants per dialogue, creating an environment conducive to emotional variability. The segmentation includes extracting individual utterances, storing audio files, and accompanying visual frames.
Emotion annotations utilize a dimensional framework, evaluating utterances along three primitives: valence (negative to positive), activation (calm to excited), and dominance (weak to strong). The annotation process involved 17 human evaluators for the initial set (499 utterances) and 6 evaluators for the extended set, ensuring rigorous labeling. The Self-Assessment Manikins (SAMs) were used as an evaluation tool on a 5-point scale ranging from −1 to +1 for each dimension, capturing the nuanced nature of emotions beyond discrete categories. This framework is particularly well-suited for analyzing spontaneous and mixed emotional states, with the involvement of multiple evaluators adding robustness to the labeling process. The resulting corpus covers a diverse emotional range, though emotions lean towards neutral or negative states due to the topics discussed.
This corpus is notable for its spontaneity, stemming from real-life, unscripted interactions, and its use of both audio and visual modalities. Its dimensional annotation scheme allows for a nuanced understanding of emotion transitions and person-specific expression patterns.

6.20. SAVEE

Haq and Jackson ([78], 2010) compiled an audio-visual database using recordings from four English male actors portraying seven emotions in a controlled setting. The dataset includes six basic emotions—anger, disgust, fear, happiness, sadness, and surprise—along with a neutral state. Each actor contributed 120 utterances, resulting in a total of 480 sentences. Each recording session included 15 phonetically balanced sentences from the TIMIT database for each emotion: 3 common sentences, 2 emotion-specific sentences, and 10 generic sentences unique to each emotion. During the recordings, the emotions to be acted, and corresponding text prompts were displayed on a monitor in front of the actors. The audio was recorded using a Beyerdynamic microphone. Evaluation of the dataset was carried out by 10 participants, including 5 native English speakers and 5 individuals who had resided in the UK for over a year.

6.21. IITKGP-SEHSC

The Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC) is a pioneering emotional speech dataset specifically created to analyze emotions in Hindi, filling a significant gap in the resources available for Indian languages ([79], 2011). This dataset was developed using professional artists from Gyanavani FM radio station in Varanasi, India, ensuring high-quality and consistent emotional expressions. It includes 12,000 utterances, based on 15 neutral Hindi text prompts performed by 10 speakers (5 male, 5 female) across eight emotional categories: anger, disgust, fear, happiness, neutral, sadness, sarcasm, and surprise. Each emotion contains 1500 utterances, with the total duration of the corpus reaching approximately nine hours. The inclusion of sarcasm, while not a basic emotion, adds a unique layer of complexity to the dataset, addressing practical nuances in emotional communication that are often overlooked in traditional corpora.
The data collection process took place in a quiet environment using a single SHURE dynamic cardioid microphone (C660N) at a distance of one foot from the speaker. Speech signals were recorded at a sampling rate of 16 kHz with 16-bit resolution, ensuring high fidelity. Sessions were scheduled on alternate days to capture natural variations in human speech production, and each artist recorded all 15 sentences consecutively for a single emotion to maintain coherence. The use of professional radio artists and a structured recording protocol ensured consistent quality, while the setup provided a robust foundation for capturing both nuanced and exaggerated emotional expressions. The emotions expressed in the database were evaluated through subjective listening tests by 25 postgraduate and research students at IIT Kharagpur. These tests assessed the naturalness and clarity of the simulated emotions, achieving an average recognition accuracy of 71% for male speakers and 74% for female speakers. Confusion matrices highlighted the reliability of the dataset, particularly for emotions such as anger and sadness, which were effectively classified. While the reliance on subjective evaluation provided meaningful insights, further details on inter-annotator agreement metrics or annotation guidelines could enhance transparency and consistency in the labeling process.
The IITKGP-SEHSC corpus is an important resource in the field of emotional speech research, offering a linguistically diverse dataset tailored to Hindi. By including prosodic and spectral analyses and addressing both basic emotions and nuanced expressions like sarcasm, it expands the scope of research into emotional communication. This corpus represents a meaningful step toward developing emotion recognition systems for diverse linguistic and cultural contexts, contributing to the broader understanding of emotional expressions in speech.

6.22. ITA-DB-RE (Real Emotions)

The Italian Emotional Database (Real Emotions) was developed to address the limitations of acted emotional databases by capturing authentic emotional expressions in judicial contexts ([80], 2011). This corpus focuses on real-life emotional states, recorded during 30 trials across seven courts in Italy, resulting in a total of 135 h of audio recordings. An experienced Italian female labeler performed a manual segmentation of the recordings to isolate speech samples, removing noisy segments and overlapping speakers. This process resulted in 522 speech samples with durations ranging from 2 to 25 s and an average length of 18 s. To address the overrepresentation of neutral samples, the dataset was refined to a balanced set of 175 samples, consisting of 88 neutral, 68 angry, and 19 sad segments.
While these emotions are relevant to courtroom proceedings, the limited selection may not fully capture other significant states such as frustration or confusion, which are also common in such high-stakes environments. The participants included judges (46 samples), witnesses (67 samples), lawyers (29 samples), and prosecutors (33 samples), with a gender distribution of 95 males and 80 females, reflecting the diverse roles and perspectives in the courtroom. The recording setup ensured naturalistic data collection by capturing live courtroom proceedings without intervention, preserving the spontaneity and authenticity of emotional expressions. The annotation process involved manual labeling of the samples into neutral, anger, and sadness, carried out by a team of two experienced and one naïve labeler. However, the paper does not elaborate on the specific criteria used for labeling or the agreement metrics between the annotators, leaving some aspects of the methodology’s consistency and reliability open to interpretation. Despite this, the resulting dataset reflects a well-curated selection of real-world emotional expressions tied to the unique context of judicial proceedings.
This corpus represents a significant advancement from the earlier Italian corpus (ITA-DB [76]), which relied on controlled, performed emotions sourced from movies and TV series. The real-life setting of the ITA-DB-RE database represents a substantial step forward in depicting the complexities of genuine emotional interactions, transitioning from acted to authentic emotional data within the specialized context of courtroom interactions. By capturing real-world emotional dynamics, this corpus addresses the need for datasets reflecting genuine interactions in specialized domains. Its relevance lies in supporting emotion recognition systems for judicial and legal applications, though its limited methodological focus leaves some gaps in documenting crucial details. Additionally, given the sensitivity of the courtroom context, the lack of detailed information about how recordings were obtained, including participant consent or ethical approvals, raises some uncertainty regarding compliance with privacy and data protection standards, which were less commonly documented at the time. Nonetheless, the corpus marks a notable step forward in Italian emotional corpora by incorporating authentic emotional expressions relevant to high-pressure environments like courtrooms.

6.23. TESS

The Toronto Emotional Speech Set (TESS [81], 2011) is an English-language emotional speech database designed to examine the recognition of emotions across different age groups. It includes 2800 audio recordings, created by embedding 200 semantically neutral target words into the carrier phrase Say the word __. These words were articulated by two female actors: one aged 26 and the other 64, both native English speakers from the Toronto area. The dataset captures seven distinct emotional states: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral, providing a broad emotional range suitable for research in affective computing and emotion recognition. To ensure high-quality audio and minimize noise interference, the recordings were conducted under controlled conditions. Both actors underwent audiometric testing to confirm normal hearing thresholds, ensuring reliable delivery of emotional expressions. Their backgrounds in musical training and university education contributed to the clarity and consistency of their portrayals. Furthermore, the use of phonemically balanced words embedded in a standardized carrier phrase enhanced the precision of emotional prosody analysis, making the dataset highly suitable for detailed emotion studies.
Evaluations focused on listener perception, achieving above-chance accuracy in emotion recognition, which underscores the reliability of the actors’ portrayals. Although the reliance on acted emotions ensures consistency, it may limit the applicability of the dataset to naturalistic contexts. Nonetheless, the TESS corpus provides valuable insights into how emotions are perceived across different age groups, making it a specialized resource for studies in speech emotion recognition, affective computing, and age-related auditory processing. By addressing a gap in emotional speech datasets focusing on older and younger speakers, it contributes to understanding the interplay of age and emotion in vocal communication, though its scope remains limited due to the small number of speakers.

6.24. SEMAINE

The SEMAINE corpus ([82], 2012) is a detailed multimodal database developed to study emotionally colored conversations between humans and artificial agents, highlighting the spontaneous emergence of emotions within specific contexts. The corpus includes 959 conversations recorded from 150 participants, each lasting approximately 5 min, resulting in 80 h of synchronized audiovisual data. High-quality recordings were captured using five high-resolution cameras and four microphones, ensuring robust multimodal integration. The corpus spans multiple emotional categories, including anger, disgust, fear, happiness, sadness, surprise, and neutral, providing a comprehensive foundation for affective computing research.
The interactions are divided into three scenarios: Solid SAL, Semi-automatic SAL, and Automatic SAL, balancing human-driven and automated system responses to capture diverse emotional expressions. In Solid SAL, a human operator simulated emotional stances using nonverbal cues like eye contact, producing 190 recordings from 95 character interactions via a teleprompter for natural exchanges. Semi-automatic SAL utilized scripted responses with varied feedback types (audio-visual, video-only, filtered audio), yielding 144 degraded and 44 baseline recordings to examine communication breakdowns. Automatic SAL, featuring an autonomous system detecting facial actions, gestures, and prosodic cues, generated 964 recordings across two versions—one fully capable of nonverbal communication and another limited in skills. The annotation methodology was rigorous, combining multiple raters with the FEELtrace system to capture valence and activation dimensions continuously over time. In Solid SAL, user clips were annotated by eight raters for 4 sessions, six raters for 17 sessions, and at least three raters for the remaining sessions. Operator clips had four raters for three sessions, while the rest were annotated by a single rater. Eleven user sessions in Semi-automatic SAL were annotated by two raters, while all Automatic SAL sessions were annotated by a single rater. The process incorporated five core traces—valence, activation, power, anticipation/expectation, and intensity—along with optional descriptors to capture nuanced emotional and interaction dynamics. Solid SAL featured the most detailed annotations, including both core dimensions and optional traces, while Semi-automatic SAL maintained a similar level with fewer raters. In Automatic SAL, engagement tracing focused on user responses to the autonomous system, emphasizing real-time interaction quality. This structured approach provided a nuanced representation of emotional dynamics, ensuring granularity while acknowledging some variability in rater numbers across scenarios.
By integrating high-quality audiovisual recordings, diverse emotional categories, and sophisticated annotation techniques, the SEMAINE corpus addresses critical gaps in multimodal emotion research. The three interaction scenarios play a crucial role in capturing a wide spectrum of emotional exchanges, ranging from human-driven interactions to autonomous system responses. This focus on real-time, emotionally rich interactions bridges the divide between human-agent and human–human communication studies. The corpus is a significant resource for developing emotionally aware systems and has become a valuable tool in the field of affective computing and emotion recognition.

6.25. CREMA-D

The Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D [83], 2014) is an English-language multimodal dataset designed to study emotion perception across three modalities: audio-only, visual-only, and audio-visual. It includes 7442 clips from 91 actors (48 male, 43 female) of diverse ages (20–74 years, mean 36) and ethnicities, expressing six basic emotions: happy, sad, anger, fear, disgust, and neutral. The dataset is structured around 12 semantically neutral sentences, performed under the guidance of professional theater directors to ensure consistent and expressive emotional portrayals. Recordings were conducted in a sound-attenuated environment, with actors seated against a green screen, using high-quality audio and video equipment to capture clear signals. This controlled setup, combined with a diverse actor pool, enhances the dataset’s demographic variability and suitability for multimodal emotion analysis. The annotation process utilized 2443 raters, likely through a well-established crowd-sourcing platform, to ensure broad participation and diversity in evaluations. Each rater annotated only a subset of clips, enabling efficient task distribution. Multiple evaluations per clip ensured consensus ratings, distinguishing matching, non-matching, and ambiguous samples. The dataset achieved moderate inter-rater agreement overall (Krippendorff’s alpha of 0.42) but strong agreement for unambiguous clips (alpha of 0.79), confirming reliability for high-quality samples. Recognition rates also varied significantly across modalities, with audio-visual stimuli achieving the highest accuracy (63.6%), followed by visual-only (58.2%) and audio-only (40.9%), highlighting the complementary nature of multimodal emotion analysis.
The CREMA-D corpus is a significant contribution to emotional speech and multimodal datasets, offering a diverse and large-scale resource with controlled recording conditions and extensive crowd-sourced annotations. It captures emotional expressions across modalities, providing valuable insights into single and multimodal emotion perception. Its design has supported advancements in affective computing, human–computer interaction, and emotion recognition research, solidifying its place as a widely used resource in the field.

6.26. EMOVO

EMOVO ([84], 2014) is the first Italian emotional speech database specifically designed to study and develop speech emotion recognition systems. The corpus consists of recordings from six professional actors (three male and three female), aged between 23 and 30 years, who performed 14 sentences in seven emotional states: neutral, disgust, fear, anger, joy, surprise, and sadness. These emotions, referred to as the “Big Six” plus neutral, are widely recognized in emotional speech research. The sentences included a mix of semantically neutral and nonsense phrases to avoid semantic bias in emotion recognition. This careful design ensures the corpus’s applicability in both research and practical applications.
Conducted in the laboratories of the Fondazione Ugo Bordoni in Rome, the recordings were made using professional-grade equipment, including Shure SM58LC microphones and a Marantz PMD670 digital recorder. Each actor recorded all 14 sentences in the seven emotional states, resulting in 588 utterances, with approximately 10 min of material per actor. Actors were encouraged to move naturally, introducing some variability in signal intensity due to changes in distance from the microphone, a minor factor that may influence processing. Emotional performances were validated through a subjective emotion discrimination test with 24 listeners, achieving an 80% overall recognition accuracy. Neutral, anger, and sadness were the most easily recognized emotions, while joy and disgust were less distinct. This validation, though limited to categorical assessments, confirmed the reliability of the actors’ performances and the robustness of the dataset for emotion recognition tasks.
The EMOVO Corpus has become a cornerstone for SER research in Italian, addressing the lack of emotional speech datasets in this linguistic context. Its inclusion of the “Big Six” emotions plus neutral, controlled laboratory conditions, and rigorous validation have made it a benchmark resource. The corpus has greatly influenced Italian emotional speech research, supporting advancements in human–computer interaction and speech synthesis, and remains a key resource for developing and evaluating emotion recognition systems.

6.27. CHEAVD

The CASIA Chinese Natural Emotional Audio-Visual Database (CHEAVD [85], 2016) is a large-scale, multimodal resource designed to support research in multimodal emotion recognition, natural language understanding, and human–computer interaction. The dataset includes 140 min of emotion-rich audio-visual segments extracted from 34 films, 2 TV series, and 4 TV shows. The selection process excluded science fiction and fantasy genres, prioritized materials featuring actors’ original voices over dubbed versions, and focused on Mandarin-speaking content with minimal accents. This corpus contains 2600 segments, each lasting 1 to 19 s, with an average duration of 3.3 s. It features recordings of 238 speakers, balanced across genders (52.5% male, 47.5% female), and spanning a wide age range from 11 to 62 years, segmented into six categories, underscoring its utility for robust, speaker-independent emotion analysis. Data was carefully selected to represent real-life emotional expressions, prioritizing scenarios closely tied to daily life. Raw materials were drawn from films and television series reflecting realistic environments, chat shows, talk shows, and impromptu speech programs. Strict criteria for segmentation ensured high-quality data, with segments containing only one speaker’s speech and facial expressions, minimal noise, and complete utterances.
The annotation process for CHEAVD employed a multi-step strategy to ensure nuanced and contextually relevant emotional labels. Four native Chinese annotators labeled each segment, focusing on both primary and secondary emotions, resulting in a rich annotation framework. The process spanned 26 emotion categories, encompassing basic emotions like happy, sad, and angry, as well as nuanced states such as shy, sarcastic, and hesitant. Notably, the inclusion of labels for fake or suppressed emotions added depth, capturing the complex interplay between internal emotional states and external expressions. While the wide range of categories enriches the dataset’s granularity, it also posed challenges in maintaining annotation consistency, as reflected in moderate inter-annotator agreement (Cohen’s kappa values around 0.5). This outcome highlights the inherent difficulty of labeling nuanced emotions and balancing granularity with reliability. The final annotations were curated through consensus discussions to mitigate discrepancies and achieve a coherent framework. This rigorous annotation process underscores the dataset’s potential for advancing research in naturalistic emotion modeling and its ability to support both categorical and continuous emotional analyses.
CHEAVD’s integration of multimodal data (audio, visual, and speech) and its emphasis on capturing non-prototypical and subtle emotional expressions make it a noteworthy resource. By addressing cultural and linguistic gaps in existing emotion datasets, it provides a tailored platform for Mandarin emotion recognition and cross-cultural studies. The inclusion of baseline emotion recognition experiments using LSTM-RNN models with a soft attention mechanism, achieving an average accuracy of 56% for six major emotions, highlights its practical relevance for developing and evaluating advanced multimodal systems.

6.28. MSP-IMPROV

The MSP-IMPROV corpus ([86], 2016) is a multimodal emotional dataset designed to explore emotion perception and recognition, balancing naturalness and control. It includes 8438 dyadic conversational turns from 12 actors (6 males, 6 females) recorded across six sessions. The dataset comprises target-improvised (652 samples), target-read (620 samples), other-improvised (4381 samples), and natural interactions (2785 samples), covering emotions such as happiness, sadness, anger, and neutrality. Recorded in a soundproof booth, it combines dyadic interactions with a mix of scripted sentences and improvised dialogue to capture emotional depth. The corpus incorporates both audio and visual data, allowing for a comprehensive analysis of emotional expression through speech, facial expressions, and body language. Actors, recruited from a theater program, were guided by designed hypothetical scenarios to blend scripted content with spontaneous expressions, ensuring both authenticity and control while enhancing the emotional dynamics of the interactions.
The annotation process used crowd-sourcing via Amazon Mechanical Turk to evaluate the emotional content of the samples. A reference set of 652 Target-Improvised sentences was preannotated to monitor evaluator performance in real time. This approach ensured unreliable annotators were identified and stopped mid-task. The dataset achieved a Fleiss’ Kappa statistic of 0.487, which indicates moderate agreement among evaluators and is comparable to other spontaneous emotional corpora. Agreement levels varied slightly by subset: Target-Improvised (k = 0.497), Target-Read (k = 0.479), Other-Improvised (k = 0.458), and Natural Interaction (k = 0.487). The Target-Improvised sentences were evaluated by an average of 28.2 annotators, while other subsets had at least five annotators, ensuring robust emotional ratings. This method combined a reference set with dynamic evaluator monitoring, demonstrating methodological rigor in emotional corpus annotation.
The MSP-IMPROV corpus bridges the gap between acted and naturalistic datasets by integrating controlled recording conditions with the spontaneity of improvisation. Its focus on dyadic interactions and audiovisual modalities addresses critical gaps in emotional speech research, providing a valuable resource for studying emotion perception and recognition. The corpus has contributed significantly to advances in affective computing, human–computer interaction, and multimodal emotion analysis, establishing itself as an important dataset for research on realistic emotional communication.

6.29. NNIME

The NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus (NNIME) is a large-scale, publicly available resource designed for the analysis of multimodal emotional interactions in Mandarin Chinese ([87], 2017). It was developed through collaboration between engineers and drama experts, emphasizing the study of dyadic human interactions to mirror real-life emotional exchanges. The dataset includes recordings of 44 professional actors (22 females, 20 males) aged 19–30, all trained in dramatic arts and native Mandarin speakers. These actors were paired into 22 dyads (seven female–female, ten female–male, five male–male) and performed spontaneous interactions targeting six emotional states: anger, sadness, happiness, frustration, neutral, and surprise. Each session lasted approximately 3 min, resulting in a total of 102 sessions and 11 h of synchronized multimodal data, including audio, video, and electrocardiogram (ECG) recordings.
The recording environment reflects meticulous planning to balance methodological rigor and ecological validity. Sessions were conducted in controlled settings modeled after daily-life contexts, such as dormitories or living rooms, enhancing the authenticity of interactions. High-definition camcorders and wireless microphones ensured high-quality audiovisual recordings, while wearable ECG devices collected physiological signals, offering unique insights into the interplay between external behavioral cues and internal emotional states. Synchronization across modalities was achieved using a clapboard technique, enabling seamless multimodal analysis. Emotion annotations are particularly comprehensive, employing a multi-perspective approach with peer reports (44 raters), director assessments (1 rater), self-reports (1 rater), and observer evaluations (4 raters), providing a total of 49 unique perspectives. Discrete annotations covered categorical emotions and valence-activation ratings on a 1-to-5 scale, while continuous annotations captured dynamic emotional flows using the FEELtrace tool, rated by four naive observers. This dual focus ensures both categorical clarity and temporal granularity in the emotional data. Post-processing involved meticulous segmentation of 6701 utterances into speech and non-verbal categories like laughter and sobbing, alongside ECG signal de-noising for deriving heart rate variability features. While the breadth of perspectives enriches the dataset, managing such variability requires robust calibration to ensure consistency across raters.
NNIME’s distinctiveness lies in its integration of external behaviors and internal physiological responses, providing a resource for exploring the relationship between expressed and felt emotions. Its emphasis on dyadic interactions and multimodal data collection, including audio, video, and physiological signals, offers valuable opportunities for studying interaction dynamics and the interplay of verbal, non-verbal, and physiological cues. By addressing the scarcity of large-scale Mandarin datasets, NNIME contributes meaningfully to cross-cultural emotion research and supports advancements in affective computing and emotion recognition systems.

6.30. AESDD

The Acted Emotional Speech Dynamic Database (AESDD) [88], 2018) was developed to address the limitations of existing emotional speech databases, taking inspiration from the SAVEE database as a reference. It comprises acted speech utterances, and it is designed to be continuously expanding. To create the initial version of the AESDD, five professional actors, aged between 25 and 30 years, were hired. The group included two male actors and three female actors. All the AESDD database utterances are in Greek.
The recordings took place in the sound studio of the Laboratory of Electronic Media at Aristotle University of Thessaloniki, Greece, which provided an ideal acoustic environment for high-quality recordings. The spoken and recorded phrases were sourced from theatrical scripts. Specifically, 19 utterances were selected from various plays to form the database, chosen for the emotional ambiguity of their context. These 19 sentences were performed by the actors in Greek, across five different emotional states: happiness, sadness, anger, fear, and disgust. Additionally, for each emotion, one extra improvised utterance was recorded, and multiple recordings were made for some sentences, resulting in approximately 500 emotional speech utterances (5 actors × 5 emotions × 20 utterances). A dramatology expert supervised the recordings, offering guidance to the actors and making necessary adjustments to ensure the quality and appropriateness of the acted speech. During preprocessing, all utterances were appropriately normalized to a peak of −3 dB.

6.31. ANAD

The development of the Arabic Natural Audio Dataset (ANAD) ([89], 2018) is motivated by the goal of aiding hearing-impaired and deaf individuals in enhancing their daily communication. By integrating an effective emotion recognition system with a reliable speech-to-text system, the aim is to enable successful phone communication between deaf or hearing-impaired individuals and others. To achieve this, the researchers focus on collecting natural phone call recordings to build the corpus. Eight videos of live calls between an anchor and an individual outside the studio were downloaded from online Arabic talk shows. All the videos are publicly available, accordingly, the authors deduced that no copyright issues are associated with their use. Eighteen human labelers were tasked with listening to the videos and categorizing each one as happy, angry, or surprised, with the average result used to label each video. The videos were then segmented into turns between callers and receivers, with silence, laughter, and noisy segments removed. Each remaining chunk was automatically divided into 1-s speech units, resulting in a final corpus of 1384 records, comprising 505 happy, 137 surprised, and 741 angry units.

6.32. CaFE

The Canadian French Emotional (CaFE) speech dataset is the first emotional speech dataset in Canadian French ([90], 2018). The dataset comprises six sentences, pronounced by six male and six female actors, in six basic emotions (sadness, happiness, anger, fear, disgust and surprise) and one neutral emotion, with two different intensities. Actors recorded their lines individually in a professional soundproof room. A Blue Microphones Yeti Pro USB microphone, on a tripod with a pop filter, was used for recording. The microphone was set to cardioid mode and connected to a remote Acer Swift 3 laptop via USB. Recording was performed at 192 kHz/24-bit. Actors were positioned freely in front of the microphone. A recording session typically lasts one hour per actor, with three to five takes recorded for each sentence, emotion, and intensity level. Particular attention was made in the choice of the sentences by the authors. The sentences have chosen emotionally neutral from a semantic point of view but well suited to be uttered in various emotions. Specifically, they have been chosen reasonably easy to pronounce sentences and composed by the same number of syllables (eight). Statistics on phonemic distributions in the chosen sentences are given in the paper showing they approach the Wioland distribution of French phonemes (measured on a large corpus combining spoken (radio broadcast) and written (literacy texts) French).

6.33. CMU-MOSEI

The CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset is a detailed and diverse resource for multimodal sentiment and emotion analysis ([91], 2018). It consists of over 23,453 annotated video segments extracted from 3228 online videos, primarily sourced from YouTube, and features 1000 distinct speakers. The dataset achieves a balanced demographic representation, with 57% male and 43% female speakers, spanning a wide range of ages, accents, and speaking styles. These segments are derived from 250 different topics, offering a mix of conversational styles and naturalistic monologues in English. The corpus spans seven emotional categories—happiness, sadness, anger, disgust, fear, surprise—while also including sentiment annotations on a continuous scale from −3 (highly negative) to 3 (highly positive), enabling fine-grained sentiment analysis.
The design leverages naturalistic video monologues sourced from online platforms, capturing authentic emotional expressions across text, audio, and visual modalities. Its speaker diversity and multimodal alignment provide a strong foundation for generalizable research, though variability in recording quality due to the reliance on public content presents a minor challenge. Each video segment is temporally aligned across modalities, ensuring coherence, with an average duration of approximately 7.28 s. The dataset also includes over 56,000 aligned modality features, offering extensive opportunities for machine learning applications. The annotation process involved over 6000 human raters, who assigned sentiment polarity and emotional intensity scores to each segment. By including both categorical emotion labels and sentiment polarity ratings on a continuous scale, the dataset captures a nuanced representation of affective states. Inter-rater agreement was validated, with efforts to ensure consistency despite the inherent subjectivity of rating emotions. While the specific methodology for training annotators or resolving disagreements is not extensively detailed, the large-scale annotation effort demonstrates a robust approach. This dual focus on categorical and intensity measures adds depth to the dataset, supporting fine-grained analyses of sentiment and emotion dynamics.
This corpus represents a valuable contribution to multimodal affective computing. Its large scale, integration of text, audio, and visual modalities, and focus on naturalistic emotional expressions address important gaps in existing datasets. By providing a diverse speaker pool and well-aligned multimodal features, the corpus supports research in emotion recognition, sentiment analysis, and multimodal machine learning. Its considered design and detailed annotations make it a valuable resource for studying the intricacies of emotional and sentiment interactions in human communication.

6.34. RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS [11], 2018) is a multimodal dataset designed to support research in emotion perception and recognition in a neutral North American accent. It includes 7356 recordings from 24 professional actors (12 male, 12 female; age range = 21–33 years; M = 26.0 years; SD = 3.75). The actors self-identified as Caucasian (N = 20), East-Asian (N = 2), and Mixed (N = 2), with one identifying as East-Asian/Caucasian and another as Black-Canadian/First Nations/Caucasian. The dataset comprises 4320 speech recordings and 3036 song recordings. Each actor produced 104 distinct vocalizations, comprising 60 spoken utterances and 44 song utterances, covering three modality conditions: audio-visual (face and voice), video-only (face, no voice), and audio-only (voice, no face). The dataset spans eight emotional categories (neutral, calm, happy, sad, angry, fearful, surprise, and disgust) for speech and six emotional categories (neutral, calm, happy, sad, angry, and fearful) for song, with emotions expressed at two intensity levels (normal and strong). Actors repeated each vocalization twice, ensuring robustness and diversity within the dataset. The corpus uniquely explores emotional expressions in both speech and song, offering a balanced design that supports versatile applications in speech emotion recognition and multimodal analyses.
The recording sessions were conducted in a controlled studio environment, ensuring consistency and high-quality output. Professional-grade equipment, including synchronized audio and video recording setups, captured speech, facial expressions, and body language with minimal interference. Actors followed a structured protocol, performing two speech sentences and two sung phrases for each emotional state and intensity, resulting in a balanced and comprehensive dataset. Stimuli were presented at a high resolution and processed in sound-attenuated booths using Sennheiser HD 518 headphones to ensure clarity during validation. The distinction between intense and normal vocal expressions enhances the dataset’s versatility, with intense expressions aiding in emotional clarity and normal expressions offering representations closer to real-life emotional nuances. The annotation process involved 319 undergraduate students (76% female, 24% male, mean age = 20.55, SD = 4.65) from Ryerson University, who evaluated each stimulus for emotional category, intensity, and genuineness. Each stimulus received 10 ratings across these three evaluation scales, resulting in a total of 220,680 annotations. Raters classified emotions using a forced-choice response format with options like neutral, calm, happy, sad, and others, alongside a “none of these are correct” escape option, ensuring flexibility in judgments. Plutchik’s wheel of emotion was incorporated to provide structure and improve clarity in categorizing emotional states. To ensure reliability, a test-retest task with 72 additional raters validated the consistency of the ratings, demonstrating high inter-rater agreement for most emotional categories and further solidifying the quality of the dataset. The decision to involve untrained participants in the annotation process proved effective, as evidenced by the high accuracy rates achieved during validation, with 80% accuracy for audio-video, 75% for video-only, and 60% for audio-only conditions.
The RAVDESS corpus is regarded as a benchmark in the field due to its multimodal design, methodological rigor, and inclusion of both speech and song. Its balanced approach to emotion categorization and intensity scaling, coupled with its large-scale validation process, ensures a high level of reliability and utility. This corpus has significantly contributed to the fields of affective computing, speech emotion recognition, and multimodal emotion research, serving as a foundation for modeling complex emotional dynamics in human communication. It is important to note that the RAVDESS paper is comprehensive and meticulously detailed, providing extensive insights into the dataset’s development and validation processes. While this summary focuses on the key aspects of the corpus, some of the nuanced methodology and data intricacies may be oversimplified or omitted for brevity, underscoring the depth and scope of the original work.

6.35. SHEMO

The Sharif Emotional Speech Database (ShEMO [92], 2018) is a large-scale and validated resource for Persian. The database contains 3000 semi-natural utterances, amounting to 3 h and 25 min of speech data sourced from online radio plays. ShEMO features speech samples from 87 native Persian speakers, encompassing five basic emotions—anger, fear, happiness, sadness, and surprise—along with a neutral state. Fifty radio plays from various genres, including comedy, romance, crime, thriller, and drama, were selected as potential sources of emotional speech. The differences in audio streams were balanced using Audacity. Since the majority of streams (approximately 90%) had a sampling frequency of 44.1 kHz, the streams with lower sampling rates were upsampled using the cubic interpolation technique. Furthermore, all stereo-recorded streams were converted to mono. Each stream was segmented into smaller parts such that each segment would cover the speech sample of only one speaker without any background noise or effect. Twelve annotators (6 males, 6 females) labeled the emotional states of the utterances, with final labels determined through majority voting. The annotators were all native speakers of Persian with no hearing impairment or psychological problems. The mean age of the annotators was 24.25 years (SD = 5.25 years), ranging from 17 to 33 years. The inter-annotator agreement, measured by Cohen’s kappa statistics, is 64%, indicating a “substantial agreement”.

6.36. BAVED

The Basic Arabic Vocal Emotions Database (BAVED) ([93], 2019) is a collection of recorded Arabic words spoken in various expressed emotions. It includes 1935 audio files recorded from 61 speakers (45 males and 16 females, from 18 to 23 years old). BAVED includes seven words, each recorded in three levels of expressed emotions, where Level 0 is a low emotion (e.g., tired), Level 1 is the neutral level (e.g., how the speaker normally expresses during the day), and Level 2 is a high level of either positive or negative emotion (happiness, joy, sadness, anger, etc.). All audio files were sampled at 16 kHz with one audio channel and 256 kbps of bitrate.

6.37. MELD

The Multimodal EmotionLines Dataset (MELD) is a comprehensive multimodal emotional conversational corpus designed to address the complexities of emotion recognition in multi-party conversations ([94], 2019). Derived from the popular English-language sitcom Friends, the dataset includes 13,000 utterances from 1433 dialogues, significantly expanding its predecessor, EmotionLines, by integrating audio and visual modalities alongside text. The dataset spans seven emotion categories: anger, disgust, fear, joy, sadness, surprise, and neutral. Each utterance is annotated with its emotional label and sentiment, further enhanced by multimodal cues such as facial expressions, vocal intonations, and textual content. This multimodal approach enhances the understanding of nuanced emotions, with visual and auditory cues, such as vocal changes and facial expressions, aiding in accurately capturing emotions like surprise. The use of Friends provides a structured framework for studying emotions in conversational settings, offering a diverse range of emotional expressions practical for training emotion recognition models. While the sitcom’s nature may amplify some portrayals, the dataset remains contextually rich. Multi-party interactions, overlapping speech, and emotional shifts reflect conversational complexities, complemented by high-quality multimodal cues, making it a valuable resource for multimodal emotion recognition research. Annotations were performed by three annotators per utterance, achieving a Fleiss’ Kappa score of 0.43 for emotion labels, reflecting moderate agreement. This process marked a methodological shift from its predecessor, EmotionLines, which utilized annotations via Amazon Mechanical Turk with five workers per utterance and a majority voting scheme, achieving a lower Fleiss’ Kappa score of 0.34. To address the challenges encountered in EmotionLines, MELD introduced an improved annotation strategy, incorporating multimodal cues (audio and visual) alongside textual data to improve context and accuracy. Sentiment annotations in MELD achieved a higher Fleiss’ Kappa of 0.91, indicating robust sentiment labeling. Disagreements in emotional labels were resolved by discarding ambiguous annotations, resulting in a more refined and reliable dataset. By capturing the dynamics between speakers and using the multimodal context, MELD’s annotation process better aligns with the complexities of conversational emotion recognition.
MELD’s contribution lies in its focus on multiparty conversational data, addressing an area that has been relatively underexplored in emotion recognition research. By offering a large-scale dataset with multimodal annotations in English and a subset translated into Chinese, it complements existing resources like IEMOCAP and SEMAINE, which primarily focus on dyadic interactions. MELD’s emphasis on conversational context and inclusion of multi-speaker dynamics make it a valuable resource for advancing research in affective computing, dialogue systems, and multimodal emotion recognition. While not without limitations, its unique design and scalability establish it as an important tool in the study of emotional communication.

6.38. MSP-PODCAST

The MSP-PODCAST corpus ([95], 2019) is a large-scale, English-language, naturalistic emotional speech database designed to address the limitations of existing resources by leveraging publicly available podcast recordings. The dataset begins with over 84,000 speaking turns extracted from 403 podcasts, covering diverse topics, speakers, and conversational styles. These segments were processed using advanced machine learning models for emotional content retrieval, followed by manual validation to ensure high-quality speech samples. The final corpus contains samples ranging from 2.75 to 11 s, annotated for both categorical emotions (happiness, sadness, anger, disgust, surprise, neutral) and dimensional attributes (valence, arousal, dominance). The podcast-based approach introduces authentic emotional expressions in dynamic conversational contexts, making it significantly more naturalistic than acted datasets. The recording setup capitalized on the inherent diversity of podcasts, with their professional and semi-professional recording conditions ensuring clear and faithful audio. This naturalistic setting captures spontaneous emotional expressions in varied real-world scenarios, enhancing the ecological validity of the corpus. The annotation process employed crowd-sourcing via Amazon Mechanical Turk, where each sample was evaluated by at least five annotators. Real-time tracking of inter-evaluator agreement and quality control measures ensured reliable labeling. This robust annotation framework resulted in a balanced corpus that effectively represents the valence-arousal space, supporting both categorical and dimensional analyses of emotion.
The MSP-PODCAST corpus stands out for its innovative, scalable methodology, combining automated and manual techniques to create a large, emotionally balanced dataset. Its ability to capture naturalistic emotional speech across diverse contexts fills a key gap in emotional speech research. As a benchmark for emotion recognition, it has significantly influenced affective computing and human–computer interaction, advancing machine learning, conversational agents, and multimodal emotion analysis.

6.39. DEMoS

The Database of Elicited Mood in Speech corpus (DEMoS [96], 2020) is an Italian emotional speech database created to address the lack of emotional speech resources in Italian and to advance research in speech emotion recognition. It comprises 9697 samples, including 9365 emotional and 332 neutral samples, collected from 68 speakers (23 females and 45 males), predominantly engineering students aged approximately 23.7 years. The corpus captures seven emotional states: anger, sadness, happiness, fear, surprise, disgust, and guilt, providing a rich variety of affective expressions. Notably, guilt, a rarely represented emotion in similar datasets, adds a unique dimension, enhancing its relevance for real-world applications. While the reliance on isolated utterances ensures clarity, it may limit the contextualization of emotions compared to interactive speech corpora.
The recording environment was designed to elicit genuine emotional speech in a controlled setting. Sessions took place in a semi-dark and quiet room, minimizing distractions and observer effects. High-quality recording equipment, including professional-grade microphones, ensured clear and faithful audio capture. Participants interacted with a computer interface that facilitated mood induction procedures (MIPs), such as music, autobiographical recall, film scenes, and empathy-based scripts. These methods encouraged dynamic shifts in valence and arousal, enabling the capture of authentic emotional expressions. Speech samples were manually segmented to prioritize syntactic and prosodic naturalness, with a mean sample duration of approximately 2.9 s. The annotation process combined self-assessments by participants with external evaluations conducted by three experts in affective computing, ensuring a focus on “prototypical” samples. Ambiguous cases, such as those from participants failing an emotional awareness test (alexithymia) or with acting experience, were excluded to preserve authenticity. While the dual evaluation approach strengthened the dataset’s quality, limited details provided on inter-annotator agreement and labeling criteria make it challenging to fully assess the consistency of the annotations.
The DEMoS corpus stands out for its use of MIPs, a well-controlled recording setup, and the inclusion of guilt as an emotion, addressing a gap in Italian emotional speech research. By focusing on prototypical samples, it provides a valuable resource for studying and modeling emotions in speech, contributing significantly to affective computing and emotion recognition in the Italian context.

6.40. MEAD

The Multi-view Emotional Audio-Visual Dataset (MEAD) ([97], 2020) is a talking-face video corpus featuring 60 actors and actresses expressing eight different emotions at three varying intensity levels. The primary goal of its development is to enable the synthesis of natural emotional reactions in realistic talking-face video generation. The study employs eight emotion categories (angry, disgust contempt, fear, happy, sad, surprise, and neutral), and three levels of emotion intensity, chosen for their intuitive alignment with human perception. The first level, defined as weak, represents subtle but noticeable facial movements. The second level, medium, corresponds to the typical expression of the emotion, reflecting its normal state. The third level, strong, is characterized by the most exaggerated expressions of the emotion, involving intense movements in the associated facial areas. The phonetically diverse TIMIT speech corpus served as the basis for defining the audio speech content. Sentences were carefully selected to cover all phonemes across each emotion category. The sentences within each emotion category were divided into three parts: 3 common sentences, 7 emotion-specific sentences, and 20 generic sentences. Fluent English speakers aged 20 to 35 with prior acting experience were recruited for the study. To assess their acting skills, candidates were asked to imitate video samples of each emotion performed at different intensity levels by a professional actor. The guidance team evaluated the candidates’ performances based on their ability to replicate the expressions in the videos, ensuring that the main features of the emotions were conveyed accurately and naturally.
Before the recording process, training sessions were provided to help speakers achieve the desired emotional states. Subsequently, an emotion arousal session was conducted to elevate their emotional state, enabling them to deliver extreme expressions required for level 3 intensity. Most speakers were recorded in the order of weak, strong, and medium intensities, as mastering medium intensity became easier when the speaker had experienced both extremes of the emotion. The quality of the dataset was evaluated focusing on two main objectives: (i) determining whether the emotions performed by actors can be accurately recognized and (ii) assessing whether the three levels of emotion intensity can be correctly distinguished. For this experiment, 100 volunteers, aged 18 to 40, were recruited from universities. Data from six actors in the MEAD dataset, including four males and two females, were randomly selected and two types of experiments were conducted: emotion classification and intensity classification. In the user study on emotion category discrimination, the averaged accuracy rate was 85%. Authors show also results on emotion intensity discrimination on captured snippets. However, it should be taken into account evaluators use both audio and video to perform their classification.

6.41. SUBESCO

The SUST Bangla Emotional Speech Corpus (SUBESCO) is currently the largest available emotional speech database for the Bangla language ([98], 2021). It comprises voice data from 20 professional speakers, evenly divided into 10 males and 10 females, aged 21 to 35 years (mean = 28.05, SD = 4.85). Audio recording was conducted in two phases, with 5 males and 5 females participating in each phase. Gender balance in the corpus was maintained by ensuring an equal number of male and female speakers and raters. The dataset includes recordings of seven emotional states (anger, disgust, fear, happiness, sadness, surprise and neutral) for 10 sentences, with five trials preserved for each emotional expression. Consequently, the total number of utterances is calculated as 10 sentences × 5 repetitions × 7 emotions × 20 speakers = 7000 recordings. Each sentence has a fixed duration of 4 s, with only silences removed while retaining complete words. The total duration of the database amounts to 7000 recordings × 4 s = 28,000 s = 466 min 40 s = 7 h 40 min 40 s. Standard Bangla was chosen as the basis for preparing the text data used to develop the emotional corpus.
Initially, a list of twenty grammatically correct sentences was created, ensuring they could be expressed in all target emotions. Subsequently, three linguistic experts selected 10 sentences from this list for the database preparation. The final text dataset includes 7 vowels, 23 consonants, and 3 fricatives, covering all five major groups of consonant articulation. It also incorporates 6 diphthongs and 1 nasalization from the Bangla IPA. The audio recordings were conducted in an anechoic sound studio. Inside the recording room, the speaker was seated and given a dialogue script containing all the sentences arranged in sequential order. A condenser microphone (Wharfedale Pro DM5.0s) mounted on a suitable microphone stand (Hercules) was provided for the speaker. The average distance between the microphone and the speaker’s mouth was 6 cm. Speakers were professional artists and, accordingly were well-acquainted with the Stanislavski method for self-inducing desired emotions. They were instructed to convey the emotions in a manner that made the recordings sound as natural as possible. The speakers were also given unlimited time to prepare themselves to accurately express the intended emotions. Human subjects (25 males and 25 females) were engaged to label the utterances. Each rater assessed a set of recordings during Phase 1 and, after a one-week break, re-evaluated the same set of recordings in Phase 2. In the first phase, the raters assessed all seven emotions. However, in the second phase, the emotion “Disgust” was excluded, while the remaining six emotions were evaluated. This exclusion aim to investigate whether “Disgust” causes confusion with other similar emotions. All the raters were students from various disciplines and schools at Shahjalal University. They were all physically and mentally healthy and aged over 18 years at the time of the evaluation. None of the raters had participated in the recording sessions. As native Bangla speakers, they were proficient in reading, writing, and understanding the language. To prevent any bias in their perception, the raters were not given prior training on the recordings. Each audio set was assigned to two raters, one male and one female, to ensure that every audio clip was rated twice, with input from both genders in each phase.
Kappa statistics and intra-class correlation (ICC) were used to assess the reliability of the rating exercises. A two-way ANOVA was also conducted following the evaluation task to examine the variability and interaction of the main factors: gender and emotion. The inter-rater reliability for Phase 1 yielded a mean Kappa value of 0.58, indicating moderate agreement between raters. In Phase 2, the mean Kappa value increased to 0.69, reflecting substantial agreement. The intra-class correlation scores were ICC = 0.75 for single measurements and ICC = 0.99 for average measurements, indicating high reliability. The ANOVA results revealed that emotion had a significant main effect on recognition rates. Additionally, the Kruskal–Wallis test showed a statistically significant difference in average emotion recognition rates across different emotions. From the two-way ANOVA analysis, it was determined that the rater’s gender had a significant main effect on emotion recognition rates. However, there was no evidence to suggest that the gender of the speakers influenced the recognizability of emotions, as no specific gender consistently expressed more recognizable emotions than the other.

6.42. EMOVIE

EMOVIE ([99], 2021) is the first public Mandarin emotion speech dataset, It was collected from seven movies, belonging to feature and comedy categories, with natural and expressive speeches. The raw audio tracks were extracted from the movie files using the ffmpeg tool. A total of 9724 samples with 4.18 h of audio was collected. Human-labeled annotation of emotion polarity of the speech audio samples was performed using a scale from −1 (negative) to 1 (positive) and a step of 0.5, where 0 is neutral emotion. Samples with the polarity of ‘−0.5’ and ‘0.5’ account for 79% of the total samples (4573 and 3171 samples, respectively), followed by ‘0’ (1783 samples), ‘−1’ (179 samples) and ‘1’ (78 samples).

6.43. MCAESD

The Mandarin Chinese auditory emotions stimulus database (MCAESD [100], 2022) is an emotional auditory stimuli database composed of Chinese pseudo-sentences recorded by six professional actors (3 males and 3 females, mean age 30.5 years) in Mandarin Chinese. The stimulus set was composed of pseudo-sentences (each of them having 12 syllables (12 characters), including three keywords with two syllables each) created by Chinese disyllabic words with a high frequency of occurrence, based on the Chinese newspaper People’s Daily database. During the recording session, the actors were asked to read the target sentences acting six emotions (happiness, sadness, anger, fear, disgust, and pleasant surprise) plus the neutral emotion. Moreover, they were asked to vocalize two intensity levels for each emotion (except for neutral): normal and strong. Normal intensity refers to the general emotional intensity in daily-life communication, while strong intensity refers to a much more vivid and profound emotion than a normal emotion. Moreover, all emotional categories were vocalized into two types of sentence pattern: declarative and interrogative. Stimuli were recorded in a professional recording studio using a SONY UTX-B03 wireless microphone and digitized at a 44.1 kHz sampling rate with 64-bit resolution on two channels. Finally, each actor read 40 different pseudo-sentences and each sentence was spoken 26 times. After a selection process, 6059 high-quality recordings were maintained for a total of 4361 pseudo-sentence stimuli included in the database. Each recording was validated though an online platform with 40 native Chinese listeners (240 participants in total) in terms of the recognition accuracy of the intended emotion portrayal.

6.44. PEMO

The Punjabi Emotional Speech Database (PEMO [101], 2022) is an emotional speech dataset for Punjabi, a traditional language spoken by the Punjab State in India. PEMO includes 12 h and 35 min of speech recorded from 60 native-Punjabi speakers (from 20 to 45 years), for a total of 22,000 utterances encompassing four emotions: anger, happiness sad, and neutral. The utterances were derived from Punjabi movies and are sampled at 44.1 KHz in a mono audio channel using the PRAAT software (6.4.45). Three annotators having a thorough knowledge of the Punjabi Language categorized the emotional content of each utterance. The most common label used for all annotators was selected as the final label for the utterance, whereas those not achieving a common label were removed from the database. The annotation process achieved an average emotion recognition rate of about 95%.

6.45. BanglaSER

BanglaSER ([102], 2022) is a Bangla speech-audio dataset for SER. BanglaSER contains a total of 1467 audio recorded from 34 nonprofessional actors (17 male, 17 female) of five emotional states, i.e., angry, happy, neutral, sad, and surprise. Three trials were conducted for each emotional state. The categorical emotional states were evaluated by 15 human validators, and the recognition performance of the intended emotion was approximately 80.5%. The actors were asked to pronounce three lexically matched Bangla statements in a Bengali accent, whose meaning is “It’s twelve o’clock”, “I knew something like this would happen”, and “What kind of gift is this?”. The speech audio data was recorded using the smartphone’s default recording application, a laptop, and a microphone. Recordings lasted between 3 to 4 s, and surrounding noises were removed using the Audacity software (3.7.5).

6.46. MNITJ-SEHSD

The Malaviya National Institute of Technology Jaipur Simulated Emotion Hindi Speech Database (MNITJ-SEHSD) is an emotional speech database in Hindi ([103], 2023). The database is designed to imitate five different emotions (happy, angry, neutral, sad, and fear) with neutral text cues to allow the speakers to mimic the emotions with no bias. The audios were recorded from 10 speakers (5 males and 5 females, from 21 to 27 years old) using an omnidirectional microphone at a sampling frequency of 44.1 kHz, later downsampled at 16 kHz. For each emotion, a total of 100 utterances were recorded. Each sentence contained six words, except for two sentences with five words and one sentence with seven words.
A subjective evaluation was carried out by three experts to compute emotion recognition performance. Forty utterances from each class were randomly selected, for a total of 200 emotional utterances. The three evaluators had to assign one of the emotion categories to each sample three times. The achieved average emotion recognition rate is 71%.

6.47. IESC

The Indian Emotional Speech Corpora (IESC [104], 2023) is a speech emotional database in the English language spoken by eight North Indian people (five males and three females). The database contains 600 emotional audio files recorded in five emotions, i.e., neutral, happy, anger, sad, and fear. All the audio files were recorded using a speech recorder app through a mobile phone in a closed room to avoid any other noises. Headphones were also used with a microphone to prevent sound leakage and for noise cancellation during the recording.

6.48. ASED

The Amharic Speech Emotion Dataset (ASED [105], 2023) is the first SER dataset for the Amharic language, covering four dialects, namely, Gojjam, Wollo, Shewa, and Gonder. In total, 65 participants (25 female, 40 male), aged from 20 to 40 years, participated in the recording of speech audio files containing five different emotions, i.e., neutral, fear, happy, sad, and angry. For each of the five emotions, five sentences expressing that emotion were composed in Amharic. The recording was performed in a quiet room to obtain speech signals with minimum noise. Since professional audio equipment was not available, six Huawei Nova 4 mobile phones were used to record the audio files. An Android-based speech recording software app was installed and set up to capture the speech utterances at a 16 kHz sampling rate at 16 bits. The recording software displayed the text for participants one sentence at a time and indicated the required emotion. Every recording was independently reviewed by eight judges, and a recording was only accepted for inclusion in the ASED dataset if five or more judges agreed. The final dataset consists of 2474 recordings, each between 2 and 4 s in length: 522 neutral, 510 fear, 486 happy, 470 sad, and 486 angry.

6.49. EmoMatchSpanishDB

The EmoMatchSpanishDB ([106], 2023) is the first database in the Spanish language, including elicited emotional voices played out by 50 non-actors (30 males and 20 females, mean age of 33.9 years old). The EmoMatchSpanishDB is a subset of the full original dataset, EmoSpanishDB, which includes all recorded audios that received consensus after a crowd-sourcing validation process. The EmoMatchSpanishDB, instead, includes only the audio data whose emotion also matches the originally elicited emotion.
The 23 phonemes existing in the central area of Spain have been used to create a total of 12 sentences to replicate regular conversations. Furthermore, all these sentences do not include emotional semantic connotation to avoid any emotional influence on the speakers. The 12 selected sentences were played (elicited) by 50 people 7 times, one for each of the considered emotions, i.e., anger, disgust, fear, happiness, sadness, surprise, and neutral. Finally, a total of 4200 emotional raw audio samples were collected. Audio data were recorded in a professional radio studio, noise-free, in PCM format with a sampling rate of 48 kHz and a bit depth of 16 bits (no compressed audio). A dynamic mono-channel cardioid microphone (Sennheiser MD421) and the AudioPlus (AEQ) software (3.0) were used to record the audio signals.
A perception test was conducted using crowd-sourcing to manually label with an emotion all the recorded audio samples. This process involved 194 native Spanish speakers. A total of 3550 audios labeled with an emotion (reaching a consensus from different evaluators) compose the EmoSpanishDB, whereas the EmoMatchSpanishDB includes 2020 audios that also matched the original elicited emotion.

6.50. nEMO

The nEMO dataset ([107], 2024) is a novel corpus of emotional speech in Polish. nEMO adopted an acted approach, and each of the involved 9 actors (four female and five male, ranging between 20 and 30 years old) was required to record the same set of utterances for six emotional states: anger, fear, happiness, sadness, surprise, and neutral state. A total of 90 sentences were created, each containing at least one uncommon phoneme present in the Polish language, and that could be used in everyday conversations. During the recording sessions, actors were given explicit instructions to focus on depicting a single emotional state at a time. Feedback was provided constantly to support and guide all participants. Each audio recording session was conducted in a home setting, to better reflect a natural environment, and involved one actor lasting approximately two hours. The utterances were captured using a cardioid condenser microphone with a 192 kHz sampling rate, equipped with a sponge and pop filter to eliminate background noise and explosive consonant utterances. The nEMO dataset underwent human evaluation of recorded emotional speech audio, and only recordings that accurately captured the intended emotional state were included. The resulting dataset contains a total of 4481 audio recordings, corresponding to more than three hours of speech.

6.51. CAVES

The Cantonese Audio--Visual Emotional Speech (CAVES [108], 2024) dataset consists of auditory and visual recordings of ten native speakers of Cantonese. Cantonese is a tonal language (with two more phonetic tones than Mandarin) primarily spoken in Southern China, particularly in the provinces of Guangdong and Guangxi. The CAVES dataset contains the six basic emotions, i.e., anger, disgust, fear, happiness, sadness, surprise, plus a neutral expression to serve as a baseline. A set of semantically neutral carrier sentences was selected to express the six emotions for each sentence without any semantic interference. Fifty sentences selected from the Cantonese Hearing In Noise Test (CHINT) sentence list. The sentences were selected to have a good coverage of the different lexical tones, both in initial and final sentence positions. Ten native speakers of Cantonese (five females and five males) participated in the recording, which was conducted in a sound-attenuated booth. A video monitor was used to present the stimulus sentence, while a video camera and a microphone were used to capture participants’ faces and utterances a microphone, respectively. Fifteen native Cantonese perceivers completed a forced-choice emotion identification to validate the recorded emotion expressions.

6.52. Emozionalmente

Emozionalmente ([109], 2025) is an Italian acted corpus of emotional speech designed to ensure comparability with the Italian Emovo database. Indeed, similarly to Emovo, Emozionalmente recruited actors to simulate emotions speaking scripted sentences and includes six emotions (anger, disgust, fear, joy, sadness, and surprise) plus the neutral state.
Eighteen sentences were constructed ad-hoc to be semantically neutral and easily readable with different emotional tones. These sentences contain everyday vocabulary and cover all Italian phonemes in various positions and contexts. A custom web app for crowd-sourcing was developed to collect audio samples from voluntary participants and to evaluate audio recordings submitted by other participants, such as emotion labeling and indication of a clear/noise audio. Thus, recordings were captured using participants’ device microphones, which introduced natural variability in audio quality. The collected corpus included 11,404 audio samples, which were reduced to 6902 samples after the data cleaning process that removed noisy samples and sample with inconsistent emotion labels. These samples were recorded from a total of 431 Italian actors, 131 males, 299 females, and 1 listed as “other”, with average age of 31.28 years.
A subjective evaluation was conducted to assess the effectiveness of Emozionalmente in conveying emotions. It involved 829 individuals, who provided a total of 34,510 evaluations, 5 per audio sample. A recognition accuracy of 66% was achieved, which demonstrates the utility and representativeness of the Emozionalmente database.

7. Discussion

In this section, we step back from the individual corpus descriptions to look at the bigger picture. The goal is to see what common patterns emerge across languages, speech types, and data collection strategies, and to reflect on what these patterns mean for SER research. We then turn to Table 3, which offers a complementary view by showing which corpora are actually being used in recent studies. Taken together, these perspectives help clarify where the field currently stands and where it still needs to move forward.

7.1. Synthesis of the Reviewed Corpora

Looking across the databases we reviewed, what stands out first is the variety of languages represented. English remains the dominant choice, with well-established corpora like IEMOCAP, CREMA-D, RAVDESS, and eNTERFACE’05, but important resources also exist in German (EMO-DB, AIBO, VAM), French (EMOTV1, CEMO, CaFE), Italian (EMOVO, Emozionalmente), Mandarin Chinese (CASIA), Spanish (EmoMatchSpanishDB), Hindi (IITKGP-SEHSC, MNITJ-SEHSD), Punjabi (PEMO), and even Danish (DES), alongside explicitly multilingual efforts such as CREST and INTERFACE. Despite this diversity, English and a handful of European languages still dominate the SER landscape. While simple counts highlight these imbalances, they do not fully capture differences in dataset size, diversity, or impact. To address this, we introduced a quality index (Q) that combines intrinsic features (number of speakers and emotions) with external indicators (citations per year as a proxy for adoption and recency to reflect current relevance). Each factor was normalized and weighted using the analytical hierarchy process (AHP), with more importance given to citations and speaker diversity while still considering emotional breadth and recency. This allows us to assess not only how many corpora exist, but also their weighted contribution to the field.
Q = 0.2615 S + 0.0872 E + 0.6114 C + 0.0400 R ,
where S , E , C , and R denote the normalized number of speakers, number of emotions, citations per year, and recency, respectively. Figure 3a illustrates this imbalance: while several languages are represented, English accounts for both the largest number of corpora and the highest cumulative quality index ( Q ). Other languages such as German, French, Italian, and Mandarin also show substantial contributions, but the long tail of underrepresented languages highlights how limited cross-cultural coverage remains. It is worth noting that the imbalanced consideration of languages reflects the publications of the SER community in the last decades. However, a recent trend indicates a growth in the consideration of underrepresented languages to be included in SER systems.
The same imbalance appears when we look at speech types. Most corpora are acted, recorded in controlled settings to ensure clarity and reliable labels, such as DES, EMO-DB, IEMOCAP, CASIA, EMOVO, RAVDESS, and CREMA-D. By contrast, truly natural corpora are rarer but vital for realism, such as call-center conversations and TV interviews in French (CEMO, EMOTV1), courtroom recordings in Italian (ITA-DB), German talk shows (VAM), or large-scale online material like CMU-MOSEI. Between these two extremes, elicited speech databases like SmartKom, eNTERFACE’05, EmoTaboo, and EmoMatchSpanishDB try to balance spontaneity with reproducibility. Figure 3b confirms this trend: acted corpora dominate in raw numbers, but natural and elicited corpora achieve competitive average quality indexes. This suggests that, although they are fewer in number, natural databases contribute valuable diversity and robustness to the SER field. Another difference lies in how speech material is sourced. The majority of datasets are purpose-recorded in studios or labs with standardized protocols, but a smaller group draws directly from existing media: films and TV for SAFE, ITA-DB, and EMOTV1, or online platforms such as YouTube for CMU-MOSEI and TV talk shows for ANAD. More recent corpora even experiment with crowd-sourcing during either collection or labeling, as seen in EmoMatchSpanishDB and Emozionalmente.
Taken together, these results illustrate how the SER field still leans heavily on ad-hoc studio recordings, but it is gradually moving toward more natural and diverse sources of emotional speech, which promise greater ecological validity and broader cultural coverage. While this overview highlights broad trends in languages, speech types, and collection methods, Table 3 provides a complementary perspective by showing which of these corpora are most often used in recent SER studies.

7.2. Usage of Corpora in Recent SER Studies

Table 3 gives a snapshot of the datasets that researchers have been relying on in early 2024. Unsurprisingly, the familiar acted corpora—IEMOCAP, CREMA-D, RAVDESS, and EMO-DB—still dominate. They remain the go-to benchmarks because they are clear, accessible, and deeply embedded in the community, making them hard to replace for fair comparisons. At the same time, there are signs of change: natural and multimodal datasets such as MSP-Podcast, MELD, and CMU-MOSEI are appearing more often, especially in studies that test robustness or combine speech with other modalities. Newer corpora in less-represented languages, like PEMO (Punjabi), EmoMatchSpanishDB (Spanish), or BAVED (Bangla), show up only occasionally, reflecting both their novelty and the slower pace at which they are adopted. Taken together, the picture is consistent with what we observed in the corpus review: the field still leans heavily on a small group of acted datasets, but interest in more diverse, spontaneous, and culturally varied resources is clearly growing.

7.3. Emerging Trends and Gaps

Looking at the bigger picture, a few clear trends and gaps stand out. The SER field still leans heavily on acted datasets in English and a handful of other major languages, even though these conditions risk producing systems that perform well in the lab but less reliably in real-world settings. At the same time, interest is steadily growing in natural and multimodal corpora, which are harder to collect and annotate, but much more representative of everyday communication. Progress has also been made in releasing datasets for languages beyond English, such as Hindi, Punjabi, Spanish, and Italian; yet, these resources remain relatively scarce, and the lack of large-scale multilingual corpora continues to limit cross-cultural generalization. Finally, annotation practices are far from consistent: some corpora use categorical labels, others rely on dimensional scales, and quality control varies widely. Addressing these issues will require both new collection strategies, such as carefully designed crowd-sourcing, and stronger collaboration within the community to build larger, more standardized, and culturally diverse emotional speech resources. However, the comparative analysis of the various datasets also reveals several psychological and ethical limitations. The predominance of acted speech, together with the scarcity of mixed or nuanced emotions, improves experimental control but reduces ecological validity, i.e., fidelity to real-world affect. Furthermore, limited linguistic and cultural diversity can bias the psychological representativeness of the corpora. Finally, the use of natural datasets is often constrained by ethical and legal requirements (e.g., consent and privacy). From a practical standpoint, responsibly releasing natural-speech corpora requires a lawful basis and clear consent, data minimization and de-identification, secure access with transparent licensing (open or controlled, consistent with copyright and platform terms), and documentation of ethics approval. Where these conditions cannot be met, distribution should be limited (for example, on-site access or sharing only derived features) or avoided.

8. Conclusions

In this article, we set out to fill a gap in the SER literature by focusing not on algorithms or features, but on the datasets that provide the foundation for the field. To this end, we deliberately adopted a narrative review and comparative analysis rather than a systematic review, since our aim was to provide a broad and critical synthesis across a heterogeneous landscape of corpora.
First, we revisited the conceptual ground of SER, outlining the main system components and the two dominant psychological models of emotion, categorical and dimensional, which continue to shape the way emotions are represented and annotated. This first part is not simply background; it discusses how insights from psychology inform technical design choices, and how mismatches between theoretical models and practical datasets can create challenges for SER research.
The heart of the review is the detailed analysis of more than fifty corpora, which we examined through multiple lenses: their linguistic and cultural coverage, the types of speech they capture (acted, elicited, or natural), their collection methods, and their annotation strategies. Beyond listing characteristics, we emphasized the psychological assumptions built into these corpora and provided a critical reading of their strengths and weaknesses. Acted datasets remain the backbone of the field, offering clarity and reliability but at the cost of ecological validity and subtle emotional variation. Natural and elicited corpora, by contrast, provide richer and more realistic affective behavior, though they suffer from annotation difficulties and limited availability. Our synthesis highlighted not only the variety of languages now represented but also the continuing imbalance in favor of English and European languages, and our critical lens drew attention to persistent gaps in cross-cultural representation, annotation consistency, and methodological transparency. By combining insights from emotion theory, system design, and corpus analysis, this review offers researchers a consolidated and practical guide for navigating the landscape of SER data.
We further complement this perspective by summarizing in a table (Table 3) the corpora that are actively used in recent research, confirming the dominance of a handful of benchmarks while highlighting a gradual shift toward natural and multimodal resources. Taken together, these analyses show that while the field has made important progress, it is still constrained by a narrow set of conditions and languages, raising concerns about robustness, fairness, and generalizability.
Looking ahead, we see clear directions for progress: the creation of larger multilingual and multicultural corpora, the design of more consistent and transparent annotation protocols, and more innovative collection strategies, such as carefully designed crowd-sourcing that balances scale with quality. These efforts will be critical if future SER systems are to move beyond narrow benchmarks and achieve the ecological validity required for real-world applications. In summary, our contribution lies in providing a narrative review and comparative framework (scope, contents, physical existence, language composition) that makes the field’s diversity transparent and its descriptive parameters replicable, while maintaining the interpretive depth that a narrative review allows. We hope this work will not only help researchers select appropriate datasets for their work but also spark renewed attention to the design and development of corpora, which remain the cornerstone of reliable and generalizable speech emotion recognition.

Author Contributions

Conceptualization, S.S.; methodology, S.S., O.S., S.C., C.M., A.F., and S.P.; validation, S.S. and A.F.; formal analysis, S.S., O.S., and G.E.; investigation, S.S., S.C., and C.M.; resources, S.S.; data curation, S.S.; writing—original draft preparation, S.S., O.S., G.E., A.F., and S.P.; writing—review and editing, S.S., O.S., G.E., A.F., and S.P.; visualization, S.S.; supervision, S.S.; project administration, S.S. and L.A.; funding acquisition, S.S. and L.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the European Union-Next Generation EU under the Italian National Recovery and Resilience Plan (NRRP), Mission 4, Component 2, Investment 1.3, CUP C49J24000240004, partnership on “Telecommunications of the Future” (PE00000001-program “RESTART”).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Munot, R.; Nenkova, A. Emotion impacts speech recognition performance. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Minneapolis, Minnesota, 3–5 June 2019; pp. 16–21. [Google Scholar]
  2. Alnuaim, A.A.; Zakariah, M.; Shukla, P.K.; Alhadlaq, A.; Hatamleh, W.A.; Tarazi, H.; Sureshbabu, R.; Ratna, R. Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier. J. Healthc. Eng. 2022, 2022, 6005446. [Google Scholar] [CrossRef]
  3. Lakomkin, E.; Zamani, M.A.; Weber, C.; Magg, S.; Wermter, S. On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA; pp. 854–860. [Google Scholar]
  4. Alshamsi, H.; Kepuska, V.; Alshamsi, H.; Meng, H. Automated speech emotion recognition on smart phones. In Proceedings of the 2018 9th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 8–10 November 2018; IEEE: Piscataway, NJ, USA; pp. 44–50. [Google Scholar]
  5. Bojanić, M.; Delić, V.; Karpov, A. Effect of Emotion Distribution on a Call Processing for an Emergency Call Center. In Proceedings of the 2020 28th Telecommunications Forum (TELFOR), Belgrade, Serbia, 24–25 November 2020; IEEE: Piscataway, NJ, USA; pp. 1–4. [Google Scholar]
  6. Abbaschian, B.J.; Sierra-Sosa, D.; Elmaghraby, A. Deep learning techniques for speech emotion recognition, from databases to models. Sensors 2021, 21, 1249. [Google Scholar] [CrossRef] [PubMed]
  7. Kim, H.; Hong, T. Enhancing emotion recognition using multimodal fusion of physiological, environmental, personal data. Expert Syst. Appl. 2024, 249, 123723. [Google Scholar] [CrossRef]
  8. Wu, X.; Zhang, Q. Intelligent aging home control method and system for internet of things emotion recognition. Front. Psychol. 2022, 13, 882699. [Google Scholar] [CrossRef] [PubMed]
  9. Hansen, L.; Zhang, Y.P.; Wolf, D.; Sechidis, K.; Ladegaard, N.; Fusaroli, R. A generalizable speech emotion recognition model reveals depression and remission. Acta Psychiatr. Scand. 2022, 145, 186–199. [Google Scholar] [CrossRef] [PubMed]
  10. Gerczuk, M.; Amiriparian, S.; Ottl, S.; Schuller, B.W. Emonet: A transfer learning framework for multi-corpus speech emotion recognition. IEEE Trans. Affect. Comput. 2021, 14, 1472–1487. [Google Scholar] [CrossRef]
  11. Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
  12. Kaur, K.; Singh, P. Trends in speech emotion recognition: A comprehensive survey. Multimed. Tools Appl. 2023, 82, 29307–29351. [Google Scholar] [CrossRef]
  13. Mohmad Dar, G.H.; Delhibabu, R. Speech Databases, Speech Features, and Classifiers in Speech Emotion Recognition: A Review. IEEE Access 2024, 12, 151122–151152. [Google Scholar] [CrossRef]
  14. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
  15. Ghosh, A.; Choudhury, S. Understanding different types of review articles: A primer for early career researchers. Indian J. Psychiatry 2025, 67, 535–541. [Google Scholar] [CrossRef]
  16. Ferrari, R. Writing narrative style literature reviews. Med. Write 2015, 24, 230–235. [Google Scholar] [CrossRef]
  17. Matveev, Y.; Matveev, A.; Frolova, O.; Lyakso, E.; Ruban, N. Automatic speech emotion recognition of younger school age children. Mathematics 2022, 10, 2373. [Google Scholar] [CrossRef]
  18. Tank, V.P.; Hadia, S. Creation of speech corpus for emotion analysis in Gujarati language and its evaluation by various speech parameters. Int. J. Electr. Comput. Eng. 2020, 10, 4752–4758. [Google Scholar] [CrossRef]
  19. Kadiri, S.R.; Gangamohan, P.; Gangashetty, S.V.; Alku, P.; Yegnanarayana, B. Excitation features of speech for emotion recognition using neutral speech as reference. Circuits Syst. Signal Process. 2020, 39, 4459–4481. [Google Scholar] [CrossRef]
  20. Baek, J.Y.; Lee, S.P. Enhanced speech emotion recognition using dcgan-based data augmentation. Electronics 2023, 12, 3966. [Google Scholar] [CrossRef]
  21. Liu, M.; Raj, A.N.J.; Rajangam, V.; Ma, K.; Zhuang, Z.; Zhuang, S. Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition. Speech Commun. 2024, 156, 103010. [Google Scholar] [CrossRef]
  22. Alluhaidan, A.S.; Saidani, O.; Jahangir, R.; Nauman, M.A.; Neffati, O.S. Speech emotion recognition through hybrid features and convolutional neural network. Appl. Sci. 2023, 13, 4750. [Google Scholar] [CrossRef]
  23. Saumard, M. Enhancing Speech Emotions Recognition Using Multivariate Functional Data Analysis. Big Data Cogn. Comput. 2023, 7, 146. [Google Scholar] [CrossRef]
  24. Sun, L.; Li, Q.; Fu, S.; Li, P. Speech emotion recognition based on genetic algorithm–decision tree fusion of deep and acoustic features. ETRI J. 2022, 44, 462–475. [Google Scholar] [CrossRef]
  25. Welivita, A.; Xie, Y.; Pu, P. Fine-grained emotion and intent learning in movie dialogues. arXiv 2020, arXiv:2012.13624. [Google Scholar] [CrossRef]
  26. Abdelhamid, A.A.; El-Kenawy, E.S.M.; Alotaibi, B.; Amer, G.M.; Abdelkader, M.Y.; Ibrahim, A.; Eid, M.M. Robust speech emotion recognition using CNN+ LSTM based on stochastic fractal search optimization algorithm. IEEE Access 2022, 10, 49265–49284. [Google Scholar] [CrossRef]
  27. Falahzadeh, M.R.; Farsa, E.Z.; Harimi, A.; Ahmadi, A.; Abraham, A. 3d convolutional neural network for speech emotion recognition with its realization on intel cpu and nvidia gpu. IEEE Access 2022, 10, 112460–112471. [Google Scholar] [CrossRef]
  28. Dai, Y.; Li, Y.; Chen, D.; Li, J.; Lu, G. Multimodal Decoupled Distillation Graph Neural Network for Emotion Recognition in Conversation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9910–9924. [Google Scholar] [CrossRef]
  29. Yun, H.I.; Park, J.S. End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation. Multimed. Tools Appl. 2023, 82, 22759–22776. [Google Scholar] [CrossRef] [PubMed]
  30. Kerkeni, L.; Serrestou, Y.; Mbarki, M.; Raoof, K.; Mahjoub, M.A. A review on speech emotion recognition: Case of pedagogical interaction in classroom. In Proceedings of the 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Fez, Morocco, 22–24 May 2017; IEEE: Piscataway, NJ, USA; pp. 1–7. [Google Scholar]
  31. Ekman, P.; Friesen, W.V. Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 1971, 17, 124. [Google Scholar] [CrossRef]
  32. Ekman, P. Basic Emotions. In Handbook of Cognition and Emotion; John Wiley & Sons, Ltd.: West Sussex, UK, 1999; Chapter 3; pp. 45–60. [Google Scholar] [CrossRef]
  33. Cowen, A.S.; Keltner, D. What the face displays: Mapping 28 emotions conveyed by naturalistic expression. Am. Psychol. 2020, 75, 349. [Google Scholar] [CrossRef]
  34. Scarantino, A.; Griffiths, P. Don’t give up on basic emotions. Emot. Rev. 2011, 3, 444–454. [Google Scholar] [CrossRef]
  35. Laukka, P.; Elfenbein, H.A. Cross-cultural emotion recognition and in-group advantage in vocal expression: A meta-analysis. Emot. Rev. 2021, 13, 3–11. [Google Scholar] [CrossRef]
  36. Dirzyte, A.; Antanaitis, F.; Patapas, A. Law enforcement officers’ ability to recognize emotions: The role of personality traits and Basic needs’ satisfaction. Behav. Sci. 2022, 12, 351. [Google Scholar] [CrossRef] [PubMed]
  37. Jack, R.E.; Garrod, O.G.; Yu, H.; Caldara, R.; Schyns, P.G. Facial expressions of emotion are not culturally universal. Proc. Natl. Acad. Sci. USA 2012, 109, 7241–7244. [Google Scholar] [CrossRef] [PubMed]
  38. Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [Google Scholar] [CrossRef]
  39. Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inf. Fusion 2024, 102, 102019. [Google Scholar] [CrossRef]
  40. Sharma, K.; Castellini, C.; Stulp, F.; Van den Broek, E.L. Continuous, real-time emotion annotation: A novel joystick-based analysis framework. IEEE Trans. Affect. Comput. 2017, 11, 78–84. [Google Scholar] [CrossRef]
  41. Calvo, R.A.; D’Mello, S. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Trans. Affect. Comput. 2010, 1, 18–37. [Google Scholar] [CrossRef]
  42. Guo, R.; Guo, H.; Wang, L.; Chen, M.; Yang, D.; Li, B. Development and application of emotion recognition technology—A systematic literature review. BMC Psychol. 2024, 12, 95. [Google Scholar] [CrossRef] [PubMed]
  43. Geetha, A.; Mala, T.; Priyanka, D.; Uma, E. Multimodal Emotion Recognition with deep learning: Advancements, challenges, and future directions. Inf. Fusion 2024, 105, 102218. [Google Scholar]
  44. Younis, E.M.; Mohsen, S.; Houssein, E.H.; Ibrahim, O.A.S. Machine learning for human emotion recognition: A comprehensive review. Neural Comput. Appl. 2024, 36, 8901–8947. [Google Scholar] [CrossRef]
  45. Pereira, P.; Moniz, H.; Carvalho, J.P. Deep emotion recognition in textual conversations: A survey. Artif. Intell. Rev. 2025, 58, 1–37. [Google Scholar] [CrossRef]
  46. Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Interspeech, Lisbon, Portugal, 4–8 September 2005; Volume 5, pp. 1517–1520. [Google Scholar]
  47. Schröder, M. Emotional speech synthesis: A review. In Proceedings of the Seventh European Conference on Speech Communication and Technology, Aalborg, Denmark, 3–7 September 2001; pp. 561–564. [Google Scholar]
  48. Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
  49. Douglas-Cowie, E.; Campbell, N.; Cowie, R.; Roach, P. Emotional speech: Towards a new generation of databases. Speech Commun. 2003, 40, 33–60. [Google Scholar] [CrossRef]
  50. Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
  51. Ververidis, D.; Kotropoulos, C. Emotional speech recognition: Resources, features, and methods. Speech Commun. 2006, 48, 1162–1181. [Google Scholar] [CrossRef]
  52. Scherer, K.R. Vocal communication of emotion: A review of research paradigms. Speech Commun. 2003, 40, 227–256. [Google Scholar] [CrossRef]
  53. Gross, J.J.; Levenson, R.W. Emotion elicitation using films. Cogn. Emot. 1995, 9, 87–108. [Google Scholar] [CrossRef]
  54. Schaefer, A.; Nils, F.; Sanchez, X.; Philippot, P. Assessing the effectiveness of a large database of emotion-eliciting films: A new tool for emotion researchers. Cogn. Emot. 2010, 24, 1153–1172. [Google Scholar] [CrossRef]
  55. Parsons, T.D. Virtual reality for enhanced ecological validity and experimental control in the clinical, affective and social neurosciences. Front. Hum. Neurosci. 2015, 9, 660. [Google Scholar] [CrossRef]
  56. Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 2011, 53, 1062–1087. [Google Scholar] [CrossRef]
  57. Douglas-Cowie, E.; Cowie, R.; Sneddon, I.; Cox, C.; Lowry, O.; Mcrorie, M.; Martin, J.C.; Devillers, L.; Abrilian, S.; Batliner, A.; et al. The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. In Proceedings of the Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007, Lisbon, Portugal, 12–14 September 2007; Proceedings 2. Springer: Berlin/Heidelberg, Germany, 2007; pp. 488–500. [Google Scholar]
  58. Schuller, B.; Batliner, A. Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing; John Wiley & Sons Ltd.: Chichester, UK, 2014. [Google Scholar]
  59. Liu, Y.l.; Huang, L.; Yan, W.; Wang, X.; Zhang, R. Privacy in AI and the IoT: The privacy concerns of smart speaker users and the Personal Information Protection Law in China. Telecommun. Policy 2022, 46, 102334. [Google Scholar] [CrossRef]
  60. Schuller, B.; Steidl, S.; Batliner, A. The interspeech 2009 emotion challenge 2009. In Proceedings of Interspeech 2009, Brighton, UK, 6–10 September 2009; pp. 312–315. [Google Scholar] [CrossRef]
  61. Engberg, I.S.; Hansen, A.V.; Andersen, O.; Dalsgaard, P. Design, recording and verification of a Danish emotional speech database. In Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes, Greece, 22–25 September 1997. [Google Scholar]
  62. Hansen, J.H.; Bou-Ghazale, S.E.; Sarikaya, R.; Pellom, B. Getting started with SUSAS: A speech under simulated and actual stress database. In Proceedings of the Eurospeech, Rhodes, Greece, 22–25 September 1997; Volume 97, pp. 1743–1746. [Google Scholar]
  63. Campbell, N. Building a Corpus of Natural Speech–and Tools for the Processing of Expressive Speech–the JST CREST ESP Project. In Proceedings of the 7th European Conference on Speech Communication and Technology, Aalborg, Denmark, 3–7 September 2001; pp. 1525–1528. [Google Scholar]
  64. Hozjan, V.; Kacic, Z.; Moreno, A.; Bonafonte, A.; Nogueiras, A. Interface Databases: Design and Collection of a Multilingual Emotional Speech Database. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Spain, 27 May–2 June 2002. [Google Scholar]
  65. Schiel, F.; Steininger, S.; Türk, U. The SmartKom Multimodal Corpus at BAS. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Spain, 27 May–2 June 2002. [Google Scholar]
  66. Batliner, A.; Hacker, C.; Steidl, S.; Nöth, E.; Haas, J. User states, user strategies, and system performance: How to match the one with the other. In Proceedings of the ITRW on Error Handling in Spoken Dialogue Systems, Chateau d’Oex, Vaud, Switzerland, 28–31 August 2003. [Google Scholar]
  67. Batliner, A.; Hacker, C.; Steidl, S.; Nöth, E.; D’Arcy, S.; Russell, M.; Wong, M. “You Stupid Tin Box”-Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 26–30 May 2004. [Google Scholar]
  68. Devillers, L.; Vasilescu, I. Reliability of Lexical and Prosodic Cues in Two Real-life Spoken Dialog Corpora. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 26–30 May 2004. [Google Scholar]
  69. Zovato, E.; Pacchiotti, A.; Quazza, S.; Sandri, S. Towards emotional speech synthesis: A rule based approach. In Proceedings of the Fifth ISCA Workshop on Speech Synthesis, Pittsburgh, PA, USA, 14–16 June 2004. [Google Scholar]
  70. Abrilian, S.; Devillers, L.; Buisine, S.; Martin, J.C. EmoTV1: Annotation of real-life emotions for the specification of multimodal affective interfaces. In Proceedings of the HCI International, Las Vegas, NV, USA, 22–27 July 2005; Volume 401, pp. 407–408. [Google Scholar]
  71. Vidrascu, L.; Devillers, L. Real-life emotions in naturalistic data recorded in a medical call center. In Proceedings of the First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)), Genoa, Italy, 22–28 May 2006; pp. 20–24. [Google Scholar]
  72. Martin, O.; Kotsia, I.; Macq, B.; Pitas, I. The eNTERFACE’05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06), Atlanta, GA, USA, 3–7 April 2006; IEEE: Piscataway, NJ, USA, 2006; p. 8. [Google Scholar]
  73. Clavel, C.; Vasilescu, I.; Devillers, L.; Richard, G.; Ehrette, T.; Sedogbo, C. The SAFE Corpus: Illustrating extreme emotions in dynamic situations. In Proceedings of the First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)), Genoa, Italy, 22–28 May 2006; pp. 76–79. [Google Scholar]
  74. Zara, A.; Maffiolo, V.; Martin, J.C.; Devillers, L. Collection and annotation of a corpus of human-human multimodal interactions: Emotion and others anthropomorphic characteristics. In Proceedings of the Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007, Proceedings 2. Lisbon, Portugal, 12–14 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 464–475. [Google Scholar]
  75. Tao, J.; Liu, F.; Zhang, M.; Jia, H. Design of Speech Corpus for Mandarin Text to Speech. 2008. Available online: https://api.semanticscholar.org/CorpusID:15860480 (accessed on 9 October 2025).
  76. Archetti, F.; Arosio, G.; Fersini, E.; Messina, E. Audio-based Emotion Recognition for Advanced Automatic Retrieval in Judicial Domain. In Proceedings of the 1st International Conference on ICT Solutions for Justice (ICT4Justice ’08), Thessaloniki, Greece, 24 October 2008. [Google Scholar]
  77. Grimm, M.; Kroschel, K.; Narayanan, S. The Vera am Mittag German audio-visual emotional speech database. In Proceedings of the 2008 IEEE International Conference on Multimedia and Expo, Hannover, Germany, 23–26 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 865–868. [Google Scholar]
  78. Haq, S.; Philip, J.B.J. Multimodal Emotion Recognition; Machine Audition: Principles, Algorithms and Systems; Wang, W., Ed.; IGI Global: Hershey, PA, USA, 2011; Chapter 17. [Google Scholar]
  79. Koolagudi, S.G.; Reddy, R.; Yadav, J.; Rao, K.S. IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In Proceedings of the 2011 International Conference on Devices and Communications (ICDeCom), Mesra, Ranchi, India, 24–25 February 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1–5. [Google Scholar]
  80. Fersini, E.; Messina, E.; Archetti, F. Emotional states in judicial courtrooms: An experimental investigation. Speech Commun. 2012, 54, 11–22. [Google Scholar] [CrossRef]
  81. Dupuis, K.; Pichora-Fuller, M.K. Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set. Can. Acoust. 2011, 39, 182–183. [Google Scholar]
  82. McKeown, G.; Valstar, M.; Cowie, R.; Pantic, M.; Schroder, M. The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent. IEEE Trans. Affect. Comput. 2012, 3, 5–17. [Google Scholar] [CrossRef]
  83. Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
  84. Costantini, G.; Iaderola, I.; Paoloni, A.; Todisco, M. EMOVO corpus: An Italian emotional speech database. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 3501–3504. [Google Scholar]
  85. Li, Y.; Tao, J.; Chao, L.; Bao, W.; Liu, Y. CHEAVD: A Chinese natural emotional audio–visual database. J. Ambient. Intell. Humaniz. Comput. 2017, 8, 913–924. [Google Scholar] [CrossRef]
  86. Busso, C.; Parthasarathy, S.; Burmania, A.; AbdelWahab, M.; Sadoughi, N.; Provost, E.M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 2016, 8, 67–80. [Google Scholar] [CrossRef]
  87. Chou, H.C.; Lin, W.C.; Chang, L.C.; Li, C.C.; Ma, H.P.; Lee, C.C. NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus. In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; IEEE: Piscataway, NJ, USA; pp. 292–298. [Google Scholar]
  88. Vryzas, N.; Kotsakis, R.; Liatsou, A.; Dimoulas, C.A.; Kalliris, G. Speech emotion recognition for performance interaction. J. Audio Eng. Soc. 2018, 66, 457–467. [Google Scholar] [CrossRef]
  89. Klaylat, S.; Osman, Z.; Hamandi, L.; Zantout, R. Emotion recognition in Arabic speech. Analog. Integr. Circuits Signal Process. 2018, 96, 337–351. [Google Scholar] [CrossRef]
  90. Gournay, P.; Lahaie, O.; Lefebvre, R. A canadian french emotional speech dataset. In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys ’18), Amsterdam, The Netherlands, 12–15 June 2018; pp. 399–402. [Google Scholar] [CrossRef]
  91. Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
  92. Mohamad Nezami, O.; Jamshid Lou, P.; Karami, M. ShEMO: A large-scale validated database for Persian speech emotion detection. Lang. Resour. Eval. 2019, 53, 1–16. [Google Scholar] [CrossRef]
  93. Aouf, A. Basic Arabic Vocal Emotions Database (BAVED). 2019. Available online: https://github.com/40uf411/Basic-Arabic-Vocal-Emotions-Dataset (accessed on 3 August 2025).
  94. Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 527–536. [Google Scholar] [CrossRef]
  95. Lotfian, R.; Busso, C. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings. IEEE Trans. Affect. Comput. 2019, 10, 471–483. [Google Scholar] [CrossRef]
  96. Parada-Cabaleiro, E.; Costantini, G.; Batliner, A.; Schmitt, M.; Schuller, B.W. DEMoS: An Italian emotional speech corpus: Elicitation methods, machine learning, and perception. Lang. Resour. Eval. 2020, 54, 341–383. [Google Scholar] [CrossRef]
  97. Wang, K.; Wu, Q.; Song, L.; Yang, Z.; Wu, W.; Qian, C.; He, R.; Qiao, Y.; Loy, C.C. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In Proceedings of the Computer Vision–ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 700–717. [Google Scholar]
  98. Sultana, S.; Rahman, M.S.; Selim, M.R.; Iqbal, M.Z. SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla. PLoS ONE 2021, 16, 1–27. [Google Scholar] [CrossRef]
  99. Cui, C.; Ren, Y.; Liu, J.; Chen, F.; Huang, R.; Lei, M.; Zhao, Z. EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 2766–2770. [Google Scholar] [CrossRef]
  100. Gong, B.; Li, N.; Li, Q.; Yan, X.; Chen, J.; Li, L.; Wu, X.; Wu, C. The Mandarin Chinese auditory emotions stimulus database: A validated set of Chinese pseudo-sentences. Behav. Res. Methods 2023, 55, 1441–1459. [Google Scholar] [CrossRef]
  101. Singla, C.; Singh, S. PEMO: A new validated dataset for Punjabi speech emotion detection. Int. J. Recent Innov. Trends Comput. Commun 2022, 10, 52–58. [Google Scholar] [CrossRef]
  102. Das, R.K.; Islam, N.; Ahmed, M.R.; Islam, S.; Shatabda, S.; Islam, A.M. BanglaSER: A speech emotion recognition dataset for the Bangla language. Data Brief 2022, 42, 108091. [Google Scholar] [CrossRef]
  103. Chauhan, K.; Sharma, K.K.; Varma, T. MNITJ-SEHSD: A Hindi Emotional Speech Database. In Proceedings of the 2023 International Conference on Communication, Circuits, and Systems (IC3S), Bhubaneswar, India, 26–28 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
  104. Singh, Y.B.; Goel, S. A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora. Multimed. Tools Appl. 2023, 82, 23055–23073. [Google Scholar] [CrossRef]
  105. Retta, E.A.; Almekhlafi, E.; Sutcliffe, R.; Mhamed, M.; Ali, H.; Feng, J. A New Amharic Speech Emotion Dataset and Classification Benchmark. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–22. [Google Scholar] [CrossRef]
  106. Garcia-Cuesta, E.; Salvador, A.B.; Pãez, D.G. EmoMatchSpanishDB: Study of speech emotion recognition machine learning models in a new Spanish elicited database. Multimed. Tools Appl. 2024, 83, 13093–13112. [Google Scholar] [CrossRef]
  107. Christop, I. nEMO: Dataset of Emotional Speech in Polish. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 22–24 May 2024; pp. 12111–12116. [Google Scholar]
  108. Chong, C.S.; Davis, C.; Kim, J. A Cantonese Audio-Visual Emotional Speech (CAVES) dataset. Behav. Res. Methods 2024, 56, 5264–5278. [Google Scholar] [CrossRef]
  109. Catania, F.; Wilke, J.W.; Garzotto, F. Emozionalmente: A Crowdsourced Corpus of Simulated Emotional Speech in Italian. IEEE Trans. Audio, Speech Lang. Process. 2025, 33, 1142–1155. [Google Scholar] [CrossRef]
  110. Sharifzadeh Jafari, Z.; Seyedin, S. A Novel Multi-Task and Ensembled Optimized Parallel Convolutional Autoencoder and Transformer for Speech Emotion Recognition. AUT J. Electr. Eng. 2024, 56, 213–226. [Google Scholar]
  111. Eriş, F.G.; Akbal, E. Enhancing speech emotion recognition through deep learning and handcrafted feature fusion. Appl. Acoust. 2024, 222, 110070. [Google Scholar] [CrossRef]
  112. Mishra, S.P.; Warule, P.; Deb, S. Speech emotion recognition using a combination of variational mode decomposition and Hilbert transform. Appl. Acoust. 2024, 222, 110046. [Google Scholar] [CrossRef]
  113. Li, H.; Zhang, X.; Duan, S.; Liang, H. Speech emotion recognition based on bi-directional acoustic-articulatory conversion. Knowl.-Based Syst. 2024, 299, 112123. [Google Scholar] [CrossRef]
  114. Li, L.; Glackin, C.; Cannings, N.; Veneziano, V.; Barker, J.; Oduola, O.; Woodruff, C.; Laird, T.; Laird, J.; Sun, Y. Investigating HuBERT-based Speech Emotion Recognition Generalisation Capability. In Proceedings of the The 23rd International Conference on Artificial Intelligence and Soft Computing 2024, Zakopane, Poland, 16–20 June 2024. [Google Scholar]
  115. Facchinetti, N.; Simonetta, F.; Ntalampiras, S. A Systematic Evaluation of Adversarial Attacks against Speech Emotion Recognition Models. Intell. Comput. 2024, 3, 0088. [Google Scholar] [CrossRef]
  116. Chen, W.; Tang, W.; Meng, Y.; Zhang, Y. An HASM-Assisted Voice Disguise Scheme for Emotion Recognition of IoT-enabled Voice Interface. IEEE Internet Things J. 2024, 11, 36397–36409. [Google Scholar] [CrossRef]
  117. Chou, H.C.; Goncalves, L.; Leem, S.G.; Salman, A.N.; Lee, C.C.; Busso, C. Minority Views Matter: Evaluating Speech Emotion Classifiers with Human Subjective Annotations by an All-Inclusive Aggregation Rule. IEEE Trans. Affect. Comput. 2024, 16, 41–55. [Google Scholar] [CrossRef]
  118. Liu, K.; Wei, J.; Zou, J.; Wang, P.; Yang, Y.; Shen, H.T. Improving Pre-trained Model-based Speech Emotion Recognition from a Low-level Speech Feature Perspective. IEEE Trans. Multimed. 2024, 26, 10623–10636. [Google Scholar] [CrossRef]
  119. Ali, S.; Naz, B.; Narejo, S.; Ahmed, Z. Alex Net-Based Speech Emotion Recognition Using 3D Mel-Spectrograms. Int. J. Innov. Sci. Technol. 2024, 6, 426–433. [Google Scholar] [CrossRef]
  120. Yue, L.; Hu, P.; Zhu, J. Gender-Driven English Speech Emotion Recognition with Genetic Algorithm. Biomimetics 2024, 9, 360. [Google Scholar] [CrossRef]
  121. Yan, J.; Li, H.; Xu, F.; Zhou, X.; Liu, Y.; Yang, Y. Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network. Electronics 2024, 13, 2010. [Google Scholar] [CrossRef]
  122. Yu, S.; Meng, J.; Fan, W.; Chen, Y.; Zhu, B.; Yu, H.; Xie, Y.; Sun, Q. Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion. Electronics 2024, 13, 2191. [Google Scholar] [CrossRef]
  123. Goncalves, L.; Salman, A.N.; Naini, A.R.; Velazquez, L.M.; Thebaud, T.; Garcia, L.P.; Dehak, N.; Sisman, B.; Busso, C. Odyssey 2024-Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results. Development 2024, 10, 4–54. [Google Scholar]
  124. Nfissi, A.; Bouachir, W.; Bouguila, N.; Mishara, B. Unveiling hidden factors: Explainable AI for feature boosting in speech emotion recognition. Appl. Intell. 2024, 54, 1–24. [Google Scholar] [CrossRef]
  125. Priya Dharshini, G.; Sreenivasa Rao, K. Transfer Accent Identification Learning for Enhancing Speech Emotion Recognition. Circuits Syst. Signal Process. 2024, 43, 5090–5120. [Google Scholar] [CrossRef]
  126. Haque, A.; Rao, K.S. Speech emotion recognition with transfer learning and multi-condition training for noisy environments. Int. J. Speech Technol. 2024, 27, 353–365. [Google Scholar] [CrossRef]
  127. Dabbabi, K.; Mars, A. Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases. J. Syst. Sci. Syst. Eng. 2024, 33, 576–606. [Google Scholar] [CrossRef]
  128. Guo, L.; Song, Y.; Ding, S. Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation. Knowl.-Based Syst. 2024, 296, 111969. [Google Scholar] [CrossRef]
  129. Haque, A.; Rao, K.S. Hierarchical speech emotion recognition using the valence-arousal model. Multimed. Tools Appl. 2024, 84, 14029–14046. [Google Scholar] [CrossRef]
  130. Khurana, S.; Dev, A.; Bansal, P. ADAM optimised human speech emotion recogniser based on statistical information distribution of chroma, MFCC, and MBSE features. Multimed. Tools Appl. 2024, 84, 10155–10172. [Google Scholar] [CrossRef]
  131. Tyagi, S.; Szénási, S. Revolutionizing Speech Emotion Recognition: A Novel Hilbert Curve Approach for Two-Dimensional Representation and Convolutional Neural Network Classification. In Proceedings of the International Conference on Robotics in Alpe-Adria Danube Region, Cluj-Napoca, Romania, 5–7 June 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 75–85. [Google Scholar]
  132. Hama, K.; Otsuka, A.; Ishii, R. Emotion Recognition in Conversation with Multi-step Prompting Using Large Language Model. In Proceedings of the 26th International Conference on Human-Computer Interaction, Washington Hilton Hotel, Washington DC, USA, 29 June–4 July 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 338–346. [Google Scholar]
  133. Akinpelu, S.; Viriri, S.; Adegun, A. An enhanced speech emotion recognition using vision transformer. Sci. Rep. 2024, 14, 13126. [Google Scholar] [CrossRef] [PubMed]
  134. Singla, C.; Singh, S.; Sharma, P.; Mittal, N.; Gared, F. Emotion recognition for human–computer interaction using high-level descriptors. Sci. Rep. 2024, 14, 12122. [Google Scholar] [CrossRef] [PubMed]
  135. Adebiyi, M.O.; Adeliyi, T.T.; Olaniyan, D.; Olaniyan, J. Advancements in accurate speech emotion recognition through the integration of CNN-AM model. TELKOMNIKA (Telecommun. Comput. Electron. Control.) 2024, 22, 606–618. [Google Scholar] [CrossRef]
Figure 1. Overview of speech emotion recognition system architecture.
Figure 1. Overview of speech emotion recognition system architecture.
Data 10 00164 g001
Figure 2. The Circumplex model [38].
Figure 2. The Circumplex model [38].
Data 10 00164 g002
Figure 3. Distribution of SER corpora by language and speech type.
Figure 3. Distribution of SER corpora by language and speech type.
Data 10 00164 g003
Table 1. Overview of advantages and limitations of different speech corpus types for SER.
Table 1. Overview of advantages and limitations of different speech corpus types for SER.
Types of Speech Advantages Limitations
Acted
  • Consistent and controlled.
  • Wide range of emotions available.
  • Standardized for easy comparison.
  • Lacks natural authenticity.
  • High cost to maintain control.
  • Limited context and spontaneity.
Elicited
  • Closer to natural emotions.
  • Stimulus-driven for context relevance.
  • Limited emotion range.
  • Context can still feel artificial.
  • Data imbalance issues.
Natural
  • Genuine, real-world expressions.
  • Abundant and lower cost.
  • Contextual information.
  • Noisy and harder to control.
  • Ethical and privacy concerns.
  • Data imbalance, overlapping emotions.
Table 2. Summary of speech emotion recognition databases considered in this review.
Table 2. Summary of speech emotion recognition databases considered in this review.
Corpus NameRef./YearLanguage(s)Speech Type/SpeakersEmotionsRecording ConditionsAnnotation Method
DES [61]/1997DanishActed/4 (2M, 2F)Neutral, Surprise, Happiness, Sadness, AngerAcoustically dampened studioPerceptual test (20 listeners)
SUSAS [62]/1997EnglishElicited and natural/
32+ (19M, 13F)
Talking styles (slow, fast, etc.), Stress (from tasks, environment, psychiatric analysis, etc.)Simulated and real-world stress scenarios (e.g., amusement park rides, helicopter missions)Categorical labels (stressed/neutral), subjective assessments, automated tools
CREST [63]/2000–2005Japanese, Chinese, EnglishNatural/Primarily non-professional volunteersEmotions (e.g., amusement), emotion-related attitudes (e.g., doubt, annoyance, surprise)Everyday conversational situations, TV broadcasts, DVDs, videosPrimarily automatic transcription
INTERFACE[64]/2002Slovenian, English, Spanish, FrenchActed/20 (10M, 10F)Anger, Sadness, Happiness, Fear, Disgust, Surprise, NeutralSound-treated studioActor self-assessments, 5 independent raters per language, listener evaluations for some languages
SmartKom [65]/2002GermanElicited/45 (20M, 25F)Evoked emotion categorized through a combination of facial expression analysis and vocal cues (prosodic features)Three simulated technical scenarios (Public, Home, Mobile)Orthographic transcriptions and prosodic markers (e.g., stress and pitch contours)
SympaFly [66]/2003GermanElicited/Three sets: S1 (49M, 61F), S2 (59M, 39F), S3 (29M, 33F)Positive (joyful), Neutral, Pronounced (emphatic), Weak Negative (surprised, ironic, compassionate), and Strong Negative (helpless, panic, angry, touchy)Interaction with automatic dialogue systemOrthographic transliteration of dialogues, annotation of (emotional) user states, prosodic and conversational peculiarities
AIBO [67]/2004German, EnglishNatural/German children: 51 (21M, 30F), English children: 30Neutral, Joyful, Surprised, Emphatic, Helpless, Touchy, Angry, BoredChildren human–robot communicationOrthographic transliteration of spoken word chain, verbal (filled pauses) and non-verbal (microphone noise)
Real-Life Call Center [68]/2004FrenchNatural/350 dialog samplesAnger, Fear, Satisfaction, NeutralRecording of call-center agent-client spoken dialogEmotion annotation using task-dependent and 2-dimensional (Activation-Evaluation) scheme, prosodic parameters (F0 features)
IEMO [69]/2004ItalianActed/3 (2M, 1F)Neutral, Angry, Happy, SadAcoustically treated room, high quality microphonesProsodic parameters calculation, labeling of emotions
EMODB [46]/2005GermanActed/10 (5M, 5F)Neutral, Anger, Fear, Joy, Sadness, Disgust, BoredomAnechoic chamber, high quality microphonesUtterance annotation using ESPS/waves+, phonemic segments labeled with SAMPA symbols
EMOTV1 [70]/2005FrenchNatural/48 speakersAnger, Despair, Disgust, Doubt, Exaltation, Fear, Irritation, Joy, Neutral, Pain, Sadness, Serenity, Surprise, WorryVideo clips of French TV interviews on 24 different topicsANVIL tool for annotating perceived emotion, context annotation for coding context
CEMO [71]/2006FrenchNatural/688 samples, involving 784 callers and 7 agentsFear, Anger, Sadness, Hurt, Surprise, Relief, Compassion, NeutralRecording of agent-caller spoken dialog from medical emergency call centerAnnotation scheme using both dimensions and labels, context annotation
eNTERFACE’05 [72]/2006EnglishElicited/42 (34M, 8F)Happiness, Sadness, Surprise, Anger, Disgust, FearHigh-quality microphone, specially conceived for speech recordingsSelf- assessment. Unsatisfying results were rejected
SAFE [73]/2006EnglishActed/400 speakersFear, Other negative emotions, Neutral, Positive emotionsAudiovisual sequences extracted from moviesSequence segmentation, abstract descriptors, ANVIL tool
EmoTaboo [74]/2007EnglishElicited/10 (6M, 4F)Amusement, Stress, Embarrassment, Satisfaction, ExasperationRecording of spoken dialogues using high-quality microphonesSelf-assessment, emotion labeling, annotation of cognitive state and context
CASIA [75]/2008Mandarin ChineseActed/4 (2M, 2F)Sad, Angry, Fear, Surprise, Happy, NeutralAudio recording in a professional recording studio using high-quality microphonesManual checking of annotation of emotions
IEMOCAP [50]/2008EnglishActed/10 (5M, 5F)Anger, Sadness, Happiness, Frustration, Excitement, NeutralRecording of spoken dialogues using high-quality microphonesDimensional and categorical annotation of emotions
ITA-DB [76]/2008ItalianActed/Speakers from moviesAnger, Fear, Joy, Sadness, Neutral391 audio samples from 40 movies and TV seriesAnnotation was based on predefined emotional categories, Prosodic features
VAM [77]/2008GermanNatural/47 (11M, 36F)Valence, activation, and dominance12 recordings of TV talk showsEmotional content assessed by human listeners
SAVEE [78]/2010EnglishActed/4 (4M)Anger, Disgust, Fear, Happiness, Sadness, Surprise, NeutralRecording of utterances using high-quality microphonesEmotions assessed by native English speaker
IITKGP-SEHSC [79]/2011HindiActed/10 (5M, 5F)Anger, Disgust, Fear, Happiness, Neutral, Sadness, Sarcasm, SurpriseRecording of sentences using high-quality microphonesProsodic and spectral features separately extracted from the utterances, quality of expressed emotions evaluated using subjective listening tests
ITA-DB-RE [80]/2011ItalianNatural/175 sentences (95M, 80F)Neutral, Sadness, AngerRecording of live courtroom proceedingsExperienced labeler performed a manual segmentation of the audio signal and emotion annotation of speech samples
TESS [81]/2011EnglishActed/2FAnger, Disgust, Fear, Happiness, Pleasant, Surprise, Sadness, NeutralRecording of sentences in a sound-attenuating boothQuality of expressed emotions evaluated using subjective listening tests
SEMAINE [82]/2012EnglishElicited/150 (57M, 93F)Fear, Anger, Happiness, Sadness, Disgust, Contempt, AmusementRecording of spoken dialogues using 2 high-quality microphones per personDimensional and categorical annotations
CREMA-D [83]/2014EnglishActed/91 (48M, 43F)Happy, Sad, Anger, Fear, Disgust, NeutralRecordings sessions in a sound-attenuated environment using high-quality microphonesEmotion labels and real-value emotion intensity values collected using crowd-sourcing
EMOVO [84]/2014ItalianActed/6 (3M, 3F)Disgust, Fear, Anger, Joy, Surprise, Sadness, NeutralRecordings of sentences using high-quality microphonesQuality of expressed emotions evaluated using subjective listening tests
CHEAVD [85]/2016ChineseNatural, Acted/238 (125M, 113F)Angry, Fear, Happy, Neutral, Sad, SurpriseRecordings from films, TV series, and TV showsNon-prototypical emotional states labeled by native speakers
MSP-IMPROV [86]/2016EnglishActed/12 (6M, 6F)Anger, Sadness, Happiness, NeutralityRecording of sentences using collar microphones in a sound boothManual emotion annotation using crowd-sourcing
NNIME [87]/2017ChineseActed/44 (22M, 20F)Anger, Sadness, Happiness, Frustration, Neutral, SurpriseAudio recording using Bluetooth wireless closed-up microphonesDiscrete and continuous-in-time emotion annotation
AESDD [88]/2018GreekActed/5 (2M, 3F)Happiness, Sadness, Anger, Fear, DisgustAudio recording in professional sound studioAnnotation was based on predefined emotional categories
ANAD [89]/2018ArabicNatural/Speakers from TV talk showsHappiness, Anger, SurpriseRecordings from Arabic TV showsManual human emotion labeling
CaFE [90]/2018Canadian FrenchActed/12 (6M, 6F)Sadness, Happiness, Anger, Fear, Disgust, Surprise, NeutralRecording in soundproof room using high-quality microphonesAnnotation was based on predefined emotional categories
CMU-MOSEI [91]/2018EnglishNatural/1000 (570M, 430F)Happiness, Sadness, Anger, Fear, Disgust, SurpriseVideos gathered from online video sharing websitesCrowd-sourcing-based emotion and sentiment annotation on a Likert scale
RAVDESS [11]/2018North American EnglishActed/24 (12M, 12F)Calm, Happy, Sad, Angry, Fearful, Surprise, Disgust, NeutralRecordings in a professional recording studio using high-quality microphonesQuality of expressed emotions evaluated using subjective listening tests
ShEMO [92]/2018PersianNatural/87 (56M, 31F)Anger, Fear, Happiness, Sadness, Surprise, NeutralRecordings of radio playsManual human emotion labeling
BAVED [93]/2019ArabicActed/61 (45M, 16F)Low emotion (Tired), Neutral, High emotion (Anger, Fear, Happiness, Sadness)Recordings of spoken wordsAnnotation was based on predefined emotional categories
MELD [94]/2019EnglishActed/1433 dialoguesAnger, Disgust, Fear, Joy, Sadness, Surprise, NeutralRecordings from the TV series Friends Manual human annotation of emotion label, sentiment, and vocal intonation
MSP-PODCAST [95]/2019EnglishNatural/
84,000 speaking turns
Angry, Sad, Happy, Surprise, Fear, Disgust, Contempt, NeutralPodcasts recordings downloaded from audio sharing websitesManual emotion annotation (attribute and categorical) using crowd-sourcing
DEMoS [96]/2020ItalianElicited/68 (45M, 23F)Anger, Sadness, Happiness, Fear, Surprise, Disgust, Guilt, NeutralSpeech recording in a semi-dark and quiet room using high-quality microphonesSelf-assessment and external assessment of the elicited emotions
MEAD [97]/2020EnglishActed/60Angry, Disgust Contempt, Fear, Happy, Sad, Surprise, NeutralRecording using multi-view camerasQuality of expressed emotions evaluated using subjective tests
SUBESCO [98]/2021BanglaActed/20 (10M, 10F)Anger, Disgust, Fear, Happiness, Sadness, Surprise, NeutralAudio recordings in an anechoic sound recording studio using high-quality microphonesQuality of expressed emotions evaluated using subjective tests
EMOVIE [99]/2021Mandarin ChineseActed/9724 audio samplesEmotion polarity: Positive, Neutral, NegativeAudio tracks from movie video clipsHuman-labeled annotation of emotion polarity
MCAESD [100]/2022Mandarin ChineseActed/6 (3M, 3F)Happiness, Sadness, Anger, Fear, Disgust, Pleasant, Surprise, NeutralAudio recordings in a professional recording studio using high-quality microphonesEach emotion recording validated by 40 native Chinese listeners
PEMO [101]/2022PunjabiNatural/60Anger, Happiness, Sadness, NeutralAudio tracks from Punjabi moviesManual subjective emotion annotation
BanglaSER [102]/2022BanglaActed/34 (17M, 17F)Anger, Happiness, Sadness, Surprise, NeutralSpeech audios recorded using smartphones and laptopsManual subjective emotion annotation
MNITJ-SEHSD [103]/2023HindiActed/10 (5M, 5F)Anger, Fear, Happy, Sad, NeutralAudio recorded using an omnidirectional microphoneQuality of expressed emotions evaluated using subjective tests
IESC [104]/2023EnglishActed/8 (5M, 3F)Anger, Fear, Happy, Sad, NeutralAudio recorded using a speech recorder app through a mobile phone in a closed roomAnnotation was based on predefined emotional categories
ASED [105]/2023AmharicActed/65 (40M, 25F)Anger, Fear, Happy, Sad, NeutralAudio recorded using a speech recorder app through a mobile phone in a quiet roomManual subjective emotion annotation
EmoMatchSpanishDB [106]/2023SpanishElicited/50 (30M, 20F)Anger, Disgust, Fear, Happiness, Sadness, Surprise, NeutralAudio recordings in a professional radio studio using high-quality microphonesManual subjective emotion annotation using crowd-sourcing
nEMO [107]/2024PolishActed/9 (5M, 4F)Anger, Fear, Happy, Sad, Surprise, NeutralAudio recorded in a home setting using a cardioid condenser microphoneManual subjective emotion annotation
CAVES [108]/2024Cantonese ChineseElicited/10 (5M, 5F)Anger, Fear, Happy, Sad, Surprise, Disgust, NeutralAudio recorded in a sound attenuated booth using a video camera and a professional microphoneManual subjective emotion annotation
Emozionalmente [109]/2025ItalianActed/431 (131M, 299F, 1 Other)Anger, Disgust, Fear, Joy, Sadness, Surprise, NeutralCrowd-sourcing platform, audio recorded using participants’ device microphonesManual subjective emotion annotation using crowd-sourcing
Table 3. Speech corpora used by relevant papers on SER published in the early 2024.
Table 3. Speech corpora used by relevant papers on SER published in the early 2024.
PaperEMODBPEMOTESSIEMOCAPMSP-PODCASTCREMA-DMSP-IMPROVRAVDESSIITKGP-SEHSCMELDSAVEECMU-MOSEIBAVEDEMOVOCAISA
Z. Sharifzadeh Jafari [110]
F. Günes Eris [111]
S. P. Mishra [112]
H. Li [113]
L. Li [114]
N. Facchinetti [115]
W. Chen [116]
H.-C. Chou [117]
K. Liu [118]
S. Ali [119]
L. Yue [120]
J. Yan [121]
S. Yu [122]
L. Goncalves [123]
A. Nfissi [124]
G. Priya Dharshini [125]
A. Haque [126]
K. Dabbabi [127]
L. Guo [128]
A. Haque [129]
S. Khurana [130]
S. Tyagi [131]
K. Hama [132]
S. Akinpelu [133]
C. Singla [134]
M. O. Adebiyi [135]
Occurrences101511252112252121
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Serrano, S.; Serghini, O.; Esposito, G.; Carbone, S.; Mento, C.; Floris, A.; Porcu, S.; Atzori, L. Review and Comparative Analysis of Databases for Speech Emotion Recognition. Data 2025, 10, 164. https://doi.org/10.3390/data10100164

AMA Style

Serrano S, Serghini O, Esposito G, Carbone S, Mento C, Floris A, Porcu S, Atzori L. Review and Comparative Analysis of Databases for Speech Emotion Recognition. Data. 2025; 10(10):164. https://doi.org/10.3390/data10100164

Chicago/Turabian Style

Serrano, Salvatore, Omar Serghini, Giulia Esposito, Silvia Carbone, Carmela Mento, Alessandro Floris, Simone Porcu, and Luigi Atzori. 2025. "Review and Comparative Analysis of Databases for Speech Emotion Recognition" Data 10, no. 10: 164. https://doi.org/10.3390/data10100164

APA Style

Serrano, S., Serghini, O., Esposito, G., Carbone, S., Mento, C., Floris, A., Porcu, S., & Atzori, L. (2025). Review and Comparative Analysis of Databases for Speech Emotion Recognition. Data, 10(10), 164. https://doi.org/10.3390/data10100164

Article Metrics

Back to TopTop