1. Introduction
Human beings can express emotions in multidimensional ways. Among these, the most commonly used methods are facial expressions, speech, body language, gestures, and writing [
1]. In recent times, several systems have been developed to identify emotions from speech signals. Still, classical automatic speech recognition systems have not paid much attention to paralinguistic information of the speech, like gender, emotion and personality. However, paralinguistic data is relevant for efficient communication and to understand one another. Speech emotion recognition (SER) is therefore considered with growing interest in investigating emotional states through speech signals. The ability to recognize emotions has the potential to significantly enhance human–computer interaction (HCI), making it more intuitive and responsive to users’ emotional states [
2]. SER has diverse applications across various domains, including robotics, mobile services, and call centers. For example, in [
3], the authors propose a new model of SER based on deep neural networks to improve the human–robot interaction. Instead, in [
4], the authors developed a smartphone application powered by cloud computing that identifies emotions in real time using a standard speech corpus. Furthermore, in [
5], the authors propose a call redistribution method based on SER that shows a remarkable reduction in waiting time for more urgent callers.
More recently, the applications of SER have expanded across healthcare, customer service, HCI and psychological assessment [
6,
7]. For instance, the integration of SER into smart home technologies and mental health monitoring systems highlights its potential in monitoring emotional well-being, particularly in personalized healthcare contexts [
8]. Since affective disorders are associated with specific voice patterns, in [
9], the authors explored a generalizable approach to evaluate clinical depression and remission from voice using transfer learning. They found that the SER model was able to accurately identify the presence of depression, providing a successful method for early diagnosis.
Despite its wide range of applications, SER remains a challenging task. Emotions vary not only between individuals but also across cultures and contexts, and their acoustic expression often overlaps across categories. Variability in recording conditions, imbalance in class distributions, and the difficulty of capturing spontaneous affect further complicate system design. Among these issues, the question of how emotions are represented and annotated in data is particularly critical: the lack of consensus on categorization schemes and the inherent subjectivity of labeling make it difficult to build consistent training corpora. Several key databases have therefore emerged in SER research, each addressing these challenges in different ways and serving distinct purposes depending on the type of emotional speech being studied.
Speech corpora used in SER are collected under various conditions, ranging from controlled laboratory settings with scripted speech to spontaneous, real-world interactions. The selection of speech types significantly impacts the training and performance of SER systems, as the variability in emotional expression across different contexts and datasets poses challenges in building models that generalize well [
10,
11]. Choosing the right dataset, therefore, plays a decisive role in the design and performance of SER systems, shaping their ability to generalize across speakers, maintain reliable annotations, and adapt to cultural differences. Although SER has been the subject of numerous surveys, most of them concentrate on features, models, and performance comparisons rather than the databases themselves. When datasets are mentioned, they are usually reduced to compact tables listing the language, number of speakers, or targeted emotions, and the most recent reviews include corpora only up to 2021 [
12,
13]. To the best of our knowledge, no dedicated review has yet examined emotional speech corpora in their own right, how they are collected, the contexts they represent, and the challenges they introduce for system design. This is the gap our paper aims to address.
To the best of our knowledge, we provide a broad and up-to-date comparative review of emotional speech databases up to mid-2025, combining both technical and psychological perspectives. Rather than aiming for exhaustiveness, our selection balances historical “pillar” corpora (e.g., DES, SUSAS, EMO-DB), widely used benchmarks (e.g., IEMOCAP, AIBO), linguistically and culturally diverse resources (e.g., CASIA Mandarin, ITA-DB, INTERFACE), and more recent or domain-specific datasets, such as those built from stress speech or emergency calls. We also include innovative corpora that introduced new methodologies or collection processes, setting precedents for subsequent work in the field. The depth of description varies depending on the documentation available; some corpora, even important ones, are reported with limited details in their original papers, and this naturally constrains the coverage we can provide. Beyond description, we offer a critical perspective on recurring issues, including the trade-off between clarity in acted corpora and ecological validity in natural ones, inconsistencies in labeling and annotation models, and gaps in cultural and linguistic diversity. Finally, we complement this analysis with a synthesis of trends across databases and a discussion of the resources that are actually being used in recent SER studies. In doing so, we hope to provide researchers not only with a practical guide to available datasets but also with insights into how corpus design choices shape the robustness and generalizability of SER systems.
The remainder of this paper is organized as follows.
Section 3 provides an overview of SER systems, whereas
Section 4 introduces the emotional models. In
Section 5, we discuss the required characteristics of databases for SER, while
Section 6 provides an extensive overview of the existing SER corpora available in the literature.
Section 7 presents an in-depth analysis of the considered SER corpora, and finally,
Section 8 concludes the paper.
6. Available Databases for SER
In this section, we present a comprehensive overview of the databases for speech emotion recognition available in the literature.
Table 2 compares these databases in terms of corpus name, reference, year of publication, speech language(s), speech type (natural, elicited, or acted), number of speakers, considered emotions, voice recording conditions, and emotion annotation methods. It summarizes the key parameters of the reviewed databases, serving as a descriptive reference to support the comparative analysis developed in our narrative review. To complement
Table 2,
Figure 3 provides an at-a-glance descriptive view of the inventory: the blue bars show the raw distributions of corpora by language and by speech type; the quality-weighted views (orange bars) are interpreted in
Section 7.
6.1. DES
The Danish Emotional Speech (DES) corpus ([
61]-1997) is a compact yet impactful resource designed to explore the perception and recognition of emotional speech in the Danish language. It consists of approximately 30 min of recordings by four professional actors (two male, two female) from Aarhus Theatre, delivering a diverse range of utterances in five emotional states: neutral, surprise, happiness, sadness, and anger. The speech material includes two isolated words (“yes” and “no”), nine sentences (four questions and five statements), and two longer passages of fluent speech. The recordings were made in an acoustically dampened studio using high-quality microphones and digital equipment to ensure optimal sound quality for acoustic analyses. This carefully controlled setup emphasizes clarity while capturing emotional variability across different speech contexts. The annotation process was validated through a perceptual test involving 20 listeners (10 male, 10 female) aged 18–59 years, primarily university staff. Listeners were tasked with identifying the expressed emotions, achieving an overall recognition rate of 67.3%. Recognition rates varied across speech types, with passages yielding the highest accuracy (76.3%), followed by words (67.5%) and sentences (65.3%). Notably, confusion between sadness and neutral emotions highlighted challenges in distinguishing subtle affective cues. Listener feedback further underscored the complexity of emotion perception, providing valuable insights for refining emotional speech modeling.
For its time, the DES corpus was a pioneering effort in the study of emotional speech, especially in the Danish language. Its compact design and inclusion of multiple speech types make it an accessible yet valuable resource for foundational research in affective computing and speech synthesis. The use of professional actors and controlled recording conditions set a precedent for later emotional speech datasets. Although limited in size, its focus on high-quality, contextually varied emotional expressions offered unique insights into emotion perception and recognition. The dataset remains relevant as a historical benchmark for understanding the evolution of emotional speech research and its methodological foundations.
6.2. SUSAS
The Speech Under Simulated and Actual Stress (SUSAS) database ([
62], 1997) is a pioneering resource designed to analyze speech variability under stress, providing a comprehensive framework for understanding the impact of emotional and environmental stressors on speech production. The corpus encompasses recordings from over 32 speakers (13 females and 19 males) across various stress conditions, capturing 16,000 utterances in total. It spans five distinct domains: (1) talking styles, including slow, fast, loud, soft, angry, clear, and questioning tones; (2) single tracking tasks simulating speech produced in noisy environments or under the Lombard effect; (3) dual tracking tasks involving compensatory and acquisition responses under task-induced workload; (4) motion-fear tasks captured in real-world scenarios like amusement park rides (e.g., “Scream Machine” and “Free Fall”) and helicopter missions, simulating extreme physical stress; and (5) psychiatric analysis speech reflecting emotional states such as depression, fear, and anxiety. Utterances include a 35-word vocabulary set tailored to aircraft communication contexts, emphasizing its applicability to aviation and high-stress environments.
Actual stress data was recorded in extreme scenarios such as roller-coaster rides and Apache helicopter missions, simulating high G-forces and fear responses.
The annotation process focused on categorizing stress and emotional states across diverse speaking scenarios. It emphasized distinguishing subtle variations in stress levels and emotional cues within real-world and simulated high-stress conditions. Categorical labels such as “stressed” or “neutral” were applied, tailored to contexts like aviation and emergency communication. Annotations also accounted for speech affected by environmental factors, such as the Lombard effect, and task-induced workload stress, capturing nuanced vocal variations under pressure. While inter-rater reliability metrics were not explicitly reported, the combined use of subjective human assessments and automated tools aimed to ensure consistency. This methodology enhances the corpus’s utility for studying stress-induced speech dynamics and emotion recognition in high-stakes applications, offering valuable insights into the interplay of stress and vocal expression in realistic scenarios.
As one of the earliest databases to focus on stressed speech, the SUSAS corpus laid the groundwork for subsequent research in stress-resilient speech recognition, emotion recognition, and speech synthesis. Its integration of diverse stress domains and detailed annotations provides a unique platform for investigating how stress influences speech, making it an invaluable resource for both theoretical and applied studies in affective computing, forensic linguistics, and adaptive speech technologies.
6.3. CREST
The Expressive Speech Processing (ESP) project, initiated in Spring 2000 and lasted five years, was a pivotal part of the JST/CREST (Core Research for Evolutional Science and Technology) initiative, funded by the Japanese Science and Technology Agency ([
63], 2001). This research aimed to collect a comprehensive database (CREST: the expressive speech database) of spontaneous, expressive speech tailored to meet the requirements of speech technology, particularly for concatenative synthesis. The project focused on statistical modeling and parameterization of paralinguistic speech data, developing mappings between the acoustic characteristics of speaking style and speaker intention or state, and implementing and testing prototypes of software algorithms in real-world applications. Emphasizing practical applications, the ESP project prioritized states likely to occur during interactions with information-providing or service-providing devices, such as emotional states (e.g., amusement) and emotion-related attitudes (e.g., doubt, annoyance, surprise).
Given the potential language-specific nature of expressive speech, the database encompasses materials in three languages: Japanese (60%), Chinese (20%), and English (20%). The target was to collect and annotate 1000 h of speech data over five years, primarily sourced from non-professional, volunteer subjects in everyday conversational situations, as well as emotional speech in television broadcasts, DVDs, and videos. To capture truly natural speech, a “Pirelli-Calendar” approach was employed, inspired by photographers who used to take 1000 rolls of film to produce a 12-photo calendar. Volunteers were equipped with long-term recorders, capturing samples of their day-to-day vocal interactions throughout the day. This extensive data collection method was aimed at ensuring adequate and representative coverage. However, annotating this vast amount of data was a monumental task, and automatic transcription using speech recognition posed significant challenges.
6.4. INTERFACE
The INTERFACE Emotional Speech Database ([
64], 2002) is a multilingual corpus developed to support research in SER across four languages: Slovenian, English, Spanish, and French. The dataset includes recordings from 20 professional actors (10 male, 10 female), approximately five actors per language. Each actor recorded speech stimuli designed to evoke six primary emotions, namely, anger, sadness, happiness, fear, disgust, and surprise, alongside neutral speech. The linguistic diversity in the corpus extends to accents and dialects within each language, ensuring comprehensive coverage for cross-linguistic emotion analysis. Each language subset includes approximately 1200 recordings, resulting in a total of 4800 audio samples. The speech material comprises a mix of isolated words, short sentences, and longer passages, reflecting varied syntactic and semantic structures to explore emotional expression across different speech contexts.
The recordings were conducted in a sound-treated studio environment to minimize noise and ensure high fidelity. High-quality condenser microphones and professional audio interfaces captured the audio signals. The actors read from carefully designed prompts displayed on screens to maintain uniformity, while also allowing for the spontaneity needed to evoke authentic emotions. Each recording session lasted approximately four hours per actor, with periodic breaks to ensure sustained vocal quality and emotional performance. For some languages, a Portable Laryngograph was employed to capture additional vocal characteristics, enriching the dataset’s acoustic features. Recordings were conducted in two separate sessions spaced two weeks apart. The annotation process combined actor self-assessments with evaluations from five independent raters per language, enabling a comparison between intended and perceived emotions. Subjective evaluations of the Spanish and Slovenian subsets provided further insights into the accuracy and intensity of emotional expressions. For the Spanish subset, 16 non-professional listeners evaluated 56 utterances (seven per emotion, including long and short versions), while 11 listeners assessed 64 utterances for the Slovenian subset. Listeners identified primary emotions, rated intensity on a scale of 1 to 5, and could mark secondary emotions if necessary. Annotations included both primary and secondary emotion labels, providing additional granularity for mixed or overlapping emotional states. Inter-rater agreement, measured using Krippendorff’s alpha, indicated moderate to high reliability, although exact values were not consistently reported. Feedback from annotators highlighted cultural and linguistic variations in emotional perception, reflecting the complexity of annotating multilingual datasets. This combination of robust training, inclusion of secondary labels, and supplementary evaluation tests enhances the dataset’s value for cross-cultural emotion recognition.
The database’s multilingual and multimodal design marked a significant contribution to emotion recognition research, particularly in cross-cultural contexts. By providing a standardized resource for analyzing emotional speech across multiple languages, it supported advancements in affective computing, multilingual speech synthesis, and early machine learning-based emotion recognition systems. This corpus has informed subsequent research efforts, offering a valuable foundation for exploring the relationship between language, culture, and emotion in the context of human–computer interaction and multilingual communication. Its influence is reflected in later datasets and studies that adopted and refined its approaches to meet evolving research needs.
6.5. SmartKom
The SmartKom Multimodal Corpus ([
65], 2002) is a dataset developed under the German SmartKom project, designed to advance research in multimodal HCI. This corpus integrates acoustic, visual, and tactile modalities, collected through Wizard-of-Oz experiments that simulate realistic system interactions. The dataset includes recordings from 45 participants (25 female and 20 male) across 90 sessions, with a diverse demographic breakdown. Recording sessions were conducted across three technical scenarios (Public, Home, Mobile), enhancing SmartKom’s applicability for various real-world environments. In SmartKom Public, the system functioned as a publicly accessible information interface, enabling users to perform tasks such as making cinema reservations or obtaining navigation details. The SmartKom Home scenario simulated an intelligent personal assistant designed for domestic environments, capable of managing tasks such as scheduling appointments or controlling home appliances. The SmartKom Mobile setup demonstrated the capabilities of a portable communication assistant, enabling users to interact with the system while on the move, such as checking emails or receiving navigation guidance. Audio data were captured across 10 channels at a high sampling rate to ensure detailed acoustic analysis. Video recordings included frontal and lateral views of participants, graphical overlays of gestures, and infrared visuals to track hand movements, all recorded in high-quality formats to support in-depth analysis.
Speech data were annotated with orthographic transcriptions and prosodic markers, such as stress and pitch contours, allowing for a detailed analysis of vocal patterns. Gestural annotations employed segmentation in the 2D plane, mapping hand and pen trajectories with high precision, supported by infrared data from the SIVIT gesture analyzer and graphical tablet inputs. User states were categorized through a combination of facial expression analysis and vocal cues, leveraging advanced annotation protocols to distinguish emotional nuances. However, the paper does not explicitly report inter-annotator agreement or validation metrics, which are crucial for ensuring reliability in subjective annotations. A standout feature of this process is the synchronization of all annotations within a QuickTime framework, which aligns multiple data streams, including audio, video, and gesture coordinates, with millisecond accuracy. The use of the BAS Partitur File (BPF) format further enhances the corpus by consolidating linguistic, gestural, and user-state annotations into a single, unified dataset.
The SmartKom Multimodal Corpus contributed significantly to research in multimodal interaction by integrating speech, gestures, and user states in realistic scenarios. Its detailed annotations and synchronized data streams supported advancements in adaptive interfaces and emotion recognition. By being publicly accessible, it encouraged broader research use and influenced the development of subsequent multimodal corpora, shaping methodologies in human–computer interaction studies.
6.6. SympaFly
The SympaFly database ([
66], 2003) was developed as part of the SMARTKOM project to study user behavior and emotional states in fully automated speech dialogue systems for flight reservations and bookings. The covered emotions include joyful, neutral, emphatic, surprised, ironic, compassionate, helpless, panic, touchy, and angry. The dataset captures dialogues across three stages of system development, reflecting progressive improvements in system performance and variations in user strategies. The first stage (S1) refers to the first usability test and comprises 110 dialogues with 2291 user turns and 11,581 words. The second stage (S2) includes 98 dialogues with 2674 user turns and 9964 words, during which the system’s performance of the dialogue manager showed improvement. The third stage (S3) contains 62 dialogues with 1900 user turns and 7655 words, collected using the same experimental settings used for S1. Dividing the corpus into these three developmental stages highlights user adaptation to varying system performance levels, providing a unique perspective on the interplay between system reliability and user frustration. Recordings were conducted over standardized telephone channels under controlled experimental conditions, with users completing tasks such as providing flight details, frequent flyer IDs, and payment information. These standardized setups ensured consistent data collection across all stages but may limit the applicability of the data to more modern multimodal systems integrating visual or tactile inputs.
The annotation process focused on holistic user states and prosodic features, classifying emotional states into five categories: positive (e.g., joyful), neutral, pronounced (e.g., emphatic), weak negative (e.g., surprised, ironic, compassionate), and strong negative (e.g., helpless, panic, angry, touchy). Two independent annotators followed by a consensus process ensured some reliability, though the paper does not mention inter-annotator agreement metrics. Annotations highlighted distinctions between categories, such as strong negative and positive, with prosodic features like hyper-articulation and emphasis being more pronounced in strong negative states, reflecting heightened emotional intensity. In contrast, overlapping features in neutral and weak negative states, such as pauses and subtle tonal shifts, posed challenges for consistent classification. Dialogue success, annotated on a four-point scale, correlated with emotional states: positive and neutral appeared more frequently in successful dialogues, while Strong Negative was associated with unsuccessful interactions. Additionally, conversational peculiarities like repetitions provided insights into user strategies during system limitations.
The originality of the SympaFly database lies in its focus on capturing emotional states within real-world human–machine dialogues, moving beyond prototypical or acted emotions to reflect interaction-centered states. By documenting user interactions across three stages of system development, the database provides insights into how system performance impacts user frustration, success rates, and adaptive behaviors. At its time in 2004, it addressed the need for datasets that captured naturalistic emotional and interactional patterns in speech-based automated systems. While limited to telephone interactions and predefined scenarios, its contribution remains significant for understanding the interplay between emotion, prosody, and dialogue outcomes, forming a basis for advancements in dialogue system evaluation and user behavior analysis.
6.7. AIBO
The You Stupid Tin Box Corpus is a cross-linguistic dataset developed to study children’s speech and emotional responses during interactions with the Sony AIBO robot ([
67], 2004). This corpus emphasizes spontaneous emotional speech, leveraging a Wizard-of-Oz experimental setup to provoke a wide range of natural emotions. The data was collected from 51 German children aged 10–13 and 30 English children aged 4–14, in scenarios where the AIBO robot behaved either obediently or disobediently, simulating varying levels of system performance. The recordings included approximately 9.2 h of German speech and 1.5 h of English speech, after filtering silences and irrelevant segments. The setup involved the children giving spoken instructions to the AIBO robot to complete tasks such as navigating a map or avoiding “poisoned” cups. The robot’s actions were pre-determined in the disobedient mode, intentionally provoking emotional responses such as frustration, joy, or irritation. Recordings were conducted using high-quality wireless microphones in controlled environments like classrooms and multimedia studios. The dataset also includes accompanying video recordings, though these are not publicly accessible due to privacy restrictions.
The annotation focused on emotional user states and prosodic peculiarities. Emotional labels were derived by multiple annotators, who categorized each word into classes such as joyful, angry, surprised, irritated, emphatic, and neutral. Prosodic annotations included phenomena like pauses, emphasis, shouting, and clear articulation. A majority-vote approach was used to ensure reliability in labeling. The corpus provides a granular view of interactional dynamics by aligning children’s verbal reactions with the robot’s pre-scripted actions.
The AIBO Corpus is notable for its focus on spontaneous, naturalistic emotions in a real-world setting, making it a significant departure from acted emotional datasets. Its cross-linguistic component and robust annotation processes make it highly relevant for SER, particularly in studying how users, especially children, respond emotionally to autonomous systems. This originality and naturalistic approach position it as a foundational resource for both emotion modeling and human–robot interaction studies.
6.8. Real-Life Call Center
The Real-Life Call Center Corpora ([
68], 2004) analyze emotional manifestations in task-oriented spoken dialog within call center interactions. The study utilizes two corpora: CORPUS 1, recorded at a stock exchange customer service center, and CORPUS 2, recorded at a capital bank service center. Together, the corpora contain a total of 2850 recorded dialog samples and over 100,000 annotated words, with CORPUS 1 contributing 8 h of recordings and CORPUS 2 contributing 5 h. These datasets capture transactional and problem-solving scenarios, where clients express a range of emotions based on the task demands. The recording setup involved natural telephone conversations between agents and clients discussing services such as stock transactions, account management, and technical issues. In CORPUS 1, longer dialog samples with heightened emotional intensity were observed due to the urgency and stakes of stock transactions, whereas CORPUS 2 included shorter dialog with more moderate emotional expressions, primarily focused on routine banking issues. All conversations were transcribed and annotated with detailed semantic, dialogic, and emotional markers. The annotation process employed a two-dimensional emotional framework that represented activation (emotional intensity) and evaluation (valence). A total of four primary emotion classes—anger, fear, satisfaction, and neutral—were labeled, along with nuanced emotions like Irritation and Anxiety to capture the complexity of real-life dialog. Inter-annotator agreement, measured using the Kappa statistic, was higher for CORPUS 1 compared to CORPUS 2, reflecting the clearer emotional cues in the former due to heightened emotional stakes. Prosodic and lexical cues such as fundamental frequency (F0), rhythm, and word choice were analyzed for their role in signaling emotional states.
The Real-Life Call Center Corpora are notable for their focus on realistic emotional dialog, capturing task-dependent emotional expressions in naturalistic call center interactions. By combining detailed annotations of lexical and prosodic features with a context-driven approach, the corpora provide valuable insights into how emotions manifest in real-world communication. Their focus on spontaneous emotional expressions, paired with a structured annotation process, makes them particularly relevant for understanding emotional dynamics in task-oriented dialog systems.
6.9. IEMO
The Italian Emotional (IEMO) Speech Corpus was developed to study emotional variations in speech and facilitate emotional speech synthesis ([
69], 2004). The dataset includes recordings from three professional Italian speakers (one female and two males), all experienced in recording tasks, ensuring consistency in their performances. Each speaker recorded 25 sentences in four emotional styles—happiness, sadness, anger, and neutral—providing a baseline for analysis. However, the exclusion of more nuanced emotions (e.g., fear, surprise) limits its applicability to scenarios requiring a broader emotional spectrum. Sentences were approximately 10 words long, designed to be phonetically balanced, and intentionally devoid of semantic emotional content to prevent bias in speaker performances. Recordings were conducted in a sound-treated room using high-quality microphones and digital acquisition equipment. The controlled recording environment ensured acoustic clarity, making the dataset well-suited for prosodic and acoustical analysis.
The annotation process aimed to capture emotional variations in speech through the direct labeling of emotional categories—happiness, sadness, anger, and neutral. These categories were informed by prosodic parameters, including fundamental frequency (F0), energy, and syllable duration, which were calculated for each utterance and compared to the neutral baseline to quantify emotional differences. Automated phonetic alignment tools were employed to achieve syllable-level segmentation, with manual corrections applied to address extralinguistic artifacts like pauses and hesitations. The process effectively balanced precision with natural variation, accommodating speaker-dependent nuances, particularly in anger, which ranged from “hot” to “cold”. While this variability adds realism to the dataset, it also highlights the challenges of achieving consistent annotations for nuanced emotional states.
This corpus is notable as the first Italian emotional speech database, addressing a significant gap in Italian emotional speech resources. Its focus on phonetically balanced, semantically neutral sentences and quantification of prosodic features offered a structured and reproducible methodology for studying emotional speech. At the time, it laid the groundwork for advancements in emotional speech synthesis and remains a valuable resource for research on prosodic correlates of emotions and their applications in speech technology.
6.10. EMODB
The German Emotional Speech Dataset ([
46], 2005) was developed to provide a controlled resource for studying emotional expression in speech. The dataset features recordings from 10 actors (5 male, 5 female) performing 10 predefined sentences in seven distinct emotional categories: anger, fear, joy, sadness, disgust, boredom, and neutral. These sentences were carefully designed to be phonetically diverse and contextually neutral, ensuring their suitability for expressing various emotions without introducing linguistic bias. The actors used the Stanislavski method, recalling personal experiences to evoke genuine emotions while maintaining a consistent speaking style. In total, the corpus contains approximately 800 utterances, recorded in an anechoic chamber using high-quality equipment, including a Sennheiser MKH 40 microphone and a Tascam DAT recorder, ensuring acoustic clarity. Electro-glottographic data were also captured to analyze voice quality and phonatory behavior.
Emotional utterances, recorded with high precision, underwent perceptual tests by 20 listeners to ensure recognizability and naturalness. Utterances recognized with an accuracy above 80% and judged as natural by more than 60% of listeners were included for further analysis. Each utterance was phonetically labeled using narrow transcription supported by visual analysis of oscillograms and spectrograms. Annotations incorporated phonatory and articulatory settings, voice characteristics, and stress levels for comprehensive emotional analysis. The use of both perceptual tests and phonetic analysis ensures consistency, capturing clear and interpretable emotional expressions.
EmoDB’s significance lies in its carefully controlled design, providing high-quality acoustic recordings paired with detailed phonetic and emotional annotations. By using acted emotions, the database achieves a balance between emotional clarity and experimental reproducibility. As one of the earliest publicly available emotional speech datasets, EmoDB has become a cornerstone in the field of speech emotion recognition, cited widely for its role in standardizing methodologies and enabling detailed analysis of acoustic and prosodic features across emotions.
6.11. EMOTV1
The EMOTV1 corpus ([
70], 2005) was developed to analyze real-life emotions in multimodal contexts, focusing on the interactions of vocal, visual, and contextual cues. This dataset comprises 51 video clips of French TV interviews, featuring 48 speakers discussing 24 diverse topics such as politics, society, and health. The total duration of the corpus is 12 min, with clip lengths ranging from 4 to 43 s. The primary aim was to study mixed and non-basic emotions in natural settings, contrasting with acted datasets. Clips were selected based on the visibility of multimodal behaviors, including facial expressions, gestures, and gaze, providing a rich dataset for analyzing spontaneous emotional expressions.
The annotation process involved two annotators who labeled emotions using the Anvil tool under three conditions: audio-only, video-only, and combined audio-video. Emotional segments were created when annotators perceived consistent behaviors, resulting in a final agreed segmentation through the union of audio-based and the intersection of video-based annotations. Labels were selected from 14 predefined categories, including anger, joy, sadness, despair, surprise, and neutral, as well as intensity and valence dimensions. Inter-annotator agreement was quantified using Kappa for categorical variables and Cronbach’s alpha for intensity and valence, with higher agreement observed in audio-only conditions compared to combined audio-video, highlighting challenges in annotating multimodal emotions. The results revealed ambiguities in non-basic emotions, including blended and masked emotions, and sequences of consecutive emotions, offering a nuanced view of naturalistic emotional expressions.
The uniqueness of EMOTV1 stems from its focus on contextually rich, spontaneous emotional data, offering a resource for exploring multimodal emotional behaviors. By employing diverse annotation schemes and addressing the complexities of naturalistic emotions, the corpus contributes to research on emotion recognition and its potential applications in areas like conversational agents and human–computer interaction. While its limited size and visual constraints pose challenges, EMOTV1 provides a valuable foundation for studying real-life emotional expressions in realistic scenarios.
6.12. CEMO
The CEMO Corpus ([
71], 2006) captures naturalistic emotional interactions recorded in a French medical emergency call center. The corpus is particularly valuable for studying real-life emotional expressions in high-stress, goal-oriented scenarios. It includes 20 h of speech recorded from 688 dialog samples, involving 784 unique callers and 7 agents. On average, each dialog contains 48 turns, providing a rich dataset for analyzing emotional dynamics in critical interactions. The callers include patients or third parties, such as family members or caregivers, reflecting diverse perspectives in medical emergencies. The recording environment ensured the naturalness of emotional expressions while maintaining ethical standards. Recordings were made imperceptibly during real medical calls, preserving the authenticity of the emotional content. Anonymity and privacy were strictly respected, with personal information removed, and the corpus is not publicly distributed. Contextual metadata was annotated, including call origin (e.g., patient or medical center), role in the dialog (caller or agent), reason for the call (e.g., immediate help, medical advice), and decision outcomes. Additional details such as the acoustic quality (e.g., noise, mobile vs. fixed phone), caller demographics (e.g., age, gender, accent), and health or mental state (e.g., hoarseness, grogginess, intoxication) were also labeled. The setup is well-designed to capture naturalistic emotional expressions in real-world scenarios.
The dataset reflects high-stakes, emotionally charged situations that are often difficult to replicate in controlled environments. However, the corpus does not explicitly describe how background noise was mitigated during annotation or pre-processing, which could influence the accuracy of emotional labeling. The corpus shows a strong commitment to privacy through anonymization of personal data and restricted access, ensuring ethical use of sensitive information. Recordings were made unobtrusively during medical calls, preserving natural behavior. However, the paper does not clarify whether explicit consent or institutional approval was obtained, leaving some room for improvement in transparency. Despite this, the privacy safeguards effectively balance ethical considerations with research value. The emotional annotation used a hierarchical framework combining coarse-grained categories and fine-grained subcategories. Coarse categories included fear, anger, sadness, hurt, relief, and compassion, with fine-grained labels like stress, panic, annoyance, and resignation, resulting in 20 distinct emotional labels. This demonstrates methodological rigor. Segments were annotated with one or two labels to account for emotional mixtures, such as relief/anxiety or compassion/annoyance, which are frequent in real-life interactions. Inter-annotator agreement was assessed through Kappa scores (0.35 for agents, 0.57 for callers) and validated with a re-annotation process, achieving 85% coherence across annotators over time. A perceptive test involving 43 participants further evaluated the annotations, achieving 85% agreement between expert and naive annotators. Even though inter-annotator agreement for agent dialog is relatively low, which could raise concerns about label consistency, the re-annotation process improved coherence. Providing more details about the annotators’ training and the criteria for label assignment could further validate the methodology. The inclusion of a perceptual test involving both experts and naive annotators is commendable, as it enhances the reliability of the annotations by cross-validating emotional perceptions. The corpus’s value lies in its focus on spontaneous emotional expressions in high-stakes medical interactions. Its integration of contextual metadata, hierarchical labels, and detailed annotations makes it a key resource for advancing speech emotion recognition in applied settings like medical communication and human–computer interaction.
6.13. eNTERFACE’05
The audio-visual emotion database (eNTERFACE’05) was designed as a reference resource for testing and evaluating emotion recognition algorithms based on video, audio, or combined audio-visual data ([
72], 2006). The original material was collected by recordings from 42 subjects, coming from 14 different nationalities (9 from Belgium; 7 from both Turkey from France; 6 from Spain; 4 from Greece; and 1 each from Italy, Austria, Cuba, Slovakia, Brazil, U.S.A., Croatia, Canada, and Russia). Among the 42 subjects, 81% were men, and the remaining 19% were women. The approach used to obtain recordings in different emotional status was elicitation. Although the participants are from different countries and had their specific mother language, all the experiments were driven in English. Each subject listened to six successive short stories, each of them eliciting a particular emotion (anger, disgust, fear, happiness, sadness, surprise). Five specific sentences for emotion have to be uttered after each single emotion was elicited. The procedure begins with the subject listening carefully to a short story to immerse themselves in the given situation. Once prepared, the subject reads, memorizes, and pronounces each of the five proposed utterances, one at a time, representing different reactions to the situation. Subjects are instructed to convey as much expressiveness as possible, focusing on eliciting the intended emotion in their delivery. If the result is deemed satisfactory, the process proceeds to the next emotion; otherwise, the subject is asked to repeat the attempt. When a subject struggled to express an emotion effectively, the experimenters provided guidance based on their understanding of typical emotional expressions. In some cases, the experimenters opted not to repeat the process if they concluded that satisfactory results were unlikely with the subject in question.
In their article [
72], authors present both the situations (stories told) to elicit each specific emotion and the sentences the speakers have to uttered as reactions. The speech signal was recorded using a high-quality microphone specifically designed for speech recordings. The microphone was positioned approximately 30 cm below the subject’s mouth. The recording room, measuring approximately ten square meters. To prevent external sounds from interfering with the experiments, the doors remained closed at all times. The audio sample rate of the recordings was 48,000 Hz, in an uncompressed stereo 16-bit format. Of the 42 participants in the database, 25 (60%) successfully convey all six emotions, producing five convincing reactions for each proposed situation. The remaining 17 participants (40%) were unable to convey the intended emotions in all their reactions, resulting in unusable samples that were excluded from the dataset.
6.14. SAFE
The Situation Analysis in a Fictional and Emotional (SAFE) Corpus ([
73], 2006) was developed to study extreme fear-related emotions in dynamic and abnormal situations. The corpus consists of 400 audiovisual sequences extracted from 30 recent movies, with sequence durations ranging from 8 s to 5 min. These scenes were selected to illustrate emotions in contexts involving both normal and abnormal situations, such as natural disasters (fires, earthquakes, floods) and psychological or physical threats (kidnapping, aggression). The dataset contains a total of 7 h of recordings, with speech constituting 76% of the data, and features approximately 4073 speech segments spoken by 400 different speakers, covering a range of accents and genders (47% male, 32% female, 1% child). The recording environment reflects the variability and complexity of real-world conditions, including overlapping speech, environmental noise, and background music, which are annotated in the corpus. Speech segments were rated on a four-level audio quality scale, ranging from inaudible to clean and realistic sound recordings. Approximately 71% of the speech data comes from abnormal situations, enhancing the dataset’s relevance for studying emotional speech under high-stakes, dynamic conditions.
Annotation was conducted using a multimodal tool (ANVIL) and included two levels of emotional descriptors: categorical (fear, other negative emotions, neutral, and positive emotions) and dimensional (intensity, evaluation, and reactivity). Fear was further categorized into subtypes, such as stress, terror, and anxiety, to capture nuanced variations. Two annotators, one English native and one bilingual, labeled the emotional content independently, achieving inter-annotator agreement in subsequent evaluations. A supplementary blind annotation, focusing on audio cues alone, was performed to assess fear detection without video context. The annotation is methodologically robust. However, greater transparency about inter-annotator agreement and annotator training could improve confidence in the reliability of the annotations. Despite these gaps, the methodology aligns well with the corpus’s goals and makes it a valuable resource for studying complex emotional dynamics in real-world scenarios.
The originality of this corpus lies in its focus on extreme fear-related emotions in dynamic and contextually rich scenarios, which are underrepresented in existing corpora. Its emphasis on capturing emotional manifestations within task-dependent contexts makes it a useful resource for studying emotion detection in challenging and high-stakes environments. Additionally, the corpus provides insights into the interplay between environmental factors, speaker variability, and emotional expression, contributing to research on speech emotion recognition in complex, real-world conditions.
6.15. EmoTaboo
This corpus ([
74], 2007) was developed to capture multimodal emotional expressions during human–human interactions in a task-oriented context. The dataset focuses on emotions elicited during a game scenario, where participants played an adapted version of the Taboo game designed to provoke spontaneous emotional reactions. The corpus consists of approximately 8 h of video and audio data, recorded from 10 pairs of participants, yielding a total of 20 individual sessions. Participants alternated roles as a “mime” or “guesser” in the game, tasked with describing or guessing specific words under time constraints, introducing elements of stress, frustration, and amusement. The recording setup used a controlled lab environment, where four camera angles captured participants’ facial expressions and upper body gestures, while high-quality microphones recorded speech. The experiment was structured to elicit a variety of emotional responses through challenging word prompts, time pressure, and penalties for incorrect guesses. While the controlled setting ensured high-quality multimodal recordings, the use of a confederate (a scripted participant) in some interactions could introduce potential biases in elicited emotional responses, as their actions were designed to provoke specific reactions.
The annotation process employed a hierarchical framework to label emotional expressions, mental states, and communication acts. Annotators could select up to five emotional labels per segment, allowing the dataset to reflect the complexity of human emotions. A total of 21 emotion labels were used, including amusement, frustration, stress, pride, and embarrassment, along with dimensions such as intensity and valence to capture subtleties in emotional expression. The annotations also included cognitive and social states, providing insights into both individual emotions and interpersonal dynamics. Annotator agreement and validation were achieved through iterative processes, ensuring reliable emotional labeling. Ultimately, the annotation process is robust, utilizing a hierarchical framework that includes 21 emotion labels and dimensions like intensity and valence, capturing the nuanced nature of emotional interactions. While allowing up to five labels per segment adds flexibility, the fixed list of labels may constrain the ability to annotate unanticipated or subtle emotional states.
The EmoTaboo Corpus stands out for its multimodal integration of speech, gestures, and facial expressions in a dyadic interaction setting. Its focus on spontaneous emotional reactions in task-oriented interactions makes it a valuable resource for studying emotion dynamics in real-time communication. While its lab-controlled environment may limit its generalizability to fully naturalistic settings, the corpus provides a rich dataset for exploring the interplay of multiple modalities in emotional expression.
6.16. CASIA
The CASIA Mandarin corpus is a Chinese emotional corpus recorded by the Institute of Automation, Chinese Academy of Sciences, and designed for Mandarin speech synthesis research ([
75], 2008). CASIA is composed of 9600 short mandarin utterances, containing six different emotions, i.e., sad, angry, fear, surprise, happy, and neutral. The audio samples were recorded by four speakers (two males and two females from 25 to 45 years old) in a professional recording studio equipped with sound card and large membrane microphone devices.
6.17. IEMOCAP
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database ([
50], 2008) is a comprehensive resource developed at the University of Southern California to study multimodal emotional communication through speech, gestures, and facial expressions. The corpus consists of approximately 12 h of data, collected from ten professional actors (five male and five female) participating in five recording sessions. Each session involved a pair of actors performing both scripted and improvised scenarios designed to elicit a wide range of emotional expressions. The covered emotions include happiness, anger, sadness, frustration, excitement, and neutral, with additional blended emotional states to reflect the complexity of real interactions. By integrating both scripted and spontaneous elements, the setup balances the naturalness of emotions with the consistency needed for controlled analysis, while the use of 61 motion capture markers, high-quality audio, and two high-resolution cameras ensures comprehensive multimodal data collection. This setup makes it a versatile and reliable resource for studying the interplay between verbal and nonverbal emotional communication. The annotation process employs a robust dual framework, using both categorical descriptors (happiness, anger, sadness, frustration, and neutral) and dimensional attributes (valence, activation, and dominance) rated on a five-point scale, allowing the corpus to capture both discrete and continuous aspects of emotional states. The involvement of six annotators, with overlapping turns reviewed by three evaluators, ensures reliability while accounting for blended emotions and nuanced expressions. This approach enhances the dataset’s flexibility and granularity, making it well-suited for analyzing real-life emotional complexities.
The IEMOCAP Corpus is widely regarded as a pioneering resource in the field of multimodal emotion research, combining innovative technology and rigorous methodology to set a benchmark for emotion recognition studies. Its integration of speech, gestures, and facial expressions, paired with its detailed annotation framework, has significantly influenced subsequent work in SER, affective computing, and human–computer interaction. Its foundational contributions continue to make it a critical resource for advancing research in these domains.
6.18. ITA-DB
The Italian Emotional Database (ITA-DB) Corpus ([
76], 2008) is an Italian emotional speech database developed to support emotion recognition in judicial contexts. This corpus addresses the lack of Italian datasets for studying emotional speech, with a focus on the specific dynamics of courtroom debates. It comprises 391 samples of emotional speech, sourced from 40 movies and TV series dubbed by Italian professional actors. The database includes five emotional categories: anger, fear, joy, sadness, and neutral, representing emotions deemed relevant to judicial scenarios. The use of high-quality dubbed audio ensures clear recordings and speaker diversity, as the samples include multiple actors across various productions. However, the reliance on acted speech from entertainment media may not fully capture the complexity of emotional expressions in judicial settings. Courtroom proceedings often involve nuanced emotional states such as confusion, frustration, resignation, or moral indignation, which are difficult to simulate authentically. These complexities are not covered by the selected emotional categories, potentially limiting the corpus’s applicability to the judicial domain. Annotation was based on the predefined emotional categories, with an effort to balance representation across classes. However, the paper lacks details about the annotation process, such as the criteria used for labeling and the reliability of the annotations.
The originality of the ITA-DB Corpus lies in its attempt to create an Italian-specific emotional speech database for a specialized application, addressing a gap in existing resources. However, the choice of emotions and the reliance on acted content may limit its ability to fully capture the subtle and layered emotions typical of courtroom interactions. While it represents a valuable step forward for Italian emotion recognition research, the corpus highlights the need for more context-specific data to better support applications in judicial and legal environments. This observation emphasizes the importance of developing a more comprehensive Italian emotional speech corpus that reflects the complexity of real-world emotional expressions in specialized domains.
6.19. VAM
The Vera am Mittag Corpus ([
77], 2008) is a German audio-visual emotional speech database developed to study spontaneous emotional expressions in natural, unscripted interactions. The corpus was recorded from 12 broadcasts of the German TV talk show “Vera am Mittag” (translated as “Vera at Noon”), aired on the Sat.1 channel between December 2004 and February 2005. It comprises 12 h of data segmented into 45 dialogues and further into 1421 utterances from 104 speakers aged 16 to 69 years, with 70% aged 35 or younger, reflecting a broad demographic. Only the set of “good” and “very good” speakers was considered, for a total of 47 speakers (11 males and 36 females). While the public entertainment setting adds to the naturalistic quality of the corpus, elements such as background noise, applause, and audience or moderator influence may introduce variability during analysis. The setup captures audio-visual signals from spontaneous and emotionally rich discussions, focusing on personal topics such as romantic affairs, family disputes, and friendship crises. The discussions are moderated by a host and involve 2 to 5 participants per dialogue, creating an environment conducive to emotional variability. The segmentation includes extracting individual utterances, storing audio files, and accompanying visual frames.
Emotion annotations utilize a dimensional framework, evaluating utterances along three primitives: valence (negative to positive), activation (calm to excited), and dominance (weak to strong). The annotation process involved 17 human evaluators for the initial set (499 utterances) and 6 evaluators for the extended set, ensuring rigorous labeling. The Self-Assessment Manikins (SAMs) were used as an evaluation tool on a 5-point scale ranging from −1 to +1 for each dimension, capturing the nuanced nature of emotions beyond discrete categories. This framework is particularly well-suited for analyzing spontaneous and mixed emotional states, with the involvement of multiple evaluators adding robustness to the labeling process. The resulting corpus covers a diverse emotional range, though emotions lean towards neutral or negative states due to the topics discussed.
This corpus is notable for its spontaneity, stemming from real-life, unscripted interactions, and its use of both audio and visual modalities. Its dimensional annotation scheme allows for a nuanced understanding of emotion transitions and person-specific expression patterns.
6.20. SAVEE
Haq and Jackson ([
78], 2010) compiled an audio-visual database using recordings from four English male actors portraying seven emotions in a controlled setting. The dataset includes six basic emotions—anger, disgust, fear, happiness, sadness, and surprise—along with a neutral state. Each actor contributed 120 utterances, resulting in a total of 480 sentences. Each recording session included 15 phonetically balanced sentences from the TIMIT database for each emotion: 3 common sentences, 2 emotion-specific sentences, and 10 generic sentences unique to each emotion. During the recordings, the emotions to be acted, and corresponding text prompts were displayed on a monitor in front of the actors. The audio was recorded using a Beyerdynamic microphone. Evaluation of the dataset was carried out by 10 participants, including 5 native English speakers and 5 individuals who had resided in the UK for over a year.
6.21. IITKGP-SEHSC
The Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC) is a pioneering emotional speech dataset specifically created to analyze emotions in Hindi, filling a significant gap in the resources available for Indian languages ([
79], 2011). This dataset was developed using professional artists from Gyanavani FM radio station in Varanasi, India, ensuring high-quality and consistent emotional expressions. It includes 12,000 utterances, based on 15 neutral Hindi text prompts performed by 10 speakers (5 male, 5 female) across eight emotional categories: anger, disgust, fear, happiness, neutral, sadness, sarcasm, and surprise. Each emotion contains 1500 utterances, with the total duration of the corpus reaching approximately nine hours. The inclusion of sarcasm, while not a basic emotion, adds a unique layer of complexity to the dataset, addressing practical nuances in emotional communication that are often overlooked in traditional corpora.
The data collection process took place in a quiet environment using a single SHURE dynamic cardioid microphone (C660N) at a distance of one foot from the speaker. Speech signals were recorded at a sampling rate of 16 kHz with 16-bit resolution, ensuring high fidelity. Sessions were scheduled on alternate days to capture natural variations in human speech production, and each artist recorded all 15 sentences consecutively for a single emotion to maintain coherence. The use of professional radio artists and a structured recording protocol ensured consistent quality, while the setup provided a robust foundation for capturing both nuanced and exaggerated emotional expressions. The emotions expressed in the database were evaluated through subjective listening tests by 25 postgraduate and research students at IIT Kharagpur. These tests assessed the naturalness and clarity of the simulated emotions, achieving an average recognition accuracy of 71% for male speakers and 74% for female speakers. Confusion matrices highlighted the reliability of the dataset, particularly for emotions such as anger and sadness, which were effectively classified. While the reliance on subjective evaluation provided meaningful insights, further details on inter-annotator agreement metrics or annotation guidelines could enhance transparency and consistency in the labeling process.
The IITKGP-SEHSC corpus is an important resource in the field of emotional speech research, offering a linguistically diverse dataset tailored to Hindi. By including prosodic and spectral analyses and addressing both basic emotions and nuanced expressions like sarcasm, it expands the scope of research into emotional communication. This corpus represents a meaningful step toward developing emotion recognition systems for diverse linguistic and cultural contexts, contributing to the broader understanding of emotional expressions in speech.
6.22. ITA-DB-RE (Real Emotions)
The Italian Emotional Database (Real Emotions) was developed to address the limitations of acted emotional databases by capturing authentic emotional expressions in judicial contexts ([
80], 2011). This corpus focuses on real-life emotional states, recorded during 30 trials across seven courts in Italy, resulting in a total of 135 h of audio recordings. An experienced Italian female labeler performed a manual segmentation of the recordings to isolate speech samples, removing noisy segments and overlapping speakers. This process resulted in 522 speech samples with durations ranging from 2 to 25 s and an average length of 18 s. To address the overrepresentation of neutral samples, the dataset was refined to a balanced set of 175 samples, consisting of 88 neutral, 68 angry, and 19 sad segments.
While these emotions are relevant to courtroom proceedings, the limited selection may not fully capture other significant states such as frustration or confusion, which are also common in such high-stakes environments. The participants included judges (46 samples), witnesses (67 samples), lawyers (29 samples), and prosecutors (33 samples), with a gender distribution of 95 males and 80 females, reflecting the diverse roles and perspectives in the courtroom. The recording setup ensured naturalistic data collection by capturing live courtroom proceedings without intervention, preserving the spontaneity and authenticity of emotional expressions. The annotation process involved manual labeling of the samples into neutral, anger, and sadness, carried out by a team of two experienced and one naïve labeler. However, the paper does not elaborate on the specific criteria used for labeling or the agreement metrics between the annotators, leaving some aspects of the methodology’s consistency and reliability open to interpretation. Despite this, the resulting dataset reflects a well-curated selection of real-world emotional expressions tied to the unique context of judicial proceedings.
This corpus represents a significant advancement from the earlier Italian corpus (ITA-DB [
76]), which relied on controlled, performed emotions sourced from movies and TV series. The real-life setting of the ITA-DB-RE database represents a substantial step forward in depicting the complexities of genuine emotional interactions, transitioning from acted to authentic emotional data within the specialized context of courtroom interactions. By capturing real-world emotional dynamics, this corpus addresses the need for datasets reflecting genuine interactions in specialized domains. Its relevance lies in supporting emotion recognition systems for judicial and legal applications, though its limited methodological focus leaves some gaps in documenting crucial details. Additionally, given the sensitivity of the courtroom context, the lack of detailed information about how recordings were obtained, including participant consent or ethical approvals, raises some uncertainty regarding compliance with privacy and data protection standards, which were less commonly documented at the time. Nonetheless, the corpus marks a notable step forward in Italian emotional corpora by incorporating authentic emotional expressions relevant to high-pressure environments like courtrooms.
6.23. TESS
The Toronto Emotional Speech Set (TESS [
81], 2011) is an English-language emotional speech database designed to examine the recognition of emotions across different age groups. It includes 2800 audio recordings, created by embedding 200 semantically neutral target words into the carrier phrase
Say the word __. These words were articulated by two female actors: one aged 26 and the other 64, both native English speakers from the Toronto area. The dataset captures seven distinct emotional states: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral, providing a broad emotional range suitable for research in affective computing and emotion recognition. To ensure high-quality audio and minimize noise interference, the recordings were conducted under controlled conditions. Both actors underwent audiometric testing to confirm normal hearing thresholds, ensuring reliable delivery of emotional expressions. Their backgrounds in musical training and university education contributed to the clarity and consistency of their portrayals. Furthermore, the use of phonemically balanced words embedded in a standardized carrier phrase enhanced the precision of emotional prosody analysis, making the dataset highly suitable for detailed emotion studies.
Evaluations focused on listener perception, achieving above-chance accuracy in emotion recognition, which underscores the reliability of the actors’ portrayals. Although the reliance on acted emotions ensures consistency, it may limit the applicability of the dataset to naturalistic contexts. Nonetheless, the TESS corpus provides valuable insights into how emotions are perceived across different age groups, making it a specialized resource for studies in speech emotion recognition, affective computing, and age-related auditory processing. By addressing a gap in emotional speech datasets focusing on older and younger speakers, it contributes to understanding the interplay of age and emotion in vocal communication, though its scope remains limited due to the small number of speakers.
6.24. SEMAINE
The SEMAINE corpus ([
82], 2012) is a detailed multimodal database developed to study emotionally colored conversations between humans and artificial agents, highlighting the spontaneous emergence of emotions within specific contexts. The corpus includes 959 conversations recorded from 150 participants, each lasting approximately 5 min, resulting in 80 h of synchronized audiovisual data. High-quality recordings were captured using five high-resolution cameras and four microphones, ensuring robust multimodal integration. The corpus spans multiple emotional categories, including anger, disgust, fear, happiness, sadness, surprise, and neutral, providing a comprehensive foundation for affective computing research.
The interactions are divided into three scenarios: Solid SAL, Semi-automatic SAL, and Automatic SAL, balancing human-driven and automated system responses to capture diverse emotional expressions. In Solid SAL, a human operator simulated emotional stances using nonverbal cues like eye contact, producing 190 recordings from 95 character interactions via a teleprompter for natural exchanges. Semi-automatic SAL utilized scripted responses with varied feedback types (audio-visual, video-only, filtered audio), yielding 144 degraded and 44 baseline recordings to examine communication breakdowns. Automatic SAL, featuring an autonomous system detecting facial actions, gestures, and prosodic cues, generated 964 recordings across two versions—one fully capable of nonverbal communication and another limited in skills. The annotation methodology was rigorous, combining multiple raters with the FEELtrace system to capture valence and activation dimensions continuously over time. In Solid SAL, user clips were annotated by eight raters for 4 sessions, six raters for 17 sessions, and at least three raters for the remaining sessions. Operator clips had four raters for three sessions, while the rest were annotated by a single rater. Eleven user sessions in Semi-automatic SAL were annotated by two raters, while all Automatic SAL sessions were annotated by a single rater. The process incorporated five core traces—valence, activation, power, anticipation/expectation, and intensity—along with optional descriptors to capture nuanced emotional and interaction dynamics. Solid SAL featured the most detailed annotations, including both core dimensions and optional traces, while Semi-automatic SAL maintained a similar level with fewer raters. In Automatic SAL, engagement tracing focused on user responses to the autonomous system, emphasizing real-time interaction quality. This structured approach provided a nuanced representation of emotional dynamics, ensuring granularity while acknowledging some variability in rater numbers across scenarios.
By integrating high-quality audiovisual recordings, diverse emotional categories, and sophisticated annotation techniques, the SEMAINE corpus addresses critical gaps in multimodal emotion research. The three interaction scenarios play a crucial role in capturing a wide spectrum of emotional exchanges, ranging from human-driven interactions to autonomous system responses. This focus on real-time, emotionally rich interactions bridges the divide between human-agent and human–human communication studies. The corpus is a significant resource for developing emotionally aware systems and has become a valuable tool in the field of affective computing and emotion recognition.
6.25. CREMA-D
The Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D [
83], 2014) is an English-language multimodal dataset designed to study emotion perception across three modalities: audio-only, visual-only, and audio-visual. It includes 7442 clips from 91 actors (48 male, 43 female) of diverse ages (20–74 years, mean 36) and ethnicities, expressing six basic emotions: happy, sad, anger, fear, disgust, and neutral. The dataset is structured around 12 semantically neutral sentences, performed under the guidance of professional theater directors to ensure consistent and expressive emotional portrayals. Recordings were conducted in a sound-attenuated environment, with actors seated against a green screen, using high-quality audio and video equipment to capture clear signals. This controlled setup, combined with a diverse actor pool, enhances the dataset’s demographic variability and suitability for multimodal emotion analysis. The annotation process utilized 2443 raters, likely through a well-established crowd-sourcing platform, to ensure broad participation and diversity in evaluations. Each rater annotated only a subset of clips, enabling efficient task distribution. Multiple evaluations per clip ensured consensus ratings, distinguishing matching, non-matching, and ambiguous samples. The dataset achieved moderate inter-rater agreement overall (Krippendorff’s alpha of 0.42) but strong agreement for unambiguous clips (alpha of 0.79), confirming reliability for high-quality samples. Recognition rates also varied significantly across modalities, with audio-visual stimuli achieving the highest accuracy (63.6%), followed by visual-only (58.2%) and audio-only (40.9%), highlighting the complementary nature of multimodal emotion analysis.
The CREMA-D corpus is a significant contribution to emotional speech and multimodal datasets, offering a diverse and large-scale resource with controlled recording conditions and extensive crowd-sourced annotations. It captures emotional expressions across modalities, providing valuable insights into single and multimodal emotion perception. Its design has supported advancements in affective computing, human–computer interaction, and emotion recognition research, solidifying its place as a widely used resource in the field.
6.26. EMOVO
EMOVO ([
84], 2014) is the first Italian emotional speech database specifically designed to study and develop speech emotion recognition systems. The corpus consists of recordings from six professional actors (three male and three female), aged between 23 and 30 years, who performed 14 sentences in seven emotional states: neutral, disgust, fear, anger, joy, surprise, and sadness. These emotions, referred to as the “Big Six” plus neutral, are widely recognized in emotional speech research. The sentences included a mix of semantically neutral and nonsense phrases to avoid semantic bias in emotion recognition. This careful design ensures the corpus’s applicability in both research and practical applications.
Conducted in the laboratories of the Fondazione Ugo Bordoni in Rome, the recordings were made using professional-grade equipment, including Shure SM58LC microphones and a Marantz PMD670 digital recorder. Each actor recorded all 14 sentences in the seven emotional states, resulting in 588 utterances, with approximately 10 min of material per actor. Actors were encouraged to move naturally, introducing some variability in signal intensity due to changes in distance from the microphone, a minor factor that may influence processing. Emotional performances were validated through a subjective emotion discrimination test with 24 listeners, achieving an 80% overall recognition accuracy. Neutral, anger, and sadness were the most easily recognized emotions, while joy and disgust were less distinct. This validation, though limited to categorical assessments, confirmed the reliability of the actors’ performances and the robustness of the dataset for emotion recognition tasks.
The EMOVO Corpus has become a cornerstone for SER research in Italian, addressing the lack of emotional speech datasets in this linguistic context. Its inclusion of the “Big Six” emotions plus neutral, controlled laboratory conditions, and rigorous validation have made it a benchmark resource. The corpus has greatly influenced Italian emotional speech research, supporting advancements in human–computer interaction and speech synthesis, and remains a key resource for developing and evaluating emotion recognition systems.
6.27. CHEAVD
The CASIA Chinese Natural Emotional Audio-Visual Database (CHEAVD [
85], 2016) is a large-scale, multimodal resource designed to support research in multimodal emotion recognition, natural language understanding, and human–computer interaction. The dataset includes 140 min of emotion-rich audio-visual segments extracted from 34 films, 2 TV series, and 4 TV shows. The selection process excluded science fiction and fantasy genres, prioritized materials featuring actors’ original voices over dubbed versions, and focused on Mandarin-speaking content with minimal accents. This corpus contains 2600 segments, each lasting 1 to 19 s, with an average duration of 3.3 s. It features recordings of 238 speakers, balanced across genders (52.5% male, 47.5% female), and spanning a wide age range from 11 to 62 years, segmented into six categories, underscoring its utility for robust, speaker-independent emotion analysis. Data was carefully selected to represent real-life emotional expressions, prioritizing scenarios closely tied to daily life. Raw materials were drawn from films and television series reflecting realistic environments, chat shows, talk shows, and impromptu speech programs. Strict criteria for segmentation ensured high-quality data, with segments containing only one speaker’s speech and facial expressions, minimal noise, and complete utterances.
The annotation process for CHEAVD employed a multi-step strategy to ensure nuanced and contextually relevant emotional labels. Four native Chinese annotators labeled each segment, focusing on both primary and secondary emotions, resulting in a rich annotation framework. The process spanned 26 emotion categories, encompassing basic emotions like happy, sad, and angry, as well as nuanced states such as shy, sarcastic, and hesitant. Notably, the inclusion of labels for fake or suppressed emotions added depth, capturing the complex interplay between internal emotional states and external expressions. While the wide range of categories enriches the dataset’s granularity, it also posed challenges in maintaining annotation consistency, as reflected in moderate inter-annotator agreement (Cohen’s kappa values around 0.5). This outcome highlights the inherent difficulty of labeling nuanced emotions and balancing granularity with reliability. The final annotations were curated through consensus discussions to mitigate discrepancies and achieve a coherent framework. This rigorous annotation process underscores the dataset’s potential for advancing research in naturalistic emotion modeling and its ability to support both categorical and continuous emotional analyses.
CHEAVD’s integration of multimodal data (audio, visual, and speech) and its emphasis on capturing non-prototypical and subtle emotional expressions make it a noteworthy resource. By addressing cultural and linguistic gaps in existing emotion datasets, it provides a tailored platform for Mandarin emotion recognition and cross-cultural studies. The inclusion of baseline emotion recognition experiments using LSTM-RNN models with a soft attention mechanism, achieving an average accuracy of 56% for six major emotions, highlights its practical relevance for developing and evaluating advanced multimodal systems.
6.28. MSP-IMPROV
The MSP-IMPROV corpus ([
86], 2016) is a multimodal emotional dataset designed to explore emotion perception and recognition, balancing naturalness and control. It includes 8438 dyadic conversational turns from 12 actors (6 males, 6 females) recorded across six sessions. The dataset comprises target-improvised (652 samples), target-read (620 samples), other-improvised (4381 samples), and natural interactions (2785 samples), covering emotions such as happiness, sadness, anger, and neutrality. Recorded in a soundproof booth, it combines dyadic interactions with a mix of scripted sentences and improvised dialogue to capture emotional depth. The corpus incorporates both audio and visual data, allowing for a comprehensive analysis of emotional expression through speech, facial expressions, and body language. Actors, recruited from a theater program, were guided by designed hypothetical scenarios to blend scripted content with spontaneous expressions, ensuring both authenticity and control while enhancing the emotional dynamics of the interactions.
The annotation process used crowd-sourcing via Amazon Mechanical Turk to evaluate the emotional content of the samples. A reference set of 652 Target-Improvised sentences was preannotated to monitor evaluator performance in real time. This approach ensured unreliable annotators were identified and stopped mid-task. The dataset achieved a Fleiss’ Kappa statistic of 0.487, which indicates moderate agreement among evaluators and is comparable to other spontaneous emotional corpora. Agreement levels varied slightly by subset: Target-Improvised (k = 0.497), Target-Read (k = 0.479), Other-Improvised (k = 0.458), and Natural Interaction (k = 0.487). The Target-Improvised sentences were evaluated by an average of 28.2 annotators, while other subsets had at least five annotators, ensuring robust emotional ratings. This method combined a reference set with dynamic evaluator monitoring, demonstrating methodological rigor in emotional corpus annotation.
The MSP-IMPROV corpus bridges the gap between acted and naturalistic datasets by integrating controlled recording conditions with the spontaneity of improvisation. Its focus on dyadic interactions and audiovisual modalities addresses critical gaps in emotional speech research, providing a valuable resource for studying emotion perception and recognition. The corpus has contributed significantly to advances in affective computing, human–computer interaction, and multimodal emotion analysis, establishing itself as an important dataset for research on realistic emotional communication.
6.29. NNIME
The NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus (NNIME) is a large-scale, publicly available resource designed for the analysis of multimodal emotional interactions in Mandarin Chinese ([
87], 2017). It was developed through collaboration between engineers and drama experts, emphasizing the study of dyadic human interactions to mirror real-life emotional exchanges. The dataset includes recordings of 44 professional actors (22 females, 20 males) aged 19–30, all trained in dramatic arts and native Mandarin speakers. These actors were paired into 22 dyads (seven female–female, ten female–male, five male–male) and performed spontaneous interactions targeting six emotional states: anger, sadness, happiness, frustration, neutral, and surprise. Each session lasted approximately 3 min, resulting in a total of 102 sessions and 11 h of synchronized multimodal data, including audio, video, and electrocardiogram (ECG) recordings.
The recording environment reflects meticulous planning to balance methodological rigor and ecological validity. Sessions were conducted in controlled settings modeled after daily-life contexts, such as dormitories or living rooms, enhancing the authenticity of interactions. High-definition camcorders and wireless microphones ensured high-quality audiovisual recordings, while wearable ECG devices collected physiological signals, offering unique insights into the interplay between external behavioral cues and internal emotional states. Synchronization across modalities was achieved using a clapboard technique, enabling seamless multimodal analysis. Emotion annotations are particularly comprehensive, employing a multi-perspective approach with peer reports (44 raters), director assessments (1 rater), self-reports (1 rater), and observer evaluations (4 raters), providing a total of 49 unique perspectives. Discrete annotations covered categorical emotions and valence-activation ratings on a 1-to-5 scale, while continuous annotations captured dynamic emotional flows using the FEELtrace tool, rated by four naive observers. This dual focus ensures both categorical clarity and temporal granularity in the emotional data. Post-processing involved meticulous segmentation of 6701 utterances into speech and non-verbal categories like laughter and sobbing, alongside ECG signal de-noising for deriving heart rate variability features. While the breadth of perspectives enriches the dataset, managing such variability requires robust calibration to ensure consistency across raters.
NNIME’s distinctiveness lies in its integration of external behaviors and internal physiological responses, providing a resource for exploring the relationship between expressed and felt emotions. Its emphasis on dyadic interactions and multimodal data collection, including audio, video, and physiological signals, offers valuable opportunities for studying interaction dynamics and the interplay of verbal, non-verbal, and physiological cues. By addressing the scarcity of large-scale Mandarin datasets, NNIME contributes meaningfully to cross-cultural emotion research and supports advancements in affective computing and emotion recognition systems.
6.30. AESDD
The Acted Emotional Speech Dynamic Database (AESDD) [
88], 2018) was developed to address the limitations of existing emotional speech databases, taking inspiration from the SAVEE database as a reference. It comprises acted speech utterances, and it is designed to be continuously expanding. To create the initial version of the AESDD, five professional actors, aged between 25 and 30 years, were hired. The group included two male actors and three female actors. All the AESDD database utterances are in Greek.
The recordings took place in the sound studio of the Laboratory of Electronic Media at Aristotle University of Thessaloniki, Greece, which provided an ideal acoustic environment for high-quality recordings. The spoken and recorded phrases were sourced from theatrical scripts. Specifically, 19 utterances were selected from various plays to form the database, chosen for the emotional ambiguity of their context. These 19 sentences were performed by the actors in Greek, across five different emotional states: happiness, sadness, anger, fear, and disgust. Additionally, for each emotion, one extra improvised utterance was recorded, and multiple recordings were made for some sentences, resulting in approximately 500 emotional speech utterances (5 actors × 5 emotions × 20 utterances). A dramatology expert supervised the recordings, offering guidance to the actors and making necessary adjustments to ensure the quality and appropriateness of the acted speech. During preprocessing, all utterances were appropriately normalized to a peak of −3 dB.
6.31. ANAD
The development of the Arabic Natural Audio Dataset (ANAD) ([
89], 2018) is motivated by the goal of aiding hearing-impaired and deaf individuals in enhancing their daily communication. By integrating an effective emotion recognition system with a reliable speech-to-text system, the aim is to enable successful phone communication between deaf or hearing-impaired individuals and others. To achieve this, the researchers focus on collecting natural phone call recordings to build the corpus. Eight videos of live calls between an anchor and an individual outside the studio were downloaded from online Arabic talk shows. All the videos are publicly available, accordingly, the authors deduced that no copyright issues are associated with their use. Eighteen human labelers were tasked with listening to the videos and categorizing each one as happy, angry, or surprised, with the average result used to label each video. The videos were then segmented into turns between callers and receivers, with silence, laughter, and noisy segments removed. Each remaining chunk was automatically divided into 1-s speech units, resulting in a final corpus of 1384 records, comprising 505 happy, 137 surprised, and 741 angry units.
6.32. CaFE
The Canadian French Emotional (CaFE) speech dataset is the first emotional speech dataset in Canadian French ([
90], 2018). The dataset comprises six sentences, pronounced by six male and six female actors, in six basic emotions (sadness, happiness, anger, fear, disgust and surprise) and one neutral emotion, with two different intensities. Actors recorded their lines individually in a professional soundproof room. A Blue Microphones Yeti Pro USB microphone, on a tripod with a pop filter, was used for recording. The microphone was set to cardioid mode and connected to a remote Acer Swift 3 laptop via USB. Recording was performed at 192 kHz/24-bit. Actors were positioned freely in front of the microphone. A recording session typically lasts one hour per actor, with three to five takes recorded for each sentence, emotion, and intensity level. Particular attention was made in the choice of the sentences by the authors. The sentences have chosen emotionally neutral from a semantic point of view but well suited to be uttered in various emotions. Specifically, they have been chosen reasonably easy to pronounce sentences and composed by the same number of syllables (eight). Statistics on phonemic distributions in the chosen sentences are given in the paper showing they approach the Wioland distribution of French phonemes (measured on a large corpus combining spoken (radio broadcast) and written (literacy texts) French).
6.33. CMU-MOSEI
The CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset is a detailed and diverse resource for multimodal sentiment and emotion analysis ([
91], 2018). It consists of over 23,453 annotated video segments extracted from 3228 online videos, primarily sourced from YouTube, and features 1000 distinct speakers. The dataset achieves a balanced demographic representation, with 57% male and 43% female speakers, spanning a wide range of ages, accents, and speaking styles. These segments are derived from 250 different topics, offering a mix of conversational styles and naturalistic monologues in English. The corpus spans seven emotional categories—happiness, sadness, anger, disgust, fear, surprise—while also including sentiment annotations on a continuous scale from −3 (highly negative) to 3 (highly positive), enabling fine-grained sentiment analysis.
The design leverages naturalistic video monologues sourced from online platforms, capturing authentic emotional expressions across text, audio, and visual modalities. Its speaker diversity and multimodal alignment provide a strong foundation for generalizable research, though variability in recording quality due to the reliance on public content presents a minor challenge. Each video segment is temporally aligned across modalities, ensuring coherence, with an average duration of approximately 7.28 s. The dataset also includes over 56,000 aligned modality features, offering extensive opportunities for machine learning applications. The annotation process involved over 6000 human raters, who assigned sentiment polarity and emotional intensity scores to each segment. By including both categorical emotion labels and sentiment polarity ratings on a continuous scale, the dataset captures a nuanced representation of affective states. Inter-rater agreement was validated, with efforts to ensure consistency despite the inherent subjectivity of rating emotions. While the specific methodology for training annotators or resolving disagreements is not extensively detailed, the large-scale annotation effort demonstrates a robust approach. This dual focus on categorical and intensity measures adds depth to the dataset, supporting fine-grained analyses of sentiment and emotion dynamics.
This corpus represents a valuable contribution to multimodal affective computing. Its large scale, integration of text, audio, and visual modalities, and focus on naturalistic emotional expressions address important gaps in existing datasets. By providing a diverse speaker pool and well-aligned multimodal features, the corpus supports research in emotion recognition, sentiment analysis, and multimodal machine learning. Its considered design and detailed annotations make it a valuable resource for studying the intricacies of emotional and sentiment interactions in human communication.
6.34. RAVDESS
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS [
11], 2018) is a multimodal dataset designed to support research in emotion perception and recognition in a neutral North American accent. It includes 7356 recordings from 24 professional actors (12 male, 12 female; age range = 21–33 years; M = 26.0 years; SD = 3.75). The actors self-identified as Caucasian (N = 20), East-Asian (N = 2), and Mixed (N = 2), with one identifying as East-Asian/Caucasian and another as Black-Canadian/First Nations/Caucasian. The dataset comprises 4320 speech recordings and 3036 song recordings. Each actor produced 104 distinct vocalizations, comprising 60 spoken utterances and 44 song utterances, covering three modality conditions: audio-visual (face and voice), video-only (face, no voice), and audio-only (voice, no face). The dataset spans eight emotional categories (neutral, calm, happy, sad, angry, fearful, surprise, and disgust) for speech and six emotional categories (neutral, calm, happy, sad, angry, and fearful) for song, with emotions expressed at two intensity levels (normal and strong). Actors repeated each vocalization twice, ensuring robustness and diversity within the dataset. The corpus uniquely explores emotional expressions in both speech and song, offering a balanced design that supports versatile applications in speech emotion recognition and multimodal analyses.
The recording sessions were conducted in a controlled studio environment, ensuring consistency and high-quality output. Professional-grade equipment, including synchronized audio and video recording setups, captured speech, facial expressions, and body language with minimal interference. Actors followed a structured protocol, performing two speech sentences and two sung phrases for each emotional state and intensity, resulting in a balanced and comprehensive dataset. Stimuli were presented at a high resolution and processed in sound-attenuated booths using Sennheiser HD 518 headphones to ensure clarity during validation. The distinction between intense and normal vocal expressions enhances the dataset’s versatility, with intense expressions aiding in emotional clarity and normal expressions offering representations closer to real-life emotional nuances. The annotation process involved 319 undergraduate students (76% female, 24% male, mean age = 20.55, SD = 4.65) from Ryerson University, who evaluated each stimulus for emotional category, intensity, and genuineness. Each stimulus received 10 ratings across these three evaluation scales, resulting in a total of 220,680 annotations. Raters classified emotions using a forced-choice response format with options like neutral, calm, happy, sad, and others, alongside a “none of these are correct” escape option, ensuring flexibility in judgments. Plutchik’s wheel of emotion was incorporated to provide structure and improve clarity in categorizing emotional states. To ensure reliability, a test-retest task with 72 additional raters validated the consistency of the ratings, demonstrating high inter-rater agreement for most emotional categories and further solidifying the quality of the dataset. The decision to involve untrained participants in the annotation process proved effective, as evidenced by the high accuracy rates achieved during validation, with 80% accuracy for audio-video, 75% for video-only, and 60% for audio-only conditions.
The RAVDESS corpus is regarded as a benchmark in the field due to its multimodal design, methodological rigor, and inclusion of both speech and song. Its balanced approach to emotion categorization and intensity scaling, coupled with its large-scale validation process, ensures a high level of reliability and utility. This corpus has significantly contributed to the fields of affective computing, speech emotion recognition, and multimodal emotion research, serving as a foundation for modeling complex emotional dynamics in human communication. It is important to note that the RAVDESS paper is comprehensive and meticulously detailed, providing extensive insights into the dataset’s development and validation processes. While this summary focuses on the key aspects of the corpus, some of the nuanced methodology and data intricacies may be oversimplified or omitted for brevity, underscoring the depth and scope of the original work.
6.35. SHEMO
The Sharif Emotional Speech Database (ShEMO [
92], 2018) is a large-scale and validated resource for Persian. The database contains 3000 semi-natural utterances, amounting to 3 h and 25 min of speech data sourced from online radio plays. ShEMO features speech samples from 87 native Persian speakers, encompassing five basic emotions—anger, fear, happiness, sadness, and surprise—along with a neutral state. Fifty radio plays from various genres, including comedy, romance, crime, thriller, and drama, were selected as potential sources of emotional speech. The differences in audio streams were balanced using Audacity. Since the majority of streams (approximately 90%) had a sampling frequency of 44.1 kHz, the streams with lower sampling rates were upsampled using the cubic interpolation technique. Furthermore, all stereo-recorded streams were converted to mono. Each stream was segmented into smaller parts such that each segment would cover the speech sample of only one speaker without any background noise or effect. Twelve annotators (6 males, 6 females) labeled the emotional states of the utterances, with final labels determined through majority voting. The annotators were all native speakers of Persian with no hearing impairment or psychological problems. The mean age of the annotators was 24.25 years (SD = 5.25 years), ranging from 17 to 33 years. The inter-annotator agreement, measured by Cohen’s kappa statistics, is 64%, indicating a “substantial agreement”.
6.36. BAVED
The Basic Arabic Vocal Emotions Database (BAVED) ([
93], 2019) is a collection of recorded Arabic words spoken in various expressed emotions. It includes 1935 audio files recorded from 61 speakers (45 males and 16 females, from 18 to 23 years old). BAVED includes seven words, each recorded in three levels of expressed emotions, where Level 0 is a low emotion (e.g., tired), Level 1 is the neutral level (e.g., how the speaker normally expresses during the day), and Level 2 is a high level of either positive or negative emotion (happiness, joy, sadness, anger, etc.). All audio files were sampled at 16 kHz with one audio channel and 256 kbps of bitrate.
6.37. MELD
The Multimodal EmotionLines Dataset (MELD) is a comprehensive multimodal emotional conversational corpus designed to address the complexities of emotion recognition in multi-party conversations ([
94], 2019). Derived from the popular English-language sitcom Friends, the dataset includes 13,000 utterances from 1433 dialogues, significantly expanding its predecessor, EmotionLines, by integrating audio and visual modalities alongside text. The dataset spans seven emotion categories: anger, disgust, fear, joy, sadness, surprise, and neutral. Each utterance is annotated with its emotional label and sentiment, further enhanced by multimodal cues such as facial expressions, vocal intonations, and textual content. This multimodal approach enhances the understanding of nuanced emotions, with visual and auditory cues, such as vocal changes and facial expressions, aiding in accurately capturing emotions like surprise. The use of Friends provides a structured framework for studying emotions in conversational settings, offering a diverse range of emotional expressions practical for training emotion recognition models. While the sitcom’s nature may amplify some portrayals, the dataset remains contextually rich. Multi-party interactions, overlapping speech, and emotional shifts reflect conversational complexities, complemented by high-quality multimodal cues, making it a valuable resource for multimodal emotion recognition research. Annotations were performed by three annotators per utterance, achieving a Fleiss’ Kappa score of 0.43 for emotion labels, reflecting moderate agreement. This process marked a methodological shift from its predecessor, EmotionLines, which utilized annotations via Amazon Mechanical Turk with five workers per utterance and a majority voting scheme, achieving a lower Fleiss’ Kappa score of 0.34. To address the challenges encountered in EmotionLines, MELD introduced an improved annotation strategy, incorporating multimodal cues (audio and visual) alongside textual data to improve context and accuracy. Sentiment annotations in MELD achieved a higher Fleiss’ Kappa of 0.91, indicating robust sentiment labeling. Disagreements in emotional labels were resolved by discarding ambiguous annotations, resulting in a more refined and reliable dataset. By capturing the dynamics between speakers and using the multimodal context, MELD’s annotation process better aligns with the complexities of conversational emotion recognition.
MELD’s contribution lies in its focus on multiparty conversational data, addressing an area that has been relatively underexplored in emotion recognition research. By offering a large-scale dataset with multimodal annotations in English and a subset translated into Chinese, it complements existing resources like IEMOCAP and SEMAINE, which primarily focus on dyadic interactions. MELD’s emphasis on conversational context and inclusion of multi-speaker dynamics make it a valuable resource for advancing research in affective computing, dialogue systems, and multimodal emotion recognition. While not without limitations, its unique design and scalability establish it as an important tool in the study of emotional communication.
6.38. MSP-PODCAST
The MSP-PODCAST corpus ([
95], 2019) is a large-scale, English-language, naturalistic emotional speech database designed to address the limitations of existing resources by leveraging publicly available podcast recordings. The dataset begins with over 84,000 speaking turns extracted from 403 podcasts, covering diverse topics, speakers, and conversational styles. These segments were processed using advanced machine learning models for emotional content retrieval, followed by manual validation to ensure high-quality speech samples. The final corpus contains samples ranging from 2.75 to 11 s, annotated for both categorical emotions (happiness, sadness, anger, disgust, surprise, neutral) and dimensional attributes (valence, arousal, dominance). The podcast-based approach introduces authentic emotional expressions in dynamic conversational contexts, making it significantly more naturalistic than acted datasets. The recording setup capitalized on the inherent diversity of podcasts, with their professional and semi-professional recording conditions ensuring clear and faithful audio. This naturalistic setting captures spontaneous emotional expressions in varied real-world scenarios, enhancing the ecological validity of the corpus. The annotation process employed crowd-sourcing via Amazon Mechanical Turk, where each sample was evaluated by at least five annotators. Real-time tracking of inter-evaluator agreement and quality control measures ensured reliable labeling. This robust annotation framework resulted in a balanced corpus that effectively represents the valence-arousal space, supporting both categorical and dimensional analyses of emotion.
The MSP-PODCAST corpus stands out for its innovative, scalable methodology, combining automated and manual techniques to create a large, emotionally balanced dataset. Its ability to capture naturalistic emotional speech across diverse contexts fills a key gap in emotional speech research. As a benchmark for emotion recognition, it has significantly influenced affective computing and human–computer interaction, advancing machine learning, conversational agents, and multimodal emotion analysis.
6.39. DEMoS
The Database of Elicited Mood in Speech corpus (DEMoS [
96], 2020) is an Italian emotional speech database created to address the lack of emotional speech resources in Italian and to advance research in speech emotion recognition. It comprises 9697 samples, including 9365 emotional and 332 neutral samples, collected from 68 speakers (23 females and 45 males), predominantly engineering students aged approximately 23.7 years. The corpus captures seven emotional states: anger, sadness, happiness, fear, surprise, disgust, and guilt, providing a rich variety of affective expressions. Notably, guilt, a rarely represented emotion in similar datasets, adds a unique dimension, enhancing its relevance for real-world applications. While the reliance on isolated utterances ensures clarity, it may limit the contextualization of emotions compared to interactive speech corpora.
The recording environment was designed to elicit genuine emotional speech in a controlled setting. Sessions took place in a semi-dark and quiet room, minimizing distractions and observer effects. High-quality recording equipment, including professional-grade microphones, ensured clear and faithful audio capture. Participants interacted with a computer interface that facilitated mood induction procedures (MIPs), such as music, autobiographical recall, film scenes, and empathy-based scripts. These methods encouraged dynamic shifts in valence and arousal, enabling the capture of authentic emotional expressions. Speech samples were manually segmented to prioritize syntactic and prosodic naturalness, with a mean sample duration of approximately 2.9 s. The annotation process combined self-assessments by participants with external evaluations conducted by three experts in affective computing, ensuring a focus on “prototypical” samples. Ambiguous cases, such as those from participants failing an emotional awareness test (alexithymia) or with acting experience, were excluded to preserve authenticity. While the dual evaluation approach strengthened the dataset’s quality, limited details provided on inter-annotator agreement and labeling criteria make it challenging to fully assess the consistency of the annotations.
The DEMoS corpus stands out for its use of MIPs, a well-controlled recording setup, and the inclusion of guilt as an emotion, addressing a gap in Italian emotional speech research. By focusing on prototypical samples, it provides a valuable resource for studying and modeling emotions in speech, contributing significantly to affective computing and emotion recognition in the Italian context.
6.40. MEAD
The Multi-view Emotional Audio-Visual Dataset (MEAD) ([
97], 2020) is a talking-face video corpus featuring 60 actors and actresses expressing eight different emotions at three varying intensity levels. The primary goal of its development is to enable the synthesis of natural emotional reactions in realistic talking-face video generation. The study employs eight emotion categories (angry, disgust contempt, fear, happy, sad, surprise, and neutral), and three levels of emotion intensity, chosen for their intuitive alignment with human perception. The first level, defined as weak, represents subtle but noticeable facial movements. The second level, medium, corresponds to the typical expression of the emotion, reflecting its normal state. The third level, strong, is characterized by the most exaggerated expressions of the emotion, involving intense movements in the associated facial areas. The phonetically diverse TIMIT speech corpus served as the basis for defining the audio speech content. Sentences were carefully selected to cover all phonemes across each emotion category. The sentences within each emotion category were divided into three parts: 3 common sentences, 7 emotion-specific sentences, and 20 generic sentences. Fluent English speakers aged 20 to 35 with prior acting experience were recruited for the study. To assess their acting skills, candidates were asked to imitate video samples of each emotion performed at different intensity levels by a professional actor. The guidance team evaluated the candidates’ performances based on their ability to replicate the expressions in the videos, ensuring that the main features of the emotions were conveyed accurately and naturally.
Before the recording process, training sessions were provided to help speakers achieve the desired emotional states. Subsequently, an emotion arousal session was conducted to elevate their emotional state, enabling them to deliver extreme expressions required for level 3 intensity. Most speakers were recorded in the order of weak, strong, and medium intensities, as mastering medium intensity became easier when the speaker had experienced both extremes of the emotion. The quality of the dataset was evaluated focusing on two main objectives: (i) determining whether the emotions performed by actors can be accurately recognized and (ii) assessing whether the three levels of emotion intensity can be correctly distinguished. For this experiment, 100 volunteers, aged 18 to 40, were recruited from universities. Data from six actors in the MEAD dataset, including four males and two females, were randomly selected and two types of experiments were conducted: emotion classification and intensity classification. In the user study on emotion category discrimination, the averaged accuracy rate was 85%. Authors show also results on emotion intensity discrimination on captured snippets. However, it should be taken into account evaluators use both audio and video to perform their classification.
6.41. SUBESCO
The SUST Bangla Emotional Speech Corpus (SUBESCO) is currently the largest available emotional speech database for the Bangla language ([
98], 2021). It comprises voice data from 20 professional speakers, evenly divided into 10 males and 10 females, aged 21 to 35 years (mean = 28.05, SD = 4.85). Audio recording was conducted in two phases, with 5 males and 5 females participating in each phase. Gender balance in the corpus was maintained by ensuring an equal number of male and female speakers and raters. The dataset includes recordings of seven emotional states (anger, disgust, fear, happiness, sadness, surprise and neutral) for 10 sentences, with five trials preserved for each emotional expression. Consequently, the total number of utterances is calculated as 10 sentences × 5 repetitions × 7 emotions × 20 speakers = 7000 recordings. Each sentence has a fixed duration of 4 s, with only silences removed while retaining complete words. The total duration of the database amounts to 7000 recordings × 4 s = 28,000 s = 466 min 40 s = 7 h 40 min 40 s. Standard Bangla was chosen as the basis for preparing the text data used to develop the emotional corpus.
Initially, a list of twenty grammatically correct sentences was created, ensuring they could be expressed in all target emotions. Subsequently, three linguistic experts selected 10 sentences from this list for the database preparation. The final text dataset includes 7 vowels, 23 consonants, and 3 fricatives, covering all five major groups of consonant articulation. It also incorporates 6 diphthongs and 1 nasalization from the Bangla IPA. The audio recordings were conducted in an anechoic sound studio. Inside the recording room, the speaker was seated and given a dialogue script containing all the sentences arranged in sequential order. A condenser microphone (Wharfedale Pro DM5.0s) mounted on a suitable microphone stand (Hercules) was provided for the speaker. The average distance between the microphone and the speaker’s mouth was 6 cm. Speakers were professional artists and, accordingly were well-acquainted with the Stanislavski method for self-inducing desired emotions. They were instructed to convey the emotions in a manner that made the recordings sound as natural as possible. The speakers were also given unlimited time to prepare themselves to accurately express the intended emotions. Human subjects (25 males and 25 females) were engaged to label the utterances. Each rater assessed a set of recordings during Phase 1 and, after a one-week break, re-evaluated the same set of recordings in Phase 2. In the first phase, the raters assessed all seven emotions. However, in the second phase, the emotion “Disgust” was excluded, while the remaining six emotions were evaluated. This exclusion aim to investigate whether “Disgust” causes confusion with other similar emotions. All the raters were students from various disciplines and schools at Shahjalal University. They were all physically and mentally healthy and aged over 18 years at the time of the evaluation. None of the raters had participated in the recording sessions. As native Bangla speakers, they were proficient in reading, writing, and understanding the language. To prevent any bias in their perception, the raters were not given prior training on the recordings. Each audio set was assigned to two raters, one male and one female, to ensure that every audio clip was rated twice, with input from both genders in each phase.
Kappa statistics and intra-class correlation (ICC) were used to assess the reliability of the rating exercises. A two-way ANOVA was also conducted following the evaluation task to examine the variability and interaction of the main factors: gender and emotion. The inter-rater reliability for Phase 1 yielded a mean Kappa value of 0.58, indicating moderate agreement between raters. In Phase 2, the mean Kappa value increased to 0.69, reflecting substantial agreement. The intra-class correlation scores were ICC = 0.75 for single measurements and ICC = 0.99 for average measurements, indicating high reliability. The ANOVA results revealed that emotion had a significant main effect on recognition rates. Additionally, the Kruskal–Wallis test showed a statistically significant difference in average emotion recognition rates across different emotions. From the two-way ANOVA analysis, it was determined that the rater’s gender had a significant main effect on emotion recognition rates. However, there was no evidence to suggest that the gender of the speakers influenced the recognizability of emotions, as no specific gender consistently expressed more recognizable emotions than the other.
6.42. EMOVIE
EMOVIE ([
99], 2021) is the first public Mandarin emotion speech dataset, It was collected from seven movies, belonging to feature and comedy categories, with natural and expressive speeches. The raw audio tracks were extracted from the movie files using the ffmpeg tool. A total of 9724 samples with 4.18 h of audio was collected. Human-labeled annotation of emotion polarity of the speech audio samples was performed using a scale from −1 (negative) to 1 (positive) and a step of 0.5, where 0 is neutral emotion. Samples with the polarity of ‘−0.5’ and ‘0.5’ account for 79% of the total samples (4573 and 3171 samples, respectively), followed by ‘0’ (1783 samples), ‘−1’ (179 samples) and ‘1’ (78 samples).
6.43. MCAESD
The Mandarin Chinese auditory emotions stimulus database (MCAESD [
100], 2022) is an emotional auditory stimuli database composed of Chinese pseudo-sentences recorded by six professional actors (3 males and 3 females, mean age 30.5 years) in Mandarin Chinese. The stimulus set was composed of pseudo-sentences (each of them having 12 syllables (12 characters), including three keywords with two syllables each) created by Chinese disyllabic words with a high frequency of occurrence, based on the Chinese newspaper People’s Daily database. During the recording session, the actors were asked to read the target sentences acting six emotions (happiness, sadness, anger, fear, disgust, and pleasant surprise) plus the neutral emotion. Moreover, they were asked to vocalize two intensity levels for each emotion (except for neutral): normal and strong. Normal intensity refers to the general emotional intensity in daily-life communication, while strong intensity refers to a much more vivid and profound emotion than a normal emotion. Moreover, all emotional categories were vocalized into two types of sentence pattern: declarative and interrogative. Stimuli were recorded in a professional recording studio using a SONY UTX-B03 wireless microphone and digitized at a 44.1 kHz sampling rate with 64-bit resolution on two channels. Finally, each actor read 40 different pseudo-sentences and each sentence was spoken 26 times. After a selection process, 6059 high-quality recordings were maintained for a total of 4361 pseudo-sentence stimuli included in the database. Each recording was validated though an online platform with 40 native Chinese listeners (240 participants in total) in terms of the recognition accuracy of the intended emotion portrayal.
6.44. PEMO
The Punjabi Emotional Speech Database (PEMO [
101], 2022) is an emotional speech dataset for Punjabi, a traditional language spoken by the Punjab State in India. PEMO includes 12 h and 35 min of speech recorded from 60 native-Punjabi speakers (from 20 to 45 years), for a total of 22,000 utterances encompassing four emotions: anger, happiness sad, and neutral. The utterances were derived from Punjabi movies and are sampled at 44.1 KHz in a mono audio channel using the PRAAT software (6.4.45). Three annotators having a thorough knowledge of the Punjabi Language categorized the emotional content of each utterance. The most common label used for all annotators was selected as the final label for the utterance, whereas those not achieving a common label were removed from the database. The annotation process achieved an average emotion recognition rate of about 95%.
6.45. BanglaSER
BanglaSER ([
102], 2022) is a Bangla speech-audio dataset for SER. BanglaSER contains a total of 1467 audio recorded from 34 nonprofessional actors (17 male, 17 female) of five emotional states, i.e., angry, happy, neutral, sad, and surprise. Three trials were conducted for each emotional state. The categorical emotional states were evaluated by 15 human validators, and the recognition performance of the intended emotion was approximately 80.5%. The actors were asked to pronounce three lexically matched Bangla statements in a Bengali accent, whose meaning is “It’s twelve o’clock”, “I knew something like this would happen”, and “What kind of gift is this?”. The speech audio data was recorded using the smartphone’s default recording application, a laptop, and a microphone. Recordings lasted between 3 to 4 s, and surrounding noises were removed using the Audacity software (3.7.5).
6.46. MNITJ-SEHSD
The Malaviya National Institute of Technology Jaipur Simulated Emotion Hindi Speech Database (MNITJ-SEHSD) is an emotional speech database in Hindi ([
103], 2023). The database is designed to imitate five different emotions (happy, angry, neutral, sad, and fear) with neutral text cues to allow the speakers to mimic the emotions with no bias. The audios were recorded from 10 speakers (5 males and 5 females, from 21 to 27 years old) using an omnidirectional microphone at a sampling frequency of 44.1 kHz, later downsampled at 16 kHz. For each emotion, a total of 100 utterances were recorded. Each sentence contained six words, except for two sentences with five words and one sentence with seven words.
A subjective evaluation was carried out by three experts to compute emotion recognition performance. Forty utterances from each class were randomly selected, for a total of 200 emotional utterances. The three evaluators had to assign one of the emotion categories to each sample three times. The achieved average emotion recognition rate is 71%.
6.47. IESC
The Indian Emotional Speech Corpora (IESC [
104], 2023) is a speech emotional database in the English language spoken by eight North Indian people (five males and three females). The database contains 600 emotional audio files recorded in five emotions, i.e., neutral, happy, anger, sad, and fear. All the audio files were recorded using a speech recorder app through a mobile phone in a closed room to avoid any other noises. Headphones were also used with a microphone to prevent sound leakage and for noise cancellation during the recording.
6.48. ASED
The Amharic Speech Emotion Dataset (ASED [
105], 2023) is the first SER dataset for the Amharic language, covering four dialects, namely, Gojjam, Wollo, Shewa, and Gonder. In total, 65 participants (25 female, 40 male), aged from 20 to 40 years, participated in the recording of speech audio files containing five different emotions, i.e., neutral, fear, happy, sad, and angry. For each of the five emotions, five sentences expressing that emotion were composed in Amharic. The recording was performed in a quiet room to obtain speech signals with minimum noise. Since professional audio equipment was not available, six Huawei Nova 4 mobile phones were used to record the audio files. An Android-based speech recording software app was installed and set up to capture the speech utterances at a 16 kHz sampling rate at 16 bits. The recording software displayed the text for participants one sentence at a time and indicated the required emotion. Every recording was independently reviewed by eight judges, and a recording was only accepted for inclusion in the ASED dataset if five or more judges agreed. The final dataset consists of 2474 recordings, each between 2 and 4 s in length: 522 neutral, 510 fear, 486 happy, 470 sad, and 486 angry.
6.49. EmoMatchSpanishDB
The EmoMatchSpanishDB ([
106], 2023) is the first database in the Spanish language, including elicited emotional voices played out by 50 non-actors (30 males and 20 females, mean age of 33.9 years old). The EmoMatchSpanishDB is a subset of the full original dataset, EmoSpanishDB, which includes all recorded audios that received consensus after a crowd-sourcing validation process. The EmoMatchSpanishDB, instead, includes only the audio data whose emotion also matches the originally elicited emotion.
The 23 phonemes existing in the central area of Spain have been used to create a total of 12 sentences to replicate regular conversations. Furthermore, all these sentences do not include emotional semantic connotation to avoid any emotional influence on the speakers. The 12 selected sentences were played (elicited) by 50 people 7 times, one for each of the considered emotions, i.e., anger, disgust, fear, happiness, sadness, surprise, and neutral. Finally, a total of 4200 emotional raw audio samples were collected. Audio data were recorded in a professional radio studio, noise-free, in PCM format with a sampling rate of 48 kHz and a bit depth of 16 bits (no compressed audio). A dynamic mono-channel cardioid microphone (Sennheiser MD421) and the AudioPlus (AEQ) software (3.0) were used to record the audio signals.
A perception test was conducted using crowd-sourcing to manually label with an emotion all the recorded audio samples. This process involved 194 native Spanish speakers. A total of 3550 audios labeled with an emotion (reaching a consensus from different evaluators) compose the EmoSpanishDB, whereas the EmoMatchSpanishDB includes 2020 audios that also matched the original elicited emotion.
6.50. nEMO
The nEMO dataset ([
107], 2024) is a novel corpus of emotional speech in Polish. nEMO adopted an acted approach, and each of the involved 9 actors (four female and five male, ranging between 20 and 30 years old) was required to record the same set of utterances for six emotional states: anger, fear, happiness, sadness, surprise, and neutral state. A total of 90 sentences were created, each containing at least one uncommon phoneme present in the Polish language, and that could be used in everyday conversations. During the recording sessions, actors were given explicit instructions to focus on depicting a single emotional state at a time. Feedback was provided constantly to support and guide all participants. Each audio recording session was conducted in a home setting, to better reflect a natural environment, and involved one actor lasting approximately two hours. The utterances were captured using a cardioid condenser microphone with a 192 kHz sampling rate, equipped with a sponge and pop filter to eliminate background noise and explosive consonant utterances. The nEMO dataset underwent human evaluation of recorded emotional speech audio, and only recordings that accurately captured the intended emotional state were included. The resulting dataset contains a total of 4481 audio recordings, corresponding to more than three hours of speech.
6.51. CAVES
The Cantonese Audio--Visual Emotional Speech (CAVES [
108], 2024) dataset consists of auditory and visual recordings of ten native speakers of Cantonese. Cantonese is a tonal language (with two more phonetic tones than Mandarin) primarily spoken in Southern China, particularly in the provinces of Guangdong and Guangxi. The CAVES dataset contains the six basic emotions, i.e., anger, disgust, fear, happiness, sadness, surprise, plus a neutral expression to serve as a baseline. A set of semantically neutral carrier sentences was selected to express the six emotions for each sentence without any semantic interference. Fifty sentences selected from the Cantonese Hearing In Noise Test (CHINT) sentence list. The sentences were selected to have a good coverage of the different lexical tones, both in initial and final sentence positions. Ten native speakers of Cantonese (five females and five males) participated in the recording, which was conducted in a sound-attenuated booth. A video monitor was used to present the stimulus sentence, while a video camera and a microphone were used to capture participants’ faces and utterances a microphone, respectively. Fifteen native Cantonese perceivers completed a forced-choice emotion identification to validate the recorded emotion expressions.
6.52. Emozionalmente
Emozionalmente ([
109], 2025) is an Italian acted corpus of emotional speech designed to ensure comparability with the Italian Emovo database. Indeed, similarly to Emovo, Emozionalmente recruited actors to simulate emotions speaking scripted sentences and includes six emotions (anger, disgust, fear, joy, sadness, and surprise) plus the neutral state.
Eighteen sentences were constructed ad-hoc to be semantically neutral and easily readable with different emotional tones. These sentences contain everyday vocabulary and cover all Italian phonemes in various positions and contexts. A custom web app for crowd-sourcing was developed to collect audio samples from voluntary participants and to evaluate audio recordings submitted by other participants, such as emotion labeling and indication of a clear/noise audio. Thus, recordings were captured using participants’ device microphones, which introduced natural variability in audio quality. The collected corpus included 11,404 audio samples, which were reduced to 6902 samples after the data cleaning process that removed noisy samples and sample with inconsistent emotion labels. These samples were recorded from a total of 431 Italian actors, 131 males, 299 females, and 1 listed as “other”, with average age of 31.28 years.
A subjective evaluation was conducted to assess the effectiveness of Emozionalmente in conveying emotions. It involved 829 individuals, who provided a total of 34,510 evaluations, 5 per audio sample. A recognition accuracy of 66% was achieved, which demonstrates the utility and representativeness of the Emozionalmente database.