Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (94)

Search Parameters:
Keywords = emotional speech databases

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
34 pages, 3911 KB  
Article
PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset
by Shufei Duan, Wenjie Zhang, Liangqi Li, Ting Zhu, Fangyu Zhao, Fujiang Li and Huizhi Liang
Multimodal Technol. Interact. 2026, 10(4), 38; https://doi.org/10.3390/mti10040038 - 2 Apr 2026
Viewed by 629
Abstract
There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods [...] Read more.
There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods persist. To address these, this study undertakes work from both data and model perspectives. For data, a Chinese multimodal database STEM-E2VA was constructed, synchronously collecting four modalities of data: articulatory kinematics, acoustics, glottal signals, and videos. This covers seven discrete emotion categories and employs PAD continuous annotation. By integrating discrete and continuous dimensional annotations, it better represents the distinction between strong and weak emotions under the same discrete emotion label. Concurrently, to process the biases in PAD annotations, we employed the SCL-90 psychological questionnaire to analyze annotators’ cognitive and emotional perceptions, thereby ensuring data reliability. For model, this paper proposes a multimodal supervised contrastive fusion network incorporating PAD perception. It employs a PAD-enhanced hybrid contrastive loss function to optimize intra-model and inter-modal feature alignment. Utilizing a cross-attention mechanism combined with a GRU–Transformer network for temporal feature extraction, it achieves deep fusion of multimodal information, reducing inter-modal discrepancies and cross-class confusion. Experiments demonstrate that the proposed method achieves 85.47% accuracy in discrete sentiment recognition on STEM-E2VA, with a substantial reduction in RMSE for PAD dimension prediction. It also exhibits excellent generalization capability on IEMOCAP, providing a novel framework for integrating discrete and continuous sentiment representations. Full article
Show Figures

Figure 1

37 pages, 7239 KB  
Review
The Cortico-Cortical and Subcortical Circuits of the Human Brain Language Centers Including the Dual Limbic and Language Functioning Fiber Tracts
by Arash Kamali, Nithya P. Narayana, Anastasia Loiko, Anusha Gandhi, Paul E. Schulz, Nitin Tandon, Manish N. Shah, Vinodh A. Kumar, Larry A. Kramer, Jay-Jiguang Zhu, Haris Sair, Roy F. Riascos and Khader M. Hasan
Brain Sci. 2026, 16(2), 142; https://doi.org/10.3390/brainsci16020142 - 28 Jan 2026
Viewed by 1749
Abstract
Background/Objectives: In recent years, MRI-based diffusion-weighted tractography techniques have uncovered additional white matter pathways that have significant roles in language processing and production. In this review, we aim to outline the major language centers of the brain and major language pathways along [...] Read more.
Background/Objectives: In recent years, MRI-based diffusion-weighted tractography techniques have uncovered additional white matter pathways that have significant roles in language processing and production. In this review, we aim to outline the major language centers of the brain and major language pathways along with association tracts that serve dual roles in both the language and limbic systems. According to the current dual-stream model of language processing, the brain’s language network is organized into a dorsal stream, responsible for mapping sound to articulation, and a ventral stream, which maps sound to meaning. Materials and Methods: The literature cited in this manuscript was identified through targeted searches of the PubMed database. Priority was given to peer-reviewed human studies, including original neuroimaging, cadaveric validation, and intraoperative stimulation studies. Non-peer-reviewed sources and publications lacking clear anatomical or functional correlation to language pathways were excluded. Results: Advances in functional MRI and diffusion weighted imaging techniques have revealed a more interconnected network, expanding our understanding beyond the classical dual-stream model of language processing. The Kamali limbic model proposed distinct ventral and dorsal limbic networks. Notably, several fiber pathways within the ventral limbic network may subserve both language and limbic functions. The association tracts with dual limbic-language functions form a critical basis for understanding the pathophysiology of language disorders accompanied by cognitive and emotional comorbidities observed in dyslexia, speech apraxia, aphasia, autism spectrum disorder, schizophrenia and post-traumatic stress disorder. Conclusions: Visualizing the language center and interconnected dual language and limbic fiber tracts highlights the importance of integrating language, executive function, and emotion in developing disease models and designing effective, targeted treatments for patients. Full article
(This article belongs to the Section Cognitive, Social and Affective Neuroscience)
Show Figures

Figure 1

14 pages, 639 KB  
Article
Recognising Emotions from the Voice: A tDCS and fNIRS Double-Blind Study on the Role of the Cerebellum in Emotional Prosody
by Sharon Mara Luciano, Laura Sagliano, Alessia Salzillo, Luigi Trojano and Francesco Panico
Brain Sci. 2025, 15(12), 1327; https://doi.org/10.3390/brainsci15121327 - 13 Dec 2025
Cited by 2 | Viewed by 896
Abstract
Background: Emotional prosody refers to the variations in pitch, pause, melody, rhythm, and stress of pronunciation conveying emotional meaning during speech. Although several studies demonstrated that the cerebellum is involved in the network subserving recognition of emotional facial expressions, there is only [...] Read more.
Background: Emotional prosody refers to the variations in pitch, pause, melody, rhythm, and stress of pronunciation conveying emotional meaning during speech. Although several studies demonstrated that the cerebellum is involved in the network subserving recognition of emotional facial expressions, there is only preliminary evidence suggesting its possible contribution to recognising emotional prosody by modulating the activity of cerebello-prefrontal circuits. The present study aims to further explore the role of the left and right cerebellum in the recognition of emotional prosody in a sample of healthy individuals who were required to identify emotions (happiness, anger, sadness, surprise, disgust, and neutral) from vocal stimuli selected from a validated database (EMOVO corpus). Methods: Anodal transcranial Direct Current Stimulation (tDCS) was used in offline mode to modulate cerebellar activity before the emotional prosody recognition task, and functional near-infrared spectroscopy (fNIRS) was used to monitor stimulation-related changes in oxy- and deoxy- haemoglobin (O2HB and HHB) in prefrontal areas (PFC). Results: Right cerebellar stimulation reduced reaction times in the recognition of all emotions (except neutral and disgust) as compared to both the sham and left cerebellar stimulation, while accuracy was not affected by the stimulation. Haemodynamic data revealed that right cerebellar stimulation reduced O2HB and increased HHB in the PFC bilaterally relative to the other stimulation conditions. Conclusions: These findings are consistent with the involvement of the right cerebellum in modulating emotional processing and in regulating cerebello-prefrontal circuits. Full article
Show Figures

Figure 1

58 pages, 744 KB  
Article
Review and Comparative Analysis of Databases for Speech Emotion Recognition
by Salvatore Serrano, Omar Serghini, Giulia Esposito, Silvia Carbone, Carmela Mento, Alessandro Floris, Simone Porcu and Luigi Atzori
Data 2025, 10(10), 164; https://doi.org/10.3390/data10100164 - 14 Oct 2025
Viewed by 6617
Abstract
Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER [...] Read more.
Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER systems. These resources set the boundaries for how well models can generalize across speakers, contexts, and cultures. In this paper, we present a narrative review and comparative analysis of emotional speech corpora released up to mid-2025, bringing together both psychological and technical perspectives. Rather than following a systematic review protocol, our approach focuses on providing a critical synthesis of more than fifty corpora covering acted, elicited, and natural speech. We examine how these databases were collected, how emotions were annotated, their demographic diversity, and their ecological validity, while also acknowledging the limits of available documentation. Beyond description, we identify recurring strengths and weaknesses, highlight emerging gaps, and discuss recent usage patterns to offer researchers both a practical guide for dataset selection and a critical perspective on how corpus design continues to shape the development of robust and generalizable SER systems. Full article
Show Figures

Figure 1

25 pages, 1822 KB  
Article
Emotion Recognition from Speech in a Subject-Independent Approach
by Andrzej Majkowski and Marcin Kołodziej
Appl. Sci. 2025, 15(13), 6958; https://doi.org/10.3390/app15136958 - 20 Jun 2025
Cited by 4 | Viewed by 4452
Abstract
The aim of this article is to critically and reliably assess the potential of current emotion recognition technologies for practical applications in human–computer interaction (HCI) systems. The study made use of two databases: one in English (RAVDESS) and another in Polish (EMO-BAJKA), both [...] Read more.
The aim of this article is to critically and reliably assess the potential of current emotion recognition technologies for practical applications in human–computer interaction (HCI) systems. The study made use of two databases: one in English (RAVDESS) and another in Polish (EMO-BAJKA), both containing speech recordings expressing various emotions. The effectiveness of recognizing seven and eight different emotions was analyzed. A range of acoustic features, including energy features, mel-cepstral features, zero-crossing rate, fundamental frequency, and spectral features, were utilized to analyze the emotions in speech. Machine learning techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and support vector machines with a cubic kernel (cubic SVMs) were employed in the emotion classification task. The research findings indicated that the effective recognition of a broad spectrum of emotions in a subject-independent approach is limited. However, significantly better results were obtained in the classification of paired emotions, suggesting that emotion recognition technologies could be effectively used in specific applications where distinguishing between two particular emotional states is essential. To ensure a reliable and accurate assessment of the emotion recognition system, care was taken to divide the dataset in such a way that the training and testing data contained recordings of completely different individuals. The highest classification accuracies for pairs of emotions were achieved for Angry–Fearful (0.8), Angry–Happy (0.86), Angry–Neutral (1.0), Angry–Sad (1.0), Angry–Surprise (0.89), Disgust–Neutral (0.91), and Disgust–Sad (0.96) in the RAVDESS. In the EMO-BAJKA database, the highest classification accuracies for pairs of emotions were for Joy–Neutral (0.91), Surprise–Neutral (0.80), Surprise–Fear (0.91), and Neutral–Fear (0.91). Full article
(This article belongs to the Special Issue New Advances in Applied Machine Learning)
Show Figures

Figure 1

15 pages, 4273 KB  
Article
Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models
by Jamsher Bhanbhro, Asif Aziz Memon, Bharat Lal, Shahnawaz Talpur and Madeha Memon
Signals 2025, 6(2), 22; https://doi.org/10.3390/signals6020022 - 9 May 2025
Cited by 13 | Viewed by 7266
Abstract
Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. [...] Read more.
Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. Despite its importance in various fields like human–computer interaction and mental health diagnosis, accurately identifying emotions from speech can be challenging due to differences in speakers, accents, and background noise. The work proposes two innovative deep learning models to improve SER accuracy: a CNN-LSTM model and an Attention-Enhanced CNN-LSTM model. These models were tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), collected between 2015 and 2018, which comprises 1440 audio files of male and female actors expressing eight emotions. Both models achieved impressive accuracy rates of over 96% in classifying emotions into eight categories. By comparing the CNN-LSTM and Attention-Enhanced CNN-LSTM models, this study offers comparative insights into modeling techniques, contributes to the development of more effective emotion recognition systems, and offers practical implications for real-time applications in healthcare and customer service. Full article
Show Figures

Figure 1

28 pages, 530 KB  
Article
Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
by Alex Mares, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado and Alfonso Dominguez-Chavez
Appl. Sci. 2025, 15(8), 4340; https://doi.org/10.3390/app15084340 - 14 Apr 2025
Cited by 4 | Viewed by 5020
Abstract
Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, [...] Read more.
Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

21 pages, 6196 KB  
Article
Building a Gender-Bias-Resistant Super Corpus as a Deep Learning Baseline for Speech Emotion Recognition
by Babak Abbaschian and Adel Elmaghraby
Sensors 2025, 25(7), 1991; https://doi.org/10.3390/s25071991 - 22 Mar 2025
Cited by 1 | Viewed by 1588
Abstract
The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning [...] Read more.
The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning speaker gender and out-of-distribution data has not been thoroughly examined. Furthermore, standards for SER remain rooted in landmark papers from the 2000s, even though modern deep learning architectures can achieve comparable or superior results to the state of the art of that era. In this research, we address these challenges by creating a new super corpus from existing databases, providing a larger pool of samples. We benchmark this dataset using various deep learning architectures, setting a new baseline for the task. Additionally, our experiments reveal that models trained on this super corpus demonstrate superior generalization and accuracy and exhibit lower gender bias compared to models trained on individual databases. We further show that traditional preprocessing techniques, such as denoising and normalization, are insufficient to address inherent biases in the data. However, our data augmentation approach effectively shifts these biases, improving model fairness across gender groups and emotions and, in some cases, fully debiasing the models. Full article
(This article belongs to the Special Issue Emotion Recognition and Cognitive Behavior Analysis Based on Sensors)
Show Figures

Graphical abstract

15 pages, 587 KB  
Systematic Review
AI Applications to Reduce Loneliness Among Older Adults: A Systematic Review of Effectiveness and Technologies
by Yuyi Yang, Chenyu Wang, Xiaoling Xiang and Ruopeng An
Healthcare 2025, 13(5), 446; https://doi.org/10.3390/healthcare13050446 - 20 Feb 2025
Cited by 39 | Viewed by 16641
Abstract
Background/Objectives: Loneliness among older adults is a prevalent issue, significantly impacting their quality of life and increasing the risk of physical and mental health complications. The application of artificial intelligence (AI) technologies in behavioral interventions offers a promising avenue to overcome challenges in [...] Read more.
Background/Objectives: Loneliness among older adults is a prevalent issue, significantly impacting their quality of life and increasing the risk of physical and mental health complications. The application of artificial intelligence (AI) technologies in behavioral interventions offers a promising avenue to overcome challenges in designing and implementing interventions to reduce loneliness by enabling personalized and scalable solutions. This study systematically reviews the AI-enabled interventions in addressing loneliness among older adults, focusing on the effectiveness and underlying technologies used. Methods: A systematic search was conducted across eight electronic databases, including PubMed and Web of Science, for studies published up to 31 January 2024. Inclusion criteria were experimental studies involving AI applications to mitigate loneliness among adults aged 55 and older. Data on participant demographics, intervention characteristics, AI methodologies, and effectiveness outcomes were extracted and synthesized. Results: Nine studies were included, comprising six randomized controlled trials and three pre–post designs. The most frequently implemented AI technologies included speech recognition (n = 6) and emotion recognition and simulation (n = 5). Intervention types varied, with six studies employing social robots, two utilizing personal voice assistants, and one using a digital human facilitator. Six studies reported significant reductions in loneliness, particularly those utilizing social robots, which demonstrated emotional engagement and personalized interactions. Three studies reported non-significant effects, often due to shorter intervention durations or limited interaction frequencies. Conclusions: AI-driven interventions show promise in reducing loneliness among older adults. Future research should focus on long-term, culturally competent solutions that integrate quantitative and qualitative findings to optimize intervention design and scalability. Full article
Show Figures

Figure 1

17 pages, 3001 KB  
Article
Performance Improvement of Speech Emotion Recognition Using ResNet Model with Data Augmentation–Saturation
by Minjeong Lee and Miran Lee
Appl. Sci. 2025, 15(4), 2088; https://doi.org/10.3390/app15042088 - 17 Feb 2025
Cited by 3 | Viewed by 2157
Abstract
Over the past five years, the proliferation of virtual reality platforms and the advancement of metahuman technologies have underscored the importance of natural interaction and emotional expression. As a result, there has been significant research activity focused on developing emotion recognition techniques based [...] Read more.
Over the past five years, the proliferation of virtual reality platforms and the advancement of metahuman technologies have underscored the importance of natural interaction and emotional expression. As a result, there has been significant research activity focused on developing emotion recognition techniques based on speech data. Despite significant progress in emotion recognition research for the Korean language, a shortage of speech databases applicable to such research has been regarded as the most critical problem in this field, leading to overfitting issues in several models developed by previous studies. To address the issue of overfitting caused by limited data availability in the field of Korean speech emotion recognition (SER), this study focuses on integrating the data augmentation–saturation (DA-S) technique into a traditional ResNet model to enhance SER performance. The DA-S technique enhances data augmentation by adjusting the saturation of an image. We used 11,192 utterance numbers provided by AI-HUB, which were converted into images to extract features such as pitch and intensity of speech. The DA-S technique was then applied to this dataset, using weights of 0 and 2, to augment the utterance numbers to 33,576. This augmented dataset was utilized to classify four emotion categories: happiness, sadness, anger, and neutrality. The results of this study showed that the proposed model using the DA-S technique overcame overfitting issues. Furthermore, its performance for SER increased by 34.19% compared to that of existing ResNet models not using the DA-S technique. This demonstrates that the DA-S technique effectively enhances model performance with limited data and may be applicable to specific areas such as stress monitoring and mental health support. Full article
(This article belongs to the Special Issue Advanced Technologies and Applications of Emotion Recognition)
Show Figures

Figure 1

20 pages, 917 KB  
Article
Developing a Dataset of Audio Features to Classify Emotions in Speech
by Alvaro A. Colunga-Rodriguez, Alicia Martínez-Rebollar, Hugo Estrada-Esquivel, Eddie Clemente and Odette A. Pliego-Martínez
Computation 2025, 13(2), 39; https://doi.org/10.3390/computation13020039 - 5 Feb 2025
Cited by 9 | Viewed by 6172
Abstract
Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify [...] Read more.
Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify emotions in speech. The paper highlights audio processing techniques such as silence removal and framing to extract features from the recordings. The features are extracted from the audio signals using spectral techniques, time-domain analysis, and the discrete wavelet transform. The resulting dataset is used to train a neural network and the support vector machine learning algorithm. Cross-validation is employed for model training. The developed models were optimized using a software package that performs hyperparameter tuning to improve results. Finally, the emotional classification outcomes were compared. The results showed an emotion classification accuracy of 0.654 for the perceptron neural network and 0.724 for the support vector machine algorithm, demonstrating satisfactory performance in emotion classification. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

17 pages, 944 KB  
Review
Addressing the Challenges in Pediatric Facial Fractures: A Narrative Review of Innovations in Diagnosis and Treatment
by Gabriel Mulinari-Santos, Amanda Paino Santana, Paulo Roberto Botacin and Roberta Okamoto
Surgeries 2024, 5(4), 1130-1146; https://doi.org/10.3390/surgeries5040090 - 13 Dec 2024
Cited by 7 | Viewed by 5067
Abstract
Background/Objectives: Pediatric facial fractures present unique challenges due to the anatomical, physiological, and developmental differences in children’s facial structures. The growing facial bones in children complicate diagnosis and treatment. This review explores the advancements and complexities in managing pediatric facial fractures, focusing on [...] Read more.
Background/Objectives: Pediatric facial fractures present unique challenges due to the anatomical, physiological, and developmental differences in children’s facial structures. The growing facial bones in children complicate diagnosis and treatment. This review explores the advancements and complexities in managing pediatric facial fractures, focusing on innovations in diagnosis, treatment strategies, and multidisciplinary care. Methods: A narrative review was conducted, synthesizing data from English-language articles published between 2001 and 2024. Relevant studies were identified through databases such as PubMed, Scopus, Lilacs, Embase, and SciELO using keywords related to pediatric facial fractures. This narrative review focuses on anatomical challenges, advancements in diagnostic techniques, treatment approaches, and the role of interdisciplinary teams in management. Results: Key findings highlight advancements in imaging technologies, including three-dimensional computed tomography (3D CT) and magnetic resonance imaging (MRI), which have improved fracture diagnosis and preoperative planning. Minimally invasive techniques and bioresorbable implants have revolutionized treatment, reducing trauma and enhancing recovery. The integration of multidisciplinary teams, including pediatricians, psychologists, and speech therapists, has become crucial in addressing both the physical and emotional needs of patients. Emerging technologies such as 3D printing and computer-assisted navigation are shaping future treatment approaches. Conclusions: The management of pediatric facial fractures has significantly advanced due to innovations in imaging, surgical techniques, and the growing importance of interdisciplinary care. Despite these improvements, long-term follow-up remains critical to monitor potential complications. Ongoing research and collaboration are essential to refine treatment strategies and improve long-term outcomes for pediatric patients with facial trauma. Full article
Show Figures

Figure 1

25 pages, 2085 KB  
Article
How Much Does the Dynamic F0 Curve Affect the Expression of Emotion in Utterances?
by Tae-Jin Yoon
Appl. Sci. 2024, 14(23), 10972; https://doi.org/10.3390/app142310972 - 26 Nov 2024
Cited by 2 | Viewed by 2514
Abstract
The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying [...] Read more.
The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying emotion remains challenging due to the complex and dynamic nature of vocal expressions. Traditional analytical methods often oversimplify these dynamics, potentially overlooking intricate patterns indicative of specific emotions. This study examines the influences of emotion and temporal variation on dynamic F0 contours in the analytical framework, utilizing a dataset valuable for its diverse emotional expressions. However, the analysis is constrained by the limited variety of sentences employed, which may affect the generalizability of the findings to broader linguistic contexts. We utilized the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), focusing on eight distinct emotional states performed by 24 professional actors. Sonorant segments were extracted, and F0 measurements were converted into semitones relative to a 100 Hz baseline to standardize pitch variations. By employing Generalized Additive Mixed Models (GAMMs), we modeled non-linear trajectories of F0 contours over time, accounting for fixed effects (emotions) and random effects (individual speaker variability). Our analysis revealed that incorporating emotion-specific, non-linear time effects and individual speaker differences significantly improved the model’s explanatory power, ultimately explaining up to 66.5% of the variance in the F0. The inclusion of random smooths for time within speakers captured individual temporal modulation patterns, providing a more accurate representation of emotional speech dynamics. The results demonstrate that dynamic modeling of F0 contours using GAMMs enhances the accuracy of emotion classification in speech. This approach captures the nuanced pitch patterns associated with different emotions and accounts for individual variability among speakers. The findings contribute to a deeper understanding of the vocal expression of emotions and offer valuable insights for advancing speech emotion recognition systems. Full article
(This article belongs to the Special Issue Advances and Applications of Audio and Speech Signal Processing)
Show Figures

Figure 1

22 pages, 336 KB  
Article
Multimodal Emotion Recognition Based on Facial Expressions, Speech, and Body Gestures
by Jingjie Yan, Peiyuan Li, Chengkun Du, Kang Zhu, Xiaoyang Zhou, Ying Liu and Jinsheng Wei
Electronics 2024, 13(18), 3756; https://doi.org/10.3390/electronics13183756 - 21 Sep 2024
Cited by 10 | Viewed by 6077
Abstract
The research of multimodal emotion recognition based on facial expressions, speech, and body gestures is crucial for oncoming intelligent human–computer interfaces. However, it is a very difficult task and has seldom been researched in this combination in the past years. Based on the [...] Read more.
The research of multimodal emotion recognition based on facial expressions, speech, and body gestures is crucial for oncoming intelligent human–computer interfaces. However, it is a very difficult task and has seldom been researched in this combination in the past years. Based on the GEMEP and Polish databases, this contribution focuses on trimodal emotion recognition from facial expressions, speech, and body gestures, including feature extraction, feature fusion, and multimodal classification of the three modalities. In particular, for feature fusion, two novel algorithms including supervised least squares multiset kernel canonical correlation analysis (SLSMKCCA) and sparse supervised least squares multiset kernel canonical correlation analysis (SSLSMKCCA) are presented, respectively, to carry out efficient facial expression, speech, and body gesture feature fusion. Different from the traditional multiset kernel canonical correlation analysis (MKCCA) algorithms, our SLSKMCCA algorithm is a supervised version and is based on the least squares form. The SSLSKMCCA algorithm is implemented by the combination of SLSMKCCA and a sparse item (L1 Norm). Moreover, two effective solving algorithms for SLSMKCCA and SSLSMKCCA are presented in addition, which use the alternated least squares and augmented Lagrangian multiplier methods, respectively. The extensive experimental results on the popular public GEMEP and Polish databases show that the recognition rate of multimodal emotion recognition is superior to bimodal and monomodal emotion recognition on average, and our presented SLSMKCCA and SSLSMKCCA fusion methods both obtain very high recognition rates, especially for the SSLSMKCCA fusion method. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

39 pages, 6629 KB  
Article
A Combined CNN Architecture for Speech Emotion Recognition
by Rolinson Begazo, Ana Aguilera, Irvin Dongo and Yudith Cardinale
Sensors 2024, 24(17), 5797; https://doi.org/10.3390/s24175797 - 6 Sep 2024
Cited by 16 | Viewed by 8353
Abstract
Emotion recognition through speech is a technique employed in various scenarios of Human–Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of [...] Read more.
Emotion recognition through speech is a technique employed in various scenarios of Human–Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%. Full article
(This article belongs to the Special Issue Emotion Recognition Based on Sensors (3rd Edition))
Show Figures

Figure 1

Back to TopTop