MDPI - Publisher of Open Access Journals

34 pages, 3911 KB

Open AccessArticle

PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E²VA Dataset

by Shufei Duan, Wenjie Zhang, Liangqi Li, Ting Zhu, Fangyu Zhao, Fujiang Li and Huizhi Liang

Multimodal Technol. Interact. 2026, 10(4), 38; https://doi.org/10.3390/mti10040038 - 2 Apr 2026

Viewed by 629

Abstract

There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods [...] Read more.

There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods persist. To address these, this study undertakes work from both data and model perspectives. For data, a Chinese multimodal database STEM-E²VA was constructed, synchronously collecting four modalities of data: articulatory kinematics, acoustics, glottal signals, and videos. This covers seven discrete emotion categories and employs PAD continuous annotation. By integrating discrete and continuous dimensional annotations, it better represents the distinction between strong and weak emotions under the same discrete emotion label. Concurrently, to process the biases in PAD annotations, we employed the SCL-90 psychological questionnaire to analyze annotators’ cognitive and emotional perceptions, thereby ensuring data reliability. For model, this paper proposes a multimodal supervised contrastive fusion network incorporating PAD perception. It employs a PAD-enhanced hybrid contrastive loss function to optimize intra-model and inter-modal feature alignment. Utilizing a cross-attention mechanism combined with a GRU–Transformer network for temporal feature extraction, it achieves deep fusion of multimodal information, reducing inter-modal discrepancies and cross-class confusion. Experiments demonstrate that the proposed method achieves 85.47% accuracy in discrete sentiment recognition on STEM-E²VA, with a substantial reduction in RMSE for PAD dimension prediction. It also exhibits excellent generalization capability on IEMOCAP, providing a novel framework for integrating discrete and continuous sentiment representations. Full article

► Show Figures

Figure 1

37 pages, 7239 KB

Open AccessReview

The Cortico-Cortical and Subcortical Circuits of the Human Brain Language Centers Including the Dual Limbic and Language Functioning Fiber Tracts

by Arash Kamali, Nithya P. Narayana, Anastasia Loiko, Anusha Gandhi, Paul E. Schulz, Nitin Tandon, Manish N. Shah, Vinodh A. Kumar, Larry A. Kramer, Jay-Jiguang Zhu, Haris Sair, Roy F. Riascos and Khader M. Hasan

Brain Sci. 2026, 16(2), 142; https://doi.org/10.3390/brainsci16020142 - 28 Jan 2026

Viewed by 1749

Abstract

Background/Objectives: In recent years, MRI-based diffusion-weighted tractography techniques have uncovered additional white matter pathways that have significant roles in language processing and production. In this review, we aim to outline the major language centers of the brain and major language pathways along [...] Read more.

Background/Objectives: In recent years, MRI-based diffusion-weighted tractography techniques have uncovered additional white matter pathways that have significant roles in language processing and production. In this review, we aim to outline the major language centers of the brain and major language pathways along with association tracts that serve dual roles in both the language and limbic systems. According to the current dual-stream model of language processing, the brain’s language network is organized into a dorsal stream, responsible for mapping sound to articulation, and a ventral stream, which maps sound to meaning. Materials and Methods: The literature cited in this manuscript was identified through targeted searches of the PubMed database. Priority was given to peer-reviewed human studies, including original neuroimaging, cadaveric validation, and intraoperative stimulation studies. Non-peer-reviewed sources and publications lacking clear anatomical or functional correlation to language pathways were excluded. Results: Advances in functional MRI and diffusion weighted imaging techniques have revealed a more interconnected network, expanding our understanding beyond the classical dual-stream model of language processing. The Kamali limbic model proposed distinct ventral and dorsal limbic networks. Notably, several fiber pathways within the ventral limbic network may subserve both language and limbic functions. The association tracts with dual limbic-language functions form a critical basis for understanding the pathophysiology of language disorders accompanied by cognitive and emotional comorbidities observed in dyslexia, speech apraxia, aphasia, autism spectrum disorder, schizophrenia and post-traumatic stress disorder. Conclusions: Visualizing the language center and interconnected dual language and limbic fiber tracts highlights the importance of integrating language, executive function, and emotion in developing disease models and designing effective, targeted treatments for patients. Full article

(This article belongs to the Section Cognitive, Social and Affective Neuroscience)

► Show Figures

Figure 1

14 pages, 639 KB

Open AccessArticle

Recognising Emotions from the Voice: A tDCS and fNIRS Double-Blind Study on the Role of the Cerebellum in Emotional Prosody

by Sharon Mara Luciano, Laura Sagliano, Alessia Salzillo, Luigi Trojano and Francesco Panico

Brain Sci. 2025, 15(12), 1327; https://doi.org/10.3390/brainsci15121327 - 13 Dec 2025

Cited by 2 | Viewed by 896

Abstract

Background: Emotional prosody refers to the variations in pitch, pause, melody, rhythm, and stress of pronunciation conveying emotional meaning during speech. Although several studies demonstrated that the cerebellum is involved in the network subserving recognition of emotional facial expressions, there is only [...] Read more.

Background: Emotional prosody refers to the variations in pitch, pause, melody, rhythm, and stress of pronunciation conveying emotional meaning during speech. Although several studies demonstrated that the cerebellum is involved in the network subserving recognition of emotional facial expressions, there is only preliminary evidence suggesting its possible contribution to recognising emotional prosody by modulating the activity of cerebello-prefrontal circuits. The present study aims to further explore the role of the left and right cerebellum in the recognition of emotional prosody in a sample of healthy individuals who were required to identify emotions (happiness, anger, sadness, surprise, disgust, and neutral) from vocal stimuli selected from a validated database (EMOVO corpus). Methods: Anodal transcranial Direct Current Stimulation (tDCS) was used in offline mode to modulate cerebellar activity before the emotional prosody recognition task, and functional near-infrared spectroscopy (fNIRS) was used to monitor stimulation-related changes in oxy- and deoxy- haemoglobin (O2HB and HHB) in prefrontal areas (PFC). Results: Right cerebellar stimulation reduced reaction times in the recognition of all emotions (except neutral and disgust) as compared to both the sham and left cerebellar stimulation, while accuracy was not affected by the stimulation. Haemodynamic data revealed that right cerebellar stimulation reduced O2HB and increased HHB in the PFC bilaterally relative to the other stimulation conditions. Conclusions: These findings are consistent with the involvement of the right cerebellum in modulating emotional processing and in regulating cerebello-prefrontal circuits. Full article

(This article belongs to the Topic The Relationship Between Bodily, Autonomic, and Communicative Behaviors and the Experiential and Cognitive Aspects of Emotion)

► Show Figures

Figure 1

58 pages, 744 KB

Open AccessArticle

Review and Comparative Analysis of Databases for Speech Emotion Recognition

by Salvatore Serrano, Omar Serghini, Giulia Esposito, Silvia Carbone, Carmela Mento, Alessandro Floris, Simone Porcu and Luigi Atzori

Data 2025, 10(10), 164; https://doi.org/10.3390/data10100164 - 14 Oct 2025

Viewed by 6617

Abstract

Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER [...] Read more.

Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER systems. These resources set the boundaries for how well models can generalize across speakers, contexts, and cultures. In this paper, we present a narrative review and comparative analysis of emotional speech corpora released up to mid-2025, bringing together both psychological and technical perspectives. Rather than following a systematic review protocol, our approach focuses on providing a critical synthesis of more than fifty corpora covering acted, elicited, and natural speech. We examine how these databases were collected, how emotions were annotated, their demographic diversity, and their ecological validity, while also acknowledging the limits of available documentation. Beyond description, we identify recurring strengths and weaknesses, highlight emerging gaps, and discuss recent usage patterns to offer researchers both a practical guide for dataset selection and a critical perspective on how corpus design continues to shape the development of robust and generalizable SER systems. Full article

► Show Figures

Figure 1

25 pages, 1822 KB

Open AccessArticle

Emotion Recognition from Speech in a Subject-Independent Approach

by Andrzej Majkowski and Marcin Kołodziej

Appl. Sci. 2025, 15(13), 6958; https://doi.org/10.3390/app15136958 - 20 Jun 2025

Cited by 4 | Viewed by 4452

Abstract

The aim of this article is to critically and reliably assess the potential of current emotion recognition technologies for practical applications in human–computer interaction (HCI) systems. The study made use of two databases: one in English (RAVDESS) and another in Polish (EMO-BAJKA), both [...] Read more.

The aim of this article is to critically and reliably assess the potential of current emotion recognition technologies for practical applications in human–computer interaction (HCI) systems. The study made use of two databases: one in English (RAVDESS) and another in Polish (EMO-BAJKA), both containing speech recordings expressing various emotions. The effectiveness of recognizing seven and eight different emotions was analyzed. A range of acoustic features, including energy features, mel-cepstral features, zero-crossing rate, fundamental frequency, and spectral features, were utilized to analyze the emotions in speech. Machine learning techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and support vector machines with a cubic kernel (cubic SVMs) were employed in the emotion classification task. The research findings indicated that the effective recognition of a broad spectrum of emotions in a subject-independent approach is limited. However, significantly better results were obtained in the classification of paired emotions, suggesting that emotion recognition technologies could be effectively used in specific applications where distinguishing between two particular emotional states is essential. To ensure a reliable and accurate assessment of the emotion recognition system, care was taken to divide the dataset in such a way that the training and testing data contained recordings of completely different individuals. The highest classification accuracies for pairs of emotions were achieved for Angry–Fearful (0.8), Angry–Happy (0.86), Angry–Neutral (1.0), Angry–Sad (1.0), Angry–Surprise (0.89), Disgust–Neutral (0.91), and Disgust–Sad (0.96) in the RAVDESS. In the EMO-BAJKA database, the highest classification accuracies for pairs of emotions were for Joy–Neutral (0.91), Surprise–Neutral (0.80), Surprise–Fear (0.91), and Neutral–Fear (0.91). Full article

(This article belongs to the Special Issue New Advances in Applied Machine Learning)

► Show Figures

Figure 1

15 pages, 4273 KB

Open AccessEditor’s ChoiceArticle

Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models

by Jamsher Bhanbhro, Asif Aziz Memon, Bharat Lal, Shahnawaz Talpur and Madeha Memon

Signals 2025, 6(2), 22; https://doi.org/10.3390/signals6020022 - 9 May 2025

Cited by 13 | Viewed by 7266

Abstract

Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. [...] Read more.

Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. Despite its importance in various fields like human–computer interaction and mental health diagnosis, accurately identifying emotions from speech can be challenging due to differences in speakers, accents, and background noise. The work proposes two innovative deep learning models to improve SER accuracy: a CNN-LSTM model and an Attention-Enhanced CNN-LSTM model. These models were tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), collected between 2015 and 2018, which comprises 1440 audio files of male and female actors expressing eight emotions. Both models achieved impressive accuracy rates of over 96% in classifying emotions into eight categories. By comparing the CNN-LSTM and Attention-Enhanced CNN-LSTM models, this study offers comparative insights into modeling techniques, contributes to the development of more effective emotion recognition systems, and offers practical implications for real-time applications in healthcare and customer service. Full article

► Show Figures

Figure 1

28 pages, 530 KB

Open AccessArticle

Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

by Alex Mares, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado and Alfonso Dominguez-Chavez

Appl. Sci. 2025, 15(8), 4340; https://doi.org/10.3390/app15084340 - 14 Apr 2025

Cited by 4 | Viewed by 5020

Abstract

Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, [...] Read more.

Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

21 pages, 6196 KB

Open AccessArticle

Building a Gender-Bias-Resistant Super Corpus as a Deep Learning Baseline for Speech Emotion Recognition

by Babak Abbaschian and Adel Elmaghraby

Sensors 2025, 25(7), 1991; https://doi.org/10.3390/s25071991 - 22 Mar 2025

Cited by 1 | Viewed by 1588

Abstract

The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning [...] Read more.

The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning speaker gender and out-of-distribution data has not been thoroughly examined. Furthermore, standards for SER remain rooted in landmark papers from the 2000s, even though modern deep learning architectures can achieve comparable or superior results to the state of the art of that era. In this research, we address these challenges by creating a new super corpus from existing databases, providing a larger pool of samples. We benchmark this dataset using various deep learning architectures, setting a new baseline for the task. Additionally, our experiments reveal that models trained on this super corpus demonstrate superior generalization and accuracy and exhibit lower gender bias compared to models trained on individual databases. We further show that traditional preprocessing techniques, such as denoising and normalization, are insufficient to address inherent biases in the data. However, our data augmentation approach effectively shifts these biases, improving model fairness across gender groups and emotions and, in some cases, fully debiasing the models. Full article

(This article belongs to the Special Issue Emotion Recognition and Cognitive Behavior Analysis Based on Sensors)

► Show Figures

Graphical abstract

15 pages, 587 KB

Open AccessSystematic Review

AI Applications to Reduce Loneliness Among Older Adults: A Systematic Review of Effectiveness and Technologies

by Yuyi Yang, Chenyu Wang, Xiaoling Xiang and Ruopeng An

Healthcare 2025, 13(5), 446; https://doi.org/10.3390/healthcare13050446 - 20 Feb 2025

Cited by 39 | Viewed by 16641

Abstract

Background/Objectives: Loneliness among older adults is a prevalent issue, significantly impacting their quality of life and increasing the risk of physical and mental health complications. The application of artificial intelligence (AI) technologies in behavioral interventions offers a promising avenue to overcome challenges in [...] Read more.

Background/Objectives: Loneliness among older adults is a prevalent issue, significantly impacting their quality of life and increasing the risk of physical and mental health complications. The application of artificial intelligence (AI) technologies in behavioral interventions offers a promising avenue to overcome challenges in designing and implementing interventions to reduce loneliness by enabling personalized and scalable solutions. This study systematically reviews the AI-enabled interventions in addressing loneliness among older adults, focusing on the effectiveness and underlying technologies used. Methods: A systematic search was conducted across eight electronic databases, including PubMed and Web of Science, for studies published up to 31 January 2024. Inclusion criteria were experimental studies involving AI applications to mitigate loneliness among adults aged 55 and older. Data on participant demographics, intervention characteristics, AI methodologies, and effectiveness outcomes were extracted and synthesized. Results: Nine studies were included, comprising six randomized controlled trials and three pre–post designs. The most frequently implemented AI technologies included speech recognition (n = 6) and emotion recognition and simulation (n = 5). Intervention types varied, with six studies employing social robots, two utilizing personal voice assistants, and one using a digital human facilitator. Six studies reported significant reductions in loneliness, particularly those utilizing social robots, which demonstrated emotional engagement and personalized interactions. Three studies reported non-significant effects, often due to shorter intervention durations or limited interaction frequencies. Conclusions: AI-driven interventions show promise in reducing loneliness among older adults. Future research should focus on long-term, culturally competent solutions that integrate quantitative and qualitative findings to optimize intervention design and scalability. Full article

(This article belongs to the Special Issue Quality of Life and Mental Health of People with Disabilities and Chronic Illnesses in the Digital Era)

► Show Figures

Figure 1

17 pages, 3001 KB

Open AccessArticle

Performance Improvement of Speech Emotion Recognition Using ResNet Model with Data Augmentation–Saturation

by Minjeong Lee and Miran Lee

Appl. Sci. 2025, 15(4), 2088; https://doi.org/10.3390/app15042088 - 17 Feb 2025

Cited by 3 | Viewed by 2157

Abstract

Over the past five years, the proliferation of virtual reality platforms and the advancement of metahuman technologies have underscored the importance of natural interaction and emotional expression. As a result, there has been significant research activity focused on developing emotion recognition techniques based [...] Read more.

Over the past five years, the proliferation of virtual reality platforms and the advancement of metahuman technologies have underscored the importance of natural interaction and emotional expression. As a result, there has been significant research activity focused on developing emotion recognition techniques based on speech data. Despite significant progress in emotion recognition research for the Korean language, a shortage of speech databases applicable to such research has been regarded as the most critical problem in this field, leading to overfitting issues in several models developed by previous studies. To address the issue of overfitting caused by limited data availability in the field of Korean speech emotion recognition (SER), this study focuses on integrating the data augmentation–saturation (DA-S) technique into a traditional ResNet model to enhance SER performance. The DA-S technique enhances data augmentation by adjusting the saturation of an image. We used 11,192 utterance numbers provided by AI-HUB, which were converted into images to extract features such as pitch and intensity of speech. The DA-S technique was then applied to this dataset, using weights of 0 and 2, to augment the utterance numbers to 33,576. This augmented dataset was utilized to classify four emotion categories: happiness, sadness, anger, and neutrality. The results of this study showed that the proposed model using the DA-S technique overcame overfitting issues. Furthermore, its performance for SER increased by 34.19% compared to that of existing ResNet models not using the DA-S technique. This demonstrates that the DA-S technique effectively enhances model performance with limited data and may be applicable to specific areas such as stress monitoring and mental health support. Full article

(This article belongs to the Special Issue Advanced Technologies and Applications of Emotion Recognition)

► Show Figures

Figure 1

20 pages, 917 KB

Open AccessArticle

Developing a Dataset of Audio Features to Classify Emotions in Speech

by Alvaro A. Colunga-Rodriguez, Alicia Martínez-Rebollar, Hugo Estrada-Esquivel, Eddie Clemente and Odette A. Pliego-Martínez

Computation 2025, 13(2), 39; https://doi.org/10.3390/computation13020039 - 5 Feb 2025

Cited by 9 | Viewed by 6172

Abstract

Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify [...] Read more.

Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify emotions in speech. The paper highlights audio processing techniques such as silence removal and framing to extract features from the recordings. The features are extracted from the audio signals using spectral techniques, time-domain analysis, and the discrete wavelet transform. The resulting dataset is used to train a neural network and the support vector machine learning algorithm. Cross-validation is employed for model training. The developed models were optimized using a software package that performs hyperparameter tuning to improve results. Finally, the emotional classification outcomes were compared. The results showed an emotion classification accuracy of 0.654 for the perceptron neural network and 0.724 for the support vector machine algorithm, demonstrating satisfactory performance in emotion classification. Full article

(This article belongs to the Section Computational Engineering)

► Show Figures

Figure 1

17 pages, 944 KB

Open AccessReview

Addressing the Challenges in Pediatric Facial Fractures: A Narrative Review of Innovations in Diagnosis and Treatment

by Gabriel Mulinari-Santos, Amanda Paino Santana, Paulo Roberto Botacin and Roberta Okamoto

Surgeries 2024, 5(4), 1130-1146; https://doi.org/10.3390/surgeries5040090 - 13 Dec 2024

Cited by 7 | Viewed by 5067

Abstract

Background/Objectives: Pediatric facial fractures present unique challenges due to the anatomical, physiological, and developmental differences in children’s facial structures. The growing facial bones in children complicate diagnosis and treatment. This review explores the advancements and complexities in managing pediatric facial fractures, focusing on [...] Read more.

Background/Objectives: Pediatric facial fractures present unique challenges due to the anatomical, physiological, and developmental differences in children’s facial structures. The growing facial bones in children complicate diagnosis and treatment. This review explores the advancements and complexities in managing pediatric facial fractures, focusing on innovations in diagnosis, treatment strategies, and multidisciplinary care. Methods: A narrative review was conducted, synthesizing data from English-language articles published between 2001 and 2024. Relevant studies were identified through databases such as PubMed, Scopus, Lilacs, Embase, and SciELO using keywords related to pediatric facial fractures. This narrative review focuses on anatomical challenges, advancements in diagnostic techniques, treatment approaches, and the role of interdisciplinary teams in management. Results: Key findings highlight advancements in imaging technologies, including three-dimensional computed tomography (3D CT) and magnetic resonance imaging (MRI), which have improved fracture diagnosis and preoperative planning. Minimally invasive techniques and bioresorbable implants have revolutionized treatment, reducing trauma and enhancing recovery. The integration of multidisciplinary teams, including pediatricians, psychologists, and speech therapists, has become crucial in addressing both the physical and emotional needs of patients. Emerging technologies such as 3D printing and computer-assisted navigation are shaping future treatment approaches. Conclusions: The management of pediatric facial fractures has significantly advanced due to innovations in imaging, surgical techniques, and the growing importance of interdisciplinary care. Despite these improvements, long-term follow-up remains critical to monitor potential complications. Ongoing research and collaboration are essential to refine treatment strategies and improve long-term outcomes for pediatric patients with facial trauma. Full article

► Show Figures

Figure 1

25 pages, 2085 KB

Open AccessArticle

How Much Does the Dynamic F0 Curve Affect the Expression of Emotion in Utterances?

by Tae-Jin Yoon

Appl. Sci. 2024, 14(23), 10972; https://doi.org/10.3390/app142310972 - 26 Nov 2024

Cited by 2 | Viewed by 2514

Abstract

The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying [...] Read more.

The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying emotion remains challenging due to the complex and dynamic nature of vocal expressions. Traditional analytical methods often oversimplify these dynamics, potentially overlooking intricate patterns indicative of specific emotions. This study examines the influences of emotion and temporal variation on dynamic F0 contours in the analytical framework, utilizing a dataset valuable for its diverse emotional expressions. However, the analysis is constrained by the limited variety of sentences employed, which may affect the generalizability of the findings to broader linguistic contexts. We utilized the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), focusing on eight distinct emotional states performed by 24 professional actors. Sonorant segments were extracted, and F0 measurements were converted into semitones relative to a 100 Hz baseline to standardize pitch variations. By employing Generalized Additive Mixed Models (GAMMs), we modeled non-linear trajectories of F0 contours over time, accounting for fixed effects (emotions) and random effects (individual speaker variability). Our analysis revealed that incorporating emotion-specific, non-linear time effects and individual speaker differences significantly improved the model’s explanatory power, ultimately explaining up to 66.5% of the variance in the F0. The inclusion of random smooths for time within speakers captured individual temporal modulation patterns, providing a more accurate representation of emotional speech dynamics. The results demonstrate that dynamic modeling of F0 contours using GAMMs enhances the accuracy of emotion classification in speech. This approach captures the nuanced pitch patterns associated with different emotions and accounts for individual variability among speakers. The findings contribute to a deeper understanding of the vocal expression of emotions and offer valuable insights for advancing speech emotion recognition systems. Full article

(This article belongs to the Special Issue Advances and Applications of Audio and Speech Signal Processing)

► Show Figures

Figure 1

22 pages, 336 KB

Open AccessArticle

Multimodal Emotion Recognition Based on Facial Expressions, Speech, and Body Gestures

by Jingjie Yan, Peiyuan Li, Chengkun Du, Kang Zhu, Xiaoyang Zhou, Ying Liu and Jinsheng Wei

Electronics 2024, 13(18), 3756; https://doi.org/10.3390/electronics13183756 - 21 Sep 2024

Cited by 10 | Viewed by 6077

Abstract

The research of multimodal emotion recognition based on facial expressions, speech, and body gestures is crucial for oncoming intelligent human–computer interfaces. However, it is a very difficult task and has seldom been researched in this combination in the past years. Based on the [...] Read more.

The research of multimodal emotion recognition based on facial expressions, speech, and body gestures is crucial for oncoming intelligent human–computer interfaces. However, it is a very difficult task and has seldom been researched in this combination in the past years. Based on the GEMEP and Polish databases, this contribution focuses on trimodal emotion recognition from facial expressions, speech, and body gestures, including feature extraction, feature fusion, and multimodal classification of the three modalities. In particular, for feature fusion, two novel algorithms including supervised least squares multiset kernel canonical correlation analysis (SLSMKCCA) and sparse supervised least squares multiset kernel canonical correlation analysis (SSLSMKCCA) are presented, respectively, to carry out efficient facial expression, speech, and body gesture feature fusion. Different from the traditional multiset kernel canonical correlation analysis (MKCCA) algorithms, our SLSKMCCA algorithm is a supervised version and is based on the least squares form. The SSLSKMCCA algorithm is implemented by the combination of SLSMKCCA and a sparse item (L1 Norm). Moreover, two effective solving algorithms for SLSMKCCA and SSLSMKCCA are presented in addition, which use the alternated least squares and augmented Lagrangian multiplier methods, respectively. The extensive experimental results on the popular public GEMEP and Polish databases show that the recognition rate of multimodal emotion recognition is superior to bimodal and monomodal emotion recognition on average, and our presented SLSMKCCA and SSLSMKCCA fusion methods both obtain very high recognition rates, especially for the SSLSMKCCA fusion method. Full article

(This article belongs to the Special Issue Applied AI in Emotion Recognition)

► Show Figures

Figure 1

39 pages, 6629 KB

Open AccessArticle

A Combined CNN Architecture for Speech Emotion Recognition

by Rolinson Begazo, Ana Aguilera, Irvin Dongo and Yudith Cardinale

Sensors 2024, 24(17), 5797; https://doi.org/10.3390/s24175797 - 6 Sep 2024

Cited by 16 | Viewed by 8353

Abstract

Emotion recognition through speech is a technique employed in various scenarios of Human–Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of [...] Read more.

Emotion recognition through speech is a technique employed in various scenarios of Human–Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%. Full article

(This article belongs to the Special Issue Emotion Recognition Based on Sensors (3rd Edition))

► Show Figures

Figure 1

Search Results (94)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (94)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI