Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (41)

Search Parameters:
Keywords = audiovisual database

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
39 pages, 3511 KB  
Systematic Review
From Senses to Memory During Childhood: A Systematic Review and Bayesian Meta-Analysis Exploring Multisensory Processing and Working Memory Development
by Areej A. Alhamdan, Hayley E. Pickering, Melanie J. Murphy and Sheila G. Crewther
Eur. J. Investig. Health Psychol. Educ. 2025, 15(8), 157; https://doi.org/10.3390/ejihpe15080157 - 12 Aug 2025
Viewed by 1791
Abstract
Multisensory processing has long been recognized to enhance perception, cognition, and actions in adults. However, there is currently limited understanding of how multisensory stimuli, in comparison to unisensory stimuli, contribute to the development of both motor and verbally assessed working memory (WM) in [...] Read more.
Multisensory processing has long been recognized to enhance perception, cognition, and actions in adults. However, there is currently limited understanding of how multisensory stimuli, in comparison to unisensory stimuli, contribute to the development of both motor and verbally assessed working memory (WM) in children. Thus, the current study aimed to systematically review and meta-analyze the associations between the multisensory processing of auditory and visual stimuli, and performance on simple and more complex WM tasks, in children from birth to 15 years old. We also aimed to determine whether there are differences in WM capacity for audiovisual compared to unisensory auditory or visual stimuli alone after receptive and spoken language develop. Following PRISMA guidelines, a systematic search of PsycINFO, MEDLINE, Embase, PubMed, CINAHL and Web of Science databases identified that 21 out of 3968 articles met the inclusion criteria for Bayesian meta-analysis and the AXIS risk of bias criteria. The results showed at least extreme/decisive evidence for associations between verbal and motor reaction times on multisensory tasks and a variety of visual and auditory WM tasks, with verbal multisensory stimuli contributing more to verbally assessed WM capacity than unisensory auditory or visual stimuli alone. Furthermore, a meta-regression confirmed that age significantly moderates the observed association between multisensory processing and both visual and auditory WM tasks, indicating that verbal- and motor-assessed multisensory processing contribute differentially to WM performance, and to different age-determined extents. These findings have important implications for school-based learning methods and other educational activities where the implementation of multisensory stimuli is likely to enhance outcomes. Full article
Show Figures

Figure 1

15 pages, 4273 KB  
Article
Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models
by Jamsher Bhanbhro, Asif Aziz Memon, Bharat Lal, Shahnawaz Talpur and Madeha Memon
Signals 2025, 6(2), 22; https://doi.org/10.3390/signals6020022 - 9 May 2025
Cited by 5 | Viewed by 3815
Abstract
Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. [...] Read more.
Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. Despite its importance in various fields like human–computer interaction and mental health diagnosis, accurately identifying emotions from speech can be challenging due to differences in speakers, accents, and background noise. The work proposes two innovative deep learning models to improve SER accuracy: a CNN-LSTM model and an Attention-Enhanced CNN-LSTM model. These models were tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), collected between 2015 and 2018, which comprises 1440 audio files of male and female actors expressing eight emotions. Both models achieved impressive accuracy rates of over 96% in classifying emotions into eight categories. By comparing the CNN-LSTM and Attention-Enhanced CNN-LSTM models, this study offers comparative insights into modeling techniques, contributes to the development of more effective emotion recognition systems, and offers practical implications for real-time applications in healthcare and customer service. Full article
Show Figures

Figure 1

20 pages, 917 KB  
Article
Developing a Dataset of Audio Features to Classify Emotions in Speech
by Alvaro A. Colunga-Rodriguez, Alicia Martínez-Rebollar, Hugo Estrada-Esquivel, Eddie Clemente and Odette A. Pliego-Martínez
Computation 2025, 13(2), 39; https://doi.org/10.3390/computation13020039 - 5 Feb 2025
Cited by 3 | Viewed by 3959
Abstract
Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify [...] Read more.
Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify emotions in speech. The paper highlights audio processing techniques such as silence removal and framing to extract features from the recordings. The features are extracted from the audio signals using spectral techniques, time-domain analysis, and the discrete wavelet transform. The resulting dataset is used to train a neural network and the support vector machine learning algorithm. Cross-validation is employed for model training. The developed models were optimized using a software package that performs hyperparameter tuning to improve results. Finally, the emotional classification outcomes were compared. The results showed an emotion classification accuracy of 0.654 for the perceptron neural network and 0.724 for the support vector machine algorithm, demonstrating satisfactory performance in emotion classification. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

17 pages, 1741 KB  
Review
Effectiveness of Non-Pharmacological Interventions in Reducing Dental Anxiety Among Children with Special Needs: A Scoping Review with Conceptual Map
by Zuhair Motlak Alkahtani
Children 2025, 12(2), 165; https://doi.org/10.3390/children12020165 - 29 Jan 2025
Viewed by 3510
Abstract
Background: Children with special needs often need tailored approaches to oral healthcare to address their unique needs effectively. It is essential to analyze the effectiveness of non-pharmacological management in reducing dental anxiety among special needs children during dental treatment. Methods: Five electronic databases, [...] Read more.
Background: Children with special needs often need tailored approaches to oral healthcare to address their unique needs effectively. It is essential to analyze the effectiveness of non-pharmacological management in reducing dental anxiety among special needs children during dental treatment. Methods: Five electronic databases, PubMed, Scopus, Web of Science, Embase, and Google Scholar, were searched from 2007 to August 2024 for randomized control trials and observational studies comparing the effectiveness of non-pharmacological techniques in reducing dental anxiety during invasive and noninvasive dental treatment. The primary outcomes of the studied intervention were reduced dental anxiety and improved behavior during dental treatment. The conceptual map was created to understand the need for assessment and behavior management for special needs children (SN). Results: Nineteen articles qualified for the final analysis from 250 screened articles. Included studies evaluated the effect of strategies applied clinically, such as audio–visual distraction, sensory-adapted environment, and virtual reality. The included studies measured the trivial to large effect of measured interventions and supported non-pharmacological interventions in clinical settings. Conclusions: Most basic non-pharmacological interventions showed a trivial to large reduction in dental anxiety among SN patients. The conceptual map developed in this study supports the need for non-pharmacological interventions as they are cost-effective and create a positive environment in dental clinics. However, more studies need to focus on non-pharmacological behavior interventions in SN children to support the findings of this scoping review. Full article
(This article belongs to the Section Pediatric Dentistry & Oral Medicine)
Show Figures

Figure 1

25 pages, 2085 KB  
Article
How Much Does the Dynamic F0 Curve Affect the Expression of Emotion in Utterances?
by Tae-Jin Yoon
Appl. Sci. 2024, 14(23), 10972; https://doi.org/10.3390/app142310972 - 26 Nov 2024
Viewed by 1558
Abstract
The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying [...] Read more.
The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying emotion remains challenging due to the complex and dynamic nature of vocal expressions. Traditional analytical methods often oversimplify these dynamics, potentially overlooking intricate patterns indicative of specific emotions. This study examines the influences of emotion and temporal variation on dynamic F0 contours in the analytical framework, utilizing a dataset valuable for its diverse emotional expressions. However, the analysis is constrained by the limited variety of sentences employed, which may affect the generalizability of the findings to broader linguistic contexts. We utilized the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), focusing on eight distinct emotional states performed by 24 professional actors. Sonorant segments were extracted, and F0 measurements were converted into semitones relative to a 100 Hz baseline to standardize pitch variations. By employing Generalized Additive Mixed Models (GAMMs), we modeled non-linear trajectories of F0 contours over time, accounting for fixed effects (emotions) and random effects (individual speaker variability). Our analysis revealed that incorporating emotion-specific, non-linear time effects and individual speaker differences significantly improved the model’s explanatory power, ultimately explaining up to 66.5% of the variance in the F0. The inclusion of random smooths for time within speakers captured individual temporal modulation patterns, providing a more accurate representation of emotional speech dynamics. The results demonstrate that dynamic modeling of F0 contours using GAMMs enhances the accuracy of emotion classification in speech. This approach captures the nuanced pitch patterns associated with different emotions and accounts for individual variability among speakers. The findings contribute to a deeper understanding of the vocal expression of emotions and offer valuable insights for advancing speech emotion recognition systems. Full article
(This article belongs to the Special Issue Advances and Applications of Audio and Speech Signal Processing)
Show Figures

Figure 1

23 pages, 3094 KB  
Article
Risk and Complexity Assessment of Autonomous Vehicle Testing Scenarios
by Zhiyuan Wei, Hanchu Zhou and Rui Zhou
Appl. Sci. 2024, 14(21), 9866; https://doi.org/10.3390/app14219866 - 28 Oct 2024
Cited by 7 | Viewed by 3178
Abstract
Autonomous vehicles (AVs) must fulfill adequate safety requirements before formal application, and performing an effective functional evaluation to verify vehicle safety requires extensive testing in different scenarios. However, it is crucial to rationalize the application of different scenarios to support different testing needs; [...] Read more.
Autonomous vehicles (AVs) must fulfill adequate safety requirements before formal application, and performing an effective functional evaluation to verify vehicle safety requires extensive testing in different scenarios. However, it is crucial to rationalize the application of different scenarios to support different testing needs; thus, one of the current challenges limiting the development of AVs is the critical evaluation of scenarios, i.e., the lack of quantitative criteria for scenario design. This study introduces a method using the Spherical Fuzzy-Analytical Network Process (SF-ANP) to evaluate these scenarios, addressing their inherent risks and complexities. The method involves constructing a five-layer model to decompose scenario elements and using SF-ANP to calculate weights based on element interactions. The study evaluates 700 scenarios from the China In-depth Traffic Safety Study–Traffic Accident (CIMSS-TA) database, incorporating fuzzy factors and element weights. Virtual simulation of vehicles in the scenarios was performed using Baidu Apollo, and the performance of the scenarios was assessed by collecting the vehicle test results. The correlation between the obtained alternative safety indicators and the quantitative values confirms the validity and scientific validity of this approach. This will provide valuable guidance for categorizing audiovisual test scenarios and selecting corresponding scenarios to challenge different levels of vehicle functionality. At the same time, it can be used as a design basis to generate a large number of effective scenarios to accelerate the construction of scenario libraries and promote commercialization of AVs. Full article
(This article belongs to the Section Transportation and Future Mobility)
Show Figures

Figure 1

13 pages, 2148 KB  
Systematic Review
The Role of Different Feedback Devices in the Survival of Patients in Cardiac Arrest: Systematic Review with Meta-Analysis
by Luca Gambolò, Pasquale Di Fronzo, Giuseppe Ristagno, Sofia Biserni, Martina Milazzo, Delia Marta Socaci, Leopoldo Sarli, Giovanna Artioli, Antonio Bonacaro and Giuseppe Stirparo
J. Clin. Med. 2024, 13(19), 5989; https://doi.org/10.3390/jcm13195989 - 8 Oct 2024
Cited by 6 | Viewed by 1757
Abstract
Background: Cardiac arrest is a critical condition affecting approximately 1 in every 1000 people in Europe. Feedback devices have been developed to enhance the quality of chest compressions during CPR, but their clinical impact remains uncertain. This study aims to evaluate the effect [...] Read more.
Background: Cardiac arrest is a critical condition affecting approximately 1 in every 1000 people in Europe. Feedback devices have been developed to enhance the quality of chest compressions during CPR, but their clinical impact remains uncertain. This study aims to evaluate the effect of feedback devices on key clinical outcomes in adult patients experiencing both out-of-hospital (OHCA) and in-hospital cardiac arrest (IHCA). The primary objective is to assess their impact on the return of spontaneous circulation (ROSC); secondary objectives include the evaluation of neurological status and survival to discharge. Methods: A systematic review was conducted following PRISMA guidelines, utilizing databases including PubMed, Scopus, Web of Science, and Embase. Studies published between 2000 and 2023 comparing CPR with and without the use of feedback devices were included. A fixed-effects network meta-analysis was performed for ROSC and survival, while a frequentist meta-analysis was conducted for neurological outcomes. Results: Twelve relevant studies met the inclusion criteria. The meta-analysis results showed that the use of audiovisual feedback devices significantly increases the likelihood of ROSC (OR 1.26, 95% CI 1.13–1.41, p < 0.0001) and survival (OR 1.52, 95% CI 1.27–1.81, p < 0.0001) compared to CPR without feedback. However, the effect of metronomes did not reach statistical significance. Conclusions: Feedback devices, particularly audiovisual ones, are associated with improved clinical outcomes in cardiac arrest patients. Their use should be encouraged in both training settings and real-life emergency scenarios to enhance survival rates and ROSC. However, further studies are needed to confirm long-term impacts and to explore the potential benefits of metronomes. Full article
(This article belongs to the Section Epidemiology & Public Health)
Show Figures

Figure 1

21 pages, 1315 KB  
Review
The Use of Audiovisual Distraction Tools in the Dental Setting for Pediatric Subjects with Special Healthcare Needs: A Review and Proposal of a Multi-Session Model for Behavioral Management
by Massimo Pisano, Alessia Bramanti, Giuseppina De Benedetto, Carmen Martin Carreras-Presas and Federica Di Spirito
Children 2024, 11(9), 1077; https://doi.org/10.3390/children11091077 - 2 Sep 2024
Cited by 5 | Viewed by 3608
Abstract
Background: A Special Health Care Need (SHCN) is characterized by any type of physical, mental, sensorial, cognitive, emotional, or developmental condition that requires medical treatment, specialized services, or healthcare interventions. These conditions can negatively impact oral health as SHCN children can hardly cooperate [...] Read more.
Background: A Special Health Care Need (SHCN) is characterized by any type of physical, mental, sensorial, cognitive, emotional, or developmental condition that requires medical treatment, specialized services, or healthcare interventions. These conditions can negatively impact oral health as SHCN children can hardly cooperate or communicate and experience higher levels of dental fear/anxiety, which interfere with regular appointments. The present narrative review aims to analyze the use of audiovisual (AV) tools in dental setting for the management of SHCN children during dental treatment and to evaluate their effectiveness in anxiety/behavior control from the child, dentist, and care-giver perspectives. This analysis leads to the proposal of a new multi-session model for the behavioral management of SHCN pediatric subjects. Methods: An electronic search on the MEDLINE/Pubmed, Scopus, and Web of Science databases was carried out and through this analysis, a new model was proposed, the “UNISA-Virtual Stepwise Distraction model”, a multi-session workflow combining traditional behavior management and the progressive introduction of AV media to familiarize the SHCN child with dental setting and manage behavior. Results: AV tools helped in most cases to manage SHCN behavior and decreased stress in both the dentist and child during dental treatments. Care-givers also welcomed AV distractors, reporting positive feedback in using them during future treatments. Conclusions: The present narrative review found increasing evidence of the use of AV media for SHCN pediatric subjects as distraction tools during dental treatment. In the majority of the studies, AV tools proved to be effective for the management of anxiety, dental fear, and behavior in dental setting. Full article
Show Figures

Figure 1

18 pages, 2938 KB  
Article
Facial Animation Strategies for Improved Emotional Expression in Virtual Reality
by Hyewon Song and Beom Kwon
Electronics 2024, 13(13), 2601; https://doi.org/10.3390/electronics13132601 - 2 Jul 2024
Cited by 6 | Viewed by 4195
Abstract
The portrayal of emotions by virtual characters is crucial in virtual reality (VR) communication. Effective communication in VR relies on a shared understanding, which is significantly enhanced when virtual characters authentically express emotions that align with their spoken words. While human emotions are [...] Read more.
The portrayal of emotions by virtual characters is crucial in virtual reality (VR) communication. Effective communication in VR relies on a shared understanding, which is significantly enhanced when virtual characters authentically express emotions that align with their spoken words. While human emotions are often conveyed through facial expressions, existing facial animation techniques have mainly focused on lip-syncing and head movements to improve naturalness. This study investigates the influence of various factors in facial animation on the emotional representation of virtual characters. We conduct a comparative and analytical study using an audio-visual database, examining the impact of different animation factors. To this end, we utilize a total of 24 voice samples, representing 12 different speakers, with each emotional voice segment lasting approximately 4–5 s. Using these samples, we design six perceptual experiments to investigate the impact of facial cues—including facial expression, lip movement, head motion, and overall appearance—on the expression of emotions by virtual characters. Additionally, we engaged 20 participants to evaluate and select appropriate combinations of facial expressions, lip movements, head motions, and appearances that align with the given emotion and its intensity. Our findings indicate that emotional representation in virtual characters is closely linked to facial expressions, head movements, and overall appearance. Conversely, lip-syncing, which has been a primary focus in prior studies, seems less critical for conveying emotions, as its accuracy is difficult to perceive with the naked eye. The results of our study can significantly benefit the VR community by aiding in the development of virtual characters capable of expressing a diverse range of emotions. Full article
Show Figures

Figure 1

31 pages, 9940 KB  
Article
Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms
by Eman Abdulrahman Alkhamali, Arwa Allinjawi and Rehab Bahaaddin Ashari
Appl. Sci. 2024, 14(12), 5050; https://doi.org/10.3390/app14125050 - 10 Jun 2024
Cited by 5 | Viewed by 2595
Abstract
Speech emotion recognition (SER) is a technology that can be applied to distance education to analyze speech patterns and evaluate speakers’ emotional states in real time. It provides valuable insights and can be used to enhance students’ learning experiences by enabling the assessment [...] Read more.
Speech emotion recognition (SER) is a technology that can be applied to distance education to analyze speech patterns and evaluate speakers’ emotional states in real time. It provides valuable insights and can be used to enhance students’ learning experiences by enabling the assessment of their instructors’ emotional stability, a factor that significantly impacts the effectiveness of information delivery. Students demonstrate different engagement levels during learning activities, and assessing this engagement is important for controlling the learning process and improving e-learning systems. An important aspect that may influence student engagement is their instructors’ emotional state. Accordingly, this study used deep learning techniques to create an automated system for recognizing instructors’ emotions in their speech when delivering distance learning. This methodology entailed integrating transformer, convolutional neural network, and long short-term memory architectures into an ensemble to enhance the SER. Feature extraction from audio data used Mel-frequency cepstral coefficients; chroma; a Mel spectrogram; the zero-crossing rate; spectral contrast, centroid, bandwidth, and roll-off; and the root-mean square, with subsequent optimization processes such as adding noise, conducting time stretching, and shifting the audio data. Several transformer blocks were incorporated, and a multi-head self-attention mechanism was employed to identify the relationships between the input sequence segments. The preprocessing and data augmentation methodologies significantly enhanced the precision of the results, with accuracy rates of 96.3%, 99.86%, 96.5%, and 85.3% for the Ryerson Audio–Visual Database of Emotional Speech and Song, Berlin Database of Emotional Speech, Surrey Audio–Visual Expressed Emotion, and Interactive Emotional Dyadic Motion Capture datasets, respectively. Furthermore, it achieved 83% accuracy on another dataset created for this study, the Saudi Higher-Education Instructor Emotions dataset. The results demonstrate the considerable accuracy of this model in detecting emotions in speech data across different languages and datasets. Full article
(This article belongs to the Special Issue Computer Vision and AI for Interactive Robotics)
Show Figures

Figure 1

22 pages, 6938 KB  
Article
Streamline Intelligent Crowd Monitoring with IoT Cloud Computing Middleware
by Alexandros Gazis and Eleftheria Katsiri
Sensors 2024, 24(11), 3643; https://doi.org/10.3390/s24113643 - 4 Jun 2024
Cited by 1 | Viewed by 3307
Abstract
This article introduces a novel middleware that utilizes cost-effective, low-power computing devices like Raspberry Pi to analyze data from wireless sensor networks (WSNs). It is designed for indoor settings like historical buildings and museums, tracking visitors and identifying points of interest. It serves [...] Read more.
This article introduces a novel middleware that utilizes cost-effective, low-power computing devices like Raspberry Pi to analyze data from wireless sensor networks (WSNs). It is designed for indoor settings like historical buildings and museums, tracking visitors and identifying points of interest. It serves as an evacuation aid by monitoring occupancy and gauging the popularity of specific areas, subjects, or art exhibitions. The middleware employs a basic form of the MapReduce algorithm to gather WSN data and distribute it across available computer nodes. Data collected by RFID sensors on visitor badges is stored on mini-computers placed in exhibition rooms and then transmitted to a remote database after a preset time frame. Utilizing MapReduce for data analysis and a leader election algorithm for fault tolerance, this middleware showcases its viability through metrics, demonstrating applications like swift prototyping and accurate validation of findings. Despite using simpler hardware, its performance matches resource-intensive methods involving audiovisual and AI techniques. This design’s innovation lies in its fault-tolerant, distributed setup using budget-friendly, low-power devices rather than resource-heavy hardware or methods. Successfully tested at a historical building in Greece (M. Hatzidakis’ residence), it is tailored for indoor spaces. This paper compares its algorithmic application layer with other implementations, highlighting its technical strengths and advantages. Particularly relevant in the wake of the COVID-19 pandemic and general monitoring middleware for indoor locations, this middleware holds promise in tracking visitor counts and overall building occupancy. Full article
(This article belongs to the Section Internet of Things)
Show Figures

Figure 1

7 pages, 167 KB  
Editorial
Special Issue on IberSPEECH 2022: Speech and Language Technologies for Iberian Languages
by José L. Pérez-Córdoba, Francesc Alías-Pujol and Zoraida Callejas
Appl. Sci. 2024, 14(11), 4505; https://doi.org/10.3390/app14114505 - 24 May 2024
Viewed by 1147
Abstract
ThisSpecial Issue presents the latest advances in research and novel applications of speech and language technologies based on the works presented at the sixth edition of the IberSPEECH conference held in Granada in 2022, paying special attention to those focused on Iberian languages. [...] Read more.
ThisSpecial Issue presents the latest advances in research and novel applications of speech and language technologies based on the works presented at the sixth edition of the IberSPEECH conference held in Granada in 2022, paying special attention to those focused on Iberian languages. IberSPEECH is the international conference of the Special Interest Group on Iberian Languages (SIG-IL) of the International Speech Communication Association (ISCA) and the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla, or RTTH for short). Several researchers were invited to extend the contributions presented at IberSPEECH2022 due to their interest and quality. As a result, the Special Issue is composed of 11 papers that cover different research topics related to speech perception, speech analysis and enhancement, speaker verification and identification, speech production and synthesis, natural language processing, together with several applications and evaluation challenges. Full article
15 pages, 483 KB  
Article
A Feature Selection Algorithm Based on Differential Evolution for English Speech Emotion Recognition
by Liya Yue, Pei Hu, Shu-Chuan Chu and Jeng-Shyang Pan
Appl. Sci. 2023, 13(22), 12410; https://doi.org/10.3390/app132212410 - 16 Nov 2023
Cited by 4 | Viewed by 2219
Abstract
The automatic identification of emotions from speech holds significance in facilitating interactions between humans and machines. To improve the recognition accuracy of speech emotion, we extract mel-frequency cepstral coefficients (MFCCs) and pitch features from raw signals, and an improved differential evolution (DE) algorithm [...] Read more.
The automatic identification of emotions from speech holds significance in facilitating interactions between humans and machines. To improve the recognition accuracy of speech emotion, we extract mel-frequency cepstral coefficients (MFCCs) and pitch features from raw signals, and an improved differential evolution (DE) algorithm is utilized for feature selection based on K-nearest neighbor (KNN) and random forest (RF) classifiers. The proposed multivariate DE (MDE) adopts three mutation strategies to solve the slow convergence of the classical DE and maintain population diversity, and employs a jumping method to avoid falling into local traps. The simulations are conducted on four public English speech emotion datasets: eNTERFACE05, Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Surrey Audio-Visual Expressed Emotion (SAEE), and Toronto Emotional Speech Set (TESS), and they cover a diverse range of emotions. The MDE algorithm is compared with PSO-assisted biogeography-based optimization (BBO_PSO), DE, and the sine cosine algorithm (SCA) on emotion recognition error, number of selected features, and running time. From the results obtained, MDE obtains the errors of 0.5270, 0.5044, 0.4490, and 0.0420 in eNTERFACE05, RAVDESS, SAVEE, and TESS based on the KNN classifier, and the errors of 0.4721, 0.4264, 0.3283 and 0.0114 based on the RF classifier. The proposed algorithm demonstrates excellent performance in emotion recognition accuracy, and it finds meaningful acoustic features from MFCCs and pitch. Full article
(This article belongs to the Special Issue Recent Applications of Explainable AI (XAI))
Show Figures

Figure 1

14 pages, 406 KB  
Article
English Speech Emotion Classification Based on Multi-Objective Differential Evolution
by Liya Yue, Pei Hu, Shu-Chuan Chu and Jeng-Shyang Pan
Appl. Sci. 2023, 13(22), 12262; https://doi.org/10.3390/app132212262 - 13 Nov 2023
Cited by 7 | Viewed by 1669
Abstract
Speech signals involve speakers’ emotional states and language information, which is very important for human–computer interaction that recognizes speakers’ emotions. Feature selection is a common method for improving recognition accuracy. In this paper, we propose a multi-objective optimization method based on differential evolution [...] Read more.
Speech signals involve speakers’ emotional states and language information, which is very important for human–computer interaction that recognizes speakers’ emotions. Feature selection is a common method for improving recognition accuracy. In this paper, we propose a multi-objective optimization method based on differential evolution (MODE-NSF) that maximizes recognition accuracy and minimizes the number of selected features (NSF). First, the Mel-frequency cepstral coefficient (MFCC) features and pitch features are extracted from speech signals. Then, the proposed algorithm implements feature selection where the NSF guides the initialization, crossover, and mutation of the algorithm. We used four English speech emotion datasets, and K-nearest neighbor (KNN) and random forest (RF) classifiers to validate the performance of the proposed algorithm. The results illustrate that MODE-NSF is superior to other multi-objective algorithms in terms of the hypervolume (HV), inverted generational distance (IGD), Pareto optimal solutions, and running time. MODE-NSF achieved an accuracy of 49% using eNTERFACE05, 53% using the Ryerson audio-visual database of emotional speech and song (RAVDESS), 76% using Surrey audio-visual expressed emotion (SAVEE) database, and 98% using the Toronto emotional speech set (TESS). MODE-NSF obtained good recognition results, which provides a basis for the establishment of emotional models. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

31 pages, 13108 KB  
Article
Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism
by Konstantinos Mountzouris, Isidoros Perikos and Ioannis Hatzilygeroudis
Electronics 2023, 12(20), 4376; https://doi.org/10.3390/electronics12204376 - 23 Oct 2023
Cited by 16 | Viewed by 7055
Abstract
Speech emotion recognition (SER) is an interesting and difficult problem to handle. In this paper, we deal with it through the implementation of deep learning networks. We have designed and implemented six different deep learning networks, a deep belief network (DBN), a simple [...] Read more.
Speech emotion recognition (SER) is an interesting and difficult problem to handle. In this paper, we deal with it through the implementation of deep learning networks. We have designed and implemented six different deep learning networks, a deep belief network (DBN), a simple deep neural network (SDNN), an LSTM network (LSTM), an LSTM network with the addition of an attention mechanism (LSTM-ATN), a convolutional neural network (CNN), and a convolutional neural network with the addition of an attention mechanism (CNN-ATN), having in mind, apart from solving the SER problem, to test the impact of the attention mechanism on the results. Dropout and batch normalization techniques are also used to improve the generalization ability (prevention of overfitting) of the models as well as to speed up the training process. The Surrey Audio–Visual Expressed Emotion (SAVEE) database and the Ryerson Audio–Visual Database (RAVDESS) were used for the training and evaluation of our models. The results showed that the networks with the addition of the attention mechanism did better than the others. Furthermore, they showed that the CNN-ATN was the best among the tested networks, achieving an accuracy of 74% for the SAVEE database and 77% for the RAVDESS, and exceeding existing state-of-the-art systems for the same datasets. Full article
(This article belongs to the Special Issue Feature Papers in Computer Science & Engineering)
Show Figures

Figure 1

Back to TopTop