Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (45)

Search Parameters:
Keywords = EMO-DB

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 5649 KB  
Article
Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion
by Md. Shahid Ahammed Shakil, Fahmid Al Farid, Nitun Kumar Podder, S. M. Hasan Sazzad Iqbal, Abu Saleh Musa Miah, Md Abdur Rahim and Hezerul Abdul Karim
J. Imaging 2025, 11(8), 273; https://doi.org/10.3390/jimaging11080273 - 14 Aug 2025
Viewed by 1050
Abstract
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep [...] Read more.
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time–frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

28 pages, 530 KB  
Article
Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
by Alex Mares, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado and Alfonso Dominguez-Chavez
Appl. Sci. 2025, 15(8), 4340; https://doi.org/10.3390/app15084340 - 14 Apr 2025
Cited by 1 | Viewed by 2941
Abstract
Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, [...] Read more.
Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

18 pages, 1911 KB  
Article
Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition
by Lukasz Smietanka and Tomasz Maka
Appl. Sci. 2025, 15(5), 2598; https://doi.org/10.3390/app15052598 - 27 Feb 2025
Cited by 2 | Viewed by 785
Abstract
This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and [...] Read more.
This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and EmoDB and RAVDESS databases. As the results show, the proposed system improved the classification accuracy of both representations: 1.29% for CQT and 3.75% for Mel spectrogram compared to the typical CNN architecture for the EmoDB dataset and 3.02% for CQT and 0.63% for Mel spectrogram in the case of RAVDESS. Additionally, the results present a significant increase of around 14% in classification performance in the case of happiness and disgust emotions using Mel spectrograms and around 20% in happiness and disgust emotions for CQT in the case of best models trained on EmoDB. On the other hand, in the case of models that achieved the highest result for the RAVDESS database, the most significant improvement was observed in the classification of a neutral state, around 16%, using the Mel spectrogram. For CQT representation, the most significant improvement occurred for fear and surprise, around 9%. Additionally, the average results for all prepared models showed the positive impact of the method used on the quality of classification of most emotional states. For the EmoDB database, the highest average improvement was observed for happiness—14.6%. For other emotions, it ranged from 1.2% to 8.7%. The only exception was the emotion of sadness, for which the classification quality was average decreased by 1% when using the Mel spectrogram. In turn, for the RAVDESS database, the most significant improvement also occurred for happiness—7.5%, while for other emotions ranged from 0.2% to 7.1%, except disgust and calm, the classification of which deteriorated for the Mel spectrogram and the CQT representation, respectively. Full article
Show Figures

Figure 1

14 pages, 1413 KB  
Article
Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation
by Kyung-Min Roh and Seok-Pil Lee
Appl. Sci. 2024, 14(21), 9890; https://doi.org/10.3390/app14219890 - 29 Oct 2024
Cited by 3 | Viewed by 1891
Abstract
With the advancement of Artificial Intelligence (AI) and the Internet of Things (IoT), research in the field of emotion detection and recognition has been actively conducted worldwide in modern society. Among this research, speech emotion recognition has gained increasing importance in various areas [...] Read more.
With the advancement of Artificial Intelligence (AI) and the Internet of Things (IoT), research in the field of emotion detection and recognition has been actively conducted worldwide in modern society. Among this research, speech emotion recognition has gained increasing importance in various areas of application such as personalized services, enhanced security, and the medical field. However, subjective emotional expressions in voice data can be perceived differently by individuals, and issues such as data imbalance and limited datasets fail to provide the diverse situations necessary for model training, thus limiting performance. To overcome these challenges, this paper proposes a novel data augmentation technique using Conditional-DCGAN, which combines CGAN and DCGAN. This study analyzes the temporal signal changes using Mel-spectrograms extracted from the Emo-DB dataset and applies a loss function calculation method borrowed from reinforcement learning to generate data that accurately reflects emotional characteristics. To validate the proposed method, experiments were conducted using a model combining CNN and Bi-LSTM. The results, including augmented data, achieved significant performance improvements, reaching WA 91.46% and UAR 91.61%, compared to using only the original data (WA 79.31%, UAR 78.16%). These results outperform similar previous studies, such as those reporting WA 84.49% and UAR 83.33%, demonstrating the positive effects of the proposed data augmentation technique. This study presents a new data augmentation method that enables effective learning even in situations with limited data, offering a progressive direction for research in speech emotion recognition. Full article
(This article belongs to the Special Issue Human Activity Recognition (HAR) in Healthcare, 2nd Edition)
Show Figures

Figure 1

17 pages, 6610 KB  
Article
Fusion of PCA and ICA in Statistical Subset Analysis for Speech Emotion Recognition
by Rafael Kingeski, Elisa Henning and Aleksander S. Paterno
Sensors 2024, 24(17), 5704; https://doi.org/10.3390/s24175704 - 2 Sep 2024
Cited by 3 | Viewed by 2150
Abstract
Speech emotion recognition is key to many fields, including human–computer interaction, healthcare, and intelligent assistance. While acoustic features extracted from human speech are essential for this task, not all of them contribute to emotion recognition effectively. Thus, reduced numbers of features are required [...] Read more.
Speech emotion recognition is key to many fields, including human–computer interaction, healthcare, and intelligent assistance. While acoustic features extracted from human speech are essential for this task, not all of them contribute to emotion recognition effectively. Thus, reduced numbers of features are required within successful emotion recognition models. This work aimed to investigate whether splitting the features into two subsets based on their distribution and then applying commonly used feature reduction methods would impact accuracy. Filter reduction was employed using the Kruskal–Wallis test, followed by principal component analysis (PCA) and independent component analysis (ICA). A set of features was investigated to determine whether the indiscriminate use of parametric feature reduction techniques affects the accuracy of emotion recognition. For this investigation, data from three databases—Berlin EmoDB, SAVEE, and RAVDES—were organized into subsets according to their distribution in applying both PCA and ICA. The results showed a reduction from 6373 features to 170 for the Berlin EmoDB database with an accuracy of 84.3%; a final size of 130 features for SAVEE, with a corresponding accuracy of 75.4%; and 150 features for RAVDESS, with an accuracy of 59.9%. Full article
(This article belongs to the Special Issue Emotion Recognition Based on Sensors (3rd Edition))
Show Figures

Figure 1

33 pages, 2134 KB  
Article
A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition
by Sunil Kumar Prabhakar and Dong-Ok Won
Biomimetics 2024, 9(9), 513; https://doi.org/10.3390/biomimetics9090513 - 26 Aug 2024
Cited by 3 | Viewed by 1303
Abstract
Speech emotion recognition (SER) tasks are conducted to extract emotional features from speech signals. The characteristic parameters are analyzed, and the speech emotional states are judged. At present, SER is an important aspect of artificial psychology and artificial intelligence, as it is widely [...] Read more.
Speech emotion recognition (SER) tasks are conducted to extract emotional features from speech signals. The characteristic parameters are analyzed, and the speech emotional states are judged. At present, SER is an important aspect of artificial psychology and artificial intelligence, as it is widely implemented in many applications in the human–computer interface, medical, and entertainment fields. In this work, six transforms, namely, the synchrosqueezing transform, fractional Stockwell transform (FST), K-sine transform-dependent integrated system (KSTDIS), flexible analytic wavelet transform (FAWT), chirplet transform, and superlet transform, are initially applied to speech emotion signals. Once the transforms are applied and the features are extracted, the essential features are selected using three techniques: the Overlapping Information Feature Selection (OIFS) technique followed by two biomimetic intelligence-based optimization techniques, namely, Harris Hawks Optimization (HHO) and the Chameleon Swarm Algorithm (CSA). The selected features are then classified with the help of ten basic machine learning classifiers, with special emphasis given to the extreme learning machine (ELM) and twin extreme learning machine (TELM) classifiers. An experiment is conducted on four publicly available datasets, namely, EMOVO, RAVDESS, SAVEE, and Berlin Emo-DB. The best results are obtained as follows: the Chirplet + CSA + TELM combination obtains a classification accuracy of 80.63% on the EMOVO dataset, the FAWT + HHO + TELM combination obtains a classification accuracy of 85.76% on the RAVDESS dataset, the Chirplet + OIFS + TELM combination obtains a classification accuracy of 83.94% on the SAVEE dataset, and, finally, the KSTDIS + CSA + TELM combination obtains a classification accuracy of 89.77% on the Berlin Emo-DB dataset. Full article
Show Figures

Figure 1

17 pages, 3781 KB  
Article
MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers
by Hui Li, Jiawen Li, Hai Liu, Tingting Liu, Qiang Chen and Xinge You
Sensors 2024, 24(17), 5506; https://doi.org/10.3390/s24175506 - 25 Aug 2024
Cited by 9 | Viewed by 4508
Abstract
Speech emotion recognition (SER) is not only a ubiquitous aspect of everyday communication, but also a central focus in the field of human–computer interaction. However, SER faces several challenges, including difficulties in detecting subtle emotional nuances and the complicated task of recognizing speech [...] Read more.
Speech emotion recognition (SER) is not only a ubiquitous aspect of everyday communication, but also a central focus in the field of human–computer interaction. However, SER faces several challenges, including difficulties in detecting subtle emotional nuances and the complicated task of recognizing speech emotions in noisy environments. To effectively address these challenges, we introduce a Transformer-based model called MelTrans, which is designed to distill critical clues from speech data by learning core features and long-range dependencies. At the heart of our approach is a dual-stream framework. Using the Transformer architecture as its foundation, MelTrans deciphers broad dependencies within speech mel-spectrograms, facilitating a nuanced understanding of emotional cues embedded in speech signals. Comprehensive experimental evaluations on the EmoDB (92.52%) and IEMOCAP (76.54%) datasets demonstrate the effectiveness of MelTrans. These results highlight MelTrans’s ability to capture critical cues and long-range dependencies in speech data, setting a new benchmark within the context of these specific datasets. These results highlight the effectiveness of the proposed model in addressing the complex challenges posed by SER tasks. Full article
Show Figures

Figure 1

23 pages, 11201 KB  
Article
Feature-Enhanced Multi-Task Learning for Speech Emotion Recognition Using Decision Trees and LSTM
by Chun Wang and Xizhong Shen
Electronics 2024, 13(14), 2689; https://doi.org/10.3390/electronics13142689 - 10 Jul 2024
Cited by 7 | Viewed by 2144
Abstract
Speech emotion recognition (SER) plays an important role in human-computer interaction (HCI) technology and has a wide range of application scenarios in medical medicine, psychotherapy, and other applications. In recent years, with the development of deep learning, many researchers have combined feature extraction [...] Read more.
Speech emotion recognition (SER) plays an important role in human-computer interaction (HCI) technology and has a wide range of application scenarios in medical medicine, psychotherapy, and other applications. In recent years, with the development of deep learning, many researchers have combined feature extraction technology with deep learning technology to extract more discriminative emotional information. However, a single speech emotion classification task makes it difficult to effectively utilize feature information, resulting in feature redundancy. Therefore, this paper uses speech feature enhancement (SFE) as an auxiliary task to provide additional information for the SER task. This paper combines Long Short-Term Memory Networks (LSTM) with soft decision trees and proposes a multi-task learning framework based on a decision tree structure. Specifically, it trains the LSTM network by computing the distances of features at different leaf nodes in the soft decision tree, thereby achieving enhanced speech feature representation. The results show that the algorithm achieves 85.6% accuracy on the EMO-DB dataset and 81.3% accuracy on the CASIA dataset. This represents an improvement of 11.8% over the baseline on the EMO-DB dataset and 14.9% on the CASIA dataset, proving the effectiveness of the method. Additionally, we conducted cross-database experiments, real-time performance analysis, and noise environment analysis to validate the robustness and practicality of our method. The additional analyses further demonstrate that our approach performs reliably across different databases, maintains real-time processing capabilities, and is robust to noisy environments. Full article
Show Figures

Figure 1

18 pages, 445 KB  
Article
Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion
by Shaode Yu, Jiajian Meng, Wenqing Fan, Ye Chen, Bing Zhu, Hang Yu, Yaoqin Xie and Qiurui Sun
Electronics 2024, 13(11), 2191; https://doi.org/10.3390/electronics13112191 - 4 Jun 2024
Cited by 7 | Viewed by 3463
Abstract
Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning [...] Read more.
Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning of different deep networks are employed for encoding emotion patterns. Specifically, a cross-attention fusion (CAF) module is designed to integrate the dual-stream output for emotion recognition. Using different dual-stream encoders (fully training a text processing network and fine-tuning a pre-trained large language network), the CAF module is compared to other three fusion modules on three databases. The SER performance is quantified with weighted accuracy (WA), unweighted accuracy (UA), and F1-score (F1S). The experimental results suggest that the CAF outperforms the other three modules and leads to promising performance on the databases (EmoDB: WA, 97.20%; UA, 97.21%; F1S, 0.8804; IEMOCAP: WA, 69.65%; UA, 70.88%; F1S, 0.7084; RAVDESS: WA, 81.86%; UA, 82.75.21%; F1S, 0.8284). It is also found that fine-tuning a pre-trained large language network achieves superior representation than fully training a text processing network. In a future study, improved SER performance could be achieved through the development of a multi-stream representation of emotional cues and the incorporation of a multi-branch fusion mechanism for emotion recognition. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

16 pages, 1293 KB  
Article
A Generation of Enhanced Data by Variational Autoencoders and Diffusion Modeling
by Young-Jun Kim and Seok-Pil Lee
Electronics 2024, 13(7), 1314; https://doi.org/10.3390/electronics13071314 - 31 Mar 2024
Viewed by 1807
Abstract
In the domain of emotion recognition in audio signals, the clarity and precision of emotion delivery are of paramount importance. This study aims to augment and enhance the emotional clarity of waveforms (wav) using a technique called stable diffusion. Datasets from EmoDB and [...] Read more.
In the domain of emotion recognition in audio signals, the clarity and precision of emotion delivery are of paramount importance. This study aims to augment and enhance the emotional clarity of waveforms (wav) using a technique called stable diffusion. Datasets from EmoDB and RAVDESS, two well-known repositories of emotional audio clips, were utilized as the main sources for all experiments. We used the ResNet-based emotion recognition model to determine the emotion recognition of the augmented waveforms after emotion embedding and enhancement, and compared the enhanced data before and after the enhancement. The results showed that applying a mel-spectrogram-based diffusion model to the existing waveforms enlarges the salience of the embedded emotions, resulting in better identification. This augmentation has significant potential to advance the field of emotion recognition and synthesis, paving the way for improved applications in these areas. Full article
(This article belongs to the Special Issue Data Privacy and Cybersecurity in Mobile Crowdsensing)
Show Figures

Figure 1

19 pages, 868 KB  
Article
Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition
by Chenjing Sun, Yi Zhou, Xin Huang, Jichen Yang and Xianhua Hou
Electronics 2024, 13(6), 1103; https://doi.org/10.3390/electronics13061103 - 17 Mar 2024
Cited by 3 | Viewed by 4904
Abstract
Speech emotion recognition poses challenges due to the varied expression of emotions through intonation and speech rate. In order to reduce the loss of emotional information during the recognition process and to enhance the extraction and classification of speech emotions and thus improve [...] Read more.
Speech emotion recognition poses challenges due to the varied expression of emotions through intonation and speech rate. In order to reduce the loss of emotional information during the recognition process and to enhance the extraction and classification of speech emotions and thus improve the ability of speech emotion recognition, we propose a novel approach in two folds. Firstly, a feed-forward network with skip connections (SCFFN) is introduced to fine-tune wav2vec 2.0 and extract emotion embeddings. Subsequently, ConLearnNet is employed for emotion classification. ConLearnNet comprises three steps: feature learning, contrastive learning, and classification. Feature learning transforms the input, while contrastive learning encourages similar representations for samples from the same category and discriminative representations for different categories. Experimental results on the IEMOCAP and the EMO-DB datasets demonstrate the superiority of our proposed method compared to state-of-the-art systems. We achieve a WA and UAR of 72.86% and 72.85% on IEMOCAP, and 97.20% and 96.41% on the EMO-DB, respectively. Full article
(This article belongs to the Special Issue New Advances in Affective Computing)
Show Figures

Figure 1

14 pages, 1015 KB  
Communication
A New Network Structure for Speech Emotion Recognition Research
by Chunsheng Xu, Yunqing Liu, Wenjun Song, Zonglin Liang and Xing Chen
Sensors 2024, 24(5), 1429; https://doi.org/10.3390/s24051429 - 22 Feb 2024
Cited by 11 | Viewed by 3342
Abstract
Deep learning promotes the breakthrough of emotion recognition in many fields, especially speech emotion recognition (SER). As an important part of speech emotion recognition, the most relevant acoustic feature extraction has always attracted the attention of existing researchers. Aiming at the problem that [...] Read more.
Deep learning promotes the breakthrough of emotion recognition in many fields, especially speech emotion recognition (SER). As an important part of speech emotion recognition, the most relevant acoustic feature extraction has always attracted the attention of existing researchers. Aiming at the problem that the emotional information contained in the current speech signals is distributed dispersedly and cannot comprehensively integrate local and global information, this paper presents a network model based on a gated recurrent unit (GRU) and multi-head attention. We evaluate our proposed emotion model on the IEMOCAP and Emo-DB corpora. The experimental results show that the network model based on Bi-GRU and multi-head attention is significantly better than the traditional network model at detecting multiple evaluation indicators. At the same time, we also apply the model to a speech sentiment analysis task. On the CH-SIMS and MOSI datasets, the model shows excellent generalization performance. Full article
(This article belongs to the Special Issue Advanced-Sensors-Based Emotion Sensing and Recognition)
Show Figures

Figure 1

15 pages, 2722 KB  
Article
Optimizing Speech Emotion Recognition with Deep Learning and Grey Wolf Optimization: A Multi-Dataset Approach
by Suryakant Tyagi and Sándor Szénási
Algorithms 2024, 17(3), 90; https://doi.org/10.3390/a17030090 - 20 Feb 2024
Cited by 6 | Viewed by 3242
Abstract
Machine learning and speech emotion recognition are rapidly evolving fields, significantly impacting human-centered computing. Machine learning enables computers to learn from data and make predictions, while speech emotion recognition allows computers to identify and understand human emotions from speech. These technologies contribute to [...] Read more.
Machine learning and speech emotion recognition are rapidly evolving fields, significantly impacting human-centered computing. Machine learning enables computers to learn from data and make predictions, while speech emotion recognition allows computers to identify and understand human emotions from speech. These technologies contribute to the creation of innovative human–computer interaction (HCI) applications. Deep learning algorithms, capable of learning high-level features directly from raw data, have given rise to new emotion recognition approaches employing models trained on advanced speech representations like spectrograms and time–frequency representations. This study introduces CNN and LSTM models with GWO optimization, aiming to determine optimal parameters for achieving enhanced accuracy within a specified parameter set. The proposed CNN and LSTM models with GWO optimization underwent performance testing on four diverse datasets—RAVDESS, SAVEE, TESS, and EMODB. The results indicated superior performance of the models compared to linear and kernelized SVM, with or without GWO optimizers. Full article
(This article belongs to the Special Issue Bio-Inspired Algorithms)
Show Figures

Figure 1

17 pages, 402 KB  
Article
Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages
by Ephrem Afele Retta, Richard Sutcliffe, Jabar Mahmood, Michael Abebe Berwo, Eiad Almekhlafi, Sajjad Ahmad Khan, Shehzad Ashraf Chaudhry, Mustafa Mhamed and Jun Feng
Appl. Sci. 2023, 13(23), 12587; https://doi.org/10.3390/app132312587 - 22 Nov 2023
Cited by 10 | Viewed by 2564
Abstract
In a conventional speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language do not exist, data from other languages can be used instead. We [...] Read more.
In a conventional speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language do not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German, and Urdu. For Amharic, we use our own publicly available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu, we use the existing RAVDESS, EMO-DB, and URDU datasets. We followed previous research in mapping labels for all of the datasets to just two classes: positive and negative. Thus, we can compare performance on different languages directly and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. The results, averaged for the three models, were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each of the following pairs: Amharic↔German, Amharic↔English, and Amharic↔Urdu. The results with Amharic as the target suggested that using English or German as the source gives the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percentage points greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training an SER classifier when resources for a language are scarce. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 2260 KB  
Article
A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech
by Sera Kim and Seok-Pil Lee
Electronics 2023, 12(19), 4034; https://doi.org/10.3390/electronics12194034 - 25 Sep 2023
Cited by 20 | Viewed by 6217
Abstract
The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms [...] Read more.
The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)–Transformer and a 2D convolutional neural network (CNN). The BiLSTM–Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech. Full article
(This article belongs to the Special Issue Theories and Technologies of Network, Data and Information Security)
Show Figures

Figure 1

Back to TopTop