MDPI - Publisher of Open Access Journals

23 pages, 4407 KB

Open AccessArticle

Empirical Comparison between Deep and Classical Classifiers for Speaker Verification in Emotional Talking Environments

by Ali Bou Nassif, Ismail Shahin, Mohammed Lataifeh, Ashraf Elnagar and Nawel Nemmour

Information 2022, 13(10), 456; https://doi.org/10.3390/info13100456 - 27 Sep 2022

Cited by 4 | Viewed by 2996

Speech signals carry various bits of information relevant to the speaker such as age, gender, accent, language, health, and emotions. Emotions are conveyed through modulations of facial and vocal expressions. This paper conducts an empirical comparison of performances between the classical classifiers: Gaussian Mixture Model (GMM), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Artificial neural networks (ANN); and the deep learning classifiers, i.e., Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and Gated Recurrent Unit (GRU) in addition to the ivector approach for a text-independent speaker verification task in neutral and emotional talking environments. The deep models undergo hyperparameter tuning using the Grid Search optimization algorithm. The models are trained and tested using a private Arabic Emirati Speech Database, Ryerson Audio–Visual Database of Emotional Speech and Song dataset (RAVDESS) database, and a public Crowd-Sourced Emotional Multimodal Actors (CREMA) database. Experimental results illustrate that deep architectures do not necessarily outperform classical classifiers. In fact, evaluation was carried out through Equal Error Rate (EER) along with Area Under the Curve (AUC) scores. The findings reveal that the GMM model yields the lowest EER values and the best AUC scores across all datasets, amongst classical classifiers. In addition, the ivector model surpasses all the fine-tuned deep models (CNN, LSTM, and GRU) based on both evaluation metrics in the neutral, as well as the emotional speech. In addition, the GMM outperforms the ivector using the Emirati and RAVDESS databases. Full article

(This article belongs to the Special Issue Signal Processing Based on Convolutional Neural Network)

► Show Figures

Figure 1

17 pages, 1754 KB

Open AccessArticle

Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion

by Badriyya B. Al-onazi, Muhammad Asif Nauman, Rashid Jahangir, Muhmmad Mohsin Malik, Eman H. Alkhammash and Ahmed M. Elshewey

Appl. Sci. 2022, 12(18), 9188; https://doi.org/10.3390/app12189188 - 14 Sep 2022

Cited by 49 | Viewed by 8121

Abstract

In recent years data science has been applied in a variety of real-life applications such as human-computer interaction applications, computer gaming, mobile services, and emotion evaluation. Among the wide range of applications, speech emotion recognition (SER) is also an emerging and challenging research topic. For SER, recent studies used handcrafted features that provide the best results but failed to provide accuracy while applied in complex scenarios. Later, deep learning techniques were used for SER that automatically detect features from speech signals. Deep learning-based SER techniques overcome the issues of accuracy, yet there are still significant gaps in the reported methods. Studies using lightweight CNN failed to learn optimal features from composite acoustic signals. This study proposed a novel SER model to overcome the limitations mentioned earlier in this study. We focused on Arabic vocal emotions in particular because they received relatively little attention in research. The proposed model performs data augmentation before feature extraction. The 273 derived features were fed as input to the transformer model for emotion recognition. This model is applied to four datasets named BAVED, EMO-DB, SAVEE, and EMOVO. The experimental findings demonstrated the robust performance of the proposed model compared to existing techniques. The proposed SER model achieved 95.2%, 93.4%, 85.1%, and 91.7% accuracy on BAVED, EMO-DB, SAVEE, and EMOVO datasets respectively. The highest accuracy was obtained using BAVED dataset, indicating that the proposed model is well suited to Arabic vocal emotions. Full article

(This article belongs to the Special Issue Machine Learning for Language and Signal Processing)

► Show Figures

Figure 1

Search Results (2)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (2)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI