Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN

Waleed, Gheed T.; Shaker, Shaimaa H.

doi:10.3390/info16070518

Open AccessArticle

Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN

by

Gheed T. Waleed

^* and

Shaimaa H. Shaker

Computer Science Department, University of Technology, Baghdad 10066, Iraq

^*

Author to whom correspondence should be addressed.

Information 2025, 16(7), 518; https://doi.org/10.3390/info16070518 (registering DOI)

Submission received: 10 May 2025 / Revised: 8 June 2025 / Accepted: 19 June 2025 / Published: 21 June 2025

(This article belongs to the Special Issue Artificial Intelligence Methods for Human-Computer Interaction)

Download

Browse Figures

Versions Notes

Abstract

:

Speech emotion recognition (SER) plays a vital role in enhancing human–computer interaction (HCI) and can be applied in affective computing, virtual support, and healthcare. This research presents a high-performance SER framework based on a lightweight 1D Convolutional Neural Network (1D-CNN) and a multi-feature fusion technique. Rather than employing spectrograms as image-based input, frame-level characteristics (Mel-Frequency Cepstral Coefficients, Mel-Spectrograms, and Chroma vectors) are calculated throughout the sequences to preserve temporal information and reduce the computing expense. The model attained classification accuracies of 94.0% on MELD (multi-party talks) and 91.9% on RAVDESS (acted speech). Ablation experiments demonstrate that the integration of complimentary features significantly outperforms the utilisation of a singular feature as a baseline. Data augmentation techniques, including Gaussian noise and time shifting, enhance model generalisation. The proposed method demonstrates significant potential for real-time emotion recognition using audio only in embedded or resource-constrained devices.

Keywords:

speech emotion recognition (SER); convolutional neural networks (CNNs); mel-spectrogram

1. Introduction

Emotion is an important aspect of human communication, which carries important cue information as to a speaker’s intention, mental state, and general mood. The recognition of emotion through speech is important for human–computer interaction (HCI) systems, affective computing, virtual assistants, e-learning systems, and healthcare technology [1]. In these applications, machines should understand not only spoken words but also how something is said. Speech emotion recognition has therefore appeared as a basic building block in the development of human-oriented artificial intelligence systems.

The objective of SER is to categorise speech signals into predefined emotional types like anger, happiness, sadness, and fear. Emotion recognition from speech is still a difficult task because of various inherent difficulties such as speaker variability, variation of speech style, the presence of background noise, and the ambiguous nature of emotional expression [2]. In addition, in practice, spontaneous speech, scenarios of multiparty conversations, and unclean background noise are pervasive, which can greatly degrade the performance of speech emotion recognition [3].

Traditional speech emotion recognition (SER) systems typically used hand-crafted acoustic features such as prosodic (pitch, energy, and speech rate) and spectral (Mel-Frequency Cepstral Coefficients (MFCCs), formants, and spectral flux) features [4]. The features were typically used as input for shallow classifiers, e.g., Support Vector Machines (SVMs) [5], Gaussian Mixture Models (GMMs) [6], Hidden Markov Models (HMMs) [7], and Random Forests. Although effective with small and clean datasets, these methodologies failed to generalise effectively to more varied and noisy speech environments.

The advent of deep learning revolutionised speech emotion recognition by facilitating automatic feature extraction from raw or light-processed audio signals. Early deep learning techniques included Convolutional Neural Networks (CNNs) to extract spectral and spatial patterns from voice spectrograms [8], while Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks were used to model temporal dependencies [9]. Hybrid architectures such as CNN-LSTM integrated these benefits, resulting in substantial enhancements. However, CNN-based methods using 2D spectrograms still risk losing nuanced temporal information by transforming sequential signals into image-like features.

In recent years, there has been a significant shift towards transformer-based designs and self-supervised learning in speech and voice emotion recognition tasks. Transformers are the predominant structures for sequential modelling, owing to their self-attention mechanism. Initiatives such as SER-Transformer [10] and EmoBERTa have demonstrated superior performance in speech emotion recognition by effectively extracting long-range dependencies and contextual information from raw multimodal or audio inputs.

Multimodal speech emotion recognition (SER) has garnered significant attention, incorporating speech alongside additional modalities such as facial expressions, physiological signs, and text [11,12]. Multimodal fusion has significantly enhanced accuracy, although it incurs increased model complexity and necessitates synchronised multimodal datasets, which may not always be accessible in real-world scenarios.

Contrastive learning [13] and domain-adaptive [14] SER methodologies have emerged to address domain changes (e.g., differences in speakers or datasets) and enhance robustness under unseen conditions.

Despite these developments, real-time and resource-constrained environments, such as mobile applications, conversational agents, and embedded systems, necessitate lightweight, efficient, and speech-exclusive models. Prior studies have employed CNNs for speech emotion recognition using features such as MFCCs and spectrograms [15,16], while 1D-CNNs have been effectively shown to capture short-term temporal dependencies [17]. However, the majority of current systems are constrained, focussing either on temporal variations related to type or on managing computationally demanding structures. Our study addresses these issues by presenting a lightweight, real-time model that integrates MFCC, Mel-spectrogram, and Chroma information through parallel 1D-CNN streams. The combined feature fusion strategy enables the model to exploit the complementary spectral, prosodic, and tonal features without compromising the computational efficiency and thus improves the accuracy as well as deployment feasibility of the model in realistic, resource-constrained scenarios.

This study introduces a feature-fusion-based 1D Convolutional Neural Network (1D-CNN) architecture that directly processes sequential acoustic feature representations, in contrast to the conventional use of 2D spectrogram images in classic CNN-based emotion recognition techniques. Instead of converting speech signals into spectrogram images for 2D convolution, the model employs numerically derived features—MFCCs, Mel-spectrogram matrices, and Chroma vectors—that preserve temporal and spectral information in its original sequential format [4]. The features are integrated and fed into the 1D-CNN to analyze the temporal dynamics of speech signals more effectively. The architecture facilitates the efficient use of spectral, prosodic, and harmonic details without the computational demands linked to 2D image processing. The system is evaluated using two representative datasets: MELD, which represents realistic conversational speech, and RAVDESS, which features controlled acted emotional responses. The suggested model demonstrates superior performance while maintaining computing efficiency, as substantiated by extensive experiments, including ablation tests and comparison analyses, hence validating its suitability for real-time emotion-aware systems.

2. Related Works

2.1. Classical and Traditional Approaches to Deep Learning

Speech emotion recognition (SER) has been extensively examined using both traditional machine learning and deep learning methodologies.

Özseven [15] employed texture-based analysis methods, including Wavelet Decomposition (WD), Gabor Filters (GFs), the Gray-Level Co-occurrence Matrix (GLCM), and the Histogram of Orientated Gradients (HOG), to enhance expressiveness in features on spectrogram images. The combination of individual texture-based features with acoustic features resulted in a notable enhancement in the classification performance on datasets such as SAVEE, EMO-DB, and eNTERFACE’05, despite the suboptimal results of the texture-based features alone.

Liu et al. [16] introduced a hybrid method that combines speaker-dependent and speaker-independent acoustic features within an Extreme Learning Machine Decision Tree (ELM-DT), resulting in approximately 89% precision on the CASIA dataset. Tuncer et al. [17] established a sophisticated pipeline that incorporates the Tunable Q-Factor Wavelet Transform (TQWT), the Travelling Salesman Problem (TSP) for feature transformation, and the INCA feature selector, culminating in SVM classification. The performance of the methodology reached over 90% precision rates in the RAVDESS, EMO-DB, SAVEE, and EMOVO datasets.

Huang et al. [18] introduced a set of hybrid models, including CNNs and SVM, where a 2D CNN was used to extract features from magnitudes of spectrograms, and an SVM for classification. The proposed system achieved 88.3% (speaker-dependent system) and 85.2% (speaker-independent system) on the EMO-DB dataset. Wu et al. [19] proposed the use of MSFs in addition to MFCCs and prosodic features together with LDA to achieve an accuracy of 85% on speaker-independent tests.

Notable contributions include Lampropoulos and Tsihrintzis [20], who utilised MPEG-7 audio descriptors, achieving 83.93% precision on EMO-DB, and Wang et al. [21], who created Fourier Parameter (FP)-based features that resulted in 73% precision on a six-class SER problem.

Zeng et al. [22] proposed a Gated Residual Network (GResNet) with spectrogram-based input with an accuracy of 65.97% on the RAVDESS dataset. In contrast, Popova et al. [23] showed that employing transfer learning on pre-trained deep neural networks with Mel-spectrograms could enhance accuracy without necessitating complete model retraining.

While these methods established the foundation for SER pipelines, they are often limited by their dependence on singular feature types (e.g., spectrograms) and their inability to effectively capture long-term temporal dependencies and multimodal relationships.

2.2. Modern Deep Learning Methodologies (2020–2024)

Recent SER systems are increasingly adopting end-to-end learning approaches, leveraging large datasets and enhanced computing resources. They utilise self-supervised learning (SSL) and transformer-based architectures to overcome the limitations of traditional feature engineering pipelines.

2.2.1. Transformer-Based Architectures

Transformer-based models have significantly advanced speech emotion recognition by leveraging attention mechanisms to capture long-range dependencies in speech. The SER-Transformer [24] employed positional encoding and self-attention mechanisms, attaining an accuracy of 82.4% on RAVDESS and 85.3% on CREMA-D.

2.2.2. Self-Supervised Learning Models

Self-supervised methods like HuBERT [25] and Wav2Vec 2.0 [26] undergo pre-training on unlabelled audio to develop transferable representations. These methods allow the models to more effectively generalise across domains and eliminate the need for manually created features. More recently, Dabbabi and Mars [25] showed that by exploiting a small self-supervised model named DistilHuBERT, they were able to achieve close to state-of-the-art performance based on audio–visual features, with 90.79% on the RAVDESS dataset and 82.35% on the BAVED corpus, suggesting its effectiveness in emotion recognition while having a lower computational cost. On the other hand, Pepino et al. [26] have shown that the Wav2Vec 2.0 embeddings alone can improve emotion identification performance on well-researched benchmarks like RAVDESS and IEMOCAP by acting as direct input with basic neural classifiers.

2.2.3. Multimodal Speech Emotion Recognition

Multimodal fusion has been explored to enhance emotion recognition accuracy, integrating speech, facial expressions, and textual context. Chudasama et al. [27] introduced M2FNet, a transformer-based fusion model integrating audio and text features, achieving an accuracy of 91.2% on MELD. Li et al. [28] introduced a multimodal conversation model utilising graphs and an attention network, achieving an accuracy of 89.3% on the IEMOCAP dataset. While these methods demonstrate high accuracy, their dependence on multimodal data restricts scalability in voice-only contexts.

2.2.4. Lightweight and Embedded Systems

Sadok et al. [29] presented a quantisation-aware model (VQ-MAE-S-12) for edge deployment, achieving 84.1% accuracy on RAVDESS while minimising model size and runtime. Table 1 presents the performance metrics of some representative SER models based on traditional and modern deep learning techniques.

3. Methodology

This paper presents a novel process for recognising emotions based on speech, which holds significance in several artificial intelligence applications. Figure 1 displays the depiction of the proposed method, where we begin by describing the datasets utilised in our study. Following the presentation of the datasets, we proceed to explain the suggested framework, which begins with the process of extracting features and is then followed by the preprocessing stage and implementation of the baseline deep learning model.

3.1. Datasets

For this study, we leverage two highly regarded datasets: the Multimodal EmotionLines Dataset (MELD) [30] and the Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS) [31]. These datasets are widely recognised in speech emotion recognition (SER) research for their diversity and richness in emotional expression.

MELD Dataset: This dataset consists of 13,000 utterances sourced from multi-party conversations in the television show Friends. Each utterance is categorised into one of seven emotional labels: neutral, sadness, surprise, anger, contempt, fear, and happiness. The dataset is publicly available at: “https://affective-meld.github.io/ (accessed on 16 July 2024)”.
RAVDESS Dataset: This dataset contains 7356 audio–video recordings featuring actors portraying eight emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprised. However, since calmness is often acoustically indistinguishable from neutrality, we exclude it and focus on the remaining seven primary emotions for this study. The dataset is publicly available at: “https://zenodo.org/records/1188976” accessed on 29 October 2024.

To optimise model performance and minimise processing time, we extract only the audio components (in WAV format) from the video samples. This approach ensures that the analysis remains focused on the essential acoustic features while reducing computational overhead.

3.2. Feature Extraction and Augmentation

Feature extraction plays a pivotal role in enhancing model performance, as indicated by earlier research [8]. We utilise the Librosa audio library for feature extraction, computing three spectral representations of audio samples:

Mel-frequency Cepstral Coefficients (MFCCs):
- Captures short-term power spectrum features of audio signals.
- Simulates human auditory perception by mapping frequencies to the Mel scale, which aligns more closely with how humans perceive sound.
- Has shown strong performance.in recognising speech patterns and identifying emotions in audio.

In order to support the multi-feature extraction approach, we constructed Figure 2, which displays two primary spectrum representations calculated with the Librosa library. A Mel-Spectrogram, which is a graphical depiction of the changes in energy in various frequency bands over time, is displayed in the image on the left-hand side. This visualisation captures both pitch and timbre (sound quality) elements that are crucial for emotion detection. The Chroma Spectrogram on the right may follow melodic and harmonic content because it is sensitive to pitch class information, or the tonal energy in each of the 12 pitch classes. In order to improve the emotion categorisation by offering more spectral, harmonic features from the input audio, these two representations are combined with MFCCs in the feature fusion block (see Figure 2).

The following formulations serve as the theoretical basis for our feature extraction process. The time–frequency representation of the signal is obtained using the short-time Fourier transform (STFT), given by Equation (1) below:

S T F T \{x (t)\} (m, w) = \sum_{n = - \infty}^{\infty} x [n] ω [n - m] e^{- j w n}

(1)

where

x[n]: Input audio signal in time domain;
w[m-n]: Window function (e.g., Hamming, Hann) centered in time;
m: Time shift, or frame index.;
$ω$ : angular frequency bin;
$e^{- j w n}$ : Complex exponential basis function (for Fourier transform);
$S T F T (m, ω)$ : The Time-frequency representation of the signal.

The Mel-Frequency Cepstral Coefficients are computed by calculating the Discrete Cosine Transform of the log of the Mel-scaled power spectrum, as shown in Equation (2):

M F C C (t, n) = \sum_{K = 1}^{K} \log (S_{k} (t)) \cdot \cos [n \cdot (k - \frac{1}{2})] \cdot \frac{π}{K}

(2)

t: Frame or time index.
n: MFCC coefficient index (e.g., 1st, 2nd, …).
K: Number of Mel filter banks.
$S_{k} (t) :$ Power or energy of the signal at kth Mel filter for a frame.
$\log (S_{k} (t)) :$ Logarithmic power spectrum (simulates human loudness perception).
Cos [.. ]: DCT, which is used to decorrelate features.

b.

Mel-scaled spectrogram

Converts audio signals into a time–frequency representation, providing a detailed analysis of sound over time.
Delivers richer frequency resolution, effectively capturing the texture and nuances of the voice.

c.

Chroma Feature

Extracts tonal and harmonic information from audio signals.
Substantially contributes to detecting intonation and pitch variations, which are essential for capturing emotional expression.

The Fourier transform is performed to obtain the energy spectrum, after which the Mel-frequency scale is overlaid. Pitch courses and harmony are more challenging to accurately capture with the Mel-scaled spectrogram and MFCCs, but they are nevertheless helpful for identifying and tracking timbre changes in a sound file. This issue is addressed using Chroma grams. The short-time Fourier transform (STFT) was utilised in this experiment. In contrast to MFCCs and Mel-scaled spectrograms, the spectral contrast feature provides a more comprehensive spectral analysis of sound. All extracted feature sets are concatenated into a unified input representation, serving as the training data for the CNN.

To enhance the generalisation of training and improve robustness against environmental noise, we employed augmentation techniques. Firstly, random Gaussian noise was used to replicate real-life scenarios involving speech, and background sounds Gaussian noise was created with Equation (3), a formula for a standard normal distribution:

p (z) = \frac{1}{\sqrt{2 π σ}} e^{- {(z - μ)}^{2} / {2 σ}^{2}}

(3)

$μ$ is the average value of z.
$σ$ is its standard deviation.

Second, time shifting was performed by randomly moving the audio onset forward or backward in time up to 25% of the signal. These techniques jointly increased the diversity of the data and facilitated better handling of the spontaneous human communication that the models had to address. We added more noise by adding augmented data and this convolutional model learns better from all versions of regular real-world data. This has improved the precision of the model and also reduced its sensitivity to audio conditions when applied in the real world.

During the implementation phases of multi-feature fusion, we extracted a total of MFCCs (40 coefficients), Mel-spectrograms (128 bins), and Chroma vectors (12 dimensions) for each speech signal, respectively. To guarantee temporal synchronisation among feature types, all features are calculated using an identical frame length of 25ms and a hop size of 10ms. Consequently, the temporal dimension is maintained across all feature matrices.

Each feature matrix is subsequently fed into an individual 1D-CNN branch. The outputs of the branches are feature maps of dimensions (T, F′), where T represents the number of time steps and F′ denotes the quantity of learnt filters. We synchronise their outputs along the temporal dimension by employing zero-padding and up-sampling to ensure uniformity in T across all branches. The aligned feature maps are subsequently concatenated along the feature dimension, resulting in a fused representation of dimensions (T, 3 × F′). The integrated feature sequence is subsequently input into multiple fully connected layers for unified representation learning and classification. This implicit fusion allows the model to internally integrate complementary cues from spectral, prosodic, and harmonic information while maintaining feature-level specificity.

3.3. CNN Model

The proposed SER model is developed using 1D-CNN to directly extract sequential speech features from audio signals. This model processes 1D sequential data directly, in contrast to traditional CNNs that handle 2D spectrogram images, thereby preserving essential temporal dependencies in the speech signal. This approach eliminates the need for image-based transformations and significantly reduces computational overhead. Figure 3 presents the complete architecture of the model, which includes multiple convolutional, pooling, and fully connected layers.

The model accepts a multi-channel input representation, where each channel corresponds to one of the extracted feature sets: MFCCs, Mel-scaled spectrograms, and Chroma vectors. The network is composed of three convolutional blocks that progressively extract hierarchical representations, while regularisation techniques are used to combat overfitting. The first convolutional block utilises a 1D convolutional layer with 128 filters and a kernel size of 5, followed by batch normalisation to stabilize the learning process and reduce internal covariate shifts. A Rectified Linear Unit (ReLU) introduces non-linearity, followed by a max-pooling layer with a pool size of 2 to reduce the temporal dimension and computational cost. A dropout layer with a rate of 0.4 is employed to reduce overfitting by randomly deactivating neurones during the training phase.

The second block executes a convolution operation utilising 64 filters of size 3 × 3 to extract mid-level features. This includes batch normalisation, ReLU activation, max-pooling, and dropout with a reduced rate of 0.3. The third convolutional block comprises 32 filters with a kernel size of 3. This block maintains a rate of 0.2 at the end to decrease this rate while incorporating additional neurones as proximity to the classification layers increases. Following the convolutional layers, a fully connected dense layer of size 128 is employed to acquire high-level abstract representations of the extracted features. Finally, a Softmax output layer provides class probabilities for emotion categories. This model is optimised to achieve a balance between complexity and speed, demonstrating effective performance on both acted and naturalistic speech corpora without relying on computationally intensive 2D CNN architectures. The integration of dropout and batch normalisation across all layers enhances network robustness and significantly reduces the risk of severe overfitting, particularly in scenarios involving limited datasets.

3.4. Training and Evaluation

The proposed model was independently trained on the RAVDESS and MELD datasets to evaluate its generalisability across acted and conversational speech data. Prior to training, both datasets were divided into training and test sets while maintaining the original class distribution through stratified sampling. Two methods of splitting were employed: a 70:30 split and an 80:20 split. The 80:20 split yielded favourable results, which are reported here.

The Adam optimiser, characterised by its adaptive learning rate and momentum, was employed for model optimisation in speech tasks. The learning rate was set to 0.0001, and the categorical cross-entropy loss function was utilised for parameter updates, appropriate for multi-class classification tasks, such as SER. A mini-batch size of 64 was chosen to balance training speed and stability, and the model was trained for 120 epochs, which was determined to be sufficient for convergence without exhibiting signs of overfitting in preliminary experiments.

Training was conducted on a workstation equipped with an Intel Core i7-8550U CPU, 16 GB of RAM, and an NVIDIA GTX 1650 GPU. The equipment was manufactured by Lenovo Group Limited, headquartered in Beijing, China. To prevent overfitting, we implemented an early stopping strategy based on validation loss to select the optimal training model checkpoint. The model’s performance was evaluated using standard metrics, including accuracy, precision, recall, and F1-score, assessed on the held-out test set. An ablation study was conducted to examine the contribution of each feature combination (MFCC, Mel, and Chroma) in the model. We conducted statistical significance testing between the proposed method and baseline models, including CNN and CNN-LSTM, using a paired t-test, with p < 0.05 deemed significant.

In contrast to prior research by Eom et al. [4] that employed 1D-CNNs on MFCCs for speech emotion recognition, our model incorporates a novel multi-branch architecture that simultaneously analyses three complementary audio representations: MFCCs, Mel-spectrograms, and Chroma vectors. Each branch comprises a distinct 1D-CNN pipeline, designed to independently acquire temporal and spectral information from its associated feature. The intermediate representations are then aligned and integrated by a learnable fusion layer, which determines the contribution of each feature type; these contributions are acquired through fully connected layers and dropout-based regularisation. Unlike simple concatenation or early fusion, our method ensures that each stream influences the final emotion embedding.

Unlike traditional 1D-CNNs in Speech Emotion Recognition (e.g., [18]), which utilise shallow CNN layers on a singular feature stream (typically MFCCs or log-Mel spectrograms), our model is distinguished by its multi-branch temporal encoding and feature-specific processing. Instead of consolidating all heterogeneous acoustic features into a unified space, we establish three parallel 1D-CNN pipelines tailored to MFCC, Mel-spectrogram, and Chroma, respectively, enabling the model to capture distinct temporal features from various perceptual dimensions of the speech signal.

Moreover, our fusion mechanism not only concatenates the output but also utilises a trainable alignment component to dynamically weigh and integrate each feature map based on its relevance for emotion classification. This architecture allows the model to capture intricate feature interactions and outperform single-stream or early-fusion CNNs. Conversely, prior CNN-based approaches (e.g., [18]) typically employ uniform convolutional filters for all inputs, disregarding feature heterogeneity.

4. Results and Discussion

This section presents a systematic evaluation of the proposed feature-fusion-based 1D-CNN for SER. The model’s performance was assessed using two benchmark datasets: the MELD dataset and the RAVDESS dataset. Evaluation metrics included accuracy, precision, recall, F1-score, learning curves, ablation experiments, and comparisons with state-of-the-art models.

4.1. Performance on RAVDESS Dataset

The results from the RAVDESS dataset indicated that the proposed model demonstrated strong performance in classifying acted emotional speech, achieving an overall accuracy of 91.9%. Figure 4 illustrates the rapid convergence of the model during the training process, with training accuracy achieving approximately 99.9% after several epochs. The validation accuracy shows a significant increase, reaching 91.9%, which is higher than the baseline of approximately 85%, without any indications of overfitting, such as fluctuations in accuracy.

The corresponding training and validation loss curves, presented in Figure 5, reveal a consistent decline throughout the training process. While minor fluctuations are observed in the validation loss, they are likely due to the variability in speaker expressions and emotional intensity present in the RAVDESS dataset. Nonetheless, the overall trend supports the conclusion that the model is capable of learning stable and discriminative representations for emotion recognition from acted speech.

4.2. Performance on MELD Dataset

On the MELD corpus, with overlapping conversational speech, the model demonstrated excellent performance on the MELD dataset, achieving a classification accuracy of 94.0%. As shown in Figure 6, both the training and validation accuracy curves exhibit rapid learning and stable convergence. The validation curve closely tracks the training curve, reaching 94.0% by the end of training, confirming the model’s capacity to generalise effectively to complex real-world scenarios.

As depicted in Figure 7, the loss curves indicate consistent reductions in both training and validation loss. Although minor fluctuations are present, the overall trend reflects effective learning and stable optimisation across epochs. These results highlight the model’s ability to handle spontaneous speech, speaker variability, and conversational noise, all of which are characteristic challenges in real-world SER applications.

4.3. Ablation Study

An ablation study was performed to assess the impact of the proposed multi-feature fusion strategy and assess the contribution of each feature to the overall performance. Table 2 summarises the accuracies achieved under three configurations:

(1): Using MFCCs alone.
(2): Combining MFCCs with Mel-spectrograms.
(3): Integrating the full combination of MFCCs, Mel-spectrograms, and Chroma features.

The results demonstrate that relying solely on MFCCs results in accuracy levels of 85.0% for RAVDESS and 87.1% for MELD. Adding Mel-spectrograms yields a notable performance improvement of approximately 4–5%. The highest accuracy, 91.9% on RAVDESS and 94.0% on MELD, was achieved with the full feature fusion, confirming the complementary nature of spectral, prosodic, and harmonic features in enhancing emotion recognition.

4.4. Evaluation Metrics Overview

Other than accuracy, the model’s performance was measured on precision, recall, and F1-score (Table 3), making the performance metric well-rounded.

4.5. Comparative Analysis with State-of-the-Art Methods

To validate the effectiveness of the proposed 1D-CNN model, its performance was compared against several commonly used baseline models and recent state-of-the-art (SOTA) approaches, evaluated separately on the MELD and RAVDESS datasets.

Table 4 presents a numerical comparison on the MELD dataset. The used model achieved an accuracy of 94.0% on the MELD dataset, significantly surpassing both traditional and advanced deep learning models. For example, Ho et al. [32] utilised a Recurrent Neural Network (RNN)-based multi-level attention model to recognise contextual emotional cues in conversation, attaining merely 61.2% accuracy, potentially due to their inability to capture intricate inter-utterance dependencies. Hu et al. [33] implemented a contextual LSTM with speaker-level factorisation, while Li et al. [28] utilised a graph-based attention network to illustrate the relationships among speakers and modalities in dialogue history. Both methods attained an accuracy of 68.28%, indicating the challenge of modelling long-range conversational dependencies or multimodal interactions.

On the contrary, Chudasama et al. [27] proposed M2FNet, a transformer-based multimodal fusion network that integrates text, audio, and visual signals for emotion recognition in conversational contexts. Despite its holistic approach, it achieved a suboptimal performance of 66.7% due to its larger model size and increased dependence on synchronisation across all modalities. Our model relies solely on audio and does not encounter these limitations, yet it substantially surpasses all these methods. In comparison to CNN-X [8], which achieves an accuracy of 82.9% using 2D Convolutional Neural Networks with spectrogram images, our 1D-CNN demonstrates substantial advancement. Our methodology utilises the extracted sequential numerical features (MFCCs, Mel-spectrograms, and Chroma vectors), thereby preserving temporal information without necessitating costly signal transformations into two-dimensional images. This validates our design choice and illustrates the efficiency and efficacy of the 1D-CNN fusion-based architecture for representing emotional cues from speech in a resource-constrained environment.

This considerable improvement highlights two critical aspects of the proposed approach:

The effectiveness of directly modelling temporal dependencies using 1D convolution, which captures the sequential nature of speech more efficiently than 2D CNNs.
The advantages of the feature fusion strategy, which integrates MFCCs, Mel-spectrograms, and Chroma features to provide a more comprehensive representation of emotional speech characteristics.

These findings confirm the superiority of the proposed model and its potential for real-world deployment in emotion-aware applications.

Table 5 summarises the comparative results on the RAVDESS dataset. In our experiments, the proposed model attained an accuracy of 91.9% on the RAVDESS dataset, exceeding the performance of previously evaluated state-of-the-art benchmark models on the same dataset. Sadok et al. [29] introduced the VQ-MAE-S-12, a quantisation-aware transformer architecture optimised for efficient deployment on edge devices. This model exhibits reduced compressibility and employs a multi-head self-attention mechanism alongside masked autoencoding, both of which are computationally intensive and memory-demanding, achieving an accuracy of 84.1%.

In another recent study, Jiménez et al. [35] utilised xlsr-Wav2Vec2.0, a commonly employed self-supervised transformer pre-trained on multilingual speech data. This architecture employs contextual embeddings of unannotated audio and subsequently fine-tunes with task-specific classifiers. Despite attaining a commendable accuracy of 86.7%, its substantial size—frequently exceeding 90 million parameters—and elevated inference costs impede its applicability in real-time or embedded applications without extensive pruning or distillation.

The consistent superiority of the proposed model across both MELD and RAVDESS datasets can be attributed to three key design choices:

The use of 1D convolutions, which effectively preserve the temporal dependencies present in speech signals.
The integration of complementary features (MFCCs, Mel-spectrograms, and Chroma), resulting in a richer and more informative representation of emotional speech.
The application of data augmentation techniques (e.g., Gaussian noise, time shifting), which improve the model’s robustness to speaker variability and varying recording conditions.

These findings collectively demonstrate that the proposed model is highly effective for speech emotion recognition, offering robust performance across both controlled (acted) and real-world (spontaneous) speech datasets.

When compared to traditional 2D CNN models such as CNN-X, the proposed 1D-CNN offers superior temporal feature extraction without depending on spectrogram image representations. Furthermore, the model showed better generalisation capability than hybrid models like DF-ERC, which, despite combining CNN and LSTM layers, struggled to generalise effectively on the MELD dataset.

4.6. Confusion Matrix Analysis

To more finely assess model performance on particular emotion categories, we examined confusion matrices created on both benchmark datasets.

4.6.1. MELD Dataset

The confusion matrix results when the proposed model is evaluated on MELD dataset are shown in Figure 8. The model presents good performance for all the emotion categories and is characterised by high accuracies, particularly for the emotions of sadness (394 correct recognition), disgust (386), and anger (392). Little confusion is observed among the classes, which reflects that the model adapts fairly well to the convoluted form of context that MELD provides. Although there are some small misclassifications between related emotions (e.g., natural vs. sadness), the overall results indicate that the model benefits from the multi-modal MELD inputs and conversation context, providing unique emotional cues for more accurate recognition.

4.6.2. RAVDESS Dataset

The confusion matrix of the emotion recognition model on the RAVDESS dataset is illustrated in Figure 9. The model has significant predictive accuracy in classification across many emotion categories, particularly in sadness (311 correct predictions), enjoyment (308), and fear (288). The categories of rage and disgust demonstrate comparably consistent results, with 285 and 249 instances accurately predicted for each category, respectively. It is observed that there misunderstanding exists between surprise and natural expressions, with 36 instances of surprise potentially misclassified as natural due to prosodic or vocal characteristics in expressive delivery. Despite these misclassifications, the confusion matrix overall demonstrates that the model effectively makes affective distinctions among the acted audiovisual recordings of the RAVDESS dataset.

4.7. Model Efficiency

Our architecture’s primary innovations include feature-aware parallelism, temporal depth utilising 1D-CNNs, and a design optimised for deployment efficiency, all aimed at achieving superior performance and real-time implementation in speech-only emotion recognition.

This architecture can capture a wider range of emotional cues—MFCCs offer high resolution in spectral energy, Mel-spectrograms deliver perceptually scaled frequency bands, and Chroma features highlight harmonic content. The network is engineered to be lightweight, comprising approximately 1.2 million parameters and exhibiting low FLOPs per inference, suitable for deployment in resource-constrained and real-time applications. The advantages are further illustrated in Table 6, which compares model size and computational requirements across current SER architectures.

To further support the model’s real-time capabilities, it was performed on a Lenovo LOQ 15 laptop equipped with an 8-core Intel Core i7-13650HX processor (up to 4.9 GHz), 16 GB DDR5 RAM, and an NVIDIA GeForce RTX 4060 GPU. We performed inference solely using the CPU to simulate an edge environment with constrained resources. The model, on average, processed a standard 4 s utterance in approximately 34 ms (including feature computation and prediction), with peak memory usage remaining below 90 MB during inference. These results demonstrate that our model can perform real-time operations (under 100 ms delay), and its minimal memory and computational requirements facilitate deployment on embedded or mobile platforms without GPU support.

4.8. Limitations and Future Work

Although data augmentation techniques enhanced the model’s capacity to manage variations in speakers and background noise, certain difficulties persisted despite the model’s commendable performance. Misclassifications occurred more frequently in nuanced or blended emotion classes. In data such as MELD, with speaker overlap and conversational interactivity often occurring, the model at times had difficulty distinguishing between low-arousal emotions such as “neutral” and “sad.”

Future enhancements could include

Integrating attention mechanisms for enhanced context awareness.
Taking advantage of multimodal information (i.e., integrating audio with transcript embeddings).
Using domain adaptation methods to more effectively manage speaker and recording variation between data sets.

Although our model demonstrates good results on benchmark datasets such as MELD and RAVDESS, these datasets are comparatively clean, properly annotated, and captured in controlled environments. They do not, however, adequately address the complicated characteristics of spontaneous and authentic speech, including background noise, speech interruptions, speaker overlap, and diverse recording environments. This limitation emphasises the necessity of evaluating our model’s compatibility in noisy or challenging conditions.

In future work, we will further evaluate our methodology on more realistic datasets such as CREMA-D and IEMOCAP. Additionally, we will develop noise-augmented versions of existing benchmarks to simulate real-world acoustic variations. We intend to investigate domain adaptation techniques and noise-resistant feature-space preprocessing to enhance model generalisability and stability, especially in embedded or field environments.

5. Conclusions

This study introduced a light-weight but effective 1D Convolutional Neural Network (1D-CNN) model for Speech Emotion Recognition (SER), utilising a combination of MFCCs, Mel-spectrograms, and Chroma features in extracting complementary acoustic information. In contrast with the typical 2D CNN-based approaches utilising spectrogram image as input, the model takes sequential audio features as input and yet maintains the temporal resolution with significantly reduced computational cost.

Comprehensive assessments on two test benchmarks—the conversational speech dataset MELD and the acted speech dataset RAVDESS—demonstrated the generalizability across a wide range of speech settings for the model. It attained 94.0% accuracy on RAVDESS and 91.9% on MELD, surpassing conventional classifiers, hybrid CNN-LSTM baselines, and a variety of other recent transformer-based and multimodal deep learning models.

Combining the use of Gaussian noise injection and time shifting resulted in a significant improvement in robustness against variation in speaker utterances as well as background noise. Ablation experiments verified that the complete combination of MFCCs, Mel-spectrograms, and Chroma features yielded statistically significant improvement over one-feature models. Even with these advantages, some shortcomings remain—including the failure to capture subtle emotion states (“fear” vs. “surprise” or neutral-affect overlap). Future research can include the addition of mechanisms of attention, transformer encoders, and audio–text multimodal integration with the goal of improving discrimination for fine-grained or equivocal emotion signals.

Author Contributions

Conceptualisation, G.T.W. and S.H.S.; methodology, S.H.S.; software, G.T.W.; validation, G.T.W. and S.H.S.; formal analysis, G.T.W.; investigation, S.H.S.; resources, G.T.W. and S.H.S.; data curation, G.T.W. and S.H.S.; writing—original draft preparation, G.T.W.; writing—review and editing, G.T.W. and S.H.S.; visualisation, G.T.W.; supervision, S.H.S.; project administration, S.H.S.; funding acquisition, G.T.W. and S.H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available at https://affective-meld.github.io/ (accessed on 16 July 2024), https://zenodo.org/records/1188976 (accessed on 29 October 2024).

Acknowledgments

The authors like to appreciate the utilisation of ChatGPT-4 on 24 January 2025, primarily for the purpose of content rewriting to enhance clarity and efficacy. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
AlHanai, T.; Ghassemi, M. Predicting latent narrative mood using audio and physiologic data. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI 2017, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar] [CrossRef]
Eom, Y.; Bang, J. Speech emotion recognition based on 2D CNN and improved MFCC features. J. Inf. Commun. Converg. Eng. 2021, 19, 148–154. [Google Scholar]
Bhavan, A.; Chauhan, P.; Hitkul; Shah, R.R. Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 2019, 184, 104886. [Google Scholar] [CrossRef]
Tashev, I.J.; Wang, Z.-Q.; Godin, K. Speech emotion recognition based on Gaussian Mixture Models and Deep Neural Networks. In Proceedings of the 2017 Information Theory and Applications Workshop (ITA), La Jolla, CA, USA, 12–17 February 2017; pp. 1–4. [Google Scholar] [CrossRef]
Fahad, M.S.; Deepak, A.; Pradhan, G.; Yadav, J. DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features. Circuits Syst. Signal Process. 2021, 40, 466–489. [Google Scholar] [CrossRef]
Dal Rí, F.A.; Ciardi, F.C.; Conci, N. Speech Emotion Recognition and Deep Learning: An Extensive Validation Using Convolutional Neural Networks. IEEE Access 2023, 11, 116638–116649. [Google Scholar] [CrossRef]
Li, D.; Liu, J.; Yang, Z.; Sun, L.; Wang, Z. Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 2021, 173, 114683. [Google Scholar] [CrossRef]
Hazmoune, S.; Bougamouza, F. Using transformers for multimodal emotion recognition: Taxonomies and state of the art review. Eng. Appl. Artif. Intell. 2024, 133, 108339. [Google Scholar] [CrossRef]
Alarcão, S.M.; Fonseca, M.J. Emotions recognition using EEG signals: A survey. IEEE Trans. Affect. Comput. 2019, 10, 374–393. [Google Scholar] [CrossRef]
Zhang, H.; Huang, Z.; Shang, Z.; Zhang, P.; Yan, Y. LinearSpeech: Parallel Text-to-Speech with Linear Complexity. In Proceedings of the Interspeech 2021, ISCA, Brno, Czech Republic, 30 August–3 September 2021; pp. 4129–4133. [Google Scholar] [CrossRef]
Tzirakis, P.; Zhang, J.; Schuller, B.W. End-to-End Speech Emotion Recognition Using Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5089–5093. [Google Scholar] [CrossRef]
Abdelwahab, M.; Busso, C. Supervised domain adaptation for emotion recognition from speech. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5058–5062. [Google Scholar] [CrossRef]
Özseven, T. Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl. Acoust. 2018, 142, 70–77. [Google Scholar] [CrossRef]
Liu, Y.; Sun, H.; Chen, G.; Wang, Q.; Zhao, Z.; Lu, X.; Wang, L. Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 1893–1897. [Google Scholar] [CrossRef]
Subasi, A.; Tuncer, T.; Dogan, S.; Tanko, D.; Sakoglu, U. EEG-based emotion recognition using tunable Q wavelet transform and rotation forest ensemble classifier. Biomed. Signal Process. Control. 2021, 68, 102648. [Google Scholar] [CrossRef]
Gayathri, R.; Arun Kumar, B.; Inbanathan, S.; Karthick, S. Speech emotion recognition using CNN-LSTM. Int. J. Sci. Res. Eng. Manag. 2023, 7. [Google Scholar] [CrossRef]
Wu, S.; Falk, T.H.; Chan, W.Y. Automatic speech emotion recognition using modulation spectral features. Speech Commun. 2011, 53, 768–785. [Google Scholar] [CrossRef]
Lampropoulos, A.S.; Tsihrintzis, G.A. Evaluation of MPEG-7 descriptors for speech emotional recognition. In Proceedings of the 2012 8th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2012, Piraeus-Athens, Greece, 18–20 July 2012. [Google Scholar] [CrossRef]
Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75. [Google Scholar] [CrossRef]
Zeng, Y.; Mao, H.; Peng, D.; Yi, Z. Spectrogram based multi-task audio classification. Multimed. Tools Appl. 2019, 78, 3705–3722. [Google Scholar] [CrossRef]
Popova, A.S.; Rassadin, A.G.; Ponomarenko, A.A. Emotion Recognition in Sound. In Advances in Neural Computation, Machine Learning, and Cognitive Research; Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Eds.; Springer: Cham, Switzerland, 2018; pp. 117–124. [Google Scholar] [CrossRef]
Tang, X.; Lin, Y.; Dang, T.; Zhang, Y.; Cheng, J. Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism. Speech Commun. 2025, 171, 103242. [Google Scholar] [CrossRef]
Dabbabi, K.; Mars, A. Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases. J. Syst. Sci. Syst. Eng. 2024, 33, 576–606. [Google Scholar] [CrossRef]
Pepino, L.; Riera, P.; Ferrer, L. Emotion recognition from speech using wav2vec 2.0 embeddings. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic, 30 August–3 September 2021. [Google Scholar] [CrossRef]
Chudasama, V.; Kar, P.; Gudmalwar, A.; Shah, N.; Wasnik, P.; Onoe, N. M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Li, B.; Fei, H.; Liao, L.; Zhao, Y.; Teng, C.; Chua, T.-S.; Ji, D.; Li, F. Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition. In Proceedings of the MM 2023: The 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar] [CrossRef]
Sadok, S.; Leglaive, S.; Seguier, R. A Vector Quantized Masked Autoencoder for Speech Emotion Recognition. In Proceedings of the ICASSPW 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar] [CrossRef]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar] [CrossRef]
Livingstone, S.R.; Russo, F.A. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in North American english. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
Ho, N.-H.; Yang, H.-J.; Kim, S.-H.; Lee, G. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 2020, 8, 61672–61686. [Google Scholar]
Hu, D.; Bao, Y.; Wei, L.; Zhou, W.; Hu, S. Supervised adversarial contrastive learning for emotion recognition in conversations. arXiv 2023, arXiv:2306.01505. [Google Scholar]
Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
Luna-Jiménez, C.; Kleinlein, R.; Griol, D.; Callejas, Z.; Montero, J.M.; Fernández-Martínez, F. A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset. Appl. Sci. 2021, 12, 327. [Google Scholar] [CrossRef]

Figure 1. Block diagram of the proposed SER pipeline.

Figure 2. Left: Mel-Spectrogram showing the frequency distribution over time on a sample from RAVDESS dataset. Right: Chroma Spectrogram representing tonal intensity across pitch classes on the same sample from RAVDESS datset. Both were extracted using the Librosa library and used in the fusion step to enhance emotional tone interpretation.

Figure 3. One-dimensional Convolutional Neural Network architecture used in emotion classification.

Figure 4. Accuracy rates on RAVDESS measured across 120 epochs.

Figure 5. Model training versus testing loss for RAVDESS using the proposed model.

Figure 6. Accuracy rates on MELD measured across 120 epochs.

Figure 7. Model training versus testing loss for MELD using the proposed model.

Figure 8. The confusion matrix of the suggested 1D-CNN model on the MELD dataset.

Figure 9. Confusion matrix of the suggested 1D-CNN model on RAVDESS dataset.

Table 1. Summary of SER-related studies.

Study	Architecture	Dataset	Key Contribution
[15] [16] [17] [18] [19] [20] [22] [24] [25] [27] [28] [29]	Texture + Acoustic Fusion Hybrid ELM-DT SVM + TQWT 2D CNN + SVM MFCC + MSF + LDA MPEG 7 features GResNet Transformer + Self Attention DistilHuBERT M2FNet Graph- Attention Multimodal model VQ-MAE-S-12	SAVEE, EMO-DB, eNTERFACE’05 CASIA RAVDESS, EMO-DB, SAVEE EMOVO EMO-DB Recorded Dataset EMO-DB RAVDESS RAVDESS, CERMA-D RAVDESS, BAVED MELD IEMOCAP RAVDESS	Improving image features 89% precision 90% Accuracy 88.3% precision 85% Accuracy 83.93% precision 65.97% Accuracy 82.4% (RAVDESS), 85.3% (CREMA-D) 90.79% (RAVDESS), 82.35% (BAVED) 91.2% accuracy 89.3% accuracy 84.1% accuracy for embedded systems

Table 2. Effect of feature combination on recognition accuracy (%) for RAVDESS and MELD.

Feature Combination	Accuracy (RAVDESS)	Accuracy (MELD)
MFCC only	85.0%	87.1%
MFCC + Mel	89.5%	91.0%
MFCC + Mel + Chroma	91.9%	94.0%

Table 3. Performance indicators for the model presented.

Metric	RAVDESS (%)	MELD (%)
Accuracy	91.9	94.0
Precision	90.5	93.2
Recall	91.1	93.8
F1-Score	90.8	93.5

Table 4. Comparative performance results on the MELD dataset.

Study/Model	Year	Approach	Accuracy (%)
Ho et al. [32]	2020	RNN + Multi-level fusion Attention	61.2
Chudasama et al. [27]	2022	M2Fnet (Multimodal fusion (Text + Audio + visual))	66.7
Hu et al. [33]	2023	SACL—LSTM	68.28
Li et al. [28]	2023	Attention-based Multi-modal	68.28
Ciardi et al. [8]	2023	2D CNN with Spectrogram Inputs	82.9
Proposed Model	2025	1D CNN + Feature Fusion	94.00

Table 5. Comparative Performance on RAVDESS Dataset.

Study/Model	Year	Approach	Accuracy (%)
Chauhan et al. [5]	2019	Ensemble + SVM	75.69
Satt et al. [34]	2017	Deep CNN	73
Sadok et al. [29]	2023	VQ-MAE-S-12	84.10
Jiménez et al. [35]	2022	xlsr-Wav2Vec2.0	86.70
Proposed Model	2025	1D CNN + Feature Fusion	91.90

Table 6. Comparision of different SER models’ size and computational cost.

Model	Params (M)	FLOPs (MF)	Input Modality
VQ-MAE-S-12 [29]	4.3	250–300	Audio
Wav2Vec2.0 [35]	94	>1000	Audio
M2Fnet [27]	>10	Not mentioned	Audio + Text + Visual
Proposed Model	1.2	115	Audio (MFCC + Mel+ Chroma)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Waleed, G.T.; Shaker, S.H. Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN. Information 2025, 16, 518. https://doi.org/10.3390/info16070518

AMA Style

Waleed GT, Shaker SH. Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN. Information. 2025; 16(7):518. https://doi.org/10.3390/info16070518

Chicago/Turabian Style

Waleed, Gheed T., and Shaimaa H. Shaker. 2025. "Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN" Information 16, no. 7: 518. https://doi.org/10.3390/info16070518

APA Style

Waleed, G. T., & Shaker, S. H. (2025). Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN. Information, 16(7), 518. https://doi.org/10.3390/info16070518

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN

Abstract

1. Introduction

2. Related Works

2.1. Classical and Traditional Approaches to Deep Learning

2.2. Modern Deep Learning Methodologies (2020–2024)

2.2.1. Transformer-Based Architectures

2.2.2. Self-Supervised Learning Models

2.2.3. Multimodal Speech Emotion Recognition

2.2.4. Lightweight and Embedded Systems

3. Methodology

3.1. Datasets

3.2. Feature Extraction and Augmentation

3.3. CNN Model

3.4. Training and Evaluation

4. Results and Discussion

4.1. Performance on RAVDESS Dataset

4.2. Performance on MELD Dataset

4.3. Ablation Study

4.4. Evaluation Metrics Overview

4.5. Comparative Analysis with State-of-the-Art Methods

4.6. Confusion Matrix Analysis

4.6.1. MELD Dataset

4.6.2. RAVDESS Dataset

4.7. Model Efficiency

4.8. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI