Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models

Bhanbhro, Jamsher; Memon, Asif Aziz; Lal, Bharat; Talpur, Shahnawaz; Memon, Madeha

doi:10.3390/signals6020022

Open AccessArticle

Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models

by

Jamsher Bhanbhro

^1,*

,

Asif Aziz Memon

²

,

Bharat Lal

¹

,

Shahnawaz Talpur

³

and

Madeha Memon

³

¹

DIMES Department, University of Calabria, 87036 Rende, Italy

²

Computer Science Department, Dawood University of Engineering and Technology, Karachi 75300, Pakistan

³

Computer Systems Engineering Department, Mehran University of Engineering and Technology, Jamshoro 76062, Pakistan

^*

Author to whom correspondence should be addressed.

Signals 2025, 6(2), 22; https://doi.org/10.3390/signals6020022

Submission received: 28 February 2025 / Revised: 10 April 2025 / Accepted: 25 April 2025 / Published: 9 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. Despite its importance in various fields like human–computer interaction and mental health diagnosis, accurately identifying emotions from speech can be challenging due to differences in speakers, accents, and background noise. The work proposes two innovative deep learning models to improve SER accuracy: a CNN-LSTM model and an Attention-Enhanced CNN-LSTM model. These models were tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), collected between 2015 and 2018, which comprises 1440 audio files of male and female actors expressing eight emotions. Both models achieved impressive accuracy rates of over 96% in classifying emotions into eight categories. By comparing the CNN-LSTM and Attention-Enhanced CNN-LSTM models, this study offers comparative insights into modeling techniques, contributes to the development of more effective emotion recognition systems, and offers practical implications for real-time applications in healthcare and customer service.

Keywords:

speech emotion recognition; CNN-LSTM; attention mechanism; deep learning; audio processing; SER

1. Introduction

Speech signals are the most natural and practical form of human communication. They convey not only linguistic information but also a wealth of non-linguistic cues, including facial expressions and emotions. In the past decade, Speech Emotion Recognition (SER) has emerged as a promising research field, enabling computers to understand the emotional nuances in speech [1,2]. SER examines paralinguistic voice quality cues (paralinguistics refers to non-verbal elements of communication, such as tone, pitch, volume, and facial expressions, which convey meaning beyond words), such as pitch, intonation, rhythm, and speech rate, to understand speakers’ emotional states. The applications of SER extend to healthcare and education, where systems can monitor emotional well-being and provide personalized support such as tailored therapy sessions or adaptive learning environments. As technology continues to evolve, SER holds the potential to enhance human–computer interaction (HCI) and deepen our understanding of human communication.

Recognizing emotions from speech signals is a complex task for several reasons. One of the main challenges is the lack of accurate and balanced speech emotion datasets, which restricts the development and evaluation of SER systems [2]. Creating high-quality speech emotion databases requires significant effort and time, as they must include a wide range of speakers across different genders, ages, languages, cultures, and emotional expressions. Additionally, emotions are often conveyed through sentences rather than individual words, making it even harder to identify emotions from speech signals. These challenges are further complicated by the need to account for cultural and linguistic differences (e.g., variations in tone or expression across languages) in emotional expression, as well as the subjective nature of emotions (i.e., how individuals perceive and express feelings differently). All these factors play a crucial role in shaping the development and performance of SER systems [1].

However, researchers are actively working to overcome these challenges, aiming to improve the accuracy and efficiency of emotion recognition from speech signals. Their efforts seek to enhance how machines understand the intricate emotional cues (e.g., subtle shifts in tone or pace) in spoken language. Beyond the words and information conveyed, the speech signal also carries the implicit emotional state (e.g., happiness or frustration) of the speaker [1]. An efficient SER system, which effectively reflects the speaker’s emotions by separating acoustic components, lays the foundation for more effective HCI. SER systems are not only essential but also hold significant scientific value in health, human–machine interactions, and various other areas like behavioral analysis and customer support. The potential of these systems in enhancing communication for individuals with speech impairments or in environments where non-verbal cues are crucial further highlights their importance in interpreting and facilitating emotional expression through alternative channels such as assistive technologies or virtual assistants.

The primary aims of this research are to develop and compare two deep learning models for SER, addressing the following research question: How do CNN-LSTM and Attention-Enhanced CNN-LSTM models differ in accuracy and generalization for emotion recognition? We developed these models to recognize emotions from the RAVDESS dataset [3]. This study includes a brief literature review of SER. The main aim is to compare two different deep learning models: one using a 2D CNN with LSTM and the other incorporating a bidirectional LSTM and an attention layer. We detail the methodology for creating and using these models from scratch, including techniques for improving model generalization by adding AWGN. This paper contributes to the field by providing a comprehensive comparison of these advanced models, demonstrating their effectiveness in emotion recognition, and offering insights into their practical implementation and potential applications in real-world scenarios.

2. Literature Review

Traditionally, machine learning models in SER have relied on hand-crafted features such as signal energy, voice pitch, entropy, crossing rate, Mel-frequency cepstral coefficients (MFCC) [4,5,6,7], and chroma-based features [8]. However, the effectiveness of these models often depends on the specific features selected, leading to uncertainty in their performance (e.g., inconsistent accuracy across datasets). Ongoing research has explored new features and algorithms that aim to capture the complex dynamics of feature sequences reflecting human emotions. However, the challenge remains in identifying the attributes most closely connected to different emotions to facilitate accurate predictions.

Since 2015, significant advancements in deep learning techniques and increased processing capacity have led to the development of more efficient end-to-end SER systems. These systems can rapidly extract information from spectrograms or raw waveforms [9,10], eliminating the need for manual feature extraction. Many studies suggest that employing deep learning (CNN and RNN) models based on spectrograms and raw waveforms can enhance SER performance [11,12,13]. Tocoglu et al. (2019) analyzed 205,000 Turkish tweets using the Tweepy Python module, finding CNNs to achieve 87% accuracy [14]. Kamyab et al. (2021) proposed an attention-based CNN and Bi-LSTM model for sentiment analysis, combining TF-IDF and GloVe word embeddings to achieve over 90% accuracy across multiple datasets [15]. Previous work in audio emotion recognition achieved lower performance. Authors in [16] proposed hybrid MFCCT-CNN features that demonstrate superior accuracy in recent studies. A multi-task deep neural network using ResNets and a gate mechanism showed potential in SER tasks with 64.68% accuracy on the RAVDESS dataset [17]. Furthermore, Hidden Markov-based SER techniques were surveyed by [18], each addressing different challenges in emotion recognition. A hybrid model combining CNN with LSTM for sentiment analysis on RAVDESS achieved 90% accuracy [19], while BERT models [20] excelled in emotion detection in text with 92% accuracy. Innovative approaches like combining DNN with GNN for audio signal emotion recognition and multimodal sentiment analysis were also explored [21]. A lightweight neural network [22], optimized for real-time emotion recognition, achieved 80% accuracy. Several other studies [3,23,24,25,26,27,28,29,30,31] have also made significant contributions. However, these studies often lack direct comparisons on the same dataset or detailed ablation studies, limiting their insights into model generalization. Creating sophisticated SER systems is no small task, as it demands a considerable amount of labeled training data. Without enough data, models may fail to capture the nuances of speech emotions accurately, while a lack of data can result in models becoming too specialized, limiting their broader applicability.

The primary objective of this study is to train and evaluate two hybrid models using the RAVDESS dataset, focusing on achieving the highest possible accuracy while ensuring robust generalization across diverse conditions. Our major contribution lies in improving model generalization and increasing accuracy through novel architectures and noise augmentation. Additionally, we aim to conduct a comprehensive comparison of the models’ performance across various training techniques and parameters to enrich insights into their efficacy and versatility, addressing a gap in prior work by validating generalization on RAVDESS and partially on SAVEE for cross-dataset insight and comparing with mainstream methods (Table 1). This research seeks to enhance both models’ optimization strategies and the broader understanding of SER systems. However, this research not only advances the technical aspects of SER but also explores the ethical implications and best practices for responsible technology use.

3. Methodology

This section explains the approach used in this study, including data collection, pre-processing, and model development. Each step details the research process to ensure the reliability of the findings.

3.1. Dataset

The publicly available Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset was chosen for this research, as it provides a wide range of emotional expressions in both audio and video formats [3]. RAVDESS, collected between 2015 and 2018, includes 1440 audio files from 24 professional actors (12 male and 12 female) expressing eight emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprise. Each emotion is recorded at two intensity levels (normal and strong), except for neutral, enhancing its suitability for SER. Its advantages include high-quality recordings and balanced gender representation, though it is limited to English and controlled settings, potentially reducing real-world variability. For this study, only audio data were used, which were extracted by isolating the speech component from the multimodal dataset, ignoring video elements. Figure 1 shows samples for each class.

3.2. Pre-Processing

This subsection outlines the pre-processing steps to prepare RAVDESS audio data for SER models. Neural Networks yield efficient results in processing images, and CNN algorithms form the backbone of this research. While RAVDESS contains audio files, processing these audio files into images was mandatory to leverage CNN capabilities. Prior to giving the data directly to the models, the dataset is initially transformed into a form suitable for model processing. Pre-processing is a very important step because it transforms the data into a suitable format. Each audio signal is processed to create an image representation called the Mel spectrogram using mathematical transformations. To mitigate noise from multimodal data (e.g., background sounds or actor variability), audio was standardized to a 16 kHz sample rate, and silent segments were trimmed. First, the audio signal is divided into short overlapping segments using a process called windowing with a 25 ms window and 10 ms overlap. Each segment undergoes a Fourier Transform to convert it into the frequency domain, where it is represented as a spectrum of frequencies. These spectra are then transformed into a Mel scale, which approximates how humans perceive pitch. This transformation helps to emphasize perceptually relevant features in the audio. The resulting Mel spectrogram is essentially a 2D image, where time is represented on the horizontal axis and frequency on the vertical axis. Additionally, color intensities or brightness in the image correspond to the magnitude of each frequency component. Figure 2 shows a Mel spectrogram sample generated after preprocessing.

Converting Audio to Mel Spectrograms The mathematics of converting audio into spectrograms is mentioned below:

x (n) = \sum_{m = 0}^{M - 1} w (m) \cdot s (n + m)

(1)

where

x (n)

is the windowed signal,

w (m)

is the window function, and

s (n + m)

is the original signal.

Each windowed segment undergoes a Fourier Transform to convert it into the frequency domain:

X (k) = \sum_{n = 0}^{N - 1} x (n) \cdot e^{- j \frac{2 π}{N} k n}

(2)

where

X (k)

is the frequency spectrum of the segment.

The frequencies are then mapped onto the Mel scale using the following formula:

M (f) = 2595 {log}_{10} (1 + \frac{f}{700})

(3)

The pre-processing phase optimizes the data for SER models by adjusting sample rates and employing noise modulation techniques (detailed in Section 3.3) to ensure the proposed models are exposed to a realistic and challenging array of audio inputs, enhancing their ability to generalize and accurately identify emotions in varied conditions, such as noisy real-world environments.

3.3. Models Description

The proposed models integrate convolutions and recurrent neural networks to effectively process and classify emotions from audio data. Data were split into 80% training, 10% validation, and 10% testing sets with temporal separation (early recordings for training, later for testing) to avoid overfitting. Here is a comprehensive explanation of how both models work, supported by mathematical background and specific parameters, with their reproducibility ensured through shared configurations.

3.3.1. Time Distributed 2D CNN-LSTM Model

The following steps represent the structure of the first model. Stack time refers to the sequential processing of time-distributed data, such as Mel spectrogram chunks, to capture temporal dependencies [13].

We designed the model as shown in Figure 3. This model starts with six time-distributed 2D CNN blocks that process segmented Mel spectrogram chunks, which are the spectra of the speech audio files obtained after pre-processing. The first two convolutional blocks apply a 5 × 5 kernel with a stride of 1 and padding of 2, while the remaining four use a 3 × 3 kernel with padding of 1. This configuration ensures that spatial dimensions are preserved throughout most convolutional operations. The initial convolution layers in each block extract local features from the input spectrograms. Batch normalization follows each convolution to stabilize and accelerate training, as described in Equation (4).

The activation function used throughout the model is ELU (Exponential Linear Unit), applied after normalization to introduce non-linearity and improve the learning of complex patterns (using Equation (5)). After each activation, max pooling is applied. The first three convolutional blocks use a 2 × 2 kernel with stride 2, while the latter three use a 4 × 4 kernel with stride 4 (Equation (6)). Dropout is applied after every pooling layer with a probability of 0.2 (Equation (7)) to prevent overfitting.

The convolutional blocks progressively increase the number of filters in the following sequence: 8, 16, 32, 64, 128, and 256. This enables the model to learn increasingly complex and abstract features from the Mel spectrograms, capturing variations relevant to different emotional expressions in speech.

After the convolutional layers, the output is flattened and passed to a fully connected dense layer. This dense layer acts as a dimensionality reducer and feature combiner before the features are forwarded to a final softmax classifier. The output layer uses a linear transformation (Equation (8)) followed by a softmax activation function (Equation (9)) to produce probabilities for each emotion class.

For example, consider a scenario in which there is an audio clip of someone speaking happily. This audio is first converted into a Mel spectrogram, highlighting the unique frequency and time characteristics of the speech. The time-distributed CNN layers analyze this spectrogram, extracting important features like intensity and pitch variations that signify happiness. These processed features are then passed through the dense layers to accurately classify the emotion as happy.

\hat{X} = \frac{X - μ}{σ}

(4)

where

μ

and

σ

are the mean and standard deviation of the batch.

f (x) = \{\begin{matrix} x & if x > 0 \\ α (e^{x} - 1) & if x \leq 0 \end{matrix}

(5)

where

α

is a small positive constant (typically 1.0).

Y_{i, j} = max_{m, n} X_{i + m, j + n}

(6)

y = x \cdot Bernoulli (p)

(7)

where p is the dropout probability (0.2).

y = W \cdot x + b

(8)

softmax (x_{i}) = \frac{e^{x_{i}}}{\sum_{j} e^{x_{j}}}

(9)

3.3.2. Stacked Time Distributed 2D CNN–Bidirectional LSTM with Attention

The second model builds on the first model Section 3.3.1 by adding a 2D LSTM and an attention layer. The primary enhancement is the addition of a bidirectional LSTM, which processes speech data in both forward and reverse directions, allowing the model to gain a deeper understanding of emotions in speech (Architecture is shown in Figure 4). Additionally, an attention mechanism is included to improve performance. This mechanism focuses on the most crucial segments of the speech, assigning weights to portions that carry significant emotional cues. This focused approach helps the model identify emotional signals more accurately, thereby improving its emotion classification accuracy. The bidirectional LSTM layer processes the input sequence in both forward and backward directions. This is represented mathematically as follows:

{\vec{h}}_{t} = LSTM (x_{t}, {\vec{h}}_{t - 1})

(10)

{\overset{\leftarrow}{h}}_{t} = LSTM (x_{t}, {\overset{\leftarrow}{h}}_{t + 1})

(11)

where

{\vec{h}}_{t}

and

{\overset{\leftarrow}{h}}_{t}

are the hidden states of the forward and backward LSTMs at time step t. The final hidden state

h_{t}

is obtained by concatenating these two states:

h_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}]

(12)

This bidirectional processing allows the model to capture dependencies from both past and future contexts, providing a more comprehensive understanding of the sequential data.

Moreover, the attention mechanism is applied to the output of the bidirectional LSTM. The attention mechanism computes a context vector c as a weighted sum of the hidden states, where the weights

α_{t}

are determined by the relevance of each hidden state:

α_{t} = \frac{exp (e_{t})}{\sum_{i = 1}^{T} exp (e_{i})}

(13)

e_{t} = tanh (W_{a} h_{t} + b_{a})

(14)

c = \sum_{t = 1}^{T} α_{t} h_{t}

(15)

where

W_{a}

and

b_{a}

are learnable parameters. The context vector c captures the most relevant information from the entire sequence, allowing the model to focus on the crucial parts of the input when making predictions.

Finally, the context vector c is passed through a dense layer, followed by a softmax activation to produce the final emotion classification probabilities:

y = softmax (W_{c} c + b_{c})

(16)

While developing proposed models, it was observed that the balanced data were causing overfitting, meaning the models were performing well on the training data but not generalizing well to new, unseen data (test). To address this issue, we introduced noise to the data, which is an important step in enhancing the model’s generalization capabilities, unlike simpler methods like LIWC, which rely on linguistic categorization [32]. The noise was added during the data augmentation stage. We used Additive White Gaussian Noise (AWGN) and uniform noise to augment the training data, unlike Google’s attention model, which focuses on sequence weighting without noise handling [33]. The process involves generating white Gaussian noise and adding it to the original signal. First, the signal and the generated noise were normalized. The normalization is based on the number of bits used to represent each sample in the audio signal, typically 16 bits for high-quality audio. This normalization and addition of noise in both models is mathematically given by the following:

norm_constant = 2^{(num_bits - 1)}

(17)

signal_norm = \frac{signal}{norm_constant}

(18)

noise_norm = \frac{noise}{norm_constant}

(19)

Next, the power of the signal and the noise was computed:

s_power = \frac{\sum {signal_norm}^{2}}{signal_len}

(20)

n_power = \frac{\sum {noise_norm}^{2}}{signal_len}

(21)

A random Signal-to-Noise Ratio (SNR) was generated within a specified range (15–30 dB) to determine the level of noise to be added. The target SNR is computed as follows:

target_snr = random . randint (snr_low, snr_high)

(22)

The noise was then scaled according to the desired SNR:

K = \sqrt{(\frac{s_power}{n_power}) \cdot 10^{- \frac{target_snr}{10}}}

(23)

Finally, the noise was added to the original signal:

noisy_signal = signal + K \cdot noise

(24)

This process was repeated multiple times to create several augmented versions of each original signal, unlike traditional methods, which may not address noise robustness. The augmented signals were then added to the training dataset, effectively increasing its size and diversity. This helped in preventing the model from overfitting by ensuring it was exposed to a wider variety of data during training, improving generalization beyond RAVDESS to potential real-world noisy conditions.

3.4. Ablation Study

To verify the contribution of key architectural components, we conducted ablation experiments on both models. In Model 2, removing the LSTM layer led to a significant drop in accuracy from 98.1% to 95%, highlighting its role in capturing temporal dependencies. Additionally, omitting the dropout layer resulted in increased overfitting, evidenced by a 15% rise in validation loss. For both models. Furthermore, excluding the bidirectional LSTM decreased accuracy to 89%, confirming its value in enhancing contextual understanding. These results validate the necessity of each component to achieve high performance and strong generalization.

4. Results and Discussion

This section evaluates the models’ performances both qualitatively and quantitatively using metrics such as loss, accuracy, precision, recall, and F1 score. Confusion matrices are used to effectively compare the classification capabilities of both models.

In this study, noise augmentation was applied to both models using two different techniques. Model 1 was trained with uniformly distributed noise added to the signal using a custom uniform noise injection method, while Model 2 was augmented using additive white Gaussian noise (AWGN), as described in the provided implementation. The augmentation methods improve generalization and reduce overfitting, all without modifying the underlying model architecture.

To ensure fair evaluation and avoid time-out issues, training was conducted on a GPU (NVIDIA RTX 3080) with a batch size of 32, completing 564 epochs within the reported times (Table 2). There are various techniques available for improving generalization, such as regularization, early stopping, and simplifying the model by removing layers or blocks. However, in this study, we opted to keep the model unchanged, using simple noise augmentation to maintain overall simplicity. Both uniform noise and AWGN contributed to a significant reduction in validation loss, suggesting improved model generalization.

Confusion matrices are essential for evaluating classification tasks. Figure 5 provides a comprehensive comparison of both models, before and after noise augmentation, as well as their final test performance. Figure 5a,b show the validation results of Model 1 before and after adding uniform noise, respectively. Initially, Model 1 struggles to differentiate between emotionally similar classes like sad and calm, as seen in Figure 5a. After uniform noise augmentation, Figure 5b shows significant improvements across most emotions, especially in reducing misclassification of neutral, which had fewer training samples.

Figure 5c,d present the validation confusion matrices for Model 2 before and after applying additive white Gaussian noise (AWGN). While Figure 5c shows relatively better initial performance compared to Model 1, confusion still exists between classes such as surprise, fear, disgust, and calm. After applying AWGN, as shown in Figure 5d, Model 2 exhibits clearer class boundaries, with improved classification across all emotional states.

Figure 5e,f display the final test set confusion matrices for both models. Model 1 (Figure 5e), trained with uniform noise, achieves a solid accuracy of 96.5%, correctly predicting dominant emotions like happy, angry and others. Model 2 (Figure 5f), trained with AWGN and enhanced with attention mechanisms, reaches a peak accuracy of 98.1%, with very few misclassifications and strong generalization across the emotional spectrum. These matrices confirm the contribution of both uniform noise and AWGN-based augmentation, along with architectural design, toward improved emotion recognition performance.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(25)

P r e c i s i o n = \frac{T P}{T P + F P}

(26)

R e c a l l = \frac{T P}{T P + F N}

(27)

F 1 S c o r e = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(28)

These metrics provide comprehensive details about the performance of classification models, helping to assess their ability to correctly classify emotions and balance precision and recall [32].

The confusion matrices provide insight into the predictive accuracy of each model across various emotional states. Model 1, trained with uniform noise, achieved a notable accuracy of 96.5%, with the highest performance in identifying emotions such as happy, angry and others. Model 2, which was further enhanced with an attention layer and trained using AWGN, achieved an improved accuracy of 98.1%. It demonstrated stronger generalization and better distinction between closely related emotional states such as neutral and calm, while maintaining high accuracy across all emotion classes.

The comparison Table 2 highlights the training efficiency and accuracy across different epochs, presenting the steady improvement and stabilization of model performance over time. The convergence rate is also very high. The performance metrics Table 3 shows that Model 2 achieves a better balance in precision, recall, and F1 score. This improvement is due to the attention mechanism’s ability to highlight important features in the data. The detailed examination of these metrics shows that while both models perform well, the addition of attention mechanisms in Model 2 significantly enhances performance, especially in accurately classifying nuanced emotional expressions.

Figure 6 presents the training and validation losses for both Model 1 (a) and Model 2 (b) before and after the addition of noise augmentation. The figure illustrates a notable contrast in training and validation losses before and after the introduction of noise. Initially, both models displayed high training loss that gradually decreased, indicating stabilization through learning. However, after a few epochs, both models experienced an increase in validation loss, indicating potential overfitting. One important thing to note here is that the second model has smoother graphs compared to the first model. Additionally, the second model exhibits higher validation loss before the addition of noise. This suggests that the model is effectively memorizing the data but struggling with the underlying patterns. This overfitting might be attributed to the complexity of the model as the dataset is well-balanced and data mismatch is not the primary cause.

This study highlights the importance of architectural choices in deep learning models for SER, advancing the field beyond prior works (Table 1). Compared to [17] (64.68% on RAVDESS), our models achieve superior accuracy (96.5% and 98.1%) due to uniform noise, AWGN, and attention enhancements, unlike Google’s model, which lacks noise robustness [33]. The organization of CNN and LSTM blocks within a network is important, as the unique capabilities of each layer must be effectively utilized. CNN layers extract spatial features from spectrograms, while LSTM units capture the temporal dependencies essential for SER. The sequence and interaction between these blocks determine the model’s overall learning effectiveness and feature integration. Additionally, incorporating attention mechanisms can improve the network’s focus, allowing for a more detailed analysis of emotional cues. Ultimately, the configuration and composition of the network architecture play a critical role in the model’s ability to achieve high accuracy in SER tasks.

Table 1 in the literature review section presents notable prior work in the field of SER. Comparing our results with those studies shows that our proposed models, especially Model 2, perform exceptionally well in terms of accuracy and generalization.

For instance, Ref. [17] used ResNets and achieved an accuracy of 64.68% on RAVDESS. Ref. [19] applied CNN-LSTM on the same dataset and reported 90% accuracy. Similarly, Ref. [16] using CNN achieved 92%. Other recent works [21,22] achieved between 70 and 80% on different datasets using neural architectures.

In contrast, our Model 1 reached 96.5% accuracy using uniform noise augmentation, and Model 2 reached 98.1% accuracy by combining AWGN with an attention mechanism. These results confirm the effectiveness of our architecture and augmentation strategy, especially when tested on a clean and balanced dataset like RAVDESS. Furthermore, preliminary testing on the SAVEE dataset showed promising cross-dataset generalization without retraining, highlighting the robustness of our models.

As seen in several research comparisons in Table 1, many studies have shown positive results. However, the proposed models in this study stand out due to their unique architecture and generalization techniques. While previous studies have achieved good accuracy, this study models consistently perform better. Overall, models not only achieve higher accuracy but also show better generalization and robustness against overfitting, which are not prominently addressed in most of the studies mentioned earlier.

5. Conclusions

We present advanced techniques for SER by creating two hybrid models. Both models perform well despite the challenges in SER systems due to input variability. The bidirectional LSTM model offers slightly higher accuracy due to its bidirectional learning and attention mechanism, although it provides minimal additional advantages. The second model, while more complex and requiring extensive training, makes slightly better predictions. This precision can be attributed to a well-balanced dataset, strategic data augmentation to reduce losses, and careful management of convolution blocks and layers. Furthermore, the models exhibit high generalizability, which proves effective even in noisy data conditions. Selecting the right hyperparameters and using the Mel Spectrogram as input allow the models to achieve excellent results.

SER is vital for AI and research and focuses on recognizing emotions from speech signals by creating advanced models. This work utilized stacked CNN and LSTM layers to analyze Mel spectrograms. The first model used these networks effectively, while the second model incorporated a bidirectional LSTM with an attention layer to enhance feature detection.

Both models demonstrated strong classification accuracy, with the first model reaching 96.5% and the second model achieving 98.1%. Model 2’s attention mechanism improved its predictive performance but made it more complex and slower to train compared to Model 1.

This research demonstrates the significant potential of the proposed hybrid models for SER applications, though certain limitations remain. The current evaluation was conducted primarily on the RAVDESS dataset, which features English-language speech with North American accents. To further test generalizability, we also evaluated the model on the SAVEE [34] dataset, another widely used English speech emotion dataset. The model correctly classified 5 out of 7 randomly selected emotion samples (tested using our pre-trained model without retraining on SAVEE), indicating promising generalization across similar English-accented datasets. However, generalizability to non-English or multilingual contexts remains unverified. Due to linguistic and cultural variations in emotional expression, additional testing on diverse datasets is needed for broader applicability. Moreover, this study did not incorporate multimodal analysis (e.g., combining audio and video); this may limit real-world applicability in complex scenarios. Despite these constraints, both models demonstrated strong capabilities in controlled environments. This comparison underscores the importance of thoughtful architectural design. Future work will focus on refining these models, experimenting with advanced feature extraction techniques, and expanding evaluations to include larger multilingual datasets and noisy, multimodal environments in the real world, including customer service and mental health monitoring systems.

Author Contributions

J.B. gathered the materials and conducted the experiments. All of the authors conceived the experiments. Specifically, A.A.M. and B.L. dealt with designing the methodology and worked on the literature review, while S.T. and M.M. analyzed the results and thoroughly reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by PNRR project FAIR—Future AI Research (PE00000013), Spoke 9—Green-aware AI, under the NRRP MUR program funded by the NextGenerationEU.

Data Availability Statement

The dataset used in this research is the RAVDESS dataset, publicly available at accessed on 12 March 2023 https://zenodo.org/record/1188976 and https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio. Model code and configurations are available upon request from the corresponding author to ensure reproducibility.

Acknowledgments

The authors would like to thank Luigi Palopoli for scientific support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kadiri, S.R.; Gangamohan, P.; Gangashetty, S.V.; Alku, P.; Yegnanarayana, B. Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference. Circuits Syst. Signal Process. 2020, 39, 4459–4481. [Google Scholar] [CrossRef]
Singh, Y.B.; Goel, S. Survey on Human Emotion Recognition: Speech Database, Features and Classification. In Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 12–13 October 2018; pp. 298–301. [Google Scholar]
Livingstone, S.; Russo, F. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
Pratama, A.; Sihwi, S.W. Speech Emotion Recognition Model Using Support Vector Machine Through MFCC Audio Feature. In Proceedings of the 2022 14th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia, 18–19 October 2022; pp. 303–307. [Google Scholar]
Nancy, M.; Kumar, G.S.; Doshi, P.; Shaw, S. Audio Based Emotion Recognition Using Mel Frequency Cepstral Coefficient and Support Vector Machine. J. Comput. Theor. Nanosci. 2018, 15, 2255–2258. [Google Scholar] [CrossRef]
Prabakaran, D.; Sriuppili, S. Speech Processing: MFCC Based Feature Extraction Techniques—An Investigation. J. Phys. Conf. Ser. 2021, 1717, 012009. [Google Scholar] [CrossRef]
Nasim, A.S.; Chowdory, R.H.; Dey, A.; Das, A. Recognizing Speech Emotion Based on Acoustic Features Using Machine Learning. In Proceedings of the 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, 23–25 October 2021; pp. 1–7. [Google Scholar]
Gao, M.; Dong, J.; Zhou, D.; Zhang, Q.; Yang, D. End-to-End SER Based on a One-Dimensional Convolutional Neural Network. In Proceedings of the 2019 ACM International Conference on Intelligent Autonomous Systems (ICIAI), Guilin, China, 23–25 October 2019; pp. 78–82. [Google Scholar]
Ma, X.; Wu, Z.; Jia, J.; Xu, M.; Meng, H.; Cai, L. Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3683–3687. [Google Scholar]
Neumann, M.; Vu, N.T. Improving Speech Emotion Recognition with Unsupervised Representation Learning on Unlabeled Speech. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7390–7394. [Google Scholar]
Zhan, Y.; Yuan, X. Audio Post-Processing Detection and Identification Based on Audio Features. In Proceedings of the 2017 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR), Ningbo, China, 9–12 July 2017; pp. 154–158. [Google Scholar]
Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
Tocoglu, M.A.; Ozturkmenoglu, O.; Alpkocak, A. Emotion Analysis From Turkish Tweets Using Deep Neural Networks. IEEE Access 2019, 7, 183061–183069. [Google Scholar] [CrossRef]
Kamyab, M.; Liu, G.; Adjeisah, M. Attention-Based CNN and Bi-LSTM Model Based on TF-IDF and GloVe Word Embedding for Sentiment Analysis. Appl. Sci. 2021, 11, 11255. [Google Scholar] [CrossRef]
Alluhaidan, A.S.; Saidani, O.; Jahangir, R.; Nauman, M.A.; Neffati, O.S. Speech Emotion Recognition Through Hybrid Features and Convolutional Neural Network. Appl. Sci. 2023, 13, 4750. [Google Scholar] [CrossRef]
Han, T.; Zhang, Z.; Ren, M.; Dong, C.; Jiang, X.; Zhuang, Q. Speech Emotion Recognition Based on Deep Residual Shrinkage Network. Electronics 2023, 12, 2512. [Google Scholar] [CrossRef]
Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech Emotion Recognition Using Hidden Markov Models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
Bhanbhro, J.; Talpur, S.; Memon, A.A. Speech Emotion Recognition Using Deep Learning Hybrid Models. In Proceedings of the 2022 International Conference on Emerging Technologies in Electronics, Computing and Communication (ICETECC), Jamshoro, Pakistan, 7–9 December 2022; pp. 1–5. [Google Scholar]
Abas, A.R.; Elhenawy, I.; Zidan, M.; Othman, M. BERT-CNN: A Deep Learning Model for Detecting Emotions from Text. Comput. Mater. Contin. 2021, 71, 2943–2961. [Google Scholar]
Li, Y.; Wang, Y.; Yang, X.; Im, S.-K. Speech Emotion Recognition Based on Graph-LSTM Neural Network. EURASIP J. Audio Speech Music Process. 2023, 2023, 40. [Google Scholar] [CrossRef]
Mustaqeem; Kwon, S. Att-Net: Enhanced Emotion Recognition System Using Lightweight Self-Attention Module. Appl. Soft Comput. 2021, 102, 107101. [Google Scholar] [CrossRef]
Chuang, Z.J.; Wu, C.H. Emotion recognition from text using neural networks. Neural Comput. 2003, 15, 2047–2085. [Google Scholar]
Bourlard, H.; Morgan, N. A new approach for speech emotion recognition using HMM. IEEE Signal Process. Lett. 1996, 3, 89–91. [Google Scholar]
Kipyatkova, I. LSTM-Based Language Models for Very Large Vocabulary Continuous Russian Speech Recognition System. In Speech and Computer; Springer International Publishing: Cham, Switzerland, 2019; pp. 219–226. [Google Scholar]
Zhang, Y.; Du, J.; Wang, Z.-R.; Zhang, J. Attention Based Fully Convolutional Network for Speech Emotion Recognition. arXiv 2018, arXiv:1806.01506. [Google Scholar]
Meher, S.S.; Ananthakrishna, T. Dynamic Spectral Subtraction on AWGN Speech. In Proceedings of the 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 19–20 February 2015; pp. 92–97. [Google Scholar]
Mujaddidurrahman, A.; Ernawan, F.; Wibowo, A.; Sarwoko, E.A.; Sugiharto, A.; Wahyudi, M.D.R. Speech Emotion Recognition Using 2D-CNN with Data Augmentation. In Proceedings of the 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM), Pekan, Malaysia, 24–26 August 2021; pp. 685–689. [Google Scholar]
Zhu, Z.; Dai, W.; Hu, Y.; Li, J. Speech Emotion Recognition Model Based on Bi-GRU and Focal Loss. Pattern Recognit. Lett. 2020, 140, 358–365. [Google Scholar] [CrossRef]
Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 3653–3657. [Google Scholar]
Asiya, U.A.; Kiran, V.K. Speech Emotion Recognition—A Deep Learning Approach. In Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud), Palladam, India, 11–13 November 2021; pp. 867–871. [Google Scholar]
Pennebaker, J.W.; Francis, M.E.; Booth, R.J. Linguistic Inquiry and Word Count: LIWC 2001; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2001. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Haq, S.; Jackson, P.J.B. Speaker-dependent audio-visual emotion recognition. In Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP’08), Norwich, UK, 10–13 September 2009. [Google Scholar]

Figure 1. RAVDESS dataset.

Figure 2. Mel Spectrogram Representation of the audio signal.

Figure 3. Model 1: Proposed system schema showing 2D CNN-LSTM architecture.

Figure 4. Model 2: Proposed system schema showing 2D CNN-BiLSTM architecture.

Figure 5. Confusionmatrices showing performance before and after applying uniform noise for both models, along with final test set performance. Each subfigure is labeled (a) through (f) to indicate the following: (a) Model 1 validation before noise; (b) Model 1 validation after noise; (c) Model 2 validation before noise; (d) Model 2 validation after noise; (e) Model 1 final test performance; (f) Model 2 final test performance.

Figure 6. Training and validation loss curves for both models. (a) Model 1 uniform noise; (b) Model 2 AWGN.

Table 1. Overview of significant studies in Speech Emotion Recognition (SER).

Study Ref.	Technique	Data Used	Key Findings	Accuracy
[9,10]	End-to-end SER	Various	Swift information extraction; no manual features	-
[11,12,13]	CNN-LSTM, others	Various	Enhanced SER performance	-
[14]	CNN	Turkish tweets	Excelling in text analysis	87%
[15]	LSTM, CNN	Various	Improved sentiment analysis	94%
[16]	CNN	Audio	Novel approach; modest success	63%
[17]	ResNets	Various	Promising in SER tasks	70+%
[19]	CNN-LSTM	RAVDESS	High accuracy in sentiment analysis	90%
[20]	BERT	Text emotion	Excellent in text-based detection	92+%
[21]	DNN, GNN	Audio, multimodal	Innovative in audio signal recognition	70–88%
[22]	NN	RAVDESS	Real-time emotion recognition	80%

Table 2. A comparison of the training epochs, time, and accuracy of both models.

Models	Epochs	Training Time (Minutes)	Accuracy
Model 1	60	20	60%
	130	43	80%
	200	71	96.5%
Model 2	60	31	67%
	130	68	85%
	200	103	98.1%

Table 3. Comparing precision, recall, and F1 score for both models, along with overall accuracy.

Class	Precision		Recall		F1-Score		Accuracy (%)
	Model 1	Model 2	Model 1	Model 2	Model 1	Model 2
Surprise	1.00	1.00	1.00	0.95	1.00	0.97	96.5 (M1) 98.1 (M2)
Neutral	1.00	1.00	1.00	1.00	1.00	1.00
Calm	0.90	0.95	0.95	1.00	0.92	0.97
Happy	0.95	1.00	0.95	1.00	0.95	1.00
Sad	0.95	1.00	0.95	1.00	0.95	1.00
Angry	0.90	0.95	0.95	1.00	0.92	0.97
Fear	1.00	0.95	1.00	1.00	1.00	0.97
Disgust	1.00	0.95	0.95	0.90	0.97	0.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bhanbhro, J.; Memon, A.A.; Lal, B.; Talpur, S.; Memon, M. Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models. Signals 2025, 6, 22. https://doi.org/10.3390/signals6020022

AMA Style

Bhanbhro J, Memon AA, Lal B, Talpur S, Memon M. Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models. Signals. 2025; 6(2):22. https://doi.org/10.3390/signals6020022

Chicago/Turabian Style

Bhanbhro, Jamsher, Asif Aziz Memon, Bharat Lal, Shahnawaz Talpur, and Madeha Memon. 2025. "Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models" Signals 6, no. 2: 22. https://doi.org/10.3390/signals6020022

APA Style

Bhanbhro, J., Memon, A. A., Lal, B., Talpur, S., & Memon, M. (2025). Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models. Signals, 6(2), 22. https://doi.org/10.3390/signals6020022

Article Menu

Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Dataset

3.2. Pre-Processing

3.3. Models Description

3.3.1. Time Distributed 2D CNN-LSTM Model

3.3.2. Stacked Time Distributed 2D CNN–Bidirectional LSTM with Attention

3.4. Ablation Study

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI