Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Pan, Shing-Tai; Wu, Han-Jui

doi:10.3390/electronics12112436

Open AccessArticle

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

by

Shing-Tai Pan

^* and

Han-Jui Wu

Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(11), 2436; https://doi.org/10.3390/electronics12112436

Submission received: 2 May 2023 / Revised: 25 May 2023 / Accepted: 25 May 2023 / Published: 27 May 2023

(This article belongs to the Special Issue Recent Advances in Data Science and Information Technology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.

Keywords:

speech emotion recognition; one-dimensional neural network; LSTM; CNN; MFCCs

1. Introduction

Over the years, there has been a significant progression in the field of speech emotion recognition, starting from the era of digital voice transmission to the emergence of self-attention mechanisms. In the early stages, with the advent of digital voice transmission, researchers focused on developing adaptive playout delay control mechanisms for packetized audio transmitted over the internet. An example of such work is the paper [1] by Marco Roccetti et al. in May 2001. This research aimed to address the challenges of transmitting audio efficiently over the internet. However, over the last two decades, the research topics of voice media shifted from the internet to the terminals (devices) due to the progress of computer hardware. In the current decade, the research topic that is receiving significant attention in the field of voice media is speech recognition.

As technology advanced, researchers recognized the importance of understanding and recognizing emotions conveyed through speech. They realized that speech carries valuable emotional cues that can enhance human–computer interaction and communication systems. Consequently, the field of speech emotion recognition began to gain attention. One notable advancement in this area is the incorporation of self-attention mechanisms. Self-attention mechanisms are a type of neural network architecture that enables the model to focus on different parts of the input sequence when processing information. This attention mechanism allows the model to weigh the importance of different speech features or context, which has proven valuable for speech emotion recognition tasks. In the context of self-attention mechanisms, using LSTM (long short-term memory) neural networks can be particularly beneficial for speech emotion recognition. This is because speech emotion recognition involves addressing a time-series problem, where long sentences or utterances are used to determine the emotional content of speech. LSTM networks are a type of recurrent neural network (RNN) that excel in processing sequential data over time. They are specifically designed to handle the challenge of capturing long-term dependencies in sequential data, making them suitable for tasks involving time-series analysis, such as speech emotion recognition.

In summary, the journey of speech emotion recognition has evolved from addressing the challenges of digital voice transmission to recognizing the importance of emotional cues in speech. The introduction of self-attention mechanisms has further advanced the field, allowing for improved recognition and understanding of emotions conveyed through speech.

The research of speech recognition has hence attracted more and more attention recently due to the aforementioned history of voice media and the fact that speech is vital for human communication and is considered one of our fundamental forms of expression. Moreover, a speech signal contains a lot of information, such as semantics [2], speaker identity [3], language type [4], speaker emotion [5,6,7,8,9,10,11,12], and so on. Among them, emotion recognition from speech is a technology that many organizations and companies have been eager to develop in recent years, for example, Amazon Web Services (AWS) [13], Google [14], NVIDIA [15], etc. Automatic speech emotion recognition has gained increasing importance in this era.

Automatic speech emotion recognition [16] can be performed using artificial intelligence technology that identifies emotional states, such as sadness, happiness, anger, and anxiety, from speech by analyzing features from tone, intonation, and rhythm, etc. The applications of speech emotion recognition technology are very extensive. For example, in the customer service industry [17], it can quickly determine whether customers are satisfied and whether they need more support by identifying the emotional state from their speech. In the medical field [18], speech emotion recognition technology can help doctors better understand the emotional state of patients and provide more humane treatment plans. In the entertainment industry [19], speech emotion recognition technology can be applied in gaming, music, and other scenarios to provide more personalized and immersive experiences by recognizing the emotional state of players or audiences.

A new trend for the applications of speech emotion recognition is speech emotion detection on portable devices. With the global popularity of smart mobile devices, speech has become an indispensable means of interaction between the devices and users. If the terminal device can recognize the user’s current emotional state through speech signals, it can make the device more humane. This enhances the user experience by enabling interactions beyond repetitive and robotic responses, creating a more humane and personalized interaction with the device. This not only enhances the enjoyment of interaction between users and terminal devices but also enables the terminal device to provide more activate personalized services.

However, there are many challenges in speech emotion recognition technology. The diversity of speech signals produced by different people in different languages affects the accuracy of recognition. In addition, speech signals are often subject to environmental noise, microphone distortion, and other interferences, which can make speech emotion recognition even more difficult. The most significant challenge to be overcome is the subjectivity of emotions. Different individuals may perceive and interpret various emotions from the same speech signal, making it challenging to achieve consistent and accurate emotion recognition across different users. Consequentially, this paper aims to develop a speech emotion recognition system with better accuracy and more stable performance by combining a convolutional neural network (CNN) and long short-term memory neural networks (LSTM) with a data augmentation technique. It is worthwhile to mention that the convolutional neural network is one of neural network architecture and is a machine learning method that is widely applied in many applications; see, for example, the papers [20,21,22,23,24].

As for the convolutional neural network, the convolutional neural network architecture can be divided into two types: one-dimensional (1D) convolution and two-dimensional (2D) convolution. For the 2D convolution, the speech data need to undergo a transformation from the time domain to the frequency domain before being used as input for convolution. Hence, the calculations of 2D convolution are more complicated than that of 1D convolution. Considering the computation cost and power consumption of wearable or portable smart devices, this paper proposes a machine learning model based on a 1D convolutional neural network for speech emotion recognition. Compared with other related studies, the experimental results show that the method proposed in this paper has a higher recognition rate and higher data generalizability.

This paper is organized as follows. Section 2 introduces some related works about speech emotion recognition. Subsequently, the databases used for the experiments in this paper and the proposed feature extraction method as well as the data augmentation method are introduced in Section 3. Then, Section 4 describes the proposed model for modeling the speech emotion signals. The experimental results are also revealed in this section. Section 5 discusses the experimental results in Section 4. Finally, some conclusions and future research about speech emotion recognition are made in Section 6.

2. Related Works

This section will discuss the related papers in the area of speech emotion recognition. The paper [25] proposes a model, called convolutional long short-term memory fully connected deep neural networks (ConvLSTM-FCN), for processing time-series data such as speech signals or video sequences. The ConvLSTM-FCN model combines the distinctive features of convolutional neural networks and long short-term memory models and uses a fully connected deep neural network for classification.

The paper [5] proposes a method for speech emotion recognition based on Resnet. Firstly, each speech datum is segmented into multiple frames. Then, the K-means algorithm is utilized for clustering the frames and replaces each frame with the corresponding cluster centroid of the cluster the frame is classified to. The one-dimensional audio signals are transformed into a two-dimensional spectrum using the short-time Fourier transform. The Resnet 101 model is trained by using the training data on a frequency domain. Finally, a bidirectional long short-term memory neural network is used to model the time-series features of the frames. This work uses the RAVDESS, EMO-DB, and IEMOCAP databases for training and testing and achieves accuracies of 77.02%, 85.57%, and 72.25%, respectively. In [6], the authors preprocess the speech signals with algorithms for noise removal. Then, the speech signals are converted into two-dimensional spectrograms as inputs for training the convolutional neural networks. The paper employs the RAVDESS and IEMOCAP databases for training and testing the proposed model. It reports accuracies of 81.75% and 79.50% for the RAVDESS and IEMOCAP datasets, respectively. The paper [7] proposes a speech emotion recognition method that utilizes support vector machines (SVM). The method extracts the Mel-frequency cepstral coefficients of speech data as feature values and conducts support vector machine algorithms to perform non-linear classification. The accuracies obtained by this method on the RAVDESS and EMO-DB databases are 75.69% and 92.45%, respectively.

In paper [8], MFCCs, Log-power on a Mel Spectrogram, and a chromagram are used as features for the speech signals. These features are then fed into residual bidirectional LSTM (RBLSTM) and attention-based multi-learning model (ABMD) models for training and classification. The RBLSTM model consists of residual bidirectional LSTM blocks and multi-head attention mechanism blocks. This architecture enables the model to capture essential features in the time-series data and analyze the dependencies between these features across different time lengths. The ABMD model uses residual dilated causal convolution blocks and dilated convolution layers with multi-head attention to analyze the global correlations between features in a parallel manner. This speech emotion recognition method achieves an accuracy of 85.89% on the RAVDESS database and 95.93% on the EMO-DB database.

A speech emotion recognition method that first performs data augmentation on the speech signal is proposed in [9]. This paper uses the Left Mel Spectrogram segment, Right Mel Spectrogram segment, Mid Mel Spectrogram segment, and Side Mel Spectrogram segment to obtain Mel Spectrograms of different segment levels. These Mel Spectrograms are then combined into multi-dimensional data and fed into 2D Local Feature Learning Blocks for local feature extraction. Finally, two LSTM layers are applied for feature learning on the time-series level. The experimental results show that the model trained and tested on the IEMOCAP database achieves an average accuracy of 88.80%.

The aforementioned studies have had some effect on the exploration of effective features. There are some studies focused on the development of innovative models for speech emotion recognition. The paper [10] proposes 1D and 2D convolutional neural networks combined with LSTM to compare the performance of the models. Both deep learning models consisted of the same number of local feature learning block and LSTM layers. The 1D CNN-LSTM model applies raw speech data as input, while the 2D CNN-LSTM model used the Log-power of the speech signal on the Mel Spectrogram as input. When trained and tested on the EMO-DB database, the 2D CNN-LSTM model achieved an accuracy of 95.89%. This accuracy was found to be 9.16% higher compared to the 1D CNN-LSTM model. Similarly, the paper [11] proposes a ConvLSTM neural network model based on two-dimensional CNN. Unlike the general CNN-LSTM architecture, ConvLSTM directly applies the convolutional layer as a part of LSTM layer for each time-step. The model proposed in this study achieved an accuracy of 80.00% on the RAVDESS database and 75.00% on the IEMOCAP database.

Finally, to improve the recognition rates, the data augmentation technique is promising. Paper [26] discusses how to deal with class imbalance in datasets when applying deep learning for classification tasks. One approach is to use data augmentation techniques to increase the diversity of the dataset and improve the model’s performance on minority classes. Based on the experimental results presented in [26], the utilization of data augmentation techniques can result in a significant improvement in the training efficiency of the model. In fact, the results suggest that incorporating data augmentation can lead to more than a two-fold enhancement in training efficiency. However, data augmentation techniques need to be adjusted and optimized for different tasks and data types. Appropriate evaluation and validation are also required. Additionally, data augmentation may lead to overfitting issues, so it is necessary to control the scope and degree of data augmentation. An LDA-based data augmentation algorithm for acoustic scene classification is proposed in [27]. This algorithm transforms the raw audio data into feature vectors in LDA space and generates new data samples with differentiated features. This can expand the dataset. Experimental results show that by using the LDA algorithm for data augmentation, the performance of acoustic scene classification can be improved. Paper [28] proposes a speech emotion recognition method that uses data augmentation. Firstly, the database is expanded using data augmentation techniques such as pitch shift, adding white noise, and time stretch. Then, seven speech features are extracted. These features include MFCCs, delta MFCCs, delta-delta MFCCs, spectral contrast, Mel Spectrogram, chromagram, and tonnetz. These seven features are fused into feature sequences and used as input for the CNN model. The speech emotion recognition method proposed in the paper achieves accuracies of 90.6% and 96.7% on the RAVDESS and EMO-DB databases, respectively.

In order to address the room for improvement in the recognition rate observed in previous studies, this paper proposes modifications to the CLDNN (Convolutional, LSTM, and Deep Neural Network) architecture proposed in [25]. The objective is to enhance the recognition rate by integrating one-dimensional CNN, LSTM, and DNN components. By combining these different layers, the model aims to leverage the strengths of each component, such as the CNN’s ability to capture local patterns, the LSTM’s ability to model temporal dependencies, and the DNN’s ability to learn complex representations. The goal is to achieve better accuracy and performance in speech emotion recognition compared to previous approaches.

3. Database and Feature Extraction

3.1. Databases

Three databases are applied for model training and validation in the experiments of this paper, which are RAVDESS [29], EMO-DB [30] and IEMOCAP [31]. In order to ensure the reliability of machine learning model, the Scikit-Learn package is conducted to randomly select 10% of the dataset as the data for testing, and the remaining 90% data are randomly divided into train data and validation data at a ratio of 9:1. This paper adopts 10-fold cross-validation to train and test the model. Figure 1 shows the cross-validation process used in this paper.

RAVDESS (The Ryerson Audio-Visual Database of Emotional Speech and Songs) is a database for speech and expression emotion recognition created by Ryerson University’s Voice Research Lab in Canada. The RAVDESS database contains audio and video samples recorded from 24 actors (12 male and 12 female) and includes 8 different emotions (angry, calm, disgust, fear, happy, neutral, sad, and surprised). The audio data are recorded in English language and consists of both singing and speaking pronunciations. In total, there are 2452 speech samples in the database. The data number and proportion of each emotion label in the RAVDESS database is shown in Table 1.

The EMO-DB (Berlin Database of Emotional Speech) database is a speech emotion database created by the Psychoacoustics Research Group at the Technical University of Munich in 1997. The database contains audio data recorded from 10 German actors (5 male and 5 female) in 7 emotional states (angry, boredom, disgust, fear, happy, neutral, and sad). In total, there are 535 speech samples in this database. The audio data in this database was recorded in the German language and has been widely applied for training and testing in different speech emotion recognition studies. Table 2 shows the data number and proportion of each emotional label in the EMO-DB database.

IEMOCAP (Interactive Emotional Dyadic Motion Capture) is a commonly used database for speech emotion recognition studies, created by the Speech and Communication Technology Lab at the University of Southern California. The database consists of English speech recordings recorded from 10 professional actors (5 male and 5 female) that include elements of scenes (such as arguments and negotiations) and are labeled with emotion intensity scores from 1 to 5, as well as synchronized with facial expression and audio data. This database is highly suitable for training and testing models for speech emotion recognition and multimodal emotion analysis. In this study, we refer to [5] and utilize the angry, happy, neutral, and sad data from the IEMOCAP database for model training and testing. The dataset consists of a total of 5531 audio samples. The data number and proportion of each emotion label in the IEMOCAP database is shown in Table 3.

The raw audio data from these databases will be converted into Mel-frequency cepstral coefficients (MFCCs). Then, the MFCCs will be used as the input of the neural network model for training.

3.2. Data Augmentation

When training a machine learning model, a small amount of data can directly lead to deterioration in the trained model. Therefore, increasing the amount of effective training and testing data through the data augmentation can help the machine learning model to learn more useful features and to avoid under-fitting. In this paper, we conduct a data augmentation algorithm to expand the total amount of audio data. In this study, the researchers refer to the results presented in [27,28] for comparing various data augmentation methods. Based on these comparisons, they choose two relatively simple data augmentation methods with better results: adding noise and shifting pitch. In this paper, the noise ratio is set to 0.035. The speech signals with added noise are calculated by two steps. First, the maximum value of the speech signal is multiplied with the noise ratio to obtain the noise amplitude. Then, the noise amplitude is added to the original speech signal to obtain the speech signal with added noise. Additionally, in terms of pitch shifting, this paper applies Fourier transform to the original speech signal. Then, the frequency of the frequency-domain signal is multiplied by a logarithmic phase factor of value 0.7. Finally, the signal is transformed back to the time domain through Fourier inverse transform to achieve pitch shifting.

After applying the data augmentation, the researchers obtain audio signals with added noise, adjusted pitch, and both added noise and adjusted pitch. This has increased the total amount of audio data by four times. Figure 2 shows the resulting signals examples of the data augmentation methods conducted in this paper.

3.3. Feature Value

In this work, the MFCCs (Mel-frequency cepstral coefficients) [12] are extracted from the speech signals to represent the audio data. By extracting the MFCCs, the audio signals are transformed into a set of feature values that mimic the human auditory system’s response to different frequency components in the speech signals. For speech recognition and speaker recognition, the most commonly used speech feature value is MFCCs. The extraction process of MFCCs is shown in Figure 3.

The steps of extracting MFCCs are briefly introduced as follows.

Step 1: Pre-emphasis can enhance the high-frequency part of the speech signal. This step simulates the human ear’s automatic gain for high-frequency waves. The Equation (1) reveals the calculation of pre-emphasis.

$S (n) = D a t a (n) - 0.95 D a t a (n - 1), 1 \leq n \leq L e n g t h (D a t a),$

(1)

in which $D a t a (n)$ is the original signals and $S (n)$ is the resulting signals after pre-emphasis.
Step 2: Framing helps the designed system to better analyze the relationship between signals and time-changing. The length of each frame is set to 256 points, and the overlap rate between frames is set to 50% in this study.
Step 3: For the purpose of reducing the interference caused by the discontinuity between frames when Fast Fourier transform is applied, all frames have to be multiplied by the Hamming window in Equation (2).

$W (n) = \{\begin{matrix} 0.54 - 0.46 c o s \frac{2 n π}{N - 1}, 0 \leq n \leq N - 1 \\ z e r o, o t h e r w i s e \end{matrix}$

(2)
Step 4: The logarithmic power can be obtained by calculating the spectrum after Fourier transform through the triangular filter in Equation (3).

$B_{m} (k) = \{\begin{matrix} \frac{k - f_{m - 1}}{f_{m} - f_{m - 1}}, f_{m - 1} \leq k \leq f_{m} \\ \frac{f_{m + 1} - k}{f_{m + 1} - f_{m}}, f_{m} \leq k \leq f_{m + 1} \\ z e r o, o t h e r w i s e \end{matrix}, 1 \leq m \leq M,$

(3)

in which $M$ is the number of filters, and $f_{m}$ is the center frequency of the $m$ th filter.
Step 5: The discrete cosine transform of the logarithmic power is used to obtain the MFCCs of audio signals based on Equations (4) and (5).

$Y (m) = l o g (\sum_{k = f_{m - 1}}^{f_{m + 1}} {|X [k]|}^{2} B_{m} [k])$

(4)

$C_{x} (n) = \frac{1}{M} \sum_{m = 1}^{M} Y (m) c o s (\frac{n π (m - \frac{1}{2})}{M})$

(5)

in which $X [k]$ is the spectrum of each frame in speech, and $C_{x} (n)$ is the resulting MFCCs.

The data dimension after MFCC extraction is 20 in this paper. To facilitate the training of 1D CNNs, the researchers reshape each MFCC into one-dimensional data format using the reshape function in the Numpy package based on Python language.

4. Methods and Experiments

In this section, two models used in this paper for speech emotion recognition are introduced. The first model is 1D CNN-DNN model, and the second is 1D CLDNN model. Some experiments are also conducted, and the numerical results are revealed in this section.

Moreover, in the experiments, this paper uses the NVIDIA GeForce RTX 3090 graphics card for ML training and testing, and the Intel i7-10700 processor is used for the data augmentation and MFCCs extraction. The experimental environment is shown in Table 4.

4.1. CNN-DNN Model

The architecture and training process of the CNN-DNN model applied in this paper is shown in Figure 4. Mel-frequency cepstral coefficients are extracted from the audio data after the data augmentation and are used as the input of the model. The model calculates feature maps using five LFLBs composed of one-dimensional CNN and then uses one hidden layer to extract feature values. Finally, the Softmax function is used as the activation function of the output layer to obtain the classification results.

The zero-padding method is applied to all convolutional layers. The activation function Rectified Linear Unit (ReLU) is used in the experiments. Moreover, each convolutional layer is equipped with a max pooling layer and a batch normalization layer to help the feature values converge and avoid gradient vanishing. The parameters of the CNN-DNN model used in this paper are shown in Table 5.

To observe the impact of the data augmentation on model training, the researchers train the same model architecture and with the same parameters with two different training datasets: the raw audio data and the augmented audio data. The 10-fold cross-validation method is used to verify the performance of the designed model. Taking the RAVDESS database as an example, when training with the raw data, the average accuracy achieved was 63.41%. The standard deviation associated with this average accuracy was 1.34%. After the data augmentation, the average accuracy increased to 91.87% with a less standard deviation of 1.00%. This results in a 28.46% improvement in accuracy over using the raw audio data. The experimental results indicate that training data can significantly affect the performance of the CNN-DNN model. Additionally, the data augmented by the data augmentation conducted in this paper is effective for model training. The confusion matrix of the experimental results for RAVDESS database using raw audio data for training by CNN-DNN model is shown in Figure 5. The confusion matrix of experimental results using data after the data augmentation for training is shown in Figure 6. It can be seen from these two figures that the recognition rate for each emotion also increases after data augmentation.

For the database EMO-DB, the researchers conduct the same experiments as that for RAVDESS. The EMO-DB database contains only 535 samples. After the data augmentation, the total number of samples increases to 2140. By conducting 10-fold cross-validation, the model achieved an average accuracy of 88.22% with a standard deviation of 6.3%. Figure 7 shows the confusion matrix of the experimental results for the EMO-DB database.

Subsequently, the same experiments as those for RAVDESS and EMO-DB databases are conducted on IEMOCAP database. The IEMOCAP database includes four different emotion labels and contains 5531 raw audio samples. After data augmentation, a total number of 22,124 audio samples are obtained. Although it has the least number of emotion labels among the three databases, it has the largest amount of data. By conducting 10-fold cross-validation experiments, the model achieved an average accuracy of 91.04% with a standard deviation of 0.6%. Figure 8 shows the confusion matrix for the experiments on IEMOCAP database.

Without considering the speaker or language, the differences of the three databases used in this paper are discussed as follows. The RAVDESS database has different pronunciations (singing and speaking) with the most emotion categories and a more evenly distributed sample proportion. The EMO-DB database has the smallest total number of samples and a less uniform distribution. The IEMOCAP database has the largest amount of data with only four emotion categories. The CNN-DNN model achieved an average accuracy of around 90% for all three databases in cross-validation. In other words, with sufficient data size, the CNN-DNN model has a certain degree of generalizability to different databases. In order to enhance further the performance of the designed model, based on the CNN-DNN model, the researchers will add a long short-term memory neural network to enhance the feature learning on time-series level.

4.2. CLDNN Model

Convolutional neural networks focus more on learning local features. However, since the audio is temporal sequence data, training with recurrent neural networks can help the model to learn features over time. Hence, a CLDNN model is proposed in this paper by adding a long short-term memory (LSTM) network between the convolutional layers and hidden layers. This will enable the model to analyze the relationship between feature values and time-changing during training process. The architecture and training process of the model is shown in Figure 9.

In this paper, the output dimension of the LSTM layer is set to 50, and the parameters of the CLDNN model are listed in Table 6.

Similarly, the researchers take the RAVDESS database as an example to observe the impact of the data augmentation on model training. Using the raw audio data for 10-fold cross-validation, the researchers obtain an average accuracy of 68.70% with a standard deviation of 2.54%. After the data augmentation, the cross-validation yields an average accuracy up to 95.52% with an average standard deviation of 0.47%. For CLDNN model, the average accuracy increases by 26.82% after the data augmentation. This once again verifies that data augmentation is an effective method for speech emotion recognition. Figure 10 and Figure 11 show the confusion matrix of the experimental results on the RAVDESS database before and after the data augmentation, respectively.

For the database EMO-DB, the researchers conduct the same experiments as those for RAVDESS. Performing 10-fold cross-validation for the training of CLDNN model on the EMO-DB database, the researchers obtain an average accuracy of 95.84% with a standard deviation of 1.75%. This indicates a slight improvement of performance compared to the performance of the CNN-DNN model. The confusion matrix of the experimental results for cross-validation on the EMO-DB database is shown in Figure 12.

Subsequently, the same experiments as those for RAVDESS and EMO-DB databases are conducted on IEMOCAP database. In many related studies, due to the fewer number of emotion labels, the accuracy of models tested on the IEMOCAP database is usually lower. However, the CLDNN model achieved a 96.21% accuracy with a standard deviation of 0.39% in the cross-validation experiments on the IEMOCAP database. The confusion matrix of the experimental results is shown in Figure 13. Additionally, the validation results on the IEMOCAP database are comparable to those using other databases. The comparison indicates that the use of different databases has little impact on the proposed method for speech emotion recognition. This again verifies the generality and stability of the proposed methods.

5. Results and Discussion

Comparing the training results of the CNN-DNN model and the CLDNN model, the researchers find that the average accuracy of cross-validation experiments without using data augmentation for the model CLDNN is 5.29%, better than that of CNN-DNN. Additionally, the average accuracy of cross-validation experiments after the data augmentation for the model CLDNN increases by 3.65% compared to that of the CNN-DNN model. The comparison is shown in Table 7. This indicates that adding LSTM layers can improve the training performance by learning features on time sequence level.

Three performance indexes: precision, recall, and F1-score, are adopted in this paper to evaluate the performance of the proposed method. The performance indexes of the designed model are calculated based on the Equations (6)–(8).

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F 1 - Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(8)

Table 8 shows the precision, recall, and F1-score of each emotion label recognized by using the CLDNN model with 10-fold cross-validation experiments on RAVDESS database.

It can be seen that the precisions of ‘fear’ and ‘sad’ are lower than those of the other emotions. It is likely due to the reason that the two voices of ‘fear’ and ‘sad’ have a relatively low tone expression. Consequently, emotion of labels with a similar low tone of voice may be mistakenly predicted as belonging to these two labels. On the other hand, the recall of ‘disgust’ and ‘surprised’ show poorer performance. This is likely because that the model does not capture the features of these two emotions as well as it does for other emotions. As a result, these two emotions are more likely to be mistakenly predicted as each other.

The precision, recall, and F1-score of CLDNN model, which are calculated through cross-validation experiments on the EMO-DB database, are shown in Table 9. It shows that the precision for the emotion ‘sad’ is 100%. This indicates that the error prediction rate is almost zero, which is the lowest among those of all emotions. The precision for the emotion ‘happy’ is the lowest one. By examining the confusion matrix in Figure 12, it is found that the model tends to wrongly predict the emotion ‘angry’ as ‘happy’. This is possibly due to both emotions having similar high-tone expression.

Table 10 shows the precision, recall, and F1-score of the CLDNN model calculated after applying 10-fold cross-validation experiments on the IEMOCAP database. The standard deviation of these data for each emotion label is relatively smaller than that of the RAVDESS and EMO-DB databases. This is possibly due to the fact that the IEMOCAP database has the largest total amount of data with only four emotion categories. This makes the designed model less prone to have incorrect prediction between different emotions.

Table 11, Table 12 and Table 13 show the comparison of the proposed 1D CLDNN model with other related studies using the RAVDESS, EMO-DB, and IEMOCAP databases in terms of average accuracy through cross-validation experiments. The CLDNN model proposed in this paper achieves the highest accuracy in both the RAVDESS and IEMOCAP databases, which is a lot of improvement relative to other studies. Additionally, the recognition accuracy of this model for all three databases is around 95%. As for the EMO-DB database, the accuracy of the proposed CLDNN model is just a little lower (less than 0.1%) than those of [8] and [10] according to the results that the accuracy of the method proposed in this paper, [8], and [10] are 95.84%, 95.93%, and 95.89% on the EMO-DB database, respectively. Hence, it can be summarized that the proposed CLDNN model can extract important features from audio data in these three databases and has a certain degree of data generalization.

6. Conclusions and Future Work

This paper applied data augmentation techniques such as adding noise and adjusting pitch to expand the effective training and testing datasets. Combining CNN, LSTM and DNN, this paper proposed a 1D CLDNN model to improve the recognition rate for speech emotion recognition. MFCCs were extracted from the audio signals and used as inputs to train the CLDNN model. The 10-fold cross-validation experiments are conducted on the RAVDESS, EMO-DB, and IEMOCAP databases, resulting in accuracies of 95.52%, 95.84%, and 96.21%, respectively. The overall recognition rate is higher than those of other related studies. Moreover, the experimental results show that the variability of the samples from these three different databases for training and testing has little effect on the prediction results. This verifies the generality and stability of the proposed model.

According to the finding in this paper, there are two issues that are important to be managed for speech emotion recognition. The first issue is the collection of speech emotion databases. Most of the open databases for speech emotion are recorded by some actors. The variety of signals between different databases is likely wide according to the performance of actors. Hence, finding databases recorded by native speakers is important for the research in speech emotion recognition. The second issue is cross-language speech emotion recognition. There are many languages used in different countries. To design a speech emotion recognition system that can recognize emotions from speech with different languages is a practical problem. Based on the findings of this paper, a deeper neural network is promising to solve this problem. However, the costs of hardware and power consumption will become a heavy burden for a company or an organization.

In the future, the proposed 1D CLDNN model can be used to extract other speech features such as VAD (voice activity detection) [32], ZCR (zero-crossing rate) and RMS (root mean square) [33] as inputs to the model for training. This method is proposed not only to expect for higher speech emotion recognition performance but also to enable different kind of applications, for example, the applications on cough and spoofing detection. Additionally, integration with an image-based emotion recognition model can be explored to develop a multimodal neural network model [34] for enhancing the performance of designed model. This will be further investigated for deployment on embedded devices to reduce power consumption and computational complexity.

Author Contributions

Methodology, S.-T.P.; Software, H.-J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science and Technology Council grant number MOST 109-2221-E-390-014-MY2. And the APC was funded by National University of Kaohsiung.

Data Availability Statement

RAVDESS database: https://zenodo.org/record/1188976. EMO-DB database: http://emodb.bilderbar.info/start.html; IEMOCAP database: https://sail.usc.edu/iemocap/.

Conflicts of Interest

The authors declare no conflict of interest.

References

Roccetti, M.; Ghini, V.; Pau, G.; Salomoni, P.; Bonfigli, M.E. Design and experimental evaluation of an adaptive playout delay control mechanism for packetized audio for use over the internet. Multimed. Tools Appl. 2001, 14, 23–53. [Google Scholar] [CrossRef]
Weng, Z.; Qin, Z.; Tao, X.; Pan, C.; Liu, G.; Li, G.Y. Deep Learning Enabled Semantic Communications with Speech Recognition and Synthesis. arXiv 2022, arXiv:2205.04603. [Google Scholar] [CrossRef]
Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deep Speaker Recognition. arXiv 2018, arXiv:1806.05622. [Google Scholar]
Valk, J.; Alumae, T. Voxlingua107: A Dataset for Spoken Language Recognition. In Proceedings of the IEEE Spoken Language Technology Workshop, Shenzhen, China, 19–22 January 2021. [Google Scholar]
Sajjad, M.; Kwon, S. Clustering Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar]
Kwon, S. A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition. Sensors 2020, 20, 183. [Google Scholar]
Bhavan, A.; Chauhan, P.; Shah, R.R. Bagged Support Vector Machines for Emotion Recognition from Speech. Knowl. Based Syst. 2019, 184, 104886. [Google Scholar] [CrossRef]
Kakuba, S.; Poulose, A.; Han, D.S. Attention-Based Multi-Learning Approach for Speech Emotion Recognition with Dilated Convolution. IEEE Access 2022, 10, 122302–122313. [Google Scholar] [CrossRef]
Amjad, A.; Khan, L.; Chang, H.T. Recognizing Semi-Natural and Spontaneous Speech Emotions Using Deep Neural Networks. IEEE Access 2022, 10, 37149–37163. [Google Scholar] [CrossRef]
Zhao, J.; Mao, X.; Chen, L. Speech Emotion Recognition Using Deep 1D & 2D CNN LSTM Networks. Biomed. Signal Process Control 2019, 47, 312–323. [Google Scholar]
Khan, M.; Kwon, S. CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network. Mathematics 2020, 8, 2133. [Google Scholar]
Chen, Z.Q.; Pan, S.T. Integration of Speech and Consecutive Facial Image for Emotion Recognition Based on Deep Learning. Master’s Thesis, National University of Kaohsiung, Kaohsiung, Taiwan, 2021. [Google Scholar]
Amazon Polly. Available online: https://aws.amazon.com/polly/ (accessed on 5 April 2023).
Google Speech-to-Text. Available online: https://cloud.google.com/speech-to-text (accessed on 5 April 2023).
NVIDIA Emotion Classification. Available online: https://docs.nvidia.com/tao/tao-toolkit/text/emotion_classification/emotion_classification (accessed on 5 April 2023).
Abbaschian, B.J.; Sosa, D.S.; Elmaghraby, A. Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models. Sensor 2021, 21, 1249. [Google Scholar] [CrossRef] [PubMed]
Han, W.; Jiang, T.; Li, Y.; Schuller, B.; Ruan, H. Ordinal Learning for Emotion Recognition in Customer Service Calls. In Proceedings of the ICASSP, Barcelona, Spain, 4–8 May 2020; pp. 6494–6498. [Google Scholar]
Dhuheir, M.; Albaseer, A.; Baccour, E.; Erbad, A.; Abdallah, M.; Hamdi, M. Emotion Recognition for Healthcare Surveillance Systems using Neural Networks: A Survey. In Proceedings of the International Wireless Communications and Mobile Computing, Harbin City, China, 28 June–2 July 2021; pp. 681–687. [Google Scholar]
Gomez-Canon, J.S.; Cano, E.; Eerola, T.; Herrera, P.; Hu, X.; Yang, Y.H.; Gomez, E. Music Emotion Recognition: Toward New, Robust Standards in Personalized and Context-Sensitive Applications. IEEE Signal Process. Mag. 2021, 38, 106–114. [Google Scholar] [CrossRef]
Zhang, R.; Yin, Z.; Wu, Z.; Zhou, S. A novel automatic modulation classification method using attention mechanism and hybrid parallel neural network. Appl. Sci. 2021, 11, 1327. [Google Scholar] [CrossRef]
Kulin, M.; Kazaz, T.; Poorter, E.D.; Moerman, I. A survey on machine learning-based performance improvement of wireless networks: PHY, MAC and network layer. Electronics 2021, 10, 318. [Google Scholar] [CrossRef]
Mirmozaffari, M.; Yazdani, M.; Boskabadi, A.; Dolatsara, H.A.; Kabirifar, K.; Golilarz, N.A. A novel machine learning approach combined with optimization models for eco-efficiency evaluation. Appl. Sci. 2020, 10, 5210. [Google Scholar] [CrossRef]
Noroznia, H.; Gandomkar, M.; Nikoukar, J.; Aranizadeh, A.; Mirmozaffari, M. A Novel Pipeline Age Evaluation: Considering Overall Condition Index and Neural Network Based on Measured Data. Mach. Learn. Knowl. Extr. 2023, 5, 252–268. [Google Scholar] [CrossRef]
Han, X.; Chen, S.; Chen, M.; Yang, J. Radar specific emitter identification based on open-selective kernel residual network. Digit. Signal Process. 2023, 134, 103913. [Google Scholar] [CrossRef]
Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H. Convolutional Long Short-Term Memory Fully Connected Deep Neural Networks. In Proceedings of the ICASSP, South Brisbane, QLD, Australia, 19–24 April 2015; pp. 4580–4584. [Google Scholar]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on Deep Learning with Class Imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Leng, Y.; Zhao, W.; Lin, C.; Sun, C.; Wang, R.; Yuan, Q.; Li, D. LDA-Based Data Augmentation Algorithm for Acoustic Scene Classification. Knowl. Based Syst. 2022, 195, 105600. [Google Scholar] [CrossRef]
Jahangir, R.; Teh, Y.W.; Mujtaba, G.; Alroobaea, R.; Shaikh, Z.H.; Ali, I. Convolutional Neural Network-based Cross-corpus Speech Emotion Recognition with Data Augmentation and Features Fusion. Mach. Vis. Appl. 2022, 33, 41. [Google Scholar] [CrossRef]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A Database of German Emotional Speech. In Proceedings of the Interspeech 2015—Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Samuel, K.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang. Resour. Eval. 2018, 42, 335–359. [Google Scholar] [CrossRef]
Wang, K.; An, N.; Li, B.; Zhang, Y.; Li, L. Speech Emotion Recognition Using Fourier Parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75. [Google Scholar] [CrossRef]
Móstoles, R.; Griol, D.; Callejas, Z.; Fernández-Martínez, F. A Proposal for Emotion Recognition using Speech Features, Transfer Learning and Convolutional Neural Networks. Iberspeech 2021, 12, 55–60. [Google Scholar]
Wu, H.J.; Pan, S.-T. Combination of 1D CNN and LSTM for Realization of Speech Emotion Recognition. In Proceedings of the 27th International Conference on Technologies and Applications of Artificial Intelligence (TAAI 2022), Tainan, Taiwan, 1–3 December 2022. [Google Scholar]

Figure 1. The process of model training and testing using 10-fold cross-validation for the experiments with various dataset.

Figure 2. The signals of augmentation of audio signals (a) Raw audio signal; (b) Audio signal with added noise; (c) Audio signal with pitch shifting; (d) Audio signal with added noise and pitch shifting.

Figure 3. The process for extracting MFCCs for audio signals.

Figure 4. Architecture and training process of the CNN-DNN model with MFCC features for speech emotion recognition.

Figure 5. Confusion matrix of accuracy of each emotion for the raw audio data in RAVDESS database by using CNN-DNN model.

Figure 6. Confusion matrix of accuracy of each emotion after data augmentation for RAVDESS database by using CNN-DNN model.

Figure 7. Confusion matrix of accuracy of each emotion after data augmentation for EMO-DB database by using CNN-DNN model.

Figure 8. Confusion matrix of accuracy of each emotion after data augmentation for IEMOCAP database by using CNN-DNN model.

Figure 9. Architecture and training process of the proposed 1D CLDNN model with MFCC features for speech emotion recognition.

Figure 10. Confusion matrix of accuracy of each emotion for the raw audio data in RAVDESS database by using 1D CLDNN model.

Figure 11. Confusion matrix of accuracy of each emotion after data augmentation for RAVDESS database by using 1D CLDNN model.

Figure 12. Confusion matrix of accuracy of each emotion after data augmentation for EMO-DB database by using 1D CLDNN model.

Figure 13. Confusion matrix of accuracy of each emotion after data augmentation for IEMOCAP database by using 1D CLDNN model.

Table 1. The quantity and proportion of data for each emotion in the RAVDESS database.

Label	Number of Data	Proportion
Angry	376	15.33%
Calm	376	15.33%
Disgust	192	7.83%
Fear	376	15.33%
Happy	376	15.33%
Neutral	188	7.39%
Sad	376	15.33%
Surprised	192	7.83%
Total	2452	100%

Table 2. The quantity and proportion of data for each emotion in the EMO-DB database.

Label	Number of Data	Proportion
Angry	127	23.74%
Boredom	81	15.14%
Disgust	46	8.6%
Fear	69	12.9%
Happy	71	13.27%
Neutral	79	14.77%
Sad	62	11.59%
Total	535	100%

Table 3. The quantity and proportion of data for each emotion in the IEMOCAP database.

Label	Number of Data	Proportion
Angry	1103	19.94%
Happy	1636	29.58%
Neutral	1708	30.88%
Sad	1084	19.6%
Total	5531	100%

Table 4. The research equipment and environment used in the experiment.

Experimental Environment
CPU	Intel^® Core™ i7-10700 CPU 2.90 GHz Manufacturer: Intel Corporation, Santa Clara, CA, USA
GPU	NVIDIA GeForce RTX 3090 32 GB Manufacturer: NVIDIA Corporation, Santa Clara, CA, USA
IDE	Jupyter notebook (Python 3.7.6)
Cross-validation	10-fold

Table 5. Parameters used for the CNN-DNN model in this paper.

Layer		Information
LFLB 1	Conv1D (input) BatchNormalization MaxPooling1D	filters = 256, kernel_size = 5, strides = 1 pool_size = 5, strides = 2
LFLB 2	Conv1D BatchNormalization MaxPooling1D	filters = 128, kernel_size = 5, strides = 1 pool_size = 5, strides = 2
LFLB 3	Conv1D BatchNormalization MaxPooling1D	filters = 128, kernel_size = 5, strides = 1 pool_size = 5, strides = 2
LFLB 4	Conv1D BatchNormalization MaxPooling1D	filters = 64, kernel_size = 3, strides = 1 pool_size = 5, strides = 2
LFLB 5	Conv1D BatchNormalization MaxPooling1D	filters = 64, kernel_size = 3, strides = 1 pool_size = 3, strides = 2
Flatten
Dense		units = 256, activation = “relu”
BatchNormalization
Dense (output)		activation = “softmax”

Table 6. Parameters used for the CLDNN model in this paper.

Layer	Information
Conv1D (input)	filters = 256, kernel_size = 5, strides = 1
Conv1D	filters = 128, kernel_size = 5, strides = 1
Conv1D	filters = 128, kernel_size = 5, strides = 1
Conv1D	filters = 64, kernel_size = 3, strides = 1
Conv1D	filters = 64, kernel_size = 3, strides = 1
Flatten
LSTM	units = 50
Dense	units = 256, activation = “relu”
Dense (output)	units = 8, activation = “softmax”

Table 7. The average accuracy of cross-validation for the CNN-DNN and CLDNN models using raw audio data and audio data enhancement through the data augmentation.

Model	Raw Audio Data	Data Augmentation
CNN-DNN	63.41%	91.87%
CLDNN	68.70%	95.52%

Table 8. The precision, recall, and F1-score on each emotion recognized by using CLDNN model with data augmentation of the RAVDESS database.

Label	Precision	Recall	F1-Score
Angry	93.54%	93.85%	93.69%
Calm	93.53%	95.99%	94.74%
Disgust	95.02%	88.86%	91.84%
Fear	88.91%	91.56%	90.22%
Happy	90.52%	90.96%	90.74%
Neutral	92.84%	91.07%	91.95%
Sad	88.35%	91.52%	89.91%
Surprised	97.42%	88.18%	92.57%
Average	92.52%	91.50%	91.96%

Table 9. The precision, recall, and F1-score on each emotion recognized by using CLDNN model with data augmentation of the Emo-DB database.

Label	Precision	Recall	F1-Score
Angry	98.04%	92.59%	95.24%
Boredom	95.53%	95.86%	95.70%
Disgust	95.10%	92.38%	93.72%
Fear	94.65%	100%	97.25%
Happy	90.44%	98.15%	94.14%
Neutral	95.71%	95.71%	95.71%
Sad	100%	99.60%	99.80%
Average	95.64%	96.33%	95.94%

Table 10. The precision, recall, and F1-score on each emotion recognized by using CLDNN model with data augmentation of the Emo-DB database.

Label	Precision	Recall	F1-Score
Angry	95.92%	96.92%	96.42%
Happy	96.18%	95.53%	95.86%
Neutral	96.38%	96.02%	96.20%
Sad	96.26%	96.63%	96.60%
Average	96.19%	96.28%	96.27%

Table 11. Comparison of the results of cross-validation for the proposed CLDNN model with other related research on the RAVDESS database.

Methods	Cross-Validation	Accuracy
M. Sajjad et al. [5]	5-fold	77.02%
S. Kwon [6]	5-fold	79.50%
A. Bhavan et al. [7]	10-fold	75.69%
S. Kakuba et al. [8]	10-fold	85.89%
Proposed model	10-fold	95.52%

Table 12. Comparison of the results of cross-validation for the proposed CLDNN model with other related research on the EMO-DB database.

Methods	Cross-Validation	Accuracy
M. Sajjad et al. [5]	5-fold	85.57%
A. Bhavan et al. [7]	10-fold	92.45%
S. Kakuba et al. [8]	10-fold	95.93%
J. Zhao et al. [10]	5-fold	95.89%
Proposed model	10-fold	95.84%

Table 13. Comparison of the results of cross-validation for the proposed CLDNN model with other related research on the IEMOCAP database.

Methods	Cross-Validation	Accuracy
M. Sajjad et al. [5]	5-fold	72.25%
S. Kwon [6]	5-fold	81.75%
A. Amjad et al. [9]	5-fold	88.80%
Proposed model	10-fold	96.21%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, S.-T.; Wu, H.-J. Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation. Electronics 2023, 12, 2436. https://doi.org/10.3390/electronics12112436

AMA Style

Pan S-T, Wu H-J. Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation. Electronics. 2023; 12(11):2436. https://doi.org/10.3390/electronics12112436

Chicago/Turabian Style

Pan, Shing-Tai, and Han-Jui Wu. 2023. "Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation" Electronics 12, no. 11: 2436. https://doi.org/10.3390/electronics12112436

APA Style

Pan, S.-T., & Wu, H.-J. (2023). Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation. Electronics, 12(11), 2436. https://doi.org/10.3390/electronics12112436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Abstract

1. Introduction

2. Related Works

3. Database and Feature Extraction

3.1. Databases

3.2. Data Augmentation

3.3. Feature Value

4. Methods and Experiments

4.1. CNN-DNN Model

4.2. CLDNN Model

5. Results and Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI