You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

20 September 2023

Enhanced Speech Emotion Recognition Using DCGAN-Based Data Augmentation

and
1
Department of Computer Science, Graduate School, Sangmyung University, Seoul 03016, Republic of Korea
2
Department of Intelligent IoT, Sangmyung University, Seoul 03016, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Theories and Technologies of Network, Data and Information Security

Abstract

Although emotional speech recognition has received increasing emphasis in research and applications, it remains challenging due to the diversity and complexity of emotions and limited datasets. To address these limitations, we propose a novel approach utilizing DCGAN to augment data from the RAVDESS and EmoDB databases. Then, we assess the efficacy of emotion recognition using mel-spectrogram data by utilizing a model that combines CNN and BiLSTM. The preliminary experimental results reveal that the suggested technique contributes to enhancing the emotional speech identification performance. The results of this study provide directions for further development in the field of emotional speech recognition and the potential for practical applications.

1. Introduction

Speech recognition technology plays a crucial role in enriching and facilitating more intuitive human–machine interactions. Notably, the significance of emotional speech recognition is evident from its broadening applications across various domains including smart homes, healthcare, entertainment, customer service, and sentiment analysis. Initially, emotion recognition in speech research focused on probabilistic models like hidden Markov models (HMMs) and Gaussian mixture models (GMMs) [1,2,3,4,5]. With the emergence of deep learning, the study of emotion recognition through neural networks has become prevalent [6,7,8,9,10,11]. Nevertheless, due to the complexity and diversity of emotions and the challenge of subjective evaluation, precise emotional speech recognition remains a major challenge.
One of the major challenges hindering progress in these studies is the limited availability of high-quality emotional speech data. In the realm of image processing, popular datasets such as CIFAR10 [12], ImageNet [13], and MNIST [14] have been extensively utilized to train deep learning models. However, these large-scale datasets are inadequate for emotional speech datasets. Among emotional speech datasets, datasets such as IEmoCAP [15], EmoDB [16], and RAVDESS [17] are frequently used in research, but they are relatively small compared with image datasets. To overcome limitations in data, we propose utilizing deep convolutional generative adversarial networks (DCGANs) [18] to augment speech data in the form of mel-spectrograms. While primarily used for image data augmentation, this work explores the application of DCGANs to mel-spectrograms, which are time-frequency representations of speech effectively capturing different components of emotion.
In addition, this study investigates the effectiveness of using a combination of convolutional neural networks (CNNs) and bidirectional long short-term memory (BiLSTM) [19] to accurately identify emotions from mel-spectrogram data. The combination of these techniques indicates the future direction of emotional speech recognition technology development and real-world applications. It provides significant potential for improving the performance of emotional speech recognition.
The structure of this paper is as follows. Initially, prior research on GANs, DCGANs, and mel-spectrograms is examined. The following section outlines the methodology by providing details on the utilized database, data preprocessing, data augmentation through DCGAN, and model design via CNN+BiLSTM. Then, in the Experiments and Results section, we validate the performance of the proposed approach through diverse experiments. In the Discussion section, we analyze the implications of these outcomes, limitations of the research, and potential avenues for future studies.

3. Proposed Method

3.1. Data Preprocessing

This study utilizes RAVDESS and EmoDB emotion speech databases. Although both contain various emotional states, this research concentrates on anger, disgust, fear, happiness, neutral, and sadness. Table 1 summarizes the quantity of speech data for each emotional state in each database.
Table 1. Data distribution by emotion in RAVDESS and EmoDB datasets.
One of the crucial stages in utilizing speech data is data preprocessing. This process targets the removal of extraneous components and the transformation of data into a form that satisfies the requisites of the model, all while retaining the distinctive features of the speech data. We first employ envelope detection to eliminate silent and redundant segments of the speech data. Envelope detection in the librosa package proves helpful for identifying the primary variations in an audio signal and efficiently eliminating silence [28]. The process involves using Short-Time Fourier Transform (STFT) [29] to divide the audio signal into multiple frames and locate the maximum amplitude in each frame. These maximums are concatenated to form an envelope, which can then be utilized to isolate and eliminate the silent portions from the original audio signal. As a result, the data are preprocessed to remove extraneous information and retain only essential audio data. Figure 1 illustrates a comparison between the original speech and the speech with silent parts eliminated via envelope detection.
Figure 1. (a) Original speech waveform; (b) speech waveform after envelope detection.
The data were converted into a mel-spectrogram after detecting the envelope, utilizing the mel-spectrogram function from the librosa package. Subsequently, the mel-spectrogram was transformed to a dB scale via Equation (4) for more consistent and efficient model training, which reduced the dynamic range of the mel-spectrogram.
S d B = 10 log 10 S r e f
The “ref” is set to the maximum value of the mel-spectrogram. To compress the large dynamic range common in real-world environments and facilitate model training, a dB scale is applied to the mel-spectrogram. Figure 2 displays a mel-spectrogram that has been processed in this manner.
Figure 2. Mel-spectrogram after dB scaling.

3.2. Mel-Spectrogram Data Augmentation Using DCGAN

In this study, we utilized DCGAN to enhance speech data. The model underwent training through mel-spectrograms obtained from the original speech data. The trained generator resulted in fresh mel-spectrograms. A mini-batch technique was utilized due to memory limitations during deep learning training. The PyTorch deep learning framework was used to construct the model’s layers.
The generator receives a random noise vector in latent space and transforms it into image-like data. In our model, we employ an initial fully connected linear layer to expand the latent vectors into 2D tensors. Subsequently, we use four transposed convolution layers to incrementally enhance the image resolution, thus yielding the final image. Batch normalization and ReLU activation functions are applied after each transposed convolution layer to ensure network stability. The generator’s final layer uses the tanh activation function to confine the output within the range of [–1, 1].
The discriminator is a model that takes in image data to classify whether an image is genuine or generated by a generator. The framework for the discriminator comprises four convolutional layers, each incorporating batch normalization and a Leaky ReLU activation function. The last convolutional layer generates a single value, indicating the probability of an image being authentic. The sigmoid activation function then outputs the probability value within the range of [0, 1]. Table 2 shows the overall design structure for both networks.
Table 2. Architectures of the generator and discriminator for DCGAN-based mel-spectrogram augmentation.
The mel-spectrogram data produced by this process are illustrated in Figure 3. This information was then combined with the original data under the label “fake”. This was used to train the emotion recognition model.
Figure 3. Generated mel-spectrogram using DCGAN.

3.3. Model Architecture: CNN-BiLSTM Emotional Speech Recognition

In this study, the original mel-spectrogram extracted from the original speech data and the mel-spectrogram generated with DCGAN were combined to form the final dataset. This dataset was used as input to a combined emotional speech recognition model of CNN and BiLSTM. The structure of the model is as follows: The first CNN module passes through a convolutional layer with 64 3 × 3 filters, applies batch normalization and ReLU activation function, and performs 2 × 2 max pooling. Next, it undergoes convolution with 128 3 × 3 filters, followed by batch normalization and an ReLU activation function. It then undergoes 4 × 4 maximum pooling. The third CNN module conducts convolution using 256 4 × 4 filters, implements batch regularization and ReLU activation functions, and executes 4 × 4 maximum pooling. To avoid overfitting, each module was subjected to drop-out. The outcome of the CNN module is transformed into the input of the LSTM and is passed through a BiLSTM layer containing 256 LSTM units. Finally, the output passes through a dense layer consisting of 128 units that incorporate L2 regularization. This is followed by a dense layer utilizing the softmax activation function, which produces the final output and denotes the probability of the class.

4. Experiment

4.1. Experimental Setting

In our experiments, we used authentic mel-spectrograms extracted from the RAVESS and EmoDB databases as well as augmented mel-spectrograms generated using DCGAN as datasets. The model structure was developed through a combination of CNN and BiLSTM, and we opted for the RMSprop [30] optimizer to ensure stable gradient updates and rapid convergence. We set the ratio of dividing the data into train, test, and validation sets to 7:1.5:1.5 to maintain the stability of the model while performing sufficient training and evaluation. We used the ReduceLROnPlateau method to dynamically adjust the learning rate to maintain the stability of the optimization process: the initial learning rate was set to 0.001, and as the training progressed, we were able to achieve better model performance by reducing the learning rate when performance improvement was no longer observed.
The main purpose of the performance evaluation is to see how effective data augmentation with DCGAN is. For this study, we compared the performance of the original data and the augmented data combined with the original data for each dataset in RAVDESS and EmoDB. Weighted accuracy (WA) and Unweighted accuracy (UA) were used as performance measures, which are commonly applied in speech emotion recognition, especially when there is an imbalanced data distribution for each emotion class. WA is a technique that measures overall accuracy by assigning weights to each class based on its significance or frequency, which mirrors the distribution of each class in the dataset. It evaluates the accuracy of each class independently. Conversely, UA calculates the average accuracy of all classes equally, without taking data imbalance into account, and evaluates the accuracy of each class independently.

4.2. Result

In this experiment, we assessed performance through the evaluation of two datasets: RAVDESS and EmoDB. To compare results, we examined the performance of using only original data versus that of incorporating augmented data for each dataset, resulting in a total of four different data configurations. To compare results, we examined the performance of using only original data versus that of incorporating augmented data for each dataset, resulting in a total of four different data configurations. Table 3 summarizes the results.
Table 3. Comparison of WA and UA for original and augmented datasets.
In the RAVDESS dataset, solely utilizing the original data resulted in a weighted accuracy (WA) of 64.8% and an unweighted accuracy (UA) of 64.2%. However, when the augmented data were incorporated, there was a marked improvement, registering 72.3% for both WA and UA. Similarly, for the EmoDB dataset, the original data yielded a WA of 80.6% and a UA of 82.6%, while incorporating augmented data led to a WA and UA of 90.4% and 91.3%, respectively. These results demonstrate that the performance of the speech emotion recognition model improves when utilizing the DCGAN-based data augmentation technique. Table 4, Table 5, Table 6 and Table 7 present the experimental findings, specifically focusing on the UA value in the confusion matrix. The colored numbers in the table footer are the highest scores in each emotion group.
Table 4. Confusion matrix for RAVDESS dataset (%).
Table 5. Confusion matrix for RAVDESS+augmented dataset (%).
Table 6. Confusion matrix for EmoDB dataset (%).
Table 7. Confusion matrix for EmoDB+augmented dataset (%).

5. Conclusions

In this study, we proposed a method to augment emotional speech data using DCGAN. Using the proposed method, a speech emotion recognition model was trained using the original data along with the augmented mel-spectrogram data generated from the RAVDESS and EmoDB datasets. The experiments indicate that the inclusion of DCGAN-generated data in the training set leads to significant improvements in model performance as compared with only using the original data. In our experiments, we also evaluated the performance of the model using two major performance evaluation metrics, WA and UA.
This research demonstrates that using generative models like DCGAN for data augmentation is an effective approach to construct high-performance models for speech emotion recognition, especially when the size of the emotional speech dataset is limited. In future work, we will further verify the generality of the proposed method by utilizing different generation models and different speech datasets. Furthermore, we believe that a deeper study of the characterization of the augmented data and the resulting performance changes in the speech emotion recognition model is necessary.

Author Contributions

Conceptualization, J.-Y.B. and S.-P.L.; methodology, J.-Y.B.; investigation, J.-Y.B.; writing—original draft preparation, J.-Y.B.; writing—review and editing, S.-P.L.; project administration, S.-P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2023 Research Grant from Sangmyung University.

Data Availability Statement

Experiments used publicly available datasets.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BiLSTMBidirectional long short-term memory;
CNNConvolutional neural network;
DCGANDeep convolutional generative adversarial network;
GANGenerative adversarial network;
GMMGaussian mixture model;
HMMHidden Markov model;
NLPNatural language processing;
ReLURectified linear unit;
STFTShort-time Fourier transform;
WAWeighted accuracy;
UAUnweighted accuracy

References

  1. Schuller, B.; Rigoll, G.; Lang, M. Hidden Markov model-based speech emotion recognition. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03), Hong Kong, China, 6–10 April 2003; IEEE: Piscataway, NJ, USA, 2003; Volume 2, p. I-401. [Google Scholar] [CrossRef]
  2. Nogueiras, A.; Moreno, A.; Bonafonte, A.; Mariño, J.B. Speech emotion recognition using hidden Markov models. In Proceedings of the Seventh European Conference on Speech Communication and Technology, Aalborg, Denmark, 3–7 September 2001. [Google Scholar] [CrossRef]
  3. Lin, Y.-L.; Wei, G. Speech emotion recognition based on HMM and SVM. In Proceedings of the 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China, 18–21 August 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 8, pp. 4898–4901. [Google Scholar] [CrossRef]
  4. Lanjewar, R.B.; Mathurkar, S.; Patel, N. Implementation and comparison of speech emotion recognition system using Gaussian Mixture Model (GMM) and K-Nearest Neighbor (K-NN) techniques. Procedia Comput. Sci. 2015, 49, 50–57. [Google Scholar] [CrossRef]
  5. Hu, H.; Xu, M.-X.; Wu, W. GMM supervector based SVM with spectral features for speech emotion recognition. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 16–20 April 2007; IEEE: Piscataway, NJ, USA, 2007; Volume 4, pp. IV-413–IV-416. [Google Scholar] [CrossRef]
  6. Abbaschian, B.J.; Sierra-Sosa, D.; Elmaghraby, A. Deep learning techniques for speech emotion recognition, from databases to models. Sensors 2021, 21, 1249. [Google Scholar] [CrossRef] [PubMed]
  7. Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
  8. Meng, H.; Yan, T.; Yuan, F.; Wei, H. Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 2019, 7, 125868–125881. [Google Scholar] [CrossRef]
  9. Makhmudov, F.; Kutlimuratov, A.; Akhmedov, F.; Abdallah, M.S.; Cho, Y.-I. Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics 2022, 11, 4047. [Google Scholar] [CrossRef]
  10. Mustaqeem; Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2019, 20, 183. [Google Scholar] [CrossRef] [PubMed]
  11. Abdelhamid, A.A.; El-Kenawy, E.-S.M.; Alotaibi, B.; Amer, G.M.; Abdelkader, M.Y.; Ibrahim, A.; Eid, M.M. Robust speech emotion recognition using CNN+ LSTM based on stochastic fractal search optimization algorithm. IEEE Access 2022, 10, 49265–49284. [Google Scholar] [CrossRef]
  12. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  13. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar] [CrossRef]
  14. LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 31 August 2023).
  15. Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
  16. Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
  17. Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
  18. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015. [Google Scholar] [CrossRef]
  19. Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
  20. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar] [CrossRef]
  21. Zeiler, M.D.; Taylor, G.W.; Fergus, R. Adaptive deconvolutional networks for mid and high level feature learning. In Proceedings of the 2011 International Conference on Computer Vision, Washington, DC, USA, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2018–2025. [Google Scholar] [CrossRef]
  22. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
  23. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  24. Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, GA, USA, 16 June 2013; p. 3. [Google Scholar]
  25. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  26. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  27. Jackson, P.; Haq, S. Surrey Audio-Visual Expressed Emotion (Savee) Database; University of Surrey: Guildford, UK, 2014. [Google Scholar]
  28. Librosa. Available online: https://librosa.org (accessed on 13 August 2023).
  29. Allen, J. Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1977, 25, 235–238. [Google Scholar] [CrossRef]
  30. Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.