You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

22 October 2021

A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

,
and
1
School of Computer Science, Qinghai Normal University, Xining 810008, China
2
The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China
3
Department of Computer Science, School of Engineering and Computer Science, Baylor University, One Bear Place #97141, Waco, TX 76798, USA
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Selected Papers from 16th National Conference on Man-Machine Speech Communication (NCMMSC2021)

Abstract

Speech emotion recognition is a substantial component of natural language processing (NLP). It has strict requirements for the effectiveness of feature extraction and that of the acoustic model. With that in mind, a Heterogeneous Parallel Convolution Bi-LSTM model is proposed to address the challenges. It consists of two heterogeneous branches: the left one contains two dense layers and a Bi-LSTM layer, while the right one contains a dense layer, a convolution layer, and a Bi-LSTM layer. It can exploit the spatiotemporal information more effectively, and achieves 84.65%, 79.67%, and 56.50% unweighted average recalls on the benchmark databases EMODB, CASIA, and SAVEE, respectively. Compared with the previous research results, the proposed model achieves better performance stably.

1. Introduction

In recent years, rapid progress of technology makes smart devices more attractive in our daily life. Intelligent services such as chatbots, psychological diagnosis assistants, intelligent healthcare, sales advertising and intelligent entertainment consider not only the completion of services, but also the humanization of communication between human and computer [1]. How to implement an intelligent human–machine interface becomes an important issue. For the applications of the spoken dialog systems, the leading organizations use chatbots to improve their customer service and generate good business results for the organization [2]. In contrast to customer engagement, empathy, which is highly related to emotion, has been incorporated into the design of a dialogue system for improving user experience in human–computer interaction (HCI). More importantly, being empathetic is a necessary step for the dialogue system to be perceived as a social character by users. The realization of humanized HCI based on the above emotion motivation will be research of far-reaching significance.
Emotion plays an important role in perception, attention, memory, and decision-making processing of human beings, and human speech contains a wealth of emotional information [3]. People can perceive emotion from different speech signals, and therefore they can capture emotional changes from speech. As a vital process for human-to-human communication, speech emotion recognition is automatically and subconsciously performed by humans [4]. Thus, to achieve better HCI, speech emotion recognition must be handled smoothly so that machines can detect emotional information from human speech in real time.
Speech Emotion Recognition (SER) aims to simulate the emotional perception process of human beings finding and deciphering the emotional information contained in speech [5]. In the past decades, SER has attracted the widespread concern from researchers, and many tremendous achievements have been made. For example, Reeves et al. found that people tend to treat computers as if they are intelligent and emotion-aware [6]. This demonstrates a growing need for agents with proper affective behavior and affective understanding in areas such as interactive robots and story-telling agents [7,8]. With the fast development of Artificial Intelligence (AI), HCI becomes increasingly convenient and friendly by adding emotions to machines. To make HCI more harmonious and intelligent, it is urgent to enable AI to recognize speech emotions so that machines or robots can act in a human-like manner. Hence, the SER research has strong academic and practical value.

3. Methods

Evolving from the preliminary models Bi-LSTM and CNN [9,18], the proposed deep learning model HPCB can process temporal coherence information in the spatial and time domains efficiently because of its well-designed heterogeneous parallel learning architecture that exploits the advantages of CNN and Bi-LSTM.

Heterogeneous Parallel Conv-BiLSTM (HPCB)

HPCB contains two heterogeneous branches, as shown in Figure 1. The purpose of designing the two heterogeneous branches is to project the original data into different transformation spaces for calculation, so as to better represent the original emotional speech.
Figure 1. The topology of the proposed deep learning model HPCB.
The left one contains two dense layers and a Bi-LSTM layer, and it processes the temporal information of input data, the number of neurons in the two dense layers is 512, and the number of memory cells in the Bi-LSTM layer is 256.
The right one contains a dense layer, a convolution layer, and a Bi-LSTM layer, and it handles the spatiotemporal information of input data. The number of neurons in the dense layer and one-dimensional convolution layer is 512, and the number of memory cells in the Bi-LSTM layer is 256. 1D convolution is used to extract the spatial information of speech emotion signals in the time dimension, and Bi-LSTM is used to extract context information from the front and back ends of speech.
To represent emotional speech more completely, the features extracted from the left- and right-branches are fused through
C o n c a t e n a t e ( · ) operation, where C o n c a t e n a t e ( · ) is the joint feature matrix. This operation increases the dimension of the features describing the original data, but the information corresponding to each dimension feature does not increase.
A S o f t m a x ( · ) function is used to classify emotions according to emotional signals from the concatenation layer that concatenates and fuses the information from the two heterogeneous branches. The number of neurons in the S o f t m a x ( · ) layer is equal to the number of emotion categories in the corresponding database.
The proposed parallel learning architecture accelerates the convergence in deep learning, it also contributes to capture and retrieve spatiotemporal coherence information, which plays an essential role in improving learning performance of the model.
The proposed HPCB employs a valid convolution operation, and it performs convolution operation only for time dimensional tensor. This means that the convolution kernel moves inside the one-dimensional tensor. The output h of convolution is calculated as:
h   = f(h)
h = f ( h 1 F S × N ) .
where h 1 denotes the output of dense layer, F = [ k 1 , k 2 , , k 512 ] denotes the convolution kernel, N denotes the number of filters and is set to 512. S denotes the stride and is set to 1 by default.
Bi-LSTM is adept at context modeling on time series data. Different from the traditional neural network, there is a connection between any two neurons in the same hidden layer. Bi-LSTM receives the input from the convolution layer, and it helps the HPCB model to extract spatial and temporal coherence emotion features more effectively.
The outputs y L B and y R B of the left and right branches are concatenated in the concatenate layer to merge information:
F c = c o n c a t e n a t e ( y L B , y R B ) .
On the top of model HPCB, there is an output layer using S o f t m a x ( · ) to classify emotion. It is noted that HPCB employs the Adam optimization in its learning procedure. Compared to the original Bi-LSTM or CNN, HPCB automatically extracts information both in the spatial and time domains in a parallel learning architecture by exploiting the pros of the two models.

4. Experimental Evaluations

The proposed model HPCB outperforms its peers on three benchmark datasets described in Section 4.1.

4.1. Databases

To query the effectiveness of HPCB in SER, it was tested on three benchmark databases EMO-DB [24], CASIA [25], and SAVEE [26]. EMO-DB is a German corpus and it contains 535 emotional sentences in total. It contains 10 speakers and 7 emotions, namely, boredom (B), anger (A), fear (F), sadness (S), disgust (D), happiness (H), and neutral (N).
CASIA is a Chinese corpus, and it is constructed by the Institute of Automation, Chinese Academy of Sciences. The public CASIA corpus contains 1200 utterances and the average length of each audio is about 1.9 s. There are 4 speakers and each speaker records 300 words in the same text. There are 6 emotions, namely, anger (A), fear (F), happy (H), neutral (N), sad (Sa), surprise (Su).
SAVEE is an English corpus, and it contains 4 speakers and 7 different emotions, namely, anger (A), disgust (D), fear (F), happiness (H), sadness (Sa), surprise (Su), and neutral (N). The number of samples of neutral is 120 while that of each remainder class is 60. Totally, there are 480 utterances.
Figure 2 shows t-SNE visualizations of the databases EMODB, CASIA, and SAVEE. It can be seen from Figure 2a that even if the samples of database EMODB are projected into the two-dimensional space spanned by t-SNE bases, the degree of confusion can be still large. The t-SNE visualization implies that the dataset shows high nonlinearity and the corresponding SER classification seems to be a nonlinear inseparable problem. Figure 2b,c shows the similar situations for the samples of databases CASIA and SAVEE. They suggest that they both are nonlinear inseparable problems because there are no clear boundaries for each class and different types of samples are even wired together sometimes.
Figure 2. The t-SNE visualizations of the benchmark databases: EMODB, CASIA, and SAVEE. (a) The t-SNE visualization of corpus EMODB. (b) The t-SNE visualization of corpus CASIA. (c) The t-SNE visualization of corpus SAVEE.

4.2. Feature Extraction

We conducted the following feature extraction for the three databases in this study. Each speech was segmented into frames with a 25 ms window and 10 ms shifting step size. Each frame was Z-normalized. To each frame, 32D Low-Level Descriptor (LLD) features, including 12D Chroma [23] and 20D MFCC [24], were extracted. The High-Level Statistical Functions (HSF), such as the mean of Chroma and the mean, variance, and maximum of MFCC, were calculated. Totally, 72 D acoustic features were used as the input of the model.

4.3. Experimental Setup

All experiments were performed on a powerful PC with 64G RAM running under Windows 10. CPU speed was 2.10 GHz, core was 40, and logic processor was 80. To accelerate computing, 2 RTX 2080 Ti GPUs were used. All models were implemented with TensorFlow toolkit [39].
To prevent possible overfitting, during the training stage, dropout was implemented in all layers. Dropout rate was 0.5, batch size was 32, and epoch was 100. In addition, Adam [40] was adopted as optimizer.
The datasets EMODB, CASIA, and SAVEE do not provide a separate training and testing set, therefore, speaker-independent (SI) strategy was employed to do train–test partition. The samples of each database were randomly divided into 5 equal parts, and 4 parts were used as the training data while the remaining one was used as the testing set. Experiments were repeated 10 times and the average value of all trials was computed. Confusion matrix and such evaluation measures as precision, unweighted average recall (UAR), accuracy, and F1-score were employed to evaluate the performance.

4.4. The Performance of HPCB and Its Peer Methods

To analyze generalization ability, on the datasets EMODB, CASIA, and SAVEE, confusion matrices of the model HPCB were obtained by averaging 10 experimental results, as shown in Figure 3. The diagonal entry of each confusion matrix represents the recall rate. The prediction results of the three confusion matrices are summarized as follows.
Figure 3. The confusion matrices of HPCB on the datasets EMODB, CASIA, and SAVEE. (a) Confusion matrix of HPCB on database EMODB. (b) Confusion matrix of HPCB on database CASIA. (c) Confusion matrix of HPCB on database SAVEE.
First, on the test sets of databases EMODB, CASIA, and SAVEE, the average UARs of the model HPCB are 84.65%, 79.67%, and 56.50% respectively. Obviously, it achieves the best performance on the EMODB database.
Second, on the test set of the EMODB database, emotions Fear (F) and Sadness (S) achieve 100.00% UAR, which is a very impressive recognition result because it has rarely achieved in the previous literature. Similarly, emotions Neutral (N) and Surprise (Su) achieve 95.35% and 89.36% UAR on the test set of the CASIA database, emotions Happiness (H) and Neutral (N) achieve 81.25% and 92.00% UAR on the test set of the SAVEE database.
Third, on the test set of the EMODB database, it is noted that the emotions Boredom (B) and Neutral (N) are easily confusing pairs, so do Happiness (H) and Anger (A). On the test set of the CASIA database, it is noted that the emotions Fear (F) and Sadness (Sa) are easily confusing pairs. On the test set of the SAVEE database, emotions Anger (A), Disgust (D), and Fear (F) have a low-level recognition performance.
Table 1, Table 2 and Table 3 summarize the performance improvements of HPCB in terms of UAR with respect to the related peer methods on the databases CASIA, EMODB, and SAVEE. Among them, the authors of [9,41,42,43,44] used the research results of previous researchers as the baseline, while the study of [45] was originally proposed in the research of automatic speech recognition. When researchers in [46,47,48,49] applied it to speech emotion recognition, the database used was also inconsistent with the database used in this study. Therefore, this study adopted the model structure proposed in [45,46,47,48,49] and verified the model performance on the three databases used in this study. Final results are shown in Table 1, Table 2 and Table 3.
Table 1. Performance comparisons (%) of the model HPCB to those of the peers in the literature on the CASIA database.
Table 2. Performance comparisons (%) of the model HPCB to those of the peers in the literature on the EMODB database.
Table 3. Performance comparisons (%) of the model HPCB to those of the peers in the literature on the SAVEE database.
The proposed HPCB model achieves much better performance on the CASIA and EMODB databases than the previous models. Among them, HPCB achieved 79.67% and 84.65% recognition performance on databases CASIA and EMODB, respectively. On database SAVEE, HPCB achieved 56.50% recognition accuracy, the UAR of HPCB on the SAVEE database is only 2.90% lower than that of the literature [9]. This suggests that the proposed model has good robustness and generalization.

5. Discussion

In this section, we further analyzed the effectiveness and robustness of the proposed system over the databases CASIA, EMODB, and SAVEE. We used the common evaluation matrices, such as weighted, unweighted, and F1-score to estimate the class level and the overall accuracy. In order to measure the model prediction performance among the actual and the predicted labels, the confusion matrix of each dataset is shown. The confusion among the actual and the predicted labels of each class is shown in certain rows and columns in the confusion matrix. We conducted comprehensive experimentation for three datasets to show the model prediction performance in terms of precision, recall, F1-score, weighted, and unweighted results. We chose an optimal model combination for an efficient SER system.
The experimental results show that, for SER in different datasets, HPCB achieves higher performance compared with the other methods. The advantage of the CNN network is that it shares convolution kernels and automatically performs feature extraction, making it suitable for the high-dimensional data. However, at the same time, the pooling layer loses a lot of valuable information by ignoring the correlation between the local and the whole. This makes CNN fail to obtain high accuracy in learning time series. When dealing with tasks related to time series, LSTM is usually more appropriate. However, in terms of classification (including SER), LSTM has an obvious disadvantage in performance. Therefore, HPCB gets an obvious advantage in the variable mapping, to preserve the more valuable information. Additionally, it also performs well on the task sensitive to time series. It can be concluded that the proposed model has excellent generalization ability for SER tasks.

6. Conclusions

In this study, a novel heterogeneous parallel acoustic model called HPCB was proposed for speech emotion recognition. It exploits the spatiotemporal information more effectively. It is characterized by its two heterogeneous branch structures: the left one is composed of two dense layers and a Bi-LSTM layer, while the right one is composed of a dense layer, a convolution layer, and a Bi-LSTM layer. The 72D high-level statistical functions (HSF) features were calculated to verify the robustness and generalization of the model HPCB. Experimental results on the databases EMO-DB and CASIA suggest that HPCB demonstrate stably leading advantages over the previous methods in the literature.
In the future, the effectiveness of HPCB will be further verified by applying it to other emotion databases and analyzing possible overfitting risks, HPCB can also be extended to other audio recognition or image classification problems for its superior learning capabilities. Furthermore, it will be compared with other deep learning models such as generative adversarial network (GAN) in SER besides zero-shot learning techniques.
The proposed model demonstrates an impressive performance in SER by retrieving spatiotemporal information in deep learning. It suggests that the spatiotemporal signals extraction could be essential to achieve high-performance SER. On the other hand, how to decrease possible overfitting risk can be another interesting topic to explore further. This is because the proposed HPCB may face possible overfitting though the 0.5 dropout ratio is employed in learning for its complicated learning architecture. We plan to evaluate the learning performance in comparison with other peer deep learning models to query whether the integration of different types of neural networks will lead to an increase of overfitting and how should we overcome it efficiently if it were to happen.

Author Contributions

Conceptualization, H.H. (Henry Han) and H.H. (Heming Huang); methodology, H.Z.; software, Zhang. H.Y; validation, H.H. (Heming Huang) and H.Z.; formal analysis, H.H. (Henry Han) and H.H. (Heming Huang); investigation, H.H. (Henry Han) and H.H. (Heming Huang); resources, H.Z.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, Zhang. H.Y, H.H. (Henry Han) and H.H. (Heming Huang); visualization, H.H. (Henry Han); supervision, H.H. (Heming Huang); project administration, H.H. (Heming Huang); funding acquisition, H.H. (Heming Huang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by RESEARCH ON SPEECH RECOGNITION OF TIBETAN AMDO DIALECT BASED ON DEEP TRANSFER LEARNING, grant number 62066039. This research is supported by the Key Laboratory of Tibetan Information Processing, Ministry of Education, Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province (Grant: 2020-ZJ-Y05), Tibetan Information Engineering Technology Research Center of Qinghai Province.

Acknowledgments

Thank you for your support from the computer school of Qinghai Normal University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hsu, J.H.; Su, M.H.; Wu, C.H.; Chen, Y.H. Speech emotion recognition considering nonverbal vocalization in affective conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1675–1686. [Google Scholar] [CrossRef]
  2. Zhou, L.; Gao, J.; Li, D.; Shum, H.-Y. The design and implementation of xiaoice, an empathetic social chatbot. Comput. Linguist. 2020, 46, 1–62. [Google Scholar] [CrossRef]
  3. Brosch, T.; Scherer, K.R.; Grandjean, D.; Sander, D. The impact of emotion on perception, attention, memory, and decision making. Swiss. Med. Wkly. 2013, 143, w13786. [Google Scholar] [CrossRef] [PubMed]
  4. Tzirakis, P.; Zhang, J.H.; Schuller, B.W. End-to-end speech emotion recognition using deep neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5089–5093. [Google Scholar]
  5. Tahon, M.; Devillers, L. Towards a small set of robust acoustic features for emotion recognition: Challenges. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 16–28. [Google Scholar] [CrossRef] [Green Version]
  6. Reeves, B.; Nass, C.I. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places; Cambridge University Press: Cambridge, UK, 1996. [Google Scholar]
  7. Koolagudi, S.G.; Rao, K.S. Emotion recognition from speech: A review. Int. J. Speech Technol. 2012, 15, 99–117. [Google Scholar] [CrossRef]
  8. Vogt, T.; Andre, E.; Wagner, J. Automatic recognition of emotions from speech: A review of the literature and recommendations for practical realization. Affect Emot. Hum. Comput. Interact. 2008, 48, 75–91. [Google Scholar]
  9. Jiang, P.; Fu, H.; Tao, H.; Lei, P.; Zhao, L. Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access 2019, 9, 90368–90376. [Google Scholar] [CrossRef]
  10. Zhang, Y.; Liu, Y.; Weninger, F.; Schuller, B. Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4490–4494. [Google Scholar]
  11. Lotfian, R.; Busso, C. Formulating emotion perception as a probabilistic model with application to categorical emotion classification. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; pp. 415–420. [Google Scholar]
  12. Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D. Recognizing realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 2011, 53, 1062–1087. [Google Scholar] [CrossRef] [Green Version]
  13. Ayadi, M.E.; Kamel, M.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
  14. Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
  15. Schmitt, M.; Ringeval, F.; Schuller, B. At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech. In Proceedings of the INTERSPEECH, San Francisco, CA, USA, 8–12 September 2016; pp. 495–499. [Google Scholar]
  16. Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the INTERSPEECH, Dresden, Germany, 6–10 September 2015; pp. 1–4. [Google Scholar]
  17. Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the INTERSPEECH, Singapore, 7–10 September 2014; pp. 223–227. [Google Scholar]
  18. Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
  19. Neumann, M.; Vu, N.T. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv 2017, arXiv:1706.00612. [Google Scholar]
  20. Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 24–27 August 2017; pp. 1089–1093. [Google Scholar]
  21. Li, P.; Song, Y.; McLoughlin, I.V.; Guo, W.; Dai, L.R. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 3087–3091. [Google Scholar]
  22. Garg, U.; Agarwal, S.; Gupta, S.; Dutt, R.; Singh, D. Prediction of emotions from the audio speech signals using MFCC, MEL and Chroma. In Proceedings of the International Conference on Computational Intelligence and Communication Networks (CICN), Bhimtal, India, 25–26 September 2020; pp. 1–5. [Google Scholar]
  23. Kumbhar, H.S.; Bhandari, S.U. Speech emotion recognition using MFCC features and LSTM network. In Proceedings of the International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 19–21 September 2019; pp. 1–3. [Google Scholar]
  24. Cirakman, O.; Gunsel, B. Online speaker emotion tracking with a dynamic state transition model. In Proceedings of the International Conference on Pattern Recognition (ICPR), Cancún, Mexico, 4–8 December 2016; pp. 307–312. [Google Scholar]
  25. Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 2015, 25, 69–75. [Google Scholar] [CrossRef]
  26. Kim, Y.; Provost, E.M. ISLA: Temporal segmentation and labeling for audio-visual emotion recognition. IEEE Trans. Affect. Comput. 2019, 10, 196–208. [Google Scholar] [CrossRef]
  27. New, T.L.; Foo, S.W.; Silva, L. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar]
  28. Neiberg, D.; Elenius, K.; Laskowski, K. Emotion recognition in spontaneous speech using GMMs. In Proceedings of the INTERSPEECH, Pittsburgh, PA, USA, 17–21 September 2006; pp. 809–812. [Google Scholar]
  29. Kokane, A.; Ram Mohana Reddy, G. Multiclass SVM-based language independent emotion recognition using selective speech features. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI), New Delhi, India, 24–27 September 2014; pp. 1069–1073. [Google Scholar]
  30. Fu, L.Q.; Mao, X.; Chen, L.J. Relative speech emotion recognition based artificial neural network. In Proceedings of the Pacific Asia Conference on Language, Information and Computing (PACLIC), Wuhan, China, 19–20 December 2018; pp. 140–144. [Google Scholar]
  31. Xu, H.H.; Gao, J.; Yuan, J. Application of speech emotion recognition in intelligent household robot. In Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence (AICI), Sanya, China, 23–24 October 2010; pp. 537–541. [Google Scholar]
  32. Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
  33. Khaki, H.; Erzin, E. Use of affect based interaction classification for continuous emotion tracking. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2881–2885. [Google Scholar]
  34. Xie, Y.; Liang, R.; Liang, Z.; Huang, C.; Zou, C.; Schuller, B. Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1675–1685. [Google Scholar] [CrossRef]
  35. Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
  36. Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
  37. Zhao, Z.; Zheng, Y.; Zhang, Z.; Wang, H.; Zhao, Y.; Li, C. Exploring spatio-temporal representations by integrating attention-based Bidirectional-LSTM-RNNs and FCNs for speech emotion recognition. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 272–276. [Google Scholar]
  38. Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 4580–4584. [Google Scholar]
  39. Suen, H.Y.; Hung, K.E.; Lin, C.L. TensorFlow-based automatic personality recognition used in asynchronous video interviews. IEEE Access 2019, 7, 61018–61023. [Google Scholar] [CrossRef]
  40. Zou, F.; Shen, L.; Jie, Z.; Zhang, W.; Liu, W. A sufficient condition for convergences of Adam and RMSProp. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 11119–11127. [Google Scholar]
  41. Liu, Z.-T.; Xie, Q.; Wu, M.; Cao, W.-H.; Mei, Y.; Mao, J.-W. Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 2018, 309, 145–156. [Google Scholar] [CrossRef]
  42. Sun, Y.; Wen, G.; Wang, J. Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed. Signal Process. Control 2015, 18, 80–90. [Google Scholar] [CrossRef]
  43. Wen, G.; Li, H.; Huang, J.; Li, D.; Xun, E. Random deep belief networks for recognizing emotions from speech signals. Comput. Intell. Neurosci. 2017, 5, 1–9. [Google Scholar] [CrossRef]
  44. Tao, H.; Liang, R.; Zha, C.; Zhang, X.; Zhao, L. Spectral features based on local Hu moments of Gabor spectrograms for speech emotion recognition. IEICE Trans. Inf. Syst. 2016, 99, 2186–2189. [Google Scholar] [CrossRef] [Green Version]
  45. Sainath, T.N.; Weiss, R.J.; Senior, A.; Wilson, K.W.; Vinyals, O. Learning the speech front-end with raw waveform CLDNNs. In Proceedings of the INTERSPEECH, Dresden, Germany, 6–10 September 2015; pp. 1–5. [Google Scholar]
  46. Dai, T.; Zhu, L.; Wang, Y.; Carley, K.M. Attentive stacked denoising autoencoder with Bi-LSTM for personalized context-aware citation recommendation. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 553–568. [Google Scholar] [CrossRef]
  47. Wang, H.; Zhao, D.Q. Emotion analysis of Microblog based on emotion dictionary and Bi-GRU. In Proceedings of the Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2020; pp. 197–200. [Google Scholar]
  48. Lee, K.H.; Kim, D.H. Design of a convolutional neural network for speech emotion recognition. In Proceedings of the International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Korea, 21–23 October 2020; pp. 1332–1335. [Google Scholar]
  49. Sara, S.; Nicholas, F.; Geoffrey, E.H. Dynamic routing between capsules. In Proceedings of the Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.