Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst
Abstract
:1. Introduction
- The Self-Relation Attention and Temporal Awareness (SRA-TA) module helps capture the meaningful information from not only essential parts in the audio signal but also the temporal information of latent features extracted from the HuBERT [14] model.
- The result improves slightly by utilizing a Mel-spectrogram containing the information related to the frequency and loudness of VB.
2. Related Works
2.1. Feature Selection for Speech Emotion Recognition
2.2. Attention Mechanism for Speech Emotion Recognition
2.3. Self-Supervised Learning Model for Speech Emotion Recognition
3. Materials and Methods
3.1. Overview
3.2. Dataset and Pre-Processing
3.3. Feature Extractor
3.4. Self-Relation Attention and Temporal Awareness Module
3.5. Multi-Label Regression Module
3.6. Loss Function
4. Results
4.1. Experimental Setup
4.2. Evaluation Metrics
4.3. Experimental Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
SER | Speech Emotion Recognition |
SRA-TA | Self-Relation Attention and Temporal Awareness |
SA | Self-Attention |
RA | Relation-Attention |
CCC | Concordance Correlation Coefficient |
VB | Vocal Burst |
SSL | Self-supervised Learning |
LLD | Low-level Descriptors |
MFCC | Mel Frequency Cepstral Coefficients |
eGeMAPS | extended Geneve minimalistic acoustic parameter |
SVM | Support Vector Machine |
CNN | Convolutional Neural Network |
ELM | Extreme Learning Machine |
AM | Attention Mechanism |
LSTM | Long Short-term Memory |
GRU | Gated Recurrent Unit |
FC | Fully Connected |
RNN | Recurrent Neural Network |
References
- Darwin, C.; Prodger, P. The Expression of the Emotions in Man and Animals; Oxford University Press: Oxford, UK, 1998. [Google Scholar]
- Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
- Simon-Thomas, E.R.; Keltner, D.J.; Sauter, D.; Sinicropi-Yao, L.; Abramson, A. The voice conveys specific emotions: Evidence from vocal burst displays. Emot. Am. Psychol. Assoc. 2009, 9, 838–844. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schröder, M. Experimental study of affect bursts. Speech Commun. 2003, 40, 99–116. [Google Scholar] [CrossRef]
- Scherer, K.R. Expression of emotion in voice and music. J. Voice 1995, 9, 235–248. [Google Scholar] [CrossRef] [Green Version]
- Baird, A.; Tzirakis, P.; Brooks, J.A.; Gregory, C.B.; Schuller, B.; Batliner, A.; Cowen, A. The ACII 2022 Affective Vocal Bursts Workshop & Competition: Understanding a critically understudied modality of emotional expression. arXiv 2022, arXiv:2207.03572. [Google Scholar]
- Cowen, A.; Bard, A.; Tzirakis, P.; Opara, M.; Kim, L.; Brooks, J.; Metrick, J. The Hume Vocal Burst Competition Dataset (H-VB) | Raw Data. 2022. Available online: https://zenodo.org/record/6308780#.Y6ParhVByUk (accessed on 28 February 2022).
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
- Misra, I.; van der Maaten, L. Self-Supervised Learning of Pretext-Invariant Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6707–6717. [Google Scholar]
- Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A Survey on Contrastive Self-Supervised Learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
- Triantafyllopoulos, A.; Liu, S.; Schuller, B.W. Deep speaker conditioning for speech emotion recognition. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1227–1236. [Google Scholar]
- Liu, S.; Mallol-Ragolta, A.; Parada-Cabeleiro, E.; Qian, K.; Jing, X.; Kathan, A.; Hu, B.; Schuller, B.W. Audio self-supervised learning: A survey. arXiv 2022, arXiv:2203.01205. [Google Scholar] [CrossRef]
- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A.H. Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Ntalampiras, S.; Fakotakis, N. Modeling the temporal evolution of acoustic parameters for speech emotion recognition. IEEE Trans. Affect. Comput. 2011, 3, 16–125. [Google Scholar] [CrossRef]
- Koolagudi, S.G.; Murthy, Y.V.; Bhaskar, S.P. Choice of a classifier, based on properties of a dataset: Case study-speech emotion recognition. Int. J. Speech Technol. 2018, 21, 167–183. [Google Scholar] [CrossRef]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
- Weninger, F.; Eyben, F.; Schuller, B.W.; Mortillaro, M.; Scherer, K.R. On the acoustics of emotion in audio: What speech, music, and sound have in common. Front. Psychol. Front. Media SA 2013, 4, 292. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014. [Google Scholar]
- Stuhlsatz, A.; Meyer, C.; Eyben, F.; Zielke, T.; Meier, G.; Schuller, B. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5688–5691. [Google Scholar]
- Papakostas, M.; Spyrou, E.; Giannakopoulos, T.; Siantikos, G.; Sgouropoulos, D.; Mylonas, P.; Makedon, F. Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition. Computation 2017, 5, 26. [Google Scholar] [CrossRef] [Green Version]
- Wang, Z.-Q.; Tashev, I. Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5150–5154. [Google Scholar]
- Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
- Hajarolasvadi, N.; Demirel, H. 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 2019, 21, 479. [Google Scholar] [CrossRef] [Green Version]
- Li, P.; Song, Y.; McLoughlin, I.V.; Guo, W.; Dai, L.-R. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3087–3091. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Hsiao, P.-W.; Chen, C.-P. Effective attention mechanism in dynamic models for speech emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2526–2530. [Google Scholar]
- Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
- Zheng, C.; Wang, C.; Jia, N. An ensemble model for multi-level speech emotion recognition. Appl. Sci. 2019, 10, 205. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2803–2807. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Prasad, L.V.S.V.; Seth, A.; Ghosh, S.; Umesh, S. Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition. arXiv 2022, arXiv:2203.16973. [Google Scholar]
- Xin, D.; Takamichi, S.; Saruwatari, H. Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations. arXiv 2022, arXiv:2206.10695. [Google Scholar]
- Kahn, J.; Rivière, M.; Zheng, W.; Kharitonov, E.; Xu, Q.; Mazaré, P.-E.; Karadayi, J.; Liptchinsky, V.; Collobert, R.; Fuegen, C.; et al. Libri-light: A benchmark for asr with limited or no supervision. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7669–7673. [Google Scholar]
- Weyand, T.; Araujo, A.; Cao, B.; Sim, J. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2575–2584. [Google Scholar]
- Meng, D.; Peng, X.; Wang, K.; Qiao, Y. Frame attention networks for facial expression recognition in videos. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3866–3870. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- ACII A-VB2022—Hume AI|ML. Available online: https://www.competitions.hume.ai/avb2022 (accessed on 7 November 2022).
- Lawrence, I.; Lin, K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989, 255–268. [Google Scholar]
- de Santana Correia, A.; Colombini, E.L. Attention, please! A survey of neural attention models in deep learning. Artif. Intell. Rev. 2022, 55, 6037–6124. [Google Scholar] [CrossRef]
- Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Eyben, F.; Schuller, B.W. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. arXiv 2022, arXiv:2203.07378. [Google Scholar]
- Atmaja, B.T.; Sasou, A. Predicting Affective Vocal Bursts with Finetuned wav2vec 2.0. arXiv 2022, arXiv:2209.13146. [Google Scholar]
- Nguyen, D.-K.; Pant, S.; Ho, N.-H.; Lee, G.-S.; Kim, S.-H.; Yang, H.-J. Fine-tuning Wav2vec for Vocal-burst Emotion Recognition. arXiv 2022, arXiv:2210.00263. [Google Scholar]
- Hallmen, T.; Mertes, S.; Schiller, D.; André, E. An Efficient Multitask Learning Architecture for Affective Vocal Burst Analysis. arXiv 2022, arXiv:2209.13914. [Google Scholar]
- Karas, V.; Triantafyllopoulos, A.; Song, M.; Schuller, B.W. Elisabeth, Self-Supervised Attention Networks and Uncertainty Loss Weighting for Multi-Task Emotion Recognition on Vocal Bursts. arXiv 2022, arXiv:2209.07384. [Google Scholar]
Model | Mean CCC |
---|---|
Baseline [6] | 0.5686 |
Wav2vec2-large | 0.6902 |
HuBERT-large | 0.7012 |
DINO | 0.5920 |
HuBERT-large + SRA | 0.7211 |
HuBERT-large + TA | 0.7127 |
HuBERT-large + SRA-TA | 0.7265 |
HuBERT-large + DINO + SRA-TA | 0.7303 |
Dataset | Awe | Excite | Amuse | Awkward | Fear | Horror | Distress | Triumph | Sadness | Surprise | Mean CCC |
---|---|---|---|---|---|---|---|---|---|---|---|
Validation | 0.8084 | 0.6895 | 0.7886 | 0.6080 | 0.7614 | 0.7370 | 0.6959 | 0.6813 | 0.7069 | 0.8125 | 0.7303 |
Test | 0.8140 | 0.6817 | 0.7956 | 0.6100 | 0.7623 | 0.7362 | 0.6935 | 0.6778 | 0.7128 | 0.8113 | 0.7295 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Trinh, D.-L.; Vo, M.-C.; Kim, S.-H.; Yang, H.-J.; Lee, G.-S. Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst. Sensors 2023, 23, 200. https://doi.org/10.3390/s23010200
Trinh D-L, Vo M-C, Kim S-H, Yang H-J, Lee G-S. Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst. Sensors. 2023; 23(1):200. https://doi.org/10.3390/s23010200
Chicago/Turabian StyleTrinh, Dang-Linh, Minh-Cong Vo, Soo-Hyung Kim, Hyung-Jeong Yang, and Guee-Sang Lee. 2023. "Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst" Sensors 23, no. 1: 200. https://doi.org/10.3390/s23010200
APA StyleTrinh, D.-L., Vo, M.-C., Kim, S.-H., Yang, H.-J., & Lee, G.-S. (2023). Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst. Sensors, 23(1), 200. https://doi.org/10.3390/s23010200