You are currently on the new version of our website. Access the old version .
ElectronicsElectronics
  • Article
  • Open Access

4 June 2024

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

,
,
,
,
,
,
and
1
School of Information and Communication Engineering, Communication University of China, Beijing 100024, China
2
School of Aerospace Science and Technology, Xidian University, Xi’an 710126, China
3
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
4
Center of Information & Network Technology, Beijing Normal University, Beijing 100875, China
This article belongs to the Special Issue Applied AI in Emotion Recognition

Abstract

Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning of different deep networks are employed for encoding emotion patterns. Specifically, a cross-attention fusion (CAF) module is designed to integrate the dual-stream output for emotion recognition. Using different dual-stream encoders (fully training a text processing network and fine-tuning a pre-trained large language network), the CAF module is compared to other three fusion modules on three databases. The SER performance is quantified with weighted accuracy (WA), unweighted accuracy (UA), and F1-score (F1S). The experimental results suggest that the CAF outperforms the other three modules and leads to promising performance on the databases (EmoDB: WA, 97.20%; UA, 97.21%; F1S, 0.8804; IEMOCAP: WA, 69.65%; UA, 70.88%; F1S, 0.7084; RAVDESS: WA, 81.86%; UA, 82.75.21%; F1S, 0.8284). It is also found that fine-tuning a pre-trained large language network achieves superior representation than fully training a text processing network. In a future study, improved SER performance could be achieved through the development of a multi-stream representation of emotional cues and the incorporation of a multi-branch fusion mechanism for emotion recognition.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.