Next Article in Journal
AoI-Aware Resource Scheduling for Industrial IoT with Deep Reinforcement Learning
Next Article in Special Issue
Single- and Cross-Lingual Speech Emotion Recognition Based on WavLM Domain Emotion Embedding
Previous Article in Journal
A Space-Borne SAR Azimuth Multi-Channel Quantization Method
Previous Article in Special Issue
Multi-Modal Sarcasm Detection with Sentiment Word Embedding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

1
School of Electronics and Information Engineering, South China Normal University, Foshan 528234, China
2
Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117583, Singapore
3
School of Cyber Security, Guangdong Polytechnic Normal University, Guangzhou 510640, China
*
Authors to whom correspondence should be addressed.
Electronics 2024, 13(6), 1103; https://doi.org/10.3390/electronics13061103
Submission received: 26 January 2024 / Revised: 11 March 2024 / Accepted: 15 March 2024 / Published: 17 March 2024
(This article belongs to the Special Issue New Advances in Affective Computing)

Abstract

Speech emotion recognition poses challenges due to the varied expression of emotions through intonation and speech rate. In order to reduce the loss of emotional information during the recognition process and to enhance the extraction and classification of speech emotions and thus improve the ability of speech emotion recognition, we propose a novel approach in two folds. Firstly, a feed-forward network with skip connections (SCFFN) is introduced to fine-tune wav2vec 2.0 and extract emotion embeddings. Subsequently, ConLearnNet is employed for emotion classification. ConLearnNet comprises three steps: feature learning, contrastive learning, and classification. Feature learning transforms the input, while contrastive learning encourages similar representations for samples from the same category and discriminative representations for different categories. Experimental results on the IEMOCAP and the EMO-DB datasets demonstrate the superiority of our proposed method compared to state-of-the-art systems. We achieve a WA and UAR of 72.86% and 72.85% on IEMOCAP, and 97.20% and 96.41% on the EMO-DB, respectively.
Keywords: speech emotion recognition (SER); wav2vec 2.0; contrastive learning speech emotion recognition (SER); wav2vec 2.0; contrastive learning

Share and Cite

MDPI and ACS Style

Sun, C.; Zhou, Y.; Huang, X.; Yang, J.; Hou, X. Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition. Electronics 2024, 13, 1103. https://doi.org/10.3390/electronics13061103

AMA Style

Sun C, Zhou Y, Huang X, Yang J, Hou X. Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition. Electronics. 2024; 13(6):1103. https://doi.org/10.3390/electronics13061103

Chicago/Turabian Style

Sun, Chenjing, Yi Zhou, Xin Huang, Jichen Yang, and Xianhua Hou. 2024. "Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition" Electronics 13, no. 6: 1103. https://doi.org/10.3390/electronics13061103

APA Style

Sun, C., Zhou, Y., Huang, X., Yang, J., & Hou, X. (2024). Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition. Electronics, 13(6), 1103. https://doi.org/10.3390/electronics13061103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop