Emotion Recognition Using EEG Signals and Audiovisual Features with Contrastive Learning
Abstract
1. Introduction
- In this paper, a multimodal emotion recognition framework is proposed based on audio-visual signals and EEG signals to consider both response and stimulus signals.
- We integrate modality-specific networks and temporal convolutional networks (TCNs) into modal encoders to extract spatiotemporal representations of multimodal data while employing contrastive learning to capture intra-modal, inter-modal, and inter-class relationships in a shared embedding space.
- We utilize cross-modal attention mechanisms to enhance the interactions between the extracted representations and to focus on the most salient information from each modality.
- We demonstrate the superior performance of our proposed method in emotion recognition through benchmark datasets and our own collected dataset.
2. Related Work
2.1. Multimodal Emotion Recognition
2.2. Contrastive Learning
2.3. Cross-Modal Attention
3. Proposed Method
3.1. Multimodal Encoder
3.1.1. Spatial Encoder
3.1.2. Temporal Encoder
3.2. Contrastive Learning for Multimodal Representation
3.3. Cross-Modality Attention and Classifier
- Video-Audio CMA:
- Video-EEG CMA:
- Audio-EEG CMA:
| Algorithm 1 Multimodal emotion recognition training algorithm. | 
| Require: Multimodal dataset  Ensure: Trained model parameters , ,    // Stage 1: Pre-training encoders   for each pre-training epoch do     for each mini-batch  do   end for end for // Stage 2: Fine-tuning with CMA and classifier Freeze   for each fine-tuning epoch do     for each mini-batch  do         // Cross-Modality Attention between modality pairs         ,          ,          ,          Concatenate all CMA outputs:     end for end for return , ,  | 
4. Experimental Results
4.1. Evaluation Datasets
- DEAP: The DEAP dataset contains EEG and peripheral signals collected from 32 participants (16 males and 16 females between the ages of 19 and 37). EEG signals were recorded while each participant watched 40 music video clips. Each participant rated their level of arousal, valence, dominance, and preference on a continuous scale from 1 to 9 using a Self-Assessment Manikin (SAM). Each trial contained 63 s of EEG signals, with the first 3 s serving as the baseline signal. The EEG signals were recorded at a sampling rate of 512 Hz using 32 electrodes. For this study, EEG data from 20 participants (10 males and 10 females) were selected for the experiment.
- SEED: The SEED dataset contains EEG and eye movement signals collected from 15 participants (7 males and 8 females). For this study, data from 10 participants (5 males and 5 females) were selected. Each participant’s EEG signals were collected while watching 15 Chinese movie clips approximately 4 min in length, designed to evoke positive, neutral, and negative emotions. The signals collected from 62 electrodes had a sampling rate of 1 kHz, which was then downsampled to 200 Hz. After watching each film clip, each participant recorded an emotion label for each video as negative (−1), neutral (0), or positive (1).
- DEHBA: The DEHBA dataset is a human EEG dataset collected during emotional audiovisual stimulation. EEG data were measured while subjects watched video clips designed to elicit four emotional states: (1) happy, (2) sad, (3) angry, and (4) relaxed. These states are defined on a plane with axes representing arousal and valence from the circumplex model of affect: “happy” corresponds to high valence and high arousal (HVHA), “angry” corresponds to low valence and high arousal (LVHA) “sad” corresponds to low valence and low arousal (LVLA), and “relaxed” corresponds to high valence and low arousal (HVLA).Researchers selected 100 videos (25 for each emotional state) based on their ability to elicit strong emotions without relying on language understanding. These videos were validated by 30 college students, who rated the intensity of their emotions after viewing each clip. EEG data were collected from 30 participants using a 36 channel electrode cap at a sampling rate of 1 kHz, and for this study, data from 12 participants (6 males and 6 females) were selected for analysis.The participants reported their emotional responses and rated the intensity of the emotions they experienced after viewing each video. This feedback was used to refine data selection and evaluate the results.
- MTIY: The Movie Trailer In YouTube (MTIY) dataset was constructed from 50 movie trailer videos retrieved from YouTube using the search term “movie trailer”. The videos covered five genres—science fiction, comedy, action, horror, and romance—with 10 videos in each genre, and each video was 60 s long. Subjects were instructed to watch all 50 videos, and an Emotiv headset was used to obtain EEG signals, with EEG features extracted every second. The EEG data were collected using 14 electrodes. The EEG features were collected using 36 electrodes. For this study, data from 16 participants (8 males and 8 females) were used. Each subject rated the level of arousal, valence, dominance, and preference on a continuous scale from 1 to 9 after watching all of the videos.
4.2. Experimental Set-Up
- VE-BiLSTM [79]: This method employs a two-layer bidirectional LSTM network. It performs feature-level fusion by concatenating video and EEG features as input, where the video features are 1024 dimensional and the EEG features are also 1024 dimensional. The first LSTM layer has 1024 hidden units, and the second LSTM layer has 256 hidden units. The final recognition is performed using a softmax layer on top of the concatenated forward and backward hidden states from the second Bi-LSTM layer.
- AVE-RT [45]: This method combines EEG, audio, and visual features for emotion recognition through feature-level fusion. It extracts power spectral density features from the EEG signals across five frequency bands, audio features using eGeMAPS [80], and visual features, including the luminance coefficient and color energy. These multimodal features are concatenated at the feature level and fed into a random tree classifier for emotion recognition.
- AVE-KELM [81]: This method combines video content and EEG signals. It extracts audio-visual features from video clips and EEG features using wavelet packet decomposition (WPD). The video features are selected using double input symmetrical relevance (DISR), while EEG features are selected by a decision tree (DT). The selected features from both modalities are then combined at the decision level using a kernel-based extreme learning machine (ELM) for final emotion recognition.
- AVE-LSTM [82]: This method integrates the audio, video, and EEG modalities for emotion recognition. Each modality has its own feature extractor, and LSTM networks are used for emotion recognition. Specifically, audio features are derived from MFCC, video features are extracted using VGG19, and EEG features are obtained through PCA after bootstrapping. The outputs from each LSTM are individually used for emotion recognition, and the final emotion prediction is achieved through decision fusion of these results. While the original approach also incorporated EMG data, our implementation excluded this modality.
4.3. Experimental Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Andalibi, N.; Buss, J. The human in emotion recognition on social media: Attitudes, outcomes, risks. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–16. [Google Scholar]
- Dubey, A.; Shingala, B.; Panara, J.R.; Desai, K.; Sahana, M. Digital Content Recommendation System through Facial Emotion Recognition. Int. J. Res. Appl. Sci. Eng. Technol 2023, 11, 1272–1276. [Google Scholar] [CrossRef]
- Pepa, L.; Spalazzi, L.; Capecci, M.; Ceravolo, M.G. Automatic emotion recognition in clinical scenario: A systematic review of methods. IEEE Trans. Affect. Comput. 2021, 14, 1675–1695. [Google Scholar] [CrossRef]
- Caruelle, D.; Shams, P.; Gustafsson, A.; Lervik-Olsen, L. Affective computing in marketing: Practical implications and research opportunities afforded by emotionally intelligent machines. Mark. Lett. 2022, 33, 163–169. [Google Scholar] [CrossRef]
- Jafari, M.; Shoeibi, A.; Khodatars, M.; Bagherzadeh, S.; Shalbaf, A.; García, D.L.; Gorriz, J.M.; Acharya, U.R. Emotion recognition in EEG signals using deep learning methods: A review. Comput. Biol. Med. 2023, 165, 107450. [Google Scholar] [CrossRef]
- Lin, W.; Li, C. Review of studies on emotion recognition and judgment based on physiological signals. Appl. Sci. 2023, 13, 2573. [Google Scholar] [CrossRef]
- Karnati, M.; Seal, A.; Bhattacharjee, D.; Yazidi, A.; Krejcar, O. Understanding deep learning techniques for recognition of human emotions using facial expressions: A comprehensive survey. IEEE Trans. Instrum. Meas. 2023, 72, 1–31. [Google Scholar] [CrossRef]
- Hashem, A.; Arif, M.; Alghamdi, M. Speech emotion recognition approaches: A systematic review. Speech Commun. 2023, 154, 102974. [Google Scholar] [CrossRef]
- Mittal, T.; Mathur, P.; Bera, A.; Manocha, D. Affect2mm: Affective analysis of multimedia content using emotion causality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5661–5671. [Google Scholar]
- Srivastava, D.; Singh, A.K.; Tapaswi, M. How You Feelin’? Learning Emotions and Mental States in Movie Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2517–2528. [Google Scholar]
- Wang, S.; Ji, Q. Video affective content analysis: A survey of state-of-the-art methods. IEEE Trans. Affect. Comput. 2015, 6, 410–430. [Google Scholar] [CrossRef]
- Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83, 19–52. [Google Scholar] [CrossRef]
- Goncalves, L.; Busso, C. Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features. IEEE Trans. Affect. Comput. 2022, 13, 2156–2170. [Google Scholar] [CrossRef]
- Ezzameli, K.; Mahersia, H. Emotion recognition from unimodal to multimodal analysis: A review. Inf. Fusion 2023, 99, 101847. [Google Scholar] [CrossRef]
- Ahmed, N.; Al Aghbari, Z.; Girija, S. A systematic survey on multimodal emotion recognition using learning algorithms. Intell. Syst. Appl. 2023, 17, 200171. [Google Scholar] [CrossRef]
- Wei, Y.; Hu, D.; Tian, Y.; Li, X. Learning in audio-visual context: A review, analysis, and new perspective. arXiv 2022, arXiv:2208.09579. [Google Scholar]
- Huang, Y.; Du, C.; Xue, Z.; Chen, X.; Zhao, H.; Huang, L. What makes multi-modal learning better than single (provably). Adv. Neural Inf. Process. Syst. 2021, 34, 10944–10956. [Google Scholar]
- Ma, Y.; Hao, Y.; Chen, M.; Chen, J.; Lu, P.; Košir, A. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Inf. Fusion 2019, 46, 184–192. [Google Scholar] [CrossRef]
- Hossain, M.S.; Muhammad, G. Emotion recognition using deep learning approach from audio-visual emotional big data. Inf. Fusion 2019, 49, 69–78. [Google Scholar] [CrossRef]
- Ghaleb, E.; Popa, M.; Asteriadis, S. Metric learning-based multimodal audio-visual emotion recognition. IEEE Multimed. 2019, 27, 37–48. [Google Scholar] [CrossRef]
- Praveen, R.G.; Granger, E.; Cardinal, P. Cross attentional audio-visual fusion for dimensional emotion recognition. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021; pp. 1–8. [Google Scholar]
- Chen, S.; Tang, J.; Zhu, L.; Kong, W. A multi-stage dynamical fusion network for multimodal emotion recognition. Cogn. Neurodyn. 2023, 17, 671–680. [Google Scholar] [CrossRef]
- Zali-Vargahan, B.; Charmin, A.; Kalbkhani, H.; Barghandan, S. Semisupervised Deep Features of Time-Frequency Maps for Multimodal Emotion Recognition. Int. J. Intell. Syst. 2023, 2023, 3608115. [Google Scholar] [CrossRef]
- Perez-Gaspar, L.A.; Caballero-Morales, S.O.; Trujillo-Romero, F. Multimodal emotion recognition with evolutionary computation for human-robot interaction. Expert Syst. Appl. 2016, 66, 42–61. [Google Scholar] [CrossRef]
- Kim, D.H.; Baddar, W.J.; Jang, J.; Ro, Y.M. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. Affect. Comput. 2017, 10, 223–236. [Google Scholar] [CrossRef]
- Hao, M.; Cao, W.H.; Liu, Z.T.; Wu, M.; Xiao, P. Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features. Neurocomputing 2020, 391, 42–51. [Google Scholar] [CrossRef]
- Farhoudi, Z.; Setayeshi, S. Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition. Speech Commun. 2021, 127, 92–103. [Google Scholar] [CrossRef]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef]
- Sarvestani, R.R.; Boostani, R. FF-SKPCCA: Kernel probabilistic canonical correlation analysis. Appl. Intell. 2017, 46, 438–454. [Google Scholar] [CrossRef]
- Deldari, S.; Xue, H.; Saeed, A.; He, J.; Smith, D.V.; Salim, F.D. Beyond just vision: A review on self-supervised representation learning on multimodal and temporal data. arXiv 2022, arXiv:2206.02353. [Google Scholar]
- Vempati, R.; Sharma, L.D. A systematic review on automated human emotion recognition using electroencephalogram signals and artificial intelligence. Results Eng. 2023, 18, 101027. [Google Scholar] [CrossRef]
- Rainville, P.; Bechara, A.; Naqvi, N.; Damasio, A.R. Basic emotions are associated with distinct patterns of cardiorespiratory activity. Int. J. Psychophysiol. 2006, 61, 5–18. [Google Scholar] [CrossRef]
- Kreibig, S.D. Autonomic nervous system activity in emotion: A review. Biol. Psychol. 2010, 84, 394–421. [Google Scholar] [CrossRef]
- Sarvakar, K.; Senkamalavalli, R.; Raghavendra, S.; Kumar, J.S.; Manjunath, R.; Jaiswal, S. Facial emotion recognition using convolutional neural networks. Mater. Today Proc. 2023, 80, 3560–3564. [Google Scholar] [CrossRef]
- Ye, J.; Wen, X.C.; Wei, Y.; Xu, Y.; Liu, K.; Shan, H. Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Can, Y.S.; Mahesh, B.; André, E. Approaches, applications, and challenges in physiological emotion recognition—A tutorial overview. Proc. IEEE 2023, 111, 1287–1313. [Google Scholar] [CrossRef]
- Chakravarthi, B.; Ng, S.C.; Ezilarasan, M.; Leung, M.F. EEG-based emotion recognition using hybrid CNN and LSTM classification. Front. Comput. Neurosci. 2022, 16, 1019776. [Google Scholar] [CrossRef] [PubMed]
- Antoniadis, P.; Pikoulis, I.; Filntisis, P.P.; Maragos, P. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–11 October 2021; pp. 3645–3651. [Google Scholar]
- Zhang, Y.H.; Huang, R.; Zeng, J.; Shan, S. M 3 f: Multi-modal continuous valence-arousal estimation in the wild. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 632–636. [Google Scholar]
- Mocanu, B.; Tapu, R.; Zaharia, T. Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image Vis. Comput. 2023, 133, 104676. [Google Scholar] [CrossRef]
- Udahemuka, G.; Djouani, K.; Kurien, A.M. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Appl. Sci. 2024, 14, 8071. [Google Scholar] [CrossRef]
- Li, Z.; Zhang, G.; Dang, J.; Wang, L.; Wei, J. Multi-modal emotion recognition based on deep learning of EEG and audio signals. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; pp. 1–6. [Google Scholar]
- Song, B.C.; Kim, D.H. Hidden emotion detection using multi-modal signals. In Proceedings of the Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; pp. 1–7. [Google Scholar]
- Liang, Z.; Zhang, X.; Zhou, R.; Zhang, L.; Li, L.; Huang, G.; Zhang, Z. Cross-individual affective detection using EEG signals with audio-visual embedding. Neurocomputing 2022, 510, 107–121. [Google Scholar] [CrossRef]
- Xing, B.; Zhang, H.; Zhang, K.; Zhang, L.; Wu, X.; Shi, X.; Yu, S.; Zhang, S. Exploiting EEG signals and audiovisual feature fusion for video emotion recognition. IEEE Access 2019, 7, 59844–59861. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf. Process. Syst. 2021, 34, 24206–24221. [Google Scholar]
- Dissanayake, V.; Seneviratne, S.; Rana, R.; Wen, E.; Kaluarachchi, T.; Nanayakkara, S. Sigrep: Toward robust wearable emotion recognition with contrastive representation learning. IEEE Access 2022, 10, 18105–18120. [Google Scholar] [CrossRef]
- Jiang, W.B.; Li, Z.; Zheng, W.L.; Lu, B.L. Functional emotion transformer for EEG-assisted cross-modal emotion recognition. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1841–1845. [Google Scholar]
- Tang, J.; Ma, Z.; Gan, K.; Zhang, J.; Yin, Z. Hierarchical multimodal-fusion of physiological signals for emotion recognition with scenario adaption and contrastive alignment. Inf. Fusion 2024, 103, 102129. [Google Scholar] [CrossRef]
- Yang, D.; Huang, S.; Liu, Y.; Zhang, L. Contextual and cross-modal interaction for multi-modal speech emotion recognition. IEEE Signal Process. Lett. 2022, 29, 2093–2097. [Google Scholar] [CrossRef]
- Praveen, R.G.; de Melo, W.C.; Ullah, N.; Aslam, H.; Zeeshan, O.; Denorme, T.; Pedersoli, M.; Koerich, A.L.; Bacon, S.; Cardinal, P.; et al. A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2486–2495. [Google Scholar]
- Zhao, J.; Ru, G.; Yu, Y.; Wu, Y.; Li, D.; Li, W. Multimodal music emotion recognition with hierarchical cross-modal attention network. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Praveen, R.G.; Alam, J. Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4803–4813. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
- Xiao, R.; Ding, C.; Hu, X. Time Synchronization of Multimodal Physiological Signals through Alignment of Common Signal Types and Its Technical Considerations in Digital Health. J. Imaging 2022, 8, 120. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Shao, W.; Xiao, R.; Rajapaksha, P.; Wang, M.; Crespi, N.; Luo, Z.; Minerva, R. Video anomaly detection with NTCN-ML: A novel TCN for multi-instance learning. Pattern Recognit. 2023, 143, 109765. [Google Scholar] [CrossRef]
- Singhania, D.; Rahaman, R.; Yao, A. C2F-TCN: A framework for semi-and fully-supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11484–11501. [Google Scholar] [CrossRef]
- Zhou, W.; Lu, J.; Xiong, Z.; Wang, W. Leveraging TCN and Transformer for effective visual-audio fusion in continuous emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5756–5763. [Google Scholar]
- Ishaq, M.; Khan, M.; Kwon, S. TC-Net: A Modest & Lightweight Emotion Recognition System Using Temporal Convolution Network. Comput. Syst. Sci. Eng. 2023, 46, 3355–3369. [Google Scholar]
- Lemaire, Q.; Holzapfel, A. Temporal convolutional networks for speech and music detection in radio broadcast. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR 2019), Delft, The Netherlands, 4–8 November 2019. [Google Scholar]
- Li, C.; Chen, B.; Zhao, Z.; Cummins, N.; Schuller, B.W. Hierarchical attention-based temporal convolutional networks for eeg-based emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1240–1244. [Google Scholar]
- Bi, J.; Wang, F.; Ping, J.; Qu, G.; Hu, F.; Li, H.; Han, S. FBN-TCN: Temporal convolutional neural network based on spatial domain fusion brain networks for affective brain–computer interfaces. Biomed. Signal Process. Control 2024, 94, 106323. [Google Scholar] [CrossRef]
- Yang, L.; Wang, Y.; Ouyang, R.; Niu, X.; Yang, X.; Zheng, C. Electroencephalogram-based emotion recognition using factorization temporal separable convolution network. Eng. Appl. Artif. Intell. 2024, 133, 108011. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. Medclip: Contrastive learning from unpaired medical images and text. arXiv 2022, arXiv:2210.10163. [Google Scholar]
- Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. Audioclip: Extending clip to image, text and audio. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 976–980. [Google Scholar]
- Geng, X.; Liu, H.; Lee, L.; Schuurmans, D.; Levine, S.; Abbeel, P. Multimodal masked autoencoders learn transferable representations. arXiv 2022, arXiv:2205.14204. [Google Scholar]
- Mai, S.; Zeng, Y.; Zheng, S.; Hu, H. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 2022, 14, 2276–2289. [Google Scholar] [CrossRef]
- Huang, G.; Ma, F. Concad: Contrastive learning-based cross attention for sleep apnea detection. In Proceedings of the Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track: European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021; Proceedings, Part V 21. Springer: Berlin/Heidelberg, Germany, 2021; pp. 68–84. [Google Scholar]
- Zhou, R.; Zhou, H.; Shen, L.; Chen, B.Y.; Zhang, Y.; He, L. Integrating Multimodal Contrastive Learning and Cross-Modal Attention for Alzheimer’s Disease Prediction in Brain Imaging Genetics. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 5–8 December 2023; pp. 1806–1811. [Google Scholar]
- Nguyen, C.V.T.; Mai, A.T.; Le, T.S.; Kieu, H.D.; Le, D.T. Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction. arXiv 2023, arXiv:2311.04507. [Google Scholar]
- Krishna, D.; Patil, A. Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 4243–4247. [Google Scholar]
- Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 2011, 3, 18–31. [Google Scholar] [CrossRef]
- Zheng, W.L.; Lu, B.L. Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Trans. Auton. Ment. Dev. 2015, 7, 162–175. [Google Scholar] [CrossRef]
- Ogawa, T.; Sasaka, Y.; Maeda, K.; Haseyama, M. Favorite video classification based on multimodal bidirectional LSTM. IEEE Access 2018, 6, 61401–61409. [Google Scholar] [CrossRef]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
- Duan, L.; Ge, H.; Yang, Z.; Chen, J. Multimodal fusion using kernel-based ELM for video emotion recognition. In Proceedings of the ELM-2015 Volume 1: Theory, Algorithms and Applications (I); Springer: Berlin/Heidelberg, Germany, 2016; pp. 371–381. [Google Scholar]
- Chen, J.; Ro, T.; Zhu, Z. Emotion recognition with audio, video, EEG, and EMG: A dataset and baseline approaches. IEEE Access 2022, 10, 13229–13242. [Google Scholar] [CrossRef]
- Asokan, A.R.; Kumar, N.; Ragam, A.V.; Shylaja, S. Interpretability for multimodal emotion recognition using concept activation vectors. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
- Polo, E.M.; Mollura, M.; Lenatti, M.; Zanet, M.; Paglialonga, A.; Barbieri, R. Emotion recognition from multimodal physiological measurements based on an interpretable feature selection method. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Virtual, 1–5 November 2021; pp. 989–992. [Google Scholar]
- Liu, B.; Guo, J.; Chen, C.P.; Wu, X.; Zhang, T. Fine-grained interpretability for EEG emotion recognition: Concat-aided grad-CAM and systematic brain functional network. IEEE Trans. Affect. Comput. 2023, 15, 671–684. [Google Scholar] [CrossRef]
- Zhao, S.; Hong, X.; Yang, J.; Zhao, Y.; Ding, G. Toward Label-Efficient Emotion and Sentiment Analysis. Proc. IEEE 2023, 111, 1159–1197. [Google Scholar] [CrossRef]
- Qiu, S.; Chen, Y.; Yang, Y.; Wang, P.; Wang, Z.; Zhao, H.; Kang, Y.; Nie, R. A review on semi-supervised learning for EEG-based emotion recognition. Inf. Fusion 2023, 104, 102190. [Google Scholar] [CrossRef]
- Ma, H.; Wang, J.; Lin, H.; Zhang, B.; Zhang, Y.; Xu, B. A transformer-based model with self-distillation for multimodal emotion recognition in conversations. IEEE Trans. Multimed. 2023, 26, 776–788. [Google Scholar] [CrossRef]
- Aslam, M.H.; Pedersoli, M.; Koerich, A.L.; Granger, E. Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition. arXiv 2024, arXiv:2408.09035. [Google Scholar]
- Sun, T.; Wei, Y.; Ni, J.; Liu, Z.; Song, X.; Wang, Y.; Nie, L. Muti-modal Emotion Recognition via Hierarchical Knowledge Distillation. IEEE Trans. Multimed. 2024, 26, 9036–9046. [Google Scholar] [CrossRef]




| Rating Values (RVs) | Valence | Arousal | Dominance | 
|---|---|---|---|
| 1 ≤ RVs ≤ 5 | Low | Low | Low | 
| 6 ≤ RVs ≤ 9 | High | High | High | 
| Rating Values (RVs) | Valence | Arousal | Dominance | 
|---|---|---|---|
| 1 ≤ RVs ≤ 3 | Negative | Activated | Controlled | 
| 4 ≤ RVs ≤ 6 | Neutral | Moderate | Moderate | 
| 7 ≤ RVs ≤ 9 | Positive | Deactivated | Overpowered | 
| Methods | Valence | Arousal | Dominance | |||
|---|---|---|---|---|---|---|
| Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | |
| VE-BiLSTM [79] | 71.8 | 71.4 | 70.1 | 70.2 | 71.5 | 71.3 | 
| AVE-KELM [81] | 78.3 | 78.1 | 76.2 | 76.6 | 77.9 | 78.1 | 
| AVE-LSTM [82] | 82.6 | 83.1 | 80.6 | 80.3 | 82.1 | 81.9 | 
| AVE-RT [45] | 85.7 | 85.5 | 82.4 | 82.2 | 85.2 | 84.8 | 
| Proposed Method | 93.4 | 93.2 | 91.7 | 92.0 | 93.5 | 93.2 | 
| Methods | Valence | Arousal | Dominance | |||
|---|---|---|---|---|---|---|
| Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | |
| VE-BiLSTM [79] | 64.6 | 64.2 | 64.3 | 63.9 | 63.7 | 63.5 | 
| AVE-KELM [81] | 73.7 | 73.5 | 73.2 | 74.1 | 72.4 | 72.1 | 
| AVE-LSTM [82] | 78.5 | 77.8 | 77.1 | 76.9 | 76.8 | 77.2 | 
| AVE-RT [45] | 80.1 | 80.3 | 79.5 | 80.2 | 80.7 | 80.5 | 
| Proposed Method | 89.3 | 89.6 | 88.6 | 88.2 | 89.2 | 89.5 | 
| Methods | DEAP: Four-Level | SEED: Three-Level | ||
|---|---|---|---|---|
| Accuracy | F1 | Accuracy | F1 | |
| VE-BiLSTM [79] | 60.1 | 59.4 | 69.3 | 70.2 | 
| AVE-KELM [81] | 67.5 | 68.2 | 75.6 | 74.9 | 
| AVE-LSTM [82] | 69.3 | 70.2 | 77.3 | 78.5 | 
| AVE-RT [45] | 75.5 | 78.4 | 81.5 | 81.3 | 
| Proposed Method | 83.2 | 84.1 | 90.9 | 91.2 | 
| Methods | DEBHA | MITY | ||
|---|---|---|---|---|
| Accuracy | F1 | Accuracy | F1 | |
| VE-BiLSTM [79] | 80.3 | 80.4 | 75.6 | 74.3 | 
| AVE-KELM [81] | 83.4 | 82.7 | 78.3 | 79.2 | 
| AVE-LSTM [82] | 85.3 | 84.1 | 80.2 | 81.1 | 
| AVE-RT [45] | 87.5 | 86.6 | 82.6 | 83.0 | 
| Proposed Method | 96.5 | 96.5 | 91.6 | 92.7 | 
| Modality | Accuracy | F1 | 
|---|---|---|
| Audio + Video | 78.4 | 77.8 | 
| Audio + EEG | 82.5 | 81.9 | 
| Video + EEG | 84.6 | 84.2 | 
| Audio + EEG + Video | 96.5 | 96.5 | 
| Condition | Accuracy | F1 | 
|---|---|---|
| Without Contrastive Learning | 92.5 | 91.3 | 
| Without Cross-Modal Attention | 94.3 | 94.0 | 
| Without Contrastive Learning and Cross-Modal Attention | 91.2 | 92.1 | 
| Proposed Method | 96.5 | 96.5 | 
| Method | Accuracy | F1 | 
|---|---|---|
| Proposed method (all samples) | 96.5 | 96.3 | 
| Proposed method (audio energy-based selection) | 97.4 | 98.1 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, J.-H.; Kim, J.-Y.; Kim, H.-G. Emotion Recognition Using EEG Signals and Audiovisual Features with Contrastive Learning. Bioengineering 2024, 11, 997. https://doi.org/10.3390/bioengineering11100997
Lee J-H, Kim J-Y, Kim H-G. Emotion Recognition Using EEG Signals and Audiovisual Features with Contrastive Learning. Bioengineering. 2024; 11(10):997. https://doi.org/10.3390/bioengineering11100997
Chicago/Turabian StyleLee, Ju-Hwan, Jin-Young Kim, and Hyoung-Gook Kim. 2024. "Emotion Recognition Using EEG Signals and Audiovisual Features with Contrastive Learning" Bioengineering 11, no. 10: 997. https://doi.org/10.3390/bioengineering11100997
APA StyleLee, J.-H., Kim, J.-Y., & Kim, H.-G. (2024). Emotion Recognition Using EEG Signals and Audiovisual Features with Contrastive Learning. Bioengineering, 11(10), 997. https://doi.org/10.3390/bioengineering11100997
 
         
                                                

 
       