GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals
Abstract
1. Introduction
- To enhance the model’s robustness to scale variations in physiological signals, we propose a novel Multi-Scale Convolutional Bidirectional Temporal Network (MC-BiLSTM). This network features a flexible, multi-branch parallel architecture that can be adapted to task requirements by adjusting convolutional kernel sizes. It extracts multi-level features through these kernels and employs a cross-scale feature fusion mechanism to integrate global semantics with local details, thereby significantly improving emotion recognition performance.
- To achieve efficient fusion of triple modalities, especially when they are weakly correlated, we innovatively design a Gated Multi-Head Cross-Attention (GMHCA) module. The network constructed with this module can dynamically constrain attention weights. By concatenating the dynamically gated fusion results of pairwise modalities with the original EEG features, it effectively leverages inter-modal relationships, leading to a substantial boost in emotion recognition accuracy.
- Systematic experiments and ablation studies conducted on the DEAP dataset demonstrate that the proposed model achieves superior classification accuracy in subject-dependent tasks; furthermore, cross-subject generalization validation on the SEED-IV dataset confirms the model’s exceptional robustness and generalization capability.
2. Related Works
2.1. Behavioral Representation in Emotion Recognition
2.2. Neurophysiological Representation in Emotion Recognition
3. Network and Model
3.1. Algorithm Architecture and Theoretical Basis
3.2. Design of the Multi-Scale Convolutional Bidirectional Temporal Network
3.3. Design of the Gated Multi-Head Cross-Attention Module
- A two-layer fully connected network is used to learn the gating weights for each modality;
- The Sigmoid activation function outputs gating coefficients in the range [0, 1];
- A complementarity metric is calculated based on feature space differences.
4. Experimental Process and Result Analysis
4.1. Experimental Setup
4.2. Experimental Results and Analysis
4.3. Failure Case Analysis and Discussion
- Signal Quality Issues: We found that some misclassified samples were accompanied by significant physiological artifacts (e.g., motion artifacts). For example, intense head movements generate high-frequency noise in EEG and EOG signals, which the model misinterpreted as high-arousal emotional features, leading to incorrect judgments. This highlights the need for more robust artifact detection and removal modules in future research.
- Individual Variability Challenges: Model performance decreased significantly in cross-subject tests. This indicates that the model’s generalization capability remains limited when processing data from new subjects not included in the training set. The primary reason may lie in the inherent differences in physiological baseline levels and response intensities among individuals, which the model failed to fully normalize.
- Inter-Modal Conflicts: We observed that when one modality (e.g., EDA) was contaminated by external factors (e.g., temperature changes), the gated fusion mechanism could be misled, assigning excessive weight to the noisy modality and thereby overshadowing correct information from other modalities. This suggests that future fusion strategies should incorporate an evaluation of signal-to-noise ratios across modalities.
- Uncertainty in Emotion Labels: Emotions are inherently subjective and continuous. The simplified binary classification labels may fail to accurately describe certain neutral or mixed emotional states, causing these borderline samples to become sources of classification errors.
5. Conclusions
- For the heterogeneity of EEG, EOG, and EDA signals (e.g., differences in frequency bands and temporal dynamics), a multi-scale convolutional and bidirectional LSTM fusion module (MC-BiLSTM) is designed to achieve collaborative extraction of spatiotemporal features;
- The GMHCA module is introduced, optimizing cross-modal information fusion efficiency by computing inter-modal correlations and dynamically adjusting gating weights in parallel.
- Insufficient cross-subject generalization ability: Experiments reveal significant differences in optimal channel selection among individuals (e.g., Subject 11’s accuracy improves by 12% after three-modal fusion). Future work could adopt a meta-learning framework to construct a prior knowledge base based on brain functional connectivity topology, dynamically optimizing personalized channel configurations through few-shot learning.
- Real-time optimization: The current model’s computational complexity may limit its efficiency in embedded deployment. Further research on model quantization and pruning strategies is needed.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Pseudocode of the Proposed Tri-Modal Emotion Recognition Model
Algorithm A1: Tri-Modal Emotion Recognition Model with Cross-Modal Attention and Dynamic Gate Fusion |
Require: EEG, EOG, EDA signals; labels |
Ensure: Trained model and evaluation results |
1: Step 1: Data Loading and Preprocessing |
2: Load EEG, EOG, EDA, and label files |
3: Select relevant channels for each modality |
4: Replace missing or infinite values with valid numbers |
5: Convert labels into binary classes (valence ≥ 5 → positive, else negative) |
6: Step 2: Model Definition |
7: Define MultiHeadCrossAttention layer: |
Project features into Q, K, V |
Split into multiple heads, compute attention weights |
Fuse heads and obtain cross-modal representation |
8: Define DynamicGateFusion layer: |
Compute complementarity = |
Apply gating network to generate gate value |
Fuse features: |
9: Step 3: Modality-specific Feature Extraction |
10: EEG branch: Multi-scale convolution → BiLSTM |
11: EOG branch: Multi-scale convolution → BiLSTM |
12: EDA branch: Multi-scale convolution → BiLSTM |
13: Step 4: Cross-modal Fusion |
14: Compute cross-modal attention: EEG–EOG and EEG–EDA |
15: Apply dynamic gate fusion with EEG as primary modality |
16: Concatenate fused features with EEG features |
17: Fully connected layers + Dropout |
18: Output classification via Sigmoid activation |
19: Step 5: Training and Evaluation |
20: Split data into training, validation, and test sets |
21: Train model with early stopping |
22: Evaluate model: Accuracy, Recall, F1-score |
23: Save training history and results |
24: Plot confusion matrix |
References
- Shu, L.; Xie, J.; Yang, M.; Li, Z.; Li, Z.; Liao, D.; Xu, X.; Yang, X. A review of emotion recognition using physiological signals. Sensors 2018, 18, 2074. [Google Scholar] [CrossRef]
- Kim, D.; Lee, J.; Woo, Y.; Jeong, J.; Kim, C.; Kim, D.K. Deep learning application to clinical decision support system in sleep stage classification. J. Pers. Med. 2022, 12, 136. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, Y.; Hu, C.; Yin, Z.; Song, Y. Transformers for EEG-based emotion recognition: A hierarchical spatial information learning model. IEEE Sens. J. 2022, 22, 4359–4368. [Google Scholar] [CrossRef]
- Araújo, T.; Teixeira, J.P.; Rodrigues, P.M. Smart-data-driven system for Alzheimer disease detection through electroencephalographic signals. Bioengineering 2022, 9, 141. [Google Scholar] [CrossRef]
- Zheng, W.L.; Zhu, J.Y.; Lu, B.L. Identifying stable patterns over time for emotion recognition from EEG. IEEE Trans. Affect. Comput. 2017, 10, 417–429. [Google Scholar] [CrossRef]
- Zhang, H.; Zhao, X.; Wu, Z.; Sun, B.; Li, T. Motor imagery recognition with automatic EEG channel selection and deep learning. J. Neural Eng. 2021, 18, 016004. [Google Scholar] [CrossRef]
- Alarcao, S.M.; Fonseca, M.J. Emotions recognition using EEG signals: A survey. IEEE Trans. Affect. Comput. 2017, 10, 374–393. [Google Scholar] [CrossRef]
- Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126. [Google Scholar] [CrossRef]
- Sadowska, K.; Turnwald, M.; O’Neil, T.; Maust, D.T.; Gerlach, L.B. Reply to: Comment on “Behavioral Symptoms and Treatment Challenges for Patients Living With Dementia”. J. Am. Geriatr. Soc. 2025. [Google Scholar] [CrossRef] [PubMed]
- Rubin, M.; Cutillo, G.; Viti, V.; Margoni, M.; Preziosa, P.; Zanetta, C.; Bellini, A.; Moiola, L.; Fanelli, G.F.; Rocca, M.A.; et al. MOGAD-related epilepsy: A systematic characterization of age-dependent clinical, fluid, imaging and neurophysiological features. J. Neurol. 2025, 272, 508. [Google Scholar] [CrossRef]
- Huang, F.; Yang, C.; Weng, W.; Chen, Z.; Zhang, Z. CM-FusionNet: A cross-modal fusion fatigue detection method based on electroencephalogram and electrooculogram. Comput. Electr. Eng. 2025, 123, 110204. [Google Scholar] [CrossRef]
- Wang, S.; Guo, G.; Xu, S. Monitoring physical and mental activities with skin conductance. Nat. Electron. 2025, 8, 294–295. [Google Scholar] [CrossRef]
- Mayerl, C.J.; German, R.Z. Muscle Function and Electromyography:(almost) 70 years since Doty and Bosma (1956). J. Neurophysiol. 2025, 134, 337–346. [Google Scholar] [CrossRef] [PubMed]
- Kumar, G.; Varshney, N. Hybrid deep-CNN and Bi-LSTM model with attention mechanism for enhanced ECG-based heart disease diagnosis. Phys. Eng. Sci. Med. 2025, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Zhuang, Y.; Lin, L.; Tong, R.; Liu, J.; Iwamot, Y.; Chen, Y.W. G-gcsn: Global graph convolution shrinkage network for emotion perception from gait. In Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Bougourzi, F.; Dornaika, F.; Mokrani, K.; Taleb-Ahmed, A.; Ruichek, Y. Fusing Transformed Deep and Shallow features (FTDS) for image-based facial expression recognition. Expert Syst. Appl. 2020, 156, 113459. [Google Scholar] [CrossRef]
- Wu, Y.; Zhang, S.; Li, P. Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features. Sci. Rep. 2025, 15, 8855. [Google Scholar] [CrossRef]
- Pane, E.S.; Wibawa, A.D.; Purnomo, M.H. Improving the accuracy of EEG emotion recognition by combining valence lateralization and ensemble learning with tuning parameters. Cogn. Process. 2019, 20, 405–417. [Google Scholar] [CrossRef]
- Yousefipour, B.; Rajabpour, V.; Abdoljabbari, H.; Sheykhivand, S.; Danishvar, S. An Ensemble Deep Learning Approach for EEG-Based Emotion Recognition Using Multi-Class CSP. Biomimetics 2024, 9, 761. [Google Scholar] [CrossRef]
- Wu, X.; Ju, X.; Dai, S.; Li, X.; Li, M. Multi-source domain adaptation for EEG emotion recognition based on inter-domain sample hybridization. Front. Hum. Neurosci. 2024, 18, 1464431. [Google Scholar] [CrossRef]
- Liu, Y.J.; Yu, M.; Zhao, G.; Song, J.; Ge, Y.; Shi, Y. Real-time movie-induced discrete emotion recognition from EEG signals. IEEE Trans. Affect. Comput. 2017, 9, 550–562. [Google Scholar] [CrossRef]
- Arya, R.; Singh, J.; Kumar, A. A survey of multidisciplinary domains contributing to affective computing. Comput. Sci. Rev. 2021, 40, 100399. [Google Scholar] [CrossRef]
- Montembeault, M.; Brando, E.; Charest, K.; Tremblay, A.; Roger, É.; Duquette, P.; Rouleau, I. Multimodal emotion perception in young and elderly patients with multiple sclerosis. Mult. Scler. Relat. Disord. 2022, 58, 103478. [Google Scholar] [CrossRef]
- Sousa, A.; d’Aquin, M.; Zarrouk, M.; Holloway, J. Person-Independent Multimodal Emotion Detection for Children with High-Functioning Autism. 2020. pp. 14–20. Available online: https://www.academia.edu/126627984/Person_Independent_Multimodal_Emotion_Detection_for_Children_with_High_Functioning_Autism (accessed on 1 September 2025).
- Zheng, W.L.; Liu, W.; Lu, Y.; Lu, B.L.; Cichocki, A. Emotionmeter: A multimodal framework for recognizing human emotions. IEEE Trans. Cybern. 2018, 49, 1110–1122. [Google Scholar] [CrossRef]
- Rayatdoost, S.; Rudrauf, D.; Soleymani, M. Expression-guided EEG representation learning for emotion recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3222–3226. [Google Scholar] [CrossRef]
- Jiménez-Guarneros, M.; Fuentes-Pineda, G. Multi-modal supervised domain adaptation with a multi-level alignment strategy and consistent decision boundaries for cross-subject emotion recognition from EEG and eye movement signals. Knowl.-Based Syst. 2025, 315, 113238. [Google Scholar] [CrossRef]
- Wu, M.; Teng, W.; Fan, C.; Pei, S.; Li, P.; Pei, G.; Li, T.; Liang, W.; Lv, Z. Multimodal Emotion Recognition based on EEG and EOG Signals evoked by the Video-odor Stimuli. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 3496–3505. [Google Scholar] [CrossRef]
- Sun, W.; Yan, X.; Su, Y.; Wang, G.; Zhang, Y. MSDSANet: Multimodal emotion recognition based on multi-stream network and dual-scale attention network feature representation. Sensors 2025, 25, 2029. [Google Scholar] [CrossRef] [PubMed]
- Li, G.; Chen, N.; Zhu, H.; Li, J.; Xu, Z.; Zhu, Z. Uncertainty-Aware Graph Contrastive Fusion Network for multimodal physiological signal emotion recognition. Neural Netw. 2025, 187, 107363. [Google Scholar] [CrossRef] [PubMed]
- Huang, Y.; Yang, J.; Liao, P.; Pan, J. Fusion of facial expressions and EEG for multimodal emotion recognition. Comput. Intell. Neurosci. 2017, 2017, 2107451. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, Y. Emotion recognition based on multimodal physiological electrical signals. Front. Neurosci. 2025, 19, 1512799. [Google Scholar] [CrossRef]
- Li, C.; Bao, Z.; Li, L.; Zhao, Z. Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf. Process. Manag. 2020, 57, 102185. [Google Scholar] [CrossRef]
- Zhang, J.; Zhu, L.; Kong, W.; Zhang, J.; Cao, J.; Cichocki, A. Reinforcement Learning Decoding Method of Multi-User EEG Shared Information Based on Mutual Information Mechanism. IEEE J. Biomed. Health Inform. 2025, 29, 6588–6598. [Google Scholar] [CrossRef]
- Redwan, U.G.; Zaman, T.; Mizan, H.B. Spatio-temporal CNN-BiLSTM dynamic approach to emotion recognition based on EEG signal. Comput. Biol. Med. 2025, 192, 110277. [Google Scholar] [CrossRef]
- Tang, X.; Qi, Y.; Zhang, J.; Liu, K.; Tian, Y.; Gao, X. Enhancing EEG and sEMG fusion decoding using a multi-scale parallel convolutional network with attention mechanism. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 32, 212–222. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Li, T.; Tang, C.; Xu, T.; Chen, P.; Bezerianos, A.; Wang, H. Emotion recognition and dynamic functional connectivity analysis based on EEG. IEEE Access 2019, 7, 143293–143302. [Google Scholar] [CrossRef]
- Thiruselvam, S.; Reddy, M.R. Frontal EEG correlation based human emotion identification and classification. Phys. Eng. Sci. Med. 2024, 48, 121–132. [Google Scholar] [CrossRef]
- Chao, H.; Dong, L.; Liu, Y.; Lu, B. Emotion recognition from multiband EEG signals using CapsNet. Sensors 2019, 19, 2212. [Google Scholar] [CrossRef]
- Tang, H.; Liu, W.; Zheng, W.L.; Lu, B.L. Multimodal emotion recognition using deep neural networks. Neural Inf. Process. 2017, 10637, 811–819. [Google Scholar] [CrossRef]
- Wu, X.; Zheng, W.L.; Li, Z.; Lu, B.L. Investigating EEG-based functional connectivity patterns for multimodal emotion recognition. J. Neural Eng. 2022, 19, 016012. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.; Hu, B.; Xu, L.; Moore, P.; Su, Y. Feature-level fusion of multimodal physiological signals for emotion recognition. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA, 9–12 November 2015; pp. 395–399. [Google Scholar] [CrossRef]
- Zhang, Z.; Yu, N.; Bian, Y.; Yan, J. Research on emotion recognition methods based on multi-modal physiological signal feature fusion. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi J. Biomed. Eng. Shengwu Yixue Gongchengxue Zazhi 2025, 42, 17–23. [Google Scholar] [CrossRef]
- Rodriguez Aguiñaga, A.; Ramirez Ramirez, M.; Salgado Soto, M.d.C.; Quezada Cisnero, M.d.l.A. A Multimodal Low Complexity Neural Network Approach for Emotion Recognition. Hum. Behav. Emerg. Technol. 2024, 2024, 5581443. [Google Scholar] [CrossRef]
- Zhao, Y.; Chen, D. Expression EEG Multimodal Emotion Recognition Method Based on the Bidirectional LSTM and Attention Mechanism. Comput. Math. Methods Med. 2021, 2021, 9967592. [Google Scholar] [CrossRef]
- Gao, H.; Cai, Z.; Wang, X.; Wu, M.; Liu, C. Multimodal Fusion of Behavioral and Physiological Signals for Enhanced Emotion Recognition Via Feature Decoupling and Knowledge Transfer. IEEE J. Biomed. Health Inform. 2025, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Li, P.; Li, A.; Li, X.; Lv, Z. Cross-Subject Emotion Recognition with CT-ELCAN: Leveraging Cross-Modal Transformer and Enhanced Learning-Classify Adversarial Network. Bioengineering 2025, 12, 528. [Google Scholar] [CrossRef] [PubMed]
- Ma, Z.; Li, A.; Tang, J.; Zhang, J.; Yin, Z. Multimodal emotion recognition by fusing complementary patterns from central to peripheral neurophysiological signals across feature domains. Eng. Appl. Artif. Intell. 2025, 143, 110004. [Google Scholar] [CrossRef]
Parameter | Value |
---|---|
Optimizer | Adam |
Batch size | 16 |
Loss function | Binary cross-entropy loss |
Optimizer | Adam |
Learning rate | ReduceLROnPlateau |
Num-heads | 3 |
D-model | 16 |
Alpha | 0.7 |
Beta | 0.3 |
Gate-units | 8 |
epoch | 10 |
GPU | NVIDIA RTX3060 |
Modal | ACC | Recall | F1-Score |
---|---|---|---|
EEG | 81.77 | 79.41 | 80.86 |
EOG | 75.46 | 73.29 | 73.76 |
EDA | 71.51 | 70.37 | 70.79 |
EEG+EOG | 86.79 | 85.67 | 86.13 |
EEG+EDA | 82.57 | 80.77 | 81.61 |
EEG+EOG+EDA | 89.45 | 88.31 | 89.01 |
Ablation Method | Accuracy | Recall | F1-Score |
---|---|---|---|
Multi-scale | 85.38 | 84.90 | 85.37 |
BiLSTM | 84.18 | 83.67 | 84.79 |
Multi-head | 87.53 | 87.16 | 87.34 |
GMHCA | 89.01 | 88.03 | 88.37 |
Subject | EEG | EEG+EOG | Subject | EEG | EEG+EOG |
---|---|---|---|---|---|
1 | 90.17 | 93.41 | 9 | 93.45 | 96.33 |
2 | 86.48 | 91.76 | 10 | 89.22 | 93.28 |
3 | 88.20 | 93.19 | 11 | 89.03 | 90.75 |
4 | 87.61 | 92.64 | 12 | 92.19 | 94.01 |
5 | 86.94 | 92.97 | 13 | 88.79 | 89.64 |
6 | 86.17 | 91.35 | 14 | 90.26 | 93.47 |
7 | 87.45 | 92.08 | 15 | 89.31 | 90.27 |
8 | 90.49 | 95.79 | Avg | 89.05 ± 1.93 | 92.73 ± 1.80 |
Method | Modal | Accuracy | EEG Channels |
---|---|---|---|
Chao et al. [39] | EEG+FACE | 66.73 | 32 |
Tang et al. [40] | EEG+EOG+EMG+EDA | 83.82 | 32 |
Wu et al. [41] | EEG+EOG | 86.61 | 32 |
Chen et al. [42] | EEG+EOG | 87.98 | 32 |
Zhang et al. [43] | EEG+EMG+EDA | 80.19 | 32 |
Adrian et al. [44] | EEG+ECG+EDA | 86 | 13 |
Zhao et al. [45] | EEG+Emotion | 86.8 | 32 |
Gao et al. [46] | EEG+ECG | 65.84 | 32 |
Li et al. [47] | EEG+EOG+EMG+EDA | 70.82 | 32 |
Ma et al. [48] | EEG+EMG+EDA | 77.33 | 32 |
Proposed method | EEG+EOG+EDA | 89.45 | 8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, X.; Li, Y.; Li, Y.; Yang, Y. GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals. Algorithms 2025, 18, 664. https://doi.org/10.3390/a18100664
Li X, Li Y, Li Y, Yang Y. GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals. Algorithms. 2025; 18(10):664. https://doi.org/10.3390/a18100664
Chicago/Turabian StyleLi, Xueping, Yanbo Li, Yuhang Li, and Yuan Yang. 2025. "GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals" Algorithms 18, no. 10: 664. https://doi.org/10.3390/a18100664
APA StyleLi, X., Li, Y., Li, Y., & Yang, Y. (2025). GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals. Algorithms, 18(10), 664. https://doi.org/10.3390/a18100664