GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals

Li, Xueping; Li, Yanbo; Li, Yuhang; Yang, Yuan

doi:10.3390/a18100664

Open AccessArticle

GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(10), 664; https://doi.org/10.3390/a18100664

Submission received: 22 August 2025 / Revised: 8 October 2025 / Accepted: 15 October 2025 / Published: 20 October 2025

(This article belongs to the Special Issue Machine Learning in Medical Signal and Image Processing (4th Edition))

Download

Browse Figures

Versions Notes

Abstract

To address the limitations of the single-modal electroencephalogram (EEG), such as its single physiological dimension, weak anti-interference ability, and inability to fully reflect emotional states, this paper proposes a gated multi-head cross-attention module (GMHCA) for multimodal fusion of EEG, electrooculography (EOG),and electrodermal activity (EDA). This attention module employs three independent and parallel attention computation units to assign independent attention weights to different feature subsets across modalities. Combined with a modality complementarity metric, the gating mechanism suppresses redundant heads and enhances the information transmission of key heads. Through multi-head concatenation, cross-modal interaction results from different perspectives are fused. For the backbone network, a multi-scale convolution and bidirectional long short-term memory network (MC-BiLSTM) is designed for feature extraction, tailored to the characteristics of each modality. Experiments show that this method, which primarily fuses eight-channel EEG with peripheral physiological signals, achieves an emotion recognition accuracy of 89.45%, a 7.68% improvement over single-modal EEG. In addition, in cross-subject experiments conducted on the SEED-IV dataset, the EEG+EOG modality achieved a classification accuracy of 92.73%. All were significantly better than the baseline method. This fully demonstrates the effectiveness of the innovative GMHCA module architecture and MC-BiLSTM feature extraction network proposed in this paper for multimodal fusion methods. Through the novel attention gating mechanism, higher recognition accuracy is achieved while significantly reducing the number of EEG channels, providing new ideas and approaches based on attention mechanisms and gated fusion for multimodal emotion recognition in resource-constrained environments.

Keywords:

multimodal; emotion recognition; attention mechanism; multi-scale convolution; physiological signal analysis

Graphical Abstract

1. Introduction

Emotion, as a core component of human psychological activity, not only affects an individual’s mental health but may also induce various mental disorders through the neuroendocrine system. In social interaction scenarios, effective emotional expression significantly influences interpersonal relationships, emotional communication efficiency, and health management outcomes [1]. With the rapid development of brain–computer interface (BCI) technology [2,3,4], EEG [5], as a technique for capturing brain signals from the scalp surface, can utilize BCI to detect neural activity signals corresponding to different states. EEG has garnered widespread attention in emotion recognition due to its ability to reflect neural activity in the cerebral cortex in real time. In practical applications, EEG signal acquisition systems based on wearable devices typically consist of multi-electrode EEG caps and data processing terminals. The raw EEG data collected by this system undergoes preprocessing, feature extraction, and pattern recognition to obtain real-time emotional state information, which can then guide users in emotion regulation or control external devices through feedback mechanisms [6]. However, single-modal EEG signals have limitations in emotion recognition, including low signal-to-noise ratio, susceptibility to interference, significant individual differences, poor generalization, and challenges in temporal alignment for dynamic emotion capture. Moreover, due to the non-stationary and random nature of EEG signals, their acquisition process is often affected by various interferences, including eye movement, electromyographic (EMG) signals, environmental noise, and individual differences among subjects [7].

To overcome these challenges, recent studies have investigated fusion techniques for multimodal physiological signals. Traditional emotion recognition relying on a single physiological modality is highly susceptible to environmental noise, limiting its ability to reliably detect nuanced affective states. Empirical comparisons demonstrate that multimodal fusion approaches consistently outperform unimodal methods in recognition accuracy across diverse datasets. This superiority arises from two key factors: one being the complementary nature of multidimensional emotional features extracted from different modalities, and the other the ability of cross-modal interaction modeling to reveal latent correlation patterns, thereby improving model generalizability. Advanced research further indicates that multimodal fusion optimizes joint feature representations, which not only increases robustness against noise but also constructs a richer emotional feature space. This space combines intra-modal discriminative attributes with inter-modal relational features, resulting in substantially enhanced emotion classification performance [8]. The application field of multimodal emotion recognition is shown in Figure 1.

In Response to the limitations observed in existing research, such as insufficient utilization of modal complementarity, the absence of dynamic weight adjustment mechanisms, and inadequate modeling of cross-modal interactions, this paper innovatively designs the GMHCA module, which quantifies the complementarity strength between modalities through gating units. It enables effective information fusion even when modality correlations are weak, overcoming the performance bottleneck of traditional methods in fusing weakly correlated modalities. Meanwhile, the introduced gating mechanism can dynamically adjust fusion weights according to feature differences between modalities, suppressing interference from redundant information while enhancing the contribution of key modalities, thus significantly improving the interpretability and effectiveness of the fusion process. In addition, the multi-head parallel attention mechanism captures complex inter-modal relationships from multiple feature subspaces, effectively addressing the information loss issue in simple fusion methods. When integrated with the MC-BiLSTM network, it significantly enhances the model’s generalization ability and robustness, laying a technical foundation for the development of portable emotion recognition systems for practical applications.

The main contributions of this paper are as follows:

To enhance the model’s robustness to scale variations in physiological signals, we propose a novel Multi-Scale Convolutional Bidirectional Temporal Network (MC-BiLSTM). This network features a flexible, multi-branch parallel architecture that can be adapted to task requirements by adjusting convolutional kernel sizes. It extracts multi-level features through these kernels and employs a cross-scale feature fusion mechanism to integrate global semantics with local details, thereby significantly improving emotion recognition performance.
To achieve efficient fusion of triple modalities, especially when they are weakly correlated, we innovatively design a Gated Multi-Head Cross-Attention (GMHCA) module. The network constructed with this module can dynamically constrain attention weights. By concatenating the dynamically gated fusion results of pairwise modalities with the original EEG features, it effectively leverages inter-modal relationships, leading to a substantial boost in emotion recognition accuracy.
Systematic experiments and ablation studies conducted on the DEAP dataset demonstrate that the proposed model achieves superior classification accuracy in subject-dependent tasks; furthermore, cross-subject generalization validation on the SEED-IV dataset confirms the model’s exceptional robustness and generalization capability.

The subsequent chapters of this paper are organized as follows. Section 2 introduces emotion recognition methods related to behavioral and neurophysiological representations; Section 3 elaborates on the proposed GMHCA-MCBILSTM network structure, including the design of the MC-BiLSTM module and the GMHCA module; Section 4 presents the experimental setup, results, and analysis on the DEAP and SEED-IV datasets; and Section 5 summarizes the paper and discusses future research directions.

2. Related Works

From the perspective of emotional representation, research can be divided into focusing on behavioral representation [9] or neurophysiological representation [10]. Behavioral representation mainly includes explicit expressions such as facial expressions, body movements, and speech tones, while neurophysiological representation involves intrinsic physiological indicators such as EEG, EOG [11], EDA [12], electromyogram (EMG) [13], and electrocardiogram (ECG) [14].

2.1. Behavioral Representation in Emotion Recognition

Emotional expression is accompanied by changes in external behavior. Therefore, emotions can be indirectly recognized by analyzing certain bodily changes. Behavioral feature-based emotion recognition infers emotional states by analyzing multidimensional information such as individual behavior patterns, verbal expressions, and facial expressions. Behavioral representation emotion recognition typically includes body posture and movement recognition, facial expression analysis, and speech and text features. For body posture research, Zhuang et al. [15] proposed a gait-based global graph convolutional contraction network (G-GCSN) for four types of emotions (happy, sad, neutral, and angry), which achieved excellent performance in emotion recognition tasks, with an accuracy of 81.5%. Bougourzi et al. [16] proposed a facial expression recognition method based on fusion transform deep and shallow features (FTDS), which achieved recognition accuracy of 98.27%, 89.65%, and 74.07% for six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) on public datasets CK+, CASIA, and MMI. For speech and text research, Wu et al. [17] proposed a multimodal emotion recognition method that combines text and audio modalities for speech text research. Through rapid learning and feature fusion strategies, the accuracy and cross-linguistic applicability of emotion recognition in conversations have been significantly improved, with a maximum improvement of 4.39% (F1 = 74.80%) in the IEMOCAP library, providing important references for subsequent research.

Behavioral representation has the advantages of high recognition accuracy and wide applicability in emotion recognition, but it also has obvious limitations, such as susceptibility to subjective consciousness control or behavioral disguise interference, low temporal resolution, and limited ability to capture instantaneous and subtle emotional changes. In contrast, neurophysiological characterization exhibits superior objectivity, real-time performance, and anti-interference ability; however, this method is still constrained by high equipment costs, complex signal interpretation, and relatively limited applicability in natural scenarios. Nevertheless, in rigorous scientific research and certain specific clinical diagnostic scenarios, neurophysiological representation of emotion recognition still demonstrates irreplaceable value.

2.2. Neurophysiological Representation in Emotion Recognition

In single-modal EEG emotion recognition research, the PANE team [18] proposed a strategy combining emotional lateralization and ensemble learning. Under four different channel sequences and combinations, the time-domain, frequency-domain, and wavelet features of EEG signals were extracted; the RF method achieved the highest accuracy of 75.6% on the DEAP dataset. Yousefipour et al. [19] proposed a three-stage classification method. The processing stage of EEG signals adopts Multi-Class Universal Spatial Pattern (MCCSP) technology. Using a Higuchi fractal dimension on the DEAP dataset, their highest average accuracy is 89.38%, providing a reliable emotion detection method. Wu et al. [20] proposed a multi-source domain adaptation common branch network for EEG emotion recognition and introduced a novel sample mixing method. This approach incorporates target domain information by directionally mixing samples from the source and target domains without increasing the overall sample size, thereby enhancing the effectiveness of conditional distribution alignment in domain adaptation. The average accuracy on the SEED dataset is 90.27%.

Although some progress has been made in unimodal research, which has advantages such as small data volume, simple analysis and acquisition process, and high recognition accuracy, its inherent shortcomings are gradually becoming apparent. The ability to characterize the characteristics of a single physiological signal is limited, making it difficult to fully capture the complexity of emotional changes. To overcome these limitations, researchers have proposed a series of multimodal fusion methods, attempting to integrate information from multiple physiological signals to improve emotion recognition accuracy [21,22,23,24]. Multimodal physiological signal fusion has become an important research trend, with related emotion recognition studies becoming a hotspot. Current multimodal fusion methods mainly include decision-level, observation-level, and feature-level fusion. Feature-level fusion strategies first extract features independently for each modality and then integrate cross-modal features. Zheng et al. [25] proposed a deep autoencoder model based on EEG differential entropy features and five dimensional EOG features. Although the dual-mode feature fusion has the best average accuracy of 85.11%, the generalization ability of the model in cross-modal scenarios still needs to be improved. Rayatdoost’s team [26] designed a cross-modal encoder that integrates EEG, EMG, and EOG features. Although a high-dimensional physiological feature space was constructed, the recognition accuracy in the multimodal feature fusion process did not meet expectations, only achieving 73.5%.

Decision-level fusion adopts a hierarchical processing mechanism, first constructing independent classification models for each modality and then integrating results through weighted voting or fuzzy integrals. Guarneros et al. [27] proposed a multimodal supervised domain adaptation method (MACDB) based on EEG and eye movement signals, which achieved cross-subject emotion recognition through multi-level alignment strategy and consistent decision boundary. The average accuracy on SEED, SEED-IV, and SEED-V datasets reached 86.68%, 85.03%, and 86.48%, respectively. However, this method did not explore more complex network architectures, which may limit further performance improvement. Wu et al. [28] utilized video–odor patterns as stimulus materials to record EEG and EOG signals. They proposed a hybrid fusion (HF) method combining Transformer and joint training to enhance the performance of emotion recognition tasks, achieving a classification accuracy of 89.50%. Sun et al. [29] proposed a multimodal emotion recognition model that integrates multi-scale feature representation and attention mechanisms. The feature extraction module of this model includes a multi-stream network module for extracting shallow EEG features and a dual-scale attention module for extracting shallow EEG features. Although full-channel EEG was used, the recognition accuracy reached over 90%. Li et al. [30] proposed an uncertainty-aware graph contrastive fusion network (UAGCFNet) for multimodal physiological signal emotion recognition, with the main innovation being the construction of an uncertainty aware graph convolutional network (UAGCN). In the subject dependent scenario, the accuracy of the valence (Valence) and arousal (Arousal) dimensions in the DEAP dataset of this model reached 88.71% and 87.13%, respectively. Hang et al. [31] proposed two multimodal fusion methods based on EEG and facial expression detection, which improved the classification accuracy of EEG expression fusion. However, due to the limitations of current network design, the system exhibits significant computational bottlenecks, with an accuracy of only 82.75%. Wang et al. [32] proposed a multimodal emotion recognition model (Att-1DCNN-GRU) based on the fusion of EEG and ECG signals. This model combines one-dimensional convolutional neural networks with attention mechanisms and gated recursive units, achieving an accuracy of over 90% and excellent performance. Li’s team [33] proposed a BILSTM attention network that converts physiological signals into spectral features; trains independent classification models for EEG, ECG, and EDA; and uses an attention-weighted decision fusion mechanism to achieve final recognition. Compared with end-to-end fusion methods, this cascaded decision fusion scheme preserves the independence of cross-modal discriminative information, but reduces classification accuracy to only 82.5%.

Research has shown that emotion recognition methods that integrate multimodal data can effectively integrate cross-modal complementary information and enhance the model’s ability to represent emotional states through feature interaction. It can be considered that emotion recognition based on neurophysiological representations is generally superior to emotion cognition based on behavioral representations in specific fields. Feature-level multimodal data fusion significantly improves the model’s representation performance of emotional states through cross-modal feature interaction and complementary information integration. This article innovatively uses a cross-attention mechanism to achieve multimodal feature fusion, significantly improving the classification performance of the model and achieving excellent recognition accuracy. To prove objectivity and effectiveness, their performance is actually compared in the fourth section as a reference, with a corresponding table.

3. Network and Model

In recent years, researchers have constructed multimodal emotion recognition models by fusing peripheral physiological signals such as EOG and EDA with EEG to capture physiological changes caused by emotions in different body parts and inter-modal interactions. Researchers have successively proposed various innovative emotion recognition algorithms and models, achieving significant improvements in recognition robustness and generalization ability.

3.1. Algorithm Architecture and Theoretical Basis

The design philosophy of the GMHCA module originates from multimodal learning theory, particularly the concepts of Cooperative Representation Learning and Cross-Modal Alignment. Different modalities such as EEG (reflecting brain activity), EOG (eye movement behavior), and EDA (electrodermal activity) are complementary in the process of perceiving emotions; they capture emotion-induced physiological changes from different perspectives. Therefore, in the fusion stage, the unique discriminative features of each modality should be preserved, and a fused space with cooperative representation ability should be constructed. Furthermore, from the perspective of information theory, the complementarity between modalities can be quantitatively modeled through mutual information (MI) [34]. If the MI between two modalities is low, it indicates that they share less information and have low redundancy, and their integration can yield richer emotional representation features. The dynamic gating mechanism introduced in GMHCA is based on this principle. By calculating the difference index between modalities and combining attention features, it adaptively adjusts the contribution weights of each modality, highlighting the role of modalities with strong complementarity in the final classification, thereby improving the overall discriminative performance. Meanwhile, to address the issues of a single perspective and limited expressive power in traditional single-head attention in cross-modal modeling, GMHCA has designed a multi-head parallel mechanism, enabling the model to capture complex nonlinear modal interactions across multiple feature subspaces. Each attention head models the semantic dependencies between modalities from different perspectives, while the gating mechanism further controls the transmission of redundant information and enhances the modeling ability of key interaction channels. Based on this theoretical framework, the specific components of the proposed model are detailed below, including the MC-BiLSTM network (Section 3.2) and the GMHCA module (Section 3.3).

The overall framework of the algorithm is shown in Figure 2. Specifically, in view of the limitations of traditional single-scale convolutional networks, which have a fixed receptive field and struggle to capture local subtle changes simultaneously during feature extraction, this algorithm designs a hierarchical multi-scale convolutional module for multi-modal signals. It achieves multi-granularity feature fusion by stacking convolutional layers with different kernel sizes in parallel, and introduces a bidirectional LSTM network to model physiological signals with strong temporal dependencies such as EEG, EOG, and EDA, establishing long-term and short-term contextual associations in the temporal dimension. To address the inherent limitations of traditional single-head attention mechanisms in cross-modal learning tasks, such as having a single perspective and low parallel computing efficiency, this paper innovatively designs a gated multi-head cross-attention module. This module constructs multiple sets of independent and learnable mapping matrices, allowing each set of matrices to learn modal interaction patterns from different feature subspaces, ultimately generating a fused representation rich in interaction information. This significantly improves the computational speed and classification accuracy of multi-modal fusion.

The specific implementation process after integrating the MC-BiLSTM network with the GMHCA module is as follows. The MC-BiLSTM network uses EEG, EOG, and EDA signal datasets as input sequences. Through the multi-scale convolutional network, multi-scale features are simultaneously learned and fused. A cross-scale feature fusion mechanism captures both global semantic information and local detail features. The bidirectional LSTM layer learns sequential dependencies among features, adapting to time-series data classification. After being processed by a fully connected layer, the output is generated. The GMHCA module achieves the efficient feature fusion of three modalities through an innovative multi-head attention concatenation mechanism. The gating mechanism dynamically adjusts the attention weight distribution of each modality. Each attention head focuses on capturing cross-modal interaction patterns at different levels. The fused multimodal features first undergo nonlinear transformation through a fully connected layer, followed by regularization via a Dropout layer to mitigate overfitting. Finally, classification probability values are output through a fully connected layer with a Sigmoid activation function.

3.2. Design of the Multi-Scale Convolutional Bidirectional Temporal Network

Traditional single-branch networks adopt a single-scale architecture, implementing feature extraction through the linear stacking of convolutional layers, pooling layers, and fully connected layers. Although such networks have fewer parameters and are efficient to train, their fixed receptive fields make it difficult to capture multi-scale features, limiting classification performance. This paper uses a multi-scale convolutional neural network for feature extraction. Unlike single-branch networks, multi-scale networks adopt a multi-branch parallel structure, introducing convolutional kernels of different sizes at different levels to process input data. This design enables the network to simultaneously learn and fuse multi-scale features, exhibiting stronger robustness to scale variations in input data. The cross-scale feature fusion mechanism captures both global semantic information and local detail features, improving classification accuracy through feature fusion. Additionally, the multi-branch architecture provides greater design flexibility, allowing researchers to adjust convolutional kernel sizes in each branch to adapt to different classification tasks.

To fully model the complex bidirectional long-range dependencies in physiological signals and achieve optimal performance and stability with limited data scales, this paper adopts a Bidirectional Long Short-Term Memory network (BiLSTM) [35]. With its unique forward and backward propagation mechanism, the model can simultaneously integrate both historical and future contextual information, significantly enhancing its ability to characterize the dynamic evolution process of emotional states. Compared to parameter-heavy Transformer models, BiLSTM offers higher data efficiency and lower risk of overfitting, whereas relative to the simplified Gated Recurrent Unit (GRU), its more refined gating mechanism and independent cell state provide stronger capabilities for capturing long-term dependencies, making it more suitable for analyzing subtle temporal patterns such as those found in EEG signals. This architecture constructs a feature representation system that integrates bidirectional temporal dynamics by processing forward and backward temporal dependencies in parallel, thereby modeling historical states and future trends collaboratively and ultimately improving the completeness of feature extraction and prediction accuracy. Specifically, the forward LSTM layer captures temporal patterns from past to present, while the backward LSTM layer extracts temporal correlation features from future to present. Their synergy enhances the model’s ability to resolve complex temporal dependencies, demonstrating significant advantages in tasks requiring comprehensive context.

For the feature differences among EEG, EOG, and EDA signals, this paper designs targeted multi-scale convolutional schemes. EEG signals exhibit neural electrical activity with millisecond-level temporal precision, where discriminative features span multiple timescales—from short-term transient events (e.g., event-related potentials) to long-term slow cortical potentials. Similarly, EOG signals reflect eye movements characterized by rapid, discrete artifacts such as blinks and saccades, which possess distinctive temporal morphologies. The raw physiological signals contain multi-scale temporal patterns that are critical for emotion recognition. To effectively capture these multi-scale patterns, we designed a multi-scale convolutional module. Therefore, in designing the multi-scale convolutional module, Branch 1 employs 1 × 1 and 3 × 3 convolutional kernels to extract local detailed features and achieve cross-channel information fusion, Branch 2 gradually expands the receptive field while maintaining resolution through two stacked 3 × 3 convolutions to capture medium-range temporal dependencies, and Branch 3 further extends the receptive field using a 5 × 5 convolution to capture longer contextual information. Finally, the outputs of the different branches are fused via feature concatenation, forming a rich multi-scale representation that significantly enhances the model’s ability to characterize both transient activities and sustained patterns, as shown in Figure 3a,b. In contrast, as a low-frequency signal, EDA primarily reflects changes in skin conductance and is closely associated with sympathetic nervous activity. Its phasic and tonic components evolve slowly, requiring a larger receptive field to capture long-term trends. To this end, Branch 1 utilizes 3 × 3 and 5 × 5 convolutional kernels to capture both local fluctuations and medium-range variations, Branch 2 employs a 5 × 5 convolution to extract patterns at intermediate temporal scales, and Branch 3 adopts a 7 × 7 kernel to comprehensively cover slow long-term trends, thereby effectively characterizing skin conductance responses related to emotional arousal. This design enables the model to adaptively extract features aligned with the inherent properties of the signal, providing information-rich and highly discriminative multi-scale feature representations for subsequent temporal modeling with BiLSTM, as shown in Figure 3c.

To address the issues of gradient degradation and feature reuse in traditional deep networks, while adapting to the high noise, low signal-to-noise ratio, and individual variability of physiological signals, residual connections are introduced between branches to fuse the features generated by different scale branches during information transmission. The multi-scale features output by each branch are concatenated through a Concat operation in the channel dimension. This design not only increases the diversity of feature propagation paths in the network, enabling the model to learn richer signal features and enhance its ability to handle complex classification tasks, but also strengthens the network’s representation ability for data detail information and global structure through the transfer and reuse of low-level features to higher levels. In addition, it facilitates the subsequent fusion of the three modalities.

3.3. Design of the Gated Multi-Head Cross-Attention Module

The key challenge of multimodal fusion lies in how to effectively integrate heterogeneous modal data. Due to significant differences in feature/gradient distributions across modalities, traditional methods (such as parameter averaging) struggle to explore cross-modal interactions and complementary characteristics. Furthermore, for non-independent and identically distributed (non-IID) multimodal data, the original aggregation methods may lose key information or even introduce noise, thereby weakening the model’s depth of multimodal modeling and leading to a significant decrease in accuracy and generalization ability. To address these issues, this section proposes a gated multi-head cross-attention module that achieves effective integration of multimodal information through deep semantic fusion guided by an attention mechanism and an adaptive gating mechanism based on modal complementarity [36]. Specifically, in the global aggregation stage, by focusing on and weighting information flows between different modalities, the deep fusion of modal features is achieved. Compared to traditional flat-tiled or weighted averaging fusion methods, cross-attention can perform weighted recombination of one modality based on the features of another modality, highlighting key features and eliminating interference between modalities, thus preserving modal complementary information as completely as possible in a distributed environment. First, cross-modal attention weights are calculated for EOG and EEG to capture dependencies, and then the weighted features and original features are input into dynamic gating to achieve adaptive adjustment of the contribution of each modality. The same process is applied between EDA and EEG. Finally, by concatenating the fused features of EEG, EOG, and EDA, a complete multimodal representation is constructed, achieving fusion among the three modalities. The original EEG features are not subjected to complex transformations by the attention mechanism and dynamic gating, preserving the integrity of single-modality information, which may include detailed features that are easily filtered out during cross-modal processing. Combining the two allows the model to leverage optimized information from cross-modal interactions without losing the inherent characteristics of the original signals, enhancing the representation ability for complex information.

The structure of the GMHCA module is shown in Figure 4. The upper part of the figure illustrates the processing flow of basic cross-attention, where lines of different colors are used to depict the changes in attention and modal addition. The lower part presents the specific implementation method of the gated multi-head cross-attention module. The feature dimension is first divided into three subspaces (i.e., three heads), with each head independently learning different inter-modal interaction patterns. Through parallel multi-head computation, complex inter-modal relationships (e.g., nonlinear relationships, local and global dependencies) are captured, avoiding the limitations of single-head attention. Each attention head assigns independent attention weights to different feature subsets across modalities. Subsequently, combined with a modality complementarity metric (e.g., feature differences), the gating mechanism suppresses the role of redundant heads and strengthens the information transmission of key heads. Through multi-head concatenation, the cross-modal interaction results from different perspectives are comprehensively integrated.

The self-attention mechanism is shown in Figure 5. Mathematically, the core computation of the attention mechanism involves three key matrices: Query, Key, and Value. Its essence lies in calculating similarity through matrix operations between Query and Key, obtaining a weight distribution after scaling, and then performing weighted aggregation with the Value matrix to generate the output representation. This series of matrix operations constitutes the basic computational paradigm of the attention mechanism.

The multi-head cross-attention mechanism extracts input data features from different representation angles by computing multiple independent attention subspaces in parallel, thereby more comprehensively modeling the complex structure of the data. During computation, the mechanism first projects the input into multiple subspaces, calculates attention features for each subspace, and finally concatenates the output tensors of all subspaces to form comprehensive features with multidimensional representation capabilities. The calculation formula for the multi-head cross-attention mechanism is as follows:

\begin{matrix} CrossAttention (Q, K, V) = α \times softmax (\frac{Q K^{T}}{\sqrt{d}}) V + b \times Q \\ {head}_{i} = (Q_{i}, K_{i}, V_{i}) \end{matrix}

(1)

where CrossAttention is the function, a and b are weight values,

K^{T}

is the transpose of K, d is the feature dimension, head_i represents the output of the i-th attention head, and

Q_{i}

,

K_{i}

and

V_{i}

are the Q, K, and V values of the i-th attention head, respectively. The dot product of Q and

K^{T}

first computes the attention weight matrix through the softmax normalization function, multiplies it with the value vector V, and then performs weighted fusion with the original query vector Q to generate the output of a single attention head, head_i. Finally, by concatenating the results of all attention heads, a complete multi-head cross-attention representation is constructed. Considering the complementarity differences among multimodal physiological signals (EEG, EOG, and EDA), this paper introduces learnable gating units to quantify inter-modal complementarity strength. The implementation is as follows:

A two-layer fully connected network is used to learn the gating weights for each modality;
The Sigmoid activation function outputs gating coefficients in the range [0, 1];
A complementarity metric is calculated based on feature space differences.

Here, G represents the gating coefficient matrix, W and b are trainable parameters, and

δ

is the Sigmoid function.

In the design of the GMHCA module, the gating values are dynamically generated through a dedicated subnetwork based on inter-modal feature differences, serving to estimate the complementary strength between modalities. This design is inspired by mutual information theory; a larger feature difference

f (X_{i})

generally implies lower mutual information between modalities, indicating stronger potential complementarity. Instead of explicitly computing mutual information—which is often intractable in high-dimensional spaces—the model uses this measurable feature difference as the input to a compact neural network. Through a data-driven approach, it learns to map feature differences to fusion weights. This strategy not only circumvents the complexity of directly computing mutual information but also effectively leverages complementary relationships among multimodal physiological signals. The gating weights are calculated using the following formula:

\begin{matrix} g_{i} = δ \times (W_{2} \times ReLU (W_{1} \times f (X_{i}) + b_{1}) + b_{2}) \\ f (X_{i}) = | X_{i} - X_{j} | \end{matrix}

(2)

where

X_{i}

represents the input modality feature;

X_{j}

X_{j}

represents the modal input features interacting with

X_{i}

;

f (X_{i})

denotes the absolute difference between two modalities in the feature space, quantifying their feature dissimilarity;

W_{1}

,

W_{2}

,

b_{1}

, and

b_{2}

are learnable parameters in the gating subnetwork; and

δ

denotes the Sigmoid activation function, which constrains the gating value within the range [0, 1].

In dynamic fusion,

A_{i}

is the output of multi-head cross-attention, and the gating values

g_{i}

are used to dynamically adjust the fusion ratio between the original features and the attention-enhanced features. The specific calculation formula is as follows:

\begin{matrix} F_{i} = g_{i} \times A_{i} + (1 - g_{i}) \times X_{i} \end{matrix}

(3)

Taking EEG and EOG as an example, this design enables the model to dynamically adjust the fusion strategy based on the actual complementary strength between modalities. The calculation formula between these two modalities is as follows:

\begin{matrix} F_{e e g - e o g} = g_{e e g} \times A_{e e g - e o g} + (1 - g_{e e g}) \times X_{i} \end{matrix}

(4)

The final fusion among EEG, EOG, and EDA modalities is achieved through concatenation. By integrating a mutual information-based gating mechanism with a multi-head cross-modal attention mechanism, GMHCA realizes a multimodal modeling approach that combines theoretical support with practical effectiveness, enhancing the model’s adaptability and expressiveness in multi-source physiological signals. The pseudo-code implementation of this module’s computational workflow and data fusion strategy is detailed in Appendix A.

4. Experimental Process and Result Analysis

This study employs the DEAP multimodal emotion dataset for experimental analysis and the experiment adopted an intra-subject setting. The DEAP dataset is a publicly available multimodal database designed for emotion analysis, which recorded multi-channel physiological signals of 32 participants while they watched 40 1-minute music videos. These signals include 32-channel EEG, EOG, EMG, and galvanic skin response (GSR), with a sampling rate of 128 Hz (after downsampling). After each trial, participants rated their emotional state on a 1–9 scale across four dimensions: Valence, Arousal, Dominance, and Liking. In total, 32 × 40 = 1280 samples were generated. In this study, scores higher than 5 are classified as the “high” category, and scores lower than 5 as the “low” category. Based on this criterion, labeled data for the valence and arousal binary classification tasks was constructed.

The DEAP dataset also includes peripheral physiological signals such as EOG and EDA for multimodal emotion recognition. This section details the key parameter configurations of the network architecture, conducts experiments based on the DEAP multimodal physiological signal dataset, and provides a systematic analysis of the experimental results.

4.1. Experimental Setup

To explore high-performance recognition methods under low-channel, wearable systems, researchers have conducted a series of experiments. Based on the team’s prior work and neurophysiological evidence from the literature [37] and literature [38], brain regions closely associated with emotions are selected, including the frontal lobe (FP2, AF4, F8), temporal lobe (T8), parietal lobe (CP1, Pz), and occipital lobe (PO3, O1), as shown in Figure 6. FP2, AF4, and F8 are located in the frontal lobe, T8 is in the temporal lobe, and the remaining four channels are in the occipital and parietal lobes. The final eight-channel EEG combination is determined as CP1, Pz, PO3, O1, T8, F8, AF4, and FP2.

Due to the weak amplitude of EOG signals and their high susceptibility to environmental interference, this study adopts Discrete Wavelet Transform (DWT) to preprocess the original physiological signals, aiming to effectively eliminate motion artifacts and baseline drift. DWT is characterized by multiresolution analysis, which enables the decomposition of signals into sub-bands with different frequencies. This allows for the accurate separation of high-frequency noise (e.g., motion artifacts) and low-frequency interference (e.g., baseline drift). By applying threshold processing or removing the coefficients of corresponding sub-bands, the effective physiological features can be retained while noise is significantly suppressed, thereby improving signal quality. This provides a more reliable data foundation for subsequent emotional feature extraction and classification.

This study implements the proposed model in Python3.8 using the TensorFlow 2.7 deep learning framework. To comprehensively evaluate model performance, the experiment adopts 10-fold cross-validation to ensure the reliability and generalizability of statistical results. Randomly and uniformly dividing the original dataset into 10 subsets of similar size is called ‘folds’. Then, 10 rounds of training and testing are conducted. In each round, 90% of the data are used as the training set and the remaining 10% as the test set, and the performance metrics of the model on the test set for that round are calculated. Finally, the average of the performance indicators obtained from 10 rounds is taken as the final estimate of the model’s performance on unknown data. The specific experimental environment configuration is shown in Table 1.

In the network, the learning rate uses the ReduceLROnPlateau function, which automatically reduces the learning rate when the validation loss stops decreasing. If the validation loss does not decrease within 3 epochs, the learning rate is multiplied by 0.1. D-model represents the output dimension of each head in the multi-head attention mechanism and the dimension of linear transformation, i.e., the input data is linearly transformed to a dimension of 16, and the output dimension of each head is also 16. Num-heads indicates the number of heads in the multi-head attention mechanism. In calculating the attention output, alpha and beta are weight coefficients used to balance the attention results and the original query vector. Alpha controls the weight of the attention results, and beta controls the weight of the original query vector. Gate-units represents the number of neurons in the first fully connected layer of the gating network, which is used to calculate fusion weights between modalities. Its value determines the complexity and expressive power of the gating network. The epoch is 10 and an early stop has been added. The computer GPU uses an NVIDIA RTX3060 graphics card.

4.2. Experimental Results and Analysis

To evaluate the performance of the proposed gated multi-head cross-attention multimodal fusion network in emotion classification, PCA is applied to reduce the dimensionality of the fused multimodal features. A random sample from Subject 1 is selected for visualization, as shown in Figure 7, where purple and yellow represent different labels. It can be observed that positive valence samples (yellow) and negative valence samples (purple) form distinct clusters, indicating significantly improved discriminability between categories after multimodal fusion. The results demonstrate that the fused multimodal features, processed by the model, effectively enhance the discriminability of emotional features. The model better captures the inherent differences between emotion categories.

The confusion matrix for classifying random subjects using only EEG signals shows that correctly predicted samples account for 80.4%, as shown in Figure 8a. After multimodal fusion with the other two modalities using the proposed network, correctly predicted samples account for 91.88% of all samples, as shown in Figure 8b. This fully validates the effectiveness of the proposed gated multi-head cross-attention network in emotion recognition tasks.

On the basis of classification accuracy, this study further evaluates the model performance comprehensively using two key indicators: macro-average recall and F1 score. Recall focuses on the original samples, reflecting the model’s ability to identify positive samples, i.e., the proportion of true positives correctly predicted. The F1 score is a comprehensive indicator balancing precision and recall. When

β

= 1, the F1 score equally considers the impact of false positives (FP) and false negatives (FN), making it more suitable for handling imbalanced data. The experiment uses the macro-average to compute overall metrics for multiclass tasks, avoiding evaluation bias caused by class size differences. After verifying the algorithm’s effectiveness, the average classification accuracy for different modality combinations is evaluated using data from 32 subjects in the DEAP dataset. The experimental results are shown in Table 2.

From the data in the table, it can be seen that compared to single-modal EEG, the classification accuracy after fusing the three modalities reaches 89.45%, a 7.68% improvement over single-modal EEG. Among the three single modalities, EOG achieves higher classification accuracy than EDA, possibly because EDA reflects sympathetic nervous system arousal and is sensitive to emotions, making it susceptible to environmental factors such as temperature, humidity, individual skin characteristics, and motion artifacts. In the emotion classification task on the DEAP dataset, EOG signals exhibit higher stability, with changes more directly related to emotional states. Additionally, interference factors can be effectively controlled through preprocessing, enabling the model to more accurately capture emotion-related features.

For better visualization, the classification accuracy data for valence dimensions from the 32 subjects using single-modal EEG and multimodal fusion are plotted, as shown in Figure 9. It can be observed that for some subjects (e.g., Subject 11), the classification accuracy increases significantly after three-modal fusion, indicating that their EEG signals may not contain sufficient features, and EOG and EDA signals effectively complement the EEG signals.

To verify the effectiveness of the proposed method, systematic ablation experiments are designed to evaluate the contribution of each module. First, the Multi-scale module is removed, and BiLSTM and GMHCA are used for three-modal fusion, achieving an accuracy of 85.38%. Next, the BiLSTM module is removed, and the multi-scale module and GMHCA are used for fusion, achieving an accuracy of 84.18%. When the gating mechanism in GMHCA is removed, the accuracy is 87.53%. When the multi-head mechanism in GMHCA is removed, the accuracy is 89.01%. The ablation experiment results are shown in Table 3. Since EEG, EOG, and EDA are all time-series signals, removing the BiLSTM module weakens the modeling of temporal dependencies, leading to a significant drop in accuracy. After removing the gating fusion mechanism, the accuracy decreases by 0.44% compared to the full model’s 89.45%. For the GMHCA module, removing the multi-head attention mechanism reduces the accuracy by 1.9% compared to the full module’s 89.45%, indicating that the multi-head mechanism can represent more information when focusing on features between different modalities.

This study further validated the multimodal fusion method using the SEED-IV dataset and the experiment also adopted an intra-subject setting. The dataset comprises 4-class data (happy, sad, fearful, and neutral) from 15 subjects. For the SEED-IV dataset, we employed the same GMHCA-MCBiLSTM model architecture as used for the DEAP dataset to maintain methodological consistency and validate its generalizability. The only adaptation made was at the input layer to accommodate the corresponding modality dimensions of the SEED-IV data (excluding EDA). All other hyperparameters (such as learning rate, batch size, and network layer structures) remained unchanged. The evaluation on the SEED-IV dataset employed the Leave-One-Subject-Out Cross-Validation (LOSO-CV) protocol, which is a standard approach for addressing individual differences and assessing model generalizability. Specifically, the dataset contains experimental sessions from 15 subjects. In each validation round, the data from one subject was used as the test set, while the data from the remaining 14 subjects constituted the training set. This process was repeated 15 times, ensuring that each subject was used exactly once as the test set. The final reported accuracy is the average ± standard deviation of the 15 test results. This evaluation involved a total of 15 test samples (one per subject). Experimental results showed that the average classification accuracy of the single-modal EEG signals was 89.05%, while the multimodal method combining EEG and EOG signals improved the average accuracy to 92.73%, representing a relative increase of 3.68 percentage points. The standard deviation of just EEG modality is 1.93%, while the standard deviation of EEG+EOG fusion accuracy is 1.80%, indicating the stability of the results. As shown in Table 4, the average classification accuracy across three experiments for all 15 subjects demonstrated that the multimodal fusion method significantly outperformed the single-modal approach, effectively verifying the validity of the proposed method.

As a supplementary validation component for the model’s generalization ability, and in consideration of the original research design intent as well as practical page limitations, we randomly selected and presented the confusion matrix of one subject, which objectively reflects the general performance of the model on the SEED-IV dataset. Figure 10 shows the confusion matrix comparison of randomly selected participant 1 before and after multimodal fusion. The results demonstrate that the number of correctly identified samples increased by approximately 3% after multimodal fusion compared to the single-modal approach. Notably, the improvements were particularly evident for the happy, sad, and neutral emotional states. These findings confirm that the proposed network effectively enhances emotion recognition classification accuracy.

As shown in Table 5, the performance of the proposed method is compared with previous studies. Among them, traditional machine learning methods serve as a baseline performance reference. Chao et al. [39] used a capsule network for emotion recognition on EEG signals in the DEAP database, achieving an accuracy of 66.73% in the valence dimension. Tang et al. [40] achieved an accuracy of 83.82% by fusing multiple modalities. Wu et al. [41] extracted features using brain functional networks and fused EEG and EOG, achieving a binary classification result of 86.61% in the valence dimension. Chen et al. [42] used SVM& KNN for multimodal classification, achieving an accuracy of 83.98%. Zhang Zhiwen et al. [43] adopted a feature weight fusion method, achieving an accuracy of 80.19% with three modalities. Adrian et al. [44] proposed a low-complexity, low-preprocessing requirement neural network model for emotion recognition by fusing multimodal physiological signals such as EEG, ECG, and EDA, achieving a classification accuracy of 86%. Zhao et al. [45] combined BCN feature extraction, an attention mechanism, and a bidirectional LSTM fusion network to achieve a recognition accuracy of 86.8% for multimodal (expression+EEG) emotion recognition. Gao et al. [46] proposed a novel multimodal decoupling knowledge distillation framework for multimodal emotion recognition, which achieved an optimal performance accuracy of 65.84% on two benchmark datasets, DEAP. Li et al. [47] proposed a novel framework for cross-subject emotion recognition CT-ELCAN (Cross-modal Transformer with Enhanced Learning-Classifying Adversarial Network). The accuracy of the valence dimension in the DEAP dataset reached 70.82%. Ma et al. [48] proposed a central to peripheral complementary fusion network (C2PCI Net) for multimodal physiological signal emotion recognition, with a classification accuracy of up to 77.33% on the DEAP dataset. Compared with previous studies, this paper introduces a gating mechanism into cross-modal attention fusion based on cross-modal attention mechanism. Using eight-channel EEG, EOG, and EDA on the DEAP dataset achieved an accuracy of 89.45%, and using eight-channel electroencephalography and electrooculography on the SEED-IV dataset achieved an accuracy of 92.73%. Not only did it validate the effectiveness of the proposed method, but it also provided key technical support for the engineering implementation of portable wearable emotion recognition devices.

4.3. Failure Case Analysis and Discussion

Although the proposed GMHCA-MCBiLSTM model performed well overall, we summarized several scenarios that may lead to model failures by analyzing misclassified samples:

Signal Quality Issues: We found that some misclassified samples were accompanied by significant physiological artifacts (e.g., motion artifacts). For example, intense head movements generate high-frequency noise in EEG and EOG signals, which the model misinterpreted as high-arousal emotional features, leading to incorrect judgments. This highlights the need for more robust artifact detection and removal modules in future research.
Individual Variability Challenges: Model performance decreased significantly in cross-subject tests. This indicates that the model’s generalization capability remains limited when processing data from new subjects not included in the training set. The primary reason may lie in the inherent differences in physiological baseline levels and response intensities among individuals, which the model failed to fully normalize.
Inter-Modal Conflicts: We observed that when one modality (e.g., EDA) was contaminated by external factors (e.g., temperature changes), the gated fusion mechanism could be misled, assigning excessive weight to the noisy modality and thereby overshadowing correct information from other modalities. This suggests that future fusion strategies should incorporate an evaluation of signal-to-noise ratios across modalities.
Uncertainty in Emotion Labels: Emotions are inherently subjective and continuous. The simplified binary classification labels may fail to accurately describe certain neutral or mixed emotional states, causing these borderline samples to become sources of classification errors.

These failure cases highlight directions for further model optimization, including developing more advanced personalized calibration techniques, designing noise-insensitive fusion mechanisms, and adopting more refined continuous emotion labels.

5. Conclusions

This study proposes an end-to-end multimodal emotion recognition network architecture, with core innovations including the following:

For the heterogeneity of EEG, EOG, and EDA signals (e.g., differences in frequency bands and temporal dynamics), a multi-scale convolutional and bidirectional LSTM fusion module (MC-BiLSTM) is designed to achieve collaborative extraction of spatiotemporal features;
The GMHCA module is introduced, optimizing cross-modal information fusion efficiency by computing inter-modal correlations and dynamically adjusting gating weights in parallel.

With a lightweight configuration using only eight-channel EEG+EOG+EDA, the model achieves a classification accuracy of 89.45%, significantly outperforming existing three-modal fusion solutions and confirming its application advantages in resource-constrained scenarios.

Despite achieving the expected goals, the following improvements are still needed:

Insufficient cross-subject generalization ability: Experiments reveal significant differences in optimal channel selection among individuals (e.g., Subject 11’s accuracy improves by 12% after three-modal fusion). Future work could adopt a meta-learning framework to construct a prior knowledge base based on brain functional connectivity topology, dynamically optimizing personalized channel configurations through few-shot learning.
Real-time optimization: The current model’s computational complexity may limit its efficiency in embedded deployment. Further research on model quantization and pruning strategies is needed.

Author Contributions

Innovation points, X.L., Y.L. (Yanbo Li), Y.L. (Yuhang Li) and Y.Y.; verification, X.L., Y.L. (Yanbo Li) and Y.L. (Yuhang Li); data organization, Y.L. (Yanbo Li) and Y.L. (Yuhang Li); writing—first draft preparation, X.L., Y.L. (Yanbo Li), and Y.L. (Yuhang Li); writing—review and editing, X.L., Y.L. (Yanbo Li) and Y.Y.; supervision, X.L. and Y.Y.; project management, X.L. and Y.Y.; funding acquisition, X.L. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62174134 and Grant 52205577, in part by the Shaanxi Innovation Capability Support Project under Grant 2021TD-25, and in part by the Xi’an Key Industrial Chain Key Core Technology Research Projects under Grant 23LLRH0044.

Data Availability Statement

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Acknowledgments

Thank you to Li Yuhang for proposing ideas, Li Yanbo for writing the initial draft, and Yang Yuan for supervising. Also, thank you to the institution that provided funding for this paper and thanks to Pang Zhibo from the Intelligent Systems Department of KTH Royal Institute of Technology for his assistance in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Pseudocode of the Proposed Tri-Modal Emotion Recognition Model

Algorithm A1: Tri-Modal Emotion Recognition Model with Cross-Modal Attention and Dynamic Gate Fusion

Require: EEG, EOG, EDA signals; labels

Ensure: Trained model and evaluation results

1: Step 1: Data Loading and Preprocessing

2: Load EEG, EOG, EDA, and label files

3: Select relevant channels for each modality

4: Replace missing or infinite values with valid numbers

5: Convert labels into binary classes (valence ≥ 5 → positive, else negative)

6: Step 2: Model Definition

7: Define MultiHeadCrossAttention layer:

Project features into Q, K, V

Split into multiple heads, compute attention weights

Fuse heads and obtain cross-modal representation

8: Define DynamicGateFusion layer:

Compute complementarity =

| x - o t h e r |

Apply gating network to generate gate value

Fuse features:

f u s i o n = g a t e \cdot a t t n_o u t p u t + (1 - g a t e) \cdot x

9: Step 3: Modality-specific Feature Extraction

10: EEG branch: Multi-scale convolution → BiLSTM

11: EOG branch: Multi-scale convolution → BiLSTM

12: EDA branch: Multi-scale convolution → BiLSTM

13: Step 4: Cross-modal Fusion

14: Compute cross-modal attention: EEG–EOG and EEG–EDA

15: Apply dynamic gate fusion with EEG as primary modality

16: Concatenate fused features with EEG features

17: Fully connected layers + Dropout

18: Output classification via Sigmoid activation

19: Step 5: Training and Evaluation

20: Split data into training, validation, and test sets

21: Train model with early stopping

22: Evaluate model: Accuracy, Recall, F1-score

23: Save training history and results

24: Plot confusion matrix

References

Shu, L.; Xie, J.; Yang, M.; Li, Z.; Li, Z.; Liao, D.; Xu, X.; Yang, X. A review of emotion recognition using physiological signals. Sensors 2018, 18, 2074. [Google Scholar] [CrossRef]
Kim, D.; Lee, J.; Woo, Y.; Jeong, J.; Kim, C.; Kim, D.K. Deep learning application to clinical decision support system in sleep stage classification. J. Pers. Med. 2022, 12, 136. [Google Scholar] [CrossRef]
Wang, Z.; Wang, Y.; Hu, C.; Yin, Z.; Song, Y. Transformers for EEG-based emotion recognition: A hierarchical spatial information learning model. IEEE Sens. J. 2022, 22, 4359–4368. [Google Scholar] [CrossRef]
Araújo, T.; Teixeira, J.P.; Rodrigues, P.M. Smart-data-driven system for Alzheimer disease detection through electroencephalographic signals. Bioengineering 2022, 9, 141. [Google Scholar] [CrossRef]
Zheng, W.L.; Zhu, J.Y.; Lu, B.L. Identifying stable patterns over time for emotion recognition from EEG. IEEE Trans. Affect. Comput. 2017, 10, 417–429. [Google Scholar] [CrossRef]
Zhang, H.; Zhao, X.; Wu, Z.; Sun, B.; Li, T. Motor imagery recognition with automatic EEG channel selection and deep learning. J. Neural Eng. 2021, 18, 016004. [Google Scholar] [CrossRef]
Alarcao, S.M.; Fonseca, M.J. Emotions recognition using EEG signals: A survey. IEEE Trans. Affect. Comput. 2017, 10, 374–393. [Google Scholar] [CrossRef]
Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126. [Google Scholar] [CrossRef]
Sadowska, K.; Turnwald, M.; O’Neil, T.; Maust, D.T.; Gerlach, L.B. Reply to: Comment on “Behavioral Symptoms and Treatment Challenges for Patients Living With Dementia”. J. Am. Geriatr. Soc. 2025. [Google Scholar] [CrossRef] [PubMed]
Rubin, M.; Cutillo, G.; Viti, V.; Margoni, M.; Preziosa, P.; Zanetta, C.; Bellini, A.; Moiola, L.; Fanelli, G.F.; Rocca, M.A.; et al. MOGAD-related epilepsy: A systematic characterization of age-dependent clinical, fluid, imaging and neurophysiological features. J. Neurol. 2025, 272, 508. [Google Scholar] [CrossRef]
Huang, F.; Yang, C.; Weng, W.; Chen, Z.; Zhang, Z. CM-FusionNet: A cross-modal fusion fatigue detection method based on electroencephalogram and electrooculogram. Comput. Electr. Eng. 2025, 123, 110204. [Google Scholar] [CrossRef]
Wang, S.; Guo, G.; Xu, S. Monitoring physical and mental activities with skin conductance. Nat. Electron. 2025, 8, 294–295. [Google Scholar] [CrossRef]
Mayerl, C.J.; German, R.Z. Muscle Function and Electromyography:(almost) 70 years since Doty and Bosma (1956). J. Neurophysiol. 2025, 134, 337–346. [Google Scholar] [CrossRef] [PubMed]
Kumar, G.; Varshney, N. Hybrid deep-CNN and Bi-LSTM model with attention mechanism for enhanced ECG-based heart disease diagnosis. Phys. Eng. Sci. Med. 2025, 1–11. [Google Scholar] [CrossRef] [PubMed]
Zhuang, Y.; Lin, L.; Tong, R.; Liu, J.; Iwamot, Y.; Chen, Y.W. G-gcsn: Global graph convolution shrinkage network for emotion perception from gait. In Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Bougourzi, F.; Dornaika, F.; Mokrani, K.; Taleb-Ahmed, A.; Ruichek, Y. Fusing Transformed Deep and Shallow features (FTDS) for image-based facial expression recognition. Expert Syst. Appl. 2020, 156, 113459. [Google Scholar] [CrossRef]
Wu, Y.; Zhang, S.; Li, P. Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features. Sci. Rep. 2025, 15, 8855. [Google Scholar] [CrossRef]
Pane, E.S.; Wibawa, A.D.; Purnomo, M.H. Improving the accuracy of EEG emotion recognition by combining valence lateralization and ensemble learning with tuning parameters. Cogn. Process. 2019, 20, 405–417. [Google Scholar] [CrossRef]
Yousefipour, B.; Rajabpour, V.; Abdoljabbari, H.; Sheykhivand, S.; Danishvar, S. An Ensemble Deep Learning Approach for EEG-Based Emotion Recognition Using Multi-Class CSP. Biomimetics 2024, 9, 761. [Google Scholar] [CrossRef]
Wu, X.; Ju, X.; Dai, S.; Li, X.; Li, M. Multi-source domain adaptation for EEG emotion recognition based on inter-domain sample hybridization. Front. Hum. Neurosci. 2024, 18, 1464431. [Google Scholar] [CrossRef]
Liu, Y.J.; Yu, M.; Zhao, G.; Song, J.; Ge, Y.; Shi, Y. Real-time movie-induced discrete emotion recognition from EEG signals. IEEE Trans. Affect. Comput. 2017, 9, 550–562. [Google Scholar] [CrossRef]
Arya, R.; Singh, J.; Kumar, A. A survey of multidisciplinary domains contributing to affective computing. Comput. Sci. Rev. 2021, 40, 100399. [Google Scholar] [CrossRef]
Montembeault, M.; Brando, E.; Charest, K.; Tremblay, A.; Roger, É.; Duquette, P.; Rouleau, I. Multimodal emotion perception in young and elderly patients with multiple sclerosis. Mult. Scler. Relat. Disord. 2022, 58, 103478. [Google Scholar] [CrossRef]
Sousa, A.; d’Aquin, M.; Zarrouk, M.; Holloway, J. Person-Independent Multimodal Emotion Detection for Children with High-Functioning Autism. 2020. pp. 14–20. Available online: https://www.academia.edu/126627984/Person_Independent_Multimodal_Emotion_Detection_for_Children_with_High_Functioning_Autism (accessed on 1 September 2025).
Zheng, W.L.; Liu, W.; Lu, Y.; Lu, B.L.; Cichocki, A. Emotionmeter: A multimodal framework for recognizing human emotions. IEEE Trans. Cybern. 2018, 49, 1110–1122. [Google Scholar] [CrossRef]
Rayatdoost, S.; Rudrauf, D.; Soleymani, M. Expression-guided EEG representation learning for emotion recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3222–3226. [Google Scholar] [CrossRef]
Jiménez-Guarneros, M.; Fuentes-Pineda, G. Multi-modal supervised domain adaptation with a multi-level alignment strategy and consistent decision boundaries for cross-subject emotion recognition from EEG and eye movement signals. Knowl.-Based Syst. 2025, 315, 113238. [Google Scholar] [CrossRef]
Wu, M.; Teng, W.; Fan, C.; Pei, S.; Li, P.; Pei, G.; Li, T.; Liang, W.; Lv, Z. Multimodal Emotion Recognition based on EEG and EOG Signals evoked by the Video-odor Stimuli. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 3496–3505. [Google Scholar] [CrossRef]
Sun, W.; Yan, X.; Su, Y.; Wang, G.; Zhang, Y. MSDSANet: Multimodal emotion recognition based on multi-stream network and dual-scale attention network feature representation. Sensors 2025, 25, 2029. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Chen, N.; Zhu, H.; Li, J.; Xu, Z.; Zhu, Z. Uncertainty-Aware Graph Contrastive Fusion Network for multimodal physiological signal emotion recognition. Neural Netw. 2025, 187, 107363. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Yang, J.; Liao, P.; Pan, J. Fusion of facial expressions and EEG for multimodal emotion recognition. Comput. Intell. Neurosci. 2017, 2017, 2107451. [Google Scholar] [CrossRef]
Wang, Z.; Wang, Y. Emotion recognition based on multimodal physiological electrical signals. Front. Neurosci. 2025, 19, 1512799. [Google Scholar] [CrossRef]
Li, C.; Bao, Z.; Li, L.; Zhao, Z. Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf. Process. Manag. 2020, 57, 102185. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, L.; Kong, W.; Zhang, J.; Cao, J.; Cichocki, A. Reinforcement Learning Decoding Method of Multi-User EEG Shared Information Based on Mutual Information Mechanism. IEEE J. Biomed. Health Inform. 2025, 29, 6588–6598. [Google Scholar] [CrossRef]
Redwan, U.G.; Zaman, T.; Mizan, H.B. Spatio-temporal CNN-BiLSTM dynamic approach to emotion recognition based on EEG signal. Comput. Biol. Med. 2025, 192, 110277. [Google Scholar] [CrossRef]
Tang, X.; Qi, Y.; Zhang, J.; Liu, K.; Tian, Y.; Gao, X. Enhancing EEG and sEMG fusion decoding using a multi-scale parallel convolutional network with attention mechanism. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 32, 212–222. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Li, T.; Tang, C.; Xu, T.; Chen, P.; Bezerianos, A.; Wang, H. Emotion recognition and dynamic functional connectivity analysis based on EEG. IEEE Access 2019, 7, 143293–143302. [Google Scholar] [CrossRef]
Thiruselvam, S.; Reddy, M.R. Frontal EEG correlation based human emotion identification and classification. Phys. Eng. Sci. Med. 2024, 48, 121–132. [Google Scholar] [CrossRef]
Chao, H.; Dong, L.; Liu, Y.; Lu, B. Emotion recognition from multiband EEG signals using CapsNet. Sensors 2019, 19, 2212. [Google Scholar] [CrossRef]
Tang, H.; Liu, W.; Zheng, W.L.; Lu, B.L. Multimodal emotion recognition using deep neural networks. Neural Inf. Process. 2017, 10637, 811–819. [Google Scholar] [CrossRef]
Wu, X.; Zheng, W.L.; Li, Z.; Lu, B.L. Investigating EEG-based functional connectivity patterns for multimodal emotion recognition. J. Neural Eng. 2022, 19, 016012. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Hu, B.; Xu, L.; Moore, P.; Su, Y. Feature-level fusion of multimodal physiological signals for emotion recognition. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA, 9–12 November 2015; pp. 395–399. [Google Scholar] [CrossRef]
Zhang, Z.; Yu, N.; Bian, Y.; Yan, J. Research on emotion recognition methods based on multi-modal physiological signal feature fusion. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi J. Biomed. Eng. Shengwu Yixue Gongchengxue Zazhi 2025, 42, 17–23. [Google Scholar] [CrossRef]
Rodriguez Aguiñaga, A.; Ramirez Ramirez, M.; Salgado Soto, M.d.C.; Quezada Cisnero, M.d.l.A. A Multimodal Low Complexity Neural Network Approach for Emotion Recognition. Hum. Behav. Emerg. Technol. 2024, 2024, 5581443. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, D. Expression EEG Multimodal Emotion Recognition Method Based on the Bidirectional LSTM and Attention Mechanism. Comput. Math. Methods Med. 2021, 2021, 9967592. [Google Scholar] [CrossRef]
Gao, H.; Cai, Z.; Wang, X.; Wu, M.; Liu, C. Multimodal Fusion of Behavioral and Physiological Signals for Enhanced Emotion Recognition Via Feature Decoupling and Knowledge Transfer. IEEE J. Biomed. Health Inform. 2025, 1–11. [Google Scholar] [CrossRef] [PubMed]
Li, P.; Li, A.; Li, X.; Lv, Z. Cross-Subject Emotion Recognition with CT-ELCAN: Leveraging Cross-Modal Transformer and Enhanced Learning-Classify Adversarial Network. Bioengineering 2025, 12, 528. [Google Scholar] [CrossRef] [PubMed]
Ma, Z.; Li, A.; Tang, J.; Zhang, J.; Yin, Z. Multimodal emotion recognition by fusing complementary patterns from central to peripheral neurophysiological signals across feature domains. Eng. Appl. Artif. Intell. 2025, 143, 110004. [Google Scholar] [CrossRef]

Figure 1. The applications of emotion recognition in some fields can be classified into several categories, with detailed introductions as follows.

Figure 2. Overall architecture design diagram of multimodal fusion network (including MC-Bilstm and GMHCA) for EEG emotion recognition.

Figure 3. Structural diagram of multi-scale convolutional network design for different modalities (a) EEG network structure. (b) EOG network structure. (c) EDA network structure.

Figure 4. Schematic diagram of GMHCA structure including Q, K, V, and corresponding module names.

Figure 5. Schematic diagram of the principle of cross-attention mechanism applied in GMHCA module.

Figure 6. The eight EEG channels selected based on the literature and previous team work are circled in orange in the figure.

Figure 7. Multi-modal fusion scatter plot analysis of EEG, EOG, and EDA, with purple and yellow indicating dispersion.

Figure 8. Confusion matrix of different modalities fusion in DEAP dataset: (a) only EEG single modality; (b) multimodal fusion.

Figure 9. Overview of classification accuracy histogram of DEAP dataset multimodal fusion, in which blue represents EEG mode and orange represents after three-mode fusion.

Figure 10. Confusion matrix for single modality and multimodal fusion in SEED-IV dataset: (a) only EEG single modality; (b) multimodal fusion.

Table 1. Main network parameters.

Parameter	Value
Optimizer	Adam
Batch size	16
Loss function	Binary cross-entropy loss
Optimizer	Adam
Learning rate	ReduceLROnPlateau
Num-heads	3
D-model	16
Alpha	0.7
Beta	0.3
Gate-units	8
epoch	10
GPU	NVIDIA RTX3060

The numbers do not include units.

Table 2. Modal ablation test results (%).

Modal	ACC	Recall	F1-Score
EEG	81.77	79.41	80.86
EOG	75.46	73.29	73.76
EDA	71.51	70.37	70.79
EEG+EOG	86.79	85.67	86.13
EEG+EDA	82.57	80.77	81.61
EEG+EOG+EDA	89.45	88.31	89.01

Table 3. Module ablation experimental results (%).

Ablation Method	Accuracy	Recall	F1-Score
Multi-scale	85.38	84.90	85.37
BiLSTM	84.18	83.67	84.79
Multi-head	87.53	87.16	87.34
GMHCA	89.01	88.03	88.37

Table 4. Multi-modal fusion results of SEED-IV dataset (%).

Subject	EEG	EEG+EOG	Subject	EEG	EEG+EOG
1	90.17	93.41	9	93.45	96.33
2	86.48	91.76	10	89.22	93.28
3	88.20	93.19	11	89.03	90.75
4	87.61	92.64	12	92.19	94.01
5	86.94	92.97	13	88.79	89.64
6	86.17	91.35	14	90.26	93.47
7	87.45	92.08	15	89.31	90.27
8	90.49	95.79	Avg	89.05 ± 1.93	92.73 ± 1.80

Table 5. Comparison of accuracy of different algorithms (%).

Method	Modal	Accuracy	EEG Channels
Chao et al. [39]	EEG+FACE	66.73	32
Tang et al. [40]	EEG+EOG+EMG+EDA	83.82	32
Wu et al. [41]	EEG+EOG	86.61	32
Chen et al. [42]	EEG+EOG	87.98	32
Zhang et al. [43]	EEG+EMG+EDA	80.19	32
Adrian et al. [44]	EEG+ECG+EDA	86	13
Zhao et al. [45]	EEG+Emotion	86.8	32
Gao et al. [46]	EEG+ECG	65.84	32
Li et al. [47]	EEG+EOG+EMG+EDA	70.82	32
Ma et al. [48]	EEG+EMG+EDA	77.33	32
Proposed method	EEG+EOG+EDA	89.45	8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Li, Y.; Li, Y.; Yang, Y. GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals. Algorithms 2025, 18, 664. https://doi.org/10.3390/a18100664

AMA Style

Li X, Li Y, Li Y, Yang Y. GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals. Algorithms. 2025; 18(10):664. https://doi.org/10.3390/a18100664

Chicago/Turabian Style

Li, Xueping, Yanbo Li, Yuhang Li, and Yuan Yang. 2025. "GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals" Algorithms 18, no. 10: 664. https://doi.org/10.3390/a18100664

APA Style

Li, X., Li, Y., Li, Y., & Yang, Y. (2025). GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals. Algorithms, 18(10), 664. https://doi.org/10.3390/a18100664

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals

Abstract

1. Introduction

2. Related Works

2.1. Behavioral Representation in Emotion Recognition

2.2. Neurophysiological Representation in Emotion Recognition

3. Network and Model

3.1. Algorithm Architecture and Theoretical Basis

3.2. Design of the Multi-Scale Convolutional Bidirectional Temporal Network

3.3. Design of the Gated Multi-Head Cross-Attention Module

4. Experimental Process and Result Analysis

4.1. Experimental Setup

4.2. Experimental Results and Analysis

4.3. Failure Case Analysis and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Pseudocode of the Proposed Tri-Modal Emotion Recognition Model

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI