1. Introduction
With the rapid advancement of artificial intelligence, emotion recognition has emerged as a cornerstone of human–computer interaction (HCI), enabling machines to perceive human affective states and facilitate naturalistic responses [
1]. While facial expressions [
2] and speech signals [
3] are widely utilized in current research, they are susceptible to subjective manipulation and deliberate masking. In contrast, physiological signals are directly regulated by the Autonomic Nervous System (ANS) and peripheral nervous system (PNS), offering an objective and authentic reflection of an individual’s psychophysiological state [
4]. Furthermore, the evolution of wearable technology has enabled the real-time capture of subtle physiological variations, significantly propelling research into physiology-based emotion recognition [
5].
To empower machines with the capability to precisely infer human intentions, a scientific emotional definition model is fundamental. Although discrete models enumerate a limited set of basic emotions, continuous dimensional approaches have gained prominence in affective computing for their ability to refine emotional states and capture nuanced affective dynamics. The most prevalent framework is Russell’s Valence–Arousal (V-A) model, which maps emotions onto a two-dimensional plane where valence represents the degree of positivity and arousal signifies intensity. This model provides a quantitative foundation for fine-grained multi-class emotion representation [
6].
Commonly used signals include electroencephalography (EEG), electrocardiography (ECG), galvanic skin response (GSR), photoplethysmography (PPG), and electromyography (EMG) [
7]. Due to their non-invasive nature, EEG and PPG signals have been extensively utilized in emotion recognition tasks. As a direct modality for capturing cerebral electrical activity, EEG provides profound insights into cognitive and emotional states. PPG signals reflect cardiovascular activities regulated by the ANS, offering critical complementary information from a PNS perspective. Recent studies have demonstrated the efficacy of these modalities: Lu et al. [
8] proposed a Convolution–Multilayer Perceptron Network (CMLP-Net) that effectively captures shared spatiotemporal features across EEG signal windows, achieving a 98.65% valence classification accuracy on the DEAP dataset. Qiao et al. [
9] developed a Generative Adversarial Network (GAN) integrated with an attention mechanism to augment brain cognitive maps, thereby enhancing recognition accuracy to 94.87% for quaternary classification on the SEED dataset. Furthermore, Han et al. [
10] introduced an Emotional Multi-scale Convolutional Neural Network (EMCNN) that enriches metadata by integrating both time- and frequency-domain information from PPG signals, attaining a quaternary accuracy of 86.1% on the DEAP dataset.
However, given the inherent complexity of human affect, the information capacity of unimodal signals remains limited, making it challenging to comprehensively capture the nuanced variations of emotional states. Integrating diverse physiological modalities can enhance information complementarity and improve recognition robustness. Consequently, deep learning-based multimodal fusion has become the mainstream paradigm.
In contemporary research, various advanced frameworks have been proposed to address multimodal challenges. For instance, the DAIL model, introduced by Li et al. [
11], utilizes domain adaptation techniques to account for inter-subject variability, when peripheral physiological signals from the DEAP dataset were employed for feature extraction, classification accuracies of 86.6% for valence and 84.4% for arousal were reported. Wang et al. [
12] introduced Husformer, an end-to-end multimodal Transformer designed to learn interactive representations across four modalities (EEG, EMG, GSR and eye-blink) via cross-modal mechanisms, surpassing 90% accuracy in binary classification tasks. Furthermore, the modified Cross Modal Transformer (MCMT) model, developed by Li et al. [
13], addresses the issue of cross-modal inconsistency by designating EEG as the primary modality supplemented by various auxiliary signals, reaching an overall recognition precision of 92.88% on the DEAP dataset.
Despite these advancements in binary or ternary classification tasks, existing models often struggle to effectively address the challenges of fine-grained emotion recognition. The research bottlenecks are twofold: the inherent class imbalance and data scarcity in physiological datasets limit the generalization of models in high-dimensional classification, and there is an insufficient exchange of deep-seated features between EEG and PNS signals during the fusion process.
To overcome these challenges, a novel multimodal emotion recognition framework, namely the Self-Attention Wasserstein Generative Adversarial Network with Bidirectional Cross-Modal Attention (SAWGAN-BDCMA), is introduced in the present study. The primary contributions of this work are as follows:
To expand the dataset scale, a Self-Attention Wasserstein GAN (SAWGAN) is introduced to synthesize high-quality samples through an adaptive weighting strategy and a signal smoothing mechanism, effectively alleviating the class imbalance problem.
To facilitate high-efficiency feature representation, a dual-branch architecture is employed to process heterogeneous physiological signals independently, fully mining intra-modal spatiotemporal representations.
To optimize the integration of deep physiological features, a Bidirectional Cross-Modal Attention (BDCMA) module is proposed to adaptively assign importance weights across modalities, enhancing information complementarity and fusion efficiency.
The classification performance of SAWGAN-BDCMA was validated using the DEAP and ECSMP datasets, achieving an improvement of up to 14.01% over state-of-the-art (SOTA) multimodal methods.
The remainder of this paper is organized as follows:
Section 2 details the architectural design of the proposed SAWGAN-BDCMA framework and the underlying principles of its constitutive modules.
Section 3 presents the experimental configuration and performance evaluations conducted on the DEAP and ECSMP datasets.
Section 4 provides an in-depth discussion of the unimodal testing results and the analysis of cross-modal fusion weights. Finally,
Section 5 concludes the paper and outlines potential directions for future research.
4. Discussion
To ensure high fidelity in the augmented data, a rigorous statistical and physiological evaluation of the SAWGAN-generated signals was conducted. The synthetic EEG and PPG signals exhibit minimal statistical deviation from real distributions, with mean differences of 0.0267 and 0.0112, respectively, while the low correlation structure differences confirm that inter-channel dependencies are effectively preserved. Crucially, the physiological integrity of the generated PPG was validated via HRV analysis. The mean RR interval difference was merely 0.0026 s, an error significantly lower than a single sampling period. Despite minor variances in standard deviation of normal-to-normal intervals (SDNN, 0.0247 s) and root mean square of successive differences (RMSSD, 0.0419 s), the overall physiological profile remains highly consistent with the original signals. These results substantiate that the SAWGAN module synthesizes signals that are both mathematically and physiologically valid, effectively enhancing emotion recognition performance.
Furthermore, the influence of various activation functions within the ViT-CBAM module was investigated. Comparative evaluations on the DEAP and ECSMP datasets demonstrated that models utilizing smooth, non-linear functions, such as GELU and ELU, consistently outperformed ReLU and LeakyReLU (as detailed in
Table 7). Specifically, the GELU-based configuration yielded 3.67% and 1.48% performance increments over ReLU and ELU, respectively, in the DEAP quaternary task, while attaining performance parity with ELU in the ECSMP six-class task. This disparity likely stems from GELU’s superior capability in capturing the stochastic and non-linear dynamics of physiological signals like EEG. Moreover, GELU exhibits inherent architectural synergy with the Transformer backbone, facilitating the preservation of richer discriminative features during cross-modal fusion and ensuring robust recognition across complex affective states.
Table 8 illustrates the performance disparity between single-modal signals and the proposed multimodal framework. Following identical preprocessing and feature extraction protocols, EEG consistently exhibits superior discriminative power over PPG, underscoring the heightened sensitivity of cortical neural activity to emotional fluctuations compared to peripheral physiological responses. However, integrating both modalities via SAWGAN-BDCMA yields a significant performance leap. While PPG shows lower standalone accuracy, it contributes vital complementary information from the ANS. These gains validate that the BDCMA mechanism effectively leverages the synergy between central and peripheral signals, capturing emotional nuances inaccessible to any single modality.
Figure 11 and
Figure 12 illustrate the average attention weights assigned by the BDCMA model across datasets. On the DEAP dataset, EEG consistently receives higher weights (exceeding 0.6) for valence-related classification, reflecting its direct link to central nervous system activity. While PPG contributions rise in arousal tasks (HAHV and LALV), confirming its efficacy in capturing autonomic cardiovascular dynamics. On the ECSMP dataset, this modality-specific synergy is further nuanced across granular emotion categories. EEG weights peak for `happy’ and `anger’, highlighting distinct cortical patterns in high-intensity states, whereas PPG weighting increases for `fear’ and `sad’, indicating stronger peripheral physiological involvement. These results validate that EEG serves as the primary informational pillar, while PPG provides critical complementary cues, allowing BDCMA to adaptively prioritize the most discriminative modality based on emotional context.
To comprehensively assess the model’s complexity, the parameters of each module were quantified, and both floating point operations per second (Flops) and GPU memory usage were evaluated, as summarized in
Table 9. The total number of parameters is approximately 2.23 M, with the SAWGAN generator and discriminator containing 0.71 M and 0.79 M parameters, respectively. The remaining components maintain parameter efficiency while jointly supporting feature extraction and multimodal fusion.
5. Conclusions
The SAWGAN-BDCMA multimodal fusion model is developed to overcome insufficient information exchange and suboptimal cross-modal fusion in conventional emotion recognition systems. SAWGAN-based augmentation first increased the balance and diversity of raw EEG and PPG, improving robustness to physiological variability. The dual-branch extractor then captured time-frequency features from EEG and dynamic cardiovascular features from PPG. Finally, the BDCMA module performed deep bidirectional interaction and adaptive fusion, exploiting complementarities between modalities. On the DEAP dataset, SAWGAN-BDCMA achieved 94.25% (binary) and 87.93% (quaternary) accuracy, while on the ECSMP dataset, it reached a peak accuracy of 97.49% for six-class recognition. Furthermore, the analysis of attention weights across emotion tasks confirms the model’s ability to adaptively weigh each modality according to sample properties, which contributes to its robust recognition performance.
Future work will integrate additional physiological channels, such as EMG, GSR and ECG, to further enrich multimodal representations. In parallel, coupling lightweight neural architectures with wearable sensing hardware is expected to enable real-time multimodal emotion recognition, facilitating practical deployment of affective computing in intelligent interaction and health monitoring.