SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition

Zhang, Ning; Su, Shiwei; Zhang, Haozhe; Yang, Hantong; Hao, Runfang; Yang, Kun

doi:10.3390/s26020582

Open AccessArticle

SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition

by

Ning Zhang

¹

,

Shiwei Su

²

,

Haozhe Zhang

²,

Hantong Yang

²,

Runfang Hao

^2,*

and

Kun Yang

^2,3,4,*

¹

College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China

²

Shanxi Key Laboratory of Artificial Intelligence & Micro Nano Sensors, College of Integrated Circuits, Taiyuan University of Technology, Taiyuan 030024, China

³

Suzhou Yakun Zhichuang Technology Co., Ltd., Suzhou 215001, China

⁴

Jinan Shengquan Group Share-Holding Co., Ltd., Jinan 250000, China

^*

Authors to whom correspondence should be addressed.

Sensors 2026, 26(2), 582; https://doi.org/10.3390/s26020582

Submission received: 28 November 2025 / Revised: 10 January 2026 / Accepted: 13 January 2026 / Published: 15 January 2026

(This article belongs to the Special Issue Advanced Signal Processing for Affective Computing)

Download

Browse Figures

Versions Notes

Abstract

Emotion recognition from physiological signals is pivotal for advancing human–computer interaction, yet unimodal pipelines frequently underperform due to limited information, constrained data diversity, and suboptimal cross-modal fusion. Addressing these limitations, the Self-Attention Wasserstein Generative Adversarial Network with Bidirectional Cross-Modal Attention (SAWGAN-BDCMA) framework is proposed. This framework reorganizes the learning process around three complementary components: (1) a Self-Attention Wasserstein GAN (SAWGAN) that synthesizes high-quality Electroencephalography (EEG) and Photoplethysmography (PPG) to expand diversity and alleviate distributional imbalance; (2) a dual-branch architecture that distills discriminative spatiotemporal representations within each modality; and (3) a Bidirectional Cross-Modal Attention (BDCMA) mechanism that enables deep two-way interaction and adaptive weighting for robust fusion. Evaluated on the DEAP and ECSMP datasets, SAWGAN-BDCMA significantly outperforms multiple contemporary methods, achieving 94.25% accuracy for binary and 87.93% for quaternary classification on DEAP. Furthermore, it attains 97.49% accuracy for six-class emotion recognition on the ECSMP dataset. Compared with state-of-the-art multimodal approaches, the proposed framework achieves an accuracy improvement ranging from 0.57% to 14.01% across various tasks. These findings offer a robust solution to the long-standing challenges of data scarcity and modal imbalance, providing a profound theoretical and technical foundation for fine-grained emotion recognition and intelligent human–computer collaboration.

Keywords:

fine-grained emotion recognition; electroencephalography; photoplethysmography; data augmentation; cross-modal attention; multimodal signal fusion

1. Introduction

With the rapid advancement of artificial intelligence, emotion recognition has emerged as a cornerstone of human–computer interaction (HCI), enabling machines to perceive human affective states and facilitate naturalistic responses [1]. While facial expressions [2] and speech signals [3] are widely utilized in current research, they are susceptible to subjective manipulation and deliberate masking. In contrast, physiological signals are directly regulated by the Autonomic Nervous System (ANS) and peripheral nervous system (PNS), offering an objective and authentic reflection of an individual’s psychophysiological state [4]. Furthermore, the evolution of wearable technology has enabled the real-time capture of subtle physiological variations, significantly propelling research into physiology-based emotion recognition [5].

To empower machines with the capability to precisely infer human intentions, a scientific emotional definition model is fundamental. Although discrete models enumerate a limited set of basic emotions, continuous dimensional approaches have gained prominence in affective computing for their ability to refine emotional states and capture nuanced affective dynamics. The most prevalent framework is Russell’s Valence–Arousal (V-A) model, which maps emotions onto a two-dimensional plane where valence represents the degree of positivity and arousal signifies intensity. This model provides a quantitative foundation for fine-grained multi-class emotion representation [6].

Commonly used signals include electroencephalography (EEG), electrocardiography (ECG), galvanic skin response (GSR), photoplethysmography (PPG), and electromyography (EMG) [7]. Due to their non-invasive nature, EEG and PPG signals have been extensively utilized in emotion recognition tasks. As a direct modality for capturing cerebral electrical activity, EEG provides profound insights into cognitive and emotional states. PPG signals reflect cardiovascular activities regulated by the ANS, offering critical complementary information from a PNS perspective. Recent studies have demonstrated the efficacy of these modalities: Lu et al. [8] proposed a Convolution–Multilayer Perceptron Network (CMLP-Net) that effectively captures shared spatiotemporal features across EEG signal windows, achieving a 98.65% valence classification accuracy on the DEAP dataset. Qiao et al. [9] developed a Generative Adversarial Network (GAN) integrated with an attention mechanism to augment brain cognitive maps, thereby enhancing recognition accuracy to 94.87% for quaternary classification on the SEED dataset. Furthermore, Han et al. [10] introduced an Emotional Multi-scale Convolutional Neural Network (EMCNN) that enriches metadata by integrating both time- and frequency-domain information from PPG signals, attaining a quaternary accuracy of 86.1% on the DEAP dataset.

However, given the inherent complexity of human affect, the information capacity of unimodal signals remains limited, making it challenging to comprehensively capture the nuanced variations of emotional states. Integrating diverse physiological modalities can enhance information complementarity and improve recognition robustness. Consequently, deep learning-based multimodal fusion has become the mainstream paradigm.

In contemporary research, various advanced frameworks have been proposed to address multimodal challenges. For instance, the DAIL model, introduced by Li et al. [11], utilizes domain adaptation techniques to account for inter-subject variability, when peripheral physiological signals from the DEAP dataset were employed for feature extraction, classification accuracies of 86.6% for valence and 84.4% for arousal were reported. Wang et al. [12] introduced Husformer, an end-to-end multimodal Transformer designed to learn interactive representations across four modalities (EEG, EMG, GSR and eye-blink) via cross-modal mechanisms, surpassing 90% accuracy in binary classification tasks. Furthermore, the modified Cross Modal Transformer (MCMT) model, developed by Li et al. [13], addresses the issue of cross-modal inconsistency by designating EEG as the primary modality supplemented by various auxiliary signals, reaching an overall recognition precision of 92.88% on the DEAP dataset.

Despite these advancements in binary or ternary classification tasks, existing models often struggle to effectively address the challenges of fine-grained emotion recognition. The research bottlenecks are twofold: the inherent class imbalance and data scarcity in physiological datasets limit the generalization of models in high-dimensional classification, and there is an insufficient exchange of deep-seated features between EEG and PNS signals during the fusion process.

To overcome these challenges, a novel multimodal emotion recognition framework, namely the Self-Attention Wasserstein Generative Adversarial Network with Bidirectional Cross-Modal Attention (SAWGAN-BDCMA), is introduced in the present study. The primary contributions of this work are as follows:

To expand the dataset scale, a Self-Attention Wasserstein GAN (SAWGAN) is introduced to synthesize high-quality samples through an adaptive weighting strategy and a signal smoothing mechanism, effectively alleviating the class imbalance problem.
To facilitate high-efficiency feature representation, a dual-branch architecture is employed to process heterogeneous physiological signals independently, fully mining intra-modal spatiotemporal representations.
To optimize the integration of deep physiological features, a Bidirectional Cross-Modal Attention (BDCMA) module is proposed to adaptively assign importance weights across modalities, enhancing information complementarity and fusion efficiency.
The classification performance of SAWGAN-BDCMA was validated using the DEAP and ECSMP datasets, achieving an improvement of up to 14.01% over state-of-the-art (SOTA) multimodal methods.

The remainder of this paper is organized as follows: Section 2 details the architectural design of the proposed SAWGAN-BDCMA framework and the underlying principles of its constitutive modules. Section 3 presents the experimental configuration and performance evaluations conducted on the DEAP and ECSMP datasets. Section 4 provides an in-depth discussion of the unimodal testing results and the analysis of cross-modal fusion weights. Finally, Section 5 concludes the paper and outlines potential directions for future research.

2. Materials and Methods

This section elaborates on the modules and methods employed in the proposed multimodal emotion recognition model SAWGAN-BDCMA, including the data augmentation approach, the dual-branch feature extraction model, and the feature fusion method, as illustrated in Figure 1.

2.1. Data Augmentation Module

The limited size of EEG and PPG datasets, coupled with class imbalance, hampers the training of robust emotion recognition models. To curb overfitting and improve generalization, this study expands the training set via data augmentation. Given the high dimensionality and complexity of physiological signals, a generative augmentation strategy based on Wasserstein Generative Adversarial Networks (WGAN) is adopted. Compared to classical GANs that rely on Jensen-Shannon divergence, WGAN is selected for its ability to provide a smoother optimization landscape via Wasserstein distance, which effectively mitigates training instability and mode collapse—common challenges when modeling high-dimensional physiological distributions [14,15].

2.1.1. Data Preprocessing

Prior to data augmentation, raw EEG and PPG signals undergo necessary preprocessing, including baseline removal, denoising and normalization to ensure numerical stability during training. Normalization is performed by subtracting the mean and dividing by the standard deviation for each channel. Subsequently, the data are clipped to the range [−3, 3] following the

3 σ

rule, which effectively eliminates potential outliers (representing less than 0.3% of the distribution) and prevents gradient instability caused by extreme values. Finally, the data are linearly rescaled to the interval [−1, 1]. This rescaling not only preserves the signal morphology within a bounded symmetric range but also aligns with the bounded output of the generator’s Tanh activation function, thereby preventing saturation and facilitating faster convergence during the adversarial game. After preprocessing, the data are reshaped into the format

(N \times C \times T)

, where

N = N_{trial} \times N_{segment}

denotes the total number of samples (trials × segments), C represents the number of channels (32 for DEAP and 7 for ECSMP in EEG; 1 for PPG), and T is fixed at 256 time steps.

2.1.2. SAWGAN Architecture

Expanding upon the WGAN structure, the proposed SAWGAN incorporates two critical enhancements to handle heterogeneous physiological signals. First, it adopts the Conditional WGAN with Gradient Penalty (CWGAN-GP) as its foundation, leveraging class labels to supervise the synthesis process of label-specific samples [16]. Second, to tackle the inherent physiological variances between EEG and PPG, where EEG reflects multi-channel cerebral dynamics and PPG captures single-channel cardiovascular activity, a self-attention mechanism is integrated into both the generator and discriminator. This allows the model to adaptively capture modality-specific features: cross-channel spatial-temporal correlations for EEG and morphological periodicities for PPG. The specific architecture of SAWGAN is illustrated in Figure 2.

The specific arrangement of SAWGAN components is tailored to accommodate the complex properties of physiological data. Specifically, self-attention modules are strategically integrated after convolutional layers to compensate for the limited receptive field of standard filters, enabling the model to capture long-range temporal dependencies and global cross-channel correlations essential for accurate EEG and PPG synthesis. Furthermore, the three-layer fully connected network acts as a high-dimensional feature mapper to effectively project the noise-label latent space into a structured initial representation.

The generator

G_{θ}

takes a noise vector

z \in R^{d}

and the corresponding class label

y

as inputs. The label is processed through an embedding layer and concatenated with

z

. The concatenated vector is transformed by a three-layer fully connected network and reshaped into an initial low-resolution representation of dimensions (B, C, 32), where B is the batch size and C denotes the number of channels. Subsequently, transposed convolutional layers upsample the feature map from (B, C, 32) to (B, 64, 64), then to (B, 32, 128), and finally to the target output shape (B, C, T), where

T = 256

and C conforms to the target signal channels. Self-attention blocks are inserted after the first and third transposed convolutional layers to capture long-range dependencies. The generator’s output is mapped to the range [−1, 1] using a Tanh activation:

G_{θ} (z, y) = tanh (UpConv (FC ([z, Embed (y)])))

(1)

where

z

is the random noise vector and

y

denotes the conditional label.

The discriminator

D_{ϕ}

receives either real physiological signal segments

x \in R^{B \times C \times T}

or generated pseudo segments

x_{p} = G_{θ} (z, y)

, along with their labels

y

, where

C = 32

or 7 for EEG or

C = 1

for PPG, and

T = 256

. The label is embedded, spatially expanded, and incorporated into the input feature map via broadcasting. As depicted in Figure 2, a stack of temporal convolutional blocks reduces the temporal resolution while increasing channel dimensionality. Subsequently, self-attention modules and global average pooling (GAP) are applied to extract a feature vector of size

(B, 512)

, which is passed through a linear classifier to produce the output

D_{ϕ} (x, y)

:

D_{ϕ} (x, y) = Linear (GAP (SA (Conv (x + α \cdot Embed (y)))))

(2)

where

ϕ

represents the network parameters of the discriminator,

Embed (y)

denotes the label embedding, and

α

is a learnable weight that adaptively modulates the influence of the conditional information.

In this paper, the Wasserstein distance with gradient penalty is employed as the training criterion. The loss function of the discriminator

L_{D}

is defined as:

L_{D} = - E_{x \sim p_{real}} [D_{ϕ} (x, y)] + E_{x_{p} \sim p_{pseudo}} [D_{ϕ} (x_{p}, y)] + λ_{gp} L_{GP}

(3)

where

L_{GP}

is the gradient penalty to enforce the Lipschitz constraint:

L_{GP} = E_{\hat{x} \sim p_{\hat{x}}} [(∥ \nabla_{\hat{x}} D_{ϕ} (\hat{x}, y) ∥_{2} - 1)^{2}]

(4)

where

\hat{x}

is a random interpolation between real samples

x

and pseudo samples

x_{p}

.

SAWGAN introduces a dynamic optimization strategy compared to conventional Wasserstein GAN with Gradient Penalty (WGAN-GP) [17]. The generator

G_{θ}

is trained to minimize the combined objective

L_{G}

, which facilitates label-guided manifold alignment through class-conditional variables

y

:

L_{G} = α_{t} L_{adv} + β_{t} L_{align} + γ_{t} L_{entropy}

(5)

where

L_{adv}

is the adversarial loss defined as:

L_{adv} = - E_{z \sim p_{z}} [D_{ϕ} (G_{θ} (z, y), y)]

.

Unlike standard WGANs that rely on static hyperparameters, SAWGAN reformulates the parameter update rule as:

θ_{t + 1} = θ_{t} - η_{G}^{(t)} \nabla_{θ} L_{G}

(6)

where the learning rate

η_{G}^{(t)}

and the objective coefficients

(α_{t}, β_{t}, γ_{t})

are time-varying parameters that evolve adaptively based on the competitive dynamics between

G_{θ}

and

D_{ϕ}

.

Specifically,

L_{align}

is the manifold alignment loss, which ensures that the synthetic samples preserve the underlying topological structure of the real physiological distribution for each emotion class.

L_{entropy}

represents the information entropy regularizer, introduced to maximize the diversity of generated samples and prevent the mode collapse common in high-dimensional physiological signal synthesis. The overall optimization process is detailed in Algorithm 1.

Algorithm 1 Self-Adaptive Parameter Adjustment in SAWGAN

Input:: Generator $G_{θ}$ , Discriminator $D_{ϕ}$ , real data $x$ , noise $z \sim p (z)$ , adaptation rate $δ$ , initial weights $(α_{0}, β_{0}, γ_{0})$
Output:: Optimized parameters $θ, ϕ$ , and adaptive hyper-parameter trajectories
1:: for each iteration t do
2:: Generate pseudo data $x_{p} = G_{θ} (z, y)$
3:: Compute discriminator loss: $L_{D} = - E [D_{ϕ} (x, y)] + E [D_{ϕ} (x_{p}, y)] + λ_{gp} L_{GP}$
4:: Compute generator loss: $L_{G} = α_{t} L_{adv} + β_{t} L_{align} + γ_{t} L_{entropy}$
5:: Update weights adaptively: $α_{t + 1}, β_{t + 1}, γ_{t + 1} \leftarrow Normalize (α_{t} + δ \nabla L_{G})$
6:: Adjust learning rates: $η_{G}^{(t + 1)}, η_{D}^{(t + 1)} \leftarrow η_{G}^{(t)} (1 + δ sign ({\dot{L}}_{G})), η_{D}^{(t)} (1 - δ sign ({\dot{L}}_{D}))$
7:: Update parameters: $θ_{t + 1} \leftarrow θ_{t} - η_{G}^{(t)} \nabla_{θ} L_{G}$ , $ϕ_{t + 1} \leftarrow ϕ_{t} - η_{D}^{(t)} \nabla_{ϕ} L_{D}$
8:: end for
9:: return Optimized parameters $θ$ and $ϕ$

As specified in the algorithm’s output, the adaptive module yields a set of optimized parameters

(θ, ϕ)

that have converged under a balanced adversarial game. The adaptation involves the rate

δ

and the loss change rate

\dot{L}

, which direct the trajectory of hyper-parameter updates. By dynamically modulating learning rates (LR) and loss coefficients according to real-time gradient statistics, the controller ensures that the model bypasses suboptimal local minima and mitigates the risk of mode collapse, ultimately producing high-fidelity synthetic physiological samples.

2.2. Feature Extraction Module

This section describes the signal feature extraction module in SAWGAN-BDCMA, the specific architecture of which is illustrated in Figure 3. As shown, this module is composed of a ViT-CBAM branch for processing EEG signals and a TCN branch for handling PPG signals.

2.2.1. ViT-CBAM

The EEG branch adopts a Vision Transformer (ViT) backbone [18] augmented with an enhanced Convolutional Block Attention Module (CBAM) [19] for spatiotemporal modeling of multi-channel signals. Input segments are first tokenized by a Patch Embedding module, then processed by a Transformer encoder to capture long-range temporal dependencies and inter-channel relations. Unlike conventional CBAM, which modulates purely convolutional features, the proposed variant is placed after the Transformer encoder to reweight channel and pseudo-spatial dimensions of the global ViT features. This placement allows attention to operate on richer, context-integrated representations, strengthening discrimination of emotion-relevant activity patterns. The overall architecture consists of Patch Embedding, Transformer Encoder, and the post-encoder CBAM attention block (see Figure 3).

Patch Embedding Module: EEG signals exhibit a structured organization across channels and time, motivating a decoupled spatiotemporal encoding strategy during patch embedding. The input is first processed by a temporal convolutional layer Conv2d(1, 24, (1, 25)), which operates along the time dimension to capture dynamic patterns within short-term intervals. Subsequently, spatial convolution (Conv2d(24, 24, (32, 1))) is applied to integrate information across brain regions in the channel dimension, enabling the encoding of inter-channel coupling relationships across 32 electrode channels. Thereafter, dimensionality reduction and feature rearrangement are achieved through pooling and a $1 \times 1$ convolutional projection, mapping features into the token representation space: $Z \in R^{B \times P \times D}$ .
Transformer Encoder: This module is designed to learn global dependencies in EEG signals. Each patch sequence, augmented with a learnable classification (CLS) token and positional encoding to preserve temporal information and enable global representation, is fed into a multi-layer Transformer encoder. Within each encoder layer, the features are updated through Multi-Head Self-Attention (MHSA) mechanisms and a Feed-Forward Network (FFN). The feature update process can be formulated as follows:

$Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V$

(7)

$Z_{l + 1} = LayerNorm (Z_{l} + Attention (Z_{l}))$

(8)

$Z_{l + 1} = LayerNorm (Z_{l + 1} + FFN (Z_{l + 1}))$

(9)

where $Q$ , $K$ , and $V$ represent the Query, Key, and Value matrices, respectively, and $d_{k}$ denotes the attention dimension. The FFN consists of a two-layer feed-forward network with GELU activation. Compared to ReLU, GELU’s smoother nonlinearity is more effective at representing the complex neural oscillations inherent in EEG data. The MHSA enables the modeling of interdependencies across different channels and temporal segments, while residual connections and LayerNorm ensure stable training of deep architectures.
Improved CBAM: The improved CBAM is composed of Channel Attention (CA) and Spatial Attention (SA). In the CA stage, global descriptors are extracted from each channel using parallel average pooling and max pooling. The pooled descriptors undergo a two-layer mapping where GELU replaces ReLU to provide smoother nonlinearities, enhancing inter-channel modeling and feature stability. The outputs of the average-pooled and max-pooled branches are summed and passed through a sigmoid, yielding channel-wise attention weights that adaptively emphasize informative EEG channels and suppress less relevant activity. This mechanism improves modeling of inter-channel importance while preserving feature stability before subsequent spatial recalibration. The process can be expressed as follows:

$X^{'} = X ⊙ σ (FC (GELU (FC (AvgPool (X)))) + FC (GELU (FC (MaxPool (X)))))$

(10)

where ⊙ denotes the element-wise multiplication operation.

In the SA module, average pooling and max pooling are applied to the feature

X^{'}

, and the results are subsequently concatenated. A one-dimensional convolution is then employed to capture dependencies across different positions in the sequence. Following the convolution operation, a Batch Normalization (BN) layer is introduced to stabilize the feature distribution. Additionally, an adjustable convolution kernel size k is incorporated to extract temporal features at different scales, enabling the model to more flexibly focus on spatial patterns in EEG signals. The final attention-enhanced features are expressed as:

X^{″} = X^{'} ⊙ σ (BN (Conv 1 D ([F_{avg}^{s}; F_{\max}^{s}])))

(11)

where

F_{avg}^{s}

and

F_{\max}^{s}

are the features obtained by applying average pooling and max pooling to

X^{'}

, respectively. Finally, the GAP is further employed to derive the EEG branch feature representation:

x_{EEG} = GAP (X^{″})

.

2.2.2. TCN

The PPG branch employs a Temporal Convolutional Network (TCN) [20] to extract temporal dependencies from PPG waveforms. By integrating causal convolutions, dilated convolutions, and residual connections, the network effectively captures both short-term and long-term temporal dynamics of the signal.

Given an input sequence

X \in R^{B \times 1 \times T}

, the dilated convolution operation in each temporal block is defined as:

H (t) = \sum_{i = 0}^{k - 1} w_{i} \cdot X_{t - d \cdot i}

(12)

where

k

is the kernel size and

d

is the dilation rate, which expands exponentially

(d \in {1, 2, 4})

across successive blocks. Since the effective receptive field of the network increases with dilation, the model can perceive long-range temporal dependencies without losing resolution. Each block output is computed as:

Y = ReLU (F (X) + W_{r} X)

(13)

where

F (X)

denotes the transformation of two dilated convolutions with nonlinear activations, and

W_{r}

represents the residual projection ensuring channel consistency. This residual mechanism stabilizes training and prevents gradient vanishing.

After three temporal blocks with channel widths [32, 64, 64], the final feature map

H_{3}

undergoes the GAP along the temporal dimension to obtain a compact feature representation:

x_{PPG} = \frac{1}{T^{'}} \sum_{t = 1}^{T^{'}} H_{3} [:, :, t]

(14)

2.3. Feature Fusion Module

Following EEG and PPG feature extraction, a BDCMA is used for deep interaction and adaptive fusion of signal features. The overall structure of the proposed model is illustrated in Figure 4. Unlike unidirectional fusion methods, the bidirectional mechanism ensures that the deep spatiotemporal features of EEG and the morphological patterns of PPG can mutually guide each other’s recalibration. Specifically, the cross-modal attention maps are designed to calculate the correlation matrix between modalities, allowing the model to adaptively suppress redundant noise in one modality while emphasizing complementary emotional cues in the other. This symmetrical design minimizes information loss during the fusion process, ensuring that the final joint representation is both discriminative and robust against modality-specific noise.

Initially, the EEG and PPG features

x_{EEG} \in R^{d_{eeg}}

and

x_{PPG} \in R^{d_{ppg}}

, obtained from the feature extraction network, are mapped to a unified representation space. Through linear projection and layer normalization, they are projected into

n

feature tokens, denoted as

Z_{EEG} \in R^{n \times d}, Z_{PPG} \in R^{n \times d}

. These features are then fed into the BDCMA module, where the EEG features serve as Key-Value pairs and the PPG features act as the Query for one attention interaction. Simultaneously, the PPG features serve as Key-Value pairs and the EEG features act as the Query for the reverse interaction. This yields:

{\tilde{Z}}_{EEG} = CrossAttn (Z_{EEG}, Z_{PPG}, Z_{PPG}), {\tilde{Z}}_{PPG} = CrossAttn (Z_{PPG}, Z_{EEG}, Z_{EEG})

.

The original tokens are then concatenated with the cross-modal attention outputs to form an extended sequence:

S_{EEG} = [Z_{EEG}; {\tilde{Z}}_{PPG}]

(15)

S_{PPG} = [Z_{PPG}; {\tilde{Z}}_{EEG}]

(16)

Subsequently, MHSA is applied to the fused token sequence to capture global dependencies, integrating intra-modal structure with context transferred across modalities [21]. The resulting sequence is then pooled to yield fixed-length, modality-level representations

f_{EEG}

and

f_{PPG}

.

A modality gating mechanism is introduced to adaptively determine the fusion weights between EEG and PPG at the sample level through:

w = Softmax (\frac{W_{g} [f_{EEG}; f_{PPG}]}{T})

(17)

where T is a temperature parameter controlling the smoothness of the weights, and

w = [w_{EEG}, w_{PPG}]

. The final fused features are computed as:

f_{fused} = w_{EEG} \cdot f_{EEG} + w_{PPG} \cdot f_{PPG}

(18)

This mechanism adaptively allocates modality importance based on sample characteristics, thereby enhancing the model’s robustness in scenarios with imbalanced multimodal features.

3. Experiments and Results

The evaluation was performed on the DEAP and ECSMP datasets, with ablation studies designed to quantify the contribution of each component.

3.1. Dataset Partitioning

DEAP is a public multimodal affect dataset comprising recordings from 32 participants who viewed 40 music video clips. In the present study, EEG from the first 32 channels and PPG from channel 39 were used, each sampled at 128 Hz [22]. For each trial, the final 60 s of EEG and PPG were extracted. Labeling rules for the binary and quaternary settings are summarized in Table 1 and Table 2.

The ECSMP dataset [23] is a comprehensive multimodal repository containing data from 89 healthy college students, including EEG, ECG, PPG, Electrodermal Activity (EDA), body temperature, and accelerometer recordings. Emotional states in this dataset are categorized into six distinct labels: neutral, fear, sadness, happiness, anger, and disgust (originally indexed as 101–106, mapped to 0–5 in this study). Since simultaneous EEG and PPG recordings were not available for all participants across all emotional states, this study focuses on a specific subset of 30 participants whose data provided a complete and synchronized representation of both target modalities.

3.2. Training Strategy

To comprehensively evaluate the classification performance of the proposed model on unknown data, the DEAP and ECSMP datasets were subjected to 10-fold cross-validation in this study, with an 8:1:1 split for training, validation, and testing within each fold. To ensure independent verification and reproducibility, manual seeds (42) were fixed for all weight initializations and data shuffling processes. The final performance metrics were obtained by averaging the results of the 10 experiments. During the training phase, an early stopping strategy with a patience of 20 epochs was implemented based on the validation loss to prevent overfitting. Models were implemented in PyTorch 1.9.1 with CUDA 11.1 and trained on an AMD Ryzen 9 5950X CPU with an NVIDIA RTX 3090 GPU (24 GB).

The SAWGAN module employed a 128-dimensional latent space, 32 batch size, and 120 iterations. Adhering to the WGAN-GP protocol [17], initial hyperparameters were set at

n_{c r i t i c} = 5

and

λ_{g p} = 10

to ensure stable optimization and accurate distance approximation. Initial LRs were

1 \times 10^{- 4}

for the generator and

7 \times 10^{- 5}

for the discriminator. Feature extraction hyperparameters were configured as follows: the ViT-CBAM utilized a 32-pixel patch size, 3 layers, and 4 heads (50 epochs, LR

1 \times 10^{- 3}

), while the TCN employed 4 residual blocks with dilation factors of 1, 2, 4, 8 and a kernel size of 3 (200 epochs, LR

1 \times 10^{- 4}

). Finally, the fusion module was trained for 200 epochs using the AdamW optimizer (LR

1 \times 10^{- 4}

, weight decay

1 \times 10^{- 5}

) and a Weighted Cross-Entropy Loss to address class imbalance, with a global batch size of 64.

3.3. Experimental Results

3.3.1. Results on the DEAP Dataset

To evaluate the effectiveness of the SAWGAN module, experiments were conducted on the DEAP dataset with three augmentation levels: 0.5×, 1×, and 2×, representing the proportional expansion of the original dataset. The model was trained using concatenated original and augmented data. Table 3 and Table 4 present the binary and quaternary classification performance under different augmentation levels.

As shown in Table 3 and Table 4, consistent performance improvements are observed across all evaluation metrics as the data augmentation levels increase. Without augmentation (0×), the model achieves only 76.01% ± 2.85% and 68.37 ± 3.35% accuracies for binary and quaternary classification, respectively, with correspondingly low precision, recall, and F1-score. The best performance is achieved at 2×, with accuracy values of 94.25 ± 1.36% (binary) and 87.93 ± 1.37% (quaternary). These trends indicate that the generated data effectively broadened the training distribution and that the proposed feature extraction and fusion methods are effective.

3.3.2. Results on the ECSMP Dataset

To further validate the generalization capability of the SAWGAN-BDCMA framework across diverse datasets and more complex affective computing tasks, a six-class emotion recognition experiment was conducted on the ECSMP dataset. Table 5 summarizes the performance metrics across various augmentation levels.

The experimental results in Table 5 demonstrate that the SAWGAN-BDCMA framework maintains exceptional performance consistency and robustness even as task complexity increases. The baseline accuracy of 79.75 ± 3.71% (0×) culminates in a peak of 97.49 ± 1.16% (2×) following SAWGAN-based augmentation. Consistent with the trends identified in the DEAP dataset, the model’s performance exhibits a monotonic improvement as the augmentation factor increases. This sustained improvement validates the module’s capacity to synthesize high-fidelity physiological signals, thereby diversifying the training distribution and empowering the extraction of discriminative features for granular six-class emotion recognition.

3.3.3. Performance Comparison with the SOTA Models

In this study, four SOTA multimodal emotion recognition methods were evaluated and benchmarked using the DEAP and ECSMP datasets, with a detailed performance analysis presented in Table 6. To ensure a rigorous and fair comparison, all baseline models were re-implemented under a unified experimental protocol. This involves utilizing an identical hardware environment, the same data preprocessing pipelines (as detailed in Section 2.1.1), and a consistent 10-fold cross-validation strategy across all tests.

As summarized in Table 6, SAWGAN-BDCMA consistently outperformed all benchmarked SOTA methods, including Husformer [12], DAIL [11], MCMT [13], and the Multimodal-Driven Fusion Data Augmentation Framework [24]. Compared to these advanced pipelines, the proposed model achieved accuracy improvements ranging from 0.57% to 14.01% across different datasets. Notably, while the performance of domain-adaptation-based models like DAIL [11] and transformer-based architectures such as Husformer [12] showed varying degrees of decline as classification complexity increased, the proposed SAWGAN-BDCMA maintained superior robustness, reaching a peak accuracy of 97.49% on the ECSMP six-class task.

3.4. Feature Visualization

To assess the efficacy of SAWGAN and the feature extraction module, visual analyses were conducted on the original data, the augmented data, and their derived representations of DEAP and ECSMP.

As illustrated in Figure 5 and Figure 6, the original and augmented physiological signals exhibit high distributional consistency across both DEAP and ECSMP datasets. In the t-SNE visualizations of EEG data, the cluster regions of original samples (circles) and synthetic samples (crosses) show substantial overlap. This indicates that SAWGAN effectively captures the intrinsic manifold of EEG features, enhancing dataset diversity without introducing significant distribution shifts. Similarly, the augmented PPG waveforms maintain high fidelity to the original signals, preserving critical temporal characteristics such as periodic patterns and heart rate variability (HRV) features. Even on the more complex six-class ECSMP dataset, SAWGAN accurately captures the dynamic transitions of physiological signals, providing a robust data source for subsequent feature extraction.

Figure 7 and Figure 8 visually demonstrate the feature extraction performance of the dual-branch model on the DEAP and ECSMP datasets. As shown in Figure 7a and Figure 8a, the ViT-CBAM architecture successfully learns discriminative representations from the original EEG data, evidenced by the distinct clustering across categories. The introduction of 2× augmented EEG data in Figure 7b and Figure 8b yields richer internal structures and sharper inter-class boundaries, particularly within the multi-class ECSMP environment.

Regarding the PPG modality, Figure 7c and Figure 8c exhibit significant overlap among data points when utilizing only original signals, indicating the limited initial separability of the TCN model. However, incorporating 2× augmented PPG data enhances feature aggregation, as illustrated in Figure 7d and Figure 8d. The diverse sample information provided by the SAWGAN module enables both the ViT-CBAM and TCN models to capture more robust spatiotemporal dynamics, resulting in superior discriminative power even for the complex six-class classification task in the ECSMP dataset.

3.5. Ablation Experiments

Ablation studies were conducted to evaluate the performance of the proposed model and the individual contributions of its components. The corresponding experimental results are presented in Figure 9 and Figure 10.

3.5.1. Ablation Study of Data Augmentation Methods

To evaluate the individual contributions of the proposed components, the self-attention mechanism and Wasserstein optimization were decoupled, with results illustrated in Figure 9a and Figure 10a. While the WGAN-GP improves training stability relative to vanilla GANs, the absence of self-attention limits its capacity to capture long-range dependencies in physiological signals, resulting in lower accuracies of 80.30% and 82.57% at 2× augmentation. Conversely, the SAWGAN integrates both mechanisms and achieves peak performance, demonstrating a powerful synergy between global feature relationship modeling and gradient-penalty-based optimization. Even at high augmentation ratios, SAWGAN exhibits stable performance gains while preserving the underlying statistical characteristics of the original multimodal signals.

3.5.2. Ablation Study of Feature Extraction Modules

Various feature extraction methods were evaluated on EEG and PPG signals to assess the contribution of the dual-branch module. For EEG (Figure 9b and Figure 10b), the proposed ViT-CBAM achieved peak accuracies of 87.93% and 97.49% on the DEAP and ECSMP datasets, respectively, outperforming WTConv [25], EEGConformer [26], and BiHDM [27]. This gain is attributable to its hybrid design, which combines global context modeling via a Vision Transformer with targeted channel–spatial reweighting. Similarly, for PPG (Figure 9c and Figure 10c), the TCN-based branch outperformed manual feature engineering (70.34%) and deep learning baselines. While CNNs [28] primarily capture local structures and LSTMs [29] are susceptible to gradient degradation, the TCN leverages dilated causal convolutions to model long-range temporal dependencies, yielding superior feature extraction. These results demonstrate that the tailored dual-branch design provides a robust foundation for subsequent cross-modal fusion.

3.5.3. Ablation Study of Feature Fusion Strategies

To verify the efficacy of the BDCMA mechanism, comparative experiments were conducted against three alternative strategies: feature concatenation, unidirectional attention (PPG-to-EEG), and unidirectional attention (EEG-to-PPG). As summarized in Figure 9d and Figure 10d, feature concatenation yields the lowest performance due to its inability to model complex inter-modal correlations. While unidirectional attention improves results, the EEG-to-PPG configuration consistently outperforms PPG-to-EEG, suggesting that high-dimensional spatiotemporal EEG features provide superior latent guidance for PPG recalibration. Moreover, BDCMA achieves the highest accuracy in all scenarios. Unlike unidirectional methods, BDCMA’s symmetrical mutual-feedback mechanism minimizes information loss and enhances the synergy between central and PNS signals. As the augmentation factor increases, the widening performance gap further substantiates BDCMA’s superior ability to capture discriminative emotional cues from enriched data distributions.

4. Discussion

To ensure high fidelity in the augmented data, a rigorous statistical and physiological evaluation of the SAWGAN-generated signals was conducted. The synthetic EEG and PPG signals exhibit minimal statistical deviation from real distributions, with mean differences of 0.0267 and 0.0112, respectively, while the low correlation structure differences confirm that inter-channel dependencies are effectively preserved. Crucially, the physiological integrity of the generated PPG was validated via HRV analysis. The mean RR interval difference was merely 0.0026 s, an error significantly lower than a single sampling period. Despite minor variances in standard deviation of normal-to-normal intervals (SDNN, 0.0247 s) and root mean square of successive differences (RMSSD, 0.0419 s), the overall physiological profile remains highly consistent with the original signals. These results substantiate that the SAWGAN module synthesizes signals that are both mathematically and physiologically valid, effectively enhancing emotion recognition performance.

Furthermore, the influence of various activation functions within the ViT-CBAM module was investigated. Comparative evaluations on the DEAP and ECSMP datasets demonstrated that models utilizing smooth, non-linear functions, such as GELU and ELU, consistently outperformed ReLU and LeakyReLU (as detailed in Table 7). Specifically, the GELU-based configuration yielded 3.67% and 1.48% performance increments over ReLU and ELU, respectively, in the DEAP quaternary task, while attaining performance parity with ELU in the ECSMP six-class task. This disparity likely stems from GELU’s superior capability in capturing the stochastic and non-linear dynamics of physiological signals like EEG. Moreover, GELU exhibits inherent architectural synergy with the Transformer backbone, facilitating the preservation of richer discriminative features during cross-modal fusion and ensuring robust recognition across complex affective states.

Table 8 illustrates the performance disparity between single-modal signals and the proposed multimodal framework. Following identical preprocessing and feature extraction protocols, EEG consistently exhibits superior discriminative power over PPG, underscoring the heightened sensitivity of cortical neural activity to emotional fluctuations compared to peripheral physiological responses. However, integrating both modalities via SAWGAN-BDCMA yields a significant performance leap. While PPG shows lower standalone accuracy, it contributes vital complementary information from the ANS. These gains validate that the BDCMA mechanism effectively leverages the synergy between central and peripheral signals, capturing emotional nuances inaccessible to any single modality.

Figure 11 and Figure 12 illustrate the average attention weights assigned by the BDCMA model across datasets. On the DEAP dataset, EEG consistently receives higher weights (exceeding 0.6) for valence-related classification, reflecting its direct link to central nervous system activity. While PPG contributions rise in arousal tasks (HAHV and LALV), confirming its efficacy in capturing autonomic cardiovascular dynamics. On the ECSMP dataset, this modality-specific synergy is further nuanced across granular emotion categories. EEG weights peak for `happy’ and `anger’, highlighting distinct cortical patterns in high-intensity states, whereas PPG weighting increases for `fear’ and `sad’, indicating stronger peripheral physiological involvement. These results validate that EEG serves as the primary informational pillar, while PPG provides critical complementary cues, allowing BDCMA to adaptively prioritize the most discriminative modality based on emotional context.

To comprehensively assess the model’s complexity, the parameters of each module were quantified, and both floating point operations per second (Flops) and GPU memory usage were evaluated, as summarized in Table 9. The total number of parameters is approximately 2.23 M, with the SAWGAN generator and discriminator containing 0.71 M and 0.79 M parameters, respectively. The remaining components maintain parameter efficiency while jointly supporting feature extraction and multimodal fusion.

5. Conclusions

The SAWGAN-BDCMA multimodal fusion model is developed to overcome insufficient information exchange and suboptimal cross-modal fusion in conventional emotion recognition systems. SAWGAN-based augmentation first increased the balance and diversity of raw EEG and PPG, improving robustness to physiological variability. The dual-branch extractor then captured time-frequency features from EEG and dynamic cardiovascular features from PPG. Finally, the BDCMA module performed deep bidirectional interaction and adaptive fusion, exploiting complementarities between modalities. On the DEAP dataset, SAWGAN-BDCMA achieved 94.25% (binary) and 87.93% (quaternary) accuracy, while on the ECSMP dataset, it reached a peak accuracy of 97.49% for six-class recognition. Furthermore, the analysis of attention weights across emotion tasks confirms the model’s ability to adaptively weigh each modality according to sample properties, which contributes to its robust recognition performance.

Future work will integrate additional physiological channels, such as EMG, GSR and ECG, to further enrich multimodal representations. In parallel, coupling lightweight neural architectures with wearable sensing hardware is expected to enable real-time multimodal emotion recognition, facilitating practical deployment of affective computing in intelligent interaction and health monitoring.

Author Contributions

Funding acquisition, K.Y.; methodology, N.Z. and R.H.; validation, H.Z. and H.Y.; investigation, S.S. and H.Z.; resources, K.Y.; writing—original draft preparation, N.Z.; writing—review and editing, R.H., K.Y. and S.S.; visualization, N.Z.; supervision, R.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Project of Shanxi Province (No. 202403021211229).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in the DEAP dataset repository at https://aistudio.baidu.com/datasetdetail/35543 (accessed on 20 July 2024) and the ECSMP dataset at https://data.mendeley.com/datasets/vn5nknh3mn/2 (accessed on 8 November 2025).

Conflicts of Interest

Author Kun Yang was employed by the companies Suzhou Yakun Zhichuang Technology Co., Ltd. and Jinan Shengquan Group Share-Holding Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wu, X.; Bai, S.; O’Sullivan, L. Editorial for the special issue on wearable robots and intelligent device. Biomim. Intell. Robot. 2023, 3, 100102. [Google Scholar] [CrossRef]
Kim, J.H.; Poulose, A.; Han, D.S. CVGG-19: Customized visual geometry group deep learning architecture for facial emotion recognition. IEEE Access 2024, 12, 41557–41578. [Google Scholar] [CrossRef]
Islam, M.R.; Akhand, M.; Kamal, M.A.S.; Yamada, K. Recognition of emotion with intensity from speech signal using 3D transformed feature and deep learning. Electronics 2022, 11, 2362. [Google Scholar] [CrossRef]
Wang, Z.; Wang, Y. Emotion recognition based on multimodal physiological electrical signals. Front. Neurosci. 2025, 19, 1512799. [Google Scholar] [CrossRef]
Wang, X.; Yu, H.; Kold, S.; Rahbek, O.; Bai, S. Wearable sensors for activity monitoring and motion control: A review. Biomim. Intell. Robot. 2023, 3, 100089. [Google Scholar] [CrossRef]
Russell, J.A. A circumplex model of affect. J. Pers. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
Rahman, M.M.; Sarkar, A.K.; Hossain, M.A.; Hossain, M.S.; Islam, M.R.; Hossain, M.B.; Quinn, J.M.; Moni, M.A. Recognition of human emotions using EEG signals: A review. Comput. Biol. Med. 2021, 136, 104696. [Google Scholar] [CrossRef]
Lu, K.; Gu, Z.; Qi, F.; Sun, C.; Guo, H.; Sun, L. CMLP-Net: A convolution-multilayer perceptron network for EEG-based emotion recognition. Biomed. Signal Process. Control 2024, 96, 106620. [Google Scholar] [CrossRef]
Qiao, W.; Sun, L.; Wu, J.; Wang, P.; Li, J.; Zhao, M. EEG emotion recognition model based on attention and gan. IEEE Access 2024, 12, 32308–32319. [Google Scholar] [CrossRef]
Han, J.; Li, H.; Zhang, X.; Zhang, Y.; Yang, H. EMCNN: Fine-Grained Emotion Recognition based on PPG using Multi-scale Convolutional Neural Network. Biomed. Signal Process. Control 2025, 105, 107594. [Google Scholar] [CrossRef]
Li, J.; Wang, X. A physiological signal emotion recognition method based on domain adaptation and incremental learning. In Proceedings of the 2022 First International Conference on Cyber-Energy Systems and Intelligent Energy (ICCSIE), Shenyang, China, 14–15 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Wang, R.; Jo, W.; Zhao, D.; Wang, W.; Gupte, A.; Yang, B.; Chen, G.; Min, B.C. Husformer: A multimodal transformer for multimodal human state recognition. IEEE Trans. Cogn. Dev. Syst. 2024, 16, 1374–1390. [Google Scholar] [CrossRef]
Li, J.; Chen, N.; Zhu, H.; Li, G.; Xu, Z.; Chen, D. Incongruity-aware multimodal physiology signals fusion for emotion recognition. Inf. Fusion 2024, 105, 102220. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Zheng, M.; Li, T.; Zhu, R.; Tang, Y.; Tang, M.; Lin, L.; Ma, Z. Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf. Sci. 2020, 512, 1009–1023. [Google Scholar] [CrossRef]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans. Adv. Neural Inf. Process. Syst. 2017, 30, 5767–5777. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Cui, R.; Chen, W.; Li, M. Emotion recognition using cross-modal attention from EEG and facial expression. Knowl.-Based Syst. 2024, 304, 112587. [Google Scholar] [CrossRef]
Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 2011, 3, 18–31. [Google Scholar] [CrossRef]
Gao, Z.; Cui, X.; Wan, W.; Zheng, W.; Gu, Z. ECSMP: A dataset on emotion, cognition, sleep, and multi-model physiological signals. Data Br. 2021, 39, 107660. [Google Scholar] [CrossRef]
Li, A.; Wu, M.; Ouyang, R.; Wang, Y.; Li, F.; Lv, Z. A Multimodal-Driven Fusion Data Augmentation Framework for Emotion Recognition. IEEE Trans. Artif. Intell. 2025, 6, 2083–2097. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 363–380. [Google Scholar]
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG Conformer: Convolutional transformer for EEG decoding and visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 31, 710–719. [Google Scholar] [CrossRef]
Li, Y.; Wang, L.; Zheng, W.; Zong, Y.; Qi, L.; Cui, Z.; Zhang, T.; Song, T. A novel bi-hemispheric discrepancy model for EEG emotion recognition. IEEE Trans. Cogn. Dev. Syst. 2020, 13, 354–367. [Google Scholar] [CrossRef]
Lee, M.; Lee, Y.K.; Lim, M.T.; Kang, T.K. Emotion recognition using convolutional neural network with selected statistical photoplethysmogram features. Appl. Sci. 2020, 10, 3501. [Google Scholar] [CrossRef]
Nath, D.; Singh, M.; Sethia, D.; Kalra, D.; Indu, S. An efficient approach to EEG-based emotion recognition using LSTM network. In Proceedings of the 2020 16th IEEE International Colloquium on Signal Processing & Its Applications (CSPA), Langkawi, Malaysia, 28–29 February 2020; pp. 88–92. [Google Scholar]

Figure 1. The framework of the proposed multimodal emotion recognition model SAWGAN-BDCMA.

Figure 2. The specific structure and principles of SAWGAN.

Figure 3. Visualisation of the dual-branch feature extraction model structure.

Figure 4. Visualisation of the BDCMA multimodal feature fusion mechanism architecture.

Figure 5. Comparison of original and augmented data on DEAP: (a) EEG data. Original and augmented samples are denoted by circles and crosses, with high and low valence represented in red and blue. (b) PPG signal waveforms. The solid blue line represents a real data sample, while the dashed yellow line corresponds to the generated sample.

Figure 6. Comparison of original and augmented data on ECSMP: (a) EEG data. Original and augmented samples are represented by circles and crosses, respectively, with distinct colors representing different emotional states. (b) PPG signal waveforms. The solid blue line represents a real data sample, while the dashed yellow line corresponds to the generated sample.

Figure 7. t-SNE feature visualization of original and augmented data from the DEAP dataset using the dual-branch model: (a) Original EEG features. (b) EEG features with 2× augmented data. (c) Original PPG features. (d) PPG features with 2× augmented data.

Figure 8. t-SNE feature visualization of original and augmented data from the ECSMP dataset using the dual-branch model: (a) Original EEG features. (b) EEG features with 2× augmented data. (c) Original PPG features. (d) PPG features with 2× augmented data.

Figure 9. Ablation study results of individual modules for quaternary classification on the DEAP dataset: (a) Different data augmentation techniques. (b) Various EEG signal feature extraction methods. (c) Diverse PPG signal feature extraction approaches. (d) Different feature fusion strategies.

Figure 10. Ablation study results of individual modules for six classifications on the ECSMP dataset: (a) Different data augmentation techniques. (b) Various EEG signal feature extraction methods. (c) Diverse PPG signal feature extraction approaches. (d) Different feature fusion strategies.

Figure 11. Statistics on the weighting assigned to EEG and PPG signals across different emotion classification tasks on the DEAP dataset.

Figure 12. Statistics on the weighting assigned to EEG and PPG signals across different emotion classification tasks on the ECSMP dataset.

Table 1. Principles for assigning labels in the DEAP dataset binary classification task.

Category	Negative	Positive
Label	0	1
Valence Threshold	<5	≥5

Table 2. Principles for assigning labels in the DEAP dataset quaternary classification task.

Category *	HAHV	HALV	LALV	LAHV
Label	0	1	2	3
Arousal Threshold	≥5	≥5	<5	<5
Valence Threshold	≥5	<5	<5	≥5

* The quaternary classification categories are defined as: high arousal and high valence (HAHV), high arousal and low valence (HALV), low arousal and low valence (LALV), and low arousal and high valence (LAHV).

Table 3. Binary classification performance (%) at different augmentation levels on the DEAP dataset.

Augmentation Level	Accuracy	Precision	Recall	F1-Score
0×	76.01 ± 2.85	76.97 ± 3.26	75.21 ± 3.57	76.09 ± 3.41
0.5×	83.13 ± 1.94	83.51 ± 2.63	82.64 ± 2.79	83.06 ± 2.48
1×	85.17 ± 1.24	84.66 ± 1.17	85.31 ± 1.32	84.96 ± 1.39
2×	94.25 ± 1.36	93.15 ± 1.32	95.14 ± 1.27	94.13 ± 1.52

Table 4. Quaternary classification performance (%) at different augmentation levels on the DEAP dataset.

Augmentation Level	Accuracy	Precision	Recall	F1-Score
0×	68.37 ± 3.35	68.35 ± 3.46	68.28 ± 3.52	68.31 ± 3.41
0.5×	71.84 ± 2.17	71.50 ± 1.92	69.40 ± 2.35	70.43 ± 2.09
1×	79.15 ± 1.54	78.60 ± 1.59	78.88 ± 1.67	78.74 ± 1.53
2×	87.93 ± 1.37	87.41 ± 1.32	88.97 ± 1.31	88.18 ± 1.27

Table 5. Six-class performance (%) at different enhancement levels on the ECSMP dataset.

Augmentation Level	Accuracy	Precision	Recall	F1-Score
0×	79.75 ± 3.71	79.69 ± 3.63	79.71 ± 3.28	79.68 ± 3.52
0.5×	84.05 ± 1.68	84.01 ± 1.57	84.04 ± 1.65	84.02 ± 1.64
1×	96.25 ± 1.05	96.21 ± 1.03	96.24 ± 0.95	96.20 ± 0.92
2×	97.49 ± 1.16	97.52 ± 1.46	97.51 ± 1.12	97.48 ± 1.38

Table 6. Mean accuracy and standard deviation (%) of different methods on the DEAP and ECSMP datasets.

Method	DEAP-Binary	DEAP-Quaternary	ECSMP-Six
SVM	62.53 ± 4.24	59.68 ± 3.42	41.17 ± 1.57
Random Forest	67.95 ± 3.67	64.20 ± 3.84	76.43 ± 3.53
CNN	68.39 ± 2.12	68.43 ± 2.17	74.54 ± 1.82
Husformer [12]	82.67 ± 2.20	78.92 ± 3.22	83.48 ± 2.37
DAIL [11]	84.46 ± 2.74	80.46 ± 3.52	89.36 ± 2.14
Multimodal-Driven Fusion Data Augmentation Framework [24]	92.85 ± 1.04	84.05 ± 1.25	92.75 ± 1.58
MCMT [13]	92.63 ± 3.95	87.36 ± 3.82	93.58 ± 2.74
The proposed model	94.25 ± 1.85	87.93 ± 1.56	97.49 ± 1.64

Table 7. Comparison of mean accuracy (%) for emotion recognition across different activation functions within the ViT-CBAM module.

Activation Function	DEAP-Quaternary	ECSMP-Six
ReLU	84.26	95.00
LeakyReLU	84.67	96.25
ELU	86.45	97.50
GELU (Proposed)	87.93	97.49

Table 8. Emotion recognition accuracy (%) for single-modal signal and multimodal fusion.

Modal Types	DEAP-Binary	DEAP-Quaternary	ECSMP-Six
EEG	86.36	81.25	90.74
PPG	64.76	62.08	55.03
EEG + PPG	94.25	87.93	97.49

Table 9. Analysis of the complexity of the framework.

Module	Params (M)	Flops (M)	Peak GPU Memory (MB)
SAWGAN Generator	0.71	3.83	74.32
SAWGAN Discriminator	0.79	19.11	74.32
ViT-CBAM	0.16	19.58	2.91
TCN	0.17	45.54	9.22
BDCMA	0.40	0.52	1.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, N.; Su, S.; Zhang, H.; Yang, H.; Hao, R.; Yang, K. SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition. Sensors 2026, 26, 582. https://doi.org/10.3390/s26020582

AMA Style

Zhang N, Su S, Zhang H, Yang H, Hao R, Yang K. SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition. Sensors. 2026; 26(2):582. https://doi.org/10.3390/s26020582

Chicago/Turabian Style

Zhang, Ning, Shiwei Su, Haozhe Zhang, Hantong Yang, Runfang Hao, and Kun Yang. 2026. "SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition" Sensors 26, no. 2: 582. https://doi.org/10.3390/s26020582

APA Style

Zhang, N., Su, S., Zhang, H., Yang, H., Hao, R., & Yang, K. (2026). SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition. Sensors, 26(2), 582. https://doi.org/10.3390/s26020582

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Augmentation Module

2.1.1. Data Preprocessing

2.1.2. SAWGAN Architecture

2.2. Feature Extraction Module

2.2.1. ViT-CBAM

2.2.2. TCN

2.3. Feature Fusion Module

3. Experiments and Results

3.1. Dataset Partitioning

3.2. Training Strategy

3.3. Experimental Results

3.3.1. Results on the DEAP Dataset

3.3.2. Results on the ECSMP Dataset

3.3.3. Performance Comparison with the SOTA Models

3.4. Feature Visualization

3.5. Ablation Experiments

3.5.1. Ablation Study of Data Augmentation Methods

3.5.2. Ablation Study of Feature Extraction Modules

3.5.3. Ablation Study of Feature Fusion Strategies

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI