1. Introduction
Emotion is a psycho-physiological process triggered by conscious or unconscious perceptions of objects or situations and is often linked with feelings, temperament, personality, and motivation [
1]. Emotion recognition has emerged as a critical research area in human–computer interaction and affective computing, with vast application prospects in fields such as human–computer interaction, psychotherapy, and education [
2,
3,
4,
5,
6,
7,
8,
9]. In recent years, emotion recognition based on electroencephalogram (EEG) signals has gained significant attention as EEG provides physiological data that directly reflects human emotional states. It offers advantages such as resistance to external interference and high spatiotemporal resolution [
10,
11].
Prior to the widespread adoption of deep learning in EEG research, several automatic or semi-automatic EEG recognition methods had been proposed [
12,
13,
14,
15,
16]. With the rise of deep learning techniques, a growing number of scholars have applied these methods to emotion recognition, which surpass the performance of traditional approaches and achieve results previously unattainable. For example, Tripathi et al. [
17] proposed using Convolutional Neural Networks (CNNs) to estimate functions that are dependent on a large number of typically unknown inputs, which classify users’ emotions using EEG data from the DEAP dataset. This approach outperformed the Support Vector Machine (SVM) binary classification model [
3] by 4.51% and 4.96% on the valence and arousal dimensions, respectively, and surpassed emotion classification methods based on Bayesian classifiers and supervised learning [
18] by 13.39% and 6.58%. These findings demonstrate that neural networks can serve as powerful classifiers for brain signals, surpassing traditional learning techniques. In the same year, Al-Nafjan et al. [
19] proposed the use of Deep Neural Networks (DNNs) for recognizing user emotions from power spectral density (PSD) and frontal asymmetry features in EEG signals, which achieved an average identification accuracy of 82.0% for binary classification on the DEAP dataset. This highlights the significant performance improvement of emotion recognition using DNNs, especially when large training datasets are available.
Attention mechanisms play an indispensable role in human perception, particularly in filtering, integrating, and interpreting information [
20]. In traditional deep learning models, feature fusion is often achieved through simple operations like weighted averaging or concatenation, which may not fully capture the differences in importance among various features. The integration of attention mechanisms allows models to learn the relevance of features automatically and adjust their weights based on their importance, resulting in more effective feature fusion. Liu et al. [
21] introduced the 3DCANN model, which incorporated an EEG channel attention learning module to extract discriminative features from continuous multi-channel EEG signals, highlighting the variability of EEG signals across different emotional states. Tao et al. [
22] proposed the ACRNN model, which integrated self-attention mechanisms into Recurrent Neural Networks (RNNs) to focus on the temporal information in EEG signals, adaptively assigning weights to different channels via a channel-wise attention mechanism. Zhang et al. [
23] proposed a novel two-step spatial-temporal emotion recognition framework, combined local and global temporal self-attention networks to improve recognition performance, and introduced a new emotion localization task to identify segments with stronger emotional signals. Building on the development of attention mechanisms, Liu et al. [
24] explored the use of Transformer-based multi-head attention mechanisms, proposing four variant transformer frameworks to investigate the relationship between emotions and EEG spatial-temporal features, thus demonstrating the importance of modeling spatial-temporal feature correlations for emotion recognition. These studies clearly indicate that incorporating attention mechanisms can substantially enhance recognition performance.
In this paper, we propose a novel hybrid model for EEG-based emotion recognition, named ECA-ResDNN. The main contributions of this work are as follows:
To address challenges such as noise, artifacts, discontinuities, drift, and distortion in EEG signal data, we introduce a novel preprocessing method that integrates Generative Adversarial Networks (GANs) and fuzzy set theory. This approach enhances the clarity and stability of EEG signals, improving the accuracy and robustness of the algorithm.
A novel Deep Neural Network is employed for EEG-based emotion recognition, leveraging its ability to capture intricate features within EEG signals. Additionally, an attention mechanism is incorporated to enhance the model’s sensitivity and ability to differentiate emotional information. This combination enables the model to better interpret and represent the emotional content of EEG signals, leading to more accurate and reliable emotion recognition.
To further enhance the robustness and accuracy of the classification model in handling uncertainty and noisy data, we propose a hybrid loss function that integrates cross-entropy loss with fuzzy set loss. This approach aims to combine the efficiency of cross-entropy loss in classification tasks with the advantages of fuzzy set loss in dealing with noise and uncertainty.
Comparative experiments were conducted against classical models, including CNNs, CNN-GRU, CNN-LSTM, and DNNs, as well as state-of-the-art methods such as SPD + SVM [
25] and GLFANet [
26]. The results demonstrate that ECA-ResDNN achieves superior accuracy and robustness compared to existing emotion recognition models. These findings validate the effectiveness of the proposed hybrid model in enhancing classification performance.
2. Materials and Methods
2.1. DEAP Dataset
The DEAP dataset was created by Koelstra et al. [
1] from Queen Mary University of London in collaboration with other institutions. EEG signals were recorded using the ActiveTwo system, manufactured by Biosemi B.V., located in Amsterdam, The Netherlands, with 32 active AgCl electrodes placed according to the international 10–20 system and a sampling rate of 512 Hz [
27]. In addition to EEG, peripheral physiological signals were also recorded, including EOG (electrooculography, with four facial electrodes capturing eye movement signals) and EMG (electromyography, with four electrodes placed on the zygomaticus major and trapezius muscles to capture muscle activity signals). Furthermore, physiological sensors were placed on the left hand, measuring pulse oximetry, temperature, and galvanic skin response (GSR).
A total of 32 participants (16 males and 16 females, aged between 19 and 37 years, with a mean age of 26.9 years) were recruited. All participants were in good physical and mental health, had no history of neurological or psychiatric disorders, and were all right-handed. Prior to the experiment, participants were required to read and acknowledge the experimental instructions and procedures. The instructions included guidelines on minimizing movement artifacts and emotional tension, which could introduce noise into EEG recordings. During the experiment, each participant watched 40 one-minute video clips sequentially while their EEG data were recorded in real-time. The data collection process for the DEAP dataset is illustrated in
Figure 1.
In each experimental session, the current experiment number was displayed to inform participants of their progress. This was followed by a five-second baseline recording to capture initial brain activity. Subsequently, a one-minute music video was presented, forming the core of the experimental procedure. After the video, participants were asked to self-assess their arousal, valence, and other emotional states, with their responses reflecting various affective conditions. To minimize fatigue and ensure data accuracy, participants took short breaks every 20 trials, during which experimental equipment and electrode placements were checked and adjusted if necessary.
2.2. Data Augmentation
The Wasserstein Generative Adversarial Network (WGAN) [
28] is an advanced architecture within the framework of Generative Adversarial Networks (GANs) that introduces the Wasserstein distance as a new metric to measure the discrepancy between generated and real samples. Unlike the commonly used Jensen–Shannon (JS) divergence [
29] in traditional GANs, the Wasserstein distance has been shown to perform more effectively, particularly in the field of emotion recognition [
30]. As illustrated in
Figure 2,
Q and
P represent the probability distributions of the generated and real samples, respectively. The arrow represents the transformation of the probability distribution of generated samples (Q) towards the real samples (P) in the Wasserstein distance framework. To transform
Q into
P, one can imagine using a bulldozer to move the “dirt” (i.e., probability mass) within
Q, gradually reshaping it to match
P. The average shortest distance the bulldozer must travel during this process is defined as the Wasserstein distance, mathematically expressed as follows:
In Equation (1), π() represents the set of all possible joint distributions γ between the real distribution and the generated distribution . denotes the distance between a real sample x and a generated sample y, and E(x,y) is the expected value of this distance for the pair of samples.
During the training process, two distinct loss functions are utilized—one for the discriminator and one for the generator—as follows:
2.3. Data Preprocessing
In this study, EEG signals from 32 channels of the DEAP dataset are initially selected. After configuring the electrode layout and reference signal, a 50 Hz notch filter and a 4–45 Hz band-pass filter are applied to remove noise and highlight the frequency components of interest, thereby enhancing the signal quality. To further improve signal purity, Independent Component Analysis (ICA) and wavelet transform are applied to remove ocular and muscle artifacts. Following this, based on event information extracted from the stimulus channel and a specified time window, the raw EEG data are segmented into a series of epochs with a window size of 2 s and a step size of 0.125 s.
Next, the Fuzzy C-Means (FCM) [
31] algorithm is employed to cluster the epoch data, remove noise components, and downsample the signals to a target sampling frequency of 128 Hz. Unlike traditional filtering methods that rely on predefined basis functions, FCM utilizes soft clustering to assign membership probabilities to each data point, enabling a more flexible and adaptive noise removal process. This characteristic is particularly advantageous for EEG signals, which exhibit high variability and overlapping frequency components. The processed EEG signal spans 63 s and consists of 3 s of transition time between video segments and 60 s of actual video stimulus presentation. The objective function of FCM is defined as follows:
In Equation (4), n represents the number of data points, c is the number of clusters, is the degree of membership of the i-th data point in the j-th cluster, m is the fuzziness parameter, is the i-th data point, and is the center of the j-th cluster.
The power spectral density (PSD) plots of the epoch data after noise removal and downsampling are shown in
Figure 3. The colors of the dots represent different EEG channels, while the lines corresponding to each color indicate the variation in power spectral density (PSD) for the respective channels. For each epoch, the EEG signal from the 3 s video transition period is used as a baseline to eliminate any EEG activity unrelated to the video stimulus. This results in a 60 s EEG signal sequence. Subsequently, feature extraction is performed on the processed data.
2.4. Feature Extraction
The Fast Fourier Transform (FFT) is an efficient algorithm for the Discrete Fourier Transform (DFT) [
32], enabling the rapid conversion of time domain signals into frequency-domain signals. If the input time domain signal
satisfies the following condition
then
can undergo a continuous Fourier transform with the following formula:
In Equation (6),
represents the output frequency signal,
is the frequency, and
i is the imaginary unit. However, since both the time domain and frequency domain signals are discrete in digital signal processing, the continuous Fourier transform is not directly computable on a computer. Therefore, the Discrete Fourier Transform (DFT) is commonly used. For a discrete signal sample
, the DFT result is given by the following:
In Equation (7), N is the length of the signal, and is the frequency index, ranging from 0 to .
The Fast Fourier Transform (FFT) takes advantage of the symmetry, periodicity, and reducibility of Fourier coefficients to optimize the calculation of the DFT, significantly improving computational efficiency. The basic FFT algorithms are divided into two main types: the time-decimation method and the frequency-decimation method. This design employs a time-window-based approach. Assuming that the data within the
k-th time window are represented as
, the spectrum
is computed as follows:
In Equation (8), is the number of sampling points within each time window.
To mitigate noise, spikes, and interference while preserving the relevant signal information, a Least Mean Square (LMS) filter is applied in this design. The filter updates its coefficients based on the error between the input signal and the desired output, minimizing the mean square error. The weight update rule is the following:
In Equation (9), is the learning rate, is the prediction error, and is the input signal vector.
Additionally, window functions are widely used in spectral analysis to effectively suppress the effects of signal truncation and improve processing accuracy. In this design, the Hanning window is selected due to its ability to balance spectral resolution and leakage suppression, making it particularly suitable for EEG signals with overlapping frequency components. The formula for the Hanning window is the following:
In Equation (10), represents the window length, and denotes the index of the sample point within the window. A longer window provides higher frequency resolution but may blur temporal details, while a shorter window improves temporal resolution but increases spectral leakage. is carefully selected based on the EEG sampling rate and epoch segmentation to ensure an optimal trade-off between frequency resolution, temporal precision, and noise suppression.
Finally, all feature vectors and labels are saved as
.npy files. The results of feature extraction for the EEG signal data from the first participant are shown in
Figure 4.
2.5. ECA Module
In recent years, channel attention mechanisms have demonstrated great potential in enhancing the performance of neural networks. The ECA (Efficient Channel Attention) module is inspired by the SE (Squeeze-and-Excitation) attention module [
33] and the CBAM (Convolutional Block Attention Module) [
34], which offer an efficient channel attention mechanism while maintaining computational efficiency. EEG signals are commonly divided into five frequency bands: delta (1–3 Hz), theta (4–7 Hz), alpha (8–13 Hz), beta (14–29 Hz), and gamma (30–47 Hz) [
35]. Research has shown that the beta and gamma bands of EEG signals exhibit the strongest response to emotions, followed by the alpha band, while the theta band shows the weakest response [
36,
37]. Based on this characteristic, the ECA module assigns different weights to these frequency bands, focusing on the frequency bands most relevant to emotional responses in the emotion classification task, thereby improving classification performance.
The structure of the ECA module is shown in
Figure 5. Let the input feature map be
∈
Rℎ×w×c, where
h,
w, and
c represent the height, width, and number of channels, respectively. First, global average pooling (GAP) is applied along the spatial dimensions to obtain a 1 × 1 ×
C feature map, where
C represents the number of channels. By aggregating spatial information through global average pooling, global features are extracted, which simplifies computational complexity while preserving the overall information of the input feature map. The calculation for the feature weights is as follows:
The core idea of the ECA attention mechanism is to use a window function to perform a weighted summation over each channel of the feature map, calculating the weight for each channel. This helps effectively capture the interaction between channels and prevents the loss of channel information. This window function can be a one-dimensional convolutional kernel. As shown in
Figure 5, local cross-channel interactions are captured through 1D convolution without reducing dimensions, where k determines the range of interaction. Since k is related to the channel dimension c, larger channel sizes result in stronger long-range interactions, while smaller channel sizes lead to stronger short-range interactions. Therefore, the following adaptive method is used to determine the kernel size:
In Equation (12),
γ and
b are constants, and
a denotes the closest odd integer. In this study, the number of input channels
C is set to 3, with both
γ and
b set to 1. Finally, after passing through the Sigmoid function (denoted by
σ in
Figure 3), the channel attention feature map is obtained. This feature map is then element-wise multiplied with the original input feature map to produce the final output feature map with channel attention.
2.6. ECA-ResDNN Model
In traditional residual networks, a max pooling layer is often added after each residual block to reduce computational cost and model complexity while retaining significant features. However, frequent pooling may cause information loss, impacting performance and generalization. To address this, we add a max pooling layer only after the final convolutional layer. As illustrated in
Figure 6, the ECA-ResNet model consists of convolutional layers, batch normalization layers, three attention mechanism residual blocks, a max pooling layer, and a flatten layer.
The parameter configuration for each attention mechanism residual block is summarized in
Table 1. Each residual block includes convolutional layers, batch normalization layers, ReLU activation layers, and an ECA module. By setting the first dimension of the output shape to “None”, the model can accommodate batch data of varying sizes during training, which enhances its flexibility and generalizability. The mathematical formulation for the attention mechanism residual block is as follows:
In Equation (13), represents the output after convolution, batch normalization, ReLU activation, and ECA attention mechanisms. Here, x is the input feature map, and y is the output feature map after applying the residual learning.
To effectively capture both temporal and spectral features from EEG signals and achieve accurate emotion classification, this paper introduces the ECA-ResDNN (Efficient Channel Attention Residual Deep Neural Network) model. This model integrates attention mechanisms, residual networks, and deep neural networks, allowing it to deeply mine the intrinsic features of EEG signals.
As shown in
Figure 7, the model consists of four main components:
- (1)
Data Input Layer: Accepts a preprocessed three-dimensional feature matrix from various EEG signal channels.
- (2)
Spectral Feature Extraction Layer: The ECA-ResNet model with an attention mechanism captures significant spectral information from each time slice.
- (3)
Temporal Feature Extraction Layer: A deep neural network extracts temporal features from the ECA-ResNet output, enabling higher-level abstraction.
- (4)
Fully Connected Layer: Final classification using a softmax activation function to categorize the input into emotional states.
3. Experimental Design
3.1. Experimental Parameter Settings
The experiments in this study were conducted using a consistent software and hardware environment. The experimental setup was performed on an HP laptop (Compal, Hangzhou, China) equipped with the TensorFlow 2.10.0 framework, an NVIDIA GeForce RTX 2080 Super graphics card (NVIDIA Corporation, Santa Clara, CA, USA), and an Intel® Core™ i7-10750H CPU (Intel Corporation, Chengdu, China). The entire training process took approximately 36 h and was distributed over a two-week period.
After preprocessing and feature extraction, the dataset was split into training and testing sets based on a specific indexing rule. One out of every eight rows was selected as the test set, with the remaining rows used as the training set. This data partitioning method was applied uniformly across all experiments to ensure consistency. Additionally, to handle outliers in the labels, any label with a value of 9 was replaced with 8.99.
For the training sessions, the batch size was set to 256. The number of epochs was configured to 120 for binary classification and 240 for eight-class classification tasks. The log display mode (verbose) was set to 1, allowing for progress updates during training.
3.2. Loss Function
This study introduces a novel loss function that integrates cross-entropy loss and fuzzy set loss, aiming to synergize the advantages of both methods for enhanced performance. The cross-entropy loss is effective for classification tasks, while the fuzzy set loss helps address uncertainty and noise in the data. The combined loss function is defined as follows:
In Equation (14), represents the cross-entropy loss, represents the fuzzy set loss, and λ is a hyperparameter in the range [0, 1] used to adjust the relative weight between the two loss functions.
The cross-entropy loss,
, measures the difference between the predicted probability distribution and the true labels. Its mathematical expression is given by the following, where
C is the number of classes,
is the one-hot encoded true label, and
is the predicted probability for the
i-th class:
The fuzzy set loss,
, introduces a fuzziness parameter
α to control the sensitivity of the loss function to misclassifications. Its expression is as follows:
In Equation (16), α is a positive parameter that adjusts the degree of fuzziness in the loss function, allowing it to better handle uncertainty and noise in the data.
5. Conclusions
In this paper, we proposed a novel hybrid model, ECA-ResDNN, for EEG-based emotion recognition, aiming to address key challenges such as noise, artifacts, and signal distortions in EEG data. By integrating an advanced preprocessing method that combines Generative Adversarial Networks (GANs) and fuzzy set theory, the clarity and stability of EEG signals were significantly enhanced. The adoption of a Deep Neural Network in conjunction with an attention mechanism allowed the model to more effectively capture and represent emotional features within EEG data. Furthermore, the introduction of a hybrid loss function, which combines cross-entropy loss with fuzzy set loss, optimized the training process, thus improving the model’s sensitivity to misclassification and enhancing its generalization capabilities.
Experimental comparisons demonstrated that the proposed ECA-ResDNN model outperforms existing models in both accuracy and robustness for emotion recognition tasks. These results validate the effectiveness and reliability of our approach, suggesting promising applications in fields such as affective computing, mental health monitoring, and human–computer interaction.
In future work, we plan to explore further enhancements through multimodal fusion and the implementation of real-time systems, which will enable deployment in practical, real-world scenarios. Additionally, current source density (CSD) transformation offers a more localized representation of cortical activity by reducing volume conduction effects, thereby improving source localization and neural activity decomposition. Given its potential to enhance EEG feature extraction, we will investigate its applicability to emotion recognition and conduct comparative analyses with conventional referencing techniques.