CMHFE-DAN: A Transformer-Based Feature Extractor with Domain Adaptation for EEG-Based Emotion Recognition

Hilali, Manal; Ezzati, Abdellah; Ben Alla, Said

doi:10.3390/info16070560

Open AccessArticle

CMHFE-DAN: A Transformer-Based Feature Extractor with Domain Adaptation for EEG-Based Emotion Recognition

by

Manal Hilali

^1,*

,

Abdellah Ezzati

¹ and

Said Ben Alla

²

¹

Faculty of Sciences and Techniques, Hassan First University, Settat 26000, Morocco

²

National School for Applied Science, Hassan First University, Berrechid 26100, Morocco

^*

Author to whom correspondence should be addressed.

Information 2025, 16(7), 560; https://doi.org/10.3390/info16070560

Submission received: 25 May 2025 / Revised: 26 June 2025 / Accepted: 27 June 2025 / Published: 30 June 2025

Download

Browse Figures

Versions Notes

Abstract

EEG-based emotion recognition (EEG-ER) through deep learning models has gained more attention in recent years, with more researchers focusing on architecture, feature extraction, and generalisability. This paper presents a novel end-to-end deep learning framework for EEG-ER, combining temporal feature extraction, self-attention mechanisms, and adversarial domain adaptation. The architecture entails a multi-stage 1D CNN for spatiotemporal features from raw EEG signals, followed by a transformer-based attention module for long-range dependencies, and a domain-adversarial neural network (DANN) module with gradient reversal to enable a powerful subject-independent generalisation by learning domain-invariant features. Experiments on benchmark datasets (DEAP, SEED, DREAMER) demonstrate that our approach achieves a state-of-the-art performance, with a significant improvement in cross-subject recognition accuracy compared to non-adaptive frameworks. The architecture tackles key challenges in EEG emotion recognition, including generalisability, inter-subject variability, and temporal dynamics modelling. The results highlight the effectiveness of combining convolutional feature learning with adversarial domain adaptation for robust EEG-ER.

Keywords:

emotion recognition; electroencephalogram signal (EEG); domain adaptation; deep learning; convolutional neural network (CNN); transformer-based architecture

1. Introduction

Emotion recognition (ER) is a growing field, and when explored with deep learning models, it has shown astounding accomplishments in all fields. Electroencephalogram (EEG) signals, with their non-invasive and temporally precise measure of brain activity, are ideal for emotion classification, initiating promising leaps in brain–computer interfaces (BCI), affective computing, and human–computer interaction.

Recent advancements in deep learning have revolutionised EEG-ER (EEG-ER), overcoming longstanding challenges such as inter-subject variability, noise sensitivity, and the need for manually extracted features [1]. Although manually extracted features, such as power spectral density (PSD) and the fast Fourier transform (FFT) [2], differential entropy (DE), and continuous wavelet transform (CWT) [3], have greatly improved the prediction accuracy of emotions, they are hindering the way to real-time EEG-ER. This brings forward the need for end-to-end models that require little to no pre-data manipulation.

Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [4,5] remain the cornerstone of EEG emotion recognition due to their exceptional ability to capture spatiotemporal patterns from raw EEG signals. They are, along with attention-based mechanisms, achieving unprecedented accuracy in classifying emotional states. Recent work introduced a progressive CNN [6] that hierarchically refines features, significantly reducing computational overhead while improving classification performance. Another approach combined CNN with a Graph-based extension [7], creating a GECNN model that explicitly models neurophysiological connectivity patterns between electrodes using graph convolutional operations, resulting in a more biologically plausible feature representation. For temporal dynamics, the hybrid CNN-RNN has also made its impact. A proposed CNN-BiLSTM-Attention framework [5] demonstrates how attention mechanisms can selectively weigh informative temporal segments, while another model, ACRNN [8], shows the benefits of dedicated spatial (CNN) and temporal (RNN) processing pathways.

Despite these significant advancements in deep learning for EEG-ER and the rapid attention that transformer-based architectures have gained, research has yet to fully explore the potential of this architecture in the domain of ER. The EEG Conformer [9] represents a hybrid approach, where local CNN features serve as input tokens for a transformer encoder that learns global interactions to capture both fine-grained spatial patterns and broad temporal contexts. Meanwhile, the STS-Transformer [10] eliminates traditional preprocessing steps through learned stem-patch embeddings, demonstrating that end-to-end transformer architectures can match or exceed the performance of a carefully engineered pipeline. However, it is important, especially in ER, considering the EEG signals’ noise susceptibility, inter-subject variability, and dataset-specific biases, to test a model’s generalisation to multiple datasets and robustness.

Testing the robustness of a model involves achieving a consistent performance across diverse subjects and recording conditions. Recent domain adaptation (DA) techniques have made significant strides in this direction, with an inter-subject transfer learning approach [4] that employs deep CNNs with adversarial training to learn subject-invariant features, while another introduces hierarchical fusion networks [11] with scenario-adaptive pretraining to improve cross-dataset generalisation. Moreover, synthetic data generation methods [3], ELM-W-AE, provide another promising solution by creating artificial EEG samples that preserve emotional discriminability while augmenting limited training sets.

Despite these advancements, there remains a critical need for developing a real-time, cross-subject adaptive EEG emotion recognition system. Admittedly, while some of these works achieve exceptionally high accuracy [3,5], the use of handcrafted features contributes to this end; also, the reliance on only one dataset for evaluation might suggest potential data leakage. Furthermore, the end-to-end models [4,8], despite their ability to recognise emotion directly from raw EEG signals, still lack in terms of generalisability to unseen subjects. Consequently, it is essential to validate one’s model on different datasets to prove its robustness, and on unseen subjects to test its generalisability.

Unquestionably, many studies display remarkable advantages over other works [6,7,10], namely in their end-to-end architecture, their use of state-of-the-art models, and their extensive evaluation protocols on different datasets under subject-dependent (SD) and subject-independent (SI) analyses. However, their accuracy from SD to SI has dropped by approximately 10%, which suggests the need to explore more ways to achieve the desired consistency.

For all of these reasons, and to further narrow this gap, our main contributions are as follows:

(1): A novel end-to-end convolutional multi-head attention feature extractor with DA Network CMHFE-DAN, for a complete pipeline, with little to no human interference, to bring the idea of real-world application closer.
(2): A transformer-based convolutional feature extractor, automating the feature extraction phase.
(3): Testing the model’s generalisation on multiple datasets with different signal acquisition methods, designs, and labelling, namely DEAP, SEED, and DREAMER datasets, to prove our CMHFE-DAN’s adaptability and generalisability.
(4): Adding a DAN module to the feature extractor and applying it to the DEAP and DREAMER datasets for cross-subject validation, all while maintaining computational efficiency for BCI applications.

2. State of the Art

Recent research has introduced several innovative approaches to improve EEG-ER, particularly focusing on cross-subject generalisation, feature extraction, noise reduction, and transformer-based architecture.

To enhance model performance, studies exploring DA techniques developed a novel loss function [12] that minimised intra-class variations while maximising inter-class distinctions, which was developed to significantly improve cross-subject accuracy. Similarly, an LSTM-based autoencoder with a CNN and attention mechanism [13] was combined to extract subject-invariant features, while domain-invariant feature extraction [14] was employed to strengthen generalisation across datasets without compromising privacy. Moreover, end-to-end CNN-based architecture [15] compared its model with traditional machine learning methods while trying different extracted features. RHPRNet [11] was a hierarchical multimodal network that extracted spatial-frequency EEG patterns, applied inter-domain and inter-modality encoders, and fused features using contrastive alignment.

Attention-based and transformer-based architectures have also seen significant advancements. An attention-based recurrent graph convolutional neural network [16] was introduced for multimodal emotion recognition, effectively integrating EEG with other biosignals. Spatio-temporal modelling has been another key focus, with the proposal of a novel multi-source DA framework [17] with a dedicated spatiotemporal feature extractor. Furthermore, combined multi-scale residual networks (MSRN), multi-task learning (MTL) [18], and connectivity features were used in order to reduce individual differences and capture dynamic brain region interactions across emotional states. Another notable contribution came from [19], which proposed a deep network comprising channel-specific feature encoders, a fusion module, and a multi-task classifier for arousal, valence, and dominance prediction. A model that used a transformer-encoder [20] to capture time-frequency features was introduced, followed by a spatial-temporal graph attention network (STGATE) with a dynamic adjacency matrix to learn EEG channel relationships. CNNs [9] for noise filtering were used before applying transformers for high-level feature extraction, resulting in better interpretability. To further enhance EEG feature extraction, a CNN-based multi-kernel [21] was tested across multiple datasets. These methods highlight the importance of both local and global feature representations in EEG analysis.

Collectively, these studies are condensed in Table 1; they illustrate the field’s progression toward hybrid architectures that integrate CNNs, transformers, and DA techniques to address the challenges of cross-subject variability and noise robustness in EEG-ER.

Many of the studies were based on CNN for its great feature extraction capability; however, when combined with MSRN [18], proposed variants [15], or used with multi-kernel [21], these models showed less accuracy than those combined with an attention mechanism or transformer architectures [16,20]. Which brought us to suggest a transformer-based CNN, using 1D convolutional layer for its capability to capture temporal patterns typical for EEG signals, combined with a fixed small kernel size = 3, due to its ability to minimise parameters and training time, and its capability of capturing fine-grained temporal patterns [15,18,20]. Moreover, this kernel size is combined across multiple Conv1D layers, which effectively describe long-range dependencies. The MHSA mechanism focuses on relevant features while enabling local-to-global receptive fields.

Undoubtedly, one of the main remarks concerning these works is that a model tailored for SD emotion recognition has a higher accuracy than one made for SI analysis. This has been seen. While not using DA, the accuracy dropped by 24% [19], by 23% [9], and by 21% [21]. Meanwhile, the models that used DA focused only on SI emotion recognition [12,17]. To achieve this end, combining the proposed feature extractor with a DA network would reduce this gap further.

3. Proposed Model and Method

3.1. Proposed Model

The proposed architecture combines temporal feature extraction, attention mechanisms, and adversarial DA to create a robust framework for EEG-ER. The model consists of three primary components: a deep convolutional feature extractor for raw EEG processing, a transformer-based attention module for capturing long-range dependencies, and a domain-adversarial neural network (DANN) component for cross-subject generalisation.

The model processes EEG signals, after permutation, through a feature extraction pipeline beginning with four consecutive 1D convolutional layers (Conv1D), where each convolutional layer has progressive filter sizes (64, 128, 256, 128, respectively), and a kernel size of 3. Then, batch normalisation layers follow each convolution to stabilise training. The output is a temporal feature map of dimension T × 128, where T represents the time dimension reduced through stride convolutions.

The multi-head self-attention (MHSA) mechanism allows the model to attend to different aspects of the EEG dynamics in parallel [22]. This component is the foremost module in the transformer; it uses the standard key-value-query (K, V, Q) with scaled dot-product attention, and the output is a weighted sum of the values:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{{Q K}^{T}}{\sqrt{k}}) V,

(1)

with k, the dimension of the temporal features, those attention outputs are then concatenated and transformed as follows:

M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, . . ., {h e a d}_{h}) W^{O},

(2)

where

{h e a d}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}),

(3)

and the W parameter matrices have to be learned.

The temporal features are processed through an MHSA module with 4 parallel attention heads. Each head operates on segmented feature dimensions (32-dimensional sub-spaces), and then the output is combined with the original features through residual connections, followed by a layer normalisation.

The attention mechanism is followed by a feed-forward network (FFN) that performs a non-linear transformation of the attended features. It consists of two fully connected layers: an expansion layer with 512 units (using ReLU activation), followed by a projection layer that restores the original feature dimension. This bottleneck design, combined with residual connections, is used to maintain stable gradients through feature transformation. A normalisation layer is then applied. This part of the FFN, combined with MHSA, is similar to the transformer encoder block [22], but optimised for 1D temporal signals.

For final classification, the temporal features are condensed through global average pooling before passing through a 128-unit dense layer (ReLU activated), and finally the FC classification layer.

The CMHFE model is slightly altered to add the DANN component to it, a maxpooling layer is added between the conv-block and the MHSA-block, and the features are then passed through global average pooling and two task-specific dense layers (128 and 64 units). The architecture of the whole model is presented in Figure 1. The DANN [23,24] component introduces adversarial training to learn subject-invariant representations, using a gradient reversal layer (GRL) between the shared feature extractor and the domain classifier; the feature extractor’s output serves as input to both the domain classifier and label predictor. During forward propagation, the GRL

R_{λ}

acts as an identity transform

R_{λ} (x) = x

(4)

where x represents the feature vector. However, during backpropagation, the GRL applies a crucial transformation by multiplying the domain classifier’s dy gradient by a negative scaling factor −λ:

\frac{d R_{λ}}{d x} = - λ d y

(5)

λ controls the adversarial strength. This reversal forces the feature extractor to learn domain-invariant representations by confusing the domain classifier.

Samples from both the source and target domains are passed during training, together with a binary domain label that identifies each sample’s origin. While the target domain samples are unlabelled, the source domain samples have known class labels. The objective is to develop a model that, when tested, can correctly predict the labels of the target domain samples.

The domain classifier consists of two dense layers (128 and 64 units with ReLU) followed by a single sigmoid output for binary domain prediction. Simultaneously, the label predictor head maps the features into 2 dense layers corresponding to separate arousal and valence predictors (2D outputs, each with linear activation). This dual-head design explicitly models the circumplex model of affect while sharing low-level feature representations.

The complete model is trained with a composite loss function L. The emotion loss combines binary cross-entropy (BCE) for arousal/valence denoted as

L_{E}

, and the domain loss

L_{d}

also uses BCE to maximise domain confusion. The total loss L is defined as:

L = L_{E} + λ L_{d}

(6)

The model includes, during training, L2 weight regularisation (L2 = 0.001) to Dense and Conv layers to constrain model complexity and dropout (rate = 0.5) to prevent over-reliance on specific temporal correlations. The dropout after each convolution is to prevent the co-adaptation of features; this configuration reduces overfitting and improves generalisation.

3.2. Datasets

The model CMHFE-DAN is validated using three widely used benchmark datasets for EEG-ER, namely DEAP, SEED, and DREAMER (each dataset with its unique characteristics in terms of experimental design, signal acquisition, and emotional labelling), for a more comprehensive evaluation of our model.

3.2.1. DEAP Dataset

The Database for Emotion Analysis using Physiological Signals (DEAP) [25,26] is a well-established multimodal dataset containing EEG recordings from 32 participants viewing 40 one-minute music videos designed for varied emotional states, with each trial including EEG signals captured using a 32-channel Biosemi ActiveTwo system according to the 10–20 international placement, along with self-reported valence, arousal, and dominance (1–9) ratings. The dataset is pre-processed to remove artefacts (EOG/EMG), downsampled to 128 Hz, and bandpass-filtered (4–45 Hz) to enhance signal quality. Each trial is segmented into 63 s, with 60 s stimuli signals and 3 s baseline signals (pre-stimuli resting state), facilitating a comparative analysis of emotional responses.

3.2.2. SEED Dataset

The SEED (SJTU Emotion EEG Dataset) [27,28], developed by Shanghai Jiao Tong University, focuses on discrete emotion classification (positive, neutral, negative). It includes 62-channel EEG recordings from 15 participants, each undergoing 15 trials across three sessions, where stimuli consist of carefully selected 4 min video clips presented in a randomised, non-repeating emotional sequence to minimise bias. The data is pre-processed to remove artefacts, downsampled to 200 Hz, and recorded following the 10–20 international system.

3.2.3. DREAMER Dataset

The DREAMER dataset [29] introduces a wearable, low-cost EEG/ECG setup. It includes recordings from 23 participants at a sampling rate of 128 Hz using an Emotiv EPOC EEG headset (14 channels) and a Shimmer2 ECG sensor, where emotional responses are self-assessed in terms of valence, arousal, and dominance with (1–5) ratings after watching 18 film clips. We use 2.5 as a threshold for a binary classification of arousal and valence. Unlike DEAP and SEED, DREAMER uses a portable data acquisition tool, making it valuable for potential real-time emotion recognition applications.

3.3. Methods and Hyperparameters

The model is implemented in Paperspace Gradient ML Platform’s notebooks [30] in TensorFlow/Keras v2.15.0 using an NVIDIA A4000 GPU.

Both architectures of the model are trained using the Adam optimiser. Table 2 summarises the amount of data used for each of the three datasets before segmentation, including the number of subjects, channels, and trials; the sampling rate in Hz; the total number of trials (trials × subjects); and the total number of signals (trials × channels × subjects). Table 3 presents the details of the parameter choices after fine-tuning, such as batch size and learning rate, which are optimised through a grid search. For computational efficiency, all data are segmented using a 4 s window (w = 4 s) with a 2 s overlap (s = 2 s), ensuring smoother feature extraction, less information loss at window boundaries, and the benefits of data augmentation via segmentation and reconstruction [31]. After segmentation, the number of resulting segments (NS) varies between datasets. For the DEAP dataset, all signals have fixed timestamps (8064), equivalent to 63 s, and the NS per signal is calculated as ((t − w)/s) + 1 = 30 segments; thus, the NS per subject is 1200, resulting in a shape of (segments, channels, timestamps) = (1200, 32, 512, 32, 512). Due to different timestamp lengths in the DREAMER and SEED datasets, the NS varies per trial. When summed, the resulting shapes are (1843, 14, 512) for DREAMER and (5031, 62, 800) for SEED.

4. Results and Discussion

The evaluation protocol assesses the model’s performance across both SD and SI scenarios, ensuring that the dual evaluation strategy offers critical insights into the model’s ability to recognise emotions within individual subjects while also examining its generalisation capabilities for completely unseen users. The SEED dataset is not used for SI analysis due to its computational heaviness in terms of training the whole dataset at once.

4.1. Within-Subject Analysis

Also referred to as the SD analysis, a 5-fold cross-validation (CV) approach that maintains the temporal structure of EEG recordings while ensuring balanced class representation across folds is used, where each subject’s data is partitioned into five non-overlapping segments, with the model trained on four folds and tested on the remaining fold. This is repeated five times to ensure that each segment operates as the test set exactly once, thus preventing evaluation bias. Performance metrics are calculated for each subject individually before computing grand averages across all participants, allowing us to examine both overall performance and inter-subject variability.

To evaluate the model’s performance, accuracy serves as our primary metric, reported alongside standard deviation (std). The SEED dataset evaluation additionally incorporates confusion matrices to visualise classification patterns across its three discrete emotion categories (positive, neutral, negative).

4.1.1. Results

For the SEED dataset, the accuracy across sessions is predicted using the 5-fold CV, then averaged across subjects. As presented in Figure 2, CMHFE’s average accuracy for subjects never drops below 70%, with the lowest accuracy 70.91% for subject #5, the highest accuracy 94.21% for subject #9, while eight subjects obtain an accuracy above 80%, and the mean accuracy for all subjects is 82.81%.

Another way to evaluate the model, rather than plotting the accuracy of each session, is to visualise the overall accuracy for all the sessions: two experiments achieve an accuracy below 62%; 46% of the population obtain an accuracy above 88%; 77% of it achieve above 75% accuracy; and 95% of it have an accuracy above 62%.

Figure 3 shows that CMHFE overall predicts the three classes correctly, except for experiment #1 of subject #2, where the model predicts all the classes as neutral; this experiment obtains an accuracy just above chance level for a classification problem with three classes. For the other experiments, which are the experiments of the subjects that obtain the lowest average accuracies, CMHFE confuses the neutral class as negative for experiment #3 of subject #1, experiment #3 of subject #3, and experiment #1 of subject #4.

Regarding the DREAMER dataset, Figure 4 shows that CMHFE predicts all average arousal accuracies above 80% except for subjects #1, #12, #16, and #17, with the lowest accuracy for subject #12, 66.25%, and the highest accuracy for subject #7, 98.80%. For valence, eight subjects obtain an average below 80%, with the lowest accuracy for subject #16, 68.3%. For all subjects, the average valence accuracy is 80.89% with the std of 9.8%, and an average arousal accuracy of 87.12% with the std of 8%.

The DEAP dataset, in Figure 5, shows only one subject (#27) with average accuracies below 70%, and one subject (#18) with only arousal below 70%. The average arousal accuracy for all subjects is 82.66% with the std 8.68%, and for arousal, 82.29% with the std 9.3%.

Table 4 summarises the results of the SD analysis, showing the average accuracy across subjects for SEED, and the average valence and arousal accuracy for both DEAP and DREAMER datasets.

4.1.2. Discussion

The observed variations in model performance across subjects and sessions might suggest fundamental differences in how individuals express or experience emotions, as reflected in their EEG patterns. In contrast, the model achieves approximately 80% stability across all datasets, demonstrating robust generalisability. Its relative struggle to predict emotional states consistently across sessions for certain subjects highlights this dynamic nature of emotional responses. These findings show the importance of our CMHFE model, which effectively captures both universal and subject-specific emotional signatures through its feature learning architecture, proving particularly valuable for developing personalised emotion recognition systems, where models can be adaptively fine-tuned to an individual’s EEG patterns. The CMHFE’s architecture naturally supports this personalisation by separating fundamental emotional features from subject-specific variations.

4.2. Cross-Subject Generalisation

Also referred to as the SI analysis, a leave-one-subject-out CV (LOSO-CV) protocol is used, which provides the most rigorous test of model generalisability by completely separating training and testing subjects. In this evaluation framework, data from N-1 subjects forms the training set, while the remaining subject’s data serves as the test set, iterating N times until each subject performs as a test set exactly once, with performance metrics computed exclusively on held-out subject data.

To evaluate the model, a set of evaluation metrics is used, with accuracy remaining as our primary summary statistic. We supplement this with the area under the ROC (Receiver Operating Characteristic) curve, called AUC (Area Under ROC Curve), along with precision, recall, and F1-scores, hence offering granular insight into a class-specific performance. They are defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

F 1 - s c o r e = 2 \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(10)

where TP, TN, FN, and FP denote true positives, true negatives, false negatives, and false positives, respectively.

4.2.1. Results

For DREAMER LOSO-CV in Figure 6, CMHFE-DAN shows good overall arousal accuracy with only 5 subjects under 70%, subjects #1, #3, #14, #17, and #23, with all the other metrics in the same range. While the model struggles more for valence accuracy, with 10 subjects above 70%, the lowest valence accuracy is 44.44% for #2, 44.44% below the chance level, along with 2 other subjects. The average arousal and valence accuracy for all subjects are 79.83% and 64.73%, respectively.

Figure 7 represents, on the left, the different metrics for valence and arousal, showcasing a good general distribution of metrics, except for the arousal precision of subjects #3 and #11, which is lower than the chance level. Figure 7b, on the right, further emphasises the stability of the model across all metrics.

4.2.2. Discussion

We place particular emphasis on the AUC-ROC metric as it provides a threshold assessment of model discriminability, which is important given that individual differences in emotion expression, precision, and recall also help identify whether the model exhibits systematic biases in emotion classification, while the F1-score offers a balanced view of performance.

The DREAMER valence metrics, while lower and showing greater variability, maintain a reasonable distribution, with most values clustered between 55 and 75, suggesting that the model captures valence-related patterns effectively despite the challenges in generalising well to this dimension; moreover, the good arousal metrics also prove this conclusion. The DEAP dataset results further reinforce these findings, with arousal and valence achieving comparable performances with 79.91% and 78.20% accuracy, respectively. Notably, the high recall scores of 92.83% for arousal and 92.12% for valence indicate the model’s strength in identifying true positive cases. In comparison, the AUC values of 85.95% and 86.35% confirm its discriminative capability across emotional states, indicating that the model maintains a robust balance between sensitivity and specificity, making it suitable for practical emotion recognition systems and proving overall model effectiveness.

The stability of these metrics underscores the model’s ability to handle cross-subject variability, which suggests the effectiveness of the DAN module added to our feature extractor in selecting features that are domain-invariant. While the gradient reversal ensures that the feature distributions over the two domains are made similar, the higher performance for valence compared to arousal on DREAMER may reflect the more distinct physiological signatures and a more challenging emotion category. This could suggest that while the model performs well overall, future refinements could focus on improving precision for arousal and reducing variability in the other class predictions, potentially through a more powerful DA technique enhancing adaptive class-specific tuning.

4.3. Comparison with SOTA

Results and Discussion

The variations between the existing research’s evaluation settings make the comparison challenging, which is why we compare our work that uses (1) the SEED, DEEP, or DREAMER datasets; (2) an end-to-end deep learning model; and (3) SI and SD analyses. The SOTAs that we compared with are all cited in Section 2.

We compared with MultiT-S ConvNet [21], ERTNet [9], EEG-Peri [11], EEG-DML [12], CVCNN [15], AE-LSTM-ATT-CNN [13], aDAPE [14], MSDA-SFE [17], MTL-MSRN [18], DCNN + GAT-MHA [19], Mul-AT-RGCN-DAN [16], and STGATE [20].

Table 5 shows the superiority of our method compared to previous works. For the DEAP dataset SD analysis, CMHFE has better arousal accuracy than the best SOTA work (underlined) with a 0.46% margin, while for valence, it outperforms it by 4.69%. Moreover, for the SI analysis, also for DEAP, the results are outstanding, outclassing the SOTA by far, where CMHFE-DAN has a 6.44% and 4.07% better arousal and valence accuracy, respectively, than the best work. Not many works use the AUC metric to evaluate their work; however, with the research that does do so, our model outclasses it by nearly 20%. As for the DREAMER SI analysis, our valence accuracy under-classes the second best by −0.25% and the best by −12.71%, while our arousal accuracy outperforms the best by 4.65% and the second best by 16.2%.

The size of the model is another crucial metric for assessing it; Table 6 contrasts the average accuracy and corresponding model size of various designs that specify the number of trainable parameters. Although CMHFE-DAN has fewer parameters than three of the SOTA models, it still maintains good accuracy. Additionally, while the MultiT-S ConvNet has fewer parameters, it fails in terms of accuracy.

The proposed architecture demonstrates a significant 20% improvement in cross-subject generalisation compared to non-adaptive baselines [9,19,21], as validated on benchmark datasets DEAP and DREAMER. This shows the capacity of our DAN module to discriminate between the source and target domains, and thus its capability in learning domain-invariant features. Furthermore, the trade-off between the size of the proposed model and the accuracy proves our CMHFE-DAN’s good balance between capacity and generalisation. By leveraging DEAP and DREAMER datasets, our framework ensures robust validation across diverse emotional models, acquisition setups, and application scenarios, enhancing the reliability and generalisability of our findings. The inclusion of both SD and SI evaluations provides a comprehensive assessment of model capabilities, from personalised systems tailored to individual users to general-purpose solutions effective across heterogeneous populations.

5. Conclusions and Future Work

To summarise, this work’s main contribution is a novel end-to-end emotion recognition model containing a feature extractor based on a 1D CNN and an MHSA module, which extract fine-grained temporal features while at the same time describing long-range dependencies. A DAN module was added to the feature extractor to generalise to unseen subjects.

The proposed CMHFE-DAN framework represents a significant step toward robust, generalisable EEG-based emotion recognition by integrating transformer-based feature extraction, DA, and cross-subject validation while maintaining computational efficiency. It further demonstrates improved generalisation across multiple datasets (DEAP, SEED, DREAMER) and addresses key challenges in EEG-ER, including inter-subject variability. It has proved its capacity compared to the state-of-the-art models with a stable, consistent accuracy of ≈80% for SD analysis and ≈79% for SI analysis, further reducing the significant gap in a model’s generalisation from SD to SI in SOTA.

However, there are still many limitations, such as dataset heterogeneity and the trade-off between model complexity and real-time performance. These pave the way for more innovation and future research focusing on enhancing model transparency via attention visualisation or biologically inspired filters, and optimising efficiency for edge deployment [32,33,34], which would need model compression using pruning and quantisation techniques. Another promising future application could be reducing reliance on labelled data through masked EEG pretraining while ensuring privacy via decentralised training paradigms.

Author Contributions

Methodology, Software, Formal Analysis, Resources, Data Curation, Writing—Original Draft Preparation, M.H.; Review and Editing, M.H., A.E. and S.B.A.; Supervision, A.E. and S.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. They are available at: DEAP: http://www.eecs.qmul.ac.uk/mmv/datasets/deap/download.html (accessed on 19 May 2025), SEED: https://zenodo.org/records/546113 (accessed on 19 May 2025), DREAMER: https://bcmi.sjtu.edu.cn/home/seed/seed.html (accessed on 19 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Craik, A.; He, Y.; Contreras-Vidal, J.L. Deep Learning for Electroencephalogram (EEG) Classification Tasks: A Review. J. Neural Eng. 2019, 16, 031001. [Google Scholar] [CrossRef] [PubMed]
Mridula, D.T.; Ahmed Ferdaus, A.; Pias, T.S. Exploring Emotions in EEG: Deep Learning Approach with Feature Fusion. In Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 13–15 December 2023; pp. 1–6. [Google Scholar]
Ari, B.; Siddique, K.; Alçin, Ö.F.; Aslan, M.; Şengür, A.; Mehmood, R.M. Wavelet ELM-AE Based Data Augmentation and Deep Learning for Efficient Emotion Recognition Using EEG Recordings. IEEE Access 2022, 10, 72171–72181. [Google Scholar] [CrossRef]
Fahimi, F.; Zhang, Z.; Goh, W.B.; Lee, T.-S.; Ang, K.K.; Guan, C. Inter-Subject Transfer Learning with an End-to-End Deep Convolutional Neural Network for EEG-Based BCI. J. Neural Eng. 2019, 16, 026007. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Ma, Y.; Wang, R.; Li, W.; Dai, Y. A Model for EEG-Based Emotion Recognition: CNN-Bi-LSTM with Attention Mechanism. Electronics 2023, 12, 3188. [Google Scholar] [CrossRef]
Chen, Z.; Yang, R.; Huang, M.; Li, F.; Lu, G.; Wang, Z. EEGProgress: A Fast and Lightweight Progressive Convolution Architecture for EEG Classification. Comput. Biol. Med. 2024, 169, 107901. [Google Scholar] [CrossRef]
Song, T.; Zheng, W.; Liu, S.; Zong, Y.; Cui, Z.; Li, Y. Graph-Embedded Convolutional Neural Network for Image-Based EEG Emotion Recognition. IEEE Trans. Emerg. Top. Comput. 2022, 10, 1399–1413. [Google Scholar] [CrossRef]
Tao, W.; Li, C.; Song, R.; Cheng, J.; Liu, Y.; Wan, F.; Chen, X. EEG-Based Emotion Recognition via Channel-Wise Attention and Self Attention. IEEE Trans. Affect. Comput. 2023, 14, 382–393. [Google Scholar] [CrossRef]
Liu, R.; Chao, Y.; Ma, X.; Sha, X.; Sun, L.; Li, S.; Chang, S. ERTNet: An Interpretable Transformer-Based Framework for EEG Emotion Recognition. Front. Neurosci. 2024, 18, 1320645. [Google Scholar] [CrossRef]
Zheng, W.; Pan, B. A Spatiotemporal Symmetrical Transformer Structure for EEG Emotion Recognition. Biomed. Signal Process. Control. 2024, 87, 105487. [Google Scholar] [CrossRef]
Tang, J.; Ma, Z.; Gan, K.; Zhang, J.; Yin, Z. Hierarchical Multimodal-Fusion of Physiological Signals for Emotion Recognition with Scenario Adaption and Contrastive Alignment. Inf. Fusion 2024, 103, 102129. [Google Scholar] [CrossRef]
Alameer, H.R.A.; Salehpour, P.; Hadi Aghdasi, S.; Feizi-Derakhshi, M.-R. Cross-Subject EEG-Based Emotion Recognition Using Deep Metric Learning and Adversarial Training. IEEE Access 2024, 12, 130241–130252. [Google Scholar] [CrossRef]
Arjun; Rajpoot, A.S.; Panicker, M.R. Subject Independent Emotion Recognition Using EEG Signals Employing Attention Driven Neural Networks. Biomed. Signal Process. Control. 2022, 75, 103547. [Google Scholar] [CrossRef]
Bethge, D.; Hallgarten, P.; Grosse-Puppendahl, T.; Kari, M.; Mikut, R.; Schmidt, A.; Özdenizci, O. Domain-Invariant Representation Learning from EEG with Private Encoders. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 7–13 May 2022; pp. 1236–1240. [Google Scholar]
Chen, J.X.; Zhang, P.W.; Mao, Z.J.; Huang, Y.F.; Jiang, D.M.; Zhang, Y.N. Accurate EEG-Based Emotion Recognition on Combined Features Using Deep Convolutional Neural Networks. IEEE Access 2019, 7, 44317–44328. [Google Scholar] [CrossRef]
Chen, J.; Liu, Y.; Xue, W.; Hu, K.; Lin, W. Multimodal EEG Emotion Recognition Based on the Attention Recurrent Graph Convolutional Network. Information 2022, 13, 550. [Google Scholar] [CrossRef]
Guo, W.; Xu, G.; Wang, Y. Multi-Source Domain Adaptation with Spatio-Temporal Feature Extractor for EEG Emotion Recognition. Biomed. Signal Process. Control. 2023, 84, 104998. [Google Scholar] [CrossRef]
Li, J.; Hua, H.; Xu, Z.; Shu, L.; Xu, X.; Kuang, F.; Wu, S. Cross-Subject EEG Emotion Recognition Combined with Connectivity Features and Meta-Transfer Learning. Comput. Biol. Med. 2022, 145, 105519. [Google Scholar] [CrossRef]
Priyasad, D.; Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Affect Recognition from Scalp-EEG Using Channel-Wise Encoder Networks Coupled with Geometric Deep Learning and Multi-Channel Feature Fusion. Knowl. Based Syst. 2022, 250, 109038. [Google Scholar] [CrossRef]
Li, J.; Pan, W.; Huang, H.; Pan, J.; Wang, F. STGATE: Spatial-Temporal Graph Attention Network with a Transformer Encoder for EEG-Based Emotion Recognition. Front. Hum. Neurosci. 2023, 17, 1169949. [Google Scholar] [CrossRef]
Emsawas, T.; Morita, T.; Kimura, T.; Fukui, K.; Numao, M. Multi-Kernel Temporal and Spatial Convolution for EEG-Based Emotion Classification. Sensors 2022, 22, 8250. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. arXiv 2015, arXiv:1409.7495. [Google Scholar]
Osumi, K.; Yamashita, T.; Fujiyoshi, H. Domain Adaptation Using a Gradient Reversal Layer with Instance Weighting. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–5. [Google Scholar]
Scherer, K.R. What Are Emotions? And How Can They Be Measured? Soc. Sci. Inf. 2005, 44, 695–729. [Google Scholar] [CrossRef]
Koelstra, S.; Muhl, C.; Soleymani, M.; Jong-Seok, L.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A Database for Emotion Analysis; Using Physiological Signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]
Zheng, W.-L.; Lu, B.-L. Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks. IEEE Trans. Auton. Ment. Dev. 2015, 7, 162–175. [Google Scholar] [CrossRef]
Duan, R.-N.; Zhu, J.-Y.; Lu, B.-L. Differential Entropy Feature for EEG-Based Emotion Classification. In Proceedings of the 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER), San Diego, CA, USA, 6–8 November 2013; pp. 81–84. [Google Scholar]
Katsigiannis, S.; Ramzan, N. DREAMER: A Database for Emotion Recognition Through EEG and ECG Signals From Wireless Low-Cost Off-the-Shelf Devices. IEEE J. Biomed. Health Inform. 2018, 22, 98–107. [Google Scholar] [CrossRef]
NVIDIA H100 for AI & ML Workloads|Cloud GPU Platform|Paperspace. Available online: https://www.paperspace.com/ (accessed on 19 May 2025).
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 710–719. [Google Scholar] [CrossRef]
Lu, L.; Wu, D.; Chen, C. Visualization Method of Emotional EEG Features Based on Convolutional Neural Networks. In Proceedings of the 2024 IEEE 25th China Conference on System Simulation Technology and its Application (CCSSTA), Tianjin, China, 21–23 July 2024; pp. 188–193. [Google Scholar]
Ludwig, S.; Bakas, S.; Adamos, D.A.; Laskaris, N.; Panagakis, Y.; Zafeiriou, S. EEGminer: Discovering Interpretable Features of Brain Activity with Learnable Filters. J. Neural Eng. 2024, 21, 036010. [Google Scholar] [CrossRef]
Kim, S.; Kim, T.-S.; Lee, W.H. Accelerating 3D Convolutional Neural Network with Channel Bottleneck Module for EEG-Based Emotion Recognition. Sensors 2022, 22, 6813. [Google Scholar] [CrossRef]

Figure 1. The architecture of the CMHFE and CMHFE-DAN.

Figure 2. The mean accuracy for the SEED dataset with (a) the mean accuracy across subjects, and (b) the number of experiments per accuracy range.

Figure 3. The confusion matrices of the first 5 subjects with their 3 sessions: each confusion matrix corresponds to the median of the 5-fold CV for the experiment. The X-axis presents the predicted class, and the Y-axis the true one.

Figure 4. The average (a) arousal accuracy and (b) valence accuracy using 5-fold CV for each subject.

Figure 5. Arousal and valence accuracies for each subject on the DEAP dataset using 5-fold CV.

Figure 6. The metrics accuracy, precision, and f1-score for the DREAMER dataset for (a) valence, and (b) arousal.

Figure 7. The DEAP dataset’s valence and arousal accuracy, precision, f1-score, and AUC (a) for all subjects and (b) the average, adding recall to it.

Table 1. State-of-the-Art (SOTA), with SD, SI, CV, and LOSO standing for subject-dependent, subject-independent, cross-validation, and leave-one-subject-out, respectively.

Cited Reference/Year	Features	Model	Dataset	Accuracy	Validation Approaches
[12]/2024	Differential entropy (DE)	Self-organised graph neural network (SOGNN)	SEED, SEED-GER, and DEAP	91.70 (SEED) 67.15 (DEAP)	LOSO-CV
[15]/2019	Frequency bands/power spectral density	Deep convolution neural network	DEAP	100 (All features) 61.76 (normalised raw data)	SD-analysis
[13]/2022	Raw EEG	LSTM ¹ with channel-attention autoencoder/CNN ² with attention	DEAP, SEED, and CHB-MIT	65.9 ± 9.5 69.5 ± 9.7 (DEAP) 76.7 ± 8.5 (SEED)	LOSO-CV
[21]/2022	Raw EEG	Multi-kernel temporal and spatial convolution network	DEAP, SEED	SD [79.9;86] SI [52.05;54.6]	SD 5-fold CV SI LOSO-CV
[9]/2024	Raw EEG	Hybrid CNN and transformer	DEAP and SEED-V	SD [77.15;-] SI [61.75;-]	SD 10-fold CV
[17]/2023	Raw EEG	Multi-source domain adaptation	SEED, SEED-IV, and DEAP	[91.65;73.92;69.68]	LOSO CV
[18]/2022	Frequency bands and connectivity features: PCC, PLV, and PLI,	Multi-scale residual network (CNN)	DEAP and SEED	[71.60;87.05%]	LOSO CV
[19]/2022	Raw EEG	SincNet-encoders/MHA ³-graph convolution networks	DEAP and DREAMER	SD [66.6;88] SI [79.72;64.98]	SD 5-fold CV SI LOSO-CV
[16]/2022	Temporal features, and frequency-domain feature DE	Attention recurrent graph convolutional network	DEAP	SD 92.5 SI 73.80	SD 5-fold CV SI LOSO-CV
[11]/2024	8 features	Hierarchical multimodal network	DEAP, SEED-IV and -V	SD [69.39;56.82;50.24] SI [58.19;42.38;32.96]	SD 5-fold CV SI LOSO-CV
[20]/2023	Frequency bands	Transformer learning block and spatial-temporal graph attention	SEED, SEED-IV, and DREAMER	[90.37;76.43;77.44]	LOSO CV

¹ LSTM: long short-term memory. ² CNN: convolutional neural network. ³ MHA: multi-head attention.

Table 2. Application scenarios.

	DEAP	DREAMER	SEED
Subjects (S)	32	23	15
EEG Channels (C)	32	14	62
Trials (T)	40	18	15 (3 sessions)
Sampling Rate	128	128	200
Total Trials (T × S)	1280	414	675
Total Signals (T × S × C)	40,960	5796	41,805

Table 3. Parameter choices.

Parameters	DEAP SD	DREAMER SD	SEED SD	DEAP SI	DREAMER SI
epoch number	200	200	100	100	100
learning rate	0.001	0.0001	0.001	0.0001	0.0001
batch size	32	16	16	32	32
optimiser	Adam	Adam	Adam	Adam	Adam
loss function	BCE ¹	BCE	SCCE ²	BCE	BCE
random seed	42	42	42	42	42
lambda	-	-	-	0.5	0.4

¹ BCE: binary cross-entropy; ² SCCE: sparse categorical cross-entropy.

Table 4. The average accuracy for DEAP, DREAMER, and SEED datasets.

	Accuracy
Model	Valence	Arousal
DEAP	82.66 ± 8.68	82.29 ± 9.3
DREAMER	80.89 ± 9.8	87.12 ± 8
SEED	82.81

Table 5. Comparison with SOTA.

SD DEAP	Accuracy
Architecture	Valence	Arousal
MultiT-S ConvNet	77.6 ± 5.8	82.2 ± 6.5
ERTNet	73.31	80.99
EEG-Peri	74.17 ± 5.34	73.34 ± 7.94
CMHFE (Ours)	82.29 ± 7.03	82.66 ± 7.55
SI DEAP	Accuracy
EEG-DML	67.15	-
CVCNN	65.51	61.76
AE-LSTM-ATT-CNN	65.9 ± 9.5	69.5 ± 9.7
MultiT-S ConvNet	52.6 ± 8.8	51.5 ± 10.4
ERTNet	59.60	63.90
MSDA-SFE	69.26	70.10
MTL-MSRN	71.29 ± 7.67	71.92 ± 6.79
DCNN + GAT-MHA	69.43	69.72
Mul-AT-RGCN-DAN	74.13	73.47
CMHFE-DAN (Ours)	78.20	79.91
SI DEAP	AUC
CVNN	65.51	61.76
CMHFE-DAN (Ours)	86.35	85.95
SI DREAMER	Accuracy
DCNN + GAT-MHA	64.98	63.71
STGATE	77.44	75.26
CMHFE-DAN (Ours)	64.73	79.91

The bold numbers refer to the best result in the section, and the underlined numbers refer to the best result for the SOTA in the section.

Table 6. The number of trainable parameters compared with SOTA and their respective average accuracy for SI-DEAP.

Model	Size	Accuracy
DCNN + GAT-MHA	7 M	69.57
MTL-MSRN	1 M	71.7
AE-LSTM-ATT-CNN	3 M	67.7
MultiT-S ConvNet	30 K	52.05
CMHFE-DAN (Ours)	450 K	79.05

The bold numbers refer to the best result.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hilali, M.; Ezzati, A.; Ben Alla, S. CMHFE-DAN: A Transformer-Based Feature Extractor with Domain Adaptation for EEG-Based Emotion Recognition. Information 2025, 16, 560. https://doi.org/10.3390/info16070560

AMA Style

Hilali M, Ezzati A, Ben Alla S. CMHFE-DAN: A Transformer-Based Feature Extractor with Domain Adaptation for EEG-Based Emotion Recognition. Information. 2025; 16(7):560. https://doi.org/10.3390/info16070560

Chicago/Turabian Style

Hilali, Manal, Abdellah Ezzati, and Said Ben Alla. 2025. "CMHFE-DAN: A Transformer-Based Feature Extractor with Domain Adaptation for EEG-Based Emotion Recognition" Information 16, no. 7: 560. https://doi.org/10.3390/info16070560

APA Style

Hilali, M., Ezzati, A., & Ben Alla, S. (2025). CMHFE-DAN: A Transformer-Based Feature Extractor with Domain Adaptation for EEG-Based Emotion Recognition. Information, 16(7), 560. https://doi.org/10.3390/info16070560

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CMHFE-DAN: A Transformer-Based Feature Extractor with Domain Adaptation for EEG-Based Emotion Recognition

Abstract

1. Introduction

2. State of the Art

3. Proposed Model and Method

3.1. Proposed Model

3.2. Datasets

3.2.1. DEAP Dataset

3.2.2. SEED Dataset

3.2.3. DREAMER Dataset

3.3. Methods and Hyperparameters

4. Results and Discussion

4.1. Within-Subject Analysis

4.1.1. Results

4.1.2. Discussion

4.2. Cross-Subject Generalisation

4.2.1. Results

4.2.2. Discussion

4.3. Comparison with SOTA

Results and Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI