1. Introduction
Emotion recognition has become a cornerstone of affective computing and human–computer interaction, with profound implications for mental health monitoring, adaptive user interfaces, and clinical diagnosis [
1,
2]. Among various physiological signals, electroencephalography (EEG) offers a direct window into the brain’s emotional processing. Its high temporal resolution and objective nature make it an ideal modality for capturing genuine emotional states, as EEG signals reflect neural activity that is difficult to consciously manipulate [
3]. Advances in deep learning have enabled models to achieve impressive accuracies, often exceeding 95%, on single, standardized datasets like DEAP and SJTU Emotion EEG Dataset (SEED) [
4,
5].
However, this success in controlled settings masks a critical challenge that severely impedes real-world application: the poor generalization of models across different datasets, a problem known as the cross-corpus generalization challenge [
6,
7]. When a model trained on one dataset is tested on another, its performance often plummets by 20–30% [
8,
9]. This performance degradation arises from significant “domain shifts” caused by a combination of factors: inter-subject physiological variability (e.g., brain anatomy, skull thickness) and inter-dataset discrepancies in experimental protocols, emotion elicitation stimuli, recording equipment, and ambient noise [
3,
10]. This gap between laboratory performance and practical applicability remains the primary hurdle for deploying robust EEG-based emotion recognition systems.
To bridge this domain gap, researchers have explored a variety of computational strategies. Early attempts relied on conventional supervised learning, which failed to generalize due to overfitting to source-specific features. Subsequently, domain adaptation techniques [
11], such as Domain Adversarial Neural Networks, were introduced to align feature distributions between source and target domains [
5]. However, these methods often presuppose a closer proximity between domain distributions than typically exists in diverse EEG datasets, limiting their effectiveness. More recent efforts have incorporated advanced deep learning architectures. Attention mechanisms, particularly multi-head self-attention from Transformers, have shown promise in adaptively focusing on the most informative EEG channels and temporal segments [
4,
9]. Concurrently, self-supervised and contrastive learning paradigms have emerged as a powerful means to learn subject-invariant representations from unlabeled data, reducing the costly dependency on manual annotation [
3,
12]. These methods learn robust features by leveraging the inherent structure of EEG signals, for instance, by predicting temporal context or contrasting different augmentations of the same sample.
Despite these advancements, existing methods still exhibit significant limitations when faced with the complexities of cross-corpus EEG data. Each approach, while addressing a piece of the puzzle, fails to provide a comprehensive solution. The primary shortcomings of current state-of-the-art approaches are systematically summarized in
Table 1.
Traditional domain adaptation methods like DANN assume that source and target distributions can be aligned through adversarial training. However, this assumption breaks down in cross-corpus EEG scenarios due to: (1) hardware variations—different EEG systems use varying sampling rates, electrode materials, and amplifier characteristics, creating systematic differences in signal properties; (2) protocol differences—SEED uses movie clips while other datasets may use images or music, resulting in fundamentally different neural response patterns; (3) subject heterogeneity—age, cultural background, and individual brain anatomy create irreducible inter-subject variability. When these differences accumulate, forcing alignment between such disparate distributions can cause negative transfer, where the model learns spurious correlations that harm performance. To systematically address these multifaceted challenges identified in
Table 1, our proposed CATE framework introduces targeted solutions: (1) against overfitting in supervised learning, CATE employs two-stage training with self-supervised pre-training to learn generalizable features rather than memorizing source-specific patterns; (2) against domain adaptation’s distribution assumptions, CATE’s dual-view strategy learns inherently robust representations without forced alignment, where NERM models domain variations as noise while WTRM extracts domain-invariant multi-scale patterns; (3) against single-view limitations, CATE’s dual-view architecture simultaneously captures complementary information with NERM providing frequency-domain robustness and WTRM capturing temporal-frequency dynamics through wavelet decomposition; and (4) against augmentation dependency, CATE introduces EEG-specific strategies with NERM’s adaptive noise injection and WTRM’s wavelet-based augmentation that preserve emotional content while ensuring robust feature learning. This comprehensive design enables CATE to effectively overcome the limitations of existing approaches.
Several recent architectures have attempted to address these limitations through more sophisticated self-supervised and contrastive learning approaches. Generic contrastive methods adapted for EEG, including SimCLR [
13] and Mixup variants [
14], employ augmentation strategies unsuited for EEG’s non-stationary nature and operate on single-view representations, missing complementary multi-domain information. GMSS [
15] applies SimCLR-based contrastive learning with meiosis augmentation but focuses solely on temporal features, neglecting crucial spatial channel interactions. MV-SSTMA [
16] employs masked autoencoding across multiple domains but shows significant performance degradation in self-supervised settings. JCFA [
17] uses dual-view contrastive learning from time-frequency domains and incorporates graph convolution during fine-tuning to model channel relationships; however, it suffers from prohibitively long training times due to the computational complexity of graph operations and lacks explicit mechanisms for handling domain-specific noise while assuming similar cross-domain distributions—an assumption that often fails in practice.
These limitations motivate our CATE framework, which introduces a fundamentally different approach to domain adaptation. Unlike existing methods that attempt to force alignment between disparate domains or rely on computationally expensive graph structures, CATE addresses domain shift through two synergistic mechanisms: (1) Noise-Enhanced Representation Modeling (NERM) actively learns to filter out domain-specific artifacts by training on adaptively noise-perturbed data, forcing the model to focus on robust, emotion-relevant patterns rather than superficial recording characteristics; and (2) Wavelet Transform Representation Modeling (WTRM) captures domain-invariant differential entropy patterns across frequency bands that persist across different recording conditions and protocols. This dual-view architecture, combined with efficient attention-based feature fusion instead of graph convolution, enables CATE to achieve superior cross-corpus performance with significantly reduced computational overhead compared to existing approaches. To address these multifaceted challenges, this paper introduces a novel Cross-corpus Attention-based Transfer Enhancement network (CATE). Our approach is built on a comprehensive two-stage framework that synergistically combines self-supervised pre-training with supervised fine-tuning. The core of CATE is a dual-view pre-training strategy that learns complementary representations from two perspectives: (1) NERM, which builds robustness to domain-specific noise and artifacts, and (2) Wavelet Transform Representation Modeling (WTRM), which captures essential frequency-domain differential entropy characteristics of EEG signals. During fine-tuning, attention mechanisms fuse these complementary features to perform robust emotion classification. A multi-component loss function, integrating alignment, contrastive, and style diversity objectives, ensures that the learned features are consistent, discriminative, and generalizable.
This work makes several key contributions to cross-corpus EEG-based emotion recognition:
We propose a novel dual-view self-supervised pre-training strategy that comprehensively captures complementary signal characteristics—noise robustness and multi-scale frequency patterns—via the NERM and WTRM views, respectively.
We design a multi-component optimization objective that effectively balances feature consistency, discriminability, and diversity, providing a solid foundation for cross-corpus generalization.
We develop an attention-based transfer enhancement mechanism that dynamically fuses multi-scale features while maintaining computational efficiency.
Through extensive experiments on the SEED series datasets, we demonstrate that CATE achieves state-of-the-art performance on six cross-corpus tasks, with accuracy improvements of 5.42% to 15.65% over existing methods, thereby setting a new benchmark for robust, practical EEG-based emotion recognition.
We pioneer the comprehensive evaluation of cross-corpus emotion recognition across multiple SEED datasets (SEED, SEED-IV, and SEED-V), particularly introducing SEED-V with its challenging five-class emotion taxonomy into cross-corpus research for the first time. Through extensive experimentation comparing multiple state-of-the-art models on these diverse emotion classification tasks (3-class, 4-class, and 5-class), we establish new benchmarks for multi-class cross-corpus EEG emotion recognition and demonstrate the scalability of our approach to increasingly complex emotion taxonomies.
3. Experimental Setting
3.1. Datasets
The SEED series of datasets is an emotional electroencephalogram (EEG) database established by the BCMI Laboratory at Shanghai Jiao Tong University. This series includes three main datasets, each targeting classification for a different number of emotional categories.
The SEED dataset (original version) focuses on three-category emotion recognition (negative, neutral, positive). It involved 15 subjects participating in 3 experimental sessions, each watching 15 emotion-inducing film clips. EEG signals were recorded using a 62-channel ESI Neuroscan system, and various EEG features such as differential entropy (DE) and differential asymmetry (DASM) were extracted.
The SEED-IV dataset expands to four-category classification (neutral, sad, fear, and happy emotions). It also involved 15 subjects participating in 3 experimental sessions, but each session included 24 film clips.
The SEED-V dataset further expands to five-category classification (disgust, fear, sad, neutral, happy). Each subject participated in 3 experimental sessions, watching 15 film clips per session, with both EEG and eye-tracking data recorded.
3.2. Pre-Processing
After collecting the raw EEG data, signal preprocessing and feature extraction performed. To increase the signal-to-noise ratio (SNR), the raw EEG signals are first downsampled to a 200
sampling rate, followed by a band-pass filter from 1
to 75
. Subsequently, feature extraction is carried out. Differential Entropy (DE), as referenced in the literature [
18], has the ability to distinguish patterns in different frequency bands. Therefore, we chose DE features as the input data for our model.
For one subject in one session in these three databases, a set of data is given in the form of channels (62) × trials (15 for SEED, 24 for SEED-IV, 15 for SEED-V) × frequency bands (5). We merge the channels and frequency bands, changing the form to trials × 310 ().
Finally, all data are organized into (SEED dataset,3394 samples per subject per session), or (SEED-IV dataset,851/832/822 samples per subject per session, respectively), or (SEED-V dataset,681/541/ 601 samples per subject per session, respectively). The corresponding label vectors are generated in the form of , or , or .
3.3. Evaluation Metrics
Accuracy is used as a measure to evaluate the EEG classification, which is calculated as follows:
where
A refers to the number of samples correctly classified by the algorithm, and Num refers to the total number of samples to be classified.
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets and Baselines
We evaluated our proposed model, CATE, on three publicly available EEG emotion recognition datasets: SEED [
19], SEED-IV [
20], and SEED-V [
21]. We conducted a comprehensive comparison against five state-of-the-art (SOTA) methods: DANN [
22], GECNN [
23], E2STN [
24], JCFA [
17], and Mixup [
14]. For fair comparison, all baselines were implemented using their officially reported hyperparameters.
4.1.2. Evaluation Protocol
We followed a strict subject-independent, cross-corpus protocol with supervised target-domain fine-tuning applied consistently across all methods. Importantly, all baseline models and our proposed CATE method were evaluated under the identical protocol, where target domain labels were used during fine-tuning. The experiments were conducted across six transfer tasks between the datasets. We use accuracy (ACC) and its standard deviation across all subjects in the test set as the primary evaluation metric. The specific trial splits for fine-tuning and testing in each task were as follows:
SEED ↔ SEED-: For SEED, 9 trials were used for fine-tuning and 6 for testing. For SEED-IV (3-class), 12 trials were used for fine-tuning and 6 for testing.
SEED ↔ SEED-: For SEED, 9 trials for fine-tuning, 6 for testing. For SEED-V (3-class subset), 6 trials for fine-tuning, 3 for testing.
SEED-↔ SEED-: For SEED-IV (4-class), 16 trials for fine-tuning, 8 for testing. For SEED-V (4-class subset), 8 trials for fine-tuning, 4 for testing.
4.1.3. Implementation Details
Models were implemented in Python 3.8 using PyTorch 1.9.0 and trained on an NVIDIA Tesla A10 GPU. The self-supervised pre-training was conducted for 200 epochs with a batch size of 256. The subsequent fine-tuning was performed for 20 epochs with a batch size of 128.
4.2. Cross-Corpus Performance Analysis
Table 2 presents the performance comparison across all six cross-corpus tasks. Our proposed CATE model consistently and significantly outperforms all baseline methods, establishing a new state-of-the-art. Notably, CATE demonstrates not only higher mean accuracy but also lower standard deviation in most cases, indicating superior performance and stability.
For instance, in the SEED → SEED- and SEED-→ SEED tasks, CATE achieves accuracies of 80.72% and 80.70%, surpassing the second-best method, JCFA, by a large margin of 15.20% and 14.37%, respectively. This substantial improvement highlights the effectiveness of our dual-view pre-training strategy. The Noise-Enhanced Representation Modeling (NERM) view learns to discard domain-specific noise stemming from different acquisition hardware, while the Wavelet Transform Representation Modeling (WTRM) view captures domain-invariant, multi-scale temporal-frequency patterns crucial for emotion recognition. The synergy between these views enables the model to learn robust and generalizable EEG representations.
The performance gains are also evident in the more challenging tasks involving SEED-V, which features a more complex emotion taxonomy. In the SEED-→ SEED- task, CATE achieves a 13.41% improvement over JCFA. This scenario, involving transfer between two distinct 4-class paradigms, underscores CATE’s ability to handle complex domain shifts and partially overlapping emotion categories.
Further analysis of the confusion matrices (Figure
Figure 3,
Figure 4 and
Figure 5) reveals CATE’s discriminative power. For example, in the SEED → SEED-
task (Figure a), CATE significantly reduces the confusion between ‘positive’ and ‘neutral’ emotions compared to JCFA (Figure b). Similarly, in the challenging SEED-
→ SEED-
experiment (
Figure 5c), CATE improves the recognition of ‘fear’ and reduces its confusion with ‘sad’, suggesting its capability to distinguish between nuanced emotional states. This confirms that our model learns more discriminative features, rather than just overfitting to the source domain’s characteristics.
4.3. Ablation Study
To validate the contributions of the core components of CATE, we conducted an extensive ablation study across all six tasks. We designed three variants: (1) w/o NERM, which removes the noise-enhanced view; (2) w/o WTRM, which removes the wavelet-based view; and (3) w/o pretrain, which trains the full model from scratch without self-supervised pre-training.
The results are summarized in
Table 3. The full CATE model achieves the best performance across all settings, demonstrating the efficacy of its integrated design.
Removing the NERM component (‘w/o NERM’) causes the most significant performance degradation across all tasks (e.g., a 7.06% drop in the SEED → SEED- task). This confirms that modeling and mitigating domain-specific noise is critical for cross-corpus generalization.
Removing the WTRM component (‘w/o WTRM’) also leads to a notable performance decline, highlighting the importance of capturing multi-scale temporal-frequency features, which serve as robust, emotion-related biomarkers.
Training the model from scratch (‘w/o pretrain’) results in inferior performance compared to the full model. This validates that the dual-view self-supervised pre-training provides a powerful initialization, enabling the model to learn more transferable representations before fine-tuning on limited labeled data.
Table 3.
Ablation study results: average accuracy and standard deviation (%) of different model configurations across all cross-dataset tasks.
Table 3.
Ablation study results: average accuracy and standard deviation (%) of different model configurations across all cross-dataset tasks.
Method | SEED → SEED- | SEED- → SEED | SEED → SEED- | SEED- → SEED | SEED- → SEED- | SEED- → SEED- |
---|
w/o NERM | 73.66 ± 18.84 | 74.50 ± 17.83 | 52.36 ± 23.52 | 79.85 ± 12.60 | 66.53 ± 20.76 | 70.51 ± 16.47 |
w/o WTRM | 73.82 ± 19.13 | 78.99 ± 18.32 | 50.37 ± 20.48 | 74.61 ± 14.23 | 64.44 ± 15.53 | 66.62 ± 14.83 |
w/o pretrain | 77.52 ± 18.69 | 76.25 ± 16.47 | 62.36 ± 19.82 | 79.05 ± 10.23 | 62.40 ± 19.51 | 71.86 ± 15.94 |
Full (CATE) | 80.72 ± 14.27 | 80.70 ± 10.09 | 68.01 ± 19.33 | 81.65 ± 09.35 | 69.48 ± 16.77 | 75.17 ± 14.01 |
In summary, the ablation study empirically proves that the superior performance of CATE stems from the powerful synergy between its noise-robust modeling (NERM), multi-scale feature extraction (WTRM), and the effective feature initialization provided by the self-supervised pre-training paradigm.
5. Discussion
The empirical results robustly demonstrate that CATE sets a new state-of-the-art in cross-corpus EEG emotion recognition. Our discussion focuses on interpreting these performance gains by linking them to our model’s core architectural innovations, contextualizing its advantages over existing methods, and outlining the broader implications of this work.
5.1. Interpretation of CATE’s Performance
The consistent and substantial outperformance of CATE across all six transfer tasks validates our central hypothesis: a dual-view, self-supervised pre-training framework can effectively learn domain-invariant representations from heterogeneous EEG data. The most significant accuracy improvements (over 14 percentage points against the next-best method in SEED ↔ SEED-IV3 tasks) are not merely incremental gains but signify a fundamental advantage in generalization capability.
This advantage stems directly from the synergistic interplay of our two complementary views. The Noise-Enhanced Representation Modeling (NERM) view forces the model to become robust against superficial, domain-specific signal characteristics—akin to learning to recognize a voice despite different background noises or microphone qualities. This is crucial for cross-corpus tasks where recording equipment and environments inevitably vary. Concurrently, the Wavelet Transform Representation Modeling (WTRM) view captures the intrinsic, multi-scale temporal-frequency dynamics that are foundational to emotional expression in EEG signals. The ablation study (
Table 3) provides clear evidence for this synergy: removing either view results in a significant performance drop, with the removal of NERM being particularly detrimental, confirming that tackling domain noise is a primary challenge.
Furthermore, the multi-head attention mechanism in the fine-tuning stage proves essential for adapting these robustly pre-trained features to specific classification tasks. As seen in the confusion matrices (e.g.,
Figure 5), CATE excels at distinguishing between nuanced and often-confused emotional states like ‘sad’ and ‘fear’. This suggests that the attention mechanism effectively learns to focus on the subtle yet discriminative neural signatures captured by the dual-view encoders, a capability that simpler feature extractors may lack.
5.2. Visualization Analysis of Feature Representations
To gain deeper insights into how each component of CATE contributes to effective emotion representation learning, we conducted t-SNE visualization analysis of the learned features across different model configurations. We randomly selected 200 samples from the SEED dataset and employed t-SNE to project the high-dimensional feature representations into a two-dimensional space.
Figure 6 presents the t-SNE projections of feature representations from four model variants, providing intuitive evidence of each component’s contribution to discriminative feature learning.
From the perspective of an ablation study, the visualization results reveal several critical insights:
First, regarding the relative advantages of the full CATE model: Among the four configurations, the complete model demonstrates the relatively best feature organization structure. While overlaps between the three emotion categories still exist (which is common in EEG-based emotion recognition), compared to other configurations, the complete model exhibits more regular feature point distribution with relatively clearer inter-class boundaries. This indicates that the synergistic effect of the dual-view architecture and pre-training strategy indeed contributes to learning more discriminative feature representations.
Second, regarding the contribution of the NERM view encoder: When the NERM component is removed, feature distribution significantly deteriorates, manifesting as increased class overlap, particularly between neutral and negative emotions. This distribution degradation validates our quantitative findings that NERM view’s noise-enhanced modeling is crucial for extracting robust features that can distinguish subtle emotional differences in the presence of domain-specific noise. The absence of this component leads to the model’s difficulty in discriminating fine-grained emotional variations.
Third, regarding the effectiveness of the WTRM view encoder: Removing the WTRM component also compromises feature discriminability, though to a slightly lesser extent than removing NERM. The clusters show increased dispersion and partial overlap, particularly in the boundary regions between emotions. This suggests that wavelet-based multi-scale temporal-frequency features provide complementary information essential for capturing the full spectrum of emotion-related neural dynamics.
Most significantly, regarding the importance of the pre-training strategy: Compared to the full model, the model without pre-training exhibits the poorest feature organization, with all three emotion categories heavily intermingled. Despite achieving reasonable classification accuracy through supervised fine-tuning alone, the lack of self-supervised pre-training results in features that fail to capture the underlying structure of emotional states. This powerfully demonstrates that our dual-view pre-training strategy learns fundamental representations that supervised learning alone cannot achieve.
Core findings from the ablation study: The comprehensive ablation analysis reveals distinct contributions from each component across different transfer scenarios:
(1) Component-wise Analysis: Removing either the NERM or WTRM view consistently degrades performance, but their relative importance varies across tasks. NERM shows more significant impact in challenging cross-corpus scenarios (e.g., 7.06% drop in SEED → SEED-), while WTRM’s contribution is more pronounced in tasks involving complex emotion taxonomies (e.g., notable degradation in SEED-V related transfers).
(2) Pre-training vs. Architecture: The comparison between component removal and pre-training ablation reveals an interesting pattern. While removing individual components causes consistent performance drops, the impact of removing pre-training varies significantly across tasks. In some cases (e.g., SEED → SEED-), the architectural components (NERM/WTRM) contribute more than pre-training, while in others (e.g., SEED → SEED-), pre-training shows greater importance. This suggests that the value of self-supervised pre-training is task-dependent and complements rather than replaces good architectural design.
(3) Synergistic Integration: The full model consistently outperforms all ablated versions, confirming that the optimal performance emerges from the synergistic combination of dual-view architecture and self-supervised pre-training, rather than from any single component dominance.
The t-SNE analysis provides compelling visual evidence supporting our architectural design choices. The synergistic combination of NERM and WTRM views, enhanced through self-supervised pre-training, creates a feature space where emotions are naturally organized along meaningful dimensions. This inherent structure in the learned representations explains CATE’s superior cross-corpus generalization capability—by learning features that capture the essential characteristics of emotional states rather than dataset-specific patterns, the model can effectively transfer knowledge across different experimental paradigms and recording conditions.
5.3. Comparison with Previous Cross-Corpus Methods
5.3.1. Comparison with Traditional Domain Adaptation Methods
DANN (2016), as a classical domain adversarial method, attempts to align feature distributions between source and target domains through adversarial training. However, our experimental results show that DANN performs poorly across all tasks (e.g., only 45.20% on SEED → SEED-), which is 35.52 percentage points lower than CATE. This substantial gap exposes DANN’s fundamental limitation: its assumption that source and target distributions can be directly matched is brittle for highly disparate EEG domains. In contrast, CATE does not force explicit alignment but rather learns core representations that are inherently robust to domain shifts, proving to be a more effective strategy.
GECNN (2021) introduced graph-embedded convolution to model spatial relationships between EEG channels. While achieving 57.25% accuracy on SEED → SEED-, an improvement over DANN, it still falls 23.47 percentage points below CATE.GECNN’s primary limitation is its reliance on fixed spatial patterns that may not generalize across different recording setups and protocols. Without explicit mechanisms for handling domain-specific artifacts or capturing multi-scale spectral dynamics, GECNN struggles with the inherent variability in cross-corpus scenarios. Our dual-view architecture addresses these gaps through complementary noise-resilient (NERM) and wavelet-based spectral (WTRM) representations, achieving superior domain adaptation.
5.3.2. Comparison with Recent Self-Supervised Methods
E2STN (2025) employs an emotion style transfer network, achieving 61.24% on SEED → SEED-. Despite attempting to mitigate domain differences through style transfer, it still lags behind CATE by 19.48 percentage points. E2STN’s main issue is that its style transfer mechanism assumes emotional “style” can be simply separated from content, but emotional expression in EEG signals is often tightly coupled with multiple aspects of the signal.
JCFA (2024) represents the closest competitor to CATE, employing dual-view contrastive learning with graph convolution. However, even this state-of-the-art method only achieves 65.52% on SEED → SEED-, 15.20 percentage points lower than CATE. More importantly, JCFA’s training time is prohibitively long due to the computational complexity of graph operations. Another critical limitation of JCFA is its lack of explicit noise handling mechanisms, while our NERM view is specifically designed to filter domain-specific artifacts.
5.4. Limitations and Future Work
Despite its strong performance, this study has limitations that open avenues for future research. First, our validation was conducted on the SEED family of datasets. While these are standard benchmarks, testing CATE’s generalization capabilities on a wider array of datasets with even greater diversity (e.g., DEAP, AMIGOS) is a necessary next step. Second, the current CATE model operates in an offline setting. For many real-world applications, such as real-time adaptive interfaces or clinical monitoring, online adaptation capabilities are required.
Future work will, therefore, focus on three key directions: (1) Model Lightweighting: We will explore model compression and knowledge distillation techniques to create a more efficient version of CATE suitable for deployment on edge devices. (2) Broader-Scale Validation: We will extend our evaluation to more diverse and larger-scale EEG datasets to further probe the limits of CATE’s generalization. (3) Online and Multimodal Extension: We plan to develop an online adaptation mechanism for CATE and investigate the fusion of EEG data with other modalities (e.g., eye-tracking, ECG) to build even more robust and comprehensive emotion recognition systems.
6. Conclusions
In this paper, we introduced CATE, a novel two-stage deep learning framework designed to overcome the critical challenge of cross-corpus generalization in EEG-based emotion recognition. By synergistically combining a dual-view self-supervised pre-training strategy with an attention-based fine-tuning mechanism, CATE effectively learns robust, transferable representations from complex and noisy EEG signals. The core innovation lies in its complementary Noise-Enhanced (NERM) and Wavelet Transform (WTRM) modeling views, which equip the model to handle domain-specific noise while capturing essential temporal-frequency patterns of emotional states.
Extensive experiments across six challenging transfer tasks using the SEED, SEED-IV, and SEED-V datasets demonstrate the definitive superiority of our approach. CATE consistently set a new state-of-the-art, outperforming existing methods by substantial margins of up to 15.65 percentage points. These results, supported by a thorough ablation study, validate the architectural choices and confirm the efficacy of each component within the CATE framework. By significantly bridging the gap between single-dataset performance and real-world applicability, this work represents a meaningful advancement towards deploying practical and reliable EEG-based affective computing systems. CATE provides a robust and promising paradigm for future research in cross-domain physiological signal analysis.