Figure 1.
Comparison of emotion classification task formulations in the Valence–Arousal space. (a) Binary valence only. (b) Binary arousal only. (c) Binary V/A as two independent tasks. (d) Four-quadrant joint classification adopted in this work, distinguishing HVHA (excitement), HVLA (contentment), LVHA (stress/anger), and LVLA (sadness). The four-quadrant task has entropy bits, twice that of any binary formulation ( bit).
Figure 1.
Comparison of emotion classification task formulations in the Valence–Arousal space. (a) Binary valence only. (b) Binary arousal only. (c) Binary V/A as two independent tasks. (d) Four-quadrant joint classification adopted in this work, distinguishing HVHA (excitement), HVLA (contentment), LVHA (stress/anger), and LVLA (sadness). The four-quadrant task has entropy bits, twice that of any binary formulation ( bit).
Figure 2.
The complete data processing and feature fusion pipeline. EEG, autonomic, and cross-modal characteristics are extracted from raw physiological signals in parallel and combined into a 150-dimensional feature vector.
Figure 2.
The complete data processing and feature fusion pipeline. EEG, autonomic, and cross-modal characteristics are extracted from raw physiological signals in parallel and combined into a 150-dimensional feature vector.
Figure 3.
Detailed hybrid CNN-LSTM-Transformer model architecture. Before classifying emotions in four quadrants, the model examines the input feature vector to capture local, temporal, and global patterns.
Figure 3.
Detailed hybrid CNN-LSTM-Transformer model architecture. Before classifying emotions in four quadrants, the model examines the input feature vector to capture local, temporal, and global patterns.
Figure 4.
Dynamic physiological signatures of emotion. (a) EEG alpha power fluctuations. (b) Arousal indicators, including GSR tonic level and SCR count. (c) Arousal-related respiratory patterns. (d) Peripheral temperature shows a slight correlation with valence. (e) Heatmap illustrating cross-modal physiological system dynamics.
Figure 4.
Dynamic physiological signatures of emotion. (a) EEG alpha power fluctuations. (b) Arousal indicators, including GSR tonic level and SCR count. (c) Arousal-related respiratory patterns. (d) Peripheral temperature shows a slight correlation with valence. (e) Heatmap illustrating cross-modal physiological system dynamics.
Figure 5.
2D valence–arousal emotion space distribution of 40 experimental trials. The categorization ground truth is four independent but somewhat overlapping groups of data items.
Figure 5.
2D valence–arousal emotion space distribution of 40 experimental trials. The categorization ground truth is four independent but somewhat overlapping groups of data items.
Figure 6.
Histograms of physiological features for distinct emotions. (a) The four emotion quadrants exhibit different EEG alpha power distributions. (b) GSR tonic level distinguishes between high and low arousal states.
Figure 6.
Histograms of physiological features for distinct emotions. (a) The four emotion quadrants exhibit different EEG alpha power distributions. (b) GSR tonic level distinguishes between high and low arousal states.
Figure 7.
Emotion recognition cross-modal feature importance ranking.
Figure 7.
Emotion recognition cross-modal feature importance ranking.
Figure 8.
The top 15 HVHA/LVLA discrepancies. The heatmap displays three statistical tests’ uncorrected p-values. Low, significant p-values are dark-red.
Figure 8.
The top 15 HVHA/LVLA discrepancies. The heatmap displays three statistical tests’ uncorrected p-values. Low, significant p-values are dark-red.
Figure 9.
Comparison of significant features identified by different statistical tests (uncorrected ).
Figure 9.
Comparison of significant features identified by different statistical tests (uncorrected ).
Figure 10.
Distribution of features by effect size category and statistical significance.
Figure 10.
Distribution of features by effect size category and statistical significance.
Figure 11.
Violin plots showing the distribution of Cohen’s d effect sizes across all physiological modalities for HVHA vs. LVLA comparison.
Figure 11.
Violin plots showing the distribution of Cohen’s d effect sizes across all physiological modalities for HVHA vs. LVLA comparison.
Figure 12.
Modality importance ranking based on mean Cohen’s d effect sizes.
Figure 12.
Modality importance ranking based on mean Cohen’s d effect sizes.
Figure 13.
Analysis of data quality and consistency. (a) EEG power data exhibited no extreme outliers. (b) Extreme GSR responses identified two trials as outliers. (c) GSR activity consistently aligned with self-reported arousal across trials.
Figure 13.
Analysis of data quality and consistency. (a) EEG power data exhibited no extreme outliers. (b) Extreme GSR responses identified two trials as outliers. (c) GSR activity consistently aligned with self-reported arousal across trials.
Figure 14.
Cross-modal feature–emotion label Pearson correlation matrix. The heatmap shows the intensity and direction of linear correlations between all pairs of variables, showing physiological system coupling to emotional aspects.
Figure 14.
Cross-modal feature–emotion label Pearson correlation matrix. The heatmap shows the intensity and direction of linear correlations between all pairs of variables, showing physiological system coupling to emotional aspects.
Figure 15.
Principal component analysis (PCA) of the feature space. (a) Emotion quadrant projection highlighting class overlap. (b,c) Valence- and arousal-colored projections showing that PC1 and PC2 capture the primary emotional dimensions. (d) Explained variance plot illustrating the retained data dimensionality. (e) Dominant physiological feature loadings contributing to the principal components.
Figure 15.
Principal component analysis (PCA) of the feature space. (a) Emotion quadrant projection highlighting class overlap. (b,c) Valence- and arousal-colored projections showing that PC1 and PC2 capture the primary emotional dimensions. (d) Explained variance plot illustrating the retained data dimensionality. (e) Dominant physiological feature loadings contributing to the principal components.
Figure 16.
Modality importance and synergy analysis. (a) Pie chart illustrating the percentage contribution of key physiological modalities within the feature set. (b) Heatmap representing inter-modality synergy and interaction strength.
Figure 16.
Modality importance and synergy analysis. (a) Pie chart illustrating the percentage contribution of key physiological modalities within the feature set. (b) Heatmap representing inter-modality synergy and interaction strength.
Figure 17.
Training and validation performance of the proposed hybrid CNN–LSTM–Transformer model. (a) Accuracy over training epochs. (b) Loss convergence across epochs. (c) Validation F1 score progression, demonstrating rapid and stable learning behavior.
Figure 17.
Training and validation performance of the proposed hybrid CNN–LSTM–Transformer model. (a) Accuracy over training epochs. (b) Loss convergence across epochs. (c) Validation F1 score progression, demonstrating rapid and stable learning behavior.
Figure 18.
(a) Model complexity measured by the number of trainable parameters. (b) Training time per epoch. (c) Inference speed in samples per second. (d) Accuracy versus model complexity trade-off.
Figure 18.
(a) Model complexity measured by the number of trainable parameters. (b) Training time per epoch. (c) Inference speed in samples per second. (d) Accuracy versus model complexity trade-off.
Figure 19.
Confusion matrices for the five evaluated models. (a) Baseline CNN model. (b) CNN + LSTM hybrid. (c) CNN + Transformer hybrid. (d) LSTM + Transformer hybrid. (e) Full hybrid CNN–LSTM–Transformer model, demonstrating the strongest diagonal classification performance.
Figure 19.
Confusion matrices for the five evaluated models. (a) Baseline CNN model. (b) CNN + LSTM hybrid. (c) CNN + Transformer hybrid. (d) LSTM + Transformer hybrid. (e) Full hybrid CNN–LSTM–Transformer model, demonstrating the strongest diagonal classification performance.
Table 1.
Summary of the three-phase experimental and modelling procedure.
Table 1.
Summary of the three-phase experimental and modelling procedure.
| Phase | Activity | Details |
|---|
| Phase 1 | Signal Acquisition | Multimodal physiological signals (EEG, GSR, BVP, Respiration,
Temperature, EMG) recorded continuously during stimulus presentation
using a BioSemi ActiveTwo system at 512 Hz, subsequently downsampled
to 128 Hz. |
| Phase 2 | Preprocessing and Feature Extraction | Raw signals underwent Butterworth filtering, artefact removal, and signal
decomposition, and feature engineering, yielding a 150-dimensional
normalised feature vector per trial. |
| Phase 3 | Model Training and Evaluation | The hybrid CNN–LSTM–Transformer model was trained using an 80/20
stratified split with five-fold cross-validation and evaluated against
ablated variants and the external DREAMER dataset. |
Table 2.
Phase distortion comparison between single-pass IIR and zero-phase filtfilt implementations for the Butterworth filter.
Table 2.
Phase distortion comparison between single-pass IIR and zero-phase filtfilt implementations for the Butterworth filter.
| Filter Method | Mean Delay (ms) | Max Delay (ms) | Passband Ripple (dB) | Nyquist Atten. (dB) | Phase Distortion |
|---|
| lfilter (single-pass IIR) | 10.1 | 32.1 | 3.0 | | Present |
| filtfilt (forward-backward) | 0.0 | 0.0 | 6.0 | | Strictly zero |
Table 3.
Summary of key findings from the physiological feature analysis and their architectural implications for the proposed hybrid model.
Table 3.
Summary of key findings from the physiological feature analysis and their architectural implications for the proposed hybrid model.
| No. | Finding | Evidence | Architectural Implication |
|---|
| 1 | No linear separability | PCA onto top two components explains only 11.1% of variance; all four quadrants show substantial overlap with no clear boundaries | Non-linear modeling is essential; linear classifiers are fundamentally insufficient for this feature space |
| 2 | Emotion information is broadly distributed | 95% variance requires 38 principal components; discriminative features span EEG (d up to 1.42), respiration (), and GSR tonic level | No single modality suffices; a high-capacity multimodal architecture is required |
| 3 | EEG leads, peripherals complement | T8 channel: mean (3 sig. features); Fp2: ; respiration: ; GSR: | All modalities must be retained; peripheral signals add discriminative power for valence-related states that EEG alone cannot capture |
| 4 | Cross-modal interactions are predictive | EEG–EMG interaction score: 0.119; BVP–respiration ratio: 0.103; synergy (autonomic–GSR), (autonomic–AF4) | Relationships between signals carry emotion-relevant information, directly motivating the Transformer-based cross-modal attention layer |
| 5 | Feature selection has a false discovery risk | At (uncorrected), ∼65 of 1307 features may be false positives; FDR retains 389, Bonferroni retains only 21 | Results are exploratory; confirmatory studies should apply FDR or Bonferroni correction to control family-wise error rate |
Table 4.
Ablation study results under intra-subject evaluation on the DEAP dataset.
Table 4.
Ablation study results under intra-subject evaluation on the DEAP dataset.
| Model Variant | Accuracy (%) | F1-Score | LVLA Recall (%) |
|---|
| CNN-only | 81.3 | 0.79 | 68.4 |
| CNN + LSTM | 87.9 | 0.85 | 82.1 |
| CNN + Transformer | 88.4 | 0.86 | 84.7 |
| LSTM + Transformer | 86.2 | 0.84 | 79.3 |
| Full Hybrid (Our Model) | 91.2 | 0.88 | 91.3 |
Table 5.
Cross-dataset generalization results (DEAP → DREAMER).
Table 5.
Cross-dataset generalization results (DEAP → DREAMER).
| Training → Testing | Accuracy (%) | F1-Score | LVLA Recall (%) |
|---|
| DEAP → DEAP | 91.2 | 0.88 | 91.3 |
| DEAP → DREAMER | 76.4 | 0.73 | 81.2 |
Table 6.
Feature alignment strategies applied for DEAP → DREAMER cross-dataset evaluation.
Table 6.
Feature alignment strategies applied for DEAP → DREAMER cross-dataset evaluation.
| Alignment Aspect | Strategy Applied | Remarks |
|---|
| EEG Channel Mapping | 14 DREAMER channels (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4) matched to nearest DEAP equivalents via 10–20 system coordinates | 18 unmatched DEAP channels imputed with subject-wise channel means from DREAMER EEG; feature vector dimensionality preserved |
| Absent Peripheral Modalities (BVP, Respiration, Temperature) | Features derived from these modalities set to zero in the DREAMER feature vector | Conservative strategy avoids synthetic signal generation; partially accounts for the 91.2% → 76.4% accuracy drop, as the model was trained on non-zero values |
| GSR Harmonisation | Same tonic/phasic decomposition pipeline (cvxEDA, Section 3.4.3) applied uniformly to both datasets | Ensures consistency in skin conductance feature extraction across recording hardware |
| ECG as BVP Surrogate | DREAMER’s ECG processed to extract HRV features: RMSSD, SDNN, mean RR interval | Serves as a partial substitute for BVP-derived cardiac features absent in DREAMER |
| Feature Normalisation | All features z-score normalised using DREAMER’s own per-subject statistics prior to inference | Mitigates distributional shift from hardware and recording condition differences |
Table 7.
Emotion classification task formulations in the Valence–Arousal space.
Table 7.
Emotion classification task formulations in the Valence–Arousal space.
| Formulation | What It Classifies | Classes | Entropy |
|---|
| Binary V | Positive vs. negative valence; arousal ignored | 2 | bit |
| Binary A | High vs. low arousal; valence ignored | 2 | bit |
| Binary V/A | Each dimension independently; conflates, e.g., HVHA (excitement) with LVHA (stress) | | bit each |
| Proposed Method | Joint HVHA/HVLA/LVHA/LVLA as a single problem | 4 | bits |
Table 8.
Comparison with recent state-of-the-art emotion recognition methods (2023–2025).
Table 8.
Comparison with recent state-of-the-art emotion recognition methods (2023–2025).
| Reference | Model | Dataset | Task | Performance |
|---|
| [7] | BiLSTM–Transformer + Autoencoder | DEAP | Binary V/A | 94.0% |
| [8] | Attention Spatio-Temporal Fusion | DEAP | Binary V/A | 98.9% |
| [9] | 3D-CNN + CRNN (EEG + Video) | DEAP | Binary V/A | 91.8% |
| [10] | Parallel Hybrid Deep Model | DEAP | Binary V/A | 95.2% |
| Proposed Method | CNN–LSTM–Transformer | DEAP | Four-Quadrant V–A | 91.2% |
Table 9.
Qualitative comparative analysis of EEG-based emotion recognition methods.
Table 9.
Qualitative comparative analysis of EEG-based emotion recognition methods.
| Ref. | Method | Task | Protocol | Performance | Key Comparison |
|---|
| [8] | Hybrid attention spatio-temporal fusion; EEG (32 ch) | Binary V/A | 5-fold CV | 98.57% V, 98.93% A | High binary accuracy but limited scalability. Our model captures LVLA subtleties with ∼20% fewer parameters. |
| [14] | CNN–LSTM; EEG + GSR | Binary A | k-fold CV | 97.39% A | Ignores valence dimension. Transformer layers enable full quadrant mapping, improving LVLA recall by 17%. |
| [9] | 3D-CNN + CRNN; EEG + Video | Binary V/A | LOSO | 91.75% A, 91.86% V | Requires facial video. Our physiology-only model improves privacy and wearable applicability. |
| [7] | BiLSTM–Transformer + AE; EEG | Binary V/A | 6-subject CV | 94% | Limited cohort (n = 6). Our evaluation spans 32 subjects with multimodal fusion. |
| [12] | LSTM + GWO; EEG (22 ch) | Binary V/A | Intra-CV | 92.5% V, 81.25% A | Reduced channels but ignores peripherals. PCA reveals cross-modal correlations (), aiding four-quadrant classification. |
| [13] | InceptionResNetV2; EEG (9 ch) | Binary V | LOSO | 72.81% | Highlights cross-subject difficulty. CAR preprocessing improves confidence to 89–93%. |
| [2] | SVM/KNN Ensemble; PPG/GSR | Binary V/A | 80/20 split | 66% | Peripheral-only signals miss neural cues. Hybrid fusion improves F1 by 0.15. |
| [15] | FCDGELM; EEG | Binary V/A | Intra-CV | 69.67% | Unimodal limitation. Temporal-global modeling improves SNR and accuracy by ∼12%. |
| [16] | LSTM (raw EEG) | Binary V/A | Intra-subject | 85.65% A | Lacks spatial modeling. CNN + Transformer improves complex-state accuracy by ∼5.6%. |
| [17] | Multi-column CNN; EEG | Binary V/A | Subject-based | 90.01% V | High sampling overhead. The hybrid model matches performance with 40% fewer samples. |
| [10] | AlexNet–DenseNet + PCA + SVM | Binary/Tri | Train-test | 95.54% V | Multi-stage pipeline. End-to-end CNN–LSTM–Transformer achieves 91.2% four-quadrant accuracy in unified training. |