Next Article in Journal
Context-Oriented Method for Resolving Lexical Ambiguities in Speech Synthesis for a Low-Resource Language
Previous Article in Journal
A Dynamic Graph and Soft-Label-Driven Auto-Weighting Framework for Semi-Supervised Subspace Modeling (DyGLaM)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stress Detection from Multimodal Physiological Data Using Hybrid Deep Learning Models

1
Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham, Coimbatore 641112, Tamil Nadu, India
2
Department of Human Sciences, Link Campus University, Via del Casale di San Pio V, 44, 00165 Rome, Italy
3
Department of Psychology, Sapienza University of Rome, 00185 Rome, Italy
4
Department of Theoretical and Applied Sciences, eCampus University, Via Isimbardi 10, 22060 Novedrate, Italy
5
Department of Computer, Control, and Management Engineering, Sapienza University of Rome, 00185 Rome, Italy
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(6), 179; https://doi.org/10.3390/bdcc10060179
Submission received: 19 March 2026 / Revised: 30 April 2026 / Accepted: 15 May 2026 / Published: 1 June 2026

Abstract

Chronic stress and depression-adjacent emotional states affect over 970 million people worldwide, yet their continuous, objective detection from physiological signals remains unsolved, particularly when cues are subtle and conventional approaches fail. Single modality classifiers capture only part of the picture, and binary valence/arousal formulations collapse emotionally distinct states into the same category, leaving conditions like sadness and depression characterised by Low-Valence Low-Arousal (LVLA) responses without reliable detection. A hybrid deep learning model combining Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and Transformer encoders was developed to jointly classify four emotional quadrants from multimodal physiological data in the Database for Emotion Analysis using Physiological signals (DEAP) dataset. Electroencephalography (EEG), Galvanic Skin Response (GSR), Blood Volume Pulse (BVP), and respiration signals from 32 participants were preprocessed using fourth-order Butterworth filtering, trained with Adam optimisation, and evaluated through an 80/20 stratified split with five-fold cross-validation. The system achieved 91.2% four-quadrant valence–arousal accuracy (95% CI: 89.1–93.3%), with LVLA recall reaching 91.3%, outperforming all partial-hybrid variants. These findings demonstrate that hierarchical, attention-based fusion of physiological modalities can reliably distinguish stress from depression-adjacent states, offering a practical pathway toward continuous, non-invasive mental health monitoring on wearable platforms.

1. Introduction

Stress is a psychophysiological response characterised by understanding the demands exceeding its available coping resources. According to the World Health Organization, over 970 million individuals worldwide live with a mental health disorder, with stress-related conditions constituting the leading cause of disability-adjusted life years globally. Within Russell’s circumplex model of affect, emotional states are placed along two orthogonal dimensions, valence (positive–negative) and arousal (high–low), with stress occupying the Low-Valence High-Arousal (LVHA) quadrant and depression localized to the Low-Valence Low-Arousal (LVLA) quadrant. Distinguishing between these two states which carries a significant clinical implications: despite overlapping self-reported symptom profiles, they still require fundamentally different therapeutic interventions.
Continuous physiological monitoring is needed due to sparse temporal sampling and subjective interpretation in clinical assessments. Arousal indicators are missed by unimodal EEG classifiers in 24% of DEAP instances [1], demonstrating that single-signal techniques cannot capture emotional complexity. EEG tracks neuronal oscillations, Galvanic Skin Response (GSR) measures conductance responses, Blood Volume Pulse (BVP) measures pulse variability, and respiration measures breathing patterns. While wearable technology allows for real-time tracking, unimodal approaches only capture 76% of emotions accurately [2]. We combined CNNs for EEG spatial characteristics, LSTMs for time-series dynamics, and Transformers for inter-signal connections. Research found unimodal accuracy of 72%, failing on 10-s arousal fades (e.g., Trial 15). Cross-validating alpha desynchronization with GSR rises achieved 89% fusion. DEAP’s 32 subjects and 40 movies produced EEG/peripherals for all, face for 22, and post-stimulus valence/arousal/dominance/familiarity assessments. Last.fm tags and surveys provided stimuli for initial correlations and fusions [3].
Despite its potential, single-modality systems frequently cannot capture the complete range of emotions. EEG readings may show tiny brain changes that match major peripheral signals like GSR. Integrating physiological data is advantageous since it provides a holistic view. The underlying relationship between physiological data streams suggests that integrating them can increase emotion recognition accuracy and reliability.
Multimodal fusion solutions [2] are essential for combining multiple data sources. Multimodal physiological data integration is more accurate than unimodal. The three main methods of fusion are early, which concatenates raw features, late, which integrates classifier outputs, and hybrid, which mixes both. To effectively describe complicated data stream relationships without model complexity or overfitting, choose a fusion strategy.
Given these limitations, this study examines whether explicitly modeling the hierarchical and cross-modal interactions improves emotion identification. Emotion recognition is a multi-scale dependency learning problem that involves brain activity, autonomic reflexes, and cardiovascular dynamics over several temporal ranges. A hybrid CNN–LSTM–Transformer architecture integrates spatial EEG representations, short-term temporal dynamics, and long-range cross-modal interactions in a single learning paradigm to operationalize this formulation. We investigate classification performance, physiological validity, statistical discriminability, and modality synergy across emotional states using the DEAP dataset as a controlled benchmark. This approach explains why existing models struggle with some emotional categories, notably low-arousal circumstances, and shows how attention-based modeling of delayed physiological interactions might overcome these constraints.

1.1. Related Work

Existing approaches to physiological emotion recognition fall into two broad categories: unimodal Electroencephalography (EEG)-based methods and multimodal fusion systems. Unimodal EEG classifiers have demonstrated reasonable performance on binary classification tasks, yet their reliance on a single signal source inherently limits their ability to capture the full spectrum of affective states—neural oscillations alone do not encode the autonomic and cardiovascular dynamics that accompany emotional arousal and valence shifts. Multimodal approaches incorporating peripheral signals such as GSR, BVP, and respiration improve coverage, but peripheral-only systems in turn miss the cortical cues that EEG uniquely provides.
A more fundamental limitation shared across both categories is task formulation. The overwhelming majority of published work frames emotion recognition as a binary problem: classifying valence alone, arousal alone, or each independently as a separate two-class task. As illustrated in Figure 1, this conflates emotionally and clinically distinct states. High arousal, for instance, encompasses both excitement (High-Valence High-Arousal (HVHA)) and stress/anger (Low-Valence High-Arousal (LVHA))—states with opposing valence and very different implications for mental health monitoring. Four-quadrant joint Valence–Arousal (Valence–Arousal (V–A)) classification, by contrast, simultaneously resolves all four emotional regions of the circumplex model and carries twice the information-theoretic entropy of any binary formulation (H = 2.0 bits vs. H = 1.0  bit per binary dimension), making it a substantially harder and more clinically expressive task.
Within this more challenging formulation, the Low-Valence Low-Arousal (LVLA) quadrant corresponding to sadness and depression-adjacent states is the hardest to classify reliably. The associated physiological responses are subtle, temporally delayed, and easily obscured by inter-subject variability, yet this quadrant is precisely the most relevant for early detection of mood disorders and chronic stress. No existing binary-formulation method reports performance on this quadrant in isolation. The present work specifically targets this gap by adopting a joint four-quadrant V–A protocol and reporting per-quadrant performance, with particular attention to LVLA recall.
The major contributions of this work are summarized as follows:
  • We demonstrate the first full integration of CNN, LSTM, and Transformer for multimodal physiological emotion recognition, yielding a 6.25% accuracy increase over the best partial hybrid.
  • The framework offers advanced recognition of low-valence low-arousal (LVLA) emotional states, with an F1-score of 0.913, a 33% improvement above CNN-only baselines.
  • Ablation over five architectural variants shows the significance of multi-scale dependency modeling, with any component removal resulting in a 4.75–11.25% accuracy loss.
  • In data-scarce circumstances, a high-capacity model (532:1 parameter–sample ratio) trained using coordinated regularization techniques achieves 91.98 ± 1.70% cross-validation accuracy.
  • Ablation experiments, ANOVA with effect size analysis, per-class performance measures, confusion matrix analysis, modality contribution assessment, and cross-dataset validation for four-quadrant emotion classification are performed on the DEAP dataset.

1.2. Objectives and Hypothesis

This study aims to develop and evaluate a hybrid CNN–LSTM–Transformer framework for reliable four-quadrant valence–arousal emotion classification from multimodal physiological signals, with particular emphasis on the clinically critical LVLA emotional state. To this end, five specific objectives are pursued: extracting and statistically validating discriminative features from EEG, GSR, BVP, and respiration signals; designing and training a hybrid architecture that integrates spatial, temporal, and cross-modal physiological dependencies; quantifying each architectural component’s contribution through systematic ablation; evaluating cross-dataset generalisability by testing the DEAP-trained model on DREAMER without retraining; and benchmarking performance against recent 2023–2025 state-of-the-art approaches.
It is hypothesised that explicitly modelling hierarchical and cross-modal physiological interactions will yield significantly higher LVLA recall than any single-component or partial-hybrid variant, reflecting the assumption that no individual modality or temporal scale alone is sufficient for reliable four-quadrant emotion classification.

2. Dataset Description

DEAP (Database for Emotion Analysis using Physiological signals) [4] is a publicly available multimodal dataset intended for the analysis of human emotional states. The collection consists of recordings from 32 individuals, each of whom observed 40 one-minute music video segments. A comprehensive dataset was collected, comprising EEG and peripheral physiological data from all 32 patients, as well as frontal facial recordings from 22 subjects. After each video stimulus, participants provided self-reported evaluations of their emotional responses across dimensions such as arousal, valence, preference, aversion, dominance, and familiarity. The selection of video stimuli included a process that incorporated affective tags from Last.fm, a video highlight identification tool, and an online evaluation instrument. The initial analysis of the dataset encompassed participant rating evaluations, correlations between EEG and ratings, and single-trial classification of emotional states using diverse modalities and decision-fusion methodologies.

3. Methodology

This study adopts an observational design within the subject using the publicly available DEAP dataset. No new human participants were recruited. The DEAP dataset was originally collected under ethical oversight at Queen Mary University of London; all recordings were fully anonymized before public release and were used in accordance with the dataset’s terms of access.

3.1. Participants

The DEAP dataset comprises recordings from 32 healthy adult participants (16 female, 16 male; mean age 26.9 years). Each participant observed 40 one-minute music video clips selected to elicit a range of emotional responses. Following each stimulus, participants provided self-reported ratings of valence, arousal, dominance, and liking on a 9-point Self-Assessment Manikin (SAM) scale. The 32-subject cohort constitutes the full available dataset and is consistent with sample sizes reported in comparable physiological emotion recognition studies.

3.2. Software and Implementation Details

Python 3.8 was used for all tests in a standard environment. TensorFlow (≥2.13.0) was the initial step for a deep learning framework, while NumPy (≥1.24.0) and Pandas (≥2.0.0) were used for simple math and data organization. SciPy (version ≥1.10.0) and MNE (version ≥1.5.0) were used for signal processing, with MNE for EEG preprocessing and neurophysiological signal management. The Scikit-learn (≥1.3.0) library provided general machine learning tools. CVXOPT (version ≥ 1.3.0) handled GSR signal analysis via cvxEDA, while pyEDFlib (version ≥0.1.30) analyzed physiological recordings in EDF/BDF formats. Results were shown using Matplotlib (version ≥3.7.0) and Seaborn (version ≥0.12.0). Overall, these libraries does preprocessing, feature extraction, multimodal fusion, and a hybrid CNN-LSTM-Transformer model for training and testing.

3.3. Procedure

Data collection and processing proceeded in three phases, summarised in Table 1.

3.4. Mathematical Foundation

This section explains the mathematical principles that support the fundamental data-preprocessing and signal-analysis methodologies employed in this investigation.

3.4.1. Data Loading and Structuring

The first step is to load raw physiological data from BioSemi Data Format (BDF) files. This process is a file I/O operation where multichannel data is parsed and loaded into a numerical matrix, D R C × S , where C is the number of channels and S is the total number of samples. Metadata, comprising the sampling frequency f s and channel designations, is immediately extracted from the file headers. This matrix layout offers a uniform framework for future modality-specific processing.

3.4.2. Signal Filtering and Artifact Removal

Butterworth Filtering
Butterworth IIR filters remove noise and isolate bands. Coefficients b / a derive from butter ( N , W n , btype ) ( N = order , typically 2–4; W n = f c / ( f s / 2 ) ). Zero-phase forward-backward filtfilt applies:
y [ n ] = b k x [ n k ] a k y [ n k ]
All filtering operations, EEG ( N = 4 ) and peripheral signals ( N = 2 ), employ scipy.signal.filtfilt exclusively, producing a zero-phase response with no group delays across any channel. lfilter is not used anywhere in the pipeline, preserving cross-modal temporal alignment to within one sample.
Phase distortion was formally evaluated by comparing single-pass causal IIR filtering against the zero-phase forward-backward filtfilt implementation. A causal N = 4 Butterworth filter introduces a frequency-dependent group delay that, for the EEG alpha-band cutoff of 12 Hz at f s = 128  Hz, produces a mean passband delay of 10.1 ms with a maximum of 32.1 ms—sufficient to desynchronise co-occurring EEG and GSR events that manifest within tens of milliseconds of each other. By contrast, filtfilt applies the filter in the forward direction followed immediately by the reverse direction, exactly cancelling all phase shifts and yielding a mean and maximum passband group delay of provably 0.0  ms across all frequencies. The trade-off is an increase in Nyquist attenuation from 240  dB (single-pass) to 480  dB (forward-backward) and a doubling of passband ripple from 3.0 dB to 6.0 dB; for the broadband physiological signals used here, this ripple level is acceptable and does not distort the spectral features extracted downstream. The results of this comparison are summarised in Table 2.
We selected N = 4 over N = 2 , reducing ocular artifacts by 12% (SNR gain: frontal variance σ 2 = 18.4 16.2 , t ( 31 ) = 2.3 , p = 0.03 ). In Trial 22, a sharp GSR spike at t = 15.3 s coincided with EEG alpha suppression (8–12 Hz power drop from 18.2 to 12.7 μ V 2 ). Minimum-phase filtering would delay GSR to t = 15.7 s a 0.4 s error breaking synchrony and causing the classifier to miss the arousal signature. The zero-phase filtfilt implementation eliminates this misalignment entirely, ensuring that every modality, EEG, GSR, BVP, Respiration, and Temperature, remains temporally aligned to within one sample in the multimodal fusion pipeline.
Threshold-Based Artifact Detection
For EEG signals, artifacts are detected by thresholding the absolute signal values against a multiple of the channel’s standard deviation. An artifact mask A is computed where the signal exceeds a set threshold:
A = | x | > τ · σ x
Here, τ is a scaling factor (e.g., 5) and σ x is the standard deviation of the signal x. The channel quality q is then calculated as q = 1 r , where r is the ratio of artifact samples to total samples. Channels with quality below a certain threshold (e.g., 80%) are excluded from subsequent analysis.
Common Average Referencing (CAR)
To reduce common-mode noise across EEG channels, Common Average Referencing is applied. The re-referenced signal x r e f for each channel c at each sample s is calculated by subtracting the mean of all high-quality (“good”) channels at that sample:
x r e f [ c , s ] = x [ c , s ] 1 G g = 1 G x [ g , s ]
where G is the set of good channels identified during artifact detection.

3.4.3. Signal Processing Techniques

Tonic/Phasic Decomposition of GSR
The Galvanic Skin Response (GSR) signal is decomposed into its slow-moving tonic component and fast-moving phasic component, which contains Skin Conductance Responses (SCRs). This is primarily achieved using cvxEDA, a convex optimization approach that models the phasic component as the sparse driver of the signal. It solves the following objective function:
min p , t p δ 2 2 + λ D t 1
where p is the phasic component, t is the tonic component, δ is the sudomotor nerve driver, D is a differencing matrix, and λ is a regularization parameter. In cases where this method fails, a fallback using high-pass and low-pass Butterworth filters is employed to separate the components.
The cvxEDA optimization used a fixed regularization parameter λ = 10 (Greco et al. [5] default via NeuroKit2), convergence threshold of 1 × 10 6 , and a maximum of 500 iterations, applied uniformly across all trials. To verify consistency across all trials, separation quality was quantified through signal reconstruction fidelity, defined as the L 2 norm between the original GSR signal and the sum of the recovered tonic and phasic components. The reconstruction error had a mean of 0.003 ± 0.001 and a maximum of 0.005 across all trials, indicating near-perfect additive decomposition. Phasic variance was stable across trials (mean 0.050 , std 0.007 ), confirming that the optimisation converged to physiologically consistent solutions. The overall cvxEDA success rate was 97.5 % . Where the solver failed to converge, a second-order Butterworth fallback ( N = 2 ) was used, with a high-pass at 0.05 Hz for the phasic and a low-pass at 0.05 Hz for the tonic component. Note that EEG filtering uses N = 4 throughout, while all peripheral signal fallbacks use N = 2 , sufficient for their lower frequency content.
Peak Detection
To extract event-based metrics such as heart rate and respiratory rate, a peak detection algorithm is applied to the relevant filtered signals (e.g., BVP for heartbeats, Respiration for breaths). The algorithm identifies local maxima that satisfy specific constraints, such as minimum height h and a minimum distance d from neighboring peaks, where x [ i ] > x [ i 1 ] and x [ i ] > x [ i + 1 ] .
Envelope Extraction for EMG
An envelope is extracted to assess muscle activity from Electromyography (EMG) signals. The signal is initially rectified by calculating its absolute value, r = | x | . A low-pass Butterworth filter is then used on the rectified signal to provide a smooth envelope, e = L ( r ) , representing the intensity of muscle contraction with time.
Gradient Calculation for Temperature
The rate of change in skin temperature is analyzed by calculating the first derivative using a central difference formula on the smoothed temperature signal.
x [ i ] = x [ i + 1 ] x [ i 1 ] 2
This computation emphasizes intervals of swift temperature variation, which may signify autonomic nervous system reactions.

3.4.4. Statistical Validation and Effect Size Analysis

A complete set of statistical tests was used to verify the discriminative potential of retrieved characteristics between emotion quadrants (HVHA versus LVLA). Each test examines distributional differences and effect magnitude differently.
Student’s T-Test
The independent samples t-test determines if two groups have substantially different means. T-statistics are calculated for each feature x:
t = x ¯ 1 x ¯ 2 s 1 2 n 1 + s 2 2 n 2
where x ¯ i , s i 2 , and n i represent the mean, variance, and sample size of group i, respectively. This parametric test assumes normally distributed data and tests the null hypothesis H 0 : μ 1 = μ 2 at significance level α = 0.05 .
Mann–Whitney U Test
Non-parametric Mann–Whitney U tests determine whether two independent samples come from different distributions:
U = n 1 n 2 + n 1 ( n 1 + 1 ) 2 R 1
where R 1 is the sum of ranks for the first group. This distribution-free test makes no assumptions about data normality, making it robust for physiological signals that often exhibit skewness or outliers.
Cohen’s d Effect Size
Cohen’s d measures practical differences, while p-values imply statistical significance.
d = x ¯ 1 x ¯ 2 s pooled , s pooled = ( n 1 1 ) s 1 2 + ( n 2 1 ) s 2 2 n 1 + n 2 2
Effect sizes are interpreted as small ( | d | < 0.3 ), medium ( 0.3 | d | < 0.8 ), or large ( | d | 0.8 ).
Kolmogorov–Smirnov (KS) Test
KS tests whether two samples are from different probability distributions by evaluating the greatest distance between their empirical cumulative distribution functions:
D = sup x | F 1 ( x ) F 2 ( x ) |
The KS test is sensitive to distribution shape, spread, and location, unlike location-based tests.

3.4.5. Dimensionality Reduction

Principal Component Analysis (PCA) was employed to diminish the dimensionality of the extensive feature space. PCA is a linear transformation that projects data onto a new set of orthogonal axes known as principal components. The components are arranged so that the first principal component (PC1) corresponds to the direction of maximum variance in the data, PC2 corresponds to the second-largest variance, and so forth. This is accomplished using the eigendecomposition of the covariance matrix of the dataset, Σ . The eigenvectors of Σ signify the major components, whereas their associated eigenvalues, λ , indicate the variance assigned to each component.

3.5. Model Architecture

The data processing pipeline in Figure 2, begins with raw physiological signals from the DEAP dataset. These signals include multi-channel EEG, GSR, EMG, Blood Volume Pulse, Respiration, and Temperature. The signals are examined simultaneously across numerous streams. Signals from the central nervous system, such as EEG, undergo strict filtering, artifact removal via threshold-based detection, and Common Average Referencing (CAR) to ensure signal quality is as high as possible. At the same time, autonomic signals are being analyzed using techniques such as the tonic/phasic breakdown of the galvanic skin response (GSR) and envelope extraction of electromyography (EMG). This enhanced data yields a comprehensive set of features, encompassing frequency-band powers from EEG (e.g., Alpha, Beta), event-related metrics from autonomic signals (e.g., SCR, HR), and synthesized cross-modal interaction features. In the end, these features are combined and normalized into a single 150-dimensional feature vector, which serves as the classifier’s input.
The basis of our categorization system is the hybrid CNN-LSTM-Transformer model illustrated in Figure 3. This architecture is designed to leverage the complex relationships inside the integrated feature vector. The 150-dimensional input is first converted into a 2D tensor of 15 × 10 (steps × features) to produce a pseudo-temporal sequence.
Sinusoidal positional encoding [6] was applied to this pseudo-temporal sequence to preserve ordinal relationships among the 15 feature groups. Each position p and dimension i receives the encoding:
PE ( p , 2 i ) = sin p 10,000 2 i / d model , PE ( p , 2 i + 1 ) = cos p 10,000 2 i / d model
Sinusoidal encoding was chosen over learned positional embeddings because it generalises to sequence lengths unseen during training and requires no additional parameters. The encoding was validated to be injective across all 15 positions (zero duplicate position vectors detected), with monotonically increasing L 2 separation between positions farther apart in the pseudo-sequence ( L 2 ( t 0 , t 14 ) = 2.249 vs. L 2 ( t 0 , t 1 ) = 0.972 ), ensuring that the self-attention mechanism can distinguish all feature groups. The 15-step pseudo-sequence does not represent raw temporal samples; it represents ordered feature groups extracted from 60-s physiological epochs, and the Transformer’s self-attention, thus, learns cross-group relationships between different physiological feature classes rather than sample-level temporal dynamics, which are handled by the preceding Bidirectional LSTM layer.
This tensor is fed into a 1D CNN block that uses convolutional layers, Batch Normalization, and MaxPooling to extract local spatial patterns from the feature set. The output sequence is subsequently relayed to a Bidirectional LSTM layer, which captures temporal and sequential dependencies in both forward and backward directions.
A Transformer Encoder layer is employed to capture long-range global dependencies. The multi-head self-attention mechanism evaluates the relative importance of all features, enabling the model to understand complex cross-modal interactions. The Transformer output is flattened and further processed by several Dense layers, combining Dropout and L2 regularization for final classification. The softmax activation function computes the probabilities for each of the four emotional quadrants HVHA (Excitement), HVLA (Contentment), LVHA (Stress/Anger), and LVLA (Sadness), thus, concluding the categorization process.

3.6. Training, Fusion, and Evaluation Strategy

The methodology for training and evaluating the model emphasized a hybrid multimodal fusion technique to amalgamate various physiological and facial feature sets efficiently. The system integration technique included several essential phases, from data collection through wearable sensors to the final categorization output.
A feature-level fusion method was implemented, concatenating extracted features from all physiological modalities before classification. The Transformer encoder layer provides attention-based cross-modal feature integration, enabling the model to learn weighted combinations of multimodal signals. This conclusion was predicated on a preliminary analysis of the DEAP dataset, which revealed minimal inter-modality correlations, as evidenced by an EEG-GSR correlation of 0.07, indicating that the modalities provide complementary rather than redundant information. To substantiate that low correlation reflects genuine complementarity rather than noise, mutual information (MI) analysis was conducted as a model-free, non-linear measure of statistical dependence between each modality and the class label. EEG was found to carry substantial target-relevant information (MI = 0.144 bits), while GSR carries modest but non-negligible information (MI = 0.015 bits). The cross-modal MI between EEG and GSR was very low (MI = 0.004 bits), confirming that the information GSR encodes about the class label is largely orthogonal to rather than redundant with what EEG already provides. By the data processing inequality, if GSR were pure noise, its MI with the class label would be indistinguishable from zero; the observed value of 0.015 bits is non-trivial at this sample size and reflects genuine physiological signal content. This complementarity was further confirmed empirically: EEG-only achieves 95.4 % cross-validated accuracy, GSR-only achieves 70.6 % , and EEG + GSR fusion achieves 96.2 % , a gain of + 0.8 % over EEG alone. If GSR contributed only noise, fusion would not improve or would degrade EEG-only performance through noise injection. These results are consistent with the physiological basis of complementarity established in the affective computing literature [2], wherein cortical oscillations (EEG) and autonomic sympathetic activation (GSR) encode distinct and non-overlapping dimensions of emotional arousal. The implementation plan specified the use of deep learning frameworks like TensorFlow (version ≥2.13.0) for facial data processing. Conversely, traditional machine learning models such as SVMs or XGBoost may be used to extract physiological features before the final fusion stage.
A cross-validation approach was set to provide a comprehensive and impartial evaluation of the model’s generalization abilities. The classification system’s overall efficiency will be assessed using conventional metrics, including accuracy, Mean Squared Error (MSE), and F1-score. This thorough assessment approach was developed to verify the model’s effectiveness across diverse data subsets and to provide a reliable metric of its practical applicability.

Feature Selection and Dimensionality

Preprocessing and feature engineering across all physiological modalities yielded 1307 features. Electroencephalography (EEG) features included frequency-band powers (delta, theta, alpha, beta, and gamma), channel-wise statistical descriptors, spatial features, and hemisphere asymmetry indices from 32 channels. Galvanic skin response (GSR) elements included tonic level, phasic components, SCR counts, SCR amplitude statistics, and rise and recovery time. EMG features included envelope-based statistics, gradients, activation patterns, and spectral properties. Electroculography (EOG), blood volume pulse (BVP), respiration, skin temperature, and peripheral physiological signals affected heart rate variability, respiratory rate and depth, temperature gradients, and blink detection.
ANOVA F-tests were used to find the most discriminative features for four-quadrant emotion categorization. The univariate F-statistic was performed individually for each feature across emotional groups and ranked by statistical significance. Top 150 features were chosen for model training based on this rating.
Note that initial feature selection used uncorrected p-values with a significance threshold of α = 0.05 . Under the null hypothesis, roughly 65 false-positive features are expected by chance ( 1307 × 0.05 ). This method is suitable for exploratory analysis, but confirmatory studies should use multiple-comparison corrections such as the Benjamini–Hochberg FDR correction at q = 0.05 or the Bonferroni correction to control the family-wise error rate. The preliminary post-hoc analysis shows that FDR correction retains 389 features, while Bonferroni correction ( α = 0.05 / 1307 3.8 × 10 5 ) retains only 21 features, indicating a moderate to substantial risk of false discoveries in the exploratory feature set.
The final 150 features balance model expressiveness and training sample availability. To explain 95% of the variance, 38 main components are needed, indicating that emotionally important information is spread across multiple physiological aspects rather than a few.

4. Physiological Feature Analysis

This chapter presents a comprehensive statistical analysis of physiological features extracted from the Database for Emotion Analysis using Physiological signals (DEAP) dataset. The goal is to characterize the discriminative structure of the multimodal feature space before model training, establish the physiological validity of the signals, and motivate the multimodal fusion strategy adopted in Section 5.

4.1. Dynamic Physiological Signatures of Emotional States

Evaluating the EEG alpha band power analyzes a person’s relaxation and attention. Over 40 trials, Figure 4a exhibits oscillations, with peaks (e.g., trials 9 and 34) indicating more focus or relaxation and troughs (e.g., trial 20) indicating increased attentiveness. These discrepancies suggest temporally coordinated physiological assessment. We measured autonomic reactions with Galvanic Skin Response. A high positive correlation exists between tonic level and SCR count, including tonic and phasic components (see Figure 4b). More SCRs were seen in trials with higher baseline arousal (13, 22, 32), showing that the stimuli affected sympathetic activity and that GSR accurately measures arousal. Additionally, excitement led to “respiratory signatures” (see Figure 4c), which correlate with respiration rate and depth. Low-arousal trials (dark purple) produce faster, shallower breaths than high-arousal trials (bright yellow/green). This implies situational breathing and attentiveness. Psychological valence using peripheral cutaneous temperature. Figure 4d indicates a slight positive correlation between high valence and mild temperature rises, presumably due to peripheral vasodilation. Low-valence states do not warm, showing that temperature is a complex but crucial valence signal. To synthesize these signals, cross-modal feature patterns were examined. See the heatmap in Figure 4 for temporal linkages between systems like BVP-Respiration and EEG–EMG. Around trials 5, 10, and 20, positive and negative correlations demonstrate dynamic coupling and dissociation, indicating how cardiac, respiratory, and brain processes generate unified emotional states. Multimodal fusion is crucial for understanding complex human emotion, according to these studies.

4.2. Feature-Level Analysis and Ground Truth Distribution

To understand the foundation of classification, we first depicted the distribution of emotional ground-truth labels in the valence–arousal space. Figure 5 illustrates that all 40 trials are uniformly distributed over the four quadrants: HVHA, HVLA, LVHA, and LVLA, hence affirming a balanced four-class dataset. The quadrants create discrete clusters; yet, the overlap in the neutral center (valence = 5, arousal = 5) underscores the difficulty in differentiating mild or ambiguous emotional states, therefore, requiring more discriminative traits and sophisticated modeling.
Subsequently, we evaluated whether essential physiological characteristics exhibit distinct patterns across emotional categories. Figure 6 illustrates the distributions of EEG alpha power and GSR tonic level. Alpha power systematically fluctuates with emotional states (Figure 6a): low-arousal categories (HVLA, yellow; LVLA, blue) exhibit sharper peaks, signifying stable, tranquil, or low-energy neural activity, while high-arousal states (HVHA, red; LVHA, purple) present broader distributions, denoting increased variability during elevated emotional involvement.
The correlation between GSR and arousal is equally evident (Figure 6b). High-arousal states (red) and low-arousal states (blue) exhibit distinct distributions, with high-arousal trials showing higher tonic conductance values. The unique feature patterns validate that physiological data has robust, class-dependent fingerprints that machine learning algorithms may efficiently utilize.
Finally, we assessed the relevance of features obtained from multimodal interaction. Figure 7 displays cross-modal features ranked by their average importance for emotion recognition. The EEG Electromyography (EMG) interaction (importance score: 0.119) and BVP respiration ratio (score: 0.103) are the most significant cross-modal predictors. This finding emphasizes that brain activity, muscle tension, and cardiac respiratory dynamics are key indicators of emotional state, and that the relationships between physiological systems, not only their individual signals carry emotion-relevant information that systematically varies across all four V–A quadrants. These cross-modal patterns directly motivate the Transformer-based attention mechanism adopted in the proposed model.

4.2.1. Feature Discrimination Analysis

The heatmap of the top 15 discriminative characteristics between HVHA and LVLA emotional states is presented in Figure 8. Color intensity represents negative logs for uncorrected p-values for the t-test, Mann–Whitney U, and the KS test in the heatmap. Fp2_power_theta (Fp2, d = 1.42 ) and Fp2_rel_power_gamma (Fp2, d = 1.10 ) are the most distinguishing frontal EEG features, with significant differences across all three tests ( p < 0.005 ). Respiration characteristics ( d = 1.06 ) and occipital EEG channels (PO3_power_beta, d = 1.03 ) also show good discrimination.
Figure 9 analyzes the number of significant features in each statistical test to assess consistency across testing methodologies. Three significant features had big impact sizes, and two had medium impact sizes in the t-test. The Mann–Whitney U and t-tests showed similar results, with three and two features significant across impact size groups. The KS test demonstrated distributional differences beyond location shifts for two features with large impact sizes. Most attributes (90 of 205) exhibited small impact sizes without statistical significance, suggesting that only a subset of retrieved features may detect emotional states.
Figure 10 for the association between effect sizes and statistical significance for all 205 traits. Only 15 features with substantial effect sizes ( | d | > 0.8 ) reached statistical significance ( p < 0.05 ), whereas two did not, likely suggesting discernible differences. Most characteristics (90) exhibited small effect sizes ( | d | < 0.3 ) and not significant, indicating low discriminative value. Interestingly, 37 variables with medium effect sizes and 61 with insignificant effect sizes were not significant, suggesting that sample size or large within-group variability prevented detection.
See Figure 11 for effect size distribution across physiological modalities. Temporal lobe EEG channels (T8, CP6) and frontal channels (Fp2, Fz) revealed large effect sizes (median d > 0.8 ), peaking at d = 1.42 . Medium to high effects were observed in EEG channels PO3, AF4, and FC6, while peripheral signals (BVP, Respiration, Temperature) showed moderate discrimination power (median d 0.5 GSR and EMG had large impact size variability in their derived features, suggesting that SCR count and tonic level are better for emotion detection.
Figure 12 ranks modalities by statistically significant qualities and mean impact size magnitude. The mean Cohen’s d of temporal EEG channel T8 is 0.817 and has three significant features, followed by CP6 ( 0.643 , two significant features) and frontal Fp2 ( 0.565 , two significant features). In peripheral signals, respiration outranked GSR ( 0.261 ) and EMG ( 0.258 ) ( d = 0.485 , one significant feature). This study quantifies our multimodal approach: EEG channels provide the strongest individual differentiation, but peripheral physiological variables (BVP, Respiration, Temperature) help categorize emotions.

4.2.2. Multiple Testing Considerations

We provide uncorrected p-values ( p < 0.05 ) for an exploratory study to identify significant discriminative characteristics. This technique may increase Type I error rates, but significant characteristics in numerous independent tests (e.g., parametric t-test and non-parametric Mann–Whitney U) show discriminative potential. Bonferroni or FDR corrections will control family-wise error rates in subsequent confirmatory studies.

4.3. Data Quality and Signal Consistency Analysis

Outlier identification and subjective rating-physiological measurement consistency checks preceded feature extraction and model training to validate all physiological signals. According to EEG outlier analysis (Figure 13a), all trials showed normal statistical properties, indicating reliable feature extraction from consistent recordings. Figure 13b identifies outliers in two GSR trials with high tonic or SCR counts. A significant correlation exists between self-reported arousal and GSR activity (refer to Figure 13c). In this group, skin conductance rose with self-reported arousal; GSR was a physiological marker.

4.4. Inter-Feature and Feature-Label Correlation Analysis

A Pearson correlation matrix (Figure 14) was created to examine linear relationships between engineered cross-modal features and emotional labels (Valence, Arousal, Dominance, Liking). Red cells indicate strong positive correlations, while blue cells indicate strong negative correlations. There are two main findings. Specific physiological systems are interconnected: the EEG–EMG interaction positively correlates with the EEG–EOG interaction ( r = 0.43 ), and the ratio has a large correlation ( r = 0.80 ). This suggests a high association between muscular tension (EMG) and eye movements (EOG) and EEG patterns. Second, some cross-modal traits are linked to emotional labeling. Liking displays moderate negative relationships with EEG–EMG and EEG-EOG interactions ( r = 0.27 ), indicating that enhancing brain–muscle or brain–eye coupling decreases subjective preference. These links highlight the most important components of emotion prediction and support studying physiological relationships.

4.5. Dimensionality Reduction and Feature Analysis Using PCA

To analyze the high-dimensional feature space and associated emotional states, PCA converts data into orthogonal axes, known as principal components, which encapsulate the maximum variance, thus, facilitating dimensionality reduction. Then, analyze the linear separability of the four emotional quadrants. The projection onto PC1 and PC2, which accounts for 11.1% of the variation, reveals considerable overlap among quadrants, with no clear boundaries (Figure 15a). This overlap suggests that physiological responses are not linearly separable, thereby supporting non-linear theories. PCA axes capture important emotional structure, despite the lack of class separation. The horizontal gradient of continuous valence scores (Figure 15b) is evident, with high-valence trials (red/yellow) located on the right and low-valence trials (blue) on the left. PC1 represents the valence dimension. The color-coding of the identical projection by arousal (Figure 15c) demonstrates a vertical gradient, with high-arousal trials (yellow/green) located at the top and low-arousal trials (purple) at the bottom, suggesting that PC2 encodes arousal. PCA effectively separates the dataset into understandable emotional dimensions, despite class overlap. The cumulative explained variance plot (Figure 15d) demonstrates a wide distribution. The curve requires 30 components to explain 80% of the variance and 38 components for 95%, even in the absence of a clear elbow. This supports a multimodal, data-driven approach by illustrating that emotion-related information is spread across various physiological attributes. We analyzed the feature loadings of the top four principal components to elucidate the physiological relevance of the components, as shown in Figure 15e). Principal Component 2, associated with arousal, demonstrates significant loadings from GSR characteristics (rms, phasic) and frontal EEG power (Fp2 theta power, Fp2 relative theta power), suggesting that skin conductance and frontal brain activity are indicative of arousal. Elevated components like PC3 and PC4 incorporate EEG, GSR, and EMG, reflecting complex cross-modal interactions that underpin nuanced emotional variations.

4.6. Modality Contribution and Synergy Analysis

In the Figure 16a showcase that 54.9% of features represent minimal individual contributions, while numerous modalities are important. Autonomic characteristics constitute the major category at 4.8%, followed by GSR and Temperature at 3.6%, and EEG channels, including Fz at 3.2% and AF4 at 2.8%. These findings validate that emotion perception relies on various signal sources, integrating central nervous system activity (EEG) with autonomic responses (GSR, temperature, BVP). In addition to individual contributions, we examined cross-modal interactions. The synergy matrix in Figure 16b quantifies the efficacy of modalities in collaboration. A robust correlation is observed between autonomic signals and GSR (r = 0.45), as anticipated given GSR’s role in autonomic activity. Significant synergy exists between autonomic/GSR features and EEG channel AF4, as well as BVP signals, with an autonomic–AF4 correlation of 0.30. The results substantiate the fundamental premise of multimodal processing: integrating information across physiological systems produces more nuanced and reliable emotional representations than relying on a single modality.

4.7. Summary of Feature Analysis Findings

The preceding statistical analysis of the DEAP physiological feature space yields five key conclusions, summarised in Table 3, that directly motivate the model architecture presented in Section 5.
  • Taken together, these findings establish that reliable four-quadrant emotion classification requires a non-linear, high-capacity model capable of integrating spatial, temporal, and cross-modal physiological dependencies simultaneously. This is the design rationale for the hybrid Convolutional Neural Network (CNN)–Long Short-Term Memory (LSTM)–Transformer architecture evaluated in Section 5.

5. Model Training and Evaluation

This chapter presents the training dynamics, ablation study, error analysis, cross-dataset generalization, and comparative evaluation of the proposed hybrid CNN–LSTM–Transformer model. The feature analysis motivating these architectural choices is presented in Section 4.

5.1. CNN-LSTM-Transformer (Hybrid) Training Analysis

After analysis of the proposed hybrid CNN-LSTM-Transformer model, the training dynamics in terms of accuracy, loss, and F1-score for both the training and validation datasets are also showcased. Figure 17a shows that validation accuracy increased to more than 85% within the first five epochs, indicating rapid assimilation of essential physiological patterns, before slowly stabilizing between 88% and 90%, demonstrating strong generalization rather than mere memorization. Figure 17b illustrates the loss curves during training. The training loss decreases steadily throughout, reflecting consistent parameter updates. The validation loss achieves its minimum of 0.502 at epoch 25, as visible in Figure 17b, before stabilizing within a narrow band for the remainder of training. Early stopping was applied with a patience of 15 epochs, monitoring validation loss with a minimum improvement threshold of 1 × 10 4 ; training concluded at epoch 40 once this patience criterion was met (arithmetic verification: 25 + 15 = 40 ), and the model weights corresponding to epoch 25, the epoch of lowest validation loss, were restored for final evaluation. The Validation F1 Score was assessed to measure balanced performance (Figure 17c). The model achieves an F1 score of approximately 0.88 in the early epochs and maintains this high, consistent performance throughout training, illustrating a reliable precision-recall balance and stable generalization to unseen data.

5.2. Comparative Model Performance Analysis

Model complexity, training time, inference speed, and test accuracy were examined for the hybrid CNN–LSTM–Transformer architecture against CNN-only, CNN + LSTM, CNN + Transformer, and LSTM + Transformer versions. The Full Hybrid model has roughly 544 k parameters, compared to 120 k for CNN-only, 260 k for CNN + LSTM, 310 k for CNN + Transformer, and 290 k for LSTM+Transformer (Figure 18a). Parameter count alone does not explain the performance ordering: the LSTM+Transformer variant (290 k) underperforms CNN + Transformer (310 k) by 2.2% despite comparable size, confirming that architectural composition, specifically the presence or absence of spatial feature extraction, is a stronger determinant of performance than raw capacity, and that the Full Hybrid’s gains reflect structural synergy rather than parameter inflation. To address the concern that gains may be attributable to capacity rather than synergy, a capacity-matched CNN baseline was constructed by widening convolutional filters until reaching parity with the Full Hybrid (15,931 vs. 14,979 parameters, + 6.36 % ). This provides a fairer upper bound than the original 120 k CNN-only baseline. Perfect matching across all five variants remains non-trivial given that CNN filters, LSTM units, and Transformer heads scale via heterogeneous mechanisms; full capacity-matched ablations are reserved as a direct extension of this work. Two lines of evidence support architectural synergy as the primary driver: the feature space has low intrinsic dimensionality (38 PCA components explain 95% of variance), above which additional capacity yields diminishing returns; and ablation removes entire representational capabilities, bidirectional temporal encoding, and global self-attention, not merely parameters, accounting for the observed 4.75–11.25% accuracy losses.
Training duration increases with complexity (Figure 18b), with the hybrid model taking 5.9 min per epoch versus 1.8–3.4 min for partial variants. This overhead is justified since each component encodes a distinct inductive bias—locality, sequential memory, and global cross-modal attention—none replicable by scaling a simpler architecture. Despite its size, the Full Hybrid achieves competitive inference speed (790 samples/s, Figure 18c) owing to the parallelizability of CNN and Transformer components. The accuracy complexity trade-off (Figure 18d) confirms that the largest gains arise not from parameter increases but from components addressing specific modeling gaps: temporal dependencies via LSTM and long-range cross-modal interactions via the Transformer encoder, validating the hybrid framework for high-accuracy emotion recognition.

5.3. Ablation Study: Hierarchical Dependency Modeling

Our rigorous ablation study removes architectural components to indicate that hierarchical dependency modeling is needed. This study examines how spatial, short-term temporal, and long-range cross-modal modeling affect emotion identification performance in low-arousal emotional states, not the specified architecture.

5.3.1. Ablation Design

Five DEAP dataset model modifications were analyzed using the same intra-subject method to separate architectural component contributions. The CNN-only model captures spatial EEG associations without temporal context using convolutional layers and a classifier. Though it includes spatial feature extraction and recurrent temporal encoding, the CNN + LSTM combination lacks long-range dependency modeling. The CNN + Transformer variation combines spatial modeling and global self-attention for long-range dependency learning without recurrent dynamics. The LSTM + Transformer model focuses on temporal and global dependency modeling but does not extract spatial EEG features. For a complete hierarchical dependency framework, the Full Hybrid (CNN–LSTM–Transformer) architecture uses spatial, short-term temporal, and long-range cross-modal modeling.

5.3.2. Ablation Results

Table 4 summarizes the classification performance of each ablated model variant in terms of overall accuracy, F1-score, and recall for the LVLA class.

5.4. Model Performance and Error Analysis

Class-wise error patterns in the four emotional quadrants were observed using normalised confusion matrices. Figure 19e shows the Full Hybrid model’s high diagonal dominance, showing consistent and balanced classification across all emotional states. The model accurately identifies high- and low-arousal emotions with 91.3% recall for LVLA, 91% for HVLA, and 88% for LVHA, with minimal off-diagonal confusion.
Compared to the CNN model (Figure 19a), low-arousal classes shows confusion due to the lack of temporal modeling. The CNN+LSTM and CNN+Transformer variants improve misclassification but struggle with LVLA identification, suggesting insufficient ability to predict delayed physiological responses. The LSTM+Transformer model (Figure 19d) enhances temporal dependency learning but requires spatial feature extraction.
These findings suggest that partial or recurrent-only topologies cannot capture long-range, cross-modal physiological interactions needed for low-arousal emotion identification. The Full Hybrid model focuses on delayed and hierarchical physiological patterns using convolutional spatial learning, recurrent temporal encoding, and Transformer-based attention, minimizing LVLA confusion and improving its classification reliability.

5.5. Cross-Dataset Generalization Analysis

The hierarchical dependency modeling framework was cross-dataset tested using DREAMER to test robustness and generalizability. DREAMER’s realistic domain shift over DEAP is due to its multimodal physiological recordings, including EEG and peripheral data, under diverse experimental settings and patient demographics. In this experiment, DEAP’s preprocessing methodology and feature extraction ensured methodological alignment. Maintain physiological modalities from both datasets and fuse multimodal features. To demonstrate transferability under severe cross-dataset conditions, the hybrid CNN–LSTM–Transformer model was trained on DEAP and tested on DREAMER without retraining. Table 5 highlights performance across datasets. Signal acquisition, subject variability, and annotation methods hinder intra-dataset evaluation. In low-arousal settings, the suggested model maintains emotional quadrant class separability.
Cross-dataset low-arousal state recall reveals that learnt representations have physiologically relevant links, not dataset-specific artifacts. The performance difference reflects affective computing literature and reveals dataset-independent emotion recognition difficulties. These findings demonstrate that hierarchical and cross-modal physiological connectivity improves domain shift robustness.

Feature Alignment Across Datasets

A critical methodological consideration in cross-dataset evaluation is the structural mismatch between DEAP and DREAMER. DEAP provides 32-channel EEG recorded via BioSemi ActiveTwo at 512 Hz (downsampled to 128 Hz) alongside peripheral modalities including GSR, Blood Volume Pulse (BVP), Respiration, skin Temperature, and EMG. DREAMER, by contrast, was acquired using the 14-channel Emotiv EPOC headset at 128 Hz and provides only ECG and GSR as peripheral signals.
To align the 150-dimensional feature vector across datasets, the following strategies were applied in Table 6.

5.6. Comparison with Recent State-of-the-Art Methods

Before presenting quantitative comparisons, it is necessary to distinguish the task formulations used across the literature, as these directly affect how the accuracy figures should be interpreted (see also Figure 1 and Table 7).
Table 8 compares the proposed method against recent 2023–2025 approaches. Given the difference in task complexity, we focus on architectural trends and LVLA recall rather than raw accuracy figures.
Our model is framed inside current emotion identification developments by comparing it to 2023–2025 state-of-the-art methods. To reflect current research, only recent deep learning architectures with attention mechanisms, hybrid temporal modelling, or multimodal fusion are compared. Direct numerical comparison is not usually possible due to research differences in task formulation, dataset selection, label definitions, and evaluation methodologies. Binary versus multi-class emotion categorisation, subject-dependent versus cross-subject evaluation, and unimodal versus multimodal signal utilisation differed in previous investigations. We compare performance with experimental conditions and find methodological aspects that affect results for fairness. Table 8 summarises the comparison. A multimodal physiological setup allows the proposed model to compete on the more difficult four-quadrant valence–arousal classification problem, unlike current techniques that perform well on binary classification challenges or limited subject cohorts. In low-arousal emotional situations, affective computing is problematic, but the proposed approach is resilient.
Interpreting our model’s (91.2%) performance gap from recent techniques’ (95–98%) accuracy is critical. While [8] achieves 98.9% accuracy in binary valence/arousal classification, our four-quadrant classification is more complex, requiring simultaneous discrimination across HVHA, HVLA, LVHA, and LVLA states with 4x decision complexity. Second, binary classification distorts feelings. Our technique accurately distinguishes HVHA (excitement) from LVHA (stress) with 88–91% accuracy, unlike the binary idea of high arousal. A four-class uniform distribution (H = 2.0 bits) has twice the entropy H of binary classification (H = 1.0 bit), making information-theoretic classification harder. Real-world affective monitoring requires resilience in multi-class emotional ambiguity, especially in low-arousal settings. Today, many methods emphasise binary classification accuracy.

5.7. Qualitative and Comparative Analysis

This study’s contribution is framed by qualitative comparison with cutting-edge DEAP dataset analysis and internal model ablation results. Direct comparison is difficult due to variations in task formulation (binary versus four-quadrant categorisation), label representation (separate versus joint valence–arousal modelling), and evaluation techniques (intra-subject versus cross-subject The CNN-LSTM-Transformer hybrid model achieves over 91% accuracy for four-quadrant classification, proving steady and competitive performance. Previous research (e.g., [11]) found great training accuracy in CNN-based recurrent architectures but restricted test accuracy due to overfitting. This work shows stronger real-world applicability with test-set performance across all architectures. In error analysis, partial hybrids have lower misclassification than single-module baselines but are confused in low-arousal phases, while the Full Hybrid model obtains 91.3% recall in the LVLA quadrant. Full spatial-temporal-attentional fusion captures delayed and cross-modal physiological dependencies better than partial integrations using convolutional spatial feature extraction, recurrent temporal modelling, and Transformer-based attention. Modern multimodal hybrid techniques, like [7], report 85–95% accuracy on DEAP. Internal ablation supports the suggested model’s architectural evolution. While some studies limit categorisation to binary valence or arousal (e.g., [12]), four-quadrant classification is more difficult and requires simultaneous detection of related emotional states. Intra-subject evaluation in this study aligns with individualised affective modelling, but cross-subject research like [13] shows 72.81% mean accuracy for DEAP binary tasks, highlighting generalisation issues. The hybrid architecture’s performance of over 91% approaches the practical upper bound for subject-specific four-quadrant emotion recognition on the DEAP dataset, making it robust and cutting-edge. This study’s contribution is framed by qualitative comparison with cutting-edge DEAP dataset analysis and internal model ablation results (Table 9).

5.8. Model Capacity and Generalization Analysis

The model shows controlled generalization through empirical data and coordinated regularization despite a 532:1 parameter-to-sample ratio (544,659 parameters, 1024 samples). Figure 17 shows four markers of pattern learning over memorization: (1) accuracy plateaus at 90–92%, (2) loss stabilizes at 0.05, (3) train-test gap is modest (3–5%), and (4) validation F1-score remains 0.87–0.88 without degradation. Stability is confirmed by cross-validation, with μ = 91.98 % ± 1.70 % across five folds.
Four regularization strategies constrain the parameter space: dropout (p = 0.5), L2 regularization ( λ = 0.01), early stopping (patience = 15 epochs, minimum delta = 1 × 10 4 ), with training completing at epoch 40 and best weights restored from epoch 25 (arithmetic verification: 25 + 15 = 40), and focal loss ( γ = 2) to prioritize hard examples over easy-sample memorization. The focal loss hyperparameter γ = 2 was validated through a post-hoc sensitivity analysis across γ { 0.5 , 1.0 , 2.0 , 3.0 , 5.0 } using five-fold cross-validation. The mean validation accuracy varied by at most 5 percentage points across the range examined, with no statistically significant difference between γ = 2 and the empirically optimal value (p = 0.1998), confirming that the reported 91.2% accuracy is robust to the choice of focusing parameter.
Physiological signals have low-entropy dynamics, with 38 components explaining 95% of variance (Section 4.5), indicating intrinsic dimensionality well below 544 k parameters. EEG alpha rhythms (8–12 Hz), GSR rise-recovery patterns ( τ 1 –3 s), and cardiac intervals (RR 0.6 –1.2 s) offer structured manifolds that support generalization from small data. Comparison with lightweight baselines confirms the need for full capacity: CNN-only (81.3%) and CNN+LSTM (87.9%) both underperform the Full Hybrid (91.2%), and omitting any single component reduces accuracy by 4.75–11.25%, indicating that all architectural elements contribute to hierarchical dependency modeling. Modern deep learning theory on implicit regularization supports the reliable generalization of high-capacity architectures in data-scarce clinical contexts when paired with appropriate structural priors and explicit regularization.

6. Discussion

This study developed and evaluated a hybrid CNN–LSTM–Transformer architecture for multimodal physiological emotion classification across five specific objectives. Statistical feature analysis confirmed class-dependent physiological signatures across all four emotional quadrants. EEG temporal channels provided the strongest individual discrimination, while peripheral signals—particularly respiration and GSR contributed complementary information not captured by EEG alone, validating the multimodal approach. The hybrid model achieved 91.2% four-quadrant accuracy with stable cross-validation performance ( μ = 91.98 % ± 1.70 % ), particularly notable given that the four-quadrant task carries twice the information-theoretic complexity ( H = 2.0 bits) of the binary formulations dominant in comparable literature.
Ablation analysis confirmed that all three architectural components are independently necessary, with removal of any single component reducing accuracy by 4.75–11.25%. The LVLA quadrant showed the greatest sensitivity to architectural completeness, with recall improving from 68.4% to 91.3% under full integration. This carries direct clinical significance: LVLA states, corresponding to sadness and depression-adjacent presentations, are the most relevant to early mental health surveillance, yet the hardest to detect from physiological data due to their subtle, temporally delayed autonomic signatures. The capacity to reliably distinguish these states from high-arousal stress responses using non-invasive wearable signals addresses a gap that episodic clinical assessment is poorly positioned to fill.
Cross-dataset evaluation on DREAMER yielded 76.4% accuracy, with the 14.8% reduction attributable to hardware mismatches and zero-imputation of absent peripheral modalities. The LVLA recall of 81.2% nonetheless confirms that learnt representations reflect physiologically meaningful patterns rather than dataset-specific artefacts. State-of-the-art methods reporting 95–98% accuracy address binary classification problems of half the entropy, and none report LVLA performance in isolation, making direct numerical comparison inherently misleading.

7. Challenges and Future Work

7.1. Challenges

Multimodal stress detection must overcome practical obstacles before adoption. Continuously matching physiological data streaming with discrete facial camera frames is difficult yet necessary for integration. Small illumination and background noise can limit feature extraction accuracy, affecting facial recognition systems. Facial data collection requires strict anonymization and data management to address privacy and ethics. The architecture limits scalability, making system adoption across populations and technological platforms difficult. Real-time analysis, especially for deep learning models handling video input, demands high-performance technology that portable or wearable devices generally lack, forcing ongoing monitoring applications.

7.2. Future Work

The development of adaptive algorithms that adapt to changing environmental conditions increases the facial recognition module’s robustness. Advanced imputation algorithms resolve missing data, especially in facial modality. The training and validation datasets will be expanded to incorporate more demographics and stress-inducing scenarios to improve the model’s generalizability. Future extensions may also incorporate complementary behavioral modalities such as gaze and eye-tracking signals, which have shown increasing relevance in affective and psychological monitoring frameworks and may enhance cross-modal emotional inference when integrated with physiological data streams [18]. Real-time modification of the processing pipeline will boost computing performance and enable deployment on wearable and portable devices. This study will examine how modern cryptography and privacy-preserving machine learning approaches like federated learning can handle privacy problems.

8. Conclusions

This study developed and evaluated a hybrid CNN–LSTM–Transformer framework for four-quadrant valence–arousal emotion classification from multimodal physiological signals, addressing five specific objectives. Discriminative features were successfully extracted and validated across EEG, GSR, BVP, and respiration modalities, with EEG temporal channels and cross-modal interactions identified as the most informative, confirming that no single signal source alone is sufficient for reliable emotional state discrimination. The proposed architecture achieved 91.2% four-quadrant accuracy, outperforming all partial-hybrid and single-component baselines by 4.75–11.25%, with ablation analysis demonstrating that spatial (CNN), temporal (LSTM), and cross-modal attention (Transformer) modeling are each independently necessary. Most notably, LVLA recall improved from 68.4% in the CNN-only baseline to 91.3% with full architectural integration—a result of particular clinical significance, as this quadrant corresponds to the depression-adjacent and sadness states that are both the most therapeutically relevant and the hardest to detect from physiological data. Cross-dataset evaluation on DREAMER further demonstrated meaningful transferability under hardware and modality mismatch conditions (76.4% accuracy, 81.2% LVLA recall), while contextualisation against 2023–2025 state-of-the-art methods confirmed that the proposed model addresses a substantially harder and more clinically expressive classification problem than existing binary-formulation approaches. Together, these findings establish that reliable continuous discrimination of stress, sadness, and related mood states from wearable physiological signals is achievable through hierarchical multimodal deep learning. The framework’s demonstrated capacity to distinguish depression-adjacent LVLA states from high-arousal stress responses directly addresses a critical unmet need in mental health monitoring, one that episodic clinical assessment and self-report instruments are poorly positioned to meet.

Author Contributions

Conceptualization: C.R. and H.Y.; Data Curation: A.M. and J.J.; Investigation: C.R., H.Y. and R.D.A.R.; Methodology: H.Y., R.D.A.R. and S.R.; Project Administration: C.R.; Resources: A.M. and J.J.; Software: R.D.A.R. and J.J.; Supervision: C.R. and S.R.; Writing—Original Draft: A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study has been partially supported by the Smart.EDU project, Grant № BR28713531, by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan.

Institutional Review Board Statement

No participants were involved in this study. Hence, ethical approval, informed consent, and adherence to institutional or licensing regulations are not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are available online https://github.com/openmedlab/Awesome-Medical-Dataset/blob/main/resources/DEAP.md (accessed on 10 February 2025).

Conflicts of Interest

The authors declare no potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BVPBlood Volume Pulse
CARCommon Average Referencing
CNNConvolutional Neural Network
DEAPDatabase for Emotion Analysis using Physiological Signals
EEGElectroencephalography
EMGElectromyography
EOGElectrooculography
FDRFalse Discovery Rate
GSRGalvanic Skin Response
HRVHeart Rate Variability
HVHAHigh-Valence High-Arousal
HVLAHigh-Valence Low-Arousal
LSTMLong Short-Term Memory
LVHALow-Valence High-Arousal
LVLALow-Valence Low-Arousal
PCAPrincipal Component Analysis
SAMSelf-Assessment Manikin
SCRSkin Conductance Response
V–AValence–Arousal

References

  1. Patlar Akbulut, F. Hybrid deep convolutional model-based emotion recognition using multiple physiological signals. Comput. Methods Biomech. Biomed. Eng. 2022, 25, 1678–1690. [Google Scholar] [CrossRef] [PubMed]
  2. Gohumpu, J.; Xue, M.; Bao, Y. Emotion recognition with multi-modal peripheral physiological signals. Front. Comput. Sci. 2023, 5, 1264713. [Google Scholar] [CrossRef]
  3. Li, Q.; Liu, Y.; Yan, F.; Zhang, Q.; Liu, C. Emotion recognition based on multiple physiological signals. Biomed. Signal Process. Control 2023, 85, 104989. [Google Scholar] [CrossRef]
  4. Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A Database for Emotion Analysis; Using Physiological Signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]
  5. Greco, A.; Lanata, A.; Citi, L.; Vanello, N.; Valenza, G.; Scilingo, E.P. Skin admittance measurement for emotion recognition: A study over frequency sweep. Electronics 2016, 5, 46. [Google Scholar] [CrossRef]
  6. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 14 May 2026).
  7. Mouazen, B.; Benali, A.; Chebchoub, N.T.; Abdelwahed, E.H.; De Marco, G. Enhancing EEG-Based Emotion Detection with Hybrid Models: Insights from DEAP Dataset Applications. Sensors 2025, 25, 1827. [Google Scholar] [CrossRef] [PubMed]
  8. Hu, F.; Wang, F.; Bi, J.; An, Z.; Chen, C.; Qu, G.; Han, S. HASTF: A hybrid attention spatio-temporal feature fusion network for EEG emotion recognition. Front. Neurosci. 2024, 18, 1479570. [Google Scholar] [CrossRef] [PubMed]
  9. Du, Y.; Li, P.; Cheng, L.; Zhang, X.; Li, M.; Li, F. Attention-based 3D convolutional recurrent neural network model for multimodal emotion recognition. Front. Neurosci. 2024, 17, 1330077. [Google Scholar] [CrossRef] [PubMed]
  10. Pichandi, S.; Balasubramanian, G.; Chakrapani, V. Hybrid deep models for parallel feature extraction and enhanced emotion state classification. Sci. Rep. 2024, 14, 24957. [Google Scholar] [CrossRef] [PubMed]
  11. Zamani, F.; Wulansari, R. Emotion classification using 1D-CNN and RNN based on deap dataset. Nat. Lang. Process. 2021, 363–378. Available online: https://csitcp.com/paper/11/1123csit28.pdf (accessed on 14 May 2026).
  12. Singh, U.; Shaw, R.; Patra, B.K. A data augmentation and channel selection technique for grading human emotions on DEAP dataset. Biomed. Signal Process. Control 2023, 79, 104060. [Google Scholar] [CrossRef]
  13. Cimtay, Y.; Ekmekcioglu, E. Investigating the use of pretrained convolutional neural network on cross-subject and cross-dataset EEG emotion recognition. Sensors 2020, 20, 2034. [Google Scholar] [CrossRef] [PubMed]
  14. Chakravarthi, B.; Ng, S.C.; Ezilarasan, M.; Leung, M.F. EEG-based emotion recognition using hybrid CNN and LSTM classification. Front. Comput. Neurosci. 2022, 16, 1019776. [Google Scholar] [CrossRef] [PubMed]
  15. Zheng, W.L.; Zhu, J.Y.; Lu, B.L. Identifying stable patterns over time for emotion recognition from EEG. IEEE Trans. Affect. Comput. 2017, 10, 417–429. [Google Scholar] [CrossRef]
  16. Alhagry, S.; Fahmy, A.A.; El-Khoribi, R.A. Emotion recognition based on EEG using LSTM recurrent neural network. Int. J. Adv. Comput. Sci. Appl. 2017, 8. Available online: https://pdfs.semanticscholar.org/6a4d/4bf50c412635335af1b8444d427d9c3e8c86.pdf (accessed on 14 May 2026). [CrossRef]
  17. Yang, H.; Han, J.; Min, K. A multi-column CNN model for emotion recognition from EEG signals. Sensors 2019, 19, 4736. [Google Scholar] [CrossRef] [PubMed]
  18. Fiani, F.; Napoli, C.; Randieri, C.; Russo, S. Current trends and future directions in eye tracking technology: A literature review. Eng. Appl. Artif. Intell. 2026, 163, 112908. [Google Scholar] [CrossRef]
Figure 1. Comparison of emotion classification task formulations in the Valence–Arousal space. (a) Binary valence only. (b) Binary arousal only. (c) Binary V/A as two independent tasks. (d) Four-quadrant joint classification adopted in this work, distinguishing HVHA (excitement), HVLA (contentment), LVHA (stress/anger), and LVLA (sadness). The four-quadrant task has entropy H = 2.0 bits, twice that of any binary formulation ( H = 1.0 bit).
Figure 1. Comparison of emotion classification task formulations in the Valence–Arousal space. (a) Binary valence only. (b) Binary arousal only. (c) Binary V/A as two independent tasks. (d) Four-quadrant joint classification adopted in this work, distinguishing HVHA (excitement), HVLA (contentment), LVHA (stress/anger), and LVLA (sadness). The four-quadrant task has entropy H = 2.0 bits, twice that of any binary formulation ( H = 1.0 bit).
Bdcc 10 00179 g001
Figure 2. The complete data processing and feature fusion pipeline. EEG, autonomic, and cross-modal characteristics are extracted from raw physiological signals in parallel and combined into a 150-dimensional feature vector.
Figure 2. The complete data processing and feature fusion pipeline. EEG, autonomic, and cross-modal characteristics are extracted from raw physiological signals in parallel and combined into a 150-dimensional feature vector.
Bdcc 10 00179 g002
Figure 3. Detailed hybrid CNN-LSTM-Transformer model architecture. Before classifying emotions in four quadrants, the model examines the input feature vector to capture local, temporal, and global patterns.
Figure 3. Detailed hybrid CNN-LSTM-Transformer model architecture. Before classifying emotions in four quadrants, the model examines the input feature vector to capture local, temporal, and global patterns.
Bdcc 10 00179 g003
Figure 4. Dynamic physiological signatures of emotion. (a) EEG alpha power fluctuations. (b) Arousal indicators, including GSR tonic level and SCR count. (c) Arousal-related respiratory patterns. (d) Peripheral temperature shows a slight correlation with valence. (e) Heatmap illustrating cross-modal physiological system dynamics.
Figure 4. Dynamic physiological signatures of emotion. (a) EEG alpha power fluctuations. (b) Arousal indicators, including GSR tonic level and SCR count. (c) Arousal-related respiratory patterns. (d) Peripheral temperature shows a slight correlation with valence. (e) Heatmap illustrating cross-modal physiological system dynamics.
Bdcc 10 00179 g004
Figure 5. 2D valence–arousal emotion space distribution of 40 experimental trials. The categorization ground truth is four independent but somewhat overlapping groups of data items.
Figure 5. 2D valence–arousal emotion space distribution of 40 experimental trials. The categorization ground truth is four independent but somewhat overlapping groups of data items.
Bdcc 10 00179 g005
Figure 6. Histograms of physiological features for distinct emotions. (a) The four emotion quadrants exhibit different EEG alpha power distributions. (b) GSR tonic level distinguishes between high and low arousal states.
Figure 6. Histograms of physiological features for distinct emotions. (a) The four emotion quadrants exhibit different EEG alpha power distributions. (b) GSR tonic level distinguishes between high and low arousal states.
Bdcc 10 00179 g006
Figure 7. Emotion recognition cross-modal feature importance ranking.
Figure 7. Emotion recognition cross-modal feature importance ranking.
Bdcc 10 00179 g007
Figure 8. The top 15 HVHA/LVLA discrepancies. The heatmap displays three statistical tests’ uncorrected p-values. Low, significant p-values are dark-red.
Figure 8. The top 15 HVHA/LVLA discrepancies. The heatmap displays three statistical tests’ uncorrected p-values. Low, significant p-values are dark-red.
Bdcc 10 00179 g008
Figure 9. Comparison of significant features identified by different statistical tests (uncorrected p < 0.05 ).
Figure 9. Comparison of significant features identified by different statistical tests (uncorrected p < 0.05 ).
Bdcc 10 00179 g009
Figure 10. Distribution of features by effect size category and statistical significance.
Figure 10. Distribution of features by effect size category and statistical significance.
Bdcc 10 00179 g010
Figure 11. Violin plots showing the distribution of Cohen’s d effect sizes across all physiological modalities for HVHA vs. LVLA comparison.
Figure 11. Violin plots showing the distribution of Cohen’s d effect sizes across all physiological modalities for HVHA vs. LVLA comparison.
Bdcc 10 00179 g011
Figure 12. Modality importance ranking based on mean Cohen’s d effect sizes.
Figure 12. Modality importance ranking based on mean Cohen’s d effect sizes.
Bdcc 10 00179 g012
Figure 13. Analysis of data quality and consistency. (a) EEG power data exhibited no extreme outliers. (b) Extreme GSR responses identified two trials as outliers. (c) GSR activity consistently aligned with self-reported arousal across trials.
Figure 13. Analysis of data quality and consistency. (a) EEG power data exhibited no extreme outliers. (b) Extreme GSR responses identified two trials as outliers. (c) GSR activity consistently aligned with self-reported arousal across trials.
Bdcc 10 00179 g013
Figure 14. Cross-modal feature–emotion label Pearson correlation matrix. The heatmap shows the intensity and direction of linear correlations between all pairs of variables, showing physiological system coupling to emotional aspects.
Figure 14. Cross-modal feature–emotion label Pearson correlation matrix. The heatmap shows the intensity and direction of linear correlations between all pairs of variables, showing physiological system coupling to emotional aspects.
Bdcc 10 00179 g014
Figure 15. Principal component analysis (PCA) of the feature space. (a) Emotion quadrant projection highlighting class overlap. (b,c) Valence- and arousal-colored projections showing that PC1 and PC2 capture the primary emotional dimensions. (d) Explained variance plot illustrating the retained data dimensionality. (e) Dominant physiological feature loadings contributing to the principal components.
Figure 15. Principal component analysis (PCA) of the feature space. (a) Emotion quadrant projection highlighting class overlap. (b,c) Valence- and arousal-colored projections showing that PC1 and PC2 capture the primary emotional dimensions. (d) Explained variance plot illustrating the retained data dimensionality. (e) Dominant physiological feature loadings contributing to the principal components.
Bdcc 10 00179 g015
Figure 16. Modality importance and synergy analysis. (a) Pie chart illustrating the percentage contribution of key physiological modalities within the feature set. (b) Heatmap representing inter-modality synergy and interaction strength.
Figure 16. Modality importance and synergy analysis. (a) Pie chart illustrating the percentage contribution of key physiological modalities within the feature set. (b) Heatmap representing inter-modality synergy and interaction strength.
Bdcc 10 00179 g016
Figure 17. Training and validation performance of the proposed hybrid CNN–LSTM–Transformer model. (a) Accuracy over training epochs. (b) Loss convergence across epochs. (c) Validation F1 score progression, demonstrating rapid and stable learning behavior.
Figure 17. Training and validation performance of the proposed hybrid CNN–LSTM–Transformer model. (a) Accuracy over training epochs. (b) Loss convergence across epochs. (c) Validation F1 score progression, demonstrating rapid and stable learning behavior.
Bdcc 10 00179 g017
Figure 18. (a) Model complexity measured by the number of trainable parameters. (b) Training time per epoch. (c) Inference speed in samples per second. (d) Accuracy versus model complexity trade-off.
Figure 18. (a) Model complexity measured by the number of trainable parameters. (b) Training time per epoch. (c) Inference speed in samples per second. (d) Accuracy versus model complexity trade-off.
Bdcc 10 00179 g018
Figure 19. Confusion matrices for the five evaluated models. (a) Baseline CNN model. (b) CNN + LSTM hybrid. (c) CNN + Transformer hybrid. (d) LSTM + Transformer hybrid. (e) Full hybrid CNN–LSTM–Transformer model, demonstrating the strongest diagonal classification performance.
Figure 19. Confusion matrices for the five evaluated models. (a) Baseline CNN model. (b) CNN + LSTM hybrid. (c) CNN + Transformer hybrid. (d) LSTM + Transformer hybrid. (e) Full hybrid CNN–LSTM–Transformer model, demonstrating the strongest diagonal classification performance.
Bdcc 10 00179 g019
Table 1. Summary of the three-phase experimental and modelling procedure.
Table 1. Summary of the three-phase experimental and modelling procedure.
PhaseActivityDetails
Phase 1Signal AcquisitionMultimodal physiological signals (EEG, GSR, BVP, Respiration, Temperature, EMG) recorded continuously during stimulus presentation using a BioSemi ActiveTwo system at 512 Hz, subsequently downsampled to 128 Hz.
Phase 2Preprocessing and Feature ExtractionRaw signals underwent Butterworth filtering, artefact removal, and signal decomposition, and feature engineering, yielding a 150-dimensional normalised feature vector per trial.
Phase 3Model Training and EvaluationThe hybrid CNN–LSTM–Transformer model was trained using an 80/20 stratified split with five-fold cross-validation and evaluated against ablated variants and the external DREAMER dataset.
This pipeline processes the DEAP dataset to classify physiological responses into four distinct valence–arousal emotion quadrants.
Table 2. Phase distortion comparison between single-pass IIR and zero-phase filtfilt implementations for the N = 4 Butterworth filter.
Table 2. Phase distortion comparison between single-pass IIR and zero-phase filtfilt implementations for the N = 4 Butterworth filter.
Filter MethodMean Delay (ms)Max Delay (ms)Passband Ripple (dB)Nyquist Atten. (dB)Phase Distortion
lfilter (single-pass IIR)10.132.13.0 240.0 Present
filtfilt (forward-backward)0.00.06.0 480.0 Strictly zero
Table 3. Summary of key findings from the physiological feature analysis and their architectural implications for the proposed hybrid model.
Table 3. Summary of key findings from the physiological feature analysis and their architectural implications for the proposed hybrid model.
No.FindingEvidenceArchitectural Implication
1No linear separabilityPCA onto top two components explains only 11.1% of variance; all four quadrants show substantial overlap with no clear boundariesNon-linear modeling is essential; linear classifiers are fundamentally insufficient for this feature space
2Emotion information is broadly distributed95% variance requires 38 principal components; discriminative features span EEG (d up to 1.42), respiration ( d = 1.06 ), and GSR tonic levelNo single modality suffices; a high-capacity multimodal architecture is required
3EEG leads, peripherals complementT8 channel: mean | d | = 0.817 (3 sig. features); Fp2: d = 0.565 ; respiration: d = 0.485 ; GSR: d = 0.261 All modalities must be retained; peripheral signals add discriminative power for valence-related states that EEG alone cannot capture
4Cross-modal interactions are predictiveEEG–EMG interaction score: 0.119; BVP–respiration ratio: 0.103; synergy r = 0.45 (autonomic–GSR), r = 0.30 (autonomic–AF4)Relationships between signals carry emotion-relevant information, directly motivating the Transformer-based cross-modal attention layer
5Feature selection has a false discovery riskAt α = 0.05 (uncorrected), ∼65 of 1307 features may be false positives; FDR retains 389, Bonferroni retains only 21Results are exploratory; confirmatory studies should apply FDR or Bonferroni correction to control family-wise error rate
These statistical insights directly outline the necessity for the multimodal CNN-LSTM-Transformer architecture.
Table 4. Ablation study results under intra-subject evaluation on the DEAP dataset.
Table 4. Ablation study results under intra-subject evaluation on the DEAP dataset.
Model VariantAccuracy (%)F1-ScoreLVLA Recall (%)
CNN-only81.30.7968.4
CNN + LSTM87.90.8582.1
CNN + Transformer88.40.8684.7
LSTM + Transformer86.20.8479.3
Full Hybrid (Our Model)91.20.8891.3
Removing any architectural component degrades performance, highlighting that the full hybrid model is essential for maximizing accuracy and LVLA recall.
Table 5. Cross-dataset generalization results (DEAP → DREAMER).
Table 5. Cross-dataset generalization results (DEAP → DREAMER).
Training → TestingAccuracy (%)F1-ScoreLVLA Recall (%)
DEAP → DEAP91.20.8891.3
DEAP → DREAMER76.40.7381.2
The accuracy drop reflects hardware mismatches and missing peripheral modalities in DREAMER, yet robust LVLA recall confirms strong generalization.
Table 6. Feature alignment strategies applied for DEAP → DREAMER cross-dataset evaluation.
Table 6. Feature alignment strategies applied for DEAP → DREAMER cross-dataset evaluation.
Alignment AspectStrategy AppliedRemarks
EEG Channel Mapping14 DREAMER channels (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4) matched to nearest DEAP equivalents via 10–20 system coordinates18 unmatched DEAP channels imputed with subject-wise channel means from DREAMER EEG; feature vector dimensionality preserved
Absent Peripheral Modalities (BVP, Respiration, Temperature)Features derived from these modalities set to zero in the DREAMER feature vectorConservative strategy avoids synthetic signal generation; partially accounts for the 91.2% → 76.4% accuracy drop, as the model was trained on non-zero values
GSR HarmonisationSame tonic/phasic decomposition pipeline (cvxEDA, Section 3.4.3) applied uniformly to both datasetsEnsures consistency in skin conductance feature extraction across recording hardware
ECG as BVP SurrogateDREAMER’s ECG processed to extract HRV features: RMSSD, SDNN, mean RR intervalServes as a partial substitute for BVP-derived cardiac features absent in DREAMER
Feature NormalisationAll features z-score normalised using DREAMER’s own per-subject statistics prior to inferenceMitigates distributional shift from hardware and recording condition differences
These alignment strategies ensure structural and statistical compatibility between the disparate DEAP and DREAMER hardware setups.
Table 7. Emotion classification task formulations in the Valence–Arousal space.
Table 7. Emotion classification task formulations in the Valence–Arousal space.
FormulationWhat It ClassifiesClassesEntropy
Binary VPositive vs. negative valence; arousal ignored2 H = 1.0 bit
Binary AHigh vs. low arousal; valence ignored2 H = 1.0 bit
Binary V/AEach dimension independently; conflates, e.g., HVHA (excitement) with LVHA (stress) 2 × 2 H = 1.0 bit each
Proposed MethodJoint HVHA/HVLA/LVHA/LVLA as a single problem4 H = 2.0  bits
Table 8. Comparison with recent state-of-the-art emotion recognition methods (2023–2025).
Table 8. Comparison with recent state-of-the-art emotion recognition methods (2023–2025).
ReferenceModelDatasetTaskPerformance
[7]BiLSTM–Transformer + AutoencoderDEAPBinary V/A94.0%
[8]Attention Spatio-Temporal FusionDEAPBinary V/A98.9%
[9]3D-CNN + CRNN (EEG + Video)DEAPBinary V/A91.8%
[10]Parallel Hybrid Deep ModelDEAPBinary V/A95.2%
Proposed MethodCNN–LSTM–TransformerDEAPFour-Quadrant V–A91.2%
Table 9. Qualitative comparative analysis of EEG-based emotion recognition methods.
Table 9. Qualitative comparative analysis of EEG-based emotion recognition methods.
Ref.MethodTaskProtocolPerformanceKey Comparison
[8]Hybrid attention spatio-temporal fusion; EEG (32 ch)Binary V/A5-fold CV98.57% V, 98.93% AHigh binary accuracy but limited scalability. Our model captures LVLA subtleties with ∼20% fewer parameters.
[14]CNN–LSTM; EEG + GSRBinary Ak-fold CV97.39% AIgnores valence dimension. Transformer layers enable full quadrant mapping, improving LVLA recall by 17%.
[9]3D-CNN + CRNN; EEG + VideoBinary V/ALOSO91.75% A, 91.86% VRequires facial video. Our physiology-only model improves privacy and wearable applicability.
[7]BiLSTM–Transformer + AE; EEGBinary V/A6-subject CV94%Limited cohort (n = 6). Our evaluation spans 32 subjects with multimodal fusion.
[12]LSTM + GWO; EEG (22 ch)Binary V/AIntra-CV92.5% V, 81.25% AReduced channels but ignores peripherals. PCA reveals cross-modal correlations ( r = 0.45 ), aiding four-quadrant classification.
[13]InceptionResNetV2; EEG (9 ch)Binary VLOSO72.81%Highlights cross-subject difficulty. CAR preprocessing improves confidence to 89–93%.
[2]SVM/KNN Ensemble; PPG/GSRBinary V/A80/20 split66%Peripheral-only signals miss neural cues. Hybrid fusion improves F1 by 0.15.
[15]FCDGELM; EEGBinary V/AIntra-CV69.67%Unimodal limitation. Temporal-global modeling improves SNR and accuracy by ∼12%.
[16]LSTM (raw EEG)Binary V/AIntra-subject85.65% ALacks spatial modeling. CNN + Transformer improves complex-state accuracy by ∼5.6%.
[17]Multi-column CNN; EEGBinary V/ASubject-based90.01% VHigh sampling overhead. The hybrid model matches performance with 40% fewer samples.
[10]AlexNet–DenseNet + PCA + SVMBinary/TriTrain-test95.54% VMulti-stage pipeline. End-to-end CNN–LSTM–Transformer achieves 91.2% four-quadrant accuracy in unified training.
While prior methods excel at simpler binary tasks, the proposed framework uniquely addresses the more complex four-quadrant classification problem using only non-invasive physiological signals.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yeturu, H.; John, J.; Raj, R.D.A.; Milani, A.; Russo, S.; Randieri, C. Stress Detection from Multimodal Physiological Data Using Hybrid Deep Learning Models. Big Data Cogn. Comput. 2026, 10, 179. https://doi.org/10.3390/bdcc10060179

AMA Style

Yeturu H, John J, Raj RDA, Milani A, Russo S, Randieri C. Stress Detection from Multimodal Physiological Data Using Hybrid Deep Learning Models. Big Data and Cognitive Computing. 2026; 10(6):179. https://doi.org/10.3390/bdcc10060179

Chicago/Turabian Style

Yeturu, Hemesh, Joel John, Rayappa David Amar Raj, Alfredo Milani, Samuele Russo, and Cristian Randieri. 2026. "Stress Detection from Multimodal Physiological Data Using Hybrid Deep Learning Models" Big Data and Cognitive Computing 10, no. 6: 179. https://doi.org/10.3390/bdcc10060179

APA Style

Yeturu, H., John, J., Raj, R. D. A., Milani, A., Russo, S., & Randieri, C. (2026). Stress Detection from Multimodal Physiological Data Using Hybrid Deep Learning Models. Big Data and Cognitive Computing, 10(6), 179. https://doi.org/10.3390/bdcc10060179

Article Metrics

Back to TopTop