1. Introduction
Emotional states profoundly impact how individuals perceive, process, and respond to information, affecting both cognition and psychological well-being. While positive emotions foster psychological resilience, negative emotions are often linked to stress-related health issues [
1,
2]. Accurate emotion recognition is therefore essential across a wide range of applications, including mental health monitoring, adaptive human–computer interaction, and affective computing [
3]. However, traditional emotion recognition methods relying on subjective assessments and manual annotations are time-consuming and prone to bias [
4].
Recent advancements in artificial intelligence (AI), particularly deep learning, have significantly improved automated emotion recognition by leveraging physiological signals such as Electroencephalography (EEG), Electrooculography (EOG), Electromyography (EMG), Skin Temperature (SKT), Photoplethysmography (PPG), Galvanic Skin Response (GSR), and respiratory rate (RR) [
5]. Although previous studies have explored deep learning with multimodal physiological signals, most have only utilised a subset of these modalities rather than integrating them comprehensively. Despite these improvements, challenges remain in feature selection, computational efficiency, and model scalability for real-time applications. Moreover, integrating deep learning with multimodal physiological signals is still evolving, constrained by limited datasets and model optimisation strategies.
Multimodal emotion recognition has gained increasing attention as it enables the integration of complementary information from multiple physiological signals to improve robustness and classification performance. Different modalities capture distinct aspects of emotional processing: EEG reflects neural activity, PPG captures cardiovascular responses, EOG records eye movement patterns, EMG measures muscle activity related to facial expressions, SKT reflects temperature changes associated with emotional states, and GSR represents skin conductance variations linked to arousal and stress. By combining these heterogeneous signals, multimodal approaches can better address the subjectivity and variability inherent in emotional responses, leading to more reliable emotion prediction systems [
6].
To address these challenges, this study proposes a multimodal fusion framework for emotion recognition, incorporating EEG, PPG, GSR, EOG, EMG, SKT, and RR. Using the DEAP dataset, we preprocess signals, extract relevant features, and employ modality-specific subnetworks before integrating them into a unified deep learning model.
While multimodal deep learning approaches have been widely explored, this study investigates the effectiveness of Temporal Convolutional Networks (TCN) [
7] and Bidirectional Long Short-Term Memory (Bi-LSTM) [
8] for multimodal physiological emotion recognition. In addition to comparing temporal modelling capabilities, the study systematically examines the influence of different evaluation protocols, including window-wise, block-wise, and trial-wise validation strategies, to assess the impact of temporal information leakage on reported performance. Optimised feature selection and advanced learning rate scheduling techniques are further employed to improve accuracy while reducing computational overhead.
By exploring deep learning-based multimodal fusion strategies, this study aims to enhance the efficiency and scalability of emotion recognition systems, contributing to the development of robust, real-time applications in affective computing and related fields.
The main contributions of this study are summarised as follows:
A multimodal deep learning framework is proposed for emotion recognition by integrating seven physiological modalities, including EEG, PPG, GSR, EOG, EMG, SKT, and RR, enabling comprehensive exploitation of both central and peripheral physiological responses.
Multiple data partitioning strategies, including conventional window-wise, block-wise, and trial-wise evaluation protocols, are systematically investigated to examine the impact of temporal information leakage on emotion recognition performance.
A leakage-aware evaluation framework is introduced through the systematic investigation of window-wise, block-wise, non-overlapping block-wise, and trial-wise partitioning strategies, providing a more realistic assessment of model generalisation and quantifying the impact of temporal information leakage on reported emotion recognition performance.
A Temporal Convolutional Network (TCN)-based architecture is developed and evaluated against a Bi-LSTM baseline to investigate the effectiveness of temporal feature learning from multimodal physiological signals.
A modality-wise ablation study is conducted to quantify the contribution of individual physiological signals and to evaluate the effectiveness of multimodal fusion for emotion recognition.
Extensive experiments on the DEAP dataset demonstrate that the proposed framework achieves competitive performance while maintaining computational efficiency, supporting its potential applicability in real-time affective computing systems.
The remainder of this paper is organised as follows.
Section 2 presents the background and reviews related work in multimodal emotion recognition.
Section 3 describes the proposed methodology, including data preprocessing, feature extraction, model architecture, and evaluation protocols.
Section 4 presents experimental results, including performance analysis, modality-wise ablation studies, and the investigation of temporal information leakage under different validation strategies. Finally,
Section 5 concludes the paper and outlines future research directions.
2. Background
In recent years, numerous studies have explored emotion recognition using physiological signals and machine learning techniques. Verma and Tiwary [
7] developed a multimodal fusion framework combining Discrete Wavelet Transform with classifiers such as Support Vector Machine (SVM), Multilayer Perceptron (MLP), and K-Nearest Neighbours (KNN) to classify emotions from EEG, GSR, and skin temperature. Yin and Zhao [
8] introduced a multiple-fusion-layer-based ensemble classifier using stacked autoencoders (MESAE) to filter noise and extract stable features for arousal and valence classification. Khalili et al. [
9] examined quadratic discriminant classifiers on EEG and peripheral signals, highlighting EEG’s effectiveness, particularly when nonlinear features were employed. Chen et al. [
10] proposed a multimodal fusion approach using EEG and peripheral signals, achieving optimal performance for arousal classification. García [
11] applied Gaussian Process Latent Variable Models (GP-LVM) to map high-dimensional physiological data into a low-dimensional latent space, improving classification accuracy. Wagner et al. [
12] investigated emotion recognition from music-induced stimuli using EMG, ECG, and skin conductivity, achieving high recognition rates with machine learning classifiers.
A notable study [
13] introduced the MED4 dataset, recording EEG, PPG, speech, and facial images under controlled conditions. It demonstrated that EEG outperforms speech in emotion recognition and that fusing EEG with speech enhances robustness in noisy environments. The study also leveraged Temporal Convolutional Networks (TCNs) to improve temporal feature extraction and classification performance. Similarly, one study [
14] demonstrated the effectiveness of TCNs for learning long-term temporal features from physiological signals, overcoming RNN limitations. Integrating the Connectionist Temporal Classification (CTC) algorithm further improved alignment of temporal features with sentiment labels. The results on the AMIGOS dataset showed that the TCN-based approach outperformed traditional methods, highlighting its potential for emotion recognition.
More recent research has increasingly shifted toward deep learning approaches for EEG-based emotion recognition. Bi-LSTM networks have demonstrated effectiveness in modelling temporal dependencies in physiological signals [
15], while CNN-RNN hybrid frameworks have improved spatial–temporal feature extraction [
16]. Other studies have explored graph convolutional networks and transformer-based architectures to better capture inter-channel and temporal relationships in EEG signals [
17]. The DEAP dataset [
18] has become one of the most widely used benchmarks for evaluating multimodal physiological emotion recognition models.
Recent studies on the DEAP dataset have investigated multimodal fusion and advanced temporal modelling strategies for emotion recognition. Ensemble-based learning frameworks combining EEG, EMG, and EOG signals have demonstrated the effectiveness of heterogeneous physiological fusion for affect classification [
19]. Deep canonical correlation analysis has also been explored to learn shared feature representations across modalities for arousal and valence recognition [
20]. In addition, attention-based transformer architectures and CNN–LSTM hybrid models have shown strong performance in spatial–temporal representation learning and temporal decoding of EEG signals [
21,
22,
23]. Transfer-learning approaches have been further investigated to improve robustness across subjects and modalities [
24], while multimodal human–computer interaction studies integrating EEG with facial video have highlighted the complementary relationship between behavioural and physiological affective cues [
25]. Together, these studies demonstrate growing interest in scalable and generalised deep learning frameworks for multimodal affective computing.
An important methodological consideration in physiological emotion recognition is the temporal segmentation of continuous signals. Many studies divide recordings into short overlapping windows to capture dynamic emotional variations and increase the number of training samples [
26,
27,
28]. However, such overlapping segmentation can introduce information leakage when windows from the same trial are distributed across training and validation sets, potentially leading to inflated performance estimates [
29]. To address this issue, more rigorous evaluation strategies, such as trial-wise and block-wise partitioning, have been proposed [
30,
31]. In particular, block-wise segmentation groups temporally adjacent windows into the same dataset partition, reducing the likelihood that highly correlated samples cross the train–validation boundary. This preserves the independence between training and validation data and provides a more reliable assessment of model generalisation.
Recent studies have increasingly emphasised the importance of evaluation methodology in EEG-based emotion recognition. A comprehensive review by Ma et al. [
32] summarised numerous subject-independent deep learning studies and demonstrated that cross-subject evaluation on the DEAP dataset remains considerably more challenging than conventional within-subject settings due to substantial inter-subject variability. Brookshire et al. [
29] discussed how data leakage can artificially inflate EEG deep learning performance and reduce the reliability of reported results. Lei et al. [
31] investigated the impact of trial-wise splitting and temporal leakage in EEG-based emotion classification, showing that improper partitioning strategies may lead to overly optimistic accuracies. Kukhilava et al. [
30] proposed a unified evaluation framework for EEG emotion recognition and highlighted inconsistencies in validation methodologies across existing studies. Similarly, Del Pup et al. [
33] demonstrated that data partitioning strategies significantly influence EEG deep learning performance, while Liu et al. [
34] introduced a comprehensive benchmark framework emphasising standardised and reproducible evaluation procedures.
In this study, a block-wise evaluation strategy is adopted to reduce temporal overlap between training and validation samples. Although this approach produces more conservative performance compared to conventional window-wise splitting, it provides a more realistic and unbiased assessment of model generalisation capability.
In addition to methodological studies on data leakage and evaluation protocols, several recent subject-independent EEG emotion recognition approaches have further demonstrated the challenges of cross-subject generalisation on the DEAP dataset. Quan et al. [
35] proposed a deep learning framework for subject-independent EEG emotion recognition and reported accuracies of 79.59% for arousal and 81.19% for valence, highlighting the difficulty of maintaining robust performance under cross-subject evaluation settings. He et al. [
36] introduced a domain adaptation-based approach for EEG emotion recognition, achieving 63.25% arousal and 64.33% valence accuracy, while emphasising the importance of reducing domain discrepancies between subjects. Together, these studies further confirm that subject-independent EEG emotion recognition remains substantially more challenging than conventional within-subject evaluation and reinforce the importance of rigorous and leakage-aware validation strategies.
Together, these studies indicate that DEAP continues to serve as a central benchmark for evaluating modality integration, feature representation, and temporal modelling strategies in affective computing. However, differences in modality selection, binary versus multi-class formulations, and cross-subject versus within-subject evaluation protocols make direct comparison challenging and highlight the need for reproducible multimodal pipelines with clearly defined learning objectives. Despite these advances, challenges such as computational complexity, overfitting, and limited real-time scalability persist. In particular, the integration of TCNs into multimodal frameworks remains underexplored, even though their ability to model long-term dependencies and enable parallelization offers clear benefits.
3. Methodology
3.1. Data Collection
The DEAP dataset [
18], a widely used benchmark for emotion recognition, consists of multimodal physiological recordings from 32 participants (16 males, 16 females) aged between 19 and 37 years. The study was conducted in a controlled laboratory environment where participants watched 40 one-minute-long music video clips selected to elicit a range of emotional responses. These videos were chosen based on prior affective ratings to ensure diversity in emotional stimuli. During the experiment, physiological signals—including EEG, EOG, EMG, GSR, RR, SKT, and PPG—were recorded at a sampling rate of 128 Hz. EEG signals were captured from 32 channels following the international 10–20 system. After watching each video, participants self-reported their emotional states using a 9-point scale for arousal and valence. For this study, arousal and valence ratings were binarized, formulating a two-class classification problem. These subjective ratings serve as the ground truth labels for emotion classification.
For model training, the DEAP data were loaded in the pre-processed Python format, where each participant’s file contains 40 trials. Each trial is represented as a three-dimensional array with the shape (40 trials × 40 channels × 8064 samples), corresponding to 32 EEG and 8 peripheral physiological signals. In this study, 11 EEG channels were selected based on prior literature and combined with six peripheral physiological channels (EOG, EMG, GSR, RR, SKT, and PPG), resulting in a final configuration of 17 channels for analysis. This multimodal configuration enables a comprehensive assessment of both central nervous system activity (EEG) and peripheral autonomic physiological responses. The corresponding label array has the shape (40 × 4), representing valence, arousal, dominance, and liking, where only valence and arousal dimensions were used for classification.
3.2. Preprocessing and Feature Extraction
To prepare the data for deep learning, each physiological signal underwent a multi-stage preprocessing and segmentation pipeline designed to enhance signal quality and reduce redundancy. First, baseline noise and artefacts were removed using the DEAP dataset’s preprocessing procedures, including filtering and normalisation. From the 40 available channels, 11 EEG channels were selected and combined with six peripheral physiological signals (EOG, EMG, GSR, RR, SKT, and PPG), resulting in 17 channels used for subsequent analysis.
Each trial (8064 samples) was segmented into overlapping windows of 2 s (256 samples) with a 0.125 s step (16 samples). For each window, a Fast Fourier Transform (FFT) was applied to extract spectral band-power features across five frequency bands commonly used in emotion-related EEG and physiological analysis. Band-power features from all selected channels were concatenated into a single feature vector per window.
To prevent overestimation of model performance caused by the high overlap between consecutive windows, a block-wise splitting strategy was employed. The sequence of overlapping windows from each trial was partitioned into non-overlapping contiguous blocks, each containing 8 consecutive windows. Given the windowing configuration, these blocks represent short, highly correlated temporal segments of the original signal rather than independent samples. Blocks were constructed sequentially without shuffling to preserve temporal dependencies. A deterministic partitioning scheme was applied in which every fourth block was assigned to the validation set, while the remaining blocks were used for training.
To mitigate the effects of temporal information leakage caused by highly overlapping windows, a block-wise splitting strategy was initially employed. The sequence of overlapping windows from each trial was partitioned into contiguous blocks, each containing eight consecutive windows, and a deterministic partitioning scheme was applied in which every fourth block was assigned to the validation set. This strategy reduces the likelihood of highly similar windows appearing in both training and validation sets; however, some temporal correlation may still remain due to the substantial overlap between neighbouring windows. To further investigate the influence of evaluation methodology on reported performance, additional experiments using larger step sizes, non-overlapping block-wise partitioning, and strict trial-wise validation were conducted. These evaluation protocols are described in
Section 4.6.
All extracted features and corresponding labels were stored as NumPy arrays to enable efficient data loading during model training. Training and validation were performed in a within-subject manner using block-wise configuration to provide a fair, reproducible, and more rigorous assessment while reducing temporal information leakage between training and validation samples.
Evaluation Protocols
Several evaluation protocols were investigated to assess the impact of temporal information leakage on emotion recognition performance. First, a conventional window-wise strategy was considered, in which overlapping windows were randomly partitioned between training and validation sets. Second, a block-wise strategy grouped neighbouring windows into contiguous temporal blocks before partitioning. Third, a leakage-reduced block-wise protocol was implemented by increasing the window step size, thereby reducing overlap between adjacent windows. Fourth, a non-overlapping block-wise protocol was adopted using a step size equal to the window length. Finally, a strict trial-wise protocol was employed in which entire trials were assigned exclusively to either the training or validation set. These protocols enabled a systematic investigation of how temporal overlap influences reported classification performance.
3.3. Temporal Convolutional Networks (TCN) and Bi-LSTM for Time-Series Modelling
Temporal Convolutional Networks (TCNs) [
37] have recently gained significant attention as a powerful alternative to recurrent neural networks (RNNs) for time-series and sequential data modelling. Unlike traditional recurrent architectures such as Long Short-Term Memory (LSTM) and Bidirectional LSTM (Bi-LSTM) [
38], which rely on sequential processing of data over time steps, TCNs leverage the strengths of convolutional operations to capture dependencies across long temporal spans in a more efficient manner. Specifically, TCNs employ causal convolutions, which ensure that the prediction at any time step is based only on the current and previous inputs, thereby preserving temporal order and avoiding information leakage from the future.
Another important feature of TCNs is the use of dilated convolutions, which systematically increase the receptive field of the convolutional filters without requiring a proportional increase in network depth. Through this mechanism, TCNs can capture both short- and long-range dependencies while maintaining computational efficiency. This design contrasts with recurrent approaches, where modelling long-term dependencies is often hindered by vanishing or exploding gradients. By expanding the receptive field exponentially with each layer, TCNs can learn temporal patterns over extended time spans without recurrent connections. Moreover, the convolutional nature of TCNs enables parallel processing of entire sequences, which substantially improves training speed compared to sequential models like Bi-LSTM.
In this study, we employ TCNs as the primary model for emotion recognition using multimodal physiological signals, including EEG, PPG, GSR, EOG, EMG, RR, and SKT. These signals inherently contain complex temporal dynamics that require models capable of efficiently extracting both local and global dependencies. The detailed architecture of the proposed TCN model is illustrated in
Figure 1.
For benchmarking and comparison, we also implement a Bi-LSTM model, as it is one of the most widely used architectures in time-series analysis due to its ability to capture bidirectional contextual information. However, despite its effectiveness, Bi-LSTM suffers from limitations such as higher computational cost, slower training, and reduced scalability for real-time applications. In addition, recurrent architectures often struggle to capture very long-range dependencies efficiently, a challenge that TCNs address through their hierarchical convolutional design.
By conducting a comparative analysis between TCN and Bi-LSTM, we aim to highlight the advantages of TCNs in terms of both predictive performance and computational efficiency. The goal of this work is to demonstrate the suitability of TCNs for physiological signal-based emotion recognition, emphasising their ability to extract meaningful temporal representations while maintaining stable training dynamics. This evaluation not only validates the robustness of TCNs but also provides insights into their practical applicability for large-scale and real-time affective computing systems.
3.4. Model Architecture
In contrast to recurrent models such as Bi-LSTM, this study adopts a TCN-based architecture for modelling physiological signals. The TCN structure is specifically designed to handle long-range temporal dependencies while offering superior parallelization and stability during training. The architecture is composed of multiple temporal convolutional layers, which progressively extract temporal features from the input signals. Each convolutional layer is embedded within residual blocks, ensuring that information can flow across layers without degradation and improving the model’s ability to learn deep temporal representations.
Feature vectors from each modality were concatenated along the feature dimension prior to input into the shared TCN backbone, forming a unified multimodal representation. To enhance generalisation and prevent overfitting, the architecture incorporates dropout layers and batch normalisation after convolutional operations. Dropout helps reduce reliance on specific filters, while batch normalisation stabilises learning and accelerates convergence. At the classification stage, fully connected layers are employed to map the extracted temporal features into two discrete classes for arousal and valence, enabling effective binary emotion recognition.
The proposed TCN model was implemented using the TensorFlow/Keras deep-learning framework. The network consisted of three stacked TCN layers with filter sizes of 64, 128, and 32, respectively. Each TCN layer employed a kernel size of 3 and dilation factors of [1, 2, 4, 8] to capture both short- and long-term temporal dependencies. Batch normalisation and dropout (rate = 0.15) were applied after each TCN layer to improve training stability and reduce overfitting. For classification, a fully connected dense layer with 16 neurons and ReLU activation was used, followed by a single-neuron output layer with a sigmoid activation function for binary classification. The main architectural parameters are summarised in
Table 1.
Training was performed using the Adam optimizer (learning rate = 0.0001) with a ReduceLROnPlateau scheduler that halved the learning rate when the validation loss plateaued (patience = 5). Binary cross-entropy loss was used, and the model was trained for 230 epochs with a batch size of 256.
A block-wise train–validation split was initially adopted for model development. Subsequently, additional leakage-aware evaluation protocols, including non-overlapping block-wise and trial-wise validation, were investigated to assess the impact of temporal information leakage on classification performance. All experiments followed a within-subject training protocol. Separate networks were trained for arousal and valence prediction.
Overall, the TCN architecture offers several advantages over recurrent counterparts. It enables faster training by exploiting convolutional operations, ensures stable gradient flow through residual connections, and scales effectively to long sequences due to dilated convolutions. These properties make the proposed TCN model well-suited for real-time physiological signal-based emotion recognition, where both computational efficiency and accurate temporal modelling are critical. The architecture is illustrated in
Figure 2.
3.4.1. Bidirectional LSTM Model for Comparison
To provide a comparative baseline for the proposed TCN architecture, a Bidirectional Long Short-Term Memory (Bi-LSTM) model was implemented to capture sequential dependencies in both forward and backward directions. The Bi-LSTM is particularly effective in modelling long-term temporal relationships, allowing the network to retain past and future contextual information in time-series data.
The model consisted of three LSTM layers with 64, 128, and 32 hidden units, respectively. All layers were implemented as bidirectional LSTMs to enhance temporal feature extraction from both forward and backward directions. Dropout layers (0.15, 0.15, and 0.15) were applied after each LSTM block, and Batch Normalisation was used to stabilise and accelerate convergence. The network concluded with a fully connected dense layer of 16 ReLU-activated units and a sigmoid output layer for binary classification.
Training was performed using the Adam optimizer with an initial learning rate of 0.0001. Binary cross-entropy loss was used for optimisation. The model was trained for 230 epochs with a batch size of 256. To prevent temporal leakage caused by highly overlapping windows, a block-wise train–validation splitting strategy was adopted, where consecutive windows from the same trial were grouped and assigned entirely to either the training or validation set. All experiments were implemented in Python using the TensorFlow deep learning framework under a within-subject evaluation protocol consistent with the proposed TCN model.
3.4.2. Ablation Study for Modality Contribution
To assess the contribution of individual physiological modalities, an ablation study was conducted using the same TCN architecture. Models were trained on subsets of modalities, including EEG only, EEG with EOG and EMG, EEG with EOG, EMG, and SKT, and all modalities combined. This analysis provides quantitative insight into the incremental benefit of including peripheral signals alongside EEG and supports the evaluation of the proposed early-fusion approach.
3.5. Adaptive Learning Rate Adjustment
Learning rate adjustment is essential for balancing the speed and stability of model convergence. If the learning rate is too small, training progresses slowly, while a high learning rate can cause instability and prevent the model from reaching an optimal solution. Adaptive adjustment [
39] methods dynamically modify the learning rate based on validation metrics, such as loss or accuracy, rather than following predefined schedules. When performance plateaus, these methods reduce the learning rate, allowing for smaller, more precise updates. This helps the model escape stagnation and continue improving, especially in later training stages where large updates could hinder fine-tuning.
4. Results and Discussion
4.1. Model Performance Overview
Unless otherwise specified, the detailed analyses presented in this section correspond to the valence classification task. For detailed analysis of model behaviour, including convergence dynamics and computational efficiency, the valence task is used as a representative case. This avoids redundancy in reporting, as both arousal and valence models share identical architectures and training protocols but require separate training processes. Arousal classification results are reported separately in
Section 4.7 and summarised under the same experimental conditions.
To provide a more rigorous evaluation, a block-wise data partitioning strategy was initially adopted, where temporally adjacent overlapping windows were grouped into contiguous blocks and assigned exclusively to either the training or validation set. This approach reduces the risk of temporal information leakage caused by highly overlapping windows and provides a more conservative assessment of model generalisation compared with conventional window-wise validation.
4.2. Accuracy and Convergence Analysis
The performance of Temporal Convolutional Networks (TCN) was evaluated against a Bi-LSTM model, summarised in
Table 2. The table presents key efficiency metrics such as training accuracy, validation accuracy, average training time per epoch, and inference latency per sample. Inference latency was measured per 5 s input window.
TCN consistently outperforms Bi-LSTM in both training and validation accuracy. It achieves a training accuracy of 97.26% and a validation accuracy of 88.42%, outperforming Bi-LSTM by approximately 9–10%. These results indicate that TCN’s causal and dilated convolutions are more effective in capturing both short- and long-term temporal dependencies in physiological signals. The small gap between training and validation accuracy also suggests that TCN generalises better and mitigates overfitting compared to Bi-LSTM.
It is worth noting that when conventional window-wise splitting is applied—where individual overlapping windows are randomly partitioned—the TCN model achieves training and validation accuracy exceeding 90%. However, such splitting may allow temporally overlapping segments from the same trial to appear in both training and validation sets, potentially inflating performance estimates due to shared signal content. By contrast, the adopted block-wise partitioning strategy groups adjacent windows into contiguous segments before splitting, reducing the likelihood that highly correlated samples appear in both training and validation sets. This results in a more conservative yet more realistic evaluation of classification performance and model generalisation.
4.3. Computational Efficiency
The average training time per epoch demonstrates computational efficiency: TCN required approximately 14.12 s per epoch, slightly faster than Bi-LSTM at 15.32 s. This improvement reflects the parallelized convolutional operations in TCN, which efficiently process sequential data. Additionally, inference latency is significantly lower for TCN (0.32 ms/sample) than Bi-LSTM (0.42 ms/sample), highlighting TCN’s suitability for real-time emotion recognition systems.
4.4. Learning Dynamics
The accuracy vs. epochs plot (
Figure 3) illustrates the learning progression of the models over training iterations. The TCN model (top) exhibits a steady and rapid increase in accuracy, reaching a plateau at higher epochs, while the Bi-LSTM and LSTM models (bottom) show a comparatively slower convergence. The smoother and more stable accuracy curve of TCN indicates improved learning efficiency and reduced fluctuations, reflecting its better capacity to capture temporal dependencies in physiological signals. The gap between training and validation accuracy remains minimal, further confirming stable generalisation under the stricter block-wise evaluation setting.
Dynamic learning rate adjustment strategies played a crucial role in enhancing model convergence and overall performance. Adaptive scheduling effectively fine-tuned network parameters, preventing overshooting and improving stability during training. As seen in the accuracy vs. epochs plot, the learning curve exhibits distinct step-like increases in accuracy, which align with scheduled reductions in the learning rate. These jumps indicate moments when the optimizer adjusts step sizes, allowing the model to escape local minima and refine its weight updates more effectively. Such adaptive techniques helped mitigate overfitting by ensuring a balanced optimisation process. The results emphasise the importance of learning rate management in training deep neural networks for physiological signal-based emotion recognition.
4.5. Ablation Study Results
The ablation study (
Table 3) was conducted to evaluate the contribution of each physiological modality to the performance of the proposed TCN model. As shown, the model trained using EEG only achieves a baseline validation accuracy of 80.5%, indicating that central nervous system activity alone provides meaningful but limited discriminative information for emotion recognition.
When peripheral modalities (EOG and EMG) are added, the validation accuracy increases to 81.8%, demonstrating that ocular and muscular signals contribute complementary temporal features that enhance classification capability. Further inclusion of SKT leads to a more substantial improvement, raising validation accuracy to 85.3%. This indicates that thermoregulatory responses provide additional discriminative information that is not captured by EEG and facial-related signals alone.
The full multimodal configuration achieves the highest validation accuracy (88.42%), confirming that combining central (EEG) and multiple peripheral physiological signals yields the most robust performance. Although the training accuracy slightly decreases compared to the previous configuration (92.9% vs. 93.3%), the validation accuracy improves, suggesting better generalisation and reduced overfitting.
Overall, the ablation results demonstrate that each additional modality contributes complementary information, validating the effectiveness of the early-fusion strategy adopted in the TCN framework and highlighting the importance of multimodal physiological integration for emotion recognition.
4.6. Impact of Evaluation Protocols and Temporal Leakage
To investigate the influence of temporal information leakage on emotion recognition performance, several evaluation strategies were examined, including conventional window-wise validation, leakage-reduced block-wise validation, non-overlapping block-wise validation, and strict trial-wise validation, summarised in
Table 4.
The results indicate that performance decreases as stricter evaluation protocols are adopted. While conventional window-wise validation achieves the highest accuracy, it is susceptible to temporal information leakage caused by highly overlapping windows. Increasing the step size and employing block-wise partitioning reduces leakage and produces more conservative performance estimates. The trial-wise protocol provides the most rigorous evaluation by ensuring complete separation of trials between training and validation sets, thereby offering the most realistic assessment of model generalisation. Although trial-wise evaluation yields lower accuracy than window-wise and block-wise approaches, the resulting performance estimates are more reliable because they better reflect the model’s ability to generalise to unseen trials. This observation is particularly important for the DEAP dataset, where the number of independent trials is limited, making leakage-aware evaluation essential for obtaining realistic and reproducible performance estimates.
4.7. Comparison with Previous Studies
Direct comparison between emotion recognition studies remains challenging because reported accuracies are strongly influenced by the adopted evaluation methodology, including segmentation strategy, overlap configuration, subject dependency, and data partitioning protocol. Recent reviews have highlighted that evaluation protocols can substantially affect reported performance and that studies employing different validation strategies should not be compared directly. In particular, conventional window-wise and random within-subject partitioning approaches may introduce temporal information leakage when highly overlapping windows are distributed across both training and validation sets, leading to overly optimistic performance estimates. Conversely, rigorous evaluation protocols, including non-overlapping, block-wise, trial-wise, cross-subject, cross-domain, and subject-independent approaches, provide more conservative but more reliable assessments of model generalisation.
To facilitate a fair comparison, previous studies are divided into two categories according to their evaluation methodology.
Table 5 summarises studies that employed conventional evaluation protocols susceptible to temporal information leakage, whereas
Table 6 presents studies adopting more rigorous evaluation and generalisation-oriented validation strategies. Following recent recommendations in the literature, studies employing fundamentally different evaluation protocols are reported separately to avoid misleading direct comparisons. Although all studies were evaluated using the DEAP dataset, differences in preprocessing pipelines, modality selection, feature extraction methods, and validation protocols should be considered when interpreting the reported accuracies.
The studies presented in
Table 5 generally report high recognition accuracies; however, their evaluation protocols may permit temporal information leakage between training and validation sets, particularly when highly overlapping windows are employed. This can result in overly optimistic estimates of model performance and may partially explain the exceptionally high accuracies reported by several studies.
Compared with the studies presented in
Table 5, the accuracies reported in
Table 6 are noticeably lower. However, these results provide a more realistic estimate of model generalisation because they are obtained under stricter evaluation protocols that reduce or eliminate temporal information leakage. The studies in
Table 6 adopt rigorous validation methodologies, including trial-wise, cross-subject, cross-domain, and leakage-aware block-wise evaluation, all of which aim to provide more reliable performance estimates than conventional window-wise validation. The proposed framework achieved 68.13% accuracy under strict trial-wise validation and 69.71% under non-overlapping block-wise evaluation, demonstrating competitive performance compared with other rigorous evaluation approaches. Furthermore, the substantial performance difference between
Table 5 and
Table 6 highlights the influence of evaluation methodology on reported emotion recognition accuracy. Although the trial-wise and non-overlapping block-wise accuracies are lower than those obtained using conventional window-wise validation, they offer a more reliable assessment of real-world performance. This observation is particularly important for the DEAP dataset, where the number of independent trials is limited, and the use of highly overlapping windows can otherwise lead to inflated performance estimates.
A further limitation of the present study is that the primary experiments were conducted under a within-subject evaluation framework. Although within-subject validation reduces inter-subject variability and enables a focused assessment of temporal modelling performance, it does not fully capture the challenges associated with generalising emotion recognition models to unseen individuals. Physiological responses to emotional stimuli can vary substantially across participants, making cross-subject emotion recognition considerably more difficult. Consequently, future research should further investigate subject-independent and cross-subject validation strategies to evaluate the robustness and practical applicability of the proposed framework in real-world scenarios.
5. Conclusions
Recognising emotions through multiple physiological modalities simultaneously is crucial for enabling real-time emotion recognition and improving classification accuracy. While traditional approaches often rely on single-modality signals, limiting their robustness and generalisability, this study addresses key gaps in the field—namely, the lack of effective fusion strategies for diverse physiological inputs, the challenge of handling high-dimensional time-series data, and the underutilization of efficient temporal modelling techniques.
To address these challenges, we proposed a deep learning framework that integrates EEG, PPG, GSR, EOG, EMG, RR, and SKT using an early-fusion strategy combined with advanced temporal modelling. A TCN was employed as the primary architecture and benchmarked against Bi-LSTM to evaluate sequential modelling effectiveness.
A key contribution of this work is the investigation of leakage-aware evaluation strategies, including block-wise and trial-wise validation protocols, to reduce information leakage caused by highly overlapping windows and provide a more realistic assessment of model generalisation.
Experimental results on the DEAP dataset demonstrate that the proposed TCN model achieves a validation accuracy of 88.42% for valence and 86.35% for arousal under the stricter block-wise protocol, outperforming the Bi-LSTM baseline. In addition, the modality-wise ablation study confirms that integrating peripheral physiological signals alongside EEG progressively improves classification performance, with the full multimodal configuration achieving the highest validation accuracy. These findings highlight the complementary nature of central and peripheral signals and validate the effectiveness of the proposed fusion framework.
Furthermore, an additional investigation was conducted to examine the influence of evaluation protocols and temporal information leakage on emotion recognition performance. The results demonstrated that classification accuracy decreases as increasingly rigorous validation strategies are adopted, with trial-wise and non-overlapping block-wise protocols providing more conservative but more reliable estimates of model generalisation. These findings highlight the importance of leakage-aware evaluation methodologies and suggest that conventional window-wise validation may overestimate performance when highly overlapping windows are used.
While the proposed framework demonstrates strong performance, certain limitations should be acknowledged. The study relies on a relatively small sample size and employs a within-subject evaluation protocol, which may restrict generalisation to unseen participants. Future work will focus on expanding the dataset by collecting custom multimodal physiological recordings and exploring cross-subject validation to further enhance the robustness and applicability of the model in real-world scenarios.