1. Introduction
Accurately and objectively recognizing emotional states is crucial for advancing human–computer interaction, which also implies the development of artificial intelligence solutions with empathy, as well as accurate mental health diagnosis. Although subjective self-reports have historically been used to measure emotions, this approach is prone to bias, often referred to as social desirability bias. Electroencephalography (EEG), on the other hand, offers an objective view of the neural dynamics of affect by providing a direct, high-temporal-resolution measurement of the brain’s electrical activity, making EEG-based emotion recognition a cornerstone technology for next-generation brain–computer interfaces (BCIs) [
1]. These intriguing systems can instantly adjust to the emotional and cognitive states of their users. However, several fundamental issues inherent to the EEG signal itself, including its poor signal-to-noise ratio, high dimensionality, and notable inter-individual variability, delay the transition from laboratory experiments to reliable, practical BCI applications [
2].
There has been extensive development in computer methods for decoding emotions from EEG, namely, in Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN), which are two basic machine learning techniques that primarily rely on manually created characteristics such as differential entropy (DE) and power spectral density (PSD), which dominated early approaches [
3,
4,
5]. Although deep learning models currently dominate, these classical models continue to be relevant. Arijit Nandi et al. focus on using online Logistic Regression [
6], as do Reis et al. [
7]; Xefteris et al. [
8] use SVM, MLP, and Random Forests, in combination with handcrafted features; and Perry Fordson et al. [
9], aim to improve accuracy by including non-EEG data with a graph based on demographic relations (Gender–Age). However, their model’s reliance on manually created features may result in less-than-ideal classification performance, as these features may not adequately capture the complicated, non-linear spatiotemporal patterns that define emotional brain processes. To address this gap, the development of deep learning has provided an end-to-end solution, utilizing Convolutional Neural Networks (CNNs) to automatically extract spatial characteristics from EEG topographies [
10]. However, these convolutions fundamentally misrepresent the neurophysiological reality, as the brain is a graph of interconnected functional regions rather than a grid of pixels. Graph Neural Networks (GNNs), and especially Graph Convolutional Networks (GCNs), have become the focus of current research due to this realization [
11], with GCNs being able to explicitly depict the brain’s functional connectivity by portraying EEG electrodes as nodes and their interactions as edges, bringing significant improvements in classification accuracy. Many studies propose hybrid architectures that combine GCNs’ spatial features with LSTMs [
12] or attention-based BiGRUs for temporal modeling [
13], thus designing a network for global–local feature aggregation. Another approach utilizes GCNs for global topology and CNNs for local patterns [
14]. Other works introduce novel neural network components [
15] and optimization strategies to enhance model capability or efficiency [
16].
Fusing multiple graph structures [
17] could provide a more comprehensive representation for emotion recognition, although a considerable gap persists in how these graph structures are defined, as the majority of GCNs currently used for processing EEG signals create adjacency matrices based only on Euclidean distance or static neuroanatomical priors, ignoring the temporal changes in signals over the entire time window [
18], and thus fail in capturing the dynamic, time-evolving character of brain network interactions during emotional experiences that is essential for accurate emotion recognition. As a result, self-organized networks that learn the graph structure directly from the input [
19,
20] and architectures that combine GCNs with Transformers [
21] to capture long-term temporal dependencies [
22,
23] are recent advances that have become increasingly popular in the field. At the same time, domain adaptation strategies that utilize adversarial discriminators to obtain subject-invariant feature representations are being employed to address the crucial problem of subject variability [
24,
25]. Despite these advanced developments, the research community has primarily focused on improving accuracy, often overlooking the critical component of interpretability, which is essential for real-world applications in BCIs, as users must be able to comprehend why a specific emotional state was inferred for a BCI to be genuinely reliable and valuable, which is a significant obstacle to the use of such models.
Another critical aspect of ER is that, while the Circumplex model of affect explicitly defines valence and arousal as continuous dimensions, the majority of researchers [
26,
27,
28] divide them into two mutually exclusive categories, positive and negative, for both valence and arousal, thereby reducing the Circumplex model to a dichotomous model, but forcing this continuum into two coarse bins is a drastic simplification. Furthermore, many practical scenarios require more than a binary output; for example, an adaptive learning system should be able to differentiate between “slightly confused,” “frustrated,” or “completely overwhelmed” to adjust the lesson difficulty appropriately, and a high-resolution model should provide this actionable granularity.
Therefore, there is a clear and immediate need for a revolutionary computational framework that not only delivers cutting-edge accuracy but also provides significance to the neurophysiological plausibility of its decisions. This work hypothesizes that an interpretable self-organized graph attention network can simultaneously achieve high performance and provide transparent, neurophysiologically plausible explanations for its classifications.
Our main contribution can be summarized as follows:
Creating an end-to-end deep learning architecture with no extensive feature engineering.
Proposing a dynamic method for the connectivity of the graph construction.
Designing a spatio-temporal attention module to capture temporal dependencies and spatial connectivity.
Integrating interpretability within our model to explore the neuroscientific explanation behind its attention mechanism.
Accurately recognizing—from a circumplex model—two, four, and nine classes, and three classes from a discrete model.
We evaluated IRSTGANet on the publicly available SEED and DEAP datasets under a rigorous 5-fold cross-validation protocol. Our results demonstrate that IRSTGANet not only outperforms existing state-of-the-art methods in classification accuracy but also, through visualization of its dynamic graphs and attention weights, generates interpretable representations.
2. State of the Art
Combining GNN with Transformer and attention models to capture spatiotemporal dependencies is a current trend. This is demonstrated in the Emotion Transformer (EmT), which utilizes a Temporal Contextual Transformer (TCT) module [
22] and the Multi-View Graph Transformer (MVGT) that integrates data across temporal, frequency, and spatial domains [
29]. This transformer architecture can also be combined with adaptive graph convolution to fuse features [
30]. In a similar vein, a Spatial-Temporal Graph Attention network with a Transformer encoder (STGATE) was created [
31], and a cascade using a Scale-aware Adaptive GCN and a Cross-EEG Transformer (CET) to model multiscale characteristics was suggested [
23]. Additionally, a multi-dimensional attention mechanism is employed in the AttGraph model to weight the discriminative EEG features and guide attention for feature selection [
32], while domain adaptation is utilized in the multimodal Mul-AT-RGCN model to align feature distributions across individuals [
26]. In conclusion, extensive research has demonstrated that the attention mechanism can effectively capture prominent EEG features, which is why, in this paper, we used the GAT block to weight the functional connectivity between EEG electrodes effectively.
Another critical area of innovation is in establishing the graph structure for GNNs, which extends beyond static anatomical distances to represent the dynamic and functional nature of brain activity. This is addressed by generating dynamic adjacency matrices; using Dynamic Time Warping (DTW) to capture time–domain relationships [
33] and employing spatiotemporal attention for multi-level functional dependencies [
18] have been explored in some case studies, as have learning dynamic inter-channel relationships [
20] and leveraging Granger causality to build more informative graphs for contrastive learning [
34]. Earlier, the value of functional connectivity was studied by constructing graphs based on Phase-Locking Value (PLV) [
11,
27], while another work compared both structural (adjacency) and functional (PLV) connectivity, finding functional links to be more discriminative [
35]. In this direction, the Self-Organized Graph Neural Network (SOGNN) dynamically constructs a unique graph for each EEG signal [
19]. It is also extended by the Variational Spatial and Gaussian Temporal (VSGT) model, which utilizes a variational Bayesian approach to identify dynamic spatial dependencies [
36], and by a model that generates task-specific adjacency matrices [
13]. AT-DGNN [
37] and GraphEmotionNet [
24] employ dynamic graph learning too, to better capture the evolving functional connectivity of the brain. For these reasons, we chose to adopt a dynamic graph construction in our model by fixing the adjacency matrix first, then learning the prominent brain connectivity through training.
One of the primary challenges addressed by various hybrid architecture concepts is the effective integration of spatial and temporal information. The Spatial–Temporal Graph Attention Network (STFCGAT) [
38] addressed this by using functional connectivity to achieve robust performance, in the same way that MSL-TGNN [
39] used a temporal learner in a multi-scale manner to extract dependencies, while the effective fusion of a Residual Network for spatial features with a Graph Attention Network to model channel connections was demonstrated in a later work [
40]. To obtain the advantages of both architectures, we combined the residual concept with our spatio-temporal graph attention network.
In the realm of explainability, some of the previously mentioned works integrated interpretability with their models to assess their neurophysiological plausibility, while others had interpretability as their main objective when identifying speech-related EEG biomarkers for Parkinson’s diagnosis [
41], or when using spectral graph theory to identify key control points in brain networks [
42]. We were inspired by these works to also incorporate interpretability into our model, providing explanations behind its conclusions.
To address the inherent uncertainty in emotion annotation, current research has focused on label-space optimization and on systems that learn from fixed labels, treating labels as noisy or incomplete signals that should be collaboratively adjusted against learned brain representations rather than treated as absolute truth. Notable examples include the CoAdapt framework [
43], which collaboratively aligns latent EEG features and associated annotations to improve label consistency, and Partial Label Learning for EEG [
44], which models candidate–label sets to resolve ambiguity. These methods highlight the importance of understanding the annotation process to enhance the robustness of emotion decoding systems.
However, multiclass classification from circumplex models has received little to no attention, particularly in graph models. Given that the circumplex model encompasses a wealth of information, reducing it to a dichotomous positive–negative model is a significant loss of information. A previously mentioned model using GCN in combination with CNN [
14] explored a four-class classification, and the Echo State Network ESN [
45] explored using four classes and eight classes when adding the dominance dimension from the DEAP dataset. For these reasons, we deemed it necessary to develop a multiclass classification system based on a dimensional model.
3. Materials and Methods
In this section, the datasets used will be discussed, along with the preprocessing and graph construction, the detailed model architecture, and the training setup, including the metrics used for evaluation.
3.1. Datasets
For a more thorough assessment, IRSTGANet is evaluated on two widely used EEG emotion recognition benchmarks: the DEAP dataset [
46,
47] and the SEED dataset [
48,
49]. Although the DEAP dataset is older, it remains among the most widely used datasets to this day, as does the SEED dataset. Another reason for choosing these datasets is their distinct features: different numbers of electrodes, assessment methods, subject numbers, and emotional models, which would provide a more nuanced evaluation of our model.
3.1.1. DEAP Dataset
DEAP is the most widely used emotion recognition dataset; it records EEG and peripheral physiological signals from 32 subjects while they watch 40 one-minute videos that participants rate (1–9) on valence, arousal, and dominance. These signals were obtained with the 32-channel Biosemi ActiveTwo system in accordance with the 10/20 system. We used the preprocessed dataset, which was downsampled to 128 Hz and divided into 63 s segments, comprising 60 s of stimulus and 3 s of pre-stimulus.
3.1.2. SEED Dataset
SEED focuses on categorizing distinct emotions (negative, neutral, and positive). As with the DEAP dataset, we used the preprocessed version, which comprises 45 experiments, with 15 subjects repeating the experiment three times (sessions), and each experiment contains 15 trials. The data were downsampled to 200 Hz; the recordings were made using the 62-channel cap following the 10/20 system; and the stimuli were carefully selected 4 min videos.
3.2. Preprocessing
The datasets for this study underwent initial preprocessing, as described in the previous section. However, to prepare this data for deep learning models that analyze temporal dynamics, additional preprocessing was necessary. This included normalization to stabilize model training and segmentation of the continuous EEG signals. The raw signals, comprising 32 electrodes for DEAP and 62 for SEED, were recorded at 8064 time points for DEAP, corresponding to approximately 63 s of neural activity. They were divided into shorter, overlapping windows with 50% overlap, equivalent to two seconds. This overlap serves as a powerful form of data augmentation [
50]. SEED has different timestamp lengths. For DEAP and SEED, window sizes L of 512 and 800 data points, respectively, were chosen, yielding segments of approximately 4 s each that effectively capture evolving emotional processes. The 4 s window size is widely used in the literature [
8,
16,
36]; thus, we adopted the same 4 s duration across datasets for consistency.
In our framework, we constructed a graph from a segment of multichannel EEG data, which is better suited to modeling relational information [
31]. A graph is defined as follows:
where N is the set of nodes, and E is the set of edges. To construct this graph from the EEG, we treat each of the brain electrodes (channels), C, as a distinct node in N:
The raw signal captured by each electrode over a defined time window constitutes the initial feature vector for its corresponding node, and edges in E represent the relationships between these nodes:
which models the functional connection between electrodes
and
. The topology of the entire graph is encapsulated in a C × C adjacency matrix A, where each element quantifies the strength
of the connection between nodes:
In our approach, the learnable adjacency matrix is initially defined from static anatomical distances, with every electrode connected to every other electrode, including itself, via self-loops. It is essential to note that this initial connectivity does not presuppose which connections are meaningful; rather than relying on pre-defined neuroanatomical priors, the model is tasked with dynamically learning the strength and importance of these connections during training. This process automatically identifies and weights the specific brain routes and inter-channel interactions that are highly discriminative in classifying emotional states.
For label engineering, as discussed in the dataset section, the labels for the DEAP dataset were rated from 1 to 9, to create 2, 4, and 9 classes. We initially used 5 as a threshold to define the dichotomous model for valence and arousal; using this same threshold we used a combination of low and high to create four classes, which were Low-Valence-Low-Arousal, Low-Valence-High-Arousal, High-Valence-Low-Arousal, and High-Valence-High-Arousal which could be equivalent in the discrete model to Sad, Angry, Relaxed, and Happy, respectively; then, we used 3 and 6 as thresholds to create three-level valence and arousal, and used a combination of low, medium, and high to create nine classes: LALV, MALV, HALV, LAMV, MAMV, HAMV, LAHV, MAHV, and HAHV, which could be equivalent in the discrete model to Depressed, Bored, Angry, Sad, Neutral, Nervous, Relaxed, Happy, and Excited, respectively. As for the SEED dataset, it already uses a discrete model of negative, neutral, and positive emotion. With this, we provide a powerful, neurophysiologically grounded input to the subsequent feature extractor.
3.3. Model Architecture
The IRSTGANet model is a spatio-temporal graph network that couples a temporal feature extractor with attention-based graph reasoning over EEG channels in four stages: temporal feature encoding, temporal projection (dimension control), stacked residual GAT blocks, and global aggregation with a linear classification head. This architecture, as seen in
Figure 1, targets fast temporal fluctuations within channels and structured spatial dependencies across electrodes in EEG for emotion recognition.
The feature extractor is a crucial step designed to extract meaningful temporal patterns from the raw EEG signal of each node, comprising a stack of three one-dimensional convolutional (Conv1D) layers, which function as a hierarchical processor to capture features at progressively broader temporal scales. The first layer increases the single input channel to 64 feature maps with a kernel size of 15 and a stride of 7, thereby looking for fine-grained, short-duration patterns in the signal before passing these to the second layer, which has 128 filters, a kernel of 11, and a stride of 5, and expands on these initial properties to recognize more complicated waveforms, then the third Conv1D layer refines the representation into 256 channels, using a kernel of 7 and a stride of 3 to capture high-level temporal motifs. Finally, an Adaptive Average Pooling layer is used to compress the convolution’s output into a fixed size of 8. Each convolutional layer is followed by Batch Normalization and a Gaussian Error Linear Unit (GELU) activation function. Then, the projection block transforms the temporal features extracted by the feature extractor, taking as input a flattened feature tensor with dimensions of 256 × 8 = 2048, which originates from the 256-channel, 8-time-point output of the preceding Adaptive Average Pooling layer. The main part of this block is the linear layer (2048, 64), which projects high-dimensional temporal features to a compact 64-dimensional temporal embedding (temb), followed by Layer Normalization, GELU activation, and a Dropout layer (0.5) to stabilize training.
Through the network, the data is activated using the GELU [
51] activation function [
21], which is approximated with the following:
The projected features are processed through residual blocks, each consisting of two stacked Graph Attention (GAT) layers. Every GAT layer learns attention weights to dynamically highlight important inter-electrode interactions. The first GAT layer in the initial block takes the temb = 64 output from the projection block as its input and produces a hidden dimension of size 128, whereas the remaining GAT layers use and output this same hidden dimension. The first layer of each block employs two attention heads and is followed by GELU activation, whereas the second layer uses one attention head and is followed by Batch Normalization and GELU to stabilize and accelerate training, and dropout (0.2) to prevent overfitting. A residual connection after the first GAT block maintains earlier representations, enabling subsequent layers to refine rather than overwrite spatial information. The model also captures attention weights for interpretability purposes.
The core of the spatial feature learning is a two-layer GATv2Conv architecture, similar to previous works [
31,
39], which dynamically models complex functional relationships among EEG electrodes, where the first GATv2Conv layer employs multi-head attention, allowing each node to select salient connections among its neighbors. GATv2 [
52] could be defined as follows, considering Equation (2) as input
, with F being the feature dimension of each channel and C being the number of channels (nodes). e is a scoring function that represents attention coefficients, and
is the normalized attention weight across neighbor nodes using Softmax:
where LeakyReLU is the nonlinear activation function,
W is the learned weight matrix,
the feature dimension of the output nodes,
denotes the concatenation, and
the transpose of the attention weight vector. After the multi-head attention module, the attention coefficients are combined, resulting in the final representation of the transformed features using the normalized attention weights.
This process learns adaptive “importance weights” for each link, determining which inter-channel relationships are most relevant for emotion recognition. The set of context-aware node properties is then fed into a second GATv2Conv layer, which refines the representations to capture higher-order spatial dependencies. Moreover, the attention weights from these layers can be visualized. Before classification, a global mean pooling layer is employed to obtain a graph-level representation. This graph embedding is then passed to the classification head, a multi-layer perceptron that maps the high-level representation to the final emotion classes. The classification head comprises three linear layers that progressively reduce dimensionality by halving the hidden dimension, which is equivalent for the first layer to (128, 64), for the second (64, 32), and the third linear layer takes the 32 and projects it to the output logits corresponding to the number of classes, thereby extracting the most discriminative features for the task. A GELU activation function and dropout (with rates of 0.5 and 0.3) are applied after each linear layer to provide regularization and prevent overfitting, and a normalization layer precedes the activation function in the first layer to ensure stable and efficient training.
Our architecture presents a distinct approach to spatiotemporal feature learning in EEG emotion recognition. First, for temporal modeling, we depart from the common reliance on recurrent networks or standard transformer encoders by employing a stacked 1D convolutional backbone, which provides a more parameter-efficient approach to capturing local-to-global temporal dynamics and serves as a powerful multi-scale feature extractor, eliminating the need for a separate sequential modeling step. Second, we propose a specialized, residually stacked Graph Attention Network (GAT) block that enhances conventional GATs by incorporating residual connections across GAT blocks with stacked GAT layers, thereby mitigating over-smoothing issues and facilitating deeper graph propagation without loss of information.
3.4. Interpretability
We used Integrated Gradients (IG) [
53] as a feature-attribution method to quantify each channel’s contribution to the model’s class scores, where the integral of integrated gradients can be approximated by summing gradients at small intervals along the path from the baseline. We also exploited the model’s self-attention weights (from GAT blocks) to obtain edge-level importances between channels and constructed a channel–channel importance matrix by aggregating edge scores within each graph. For visualization, saliency maps were generated using IG and projected onto 2D scalp layouts (topomaps), and the learned attention matrices were analyzed to identify the strongest pairwise connections between channels (the top-k edge motifs) and the channels with the highest aggregated attention weights.
At evaluation, for each batch per subject, we computed IG with a zero baseline, and for each class, we integrated the gradient of the class logit with respect to the input along the straight path from baseline to the sample, resulting in a vector of per-channel attributions, which is then averaged across samples and across folds to obtain stable per-class IG maps. In parallel, we extracted attention weights from the last GAT block, normalized scores per graph, and accumulated them into an edge-importance matrix, from which we computed node connectivity via row and column sums and listed the top-k edges. To visualize spatial structure, we mapped channel names to a standard montage and plotted per-class IG topomaps with MNE, ensuring a shared color scale across classes.
3.5. Training Setup and Metrics
PyTorch 2.3.1 and PyTorch Geometric 2.6.1 were used to implement the IRSTGANet model on an NVIDIA A6000 GPU, utilizing the Paperspace platform [
54]. The model was trained for 100 epochs using a learning rate of 0.001, a batch size of 64, and the Adam optimizer. Additionally, to ensure robust training and control overfitting, we employed a regularization strategy that included dropout, batch normalization, L2 weight decay, and early stopping; further hyperparameters are detailed in
Table 1.
When segmenting the data, it was organized by (subject, trial) instead of individual windows to prevent the same trial from being included in both training and testing sets. During training, we split each subject into five segments and shuffle through them five times so that each time one segment is the test set, and the accuracy of the subject is the average of the accuracies across the five folds, which is known as 5-fold Cross Validation (5-fold CV).
The metrics used to evaluate the model are as follows:
with TP true positives, TN true negatives, FN false negatives, and FP false positives. In addition to confusion matrices to visualize performance across the four and nine classes, accuracy is used as the primary metric.
4. Results
The evaluation metrics used to assess the performance of the proposed IRSTGANet model included accuracy, precision, recall, and F1-score for both valence and arousal, as well as for the four labels, the nine labels of the DEAP dataset, and the three emotion classes of the SEED dataset.
Figure 2 shows the results of the binary evaluation on the DEAP dataset, with
Figure 2a illustrating that the model achieved an average accuracy/F1-score of 96.28%/96.31% and 96.82%/97.03% for valence and arousal, respectively. Arousal slightly outperforms valence.
Figure 2b presents the detailed accuracy for each subject, showing that subject #15 achieved the best accuracy, with percentages of 99.49% and 97.92% for valence and arousal, respectively. The lowest accuracy was achieved by subject #4, reaching 91.49% and 94.19% for valence and arousal, respectively. The same note on arousal outperforming valence is emphasized in this figure.
For the multiclassification, IRSTGANet performed exceedingly well, as seen in
Figure 3, for both the four classes (quadrant) in
Figure 3a, and nine classes (nonant) in
Figure 3b, with an average accuracy of 93.76% and 85.15%, respectively. The first showed consistency, with no subject falling below 80% and the best reaching 98.43%, while the second showed a slight struggle, with subjects #3, #14, #19, and #30 achieving the lowest accuracy of 63.94% and the best reaching 96.38%; however, most of the subjects were over 80%. The model maintained consistently strong performance across complementary metrics, with an average precision, recall, and F1 of 95.2%, 91.35%, and 92.06%, respectively, for the quadrant classification, and of 90.77%, 85.65%, and 86.38%, respectively, for the nonant emotion recognition.
While the average accuracy provides an idea of the model’s overall performance, the confusion matrix offers a more nuanced view of each class’s performance, as shown in
Figure 4. The accuracy of the four classes in
Figure 4a is overall balanced, with a slight 1% improvement in each class moving from sad to happy. Notably, sad has the least accuracy at 92%, while happy has the best accuracy at 95%. The emotion recognition for the nine classes in
Figure 4b shows a small struggle in recognizing the neutral, nervous, and happy labels; also, 9% of the depressed labels across the dataset are falsely denoted as excited.
For the SEED dataset, as shown in
Figure 5, the model performed exceptionally well on subjects #3, #11, #12, #14, and #15, with a performance of 100% across all metrics, and subject #6, which was slightly behind with a performance of 98% for all metrics. Except for the struggle with subjects #10 and #13, the model showed a stable performance for the other six subjects.
Figure 5a provides a glimpse of the overall performance in
Figure 5b, where the model achieved an average accuracy of 94.04% and the lowest accuracy for subject #13, with a value of 77.33%.
For greater interpretability, and because many of the main connections were self-loops, these were omitted, leaving only the strong connections from one channel to another.
Figure 6 summarizes the top-ranked attention edges extracted from the deepest GAT block of the proposed model.
Figure 6b shows that the highest normalized edge weights correspond primarily to within-hemisphere fronto–parietal connections, notably F7 → P7 on the left and FC4 → P2 on the right. These links appeared repeatedly across attention heads and subjects, suggesting a strong and consistent emphasis on model information flow from anterior to posterior within each hemisphere.
Figure 6a presents the top attention-derived edge motifs with the strongest connections, mainly linking frontal and parietal regions within the same hemisphere, such as FC2 → TP8, F2 → P8, T7 → P5, and F7 → P7. Also, several central–parietal and fronto–occipital links Cz → P1, FC4 → P2, C2 → PO7, CP3 → PO4, FT8 → Oz were identified, while the additional motifs, including FC6 → AF4, AF4 → Fp2, and AF7 → TP10, suggest the involvement of right fronto–central and anterior regions in emotional information processing.
Figure 7 presents a summary of the IG attribution scores across the three emotion classes. A consistent, strong right frontal–temporal focus demonstrated the highest contribution across all classes, highlighting the significant role of this region in emotion discrimination, particularly for the negative class. A moderate contribution from the frontal central region is observed for the positive class. The neutral class displayed a comparable pattern, highlighting the midline electrodes, which indicates increased engagement of central regions and a moderate contribution from the right parietal region.
Figure 8 shows the most significant channels, with the highest values being FC6, FC, and FT8, followed by F2, FCz, Cz, F6, F4, and CP3, then C2, FC2, Fz, AF3, C4, and CP1, emphasizing the previous results on the involvement of the frontal and central regions.
The experimental findings demonstrate that combining temporal convolutions with residual graph attention mechanisms significantly enhances the model’s ability to capture both spatial and temporal information, and the attention maps indicate that node connectivity is physically explained by recognized emotion-related brain regions.
5. Discussion
Performance is reported in the
Section 4 as the average accuracy across all subjects. Due to a limitation in the evaluation pipeline, the within-subject variance across cross-validation folds was not retained for analysis; therefore, the standard deviation (SD) shown in this section reflects inter-subject variability, calculated from the spread of individual subjects’ mean accuracies. Future work will include comprehensive reporting of both within- and across-subject variance metrics to fully characterize model stability.
The proposed IRSTGANet model achieved exceptional results, with an average accuracy of 96.28% ± 1.82% and 96.82% ± 1.92% for valence and arousal, respectively, 93.76% ± 3.96% for the four classes LVLA, LVHA, HVLA, and HVHA (Sad, Angry, Relaxed, and Happy), 85.15% ± 8.37% for the nine classes LALV, MALV, HALV, LAMV, MAMV, HAMV, LAHV, MAHV, and HAHV (Depressed, Bored, Angry, Sad, Neutral, Nervous, Relaxed, Happy, and Excited), and 94.04% ± 6.70%for the three classes negative, neutral, and positive.
The results demonstrate that utilizing both a temporal feature extractor and residual graph attention layers effectively captures both short-term neural changes and long-range spatial relationships, with not only a high stable accuracy across datasets, experiments, and emotion labels, but also high precision and recall values showing the sensitivity of the model in emotion recognition, all while delivering a balanced F1 score.
Additionally, the model accurately recognized multiclass labels, a challenging task, demonstrating its ability to capture the complex spatial dependencies among emotions.
The model’s architecture showed, from the previous results, a multi-scale emotion-processing capability, with the first level looking for specific time patterns in individual electrodes, the second level looking at how brain regions in the same area interact with each other, and the third level looking at how all the brain regions interact with each other in a complicated way. This is further emphasized by the interpretability results, which identify important nodes at the first level, node-to-node importance at the second level, and the region responsible for an emotion at the third level.
IRSTGANet is designed for efficiency, with 696,546 trainable parameters, and requires 0.15 GFLOPs (150 million operations) per forward pass on an NVIDIA RTX A6000, yielding an inference latency of 11.216 ms and a peak memory usage of 333.8 MB. While the model’s low computational footprint, 0.15 GFLOPs, makes it a good candidate for deployment in systems with moderate processing capabilities, whether it meets the latency requirements of a specific real-time, closed-loop system depends on the target hardware and optimization; thus, our current results demonstrate that the core architecture, confined in a Proof of Concept, is not prohibitively complex and provides a foundation that can be further optimized for edge deployment.
5.1. Interpretability
Notably, the spatial distribution highlighted by the node analysis complements the connectivity patterns observed in the attention-based edge motifs, as the same right fronto–central and midline regions (e.g., FC6, FC4, FT8, FCz, Cz) that emerged as important in the channel importance also appear as major sources in the connectivity importance, such as the pathways from FC4 to P2 and from F7 to P7. This convergence between two independent interpretability methods strengthens the evidence that the model consistently relies on functionally meaningful fronto–parietal interactions during emotion recognition. In other words, both node-level attributions and graph-level attention emphasize the same brain regions, reinforcing the biological plausibility of the learned representations. Additionally, the omission of self-loop attention suggests that the network also retains node-specific information within key channels. Overall, the attention distribution suggests a within-hemisphere, structured pattern of recurrent fronto–parietal interactions and central-node reinforcement, rather than diffuse or random connectivity.
The neurobiological foundation of emotion recognition can be understood through a network of brain systems responsible for the evaluation, integration, and regulation of emotion. The medial prefrontal cortex has different roles, with the dorsal/posterior sectors handling initial emotional appraisal and expression, and the ventral/anterior regions regulating and extinguishing these responses, forming a hierarchical control mechanism. The lateral prefrontal cortex integrates cognitive and affective signals to facilitate goal-directed behavior, and this process is enhanced by subcortical structures like the basal forebrain, which aligns cortical activity with emotional and motivational states [
55]. Additionally, shared neural representation, illustrated by vicarious neuronal activity, means that the brain activates similar regions just by observing someone else, creating an evoked experience for understanding, empathy, and social learning in areas like the anterior cingulate cortex, insula, and amygdala during emotional experiences, which supports the concept of emotional contagion, especially for negative emotions [
56].
In light of this foundational neurobiological framework, our model’s interpretability is validated, as it has not only achieved high accuracy but also clarified the functional architecture of emotion in the EEG, as the learned graph is not random, but rather an emergent representation of the same integration connections (lateral prefrontal cortex), control hierarchies (medial prefrontal cortex midline), and broadcast pathways that neuroscience has identified as the core systems for affective processing. Our contribution demonstrates that this architecture can be learned in its entirety from raw signals and that the particular design choices and features of our structure (right-lateralized fronto–central pathways) possess the discriminative capability for advanced emotion recognition.
5.2. Ablation Studies
We conducted ablation studies to further examine the effects of the main components of our model, beginning by excluding the temporal feature extractor (TFE) from IRSTGANet, followed by the stacked residual GAT blocks (SRGAT). Removing the temporal components isolates spatial graph attention contributions, and removing the GAT blocks while using a simple GCN allows us to quantify the gain attributable to the TFE block, with both scenarios being tested using two different initial graph types, one fully connected (fc) and the other based on 10–20 connectivity.
As presented in
Table 2, when excluding the SRGAT block, the performance dropped by approximately ≈9%, as the valence and arousal accuracy dropped from 96.28% to 86.87% and from 96.82% to 87.77%. When removing the TFE block, the decrease from 96.28% to 94.90% indicates a 1.38% decrease in valence and a 1.48% decrease in arousal. It is also essential to note that performance without SRGAT is less robust, as indicated by the larger standard deviation. The effect of fc or 10–20, when the SRGAT is omitted, is insignificant, while with SRGAT, a slight effect of ≈0.5% could be seen, which suggests that adjacency matrix initialization has an effect, even though the connections are learned during training. However, the slight outperformance of fc for valence and 10–20 for arousal may give way to potential further research on task-specific analysis.
These ablation results demonstrate that the full IRSTGANet architecture achieves its superior performance through the integrated contribution of all its components. The critical role of the SRGAT block is especially evident, confirming that its dynamic modeling of inter-channel spatial relations is essential for accurately characterizing emotional brain activity.
5.3. Comparison with the State of the Art
Relative to previous graph-based EEG models, IRSTGANet was compared for the DEAP dataset with AT-DGNN-Gen [
37], LTS-GAT [
25], P-GCNN [
11], VSGT [
36], GARG [
9], ResGAT [
40], CNN + GA Features [
8], RSTAGNN [
16], CADD-DCCNN [
57], Mul-AT-RGCN [
26], MSL-TGNN [
39], ESN [
45], GLFANet [
14], and DGAT [
58], as shown in
Table 3.
Our model, for the binary classification of valence and arousal, outperformed all other models by a significant margin, surpassing the best GLFANet by 1.83% in terms of average accuracy. For the four classes, IRSTGANet outperformed ESN(4-classes) by 10.16% and GLFANet (4-classes) by 0.84%, while still surpassing most of the binary classifications. For the nine classes, our model outperformed the ESN(8-classes) by 6.45% while falling behind only eight of the sixteen presented models, thus surpassing half of the SOTA.
IRSTGANet was compared with the following models on the SEED dataset: P-GCNN [
11], SOGNN [
19], CADD-DCCNN [
57], STGATE [
31], GLFANet [
14], STAFNet [
30], and DGAT [
58]. The results are presented in
Table 4. Our model surpassed P-GCNN by 9.69%, SOGNN by 7.23%, CADD-DCCNN by 6.63%, STGATE by 3.67%, GLFANet by 0.85%, and STAFNet by a margin of 0.26%, thus surpassing all the SOTA in terms of accuracy.
To summarize, our approach offers three main advantages: enhanced temporal feature extraction through convolutional encoding, stable multi-layer spatial reasoning via residual attention, and interpretable measures of electrode-level, electrode-neighboring-level, and region-level importance.
Several limitations must be considered, including the window size, which is often used in segmentation but has not yet been shown to be optimal, as each task has its own specifics. It is also important to note that a key consideration in interpreting results is the nature of the emotional labels derived from the DEAP and SEED datasets, which use a stimulus-elicitation in which participants self-report a rating of ‘felt’ emotion in response to emotional videos, where the EEG is presumed to capture the evoked emotional experience of the participant. However, it mixes sensory decoding, cognitive appraisal, and the generation of felt emotion, making it challenging to isolate pure internal feelings from perception. Thus, while the model is effective in this context, its applicability to context-free emotional experiences requires further validation, as the findings are most relevant to fields such as neuromarketing, media studies, and specific biofeedback therapies, where emotional stimuli are defined, and engagement is measured, and less validated for emotional states arising purely from internal thought without an external, known trigger such as diagnosing mood disorders or interpreting everyday social interactions. Moreover, the real-world applicability remains a consistent limitation.
Despite the model’s success, there remains room for improvement; the first step is to test its generalization to unseen subjects or recording setups [
24,
25,
59] and to incorporate other data modalities, such as fNIRS and peripheral physiological signals [
60,
61,
62], which could enhance robustness. Second, exploring data augmentation to address limited dataset sizes [
28,
63], simplifying the model and input [
20], and reducing dimensionality [
64] could help mitigate the graph structure’s high complexity.