An Interpretable Residual Spatio-Temporal Graph Attention Network for Multiclass Emotion Recognition from EEG

Hilali, Manal; Ezzati, Abdellah; Ben Alla, Said; El Badaoui, Ahmed

doi:10.3390/signals7010016

Open AccessArticle

An Interpretable Residual Spatio-Temporal Graph Attention Network for Multiclass Emotion Recognition from EEG

by

Manal Hilali

^1,*

,

Abdellah Ezzati

¹,

Said Ben Alla

²

and

Ahmed El Badaoui

¹

Faculty of Sciences and Techniques, Hassan First University, Settat 26000, Morocco

²

National School for Applied Science, Hassan First University, Berrechid 26100, Morocco

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(1), 16; https://doi.org/10.3390/signals7010016

Submission received: 14 November 2025 / Revised: 29 December 2025 / Accepted: 30 January 2026 / Published: 5 February 2026

Download

Browse Figures

Versions Notes

Abstract

Automatic emotion recognition based on EEG has been a key research frontier in recent years, involving the direct extraction of emotional states from brain dynamics. However, existing deep learning approaches often treat EEG either as a sequence or as a static spatial map, thereby failing to jointly capture the temporal evolution and spatial dependencies underlying emotional responses. To address this limitation, we propose an Interpretable Residual Spatio-Temporal Graph Attention Network (IRSTGANet) that integrates temporal convolutional encoding with residual graph-attention blocks. The temporal module enhances short-term EEG dynamics, while the graph-attention layers learn adaptive node connectivity relationships and preserve contextual information through residual links. Evaluated on the DEAP and SEED datasets, the proposed model achieved exceptional performance on valence and arousal, as well as four-class and nine-class classification on the DEAP dataset and on the three-class SEED dataset, exceeding state-of-the-art methods. These results demonstrate that combining temporal enhancement with residual graph attention yields both improved recognition performance and interpretable insights into emotion-related neural connectivity.

Keywords:

electroencephalography; emotion recognition; graph neural network; DEAP; SEED; multi-classification

1. Introduction

Accurately and objectively recognizing emotional states is crucial for advancing human–computer interaction, which also implies the development of artificial intelligence solutions with empathy, as well as accurate mental health diagnosis. Although subjective self-reports have historically been used to measure emotions, this approach is prone to bias, often referred to as social desirability bias. Electroencephalography (EEG), on the other hand, offers an objective view of the neural dynamics of affect by providing a direct, high-temporal-resolution measurement of the brain’s electrical activity, making EEG-based emotion recognition a cornerstone technology for next-generation brain–computer interfaces (BCIs) [1]. These intriguing systems can instantly adjust to the emotional and cognitive states of their users. However, several fundamental issues inherent to the EEG signal itself, including its poor signal-to-noise ratio, high dimensionality, and notable inter-individual variability, delay the transition from laboratory experiments to reliable, practical BCI applications [2].

There has been extensive development in computer methods for decoding emotions from EEG, namely, in Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN), which are two basic machine learning techniques that primarily rely on manually created characteristics such as differential entropy (DE) and power spectral density (PSD), which dominated early approaches [3,4,5]. Although deep learning models currently dominate, these classical models continue to be relevant. Arijit Nandi et al. focus on using online Logistic Regression [6], as do Reis et al. [7]; Xefteris et al. [8] use SVM, MLP, and Random Forests, in combination with handcrafted features; and Perry Fordson et al. [9], aim to improve accuracy by including non-EEG data with a graph based on demographic relations (Gender–Age). However, their model’s reliance on manually created features may result in less-than-ideal classification performance, as these features may not adequately capture the complicated, non-linear spatiotemporal patterns that define emotional brain processes. To address this gap, the development of deep learning has provided an end-to-end solution, utilizing Convolutional Neural Networks (CNNs) to automatically extract spatial characteristics from EEG topographies [10]. However, these convolutions fundamentally misrepresent the neurophysiological reality, as the brain is a graph of interconnected functional regions rather than a grid of pixels. Graph Neural Networks (GNNs), and especially Graph Convolutional Networks (GCNs), have become the focus of current research due to this realization [11], with GCNs being able to explicitly depict the brain’s functional connectivity by portraying EEG electrodes as nodes and their interactions as edges, bringing significant improvements in classification accuracy. Many studies propose hybrid architectures that combine GCNs’ spatial features with LSTMs [12] or attention-based BiGRUs for temporal modeling [13], thus designing a network for global–local feature aggregation. Another approach utilizes GCNs for global topology and CNNs for local patterns [14]. Other works introduce novel neural network components [15] and optimization strategies to enhance model capability or efficiency [16].

Fusing multiple graph structures [17] could provide a more comprehensive representation for emotion recognition, although a considerable gap persists in how these graph structures are defined, as the majority of GCNs currently used for processing EEG signals create adjacency matrices based only on Euclidean distance or static neuroanatomical priors, ignoring the temporal changes in signals over the entire time window [18], and thus fail in capturing the dynamic, time-evolving character of brain network interactions during emotional experiences that is essential for accurate emotion recognition. As a result, self-organized networks that learn the graph structure directly from the input [19,20] and architectures that combine GCNs with Transformers [21] to capture long-term temporal dependencies [22,23] are recent advances that have become increasingly popular in the field. At the same time, domain adaptation strategies that utilize adversarial discriminators to obtain subject-invariant feature representations are being employed to address the crucial problem of subject variability [24,25]. Despite these advanced developments, the research community has primarily focused on improving accuracy, often overlooking the critical component of interpretability, which is essential for real-world applications in BCIs, as users must be able to comprehend why a specific emotional state was inferred for a BCI to be genuinely reliable and valuable, which is a significant obstacle to the use of such models.

Another critical aspect of ER is that, while the Circumplex model of affect explicitly defines valence and arousal as continuous dimensions, the majority of researchers [26,27,28] divide them into two mutually exclusive categories, positive and negative, for both valence and arousal, thereby reducing the Circumplex model to a dichotomous model, but forcing this continuum into two coarse bins is a drastic simplification. Furthermore, many practical scenarios require more than a binary output; for example, an adaptive learning system should be able to differentiate between “slightly confused,” “frustrated,” or “completely overwhelmed” to adjust the lesson difficulty appropriately, and a high-resolution model should provide this actionable granularity.

Therefore, there is a clear and immediate need for a revolutionary computational framework that not only delivers cutting-edge accuracy but also provides significance to the neurophysiological plausibility of its decisions. This work hypothesizes that an interpretable self-organized graph attention network can simultaneously achieve high performance and provide transparent, neurophysiologically plausible explanations for its classifications.

Our main contribution can be summarized as follows:

Creating an end-to-end deep learning architecture with no extensive feature engineering.
Proposing a dynamic method for the connectivity of the graph construction.
Designing a spatio-temporal attention module to capture temporal dependencies and spatial connectivity.
Integrating interpretability within our model to explore the neuroscientific explanation behind its attention mechanism.
Accurately recognizing—from a circumplex model—two, four, and nine classes, and three classes from a discrete model.

We evaluated IRSTGANet on the publicly available SEED and DEAP datasets under a rigorous 5-fold cross-validation protocol. Our results demonstrate that IRSTGANet not only outperforms existing state-of-the-art methods in classification accuracy but also, through visualization of its dynamic graphs and attention weights, generates interpretable representations.

2. State of the Art

Combining GNN with Transformer and attention models to capture spatiotemporal dependencies is a current trend. This is demonstrated in the Emotion Transformer (EmT), which utilizes a Temporal Contextual Transformer (TCT) module [22] and the Multi-View Graph Transformer (MVGT) that integrates data across temporal, frequency, and spatial domains [29]. This transformer architecture can also be combined with adaptive graph convolution to fuse features [30]. In a similar vein, a Spatial-Temporal Graph Attention network with a Transformer encoder (STGATE) was created [31], and a cascade using a Scale-aware Adaptive GCN and a Cross-EEG Transformer (CET) to model multiscale characteristics was suggested [23]. Additionally, a multi-dimensional attention mechanism is employed in the AttGraph model to weight the discriminative EEG features and guide attention for feature selection [32], while domain adaptation is utilized in the multimodal Mul-AT-RGCN model to align feature distributions across individuals [26]. In conclusion, extensive research has demonstrated that the attention mechanism can effectively capture prominent EEG features, which is why, in this paper, we used the GAT block to weight the functional connectivity between EEG electrodes effectively.

Another critical area of innovation is in establishing the graph structure for GNNs, which extends beyond static anatomical distances to represent the dynamic and functional nature of brain activity. This is addressed by generating dynamic adjacency matrices; using Dynamic Time Warping (DTW) to capture time–domain relationships [33] and employing spatiotemporal attention for multi-level functional dependencies [18] have been explored in some case studies, as have learning dynamic inter-channel relationships [20] and leveraging Granger causality to build more informative graphs for contrastive learning [34]. Earlier, the value of functional connectivity was studied by constructing graphs based on Phase-Locking Value (PLV) [11,27], while another work compared both structural (adjacency) and functional (PLV) connectivity, finding functional links to be more discriminative [35]. In this direction, the Self-Organized Graph Neural Network (SOGNN) dynamically constructs a unique graph for each EEG signal [19]. It is also extended by the Variational Spatial and Gaussian Temporal (VSGT) model, which utilizes a variational Bayesian approach to identify dynamic spatial dependencies [36], and by a model that generates task-specific adjacency matrices [13]. AT-DGNN [37] and GraphEmotionNet [24] employ dynamic graph learning too, to better capture the evolving functional connectivity of the brain. For these reasons, we chose to adopt a dynamic graph construction in our model by fixing the adjacency matrix first, then learning the prominent brain connectivity through training.

One of the primary challenges addressed by various hybrid architecture concepts is the effective integration of spatial and temporal information. The Spatial–Temporal Graph Attention Network (STFCGAT) [38] addressed this by using functional connectivity to achieve robust performance, in the same way that MSL-TGNN [39] used a temporal learner in a multi-scale manner to extract dependencies, while the effective fusion of a Residual Network for spatial features with a Graph Attention Network to model channel connections was demonstrated in a later work [40]. To obtain the advantages of both architectures, we combined the residual concept with our spatio-temporal graph attention network.

In the realm of explainability, some of the previously mentioned works integrated interpretability with their models to assess their neurophysiological plausibility, while others had interpretability as their main objective when identifying speech-related EEG biomarkers for Parkinson’s diagnosis [41], or when using spectral graph theory to identify key control points in brain networks [42]. We were inspired by these works to also incorporate interpretability into our model, providing explanations behind its conclusions.

To address the inherent uncertainty in emotion annotation, current research has focused on label-space optimization and on systems that learn from fixed labels, treating labels as noisy or incomplete signals that should be collaboratively adjusted against learned brain representations rather than treated as absolute truth. Notable examples include the CoAdapt framework [43], which collaboratively aligns latent EEG features and associated annotations to improve label consistency, and Partial Label Learning for EEG [44], which models candidate–label sets to resolve ambiguity. These methods highlight the importance of understanding the annotation process to enhance the robustness of emotion decoding systems.

However, multiclass classification from circumplex models has received little to no attention, particularly in graph models. Given that the circumplex model encompasses a wealth of information, reducing it to a dichotomous positive–negative model is a significant loss of information. A previously mentioned model using GCN in combination with CNN [14] explored a four-class classification, and the Echo State Network ESN [45] explored using four classes and eight classes when adding the dominance dimension from the DEAP dataset. For these reasons, we deemed it necessary to develop a multiclass classification system based on a dimensional model.

3. Materials and Methods

In this section, the datasets used will be discussed, along with the preprocessing and graph construction, the detailed model architecture, and the training setup, including the metrics used for evaluation.

3.1. Datasets

For a more thorough assessment, IRSTGANet is evaluated on two widely used EEG emotion recognition benchmarks: the DEAP dataset [46,47] and the SEED dataset [48,49]. Although the DEAP dataset is older, it remains among the most widely used datasets to this day, as does the SEED dataset. Another reason for choosing these datasets is their distinct features: different numbers of electrodes, assessment methods, subject numbers, and emotional models, which would provide a more nuanced evaluation of our model.

3.1.1. DEAP Dataset

DEAP is the most widely used emotion recognition dataset; it records EEG and peripheral physiological signals from 32 subjects while they watch 40 one-minute videos that participants rate (1–9) on valence, arousal, and dominance. These signals were obtained with the 32-channel Biosemi ActiveTwo system in accordance with the 10/20 system. We used the preprocessed dataset, which was downsampled to 128 Hz and divided into 63 s segments, comprising 60 s of stimulus and 3 s of pre-stimulus.

3.1.2. SEED Dataset

SEED focuses on categorizing distinct emotions (negative, neutral, and positive). As with the DEAP dataset, we used the preprocessed version, which comprises 45 experiments, with 15 subjects repeating the experiment three times (sessions), and each experiment contains 15 trials. The data were downsampled to 200 Hz; the recordings were made using the 62-channel cap following the 10/20 system; and the stimuli were carefully selected 4 min videos.

3.2. Preprocessing

The datasets for this study underwent initial preprocessing, as described in the previous section. However, to prepare this data for deep learning models that analyze temporal dynamics, additional preprocessing was necessary. This included normalization to stabilize model training and segmentation of the continuous EEG signals. The raw signals, comprising 32 electrodes for DEAP and 62 for SEED, were recorded at 8064 time points for DEAP, corresponding to approximately 63 s of neural activity. They were divided into shorter, overlapping windows with 50% overlap, equivalent to two seconds. This overlap serves as a powerful form of data augmentation [50]. SEED has different timestamp lengths. For DEAP and SEED, window sizes L of 512 and 800 data points, respectively, were chosen, yielding segments of approximately 4 s each that effectively capture evolving emotional processes. The 4 s window size is widely used in the literature [8,16,36]; thus, we adopted the same 4 s duration across datasets for consistency.

In our framework, we constructed a graph from a segment of multichannel EEG data, which is better suited to modeling relational information [31]. A graph is defined as follows:

G = (N, E)

(1)

where N is the set of nodes, and E is the set of edges. To construct this graph from the EEG, we treat each of the brain electrodes (channels), C, as a distinct node in N:

N = \{n_{i}| i = 1, . ., C}

(2)

The raw signal captured by each electrode over a defined time window constitutes the initial feature vector for its corresponding node, and edges in E represent the relationships between these nodes:

E = \{e_{i j}| n_{i}, n_{j} \in N}

(3)

which models the functional connection between electrodes

n_{i}

and

n_{j}

. The topology of the entire graph is encapsulated in a C × C adjacency matrix A, where each element quantifies the strength

a_{i j}

of the connection between nodes:

A = {a_{i j}}

(4)

In our approach, the learnable adjacency matrix is initially defined from static anatomical distances, with every electrode connected to every other electrode, including itself, via self-loops. It is essential to note that this initial connectivity does not presuppose which connections are meaningful; rather than relying on pre-defined neuroanatomical priors, the model is tasked with dynamically learning the strength and importance of these connections during training. This process automatically identifies and weights the specific brain routes and inter-channel interactions that are highly discriminative in classifying emotional states.

For label engineering, as discussed in the dataset section, the labels for the DEAP dataset were rated from 1 to 9, to create 2, 4, and 9 classes. We initially used 5 as a threshold to define the dichotomous model for valence and arousal; using this same threshold we used a combination of low and high to create four classes, which were Low-Valence-Low-Arousal, Low-Valence-High-Arousal, High-Valence-Low-Arousal, and High-Valence-High-Arousal which could be equivalent in the discrete model to Sad, Angry, Relaxed, and Happy, respectively; then, we used 3 and 6 as thresholds to create three-level valence and arousal, and used a combination of low, medium, and high to create nine classes: LALV, MALV, HALV, LAMV, MAMV, HAMV, LAHV, MAHV, and HAHV, which could be equivalent in the discrete model to Depressed, Bored, Angry, Sad, Neutral, Nervous, Relaxed, Happy, and Excited, respectively. As for the SEED dataset, it already uses a discrete model of negative, neutral, and positive emotion. With this, we provide a powerful, neurophysiologically grounded input to the subsequent feature extractor.

3.3. Model Architecture

The IRSTGANet model is a spatio-temporal graph network that couples a temporal feature extractor with attention-based graph reasoning over EEG channels in four stages: temporal feature encoding, temporal projection (dimension control), stacked residual GAT blocks, and global aggregation with a linear classification head. This architecture, as seen in Figure 1, targets fast temporal fluctuations within channels and structured spatial dependencies across electrodes in EEG for emotion recognition.

The feature extractor is a crucial step designed to extract meaningful temporal patterns from the raw EEG signal of each node, comprising a stack of three one-dimensional convolutional (Conv1D) layers, which function as a hierarchical processor to capture features at progressively broader temporal scales. The first layer increases the single input channel to 64 feature maps with a kernel size of 15 and a stride of 7, thereby looking for fine-grained, short-duration patterns in the signal before passing these to the second layer, which has 128 filters, a kernel of 11, and a stride of 5, and expands on these initial properties to recognize more complicated waveforms, then the third Conv1D layer refines the representation into 256 channels, using a kernel of 7 and a stride of 3 to capture high-level temporal motifs. Finally, an Adaptive Average Pooling layer is used to compress the convolution’s output into a fixed size of 8. Each convolutional layer is followed by Batch Normalization and a Gaussian Error Linear Unit (GELU) activation function. Then, the projection block transforms the temporal features extracted by the feature extractor, taking as input a flattened feature tensor with dimensions of 256 × 8 = 2048, which originates from the 256-channel, 8-time-point output of the preceding Adaptive Average Pooling layer. The main part of this block is the linear layer (2048, 64), which projects high-dimensional temporal features to a compact 64-dimensional temporal embedding (temb), followed by Layer Normalization, GELU activation, and a Dropout layer (0.5) to stabilize training.

Through the network, the data is activated using the GELU [51] activation function [21], which is approximated with the following:

G E L U (x) = x σ (1.702 x)

(5)

The projected features are processed through residual blocks, each consisting of two stacked Graph Attention (GAT) layers. Every GAT layer learns attention weights to dynamically highlight important inter-electrode interactions. The first GAT layer in the initial block takes the temb = 64 output from the projection block as its input and produces a hidden dimension of size 128, whereas the remaining GAT layers use and output this same hidden dimension. The first layer of each block employs two attention heads and is followed by GELU activation, whereas the second layer uses one attention head and is followed by Batch Normalization and GELU to stabilize and accelerate training, and dropout (0.2) to prevent overfitting. A residual connection after the first GAT block maintains earlier representations, enabling subsequent layers to refine rather than overwrite spatial information. The model also captures attention weights for interpretability purposes.

The core of the spatial feature learning is a two-layer GATv2Conv architecture, similar to previous works [31,39], which dynamically models complex functional relationships among EEG electrodes, where the first GATv2Conv layer employs multi-head attention, allowing each node to select salient connections among its neighbors. GATv2 [52] could be defined as follows, considering Equation (2) as input

n_{C} \in R^{F}

, with F being the feature dimension of each channel and C being the number of channels (nodes). e is a scoring function that represents attention coefficients, and

α

is the normalized attention weight across neighbor nodes using Softmax:

e_{i j} = L e a k y R e L U (a^{T} \cdot [W n_{i} ∥ W n_{j}])

(6)

α_{i j} = {S o f t m a x}_{j} (e_{i j}) = \frac{\exp (e_{i j})}{\sum_{k \in C_{i}} \exp (e_{i k})}

(7)

where LeakyReLU is the nonlinear activation function, W

\in R^{F \times F^{'}}

is the learned weight matrix,

F^{'}

the feature dimension of the output nodes,

∥

denotes the concatenation, and

a^{T}

the transpose of the attention weight vector. After the multi-head attention module, the attention coefficients are combined, resulting in the final representation of the transformed features using the normalized attention weights.

n_{i}^{'} = σ (\sum_{j \in C_{i}} α_{i j} \cdot W n_{j})

(8)

This process learns adaptive “importance weights” for each link, determining which inter-channel relationships are most relevant for emotion recognition. The set of context-aware node properties is then fed into a second GATv2Conv layer, which refines the representations to capture higher-order spatial dependencies. Moreover, the attention weights from these layers can be visualized. Before classification, a global mean pooling layer is employed to obtain a graph-level representation. This graph embedding is then passed to the classification head, a multi-layer perceptron that maps the high-level representation to the final emotion classes. The classification head comprises three linear layers that progressively reduce dimensionality by halving the hidden dimension, which is equivalent for the first layer to (128, 64), for the second (64, 32), and the third linear layer takes the 32 and projects it to the output logits corresponding to the number of classes, thereby extracting the most discriminative features for the task. A GELU activation function and dropout (with rates of 0.5 and 0.3) are applied after each linear layer to provide regularization and prevent overfitting, and a normalization layer precedes the activation function in the first layer to ensure stable and efficient training.

Our architecture presents a distinct approach to spatiotemporal feature learning in EEG emotion recognition. First, for temporal modeling, we depart from the common reliance on recurrent networks or standard transformer encoders by employing a stacked 1D convolutional backbone, which provides a more parameter-efficient approach to capturing local-to-global temporal dynamics and serves as a powerful multi-scale feature extractor, eliminating the need for a separate sequential modeling step. Second, we propose a specialized, residually stacked Graph Attention Network (GAT) block that enhances conventional GATs by incorporating residual connections across GAT blocks with stacked GAT layers, thereby mitigating over-smoothing issues and facilitating deeper graph propagation without loss of information.

3.4. Interpretability

We used Integrated Gradients (IG) [53] as a feature-attribution method to quantify each channel’s contribution to the model’s class scores, where the integral of integrated gradients can be approximated by summing gradients at small intervals along the path from the baseline. We also exploited the model’s self-attention weights (from GAT blocks) to obtain edge-level importances between channels and constructed a channel–channel importance matrix by aggregating edge scores within each graph. For visualization, saliency maps were generated using IG and projected onto 2D scalp layouts (topomaps), and the learned attention matrices were analyzed to identify the strongest pairwise connections between channels (the top-k edge motifs) and the channels with the highest aggregated attention weights.

At evaluation, for each batch per subject, we computed IG with a zero baseline, and for each class, we integrated the gradient of the class logit with respect to the input along the straight path from baseline to the sample, resulting in a vector of per-channel attributions, which is then averaged across samples and across folds to obtain stable per-class IG maps. In parallel, we extracted attention weights from the last GAT block, normalized scores per graph, and accumulated them into an edge-importance matrix, from which we computed node connectivity via row and column sums and listed the top-k edges. To visualize spatial structure, we mapped channel names to a standard montage and plotted per-class IG topomaps with MNE, ensuring a shared color scale across classes.

3.5. Training Setup and Metrics

PyTorch 2.3.1 and PyTorch Geometric 2.6.1 were used to implement the IRSTGANet model on an NVIDIA A6000 GPU, utilizing the Paperspace platform [54]. The model was trained for 100 epochs using a learning rate of 0.001, a batch size of 64, and the Adam optimizer. Additionally, to ensure robust training and control overfitting, we employed a regularization strategy that included dropout, batch normalization, L2 weight decay, and early stopping; further hyperparameters are detailed in Table 1.

When segmenting the data, it was organized by (subject, trial) instead of individual windows to prevent the same trial from being included in both training and testing sets. During training, we split each subject into five segments and shuffle through them five times so that each time one segment is the test set, and the accuracy of the subject is the average of the accuracies across the five folds, which is known as 5-fold Cross Validation (5-fold CV).

The metrics used to evaluate the model are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

F 1 - s c o r e = 2 \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(12)

with TP true positives, TN true negatives, FN false negatives, and FP false positives. In addition to confusion matrices to visualize performance across the four and nine classes, accuracy is used as the primary metric.

4. Results

The evaluation metrics used to assess the performance of the proposed IRSTGANet model included accuracy, precision, recall, and F1-score for both valence and arousal, as well as for the four labels, the nine labels of the DEAP dataset, and the three emotion classes of the SEED dataset.

Figure 2 shows the results of the binary evaluation on the DEAP dataset, with Figure 2a illustrating that the model achieved an average accuracy/F1-score of 96.28%/96.31% and 96.82%/97.03% for valence and arousal, respectively. Arousal slightly outperforms valence. Figure 2b presents the detailed accuracy for each subject, showing that subject #15 achieved the best accuracy, with percentages of 99.49% and 97.92% for valence and arousal, respectively. The lowest accuracy was achieved by subject #4, reaching 91.49% and 94.19% for valence and arousal, respectively. The same note on arousal outperforming valence is emphasized in this figure.

For the multiclassification, IRSTGANet performed exceedingly well, as seen in Figure 3, for both the four classes (quadrant) in Figure 3a, and nine classes (nonant) in Figure 3b, with an average accuracy of 93.76% and 85.15%, respectively. The first showed consistency, with no subject falling below 80% and the best reaching 98.43%, while the second showed a slight struggle, with subjects #3, #14, #19, and #30 achieving the lowest accuracy of 63.94% and the best reaching 96.38%; however, most of the subjects were over 80%. The model maintained consistently strong performance across complementary metrics, with an average precision, recall, and F1 of 95.2%, 91.35%, and 92.06%, respectively, for the quadrant classification, and of 90.77%, 85.65%, and 86.38%, respectively, for the nonant emotion recognition.

While the average accuracy provides an idea of the model’s overall performance, the confusion matrix offers a more nuanced view of each class’s performance, as shown in Figure 4. The accuracy of the four classes in Figure 4a is overall balanced, with a slight 1% improvement in each class moving from sad to happy. Notably, sad has the least accuracy at 92%, while happy has the best accuracy at 95%. The emotion recognition for the nine classes in Figure 4b shows a small struggle in recognizing the neutral, nervous, and happy labels; also, 9% of the depressed labels across the dataset are falsely denoted as excited.

For the SEED dataset, as shown in Figure 5, the model performed exceptionally well on subjects #3, #11, #12, #14, and #15, with a performance of 100% across all metrics, and subject #6, which was slightly behind with a performance of 98% for all metrics. Except for the struggle with subjects #10 and #13, the model showed a stable performance for the other six subjects. Figure 5a provides a glimpse of the overall performance in Figure 5b, where the model achieved an average accuracy of 94.04% and the lowest accuracy for subject #13, with a value of 77.33%.

For greater interpretability, and because many of the main connections were self-loops, these were omitted, leaving only the strong connections from one channel to another. Figure 6 summarizes the top-ranked attention edges extracted from the deepest GAT block of the proposed model. Figure 6b shows that the highest normalized edge weights correspond primarily to within-hemisphere fronto–parietal connections, notably F7 → P7 on the left and FC4 → P2 on the right. These links appeared repeatedly across attention heads and subjects, suggesting a strong and consistent emphasis on model information flow from anterior to posterior within each hemisphere. Figure 6a presents the top attention-derived edge motifs with the strongest connections, mainly linking frontal and parietal regions within the same hemisphere, such as FC2 → TP8, F2 → P8, T7 → P5, and F7 → P7. Also, several central–parietal and fronto–occipital links Cz → P1, FC4 → P2, C2 → PO7, CP3 → PO4, FT8 → Oz were identified, while the additional motifs, including FC6 → AF4, AF4 → Fp2, and AF7 → TP10, suggest the involvement of right fronto–central and anterior regions in emotional information processing.

Figure 7 presents a summary of the IG attribution scores across the three emotion classes. A consistent, strong right frontal–temporal focus demonstrated the highest contribution across all classes, highlighting the significant role of this region in emotion discrimination, particularly for the negative class. A moderate contribution from the frontal central region is observed for the positive class. The neutral class displayed a comparable pattern, highlighting the midline electrodes, which indicates increased engagement of central regions and a moderate contribution from the right parietal region.

Figure 8 shows the most significant channels, with the highest values being FC6, FC, and FT8, followed by F2, FCz, Cz, F6, F4, and CP3, then C2, FC2, Fz, AF3, C4, and CP1, emphasizing the previous results on the involvement of the frontal and central regions.

The experimental findings demonstrate that combining temporal convolutions with residual graph attention mechanisms significantly enhances the model’s ability to capture both spatial and temporal information, and the attention maps indicate that node connectivity is physically explained by recognized emotion-related brain regions.

5. Discussion

Performance is reported in the Section 4 as the average accuracy across all subjects. Due to a limitation in the evaluation pipeline, the within-subject variance across cross-validation folds was not retained for analysis; therefore, the standard deviation (SD) shown in this section reflects inter-subject variability, calculated from the spread of individual subjects’ mean accuracies. Future work will include comprehensive reporting of both within- and across-subject variance metrics to fully characterize model stability.

The proposed IRSTGANet model achieved exceptional results, with an average accuracy of 96.28% ± 1.82% and 96.82% ± 1.92% for valence and arousal, respectively, 93.76% ± 3.96% for the four classes LVLA, LVHA, HVLA, and HVHA (Sad, Angry, Relaxed, and Happy), 85.15% ± 8.37% for the nine classes LALV, MALV, HALV, LAMV, MAMV, HAMV, LAHV, MAHV, and HAHV (Depressed, Bored, Angry, Sad, Neutral, Nervous, Relaxed, Happy, and Excited), and 94.04% ± 6.70%for the three classes negative, neutral, and positive.

The results demonstrate that utilizing both a temporal feature extractor and residual graph attention layers effectively captures both short-term neural changes and long-range spatial relationships, with not only a high stable accuracy across datasets, experiments, and emotion labels, but also high precision and recall values showing the sensitivity of the model in emotion recognition, all while delivering a balanced F1 score.

Additionally, the model accurately recognized multiclass labels, a challenging task, demonstrating its ability to capture the complex spatial dependencies among emotions.

The model’s architecture showed, from the previous results, a multi-scale emotion-processing capability, with the first level looking for specific time patterns in individual electrodes, the second level looking at how brain regions in the same area interact with each other, and the third level looking at how all the brain regions interact with each other in a complicated way. This is further emphasized by the interpretability results, which identify important nodes at the first level, node-to-node importance at the second level, and the region responsible for an emotion at the third level.

IRSTGANet is designed for efficiency, with 696,546 trainable parameters, and requires 0.15 GFLOPs (150 million operations) per forward pass on an NVIDIA RTX A6000, yielding an inference latency of 11.216 ms and a peak memory usage of 333.8 MB. While the model’s low computational footprint, 0.15 GFLOPs, makes it a good candidate for deployment in systems with moderate processing capabilities, whether it meets the latency requirements of a specific real-time, closed-loop system depends on the target hardware and optimization; thus, our current results demonstrate that the core architecture, confined in a Proof of Concept, is not prohibitively complex and provides a foundation that can be further optimized for edge deployment.

5.1. Interpretability

Notably, the spatial distribution highlighted by the node analysis complements the connectivity patterns observed in the attention-based edge motifs, as the same right fronto–central and midline regions (e.g., FC6, FC4, FT8, FCz, Cz) that emerged as important in the channel importance also appear as major sources in the connectivity importance, such as the pathways from FC4 to P2 and from F7 to P7. This convergence between two independent interpretability methods strengthens the evidence that the model consistently relies on functionally meaningful fronto–parietal interactions during emotion recognition. In other words, both node-level attributions and graph-level attention emphasize the same brain regions, reinforcing the biological plausibility of the learned representations. Additionally, the omission of self-loop attention suggests that the network also retains node-specific information within key channels. Overall, the attention distribution suggests a within-hemisphere, structured pattern of recurrent fronto–parietal interactions and central-node reinforcement, rather than diffuse or random connectivity.

The neurobiological foundation of emotion recognition can be understood through a network of brain systems responsible for the evaluation, integration, and regulation of emotion. The medial prefrontal cortex has different roles, with the dorsal/posterior sectors handling initial emotional appraisal and expression, and the ventral/anterior regions regulating and extinguishing these responses, forming a hierarchical control mechanism. The lateral prefrontal cortex integrates cognitive and affective signals to facilitate goal-directed behavior, and this process is enhanced by subcortical structures like the basal forebrain, which aligns cortical activity with emotional and motivational states [55]. Additionally, shared neural representation, illustrated by vicarious neuronal activity, means that the brain activates similar regions just by observing someone else, creating an evoked experience for understanding, empathy, and social learning in areas like the anterior cingulate cortex, insula, and amygdala during emotional experiences, which supports the concept of emotional contagion, especially for negative emotions [56].

In light of this foundational neurobiological framework, our model’s interpretability is validated, as it has not only achieved high accuracy but also clarified the functional architecture of emotion in the EEG, as the learned graph is not random, but rather an emergent representation of the same integration connections (lateral prefrontal cortex), control hierarchies (medial prefrontal cortex midline), and broadcast pathways that neuroscience has identified as the core systems for affective processing. Our contribution demonstrates that this architecture can be learned in its entirety from raw signals and that the particular design choices and features of our structure (right-lateralized fronto–central pathways) possess the discriminative capability for advanced emotion recognition.

5.2. Ablation Studies

We conducted ablation studies to further examine the effects of the main components of our model, beginning by excluding the temporal feature extractor (TFE) from IRSTGANet, followed by the stacked residual GAT blocks (SRGAT). Removing the temporal components isolates spatial graph attention contributions, and removing the GAT blocks while using a simple GCN allows us to quantify the gain attributable to the TFE block, with both scenarios being tested using two different initial graph types, one fully connected (fc) and the other based on 10–20 connectivity.

As presented in Table 2, when excluding the SRGAT block, the performance dropped by approximately ≈9%, as the valence and arousal accuracy dropped from 96.28% to 86.87% and from 96.82% to 87.77%. When removing the TFE block, the decrease from 96.28% to 94.90% indicates a 1.38% decrease in valence and a 1.48% decrease in arousal. It is also essential to note that performance without SRGAT is less robust, as indicated by the larger standard deviation. The effect of fc or 10–20, when the SRGAT is omitted, is insignificant, while with SRGAT, a slight effect of ≈0.5% could be seen, which suggests that adjacency matrix initialization has an effect, even though the connections are learned during training. However, the slight outperformance of fc for valence and 10–20 for arousal may give way to potential further research on task-specific analysis.

These ablation results demonstrate that the full IRSTGANet architecture achieves its superior performance through the integrated contribution of all its components. The critical role of the SRGAT block is especially evident, confirming that its dynamic modeling of inter-channel spatial relations is essential for accurately characterizing emotional brain activity.

5.3. Comparison with the State of the Art

Relative to previous graph-based EEG models, IRSTGANet was compared for the DEAP dataset with AT-DGNN-Gen [37], LTS-GAT [25], P-GCNN [11], VSGT [36], GARG [9], ResGAT [40], CNN + GA Features [8], RSTAGNN [16], CADD-DCCNN [57], Mul-AT-RGCN [26], MSL-TGNN [39], ESN [45], GLFANet [14], and DGAT [58], as shown in Table 3.

Our model, for the binary classification of valence and arousal, outperformed all other models by a significant margin, surpassing the best GLFANet by 1.83% in terms of average accuracy. For the four classes, IRSTGANet outperformed ESN(4-classes) by 10.16% and GLFANet (4-classes) by 0.84%, while still surpassing most of the binary classifications. For the nine classes, our model outperformed the ESN(8-classes) by 6.45% while falling behind only eight of the sixteen presented models, thus surpassing half of the SOTA.

IRSTGANet was compared with the following models on the SEED dataset: P-GCNN [11], SOGNN [19], CADD-DCCNN [57], STGATE [31], GLFANet [14], STAFNet [30], and DGAT [58]. The results are presented in Table 4. Our model surpassed P-GCNN by 9.69%, SOGNN by 7.23%, CADD-DCCNN by 6.63%, STGATE by 3.67%, GLFANet by 0.85%, and STAFNet by a margin of 0.26%, thus surpassing all the SOTA in terms of accuracy.

To summarize, our approach offers three main advantages: enhanced temporal feature extraction through convolutional encoding, stable multi-layer spatial reasoning via residual attention, and interpretable measures of electrode-level, electrode-neighboring-level, and region-level importance.

Several limitations must be considered, including the window size, which is often used in segmentation but has not yet been shown to be optimal, as each task has its own specifics. It is also important to note that a key consideration in interpreting results is the nature of the emotional labels derived from the DEAP and SEED datasets, which use a stimulus-elicitation in which participants self-report a rating of ‘felt’ emotion in response to emotional videos, where the EEG is presumed to capture the evoked emotional experience of the participant. However, it mixes sensory decoding, cognitive appraisal, and the generation of felt emotion, making it challenging to isolate pure internal feelings from perception. Thus, while the model is effective in this context, its applicability to context-free emotional experiences requires further validation, as the findings are most relevant to fields such as neuromarketing, media studies, and specific biofeedback therapies, where emotional stimuli are defined, and engagement is measured, and less validated for emotional states arising purely from internal thought without an external, known trigger such as diagnosing mood disorders or interpreting everyday social interactions. Moreover, the real-world applicability remains a consistent limitation.

Despite the model’s success, there remains room for improvement; the first step is to test its generalization to unseen subjects or recording setups [24,25,59] and to incorporate other data modalities, such as fNIRS and peripheral physiological signals [60,61,62], which could enhance robustness. Second, exploring data augmentation to address limited dataset sizes [28,63], simplifying the model and input [20], and reducing dimensionality [64] could help mitigate the graph structure’s high complexity.

Author Contributions

Software, Formal Analysis, Resources, Data Curation, Writing—Original Draft Preparation, M.H.; Methodology, M.H. and A.E.B.; Review and Editing, M.H., A.E., S.B.A. and A.E.B.; Supervision, A.E. and S.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. They are available at: DEAP: http://www.eecs.qmul.ac.uk/mmv/datasets/deap/download.html (accessed on 25 September 2025), SEED: https://zenodo.org/records/546113, (accessed on 25 September 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wierciński, T.; Rock, M.; Zwierzycki, R.; Zawadzka, T.; Zawadzki, M. Emotion Recognition from Physiological Channels Using Graph Neural Network. Sensors 2022, 22, 2980. [Google Scholar] [CrossRef]
Mohammadi, H.; Karwowski, W. Graph Neural Networks in Brain Connectivity Studies: Methods, Challenges, and Future Directions. Brain Sci. 2025, 15, 17. [Google Scholar] [CrossRef]
Fan, Z.; Chen, F.; Xia, X.; Liu, Y. EEG Emotion Classification Based on Graph Convolutional Network. Appl. Sci. 2024, 14, 726. [Google Scholar] [CrossRef]
Fernandes, J.V.M.R.; de Alexandria, A.R.; Marques, J.A.L.; de Assis, D.F.; Motta, P.C.; Silva, B.R.d.S. Emotion Detection from EEG Signals Using Machine Deep Learning Models. Bioengineering 2024, 11, 782. [Google Scholar] [CrossRef]
Zhang, G.; Yu, M.; Liu, Y.-J.; Zhao, G.; Zhang, D.; Zheng, W. SparseDGCNN: Recognizing Emotion From Multichannel EEG Signals. IEEE Trans. Affect. Comput. 2023, 14, 537–548. [Google Scholar] [CrossRef]
Nandi, A.; Xhafa, F.; Subirats, L.; Fort, S. Real-Time Emotion Classification Using EEG Data Stream in E-Learning Contexts. Sensors 2021, 21, 1589. [Google Scholar] [CrossRef]
Reis, S.; Pinto-Coelho, L.; Sousa, M.; Neto, M.; Silva, M. Advancing Emotion Recognition: EEG Analysis and Machine Learning for Biomedical Human–Machine Interaction. BioMedInformatics 2025, 5, 5. [Google Scholar] [CrossRef]
Xefteris, V.-R.; Tsanousa, A.; Georgakopoulou, N.; Diplaris, S.; Vrochidis, S.; Kompatsiaris, I. Graph Theoretical Analysis of EEG Functional Connectivity Patterns and Fusion with Physiological Signals for Emotion Recognition. Sensors 2022, 22, 8198. [Google Scholar] [CrossRef]
Perry Fordson, H.; Xing, X.; Guo, K.; Xu, X. Emotion Recognition With Knowledge Graph Based on Electrodermal Activity. Front. Neurosci. 2022, 16, 911767. [Google Scholar] [CrossRef]
Schirrmeister, R.T.; Springenberg, J.T.; Fiederer, L.D.J.; Glasstetter, M.; Eggensperger, K.; Tangermann, M.; Hutter, F.; Burgard, W.; Ball, T. Deep Learning with Convolutional Neural Networks for EEG Decoding and Visualization. Hum. Brain Mapp. 2017, 38, 5391–5420. [Google Scholar] [CrossRef]
Wang, Z.; Tong, Y.; Heng, X. Phase-Locking Value Based Graph Convolutional Neural Networks for Emotion Recognition. IEEE Access 2019, 7, 93711–93722. [Google Scholar] [CrossRef]
Yang, Y.; Zhao, H.; Hao, Z.; Shi, C.; Zhou, L.; Yao, X. Recognition of Brain Activities via Graph-Based Long Short-Term Memory-Convolutional Neural Network. Front. Neurosci. 2025, 19, 1546559. [Google Scholar] [CrossRef]
Jiang, C.; Dai, Y.; Ding, Y.; Chen, X.; Li, Y.; Tang, Y. TSANN-TG: Temporal–Spatial Attention Neural Networks with Task-Specific Graph for EEG Emotion Recognition. Brain Sci. 2024, 14, 516. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Zhao, Y.; An, Y.; Zhao, J.; Wang, S.-H.; Yan, J. GLFANet: A Global to Local Feature Aggregation Network for EEG Emotion Recognition. Biomed. Signal Process. Control 2023, 85, 104799. [Google Scholar] [CrossRef]
Xue, Q.; Song, Y.; Wu, H.; Cheng, Y.; Pan, H. Graph Neural Network Based on Brain Inspired Forward-Forward Mechanism for Motor Imagery Classification in Brain-Computer Interfaces. Front. Neurosci. 2024, 18, 1309594. [Google Scholar] [CrossRef]
Devi, C.A.; Renuka, D.K. EEG Based Emotion Analysis Using Reinforced Spatio-Temporal Attentive Graph Neural and Context Net Techniques. J. Autom. Mob. Robot. Intell. Syst. 2024, 18, 61–68. [Google Scholar] [CrossRef]
Li, M.; Qiu, M.; Kong, W.; Zhu, L.; Ding, Y. Fusion Graph Representation of EEG for Emotion Recognition. Sensors 2023, 23, 1404. [Google Scholar] [CrossRef]
Deng, X.; Xu, D.; Huo, H.; Hong, X.; Liu, B. Multi-Level Hierarchical Dynamic Graph Convolutional Networks for Motor Imagery EEG Analysis. Neurocomputing 2025, 626, 129594. [Google Scholar] [CrossRef]
Li, J.; Li, S.; Pan, J.; Wang, F. Cross-Subject EEG Emotion Recognition With Self-Organized Graph Neural Network. Front. Neurosci. 2021, 15, 611653. [Google Scholar] [CrossRef]
Zhu, X.; Liu, G.; Zhao, L.; Rong, W.; Sun, J.; Liu, R. Emotion Classification from Multi-Band Electroencephalogram Data Using Dynamic Simplifying Graph Convolutional Network and Channel Style Recalibration Module. Sensors 2023, 23, 1917. [Google Scholar] [CrossRef]
Chen, W.; Liao, Y.; Dai, R.; Dong, Y.; Huang, L. EEG-Based Emotion Recognition Using Graph Convolutional Neural Network with Dual Attention Mechanism. Front. Comput. Neurosci. 2024, 18, 1416494. [Google Scholar] [CrossRef]
Ding, Y.; Tong, C.; Zhang, S.; Jiang, M.; Li, Y.; Lim, K.J.; Guan, C. EmT: A Novel Transformer for Generalized Cross-Subject EEG Emotion Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 10381–10393. [Google Scholar] [CrossRef]
Guo, Y.; Zhang, B.; Fan, X.; Shen, X.; Peng, X. A Comprehensive Interaction in Multiscale Multichannel EEG Signals for Emotion Recognition. Mathematics 2024, 12, 1180. [Google Scholar] [CrossRef]
Ye, W.; Wang, J.; Chen, L.; Dai, L.; Sun, Z.; Liang, Z. Adaptive Spatial–Temporal Aware Graph Learning for EEG-Based Emotion Recognition. Cyborg Bionic Syst. 2024, 5, 0088. [Google Scholar] [CrossRef]
Zhu, Y.; Gan, K.; Yin, Z. Locally Temporal-Spatial Pattern Learning with Graph Attention Mechanism for EEG-Based Emotion Recognition. arXiv 2022, arXiv:2208.11087. [Google Scholar]
Chen, J.; Liu, Y.; Xue, W.; Hu, K.; Lin, W. Multimodal EEG Emotion Recognition Based on the Attention Recurrent Graph Convolutional Network. Information 2022, 13, 550. [Google Scholar] [CrossRef]
Gao, P.; Zheng, X.; Wang, T.; Zhang, Y. Graph Convolutional Neural Network Based Emotion Recognition with Brain Functional Connectivity Network. Int. J. Crowd Sci. 2024, 8, 195–204. [Google Scholar] [CrossRef]
Gilakjani, S.S.; Osman, H.A. A Graph Neural Network for EEG-Based Emotion Recognition With Contrastive Learning and Generative Adversarial Neural Network Data Augmentation. IEEE Access 2024, 12, 113–130. [Google Scholar] [CrossRef]
Cui, Y.; Liu, X.; Liang, J.; Fu, Y. MVGT: A Multi-View Graph Transformer Based on Spatial Relations for EEG Emotion Recognition. In Proceedings of the 32nd International Conference, ICONIP 2025, Okinawa, Japan, 20–24 November 2025. [Google Scholar]
Hu, F.; He, K.; Qian, M.; Liu, X.; Qiao, Z.; Zhang, L.; Xiong, J. STAFNet: An Adaptive Multi-Feature Learning Network via Spatiotemporal Fusion for EEG-Based Emotion Recognition. Front. Neurosci. 2024, 18, 1519970. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Pan, W.; Huang, H.; Pan, J.; Wang, F. STGATE: Spatial-Temporal Graph Attention Network with a Transformer Encoder for EEG-Based Emotion Recognition. Front. Hum. Neurosci. 2023, 17, 1169949. [Google Scholar] [CrossRef]
Zhang, S.; Chu, C.; Zhang, X.; Zhang, X. EEG Emotion Recognition Using AttGraph: A Multi-Dimensional Attention-Based Dynamic Graph Convolutional Network. Brain Sci. 2025, 15, 615. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Xu, X.; Bian, X.; Qin, X. EEG Emotion Recognition Based on Ordinary Differential Equation Graph Convolutional Networks and Dynamic Time Wrapping. Appl. Soft Comput. 2024, 152, 111181. [Google Scholar] [CrossRef]
Wang, L.; Wang, S.; Jin, B.; Wei, X. GC-STCL: A Granger Causality-Based Spatial–Temporal Contrastive Learning Framework for EEG Emotion Recognition. Entropy 2024, 26, 540. [Google Scholar] [CrossRef]
Almohammadi, A.; Wang, Y.-K. Revealing Brain Connectivity: Graph Embeddings for EEG Representation Learning and Comparative Analysis of Structural and Functional Connectivity. Front. Neurosci. 2024, 17, 1288433. [Google Scholar] [CrossRef]
Liu, C.; Zhou, X.; Xiao, J.; Zhu, Z.; Zhai, L.; Jia, Z.; Liu, Y. VSGT: Variational Spatial and Gaussian Temporal Graph Models for EEG-Based Emotion Recognition. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence Main Track, Jeju, Republic of Korea, 3–9 August 2024; Volume 4, pp. 3078–3086. [Google Scholar]
Xiao, M.; Zhu, Z.; Xie, K.; Jiang, B. MEEG and AT-DGNN: Improving EEG Emotion Recognition with Music Introducing and Graph-Based Learning. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024. [Google Scholar]
Li, Z.; Zhang, G.; Wang, L.; Wei, J.; Dang, J. Emotion Recognition Using Spatial-Temporal EEG Features through Convolutional Graph Attention Network. J. Neural Eng. 2023, 20, 016046. [Google Scholar] [CrossRef]
Wang, X.; Li, C.; Liu, Y.; Liang, W.; Li, K.; Poniszewska-Maranda, A. A Spatio-Temporal Graph Neural Network for EEG Emotion Recognition Based on Regional and Global Brain. Comput. Sci. Inf. Syst. 2025, 22, 971–989. [Google Scholar] [CrossRef]
Chao, H.; Cao, Y.; Liu, Y. Corrigendum: Multi-Channel EEG Emotion Recognition Using Residual Graph Attention Neural Network. Front. Neurosci. 2023, 17, 1283644. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; Dai, G.; Li, J.; Zhu, X.; Huang, X.; Li, Y.; Tan, M.; Wang, L.; Fang, P.; Chen, X.; et al. An Interpretable Model Based on Graph Learning for Diagnosis of Parkinson’s Disease with Voice-Related EEG. npj Digit. Med. 2024, 7, 3. [Google Scholar] [CrossRef] [PubMed]
Messaoud, R.B.; Du, V.L.; Bousfiha, C.; Corsi, M.-C.; Gonzalez-Astudillo, J.; Kaufmann, B.C.; Venot, T.; Couvy-Duchesne, B.; Migliaccio, R.; Rosso, C.; et al. Correction: Low-Dimensional Controllability of Brain Networks. PLoS Comput. Biol. 2025, 21, e1013601. [Google Scholar] [CrossRef]
Gong, X.; Chen, Y.; Zhang, P.; Peng, Y.; Fang, J.; Cichocki, A. CoAdapt: Collaborative Adaptation Between Latent EEG Feature Representation and Annotation for Emotion Decoding. IEEE Trans. Instrum. Meas. 2025, 74, 2541116. [Google Scholar] [CrossRef]
Zhang, G.; Etemad, A. Partial Label Learning for Emotion Recognition From EEG. IEEE Trans. Affect. Comput. 2025, 16, 2381–2395. [Google Scholar] [CrossRef]
Liu, Y.; Liang, R.; Xu, S.; Guo, X. Structural Investigations of Multi-Reservoir Echo State Networks for EEG-Based Emotion Classification. Neurocomputing 2025, 632, 129856. [Google Scholar] [CrossRef]
Scherer, K.R. What Are Emotions? And How Can They Be Measured? Soc. Sci. Inf. 2005, 44, 695–729. [Google Scholar] [CrossRef]
Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.-S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A Database for Emotion Analysis; Using Physiological Signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]
Zheng, W.-L.; Lu, B.-L. Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks. IEEE Trans. Auton. Ment. Dev. 2015, 7, 162–175. [Google Scholar] [CrossRef]
Duan, R.-N.; Zhu, J.-Y.; Lu, B.-L. Differential Entropy Feature for EEG-Based Emotion Classification. In Proceedings of the 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER), San Diego, CA, USA, 6–8 November 2013; pp. 81–84. [Google Scholar]
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 710–719. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2023, arXiv:1606.08415. [Google Scholar]
Brody, S.; Alon, U.; Yahav, E. How Attentive Are Graph Attention Networks? arXiv 2022, arXiv:2105.14491. [Google Scholar] [CrossRef]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning—Volume 70, Sydney, NSW, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
NVIDIA H100 for AI & ML Workloads|Cloud GPU Platform|Paperspace. Available online: https://www.paperspace.com/ (accessed on 19 May 2025).
The Emotional Brain. In Conn’s Translational Neuroscience; Academic Press: Cambridge, MA, USA, 2017; pp. 635–656.
Antonelli, F.; Bernardi, F.; Koul, A.; Novembre, G.; Papaleo, F. Emotions in Multi-Brain Dynamics: A Promising Research Frontier. Neurosci. Biobehav. Rev. 2025, 168, 105965. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Bian, N.; Zhao, Z.; Wang, H.; Schuller, B.W. Multi-View Domain-Adaptive Representation Learning for EEG-Based Emotion Recognition. Inf. Fusion 2024, 104, 102156. [Google Scholar] [CrossRef]
Ding, S.; Wang, K.; Jiang, W.; Xu, C.; Bo, H.; Ma, L.; Li, H. DGAT: A Dynamic Graph Attention Neural Network Framework for EEG Emotion Recognition. Front. Psychiatry 2025, 16, 1633860. [Google Scholar] [CrossRef]
Ahuja, C.; Sethia, D. SS-EMERGE—Self-Supervised Enhancement for Multidimension Emotion Recognition Using GNNs for EEG. Sci. Rep. 2025, 15, 14254. [Google Scholar] [CrossRef] [PubMed]
Huang, W.; Chen, Y.; Jiang, X.; Zhang, T.; Chen, Q. GJFusion: A Channel-Level Correlation Construction Method for Multimodal Physiological Signal Fusion. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 60. [Google Scholar] [CrossRef]
Qu, J.; Cui, L.; Guo, W.; Bu, L.; Wang, Z. Development of a Novel Machine Learning-Based Approach for Brain Function Assessment and Integrated Software Solution. Adv. Eng. Inform. 2024, 60, 102461. [Google Scholar] [CrossRef]
Bai, X.; Tan, J.; Hu, H.; Zhang, C.; Gu, D. Multi-Modal Physiological Signal Fusion for Emotion Classification: A Multi-Head Attention Approach. J. Phys. Conf. Ser. 2023, 2637, 012047. [Google Scholar] [CrossRef]
Mohajelin, F.; Sheykhivand, S.; Shabani, A.; Danishvar, M.; Danishvar, S.; Lahijan, L.Z. Automatic Recognition of Multiple Emotional Classes from EEG Signals through the Use of Graph Theory and Convolutional Neural Networks. Sensors 2024, 24, 5883. [Google Scholar] [CrossRef]
Asghar, M.A.; Khan, M.J.; Fawad, X.; Amin, Y.; Rizwan, M.; Rahman, M.; Badnava, S.; Mirjavadi, S.S. EEG-Based Multi-Modal Emotion Recognition Using Bag of Deep Features: An Optimal Feature Selection Approach. Sensors 2019, 19, 5218. [Google Scholar] [CrossRef]

Figure 1. The architecture of the IRSTGANet model (a) preprocessing, (b) temporal feature extractor, (c) projection block, (d) stacked residual GAT blocks with N = 2, global aggregation, (e) classification head, and (f) the four emotion classes used.

Figure 2. The DEAP dataset binary evaluation using 5-fold CV: (a) the average accuracy and F1-score; (b) the detailed accuracy for each subject.

Figure 3. The DEAP dataset’s all metrics evaluation using 5-fold CV for each subject: (a) all metrics for four-class classification; (b) all metrics for nine-class classification.

Figure 4. The normalized average confusion matrix: (a) four-class classification; (b) nine-class classification.

Figure 5. The evaluation metrics for the SEED dataset using 5-fold CV: (a) accuracy for all subjects; (b) all metrics for each subject.

Figure 6. Visualization of the important edges for the SEED dataset: (a) the top 20 crucial edges; (b) the channel–channel importance for all channels.

Figure 7. Attention visualization for the three classes of the SEED dataset.

Figure 8. The top ten most essential channels.

Table 1. The IRSTGANet Parameters.

Parameter	Value
Epochs	100
Learning rate	0.001
Weigh decay	0.0001
Batch size	64
Optimizer	Adam
Temb	64
Hidden dim	128
Number of gat block	2
Number of Att * heads	2
seed	42

* Attention.

Table 2. Ablation study results using IRSTGANet for the binary classification on DEAP, where w/o refers to without.

Model	Valence		Arousal
Model	Accuracy	F1	Accuracy	F1
w/o TFE w fc	94.90 ± 2.43	94.97 ± 2.73	95.34 ± 2.14	95.53 ± 2.01
w/o TFE w 10–20	94.19 ± 3.01	94.41 ± 3.12	95.80 ± 2.38	96.01 ± 2.21
w/o SRGAT w fc	86.87 ± 8.50	86.92 ± 9.94	87.77 ± 6.86	88.06 ± 6.86
w/o SRGAT w 10–20	86.75 ± 8.34	87.12 ± 8.90	87.77 ± 6.81	88.22 ± 6.70
IRSTGANet	96.28 ± 1.82	96.31 ± 2.00	96.82 ± 1.92	97.03 ± 1.80

Table 3. Comparison of the results on the DEAP dataset with the state of the art.

Model	Year	Methods	Valence (V)	Arousal (A)	Avg Accuracy
AT-DGNN-Gen	2024	Pre: minimal + five frequency bands TFE: kernel + MHA + convolution Spatial: DGNN	--	--	60.55
LTS-GAT	2022	Pre: minimal + frequency bands + DE TFE: attention+ LSTM Spatial: a GAT layer	63.39	66.74	65.06
P-GCNN	2019	Pre: minimal + DE, PSD, DASM, RASM, and DCAU TFE: GCNN Spatial: PLV connectivity	73.31	77.03	75.17
VSGT	2024	Pre: minimal TFE: Gaussian Temporal Encoder Spatial: Variational Spatial Encoder	74.15	78.95	76.55
ResGAT	2023	Pre: minimal + Time domain features TFE: ResNet module Spatial: GAT layer	87.06	89.26	88.16
CNN + GA Features	2022	Pre: minimal TFE: 12 statistical features Spatial: 9 graph connectivity mesures	88.27	90.84	89.55
RSTAGNN	2024	Pre: minimal + statistical, time domain, and entropy features TFE: global PCA + ContextNet Spatial: temporal attention GNN	95.30	87.50	91.4
CADD-DCCNN	2024	Pre: minimal + DE + frequency bands TFE: attention + dilated causal CNN Spatial: cross attention	90.97	92.42	91.69
Mul-AT-RGCN	2022	Pre: minimal TFE: Convolutional block attention Spatial: GCN + BiLSTM	93.19	91.82	92.5
MSL-TGNN	2025	Pre: minimal TFE: BiGRU Spatial: Global GAT	93.09	93.74	93.41
DGAT	2025	Pre: minimal + DE + frequency bands TFE: -- Spatial: GAT	93.55 ± 3.89	93.19 ± 3.65	93.37 ± 3.77
ESN	2025	Pre: minimal TFE: Echo State Network Spatial: Transformer +LSTM + CNN	93.9	94.2	94.05
ESN (4-classes)			-	-	83.6
ESN (8-classes)			-	-	78.7
GLFANet	2023	Pre: minimal + DE + frequency bands TFE: GCN + CNN Spatial: electrode positions based on 10–20 system	94.53 ± 1.02	94.91 ± 1.05	94.72 ± 1.03
GLFANet (4-classes)	2023		-	-	92.92
IRSTGANet (ours)		Pre: minimal TFE: stacked 1D convolutions Spatial: stacked residual GAT blocks	96.28 ± 1.82	96.82 ± 1.92	96.55 ± 1.87
IRSTGANet (4-classes)			-	-	93.76 ± 3.96
IRSTGANet (9-classes)			-	-	85.15 ± 8.37

- This field is omitted because it is irrelevant, as with multiclassification, V and A are irrelevant. -- this field is omitted because it was not given.

Table 4. Comparison of the results in % on the SEED dataset with the state of the art.

Model	Accuracy
P-GCNN [11]	84.35
SOGNN [19]	86.81
CADD-DCCNN [57]	87.41
STGATE [31]	90.37
GLFANet [14]	93.19 ± 1.30
STAFNet [30]	93.78
DGAT [58]	94.00 ± 2.6
IRSTGANet	94.04 ± 6.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hilali, M.; Ezzati, A.; Ben Alla, S.; El Badaoui, A. An Interpretable Residual Spatio-Temporal Graph Attention Network for Multiclass Emotion Recognition from EEG. Signals 2026, 7, 16. https://doi.org/10.3390/signals7010016

AMA Style

Hilali M, Ezzati A, Ben Alla S, El Badaoui A. An Interpretable Residual Spatio-Temporal Graph Attention Network for Multiclass Emotion Recognition from EEG. Signals. 2026; 7(1):16. https://doi.org/10.3390/signals7010016

Chicago/Turabian Style

Hilali, Manal, Abdellah Ezzati, Said Ben Alla, and Ahmed El Badaoui. 2026. "An Interpretable Residual Spatio-Temporal Graph Attention Network for Multiclass Emotion Recognition from EEG" Signals 7, no. 1: 16. https://doi.org/10.3390/signals7010016

APA Style

Hilali, M., Ezzati, A., Ben Alla, S., & El Badaoui, A. (2026). An Interpretable Residual Spatio-Temporal Graph Attention Network for Multiclass Emotion Recognition from EEG. Signals, 7(1), 16. https://doi.org/10.3390/signals7010016

Article Menu

An Interpretable Residual Spatio-Temporal Graph Attention Network for Multiclass Emotion Recognition from EEG

Abstract

1. Introduction

2. State of the Art

3. Materials and Methods

3.1. Datasets

3.1.1. DEAP Dataset

3.1.2. SEED Dataset

3.2. Preprocessing

3.3. Model Architecture

3.4. Interpretability

3.5. Training Setup and Metrics

4. Results

5. Discussion

5.1. Interpretability

5.2. Ablation Studies

5.3. Comparison with the State of the Art

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI