Emotion Classification Based on Transformer and CNN for EEG Spatial–Temporal Feature Learning

Objectives: The temporal and spatial information of electroencephalogram (EEG) signals is crucial for recognizing features in emotion classification models, but it excessively relies on manual feature extraction. The transformer model has the capability of performing automatic feature extraction; however, its potential has not been fully explored in the classification of emotion-related EEG signals. To address these challenges, the present study proposes a novel model based on transformer and convolutional neural networks (TCNN) for EEG spatial–temporal (EEG ST) feature learning to automatic emotion classification. Methods: The proposed EEG ST-TCNN model utilizes position encoding (PE) and multi-head attention to perceive channel positions and timing information in EEG signals. Two parallel transformer encoders in the model are used to extract spatial and temporal features from emotion-related EEG signals, and a CNN is used to aggregate the EEG’s spatial and temporal features, which are subsequently classified using Softmax. Results: The proposed EEG ST-TCNN model achieved an accuracy of 96.67% on the SEED dataset and accuracies of 95.73%, 96.95%, and 96.34% for the arousal–valence, arousal, and valence dimensions, respectively, for the DEAP dataset. Conclusions: The results demonstrate the effectiveness of the proposed ST-TCNN model, with superior performance in emotion classification compared to recent relevant studies. Significance: The proposed EEG ST-TCNN model has the potential to be used for EEG-based automatic emotion recognition.


Introduction
Emotion is a comprehensive representation of an individual's subjective experience and behavior, encompassing various feelings, behaviors, and thoughts.It significantly influences an individual's perception and attitude, and its identification has potential applications [1].There are two primary types of methods used for emotion recognition [2].One involves external responses evoked by emotions, such as facial expressions and gestures [3], while the other focuses on internal responses induced by emotions like electroencephalogram (EEG) and electrocardiogram signals, along with other physiological signals [4].Compared with non-physiological signals, physiological signals are less easily controlled subjectively by individuals and are challenging to conceal [5].Among the several physiological signals, EEG is an electrical signal generated by the central nervous system.Nunez et al. discovered that EEG signals exhibit certain associations with physiological events such as sleep patterns, neurological disorders, and emotional states [6].Existing studies have been conducted regarding the recognition of individual emotional states using EEG [7], suggesting that EEG holds promise as a physiological signal for emotion recognition.
The current methods for emotion recognition based on EEG include two primary methods: traditional machine learning and deep learning.Bhardwaj et al. [8] classified seven different emotions based on EEG signals, preprocessed the data using filtering and an independent component analysis (ICA), and then extracted the energy and power spectral density (PSD) as features.The average classification accuracies achieved using a support vector machine (SVM) and linear discriminant analysis (LDA) were 74.13% and 66.50%, respectively.Wang et al. [9] utilized the minimum redundancy maximum relevance method to extract key EEG features for four types of emotions and compared its classification accuracy to that of of k-nearest neighbors (KNN), multi-layer perceptron, and SVM, demonstrating that the frequency domain features combined with SVM achieved an average accuracy of 66.51%.Xiao et al. [10] proposed a 4D attention neural network that involved the conversion of the raw EEG into a spatial-spectral-temporal representation in four dimensions.Subsequently, a CNN was used to process both spectral and spatial information, while attention mechanisms were integrated into a bidirectional long shortterm memory network (Bi-LSTM) for processing temporal information.The deep learning model achieved an average accuracy of 96.90% and 97.39% in the valence and arousal dimensions, respectively, for the DEAP dataset [11].Additionally, it achieved an average accuracy of 96.25% for the SEED dataset [12], encompassing three types of emotions: positive, neutral, and negative.Furthermore, it achieved an accuracy of 86.77% for the SEED-IV dataset [13], which includes four types of emotions: happy, sad, fear, and neutral.
The proposed EEG emotion recognition algorithm by An et al. [13] is based on 3D feature fusion and a convolutional autoencoder (CAE).First, the differential entropy (DE) features from various frequency bands of EEG were fused to construct 3D features of EEG signals, which preserve spatial information between channels.Then, the constructed 3D features were input into the CAE for emotion recognition.The deep learning model achieved an average accuracy of 89.49% for the valence dimension and 90.76% for the arousal dimension when evaluated on the DEAP dataset.
The aforementioned studies demonstrate the effectiveness of traditional machine learning and deep learning techniques in EEG-based emotion classification involving the extraction and selection of an optimal feature set from raw EEG data.However, the process of extracting features from raw EEG signals may result in the loss of valuable information, thereby hindering the model's ability to learn missing information.In this study, transformers are utilized to automatically extract the spatial-temporal features that are relevant to emotions from EEG data, aiming to mitigate the loss of valuable information.Additionally, a CNN is used for aggregating these extracted spatial-temporal features as a means to address the aforementioned challenges.
The transformer model, initially proposed by Vaswani et al. in 2017 [14], has demonstrated remarkable success in the fields of natural language processing [15] and computer vision [16].Unlike CNN, RNN, and LSTM networks, this model overcomes the limitations of local receptive fields and enables concurrent consideration of information across all positions in a sequence.Consequently, it excels at capturing global relationships and exhibits superior performance in calculating correlations among features within long sequences while effectively processing long-term dependencies.In this study, we anticipate that leveraging the power of the transformer model will enable effective extraction of the spatial-temporal features of emotion-related EEG signals, thereby improving the accuracy of emotion classification.

Preprocessing
The study conducted by Lashgari et al. [18] demonstrates that segmenting and reassembling data in the time domain preserves information, enables data expansion, and enhances classification accuracy.Previous research has demonstrated that implementing a 3 s time window leads to improved classification accuracy [19].Therefore, this study employs a 3 s time window for EEG segmentation.

Experimental Platform
The entire experimental process was conducted in an environment with an Intel(R) Core i5-12400F processor and an NVIDIA RTX3080 Ti GPU.The model was implemented using the Python 3.7 programming language and the Pytorch deep learning framework.

Experimental Procedures
The flowchart in Figure 1 illustrates the procedures of emotion classification using the transformer and a CNN based on EEG spatial and temporal feature learning.Initially, the raw EEG signals are preprocessed and segmented, and then emotion-related EEG spatial and temporal features are extracted by the transformer-based module.Finally, the results are generated by a prediction layer including a CNN, maximum pooling (MaxPooling), fully connected (FC), and Softmax layer.

Transformer Encoder
The encoder of the transformer model used in this study is illustrated in Figure 2. The transformer encoder includes scaled dot-product attention, addition and normalization (Add and Norm), multi-head attention, and a feed-forward neural network.The scaled dot-product attention mechanism in the transformer encoder is closely connected to the multi-head attention mechanism.In the scaled dot-product attention, the input sequence is linearly transformed to obtain query, key, and value vectors.A scaling dot product operation is then performed to calculate the attention score, which adjusts the dot product value by dividing it by the square root of the dimension of the query and key vectors.This ensures gradient stability.On the other hand, because multiple heads perform independent scaling dot product operations, the multi-head attention mechanism enables parallel focus on various aspects of the input sequence.By concatenating and finally linearly transforming these head outputs, a comprehensive representation is obtained for each position.Through this tight correlation between the scaled dot-product attention and multi-head attention, efficient encoding and representation of the input sequences are achieved by the transformer encoder.

Transformer Encoder
The encoder of the transformer model used in this study is illustrated in Figure 2. The transformer encoder includes scaled dot-product attention, addition and normalization (Add and Norm), multi-head attention, and a feed-forward neural network.The scaled dot-product attention mechanism in the transformer encoder is closely connected to the multi-head attention mechanism.In the scaled dot-product attention, the input sequence is linearly transformed to obtain query, key, and value vectors.A scaling dot product operation is then performed to calculate the attention score, which adjusts the dot product value by dividing it by the square root of the dimension of the query and key vectors.This ensures gradient stability.On the other hand, because multiple heads perform independent scaling dot product operations, the multi-head attention mechanism enables parallel focus on various aspects of the input sequence.By concatenating and finally linearly transforming these head outputs, a comprehensive representation is obtained for each position.Through this tight correlation between the scaled dot-product attention and multi-head attention, efficient encoding and representation of the input sequences are achieved by the transformer encoder.The scaled dot-product attention is depicted in Figure 2a.Initially, the input data are multiplied by three different weight matrices to obtain the query vector (Q), key vector (K) and value vector (V), respectively.Subsequently, the dot product of Q and K is divided

Scaled Dot-Product Attention
The scaled dot-product attention is depicted in Figure 2a.Initially, the input data are multiplied by three different weight matrices to obtain the query vector (Q), key vector (K) and value vector (V), respectively.Subsequently, the dot product of Q and K is divided by the scaling factor √ d k (where d k represents the dimension of the query vector), followed by computation of the weights using a Softmax function.Finally, these weights are multiplied by the value vector V to obtain a weighted result.The calculation is performed using Equation (1).

Multi-Head Attention
In Figure 2b, the multi-head attention mechanism consists of h scaled dot-product attention layers.Each scaled dot-product attention layer focuses on the information found in different subspaces.This structure allows the model to concurrently process various aspects of correlations, thereby comprehensively capturing the features of the input data.Subsequently, the attention representations from these diverse heads are concatenated to form the final multi-head attention representation, calculated using Equation (2), where W O is the weight matrix of the output.

Transformer Encoder
The transformer encoder in Figure 2c is mainly composed of two modules.The first module includes a multi-head attention layer and a normalization layer, with the latter being used to improve the stability and accelerate convergence during training.Additionally, residual connections are used between these layers to facilitate information flow and alleviate the issue of gradient vanishing.The second module is composed of a feedforward neural network layer and a normalization layer.The feedforward neural network nonlinearly maps the features obtained from the multi-head attention mechanism, contributing to the model's ability to capture distinctive features within the input sequence.Residual connections are also utilized between these layers.This architectural design aims to fully leverage the self-attention mechanism and feedforward network in the transformer model to effectively capture contextual information from the EEG signals and improve classification performance.

Transformer and CNN Models for Learning Emotion-Related EEG Temporal and Spatial Features
This study proposes a novel emotion classification model, named EEG spatial-temporal transformer and CNN (EEG ST-TCNN), which is based on EEG spatial-temporal feature learning, as illustrated in Figure 3.The model effectively uses both the spatial and temporal information embedded in EEG signals.
The EEG ST-TCNN model consists of three components.The first component is the input module of the model, where the raw EEG signals are concurrently input into the model in both spatial and temporal arrangements.The batch_size, channel, and time_point were set to 128, 62, and 600, respectively, during the experimentation on the SEED dataset.Similarly, for the DEAP dataset, the batch_size was set to 128, while the channel and time_point were adjusted to 32 and 384, respectively.To capture the EEG sequence information effectively, positional encoding (PE) is applied to embed the input EEG.PE plays a crucial role in understanding the relationship between element position and order when processing sequence data with transformers, thereby enhancing their performance in various natural language processing and sequence modeling tasks [14].In this study, we adopt the method of relative positional encoding, which is implemented using Equations ( 3) and ( 4 where PE(pos, 2i) and PE(pos, 2i + 1) denote the two elements of the positional encoding at position pos and dimension i, respectively.Here, d denotes the dimension of the embedding vector.
Residual connections are also utilized between these layers.This architectural design aims to fully leverage the self-attention mechanism and feedforward network in the transformer model to effectively capture contextual information from the EEG signals and improve classification performance.

Transformer and CNN Models for Learning Emotion-Related EEG Temporal and Spatial Features
This study proposes a novel emotion classification model, named EEG spatial-temporal transformer and CNN (EEG ST-TCNN), which is based on EEG spatial-temporal feature learning, as illustrated in Figure 3.The model effectively uses both the spatial and temporal information embedded in EEG signals.In the second component, two transformer encoders are used to extract the deep spatial and temporal features from the input EEG.The third component consists of a prediction layer, which consists of 2 convolution layers (with a convolution kernel size of 3 × 3, and there are 64 convolutional kernels), 1 MaxPooling layer (with a window size of 2 × 2), 1 FC layer, and 1 Softmax layer.Within the prediction layer, the spatial and temporal features extracted by the transformer encoder are concatenated.This concatenated representation is then processed by a combination of CNN and MaxPooling layers to effectively aggregate the features and capture the local features.Subsequently, the processed features are passed into the FC before being classified into different emotional states by Softmax.

Model Training Strategy and Process
In order to train and validate the proposed model, a ten-fold cross-validation method was used to divide the EEG data of each subject into ten samples, of which nine were used as training samples and one was used as a test sample.Finally, the average accuracy was calculated as the classification result.In the multi-head attention mechanism of the transformer, the parameter "h" was set to 8, and cross-entropy along with L2 regularization terms were utilized as the loss functions.The Adam optimizer was used for optimization.During the training process, the parameters learning rate and "batch_size" were set to 0.0001 and 128, respectively.Additionally, a dropout value of 0.3 was set to prevent overfitting, and the ReLU function was used as the activation function.

Evaluation Metrics
The classification performance of the model is evaluated using accuracy (Acc), precision (P), recall (R), and F1-score as evaluation metrics, which are calculated using Equations ( 5)-( 8): where TP indicates that the prediction of the positive class is positive, TN indicates that the prediction of the negative class is negative, FP indicates that the prediction of the negative class is positive, and FN indicates that the prediction of the positive class is negative.

Classification Performance
We conducted independent experiments on both the SEED and DEAP datasets.Four models were systematically obtained by progressively removing a certain module from the model, including EEG ST-TCNN, the model without CNN (EEG ST-T), the model with only spatial dimensions as input (EEG S-T), and the model with only temporal dimensions as input (EEG T-T).The classification results of these four models were compared to validate the effectiveness of the removed modules in the model.Figure 4 shows the accuracy of emotion recognition for these aforementioned models.Both the EEG ST-TCNN and EEG ST-T models achieved high accuracy on the SEED and DEAP datasets.Compared to the EEG ST-T model, the EEG ST-TCNN model demonstrated improvements of 0.69%, 1.93%, 0.78%, and 1.9% in the positive-neutral-negative, arousal-valence, arousal, and valence dimensions, respectively.The average accuracy and variance of the four models in the different dimensions are presented in Table 1, while Table 2 displays the t-test results for the accuracies of these models across the different dimensions.The results of EEG ST-TCNN and EEG ST-T did not show significant differences in the arousal dimension, whereas there were significant differences between the results of EEG ST-TCNN and other models in the positive-neutral-negative, arousal-valence, and valence dimensions.
valence dimensions, respectively.The average accuracy and variance of the four models in the different dimensions are presented in Table 1, while Table 2 displays the t-test results for the accuracies of these models across the different dimensions.The results of EEG ST-TCNN and EEG ST-T did not show significant differences in the arousal dimension, whereas there were significant differences between the results of EEG ST-TCNN and other models in the positive-neutral-negative, arousal-valence, and valence dimensions.Tables 3-6 present the accuracy, precision, recall, and F1-score achieved in experiments on the SEED and DEAP datasets for the EEG ST-TCNN, EEG ST-T, EEG S-T, and EEG T-T models.The experimental results demonstrate that utilizing the raw EEG spatial and temporal arrangement input concurrently is superior to solely using the spatial or temporal arrangements as inputs.Compared with EEG ST-T, the integration of the CNN into the EEG ST-T model leads to enhanced effects across the various dimensions.

Discussion
To address the prevalent challenges in current EEG-based emotion recognition, which frequently rely on manual feature extraction and the selection of an optimal feature set, this paper proposes a novel EEG ST-TCNN model.
The robustness of the proposed model was validated through independent experiments conducted on two publicly available datasets in this study.To validate the effectiveness of each module in the model, we progressively removed one module and compared the experimental results.As illustrated in Figure 4 and Tables 3-6, the ST model exhibits inferior performance compared to the EEG T-T model overall.However, the EEG S-T model contributes to enhancing the spatial location information.Notably, concurrently utilizing a spatial and temporal arrangement input based on raw EEG data in the experiment is superior to using the single spatial or temporal arrangement input alone.Both EEG ST-TCNN and EEG ST-T demonstrate superior suitability for emotion classification based on EEG signals.Compared with EEG ST-T, EEG ST-TCNN exhibits improved performances in different dimensions.Importantly, EEG ST-TCNN excels in integrating spatial and temporal information from the raw EEG signals, thereby enhancing the accuracy of emotion recognition.
In this study, the temporal and spatial information from the raw EEG signals is taken into consideration, while its frequency domain features are not utilized.Relative positional encoding is used in this research; however, the impact of different positional encoding methods on emotion classification based on EEG signals remains unexplored.It is noteworthy that Wu et al. [24] demonstrated that various positional encoding methods can affect transformer performance.Furthermore, individual differences among subjects were not taken into account in this study.Importantly, other physiological signals also contribute to the task of emotion recognition.Sun et al. [25] proposed a bimodal method combining functional near-infrared spectroscopy (fNIRS) and EEG to identify emotions, and their results indicated superior performance of the fNIRS+EEG method compared to using only fNIRS or EEG.
In our future research, we will combine frequency domain features with the spatial and temporal information of raw EEG signals and investigate the impact of different position encoding methods on model classification results prior to integrating EEG temporal, spatial, and frequency features into the model.Furthermore, we intend to conduct multi-modal and cross-subject emotion recognition research, with the expectation that the proposed model will further enhance the performance of emotion classification based on EEG signals.Compared to offline emotion recognition, we believe that real-time emotion classification based on EEG holds greater significance and potential applications.Real-time emotion recognition can provide individuals with immediate feedback, thus enhancing the effectiveness of emotion regulation.In future work, we will pay more attention to the complexity of algorithms and endeavor to classify emotions in real-time based on EEG signals.

Figure 1 .
Figure 1.The flowchart of emotion classification using the transformer and a CNN based on EEG spatial-temporal feature learning.(a) Acquisition of input data; (b) data preprocessing steps, during which EEG signals are segmented; (c) the transformer-based module, which extracts the spatial and temporal features of the EEG signals; (d) the prediction layer, which aggregates the extracted temporal and spatial features and performs the classification; (e) the classification results.

Figure 1 .
Figure 1.The flowchart of emotion classification using the transformer and a CNN based on EEG spatial-temporal feature learning.(a) Acquisition of input data; (b) data preprocessing steps, during which EEG signals are segmented; (c) the transformer-based module, which extracts the spatial and temporal features of the EEG signals; (d) the prediction layer, which aggregates the extracted temporal and spatial features and performs the classification; (e) the classification results.

Figure 3 .
Figure 3.The EEG ST-TCNN model.Note: PE (positional encoding); (batch_size, channel, time_point) and (batch_size, time_point, channel) represent the temporal and spatial arrangements of the EEG signals, respectively.The EEG ST-TCNN model consists of three components.The first component is the input module of the model, where the raw EEG signals are concurrently input into the model in both spatial and temporal arrangements.The batch_size, channel, and time_point were set to 128, 62, and 600, respectively, during the experimentation on the

Figure 4 .
Figure 4.The accuracy of the four models in classifying emotions on the SEED and DEAP datasets.Figure 4. The accuracy of the four models in classifying emotions on the SEED and DEAP datasets.

Figure 4 .
Figure 4.The accuracy of the four models in classifying emotions on the SEED and DEAP datasets.Figure 4. The accuracy of the four models in classifying emotions on the SEED and DEAP datasets.

Figure 6 .
Figure 6.The confusion matrices of the four models in the arousal-valence dimension.Note: (a-d) respectively represent the results obtained by EEG ST-TCNN, EEG ST-T, EEG S-T, and EEG T-T.

Figure 6 .
Figure 6.The confusion matrices of the four models in the arousal-valence dimension.Note respectively represent the results obtained by EEG ST-TCNN, EEG ST-T, EEG S-T, and EEG

Figure 7 .
Figure 7.The confusion matrices of the four models in the arousal dimension.Note: (a-d) re tively represent the results obtained by EEG ST-TCNN, EEG ST-T, EEG S-T, and EEG T-T.

Figure 7 .Figure 8 .
Figure 7.The confusion matrices of the four models in the arousal dimension.Note: (a-d) respectively represent the results obtained by EEG ST-TCNN, EEG ST-T, EEG S-T, and EEG T-T.

Figure 8 .
Figure 8.The confusion matrices of the four models in the valence dimension.Note: (a-d) respectively represent the results obtained by EEG ST-TCNN, EEG ST-T, EEG S-T, and EEG T-T.3.2.Comparison of the Results Obtained Using Other Methods Shen et al. [2] firstly calculated the differential entropy (DE) features from different EEG channels signals and then converted them into a four-dimensional structure.The structure integrated the frequency domain, time domain, and spatial domain features of the EEG signals.Subsequently, these structured data were then input into a four-dimensional convolutional recurrent neural network (4D-CRNN) for training.Liu et al. [20] constructed an undirected graph based on the spatial relationships between EEG electrodes.They utilized the differential entropy features of the EEG signals to represent the nodes of the undirected graph.Furthermore, they proposed a model for emotion recognition based on EEG signals using a global-to-local feature aggregation network (GLFANet).Finally, the undirected graph was fed into this model.Zheng et al. [21] constructed an EEG electrode location matrix corresponding to brain region distribution, thereby reconstructing EEG data.They used a combined model of a graph convolutional neural network and LSTM (GCN + LSTM) to extract the spatial and temporal features of the EEG signals.Liu et al. [22] proposed a model that combines a convolutional neural network (CNN), sparse autoencoder, and deep neural network (CNN-SAE-DNN).They integrated the frequency-domain features and spatial location information of EEG signals to construct a two-dimensional data input for the model.Yang et al. [23] utilized various frequency-domain features of EEG signals to construct three-dimensional data, which was then fed into a continuous convolutional neural network (continuous CNN).Table 7 compares the classification performance of these classical or cutting-edge deep learning methods with the proposed EEG ST-TCNN model on the SEED and DEAP datasets.The results demonstrate that the proposed model exhibits varying degrees of improvement in classification performance on both the SEED and DEAP datasets.

Table 1 .
The average accuracy and variance of the four models across the different dimensions.

Table 2 .
t-test results: Comparing the accuracies across the different dimensions for the four models.

Table 3 .
The accuracy, precision, recall, and F1-score of the four models in the positive-neutralnegative dimension.

Table 4 .
The accuracy, precision, recall, and F1-score of the four models in the arousal-valence dimension.

Table 5 .
The accuracy, precision, recall, and F1-Score of the four models in the arousal dimension.

Table 6 .
The accuracy, precision, recall, and F1-score of the four models in the valence dimension.

Table 7 .
Classification performance of some classic or state-of-the-art deep learning methods compared with that of the proposed model on the SEED and DEAP datasets.P-N-N represents positive-neutral-negative, A stands for arousal, and V denotes valence.