Emotion Recognition from Spatio-Temporal Representation of EEG Signals via 3D-CNN with Ensemble Learning Techniques

The recognition of emotions is one of the most challenging issues in human–computer interaction (HCI). EEG signals are widely adopted as a method for recognizing emotions because of their ease of acquisition, mobility, and convenience. Deep neural networks (DNN) have provided excellent results in emotion recognition studies. Most studies, however, use other methods to extract handcrafted features, such as Pearson correlation coefficient (PCC), Principal Component Analysis, Higuchi Fractal Dimension (HFD), etc., even though DNN is capable of generating meaningful features. Furthermore, most earlier studies largely ignored spatial information between the different channels, focusing mainly on time domain and frequency domain representations. This study utilizes a pre-trained 3D-CNN MobileNet model with transfer learning on the spatio-temporal representation of EEG signals to extract features for emotion recognition. In addition to fully connected layers, hybrid models were explored using other decision layers such as multilayer perceptron (MLP), k-nearest neighbor (KNN), extreme learning machine (ELM), XGBoost (XGB), random forest (RF), and support vector machine (SVM). Additionally, this study investigates the effects of post-processing or filtering output labels. Extensive experiments were conducted on the SJTU Emotion EEG Dataset (SEED) (three classes) and SEED-IV (four classes) datasets, and the results obtained were comparable to the state-of-the-art. Based on the conventional 3D-CNN with ELM classifier, SEED and SEED-IV datasets showed a maximum accuracy of 89.18% and 81.60%, respectively. Post-filtering improved the emotional classification performance in the hybrid 3D-CNN with ELM model for SEED and SEED-IV datasets to 90.85% and 83.71%, respectively. Accordingly, spatial-temporal features extracted from the EEG, along with ensemble classifiers, were found to be the most effective in recognizing emotions compared to state-of-the-art methods.


Introduction
An emotion is a psycho-physiological experience resulting from a conscious or unconscious perception of a situation, object, or characteristic. It is often related to mood, temperament, and personality [1]. Emotions are vital aspects of human existence and play an imperative role in our lives. The ability to understand emotions is crucial when it comes

•
A 3D-CNN model pre-trained using transfer learning was used to extract features from spatio-temporal 3D representations of EEG signals. The study used spatial information from 62 electrodes to create input modality. Using 3D-CNN and transfer learning with post-filtering, this is the first time that emotion classification has been performed on the SEED datasets based on spatiotemporal features. • Apart from the traditional fully connected layers (used for classifying after feature extraction from CNN), other major classifiers, including k-nearest neighbor (KNN), extreme learning machine (ELM), XGBoost, and random forest in hybrid models, were also used. Furthermore, the post-filtering of output labels was studied.
• A comprehensive set of results is presented to demonstrate the accuracy and efficiency of the proposed approaches on the SEED and SEED-IV datasets. Both the individual subject's accuracy and the average subject's accuracy across subjects are reported. The reporting of each subject's accuracy enhances transparency and provides a baseline against which other researchers can compare their work. The computation time to evaluate EEG signals is also reported to demonstrate the efficiency of the proposed methodologies.

Related Works
The use of deep neural networks (DNNs) for emotion recognition has received considerable attention and has achieved notable success in recent years. This section reviews previous literature on emotion identification using EEG signals based on DNNs. Using statistical features (mean, median, mode, and range) with shallow classifiers [12], 75% accuracy was achieved on the DEAP dataset using Naive Bayes, KNN, decision trees, and SVM. Qing et al. [13] achieved 74.87% accuracy on the SEED dataset and 62.63% on the DEAP dataset using an ensemble model (EM) consisting of shallow classifiers, including k-nearest neighbor (KNN), decision tree (DT), and random forest (RF), with a soft-voting strategy. Chen et al. [14] proved that the DNN-based approaches outperformed shallow classifiers in terms of performance in recognizing emotions. Tarán et al. [15] proposed using a combination of sample entropy (SampEn), Tsallis entropy (TE), Higuchi fractal dimension (HFD), and Hurst exponent (HE) with a multiclass least squares support vector machine (SVM) model for their analysis. They employed empirical mode decomposition (EMD)/intrinsic mode function (IMF) filters to clean the data, along with variational mode decomposition (VMD) filters to ensure data integrity. They achieved an accuracy of 90.63% on their dataset for four emotion classifications (happiness, sadness, fear, and neutral). Using the CNN-SAE (sparse autoencoder)-DNN model in combination with the Pearson correlation coefficient (PCC) between different channels as a feature, Liu et al. [16] achieved 96.77% accuracy on the SEED dataset, which is considered to be state-of-the-art.
To make use of the spatial information contained in EEG signals, several graph-based techniques that use the signals' spatial information were studied. Using differential entropy (DE), power spectral density (PSD), differential asymmetric feature (DASM), rational asymmetric feature (RASM), and differential causality (DCAU), Song et al. [17] proposed a dynamic graph convolution network (DGCNN) model with handcrafted features (DE, PSD, DASM, and DCAU) to classify emotions. They achieved an accuracy of 90.4% (three emotions) on the SEED dataset. In a subsequent study by Zhang et al. [18], the same features used by Song et al. for a graph convolutional broad network model were applied, and the model's accuracy increased to 94.24% on the SEED dataset. Zhong et al. [19] adopted regularized graph neural networks (RGNN) as a means of computing pre-computed differential entropy features of the SEED and SEED-IV datasets and achieved a state-of-theart performance of 79.34% for the SEED-IV dataset.
Convolutional neural networks (CNNs) are neural networks with one or more convolutional layers. They are generally used for image processing, classification, segmentation, and other autocorrelated data processing [20]. As a result of their computational efficiency, CNNs are highly effective at detecting and learning important features without any intervention from humans. CNNs can be classified based on their convolutional kernel dimension. 2D CNNs use 2D convolutional kernels and utilize context across the height and width of 2D frames (spatial features) to make predictions. However, they are inherently incapable of leveraging information from adjacent frames. Three-dimensional CNNs solve this problem, as they are the 3D equivalent of two-dimensional CNNs [21].
In recent years, 3D-CNN has achieved a considerable amount of success when it comes to processing spatio-temporal information such as action recognition [1,7,22]. This capability was also used in several studies to recognize emotions using 3D-CNN. 3D convolutional kernels are capable of handling the voxel information from adjacent frames, making them powerful models for learning representations of volumetric data, such as videos and 3D medical images (MRI, CT scan) [23]. 3D-CNN has been used in several studies to effectively extract features from videos. Therefore, the current study used 3D-CNNs to extract features from the 3D spatio-temporal representation of EEG data [6,24,25]. Salama et al. [24] used a 3D-CNN model to classify emotions, and a 3D representation of the data was created for the inputs of the model. They achieved an accuracy of 87.44% for arousal (two classes) and 88.49% for valence (two classes) on the DEAP dataset. A study by Cho et al. [6] used two different end-to-end 3D-CNN architectures, C3D and R(2) + 1D, to extract features in their study. They achieved an accuracy of 99.73% (4 classes) on the DEAP dataset. The authors proposed the use of a novel method to represent EEG signals in a 3D spatio-temporal block by setting the position of the channels at the sampling time at their original positions, after which the interpolation of 2D EEG frames was used to reconstruct the 3D signals. A transfer learning method was used by Cimtay et al. [26] to recognize emotions based on a pre-trained Inception-Resnet-V2 model. They achieved a cross-subject accuracy of 86.56% for two classes (positive-negative) and 78.34% for three classes (positive-neutral-negative) on the SEED dataset. In [25], EEG data were extracted from three international open-source datasets, such as DREAMER, SEED, and DEAP, to classify emotions using 3D-CNN. The maximum mean classification rate was 97.64% using SEED datasets.
Most earlier studies did not use the spatial information between the adjacent electrodes, which, according to some studies, is an important aspect of the input. Unlike most studies that have handcrafted features using different techniques like Pearson correlation coefficient (PCC), Principal Component Analysis (PCA), Higuchi Fractal Dimension (HFD), entropy studies, etc., for the input of the DNN, the current study created a simple 3D representation using the raw EEG signals, which preserves both the spatial and temporal information of the data. Additionally, this study completely relied on DNN's ability to extract meaningful features that could be fed into the classifier for the classification of emotions. Furthermore, this work used a pre-trained 3D-CNN model with transfer learning as the DNN model in the study. Earlier studies on SEED datasets that used 3D CNNs did not take spatial information into account, alongside the transfer learning approach, in emotion recognition. Some studies have used graph-based neural networks that use spatial information. The novelty of this paper is the use of 3D CNNs with spatio-temporal features to recognize emotion on SEED datasets. In the study, the spatial information from 62 electrodes was used to create input modality. As far as we know, this is the first time that transfer learning has been used on SEED datasets to classify emotions using spatio-temporal features with 3D-CNN and an ensemble classifier.

Dataset
The experiments were conducted using two datasets developed by SJTU: SEED and SEED-IV. The SEED dataset is classified into three categories (positive, negative, and neutral), and the SEED-IV dataset is classified into four categories (happiness, sadness, fear, and neutral). The average score was calculated for each subject using subject-dependent classification, i.e., each model was trained and tested individually for each subject. No cross-validation was performed since average scores were reported across all subjects. Phased evaluations of 3D MobileNet models were conducted for each of the 15 subjects in an 80:20 split ratio.

SEED
A multimodal dataset called SEED [11] was developed by researchers in the Brain-like Computing and Machine Intelligence (BCMI) laboratory at Shanghai Jiao Tong University (SJTU). It consisted of EEG signals that were collected from 15 subjects (7 males and 8 females; mean age: 23.27 years; standard deviation: ±2. 37). An ESI NeuroScan System with 62 channels was used to record the subjects' responses in response to 15 clips (around 4 min long) of a Chinese film, which were viewed by the subjects on their watches. The film clips were carefully selected to evoke a range of emotions, such as positive, neutral, and negative ones. Each subject experimented three times, with an interval of approximately one week between each repetition. Finally, the data were downsampled from 1000 to 200 Hz sampling rates, and a band-pass frequency filter of 0-75 Hz was applied to remove noise and artifacts.

SEED-IV
As a multimodal dataset, SEED-IV [11] contains the EEG signals of 15 subjects, ranging in age from 20 to 24 years old, with 7 males and 8 females. A 62-channel ESI NeuroScan System was used to record EEG signals while subjects watched film clips. Seventy-two short film clips of about two minutes each were used to induce four target emotions: happiness, sadness, fear, or neutrality. The experiments were conducted in three sessions with 24 trials per session (6 trials per emotion) on different days. Initially, the data were collected at a sampling frequency of 1000 Hz, but the sample rate was later downscaled to a sampling frequency of 200 Hz. The EEG dataset was processed with a bandpass filter between 1-75 Hz to filter out the noise and remove artifacts.

Spatio-Temporal Representation of EEG
SEED and SEED-IV datasets were acquired using the 62-channel ESI NeuroScan System to record the participant's EEG signals. A one-dimensional signal (Amplitude vs. time) was recorded by the system. In a film clip of T s, 200 × T samples were collected for each of the 62 electrodes, with a sampling rate of 200 Hz.
A one-dimensional vector can be used to represent the electrode used to acquire the EEG signals at a certain time (where t is the time in seconds, and t = 0, 1, . . . , N − 1), and N is the total number of samples: where c n t is the recording of the nth channel at the timestamp t.
The entire data can be represented by stacking such 1D vectors into a 2D vector of dimension 62 × N as: Many of the earlier studies [27][28][29] have used this 2D vector representation of EEG signals, which, although containing temporal information, lacks spatial information and the spatial distribution of electrodes. For these studies, information about adjacent channels and symmetrical channels is ignored. Multiple studies [5,6] have highlighted the importance of this information.
Some works [5,6] have presented an imperfect representation of EEG signals that maps electrode arrangements on the skull to a 2D plane. The method has been used for 14-channel (DREAMER) and 32-channel (DEAP) EEG datasets [5], which, in the current study, was extended to 62-channel SEED and SEED-IV datasets.
As shown in Figure 1, channels were arranged in a 9 × 9 matrix, which represents spatial information. It is noteworthy that this representation does not accurately portray the actual arrangement, so the spatial information is slightly distorted.
In the 2D sparse representation, all the empty boxes indicate that the corresponding electrodes are absent. The empty boxes were filled using interpolation to make the representation dense. Interpolation was performed using radial basis functions (RBFs) and Gaussian basis functions [6]. A simple illustration of the 2D representation of EEG signals, the 2D-EEG signals after time interpolation, and the 3D-EEG representation is shown in Figure 2. In the 2D sparse representation, all the empty boxes indicate that the corresponding electrodes are absent. The empty boxes were filled using interpolation to make the representation dense. Interpolation was performed using radial basis functions (RBFs) and Gaussian basis functions [6]. A simple illustration of the 2D representation of EEG signals, the 2D-EEG signals after time interpolation, and the 3D-EEG representation is shown in Figure 2. The 3D EEG stream, S, is then created by stacking the 2D frames one after another.
where ft is the 2D frame at timestamp t and w is the length of the time window.
Based on previous studies, one second was determined to be an appropriate time window for recognizing emotions [6]. As a result, w (sampling frequency) was set to 200. Before concatenating into 3D EEG streams, the 2D frames were resized from 9 × 9 to 64 × 64 to make spatial dimensions comparable to temporal dimensions. Figure 3 and Figure 4 Fp1 Fpz Fp2  In the 2D sparse representation, all the empty boxes indicate that the corresp electrodes are absent. The empty boxes were filled using interpolation to make the sentation dense. Interpolation was performed using radial basis functions (RBF Gaussian basis functions [6]. A simple illustration of the 2D representation of EEG s the 2D-EEG signals after time interpolation, and the 3D-EEG representation is sh Figure 2. The 3D EEG stream, S, is then created by stacking the 2D frames one after ano where ft is the 2D frame at timestamp t and w is the length of the time window. Based on previous studies, one second was determined to be an appropria window for recognizing emotions [6]. As a result, w (sampling frequency) was set Before concatenating into 3D EEG streams, the 2D frames were resized from 9 × 9 64 to make spatial dimensions comparable to temporal dimensions. Figure 3 and F show the average of all the subjects' 2D representations of different emotions in the IV and SEED datasets, respectively. The 3D EEG stream, S, is then created by stacking the 2D frames one after another.
where f t is the 2D frame at timestamp t and w is the length of the time window. Based on previous studies, one second was determined to be an appropriate time window for recognizing emotions [6]. As a result, w (sampling frequency) was set to 200. Before concatenating into 3D EEG streams, the 2D frames were resized from 9 × 9 to 64 × 64 to make spatial dimensions comparable to temporal dimensions. Figures 3 and 4 show the average of all the subjects' 2D representations of different emotions in the SEED-IV and SEED datasets, respectively. Brain Sci. 2023, 13, x FOR PEER REVIEW 7 of 18

Spatio-Temporal Learning Based on 3D-CNNs
A resource-efficient 3D-CNN network was used for emotion recognition in this study. Based on the well-known resource-efficient 2D CNNs, 3D resource-efficient CNNs were developed [32]. Several portable, wearable, wireless, low-cost, off-the-shelf devices are available on the market today that allow for effective computing methods to be utilized in everyday life. As lightweight networks, resource-efficient models are ideal for mobile and embedded applications since they can be integrated with portable, wearable, EEG devices. This work includes testing the following pre-trained models: 3D-Mobile Net, 3D-ShuffleNet, 3D-MobileNetv2, 3D-ShuffleNetv2, and 3D-EfficientNet for emotion recognition (available at https://github.com/okankop/Efficient-3DCNNs (accessed on 3 January 2023). Among these models, 3D-MobileNet reported the highest accuracy and the lowest computational complexity [33]. Therefore, we implemented 3D-MobileNet-based transfer learning in SEED datasets to recognize emotions.
The use of transfer learning is well-studied. Transfer learning is the process of improving learning for a new task by transferring knowledge from a related task that has already been learned [34]. In the past, there have been several studies that have used transfer learning to recognize emotions. A study conducted by Feng K and Chaspari T (2020) found that transfer learning can be applied to speech, video, and images, and to physiological signals related to emotion [35]. According to [26], Cimtay et al. employed a stateof-the-art pre-trained InceptionResNet model and achieved excellent performance on the SEED and DEAP datasets. In this study, 3D MobileNet was pre-trained on the Jester dataset for transfer learning. The Jester gesture recognition dataset contains labeled video clips of humans performing basic, predefined hand gestures in front of a webcam or laptop camera [22]. The 3D MobileNet network was further enhanced by the addition of dense layers (fully connected layers) to provide greater depth and accuracy when classifying complex data [26]. The weights of pre-trained 3D MobileNet models were frozen, and only the dense layer weights were trained in transfer learning. As a result, the model was optimized to prevent overfitting and reduce computation time. The description of

Spatio-Temporal Learning Based on 3D-CNNs
A resource-efficient 3D-CNN network was used for emotion recognition in this study. Based on the well-known resource-efficient 2D CNNs, 3D resource-efficient CNNs were developed [32]. Several portable, wearable, wireless, low-cost, off-the-shelf devices are available on the market today that allow for effective computing methods to be utilized in everyday life. As lightweight networks, resource-efficient models are ideal for mobile and embedded applications since they can be integrated with portable, wearable, EEG devices. This work includes testing the following pre-trained models: 3D-Mobile Net, 3D-ShuffleNet, 3D-MobileNetv2, 3D-ShuffleNetv2, and 3D-EfficientNet for emotion recognition (available at https://github.com/okankop/Efficient-3DCNNs (accessed on 3 January 2023). Among these models, 3D-MobileNet reported the highest accuracy and the lowest computational complexity [33]. Therefore, we implemented 3D-MobileNet-based transfer learning in SEED datasets to recognize emotions.
The use of transfer learning is well-studied. Transfer learning is the process of improving learning for a new task by transferring knowledge from a related task that has already been learned [34]. In the past, there have been several studies that have used transfer learning to recognize emotions. A study conducted by Feng K and Chaspari T (2020) found that transfer learning can be applied to speech, video, and images, and to physiological signals related to emotion [35]. According to [26], Cimtay et al. employed a stateof-the-art pre-trained InceptionResNet model and achieved excellent performance on the SEED and DEAP datasets. In this study, 3D MobileNet was pre-trained on the Jester dataset for transfer learning. The Jester gesture recognition dataset contains labeled video clips of humans performing basic, predefined hand gestures in front of a webcam or laptop camera [22]. The 3D MobileNet network was further enhanced by the addition of dense layers (fully connected layers) to provide greater depth and accuracy when classifying complex data [26]. The weights of pre-trained 3D MobileNet models were frozen, and only the dense layer weights were trained in transfer learning. As a result, the model was optimized to prevent overfitting and reduce computation time. The description of

Spatio-Temporal Learning Based on 3D-CNNs
A resource-efficient 3D-CNN network was used for emotion recognition in this study. Based on the well-known resource-efficient 2D CNNs, 3D resource-efficient CNNs were developed [32]. Several portable, wearable, wireless, low-cost, off-the-shelf devices are available on the market today that allow for effective computing methods to be utilized in everyday life. As lightweight networks, resource-efficient models are ideal for mobile and embedded applications since they can be integrated with portable, wearable, EEG devices. This work includes testing the following pre-trained models: 3D-Mobile Net, 3D-ShuffleNet, 3D-MobileNetv2, 3D-ShuffleNetv2, and 3D-EfficientNet for emotion recognition (available at https://github.com/okankop/Efficient-3DCNNs (accessed on 3 January 2023). Among these models, 3D-MobileNet reported the highest accuracy and the lowest computational complexity [33]. Therefore, we implemented 3D-MobileNet-based transfer learning in SEED datasets to recognize emotions.
The use of transfer learning is well-studied. Transfer learning is the process of improving learning for a new task by transferring knowledge from a related task that has already been learned [34]. In the past, there have been several studies that have used transfer learning to recognize emotions. A study conducted by Feng K and Chaspari T (2020) found that transfer learning can be applied to speech, video, and images, and to physiological signals related to emotion [35]. According to [26], Cimtay et al. employed a state-of-the-art pre-trained InceptionResNet model and achieved excellent performance on the SEED and DEAP datasets. In this study, 3D MobileNet was pre-trained on the Jester dataset for transfer learning. The Jester gesture recognition dataset contains labeled video clips of humans performing basic, predefined hand gestures in front of a webcam or laptop camera [22]. The 3D MobileNet network was further enhanced by the addition of dense layers (fully connected layers) to provide greater depth and accuracy when classifying complex data [26]. The weights of pre-trained 3D MobileNet models were frozen, and only the dense layer weights were trained in transfer learning. As a result, the model was  Table 1, and the MobileNet block is shown in Figure 5. The training parameters of the proposed 3D-MobileNet are specified in Table 2. In Table 1, the input clip, due to the fact that the 3D MobileNet was trained on RGB datasets, we concatenated the 1 × 200 (temporal) × 64 × 64 (spatial) data three times to reach the required three channels. different blocks of the proposed 3D-MobileNet architecture is given in Table 1, and the MobileNet block is shown in Figure 5. The training parameters of the proposed 3D-Mo-bileNet are specified in Table 2. In Table 1, the input clip, due to the fact that the 3D Mo-bileNet was trained on RGB datasets, we concatenated the 1 × 200 (temporal) × 64 × 64 (spatial) data three times to reach the required three channels.   ArgMax selects the label with the greatest probability in the last dense layer, which consists of neurons equal to the number of output classes. In the model, the ReLu activation function was used. ReLU is rectified linear unit activation function, and its abbreviation is already included in Figure 5. Mathematically, it can be defined as g(z) = max {0, z}. Due to its superior performance and ease of training, ReLu has become the default activation function for many neural networks. Additionally, the softmax activation function was used in the last dense layer to calculate the class probabilities. Table 3 shows the detailed specification of the fully connected layer in the proposed model.  After training the CNN model with the multiple-layer perceptron (MLP) classifier, the model was used to extract features. Deep-learning features were collected using a dense layer-1 (1024 neurons). Those features were then fed into other classifiers as input, which were then trained. In this study, k-nearest neighbor (KNN), Support Vector Machine (SVM), Extreme Gradient Boosting (XGB), Random Forest (RF), and Extreme Learning Machine (ELM) were used as classifiers (decision layers) ( Figure 6).

Post-Filtering on Output Classes
Due to the high sensitivity of EEG to noise, it is very important to filter the signals prior to use. In previous studies on medical diagnostics, EEG recordings were smoothed using various filters, such as median filters, mode filter, mean filter, smooth filters, etc. Emotions are intense feelings that last for a short period [27]. Emotions are mental states that affect physiological and psychological feelings. They have a natural beginning, a natural lifespan, and a natural end. Approximately 90 s is the average duration of emotion in the human brain, according to modern neurology [28]. It is reasonable to assume that the emotions in a healthy individual with effective emotional regulation will remain constant (or not change) for some small interval T. A prior study proposed post-filtering for output classes based on this assumption [28]. The post-filtering process in this study is similar to processes used in previous studies. Figure 7 shows the EEG recording for 10 s and the post-filtering process. Nevertheless, the window size is only five seconds. We illustrated the window size of five seconds • SVM: A supervised machine learning algorithm used for classification. The SVM algorithm seeks to find an N-dimensional hyperplane that distinctly classifies the data points. • XGB: This machine learning library implements gradient-boosted decision trees (GBDT) that are scalable and distributed. This machine-learning library performs regression, classification, and ranking using parallel tree boosting. • RF: It is a supervised machine learning algorithm that is widely used in classification and regression. A decision tree is constructed from different samples, the majority vote is taken for classification, and the average is taken for regression.
• kNN: It is one of the simplest machine learning algorithms based on the supervised learning technique. The method uses proximity to classify or predict the grouping of an individual data point. • ELM: They are feedforward neural networks with one hidden layer capable of learning more quickly than gradient-based methods.
A grid search was employed to determine the best hyperparameters for the classifiers. The results for all the above-discussed classifiers are presented in this study.

Post-Filtering on Output Classes
Due to the high sensitivity of EEG to noise, it is very important to filter the signals prior to use. In previous studies on medical diagnostics, EEG recordings were smoothed using various filters, such as median filters, mode filter, mean filter, smooth filters, etc. Emotions are intense feelings that last for a short period [27]. Emotions are mental states that affect physiological and psychological feelings. They have a natural beginning, a natural lifespan, and a natural end. Approximately 90 s is the average duration of emotion in the human brain, according to modern neurology [28]. It is reasonable to assume that the emotions in a healthy individual with effective emotional regulation will remain constant (or not change) for some small interval T. A prior study proposed post-filtering for output classes based on this assumption [28]. The post-filtering process in this study is similar to processes used in previous studies. Figure 7 shows the EEG recording for 10 s and the post-filtering process. Nevertheless, the window size is only five seconds. We illustrated the window size of five seconds working on a 10 s EEG clip for illustration purposes. A study conducted by [26] suggests that the emotional state remains the same for some short time interval T. They used a postfiltering window size of six seconds for their analysis. A mode filter with a 5 s window size was applied to the 10 s output labels. Models that predict incorrect labels are highlighted in red. Labels predicted correctly by the model are green. A yellow label indicates that the mode filter has changed the label. Using the model, the third label in window 1 was predicted to be sadness (S). Based on the model, happiness (H) was predicted to be the mode of emotions, so sadness (S) could be changed to happiness (H). Likewise, for Windows 2 and 3, we could change the fear (F) and neutral (N) predictions in the respective windows to happiness (H) predictions. Here, the window shifts by one second each time.

Performance Assessment
The objective of this study was to recognize emotions based on the subject's emotions. To maintain the balance of the dataset, the EEG recordings were split into an 80:20 training-test ratio for each subject. All 15 subjects were trained individually with the models. The seed value was fixed to make a fair comparison between the models and to make the results reproducible. The trained model was applied to the test dataset after training. The

Performance Assessment
The objective of this study was to recognize emotions based on the subject's emotions. To maintain the balance of the dataset, the EEG recordings were split into an 80:20 training-test ratio for each subject. All 15 subjects were trained individually with the models. The seed value was fixed to make a fair comparison between the models and to make the results reproducible. The trained model was applied to the test dataset after training. The performance of CNN-based hybrid models was evaluated using the following metrics.

Experimental Results and Discussion
This section summarizes the main findings of this study on emotion recognition using EEG signals for 3D-CNN MobileNet-based models. PyTorch was used to implement 3D MobileNet. An Intel ® CoreTM i9-10920X CPU running at 3.50 GHz and an Nvidia Corporation 2204 GPU running Ubuntu was used for the experiment. Accuracy and crossentropy loss values were used to evaluate the convergence of the model. The model weights were saved at the point of convergence, i.e., the least loss value, to prevent overfitting.

3D-CNN MobileNet Model with MLP Classifier (Traditional 3D-CNN Model)
Three-dimensional CNN models with MLP classifiers were trained for each subject. After training the model, it was evaluated on a test dataset. For both datasets, the CNN-MLP model produced results comparable to the state-of-the-art. The accuracy for the SEED-IV dataset was 78.32%, and the accuracy for the SEED dataset was 88.58% for 15 subjects (Tables 4a and 5a). Compared to the MLP classifier, the ELM reported higher accuracy for both datasets compared to the other classifiers.

3D-CNN MobileNet Hybrid Model
This phase involves extracting features from the trained models from Phase 1. The extracted features were then used to train different classifiers such as SVM, random forest, XG boost, KNN, and extreme learning machine. Hyperparameters for classifiers were optimized using the grid search technique. Compared to the CNN-MLP model, the CNN hybrid model showed significant improvement. However, the CNN-ELM hybrid model performed better, with 81.60% accuracy for the SEED-IV dataset (Table 4a) and 89.18% accuracy for the SEED dataset (Table 5a). The confusion matrices of the best classifier output is shown in Tables 4b and 5b for SEED-IV, and SEED dataset, respectively.

3D-CNN Model with Post-Filtering
In this phase, the output labels from phases 1 and 2 were post-filtered using mode filters. Table 6 shows the results after applying post-filtering to the labels. Various window sizes from 5 to 15 s were used for post-filtering. Increasing the window size resulted in a significant increase in average accuracy ( Figure 8). Due to post-filtering, the mispredictions were corrected by changing the prediction with the window mode emotion. The performance of the proposed 3D-CNN with ELM is shown in Tables 7 and 8 using the SEED-IV and SEED datasets, respectively. In Table 8, only subject 6 reported a lower accuracy in recognizing emotions among 15 subjects. There may be a few reasons as to why subject 6 does not perform well. Some of the reasons could be (a) the signals might be corrupted by noises and other external interferences, (b) the subject is not cooperating during the experiment, or that (c) the subject might have already participated in a similar kind of experiment and that bias might affect the model.      Table 9 compares the results of the current study with earlier studies on the SEED and SEED-IV datasets. Many earlier studies have used graph neural networks and recurrent neural networks (RNNs) to classify emotions using SEED or DREAMER datasets [17,18,29,36]. To classify emotions, the researchers have introduced a broad learning system method in graph convolutional neural networks [29]. Researchers have also studied different numbers of EEG channels for detecting emotions in order to reduce the complexity of the emotion recognition system. It was found that systems with fewer EEG channels are more accurate at detecting emotions than systems with a greater number of channels. Spatio-temporal information was recently used to classify emotions using facial expression data and EEG. Subject-specific and subject-independent emotion classifications have been conducted by some researchers using public databases [17]. Using spatio-temporal features, we were not able to find any research that used transfer learning or ensemble classifiers to recognize emotion. The comparable state-of-the-art results of this study demonstrate the capability of the 3D-CNN model to extract and learn spatio-temporal information from EEG signals. Additionally, it shows that pre-trained 3D MobileNet models with transfer learning can extract features that can be fed into ELM and MLP classifiers. Based on the computation time required to evaluate one minute of EEG data at a 200 Hz sampling frequency, Tables 10 and 11 illustrate the performance of the proposed CNN models. This time includes the time it takes to load the EEG data, load the CNN model, preprocess the dataset, and extract and evaluate features. Tables 10 and 11 show the averages across five trials to eliminate discrepancies. During one minute of EEG recording, the proposed model recognized emotions in 8-8.5 s.  In the case of real-time implementation of emotion recognition using EEG signals, some of the parameters listed below could be considered for accurate and robust emotion detection: • Latency: An emotion recognition system (ERS) must be highly responsive and recognize the user's emotions without delay. A high latency may hamper ERS performance and effectiveness in real-time applications. • Precision/Accuracy: To identify a particular emotion, it is imperative that the ERS is sufficiently accurate to distinguish between different users' emotions. It is also a necessity to improve the user experience by making the ERS as accurate as possible. • Adaptability and robustness: An emotion recognition system must adapt to identify different users' emotions. ERS performance should also not be affected by any external or internal noise. • User-Friendly: The ERS should be easy to use, and it needs to be convenient to configure it so that it recognizes the emotions of different users depending on their environment. If the setup is bulky and inconvenient to carry, people may not like to use it.
Moreover, researchers face some challenges when developing real-time emotion recognition systems for human-computer interactive devices: • Data Processing: In real-time applications, efficient algorithms, and hardware are required to process data in real-time.
• Generalizability: The model should be robust and capable of delivering high performance to new users without prior knowledge. • Integration: The integration of emotion recognition models into HCI architecture remains a major challenge. A seamless and efficient method for feeding data from the EEG setup into the model is necessary in the case of EEG-based emotion recognition systems.
It is important to note that this study has a few limitations. For example, this study uses only 3D Efficient MobileNet pre-trained models with transfer learning. Other pre-trained 3D-CNN models could be explored for better results. Additionally, only subject-dependent emotion recognition was investigated here, while subject-independent emotion recognition remains a challenge. Furthermore, future works can delve into the study of the extracted features and their explainability.

Conclusions
This study adopted the success of 3D-CNNs in video analysis, owing to their capacity to extract and learn temporal features in addition to spatial features. Using 3D-CNN, the EEG signals are represented in 3D spatio-temporal space by first converting the 1D raw EEG streams into 2D spatial streams and then stacking the 2D spatial streams into 3D EEG block streams. A 3D MobileNet network with transfer learning was used to extract and learn features from 3D EEG blocks. Additional pools and dense layers were added to the CNN network to enhance classification capabilities. In the SEED-IV dataset, four classes of samples were classified: happiness, sadness, fear, and neutral, with an accuracy of 78.32%. The SEED dataset showed an accuracy of 88.58% for classifying the samples into three groups: positive, neutral, and negative.
Additionally, the performance of hybrid models that were fed the extracted features from the 3D-CNN network into different classifiers (XG boost, random forest, support vector machine, k-nearest neighbor, and extreme learning machine), in addition to the MLP classifier (dense layers and pool layers), was examined. Compared to SEED-IV and SEED, 3D-CNN-ELM hybrid models delivered significant improvements in performance with an accuracy of 81.60% and 89.18%, respectively. The emotions of a healthy person vary very little, so it could be assumed that emotions will remain constant for a short period. These studies were used to investigate the model's performance when post-filtering the output labels with mode filters. A time window ranging from 5 to 15 s was selected.
The accuracy of the model increased as the time window of the mode filter increased. A CNN-ELM hybrid model that applied post-filtering to a 15-s window achieved an accuracy of 87.50%. The proposed model could be used for emotion recognition in HCI-related fields, healthcare, etc.