A Fusion Framework for Confusion Analysis in Learning Based on EEG Signals

: Human–computer interaction (HCI) plays a signiﬁcant role in modern education, and emotion recognition is essential in the ﬁeld of HCI. The potential of emotion recognition in education remains to be explored. Confusion is the primary cognitive emotion during learning and signiﬁcantly affects student engagement. Recent studies show that electroencephalogram (EEG) signals, obtained through electrodes placed on the scalp, are valuable for studying brain activity and identifying emotions. In this paper, we propose a fusion framework for confusion analysis in learning based on EEG signals, combining feature extraction and temporal self-attention. This framework capitalizes on the strengths of traditional feature extraction and deep-learning techniques, integrating local time-frequency features and global representation capabilities. We acquire localized time-frequency features by partitioning EEG samples into time slices and extracting Power Spectral Density (PSD) features. We introduce the Transformer architecture to capture the comprehensive EEG characteristics and utilize a multi-head self-attention mechanism to extract the global dependencies among the time slices. Subsequently, we employ a classiﬁcation module based on a fully connected layer to classify confusion emotions accurately. To assess the effectiveness of our method in the educational cognitive domain, we conduct thorough experiments on a public dataset CAL, designed for confusion analysis during the learning process. In both subject-dependent and subject-independent experiments, our method attained an accuracy/F1 score of 90.94%/0.94 and 66.08%/0.65 for the binary classiﬁcation task and an accuracy/F1 score of 87.59%/0.87 and 41.28%/0.41 for the four-class classiﬁcation task. It demonstrated superior performance and stronger generalization capabilities than traditional machine learning classiﬁers and end-to-end methods. The evidence demonstrates that our proposed framework is effective and feasible in recognizing cognitive emotions.


Introduction
In modern education, human-computer interaction (HCI) plays a crucial role, with emotion recognition being particularly significant in the field of HCI.By accurately identifying and understanding students' emotional states, educational systems can better respond to their needs and provide personalized support.Emotion recognition technology can assist educators in determining whether students are experiencing confusion, frustration, or focus during the learning process, enabling timely adoption of appropriate teaching strategies and supportive measures [1][2][3].Therefore, the importance of emotion recognition in HCI and education is self-evident.It optimizes the teaching process, enhances learning outcomes, and provides students with more personalized support and guidance.Confusion is more common than other emotions in the learning process [4][5][6].Although confusion is an unpleasant emotion, addressing confusion during controllable periods has been shown to be beneficial for learning [7][8][9], as it promotes active student engagement in learning activities.However, research on learning confusion is still in its early stages and requires further exploration.
Electroencephalography (EEG) is considered a physiological indicator of the aggregated electrical activity of neurons in the human brain's cortex.EEG is employed to record such activities and, compared to non-physiological indicators like facial expressions and gestures, offers a relatively objective assessment of emotions, making it a reliable tool for emotion recognition [10].
Traditionally, the classification of EEG signals relies on manual feature extractors and machine learning classifiers [11], such as Naive Bayes, SVM, and Random Forest.Although deep-learning architectures are a more recent introduction, they have consistently improved performance [12].Convolutional Neural Networks (CNNs) and Long Short-Term Memory Networks (LSTMs) are the primary architectures employed [13].However, employing CNNs for feature extraction primarily focuses on local aspects, hindering temporal information perception.Although LSTM-based approaches exhibit commendable performance, they also struggle with global temporal representation.Various attempts with end-to-end hybrid networks [14] have been made.However, these endeavors have resulted in models with excessively intricate architectures, leading to sluggish convergence rates or even failures to converge.Furthermore, end-to-end methodologies lack the advantages of conventional feature extraction methods in representing EEG signals.The Transformer [15] has showcased its formidable capabilities in natural language processing (NLP), owing to its significant advantage in comprehending global semantics.However, its application in EEG systems is still an area that requires further exploration.
In light of the limitations of end-to-end network models, as well as the disadvantages of CNN and LSTM, and considering the advantages of traditional feature extraction methods and the Transformer network structure, this paper proposes a fusion framework combining feature extraction and self-attention mechanism.Precisely, The EEG signals are first sliced, and frequency-domain features are extracted.These features are then tokenized into temporal tokens.Subsequently, the self-attention mechanism of the Transformer encoder layer is employed to capture temporal correlations.Finally, the extracted features are integrated using a fully connected layer to derive classification results, subsequently subjected to confusion analysis.We summarize the contributions of this paper as follows: 1.
We present a fusion framework that integrates the strengths of traditional feature extraction and deep learning to analyze confusion during the learning process.This framework enables targeted guidance by assessing the cognitive level of students.

2.
By harnessing the robust capabilities of the multi-head self-attention mechanism, we capture global contextual representations of long EEG segments, which proves beneficial for predicting confusion emotions.

3.
In both subject-dependent and subject-independent experiments, we compare our framework with traditional machine learning classifiers and end-to-end methods.The experimental results demonstrate the superiority of our framework.
The rest of this paper is organized as follows: Section 2 presents the related work about the topic.Section 3 describes the methods.Section 4 presents the experiments and discusses the results.Section 5 describes the conclusion and future research directions.

Related Work
Confusion in learning refers to feeling perplexed or uncertain while absorbing knowledge or solving problems.Given its shared attributes with emotions, it is a nascent study area, primarily exploring confusion's classification as an emotion or affective state.Confusion is deemed a cognitive emotion, indicating a state of cognitive imbalance [9,16].Individuals are encouraged to introspect and deliberate upon the material to redress this imbalance and facilitate progress, enabling a more profound comprehension.Consequently, when confused, individuals tend to activate profound cognitive processes to pursue enhanced learning outcomes.The investigation into confusion within the learning context remains in its preliminary stages.
Using EEG to recognize human emotions during various activities, including learning, is an area currently being explored.Recent research has focused on using electroencephalography to study cognitive states and emotions for educational purposes.These studies focus on attention or engagement [17,18], cognitive load, and some basic emotions such as happiness and fear.For example, researchers [19] used an EEG-based brain-computer interface (BCI) to record EEG in the FP1 region to track changes in attention.By utilizing visual and auditory cues, such as rhythmic hand raising, adaptive proxy robots can help students shift their attention when their attention falls below a preset threshold.The research results indicate that this BCI can improve learning performance.
Most traditional EEG-based classification methods rely on two steps: feature extraction and classification, and emotion classification is no exception.Many researchers have focused on exploring effective EEG features for classification, and the advancement of machine learning methods and technologies has significantly contributed to the development of these traditional methods.There have been attempts using the Common Spatial Pattern (CSP) algorithm [20], such as the FBCSP algorithm [21], which filters signals through filter banks, computes CSP energy features for each signal output through time filters, and then selects and classifies these features.Despite enhancements to the original CSP method, these techniques solely focus on analyzing the CSP energy dimension, disregarding the incorporation of temporal contextual information.Kaneshiro et al. [11] proposed Principal Component Analysis (PCA), extracting feature vectors of specific sizes from minimally preprocessed EEG signals, followed by training a classifier based on Linear Discriminant Analysis (LDA).Karimi-Rouzbahani et al. [22] explored the discriminative power of many statistical and mathematical features, and their experiments on three datasets showed that multi-valued features like wavelet coefficients and the theta frequency band performed better.Zheng et al. [23] investigated the pivotal frequency bands and channels of multichannel EEG data in emotion recognition.Jensen & Tesche [24] and Bashivan et al. [25] demonstrated through experiments that cortical oscillatory activity associated with memory operations primarily exists in the theta (4-7 Hz), alpha (8-13 Hz), and beta (13-30 Hz) frequency bands.The studies above utilize traditional machine learning classifiers to explore critical frequency bands and channels; nevertheless, traditional machine learning classifiers do not demonstrate any performance advantages.In addition, separately optimizing feature extraction and classifier could potentially result in suboptimal global optimization.
Compared to traditional methods, end-to-end deep networks eliminate the need for manual feature extraction.For most EEG applications, it has been observed that shallow models yield good results, while deep models might lead to performance degradation [12,13].Especially for classification based on CNNs, despite the shallow architectures of CNNs with few parameters, they have been widely utilized: DeepConvNet [12], EEGNet [26], ResNet [27], and other variants [28].However, due to the limitations imposed by kernel size, CNNs can learn features with local receptive fields.However, they cannot capture the crucial long-term dependencies for time series analysis.Furthermore, Recurrent neural networks(RNNs) and long short-term memory(LSTM) are introduced to capture the temporal features of EEG classification [29,30].However, these models cannot be trained in parallel, and the dependencies calculated by hidden states quickly vanish after a few time steps, making it challenging to capture global temporal dependencies.Moreover, end-to-end methods insist on utilizing deep networks to learn from raw signals, often overlooking the advantages of manual feature extraction, and complex networks can lead to difficulties in model convergence.

Methods
Transformer [15] is an emerging neural network architecture that originated in machine translation tasks.In recent years, it has gained remarkable prominence in natural language processing.However, its application to emotion recognition based on EEG data remains an area requiring further research.In this paper, we combine feature extraction with Transformer for EEG classification.Drawing on the idea of Transformer, we first extract local temporal and frequency features and then adopt the self-attention mechanism to capture global temporal features.
The overview of the proposed framework is depicted in Figure 1.It comprises three modules: preprocessing, multi-head self-attention, and a fully connected classifier.In the preprocessing module, the noise and artifacts of EEG signals are filtered out.Then, the temporal and frequency-domain information, encapsulating crucial local features, is extracted.Next, the multi-head self-attention module extracts long-term features by learning the global correlations between different temporal positions.Finally, utilizing the features extracted in the previous steps, which encapsulate spatial, frequency, and temporal information, a classifier composed of fully connected layers is adopted to output the classification results.

Preprocessing
Since the majority of EEG signals are concentrated within the range of 1 Hz to 50 Hz, a bandpass filter with a range of 1 Hz to 50 Hz was selected.This filtering procedure serves a dual purpose, eliminating low-frequency baseline drift, electrocardiographic (ECG) interference, and other high-frequency noise while effectively removing the most prominent power line interference (typically 50 Hz in China).Furthermore, EEG signals overlap with the electrooculogram (EOG) and electromyogram (EMG) signals in the frequency band.Therefore, relying solely on a single bandpass filter is insufficient to eliminate interference from EOG and EMG.This study adopted the fast, independent component analysis (fast ICA) to eliminate artifacts from EOG and EMG.
EEG comprises multiple time series corresponding to different spatial positions on the cerebral cortex where different electrodes are located on the collection device.Like audio signals, frequency-domain features are the most salient features.Thus, the spectrogram of the signals is typically employed for analysis.In frequency-domain analysis methods, Power Spectral Density (PSD) analysis is a typically adopted method, and most previous studies have used this method to investigate epilepsy and hypnosis [31,32].This method extracts frequency features that effectively detect cognitive and motor tasks [33].Moreover, the PSD method consistently exhibits the highest robustness and effectiveness in extracting distinctive spectral patterns to differentiate motor imagery EEG signals accurately [34].
In this paper, Welch's method is used to extract the power spectrum features of EEG signals.The data sequence is applied to data windowing, producing modified periodograms [35].For the EEG signal x(n) of a certain channel, first divide it into L segments, with each segment having M sampling points, then the i-th small segment x i (n) can be denoted as: take iD to be the point of start of the i th sequence.Then, L of length 2M represents data segments that are formed.The resulting output periodograms give: Here, in the window function, U gives the normalization factor of the power and is chosen such that: where w(n) is the window function.The average of these modified periodograms gives Welch's power spectrum as follows: EEG signals predominantly reside within the 1 Hz to 50 Hz range, categorized into five frequency bands, depicted in Table 1.In this paper, the above method is applied to extract the PSD features of all channels and five frequency bands of EEG signals.Instead of manually selecting several of the five frequency bands for PSD feature extraction, the key features are extracted with the powerful ability of the transformer deep-learning network architecture.We divided each sample into 0.25 s slices.The sum of values within the five frequency bands is computed for each time slice and used as a separate measurement for each channel.This approach ensures that the extracted features encompass both frequencydomain and temporal information.PSD features for each time slice are extracted separately.This results in a sample with dimensions of [W, B × C], where W is the number of time slices, B is the number of frequency bands, and C is the number of channels.Figure 1 illustrates the above process.

Multi-Head Self-Attention
Due to the continuous nature of neural activity, the context-dependent representation between different time segments would contribute to EEG classification.This module uses self-attention to learn global temporal information of EEG features.The multi-head self-attention mechanism enables the model to attend to information from different representation subspaces from various channels.Each self-attention head, h ∈ [1, 2, . . ., H], with H being the total number of heads, relies on Q c (queries), K c (keys), V c (values) vectors for token assessment.Within the features of each time segment, the frequency band features of different channels are concatenated sequentially.Given this, H is set to the number of channels, so The time slice representations are projected to latent representations of Q c , K c , V c ∈ R 1×B , where c ∈ [1, 2, . . ., C].The "queries-keys" pair aims to map the key slices to the query slices based on their in-between representational similarity, calculated as their scaled dot-product, followed by SoftMax operation [15].The resultant matrix is again multiplied with V c to calculate the representational context as the aggregation of the self-attentional interactions.For a set of C slices and a single self-attention head h, the representational context is calculated as Multiple projections of Q c , K c , and V c calculate the respective self-attention heads, and their outputs become concatenated to form the aggregate of multiple heads, as given by In this module, N multi-head self-attention layers are employed.Finally, the output vector of this module is flattened, and a fully connected layer is utilized as the classifier to obtain EEG classification results.

Dataset
We utilize a publicly available dataset called CAL, designed explicitly for confusion analysis in learning.The CAL dataset is first used in [1].It focuses on cognitive emotions during the learning process, including four categories (confused, non-confused, guess, and think-right).Raven's Progressive Matrices [36] is employed as confusion stimuli to design the experiment.A total of 25 subjects participated in this experiment.Subjects watch ten scene pictures, each of which lasts 10 s.Next, they view and perform 48 tests, each lasting a maximum of 15 s.There are 23 subjects' data obtained because the unexpected equipment problem caused a failed collection for two persons.The participants evaluate their level of confusion for each test item at the end of the trials.OpenBCI is employed as the EEG collector, which has eight channels (Fp1, Fp2, C3, C4, T5, T6, O1, O2) and a 250 Hz sampling rate, depicted in Figure 2. Table 2 summarizes the relevant information of the CAL dataset.Each trial is segmented with a non-overlapped three-second time window.Each segment is regarded as one data sample during the model training.

Experiment Settings
We follow the settings outlined in [1] and conduct experiments involving binary classification (confused and non-confused) and four classification (confused, non-confused, guess, and think-right).In addition, we consider the following two scenarios to validate the proposed methods.
(1) subject-dependent: the data are trained across multiple subjects in the subject-dependent experiments.Specifically, 70% of the EEG data from all experiments for each participant are allocated as the training set, while the remaining 30% serve as the testing set.(2) subject-independent: In the subject-independent model, the experiment emphasizes the differences between different subjects to test the method's generalization ability.Specifically, EEG data are divided into a cross-subject validation set with a split of 70%/30%, where the data from 16 subjects is used for training, and the data from the remaining seven subjects is used for testing.
To evaluate the classification performance of various methods, we consider the outcomes of conventional machine learning classifiers (Naive Bayes, SVM, and Random Forest) based on PSD features as presented in [1], as well as end-to-end methods (LSTM [30], ResNet [37], and EEGNet [26]).Furthermore, we conduct experiments using our approach without extracting PSD features to explore the benefits of feature extraction.
When extracting the samples of each category, we set a sliding window of 4 s to segment the data according to the setting in [1,38,39] to increase the sample size, i.e., the experimental samples of 4 s of EEG data.At the same time, to solve the data imbalance problem, we set overlapping parts of different lengths: 0.25 s for confused, 0.75 s for non-confused, 0.5 s for think-right, and 0.75 s for guess.
In this paper, the MNE [40] library in Python is adopted for data preprocessing operations.Our method is implemented with the PyTorch framework in Python 3.8 with an NVIDIA Geforce 3090 GPU.Using the same hyperparameter settings, we fix random seeds to repeat the experiment for different methods.We train the model using Adam optimizer with a learning rate of 1 × e −4 .The Adam optimizer combines the benefits of Momentum and RMSprop and adaptively adjusts the learning rate. 1 × e −4 is a common starting learning rate.During training, batch size and dropout rates are set to 32 and 0.5.
The batch size is 32, which is a relatively small value to improve the model's generalization performance.A dropout rate of 0.5 is chosen, a common regularization technique that can reduce overfitting.
We set the number of multi-head self-attention layers N to 6, the number of heads H to the number of channels with a value of 14, and the dimension of the feed Forward layer to 2048.The number of parameters of the model in this configuration is 1M.

Analysis of Results
Tables 3 and 4 present the experimental results of different methods in subjectdependent and subject-independent experiments on the CAL dataset.The binary classification results of the subject-dependent experiments demonstrate that our approach significantly improves accuracy by 33.19%, 23.06%, and 20.77%, respectively, compared to conventional machine learning methods (Naive Bayes, SVM, and Random Forest).Additionally, we can observe that end-to-end deep-learning methods based on CNN, ResNet, and EEGNet perform well (with accuracies of 80.61% and 72.02%, respectively), indicating strong feature extraction capabilities of CNN-based methods.However, due to the limited receptive field of CNN, it struggles to capture global features.In contrast, the Transformer architecture based on the self-attention mechanism excels at capturing global information.Experimental results further confirm the advantages of the Transformer architecture: our method, based on Transformer, achieves an average accuracy increase of 18.47% and 9.88% compared to the CNN-based EEGNet and ResNet.Furthermore, we can observe that the end-to-end LSTM network performs impressively in the binary classification tasks of the subject-dependent experiments, achieving an accuracy of 73.45%.However, when faced with ultra-long time series data such as physiological signals, the end-to-end LSTM still loses information during training, leading to the inability to capture global features.In contrast, our Transformer-based approach can capture global temporal context information, resulting in a 17.04% improvement in accuracy compared to the end-to-end LSTM method.
In the subject-dependent experiments for the four-classification task, our method also exhibits strong performance with an accuracy of 87.59%.Compared to the best-performing traditional machine learning method, Naive Bayes, our method achieves an improvement of 17.87% in accuracy.Additionally, compared to the top-performing end-to-end method, ResNet, our method shows an accuracy improvement of 14.49%.
Moreover, in subject-independent experiments, our approach achieves the highest accuracy of 68.08% and 41.74% in binary and four-class classifications, respectively.Compared to other methods, our approach achieves an accuracy improvement of 1.62% in binary classification and 0.45% in four-class classification.
The F1 score, which is based on precision and recall, is also an important evaluation metric [41].We present the F1 score of the binary and four-class classification tasks in Figures 3 and 4. Our approach achieved F1 scores of 0.90, 0.87, 0.65, and 0.41 in the experiments.Compared to the best-performing methods, our approach demonstrates improvements of 0.10, 0.14, 0.04, and 0.10, respectively.To evaluate the contribution of the feature extraction, we compare the performance of directly inputting the raw signal into the Transformer encoding layer without extracting PSD features.Comparing the last two rows of Tables 3 and 4, it is found that there is a decrease in performance.These ablation study results indicate PSD's effectiveness in representing EEG.The results also demonstrate the effectiveness of the feature extraction part in our proposed framework.
We also provide the accuracy and loss during the model training process.In Figure 5 The training time of the deep-learning model is also an important parameter.We compared the convergence time of all methods, as shown in Figure 6.Our study shows that our method has a significant advantage over the end-to-end deep-learning method in terms of training time.Our method offers a more efficient model that can save computing resources and time.
In addition to the faster training time, our method demonstrates better performance.The combination of faster training time and improved performance makes our method compelling for cognitive emotion identification.By reducing the computational burden and achieving better results, our method provides a practical and effective solution for this task.These advantages can have significant implications for real-world applications where efficiency and accuracy are crucial.

Analysis of Confusion Matrix
We provide the confusion matrices of our approach for binary and four-class classification tasks for the subject-dependent and subject-independent experiments, as shown in Figures 7 and 8, respectively.
Our method performs well in identifying cognitive emotions during the subjectdependent experiment in binary and four-class setups.In the binary task, the model exhibits excellent discrimination between confused and non-confused emotions.In the four-class task, the recognition effect of confused emotions is slightly worse than that of the other three types of emotions.A total of 13% of the confused sample identified as think-right, suggesting that similar EEG patterns may be generated when subjects are confused or think-right.However, the identification performance is relatively poor in the subject-independent experiment due to the variations among different subjects.It is evident that in the case of binary classification, the model demonstrates good recognition of confused emotions among different subjects, but it struggles with identifying nonconfused emotions.The same pattern can be observed in the four-class scenario, where the recognition performance is relatively better for confused emotions.This suggests that there is a certain degree of similarity in the EEG of different individuals when they are confused during learning, while there is a significant difference in the EEG of emotions other than confusion.These findings illustrate that addressing the differences among subjects poses a significant challenge.

Figure 1 .
Figure 1.Overview of the proposed framework.

Figure 2 .
Figure 2. The eight channel locations (colored orange) of the CAL dataset, using the international 10-20 system.

Figure 3 .Figure 4 .
Figure 3. F1 score of different methods for the subject-dependent experiments.

Figure 5 .
Figure 5. Accuracy and loss during the model training process.

Figure 6 .
Figure 6.Training time for convergence of different methods.

Figure 7 .
Figure 7. Confusion matrix of our method for the subject-dependent experiments.

Table 1 .
Five frequency bands of EEG signals.

Table 2 .
Summary of information on the CAL dataset.

Table 3 .
Subject-dependent experiment of different methods on CAL dataset (Acc/F1 score)."w/" for "with" and "w/o" for "without".The best results are highlighted in bold.

Table 4 .
Subject -independent experiment of different methods on CAL dataset (Acc/F1 score)."w/" for "with" and "w/o" for "without".The best results are highlighted in bold.